As I couldn’t make it tonight I created a Storify from Hacks/Hackers London.
It is one of those days when you remember where you were. I was sitting at my desk at the Buckinghamshire Advertiser office in Chalfont St Peter. We had just put the latest edition to bed, and were planning the next one, when the editor walked out of his office. I’ll never forget his words:
My wife’s just phoned and said a plane has crashed into one of the World Trade Centre buildings.
It was such a different time. We had no television, no radio and just a dial-up internet connection on one machine in the office. It was jealously guarded by the deputy editor, but he fired it up.
When the editor came out of his office a few minutes later and told us his wife had called to say another plane had flown into the second tower. we knew it was an attack. I grabbed the phone and called Heathrow and kept hitting redial as the deputy editor asked for someone to call. “Already on it,” I said.
Once we established there was no immediate local link we knew we didn’t have to ‘hold the front page’. The one radio we had didn’t work without earphones, so I listened as we worked on. Then the horror hit.
I’ll never forget the terror in the voice of the correspondent, I think we were listening to Radio 4, as they started to describe the first tower falling. I relayed her words and I remember seeing the shocked looks on my colleagues faces. A few minutes later the editor returned, we hadn’t noticed he’d gone. He had a small TV and indoor aerial.
We gathered around the TV when it was set up and saw the second tower go. Again I’ll never forget the editor’s words:
We have just watched thousands of people die, and we’re going to know some of them.
Six months later I went to Oli Bennett‘s memorial service. I had interviewed his mother Joy Bennett before it and a few times afterwards. She was against the Iraq war and featured in Roger Graef’s film September Mourning.
It was a moving memorial. One of Oli’s colleagues spoke about the many friends he had lost that day, his voice breaking with the emotion.
I don’t think they’ve found any trace of Oli. The Bennetts buried an urn of ash from Ground Zero in the churchyard at Penn Street, near their home. Even though I didn’t know Oli, I always think of him and particularly the loving parents he left behind every September 11 and whenever I hear the ELO song Mr Blue Sky which was played at his memorial.
Sex, lies and instant messenger
If you don’t want your partner to catch you cheating don’t use the internet. This was Alec Muffett’s advice at Hack/Hackers London’s August gathering.
There was a serious message behind Alec’s advice on keeping your illicit affair and dodgy porn habit a secret. Anyone who needs to keep a secret, a dissident or whistle-blower, needs to consider how they communicate.
If you need to keep a secret don’t use Skype, Google, Facebook, Twitter, Flickr, smartphones, applications with pop ups, iTunes, massively multiplayer online role-playing game (MMORPG), work hardware etc. All this can come back to bite you.
Skype shares everything with any machines you’ve installed it on. Even if you delete and re-install, messages they come back from the dead.
Facebook ends up everywhere, as does Google. These companies also comply with US law, which means if the National Security Agency (NSA) asks for your data, it’s handed over.
Smartphones also have comedy potential. Alec told a tale of a person whose boyfriend was sending saucy messages which their boss read when details popped up on their iPhone.
What can you do to protect yourself?
- Create a complex password only you know.
- Use a very boring pseudonym such as Edward Wilson or Carole Smith, because anything unique will come back to you.
- Avoid linking identities and never describe yourself.
- Use a different browser for day-to-day use and keep Firefox for your naughty secrets.
- Log out of Facebook.
- Clear cookies, don’t accept third-party cookies.
- Switch off everything.
- Don’t bookmark.
- Don’t save passwords
- Do not leave voicemail.
Illicit affairs and unusual kinks provide plenty of entertainment to geeks like Alec. It is amazing what is retrieved from hardware sold on eBay, he said.
His slides of advice are here: dropsafe.crypticide.com/article/5078
How digital destroyed the news cycle and what you can do about it
Demand for newspapers is falling but people still clamour for information. Tools from Twitter to Tumblr are quick ways to share information. As Martin Belam pointed out, there was one event he was following on Twitter when official sources had no details at all. (It was something about football I didn’t understand).
Digital has destroyed the traditional news cycle, but it has created a new one. Print newspapers are an enjoyable read but are always historical. Online is live and as up-to-date as possible, although social media sources can be unreliable.
Martin simplified the news cycle as write newspaper, print newspaper, wrap fish and chips in newspaper, before adding embellishments including sub-editing, layout, legal checks and loading everything online. The Guardian has a digital first policy and publishes across a multitude of platforms from iPhone to Kindle, with iPad, Windows phone and Android apps, due for released soon (nothing for Blackberry). The way the newspaper presents its product on different platforms is something Martin says needs to be addressed.
“Stop the shovels”, he said. Mobiles are not tablets, tablets are not desktops and we don’t read the stories as PDFs online. One of the frustrations facing user experience architects like Martin is making content work for a multitude of mediums. He highlighted problems with visualisations on the Telegraph and Guardian’s websites, which required a mouse, not helpful when you’re on an iPad, or copy is exported without the through for the medium with notes guiding readers to images from print which aren’t online.
Interactivity has changed the way the Guardian works. When there are mistakes “we are subbed by the comments very quickly”, Martin explained. Journalists are actively using Twitter as a news-gathering source. Paul Lewis was asking where trouble was flaring up during the recent riots, which resulted in his reports from the thick of the violence.
The Guardian is known for its liveblogging. It is a platform Martin described as a “native digital format”. Throughout the day there are political, sport and TV liveblogs generating a huge amount of traffic and an engaged readership interacting with the information.
Digital is part of storytelling now. Martin was critical of a report compiled by the NCTJ* where editors ranked web skills and social media below time management as key skills for journalists.
“It is unfair to be equipping young journalists for a job they would have been doing in the 80s and 90s.”
He pointed out the survey shows the entrenched attitude of people in control of newspapers. “They’re not interested in turning the tanker around,” Martin said.
When I teach online journalism, I tell the students I am providing them with the skills they will need to be employed in five years time. Martin has the same opinion and advised journalists to keep learning and developing as he knows digital is the future.
His summing up of the future of news raised a round of applause from the crowd:
“Let the digital people get on with saving this business properly.”
* Martin was critical of a different website that didn’t link to the report. I would like to point out the piece I sub-edited did have a link to the NCTJ page. However, this does not overcome the issue with the report page which advises people to click on the link to the right…
My colleague Alex Therrien has found out the hard way about selling on a great story.
Picking up the phone paid off for him with a great story about Tammy Page from Worthing who had the tip of her finger bitten off by a fox.
The Worthing Herald hits the streets on a Wednesday afternoon, despite the Thursday publishing date, so any reporter who wants to make some much-needed cash needs to get on the phone pretty quickly before someone else jumps in.
At least Alex discovered he is not alone as all of us could list great stories we had just been too late to sell but had made it to the nationals.
How your money is really being spent.
Wanted to look at local government spending in various areas. Looked at government account figures published on Number 10 website.
Government temp spending is triple it’s own staff budget
Found page with 190,000 individual data entires.
Had someone writing in java, one in Ruby and found the data was a bit rubbish.
Date columns were not filled, or had the number of days since 1900.
Cannot trust the data, have to ask if it’s correct and can be validated.
Had a massive amount of data, tried to break through agency and temp staff. Cutting back a massive spreadsheet.
Used Zoho(?) where you can see things pretty quickly.
Visuals created once separated the costs. Need to dig deep into the data to find the quirks.
Taking home to learn the accuracy of data, structured database, other axis of investigation, getting data clean, automatically updating.
Is it worth it?
Took extensive salary data.
Put in location and job and then the function shows if it’s worth living there.
A Welsh teacher earning £45,000, not competitive.
Someone in London working as an accountant at £45,000, data showed 16 applicants per job making it a 50/50.
From the initial data service a map was created where you can choose a function, a job title and a region to find out visually whether a job is worth it. It pushes down per region.
Can also zone in to regions using a slider system.
A splendid and complex visualisation. (The winner)
Started with the idea of truck stops and which ones were safe.
Started looking for data on the Highways Agency site and found it wanting.
Found a map with decent truck stop sites.
Had the xml source and started to develop a scraper on Scraperwiki and got a view on Google maps.
Plotted all the points. Letter on the point shows how safe by analysing which ones had CCTV and various security measures.
Further on wanted to find out more about truck crime. Looked ast the TruckPol website and took the data from PDFs and put in a spreadsheet.
Updated the view with the information about crimes. Red ones not so great, blue are good and a purple is okay.
(Winner of the best scraper award from Scraperwiki and third place overall).
Take over watch
UK Takeover panel was the prime source of information showing all take overs in play. The aim was to create something to provide details about companies.
Had scraped data but needed to add sector and revenue to create context.
Also used Investigate.co.uk
Had a live table showing activity from the last two days.
Have different sectors and can pull information out to see what’s happening in different areas
Creates a map showing areas affected by snow and see where the nearest snow hole is. (See snow hole blog)
How people move around the chemical world
Used Google Refine to play with the data. Pulled out the geocode to map where the companies were.
Google Fusion also used.
Top 100 chemical companies. Merged Google finance information with Isis.
Created a visual showing how sales had gone down with the chemical industry sales halving from 2007-08.
We were quite a large group to start with, so we’ve ended up splitting in two. One group is working on scraping details of registered care homes, and I’m in a group working on information gathered but creating an interesting and informative visual.
Our first battle was making sure Scraperwiki could read our data so we could work with it.
First of all I uploaded to Google docs, but the comma separated values (CSV) scraper didn’t like it. Then when the spreadsheet was published as a web page, as suggested by it still wasn’t happy because it wanted to be signed into Google.
Francis Irving also suggested scraping What Do They Know, because it was Freedom of Information dat.
After much fiddling Matt managed to pull out the raw data by popping (pulling from the top of the list) and using a Python scraper.
It turned out the data we had was so unstructured it wasn’t possible to work with it.
After lunch we’re working on a different project.
Francis Irving of Scraperwiki explains how it works.
Take the Gulf oil spill. You can find a list of oil fields around the UK, but it’s all in a strange lump.
He shows a piece of Python code reading the oil field pages and turns it into a piece of data.
It’s quite simple to make a map view, but also code to make more complicated views.
Scraperwiki is automatic data conversion.
Scrape internet pages, Parser it, organise it, collect it and model it into a view. It will keep running and give the dataset constantly.
There are two kinds of journalism to use with the data. You can make tools, specific tools and find a story.
In Belfast took a list of historic houses in the UK. The data scraper looked through a host of websites, using Python, can use Ruby.
There are a multitude of visuals available. The Belfast project showed a spike in 1979, this was explained due to a political sectarian issue.
Answering a question, Francis confirms you can scrape more than one website at a time.
Francis would like to see more linked data and merging datasets together.
Asked about licensing for commercial use. Francis says it’s mainly used for public data. Scraperwiki blocks scraping Facebook because it’s private data, but the code can be adjusted.
Interested areas for projects today are: farming, local government budgets, public sector salaries, mapping chemical companies and distributors, environment, transport, road transport crime, truckstops map, energy data, countryprofile link to carbon emissions, e-waste, airline data, plastics data, empty shops, infotainment to make user interested in the data, another visualisation on companies ranking based on customer reviews, using the crowd to share information with data and create interesting information, data annotating content and enriching content, health data… and anything else we’re doing.
Francis talking about two different stories on the internet.
It used to be the case you had to check the division list to find out how MPs voted.
Created a web scraper pulling out the information and created The Public Whip, showing how MPs voted.
Have to be a parliament nerd to understand, even when it’s broken down.
They Work for You simplifies the information even more, it tells you something about your MP.
Bring the division information together. Take a list from public whip and create a summary of how they voted.
Checking how one MP voted on the Iraq War. Voted with the majority in favour of the war on three votes and abstained from the first and then the final three. It’s almost a deal with electorate.
MP asked to have “voted moderately” removed because found it misleading. A number of MPs have complained, but checked the votes.
Richard Pope founder of Scraperwiki made a website after the demolition of his local pub (a fine-looking establishment called The Queen) and created Planning Alerts.com website.
It helps people access information from outside the immediate catchment area. He wrote lots of web scrapers. Example of different councils’ planning application systems.
Scraperwiki is like Wikipedia but for data. It’s a technical product for use when you’re not technical. Can look at different data scrapers and copy what others are doing without learning Pearl or Python.
Planning Alerts is being moved over to Scraperwiki. Can tag it on Scraperwiki and find information. Can find stories and in-depth information.
Can request a dataset and have something built for you.
Francis was asked, is it legal? In the UK if it’s public data, not for sale, you can reuse it. Would take things down if asked, but it’s open stuff.
Could it be stopped? Would be ill-advised to stop people, and journalists, reading public information.
Public whip and They work for you, look at numerous votes.
Looking at ways to fund it such as private scrapers, or scrapers in a cocoon. Looking at white label for intranet use. There’s a market for data and developers who want to give data. Want to match developers with data. Currently funded by Channel4. Want to remain free for the public.
Does it make people lazy? No, it’s already published but it makes it easier. Movement of people trying to get publishers of data to change. Always a need to pull out in a variety of formats.
Running Hacks and Hackers days working together finding stories and hunting around.
Have had data scraped from What do they know site.
James Ball from the Bureau of Investigative Journalism
He was the chief data analyst for Dispatches and Al Jazeera by turning the logs into English to help journalists working on the programmes.
Stories on torture; civilian deaths at US checkpoints; 109,032 dead; 183,991 one in 5o detailed; 1,3oo allegations of torture against Iraqi troops, 30p against American forces
US helicopters gun down surrendering insurgents.
US claim to have killed 103 civilians.
Getting the data.. Data.gov.uk; Freedom of INformation Act, Web scrapers (ScraperWiki.com) or turn up at an undisclosed location, at 1am on Sunday, and told not to go straight home after picking up a USB stick.
It was a 400mb text file. Almost 400,000 documents and almost 40 million words of dense military jargon.
Couldn’t read it or open it up. It’s a data cleaning problem. Had a text file, a comma separated file and these did not work. Dates creeping into wrong columns.
Had to scrap and look at MySQL file. Used UltraEdit and worked really well.
To turn it into something workable was knocking off bits of code.
Dates didn’t work, also inconsistent. Find Google Refine a useful new tool to clean up information.
Old Excel cut off so you can see more than a scrap. Needed to find a way to help people view it when had limited number of computers to look at it.
Low tech solutions were small PDFs but these were really helpful.
Always asked what data looks like, so by exporting sections as 800 page PDFs it worked to give something for people to see. Not good for data crunching, but good for reading several hundred reports. Worked well for reporters, particularly when looking at a specific area or torture records.
Used mail merge as a handy way to free out the data.
Ran a MySQL database and got a tech person to build a web interface.
War Logs diary dig is very neat but it’s not the best thing.
Searching for information such as escalation of force, or blue on white, find few reports. Search for friendly actions, find more. These are attacks with civilian categories.
Asking the right questions and searching brought out the right stories. Had to be so sure asking data the right question.
Searched for Prime Minister’s name. Found out more about stories already reportered. Data had it from the in-depth. Covered all areas, not just limited to where the few journalists were embedded.
Used great software to show incidents over periods of time. Colour coded to show deaths, civilians, enemies, police, friendlys etc.
Ten thousand killed through ethnic cleansing murders. More people killed in murders than IED explosions, found in data.
Discovered a category of incident marked as causing media outcry.
http://v.gd/q4zxDz – Tutorial.
Used Tableau to see data. Limit to free version of up to 100,000 records.
Searches of the data found civilians killed at checkpoints due to car bombs exploding. Had people reading 800 reports to get the real story behind the numbers, too.
Found was great to use, particularly visually without worrying about code.
People liked word highlights and PDF was the best way to use it.
Used the data as part of the research. Didn’t think, let’s do maps and data images, but did.
Had maps showing where fatal incidents happened.
Powerful information, especially when you pull out from central Baghdad.
Team on the ground went out to Baghdad talking to people for Dispatches.
All the data was geocoded. Took an area and pulled out every report from the area. Used in a map view to see what had happened.
The map helped reporters speak to people on the ground.
Had video of man in a white sedan come out of his vehicle who was then gunned down by an Apache. Found the report in the Iraq log mentioning the sedan using geodata. Report didn’t show the driver getting out and surrendering, the video did.
Checking details found it was within range of Apache, and lawyer cleared the footage for Dispatches.
Information tells story that doesn’t look like a data story. Man shot while surrendering is a stronger story, although he had a mortar tube in his car.
It wasn’t found with clever tricks but 10 weeks, with 25 people reading detailed reports working more than 18 hours a day. 30,000 reports read in detail. 5,000 read closely.
Richard Dixon from The Times asks if the leak will make this type of data more difficult to come across and unlock.
James suggests not because of the way it was leaked.
Francis Irving asked who paid? Funding from the David and Elaine Potter foundation. Dispatches paid a standard fee. Also took a fee from Al Jazeera. This gave a budget to cover research.
Mechanical Turk used for mundane repeat tasks, but something like this is too sensitive for farming out to different nationalities. Needed researchers who were trusted and had been working on it for some time because the information was so sensitive.
Judith Townend asked if there were issues with mainstream media taking up the story. James said it was difficult but explaining the data and making it clear helped. Put across idea it’s battlefield data but trust the data. The numbers change as you’re going through in data journalism.
As people became more comfortable with it, it didn’t become difficult to ‘sell’ at all.
Bureau of Investivative Journalism put all information, maps, animations on the web. Also put the raw data, heavily redacted, online. Wikileaks put it all online.