James Ball from the Bureau of Investigative Journalism
He was the chief data analyst for Dispatches and Al Jazeera by turning the logs into English to help journalists working on the programmes.
Stories on torture; civilian deaths at US checkpoints; 109,032 dead; 183,991 one in 5o detailed; 1,3oo allegations of torture against Iraqi troops, 30p against American forces
US helicopters gun down surrendering insurgents.
US claim to have killed 103 civilians.
Getting the data.. Data.gov.uk; Freedom of INformation Act, Web scrapers (ScraperWiki.com) or turn up at an undisclosed location, at 1am on Sunday, and told not to go straight home after picking up a USB stick.
It was a 400mb text file. Almost 400,000 documents and almost 40 million words of dense military jargon.
Couldn’t read it or open it up. It’s a data cleaning problem. Had a text file, a comma separated file and these did not work. Dates creeping into wrong columns.
Had to scrap and look at MySQL file. Used UltraEdit and worked really well.
To turn it into something workable was knocking off bits of code.
Dates didn’t work, also inconsistent. Find Google Refine a useful new tool to clean up information.
Old Excel cut off so you can see more than a scrap. Needed to find a way to help people view it when had limited number of computers to look at it.
Low tech solutions were small PDFs but these were really helpful.
Always asked what data looks like, so by exporting sections as 800 page PDFs it worked to give something for people to see. Not good for data crunching, but good for reading several hundred reports. Worked well for reporters, particularly when looking at a specific area or torture records.
Used mail merge as a handy way to free out the data.
Ran a MySQL database and got a tech person to build a web interface.
War Logs diary dig is very neat but it’s not the best thing.
Searching for information such as escalation of force, or blue on white, find few reports. Search for friendly actions, find more. These are attacks with civilian categories.
Asking the right questions and searching brought out the right stories. Had to be so sure asking data the right question.
Searched for Prime Minister’s name. Found out more about stories already reportered. Data had it from the in-depth. Covered all areas, not just limited to where the few journalists were embedded.
Used great software to show incidents over periods of time. Colour coded to show deaths, civilians, enemies, police, friendlys etc.
Ten thousand killed through ethnic cleansing murders. More people killed in murders than IED explosions, found in data.
Discovered a category of incident marked as causing media outcry.
http://v.gd/q4zxDz – Tutorial.
Used Tableau to see data. Limit to free version of up to 100,000 records.
Searches of the data found civilians killed at checkpoints due to car bombs exploding. Had people reading 800 reports to get the real story behind the numbers, too.
Found was great to use, particularly visually without worrying about code.
People liked word highlights and PDF was the best way to use it.
Used the data as part of the research. Didn’t think, let’s do maps and data images, but did.
Had maps showing where fatal incidents happened.
Powerful information, especially when you pull out from central Baghdad.
Team on the ground went out to Baghdad talking to people for Dispatches.
All the data was geocoded. Took an area and pulled out every report from the area. Used in a map view to see what had happened.
The map helped reporters speak to people on the ground.
Had video of man in a white sedan come out of his vehicle who was then gunned down by an Apache. Found the report in the Iraq log mentioning the sedan using geodata. Report didn’t show the driver getting out and surrendering, the video did.
Checking details found it was within range of Apache, and lawyer cleared the footage for Dispatches.
Information tells story that doesn’t look like a data story. Man shot while surrendering is a stronger story, although he had a mortar tube in his car.
It wasn’t found with clever tricks but 10 weeks, with 25 people reading detailed reports working more than 18 hours a day. 30,000 reports read in detail. 5,000 read closely.
Richard Dixon from The Times asks if the leak will make this type of data more difficult to come across and unlock.
James suggests not because of the way it was leaked.
Francis Irving asked who paid? Funding from the David and Elaine Potter foundation. Dispatches paid a standard fee. Also took a fee from Al Jazeera. This gave a budget to cover research.
Mechanical Turk used for mundane repeat tasks, but something like this is too sensitive for farming out to different nationalities. Needed researchers who were trusted and had been working on it for some time because the information was so sensitive.
Judith Townend asked if there were issues with mainstream media taking up the story. James said it was difficult but explaining the data and making it clear helped. Put across idea it’s battlefield data but trust the data. The numbers change as you’re going through in data journalism.
As people became more comfortable with it, it didn’t become difficult to ‘sell’ at all.
Bureau of Investivative Journalism put all information, maps, animations on the web. Also put the raw data, heavily redacted, online. Wikileaks put it all online.