Sarah Booker


Data projects presented at #hhhrbi

Filed under: journalism,technical — Sarah Booker Lewis @ 6:45 pm
Tags: , ,


How your money is really being spent.

Wanted to look at local government spending in various areas. Looked at government account figures published on Number 10 website.

Government temp spending is triple it’s own staff budget

Found page with 190,000 individual data entires.

Had someone writing in java, one in Ruby and found the data was a bit rubbish.

Date columns were not filled, or had the number of days since 1900.

Cannot trust the data, have to ask if it’s correct and can be validated.

Had a massive amount of data, tried to break through agency and temp staff. Cutting back a massive spreadsheet.

Used Zoho(?) where you can see things pretty quickly.

Visuals created once separated the costs. Need to dig deep into the data to find the quirks.

Taking home to learn the accuracy of data, structured database, other axis of investigation, getting data clean, automatically updating.

Is it worth it?
Took extensive salary data.

Put in location and job and then the function shows if it’s worth living there.

A Welsh teacher earning £45,000, not competitive.

Someone in London working as an accountant at £45,000, data showed 16 applicants per job making it a 50/50.

From the initial data service a map was created where you can choose a function, a job title and a region to find out visually whether a job is worth it. It pushes down per region.

Can also zone in to regions using a slider system.

A splendid and complex visualisation. (The winner)

Truck stops

Started with the idea of truck stops and which ones were safe.

Started looking for data on the Highways Agency site and found it wanting.

Found a map with decent truck stop sites.

Had the xml source and started to develop a scraper on Scraperwiki and got a view on Google maps.

Plotted all the points. Letter on the point shows how safe by analysing which ones had CCTV and various security measures.

Further on wanted to find out more about truck crime. Looked ast the TruckPol website and took the data from PDFs and put in a spreadsheet.

Updated the view with the information about crimes. Red ones not so great, blue are good and a purple is okay.

(Winner of the best scraper award from Scraperwiki and third place overall).

Take over watch

UK Takeover panel was the prime source of information showing all take overs in play. The aim was to create something to provide details about companies.

Had scraped data but needed to add sector and revenue to create context.

Also used

Had a live table showing activity from the last two days.

Have different sectors and can pull information out to see what’s happening in different areas

Snow Holes

Creates a map showing areas affected by snow and see where the nearest snow hole is. (See snow hole blog)


How people move around the chemical world

Used Google Refine to play with the data. Pulled out the geocode to map where the companies were.

Google Fusion also used.

Top 100 chemical companies. Merged Google finance information with Isis.

Created a visual showing how sales had gone down with the chemical industry sales halving from 2007-08.

(Second place)

Creating something informative and visually interesting #hhhrbi

Filed under: technical — Sarah Booker Lewis @ 5:08 pm
Tags: , , ,

After spending the morning running before we could walk the team I’m in, Mike Beardmore, Dominic Clay, Matt Holmes and I  have discussed putting together something simple.

Matt had used C (sharp) to pull out all the #uksnow tweets and we plan to create a mash-up map using the Highways Agency RSS feed to build a regularly updated map.

The first process was removing all the non-post code tweets.

Mike has also suggested mashing the #uksnow with a re-written scrape on Scraperwiki with details of Harvester restaurants in the UK.

However, we had an issue with the Twitter as too much information was coming in at once. It’s snowing and #uksnow is a popular hashtag and the API couldn’t deal with it.

Mike  took Twitter feed for #UK with the post code, extracted the postcode, took the Scraperwiki feed of Harvesters and extracted the post code from those, and created a datafile formatted as XML so it would show up on Google maps.

The plan is for a pointer showing the location of a restaurant in the snow, a snow hole, providing warmth.

Mike has managed to get it to work on his own because he’s very capable with the code and produced a map showing areas where heavy snow is reported and the location of the Harvester restaurants nearby.

The potential future for this map would be to show a wide variety of restaurants, service areas and places offering shelter to people who find themselves trapped by snow while travelling.

Creating something visually stimulating from data #hhhrbi

Filed under: journalism,technical — Sarah Booker Lewis @ 12:56 pm
Tags: , , , , ,

We were quite a large group to start with, so we’ve ended up splitting in two. One group is working on scraping details of registered care homes, and I’m in a group working on information gathered but creating an interesting and informative visual.

Our first battle was making sure Scraperwiki could read our data so we could work with it.

First of all I uploaded to Google docs, but the comma separated values (CSV) scraper didn’t like it. Then when the spreadsheet was published as a web page, as suggested by  it still wasn’t happy because it wanted to be signed into Google.

Matt suggested putting the CSV onto his server, so I exported it and sent it over to him.

Francis Irving also suggested scraping What Do They Know, because it was Freedom of Information dat.

After much fiddling Matt managed to pull out the raw data by popping (pulling from the top of the list) and using a Python scraper.

It turned out the data we had was so unstructured it wasn’t possible to work with it.

After lunch we’re working on a different project.

Introduction to ScraperWiki #hhhrbi

Filed under: journalism,technical — Sarah Booker Lewis @ 10:44 am
Tags: , , , , , , , ,

Francis Irving of Scraperwiki explains how it works.

Take the Gulf oil spill. You can find a list of oil fields around the UK, but it’s all in a strange lump.

He shows a piece of Python code reading the oil field pages and turns it into a piece of data.

It’s quite simple to make a map view, but also code to make more complicated views.

Scraperwiki is automatic data conversion.


Scrape internet pages, Parser it, organise it, collect it and model it into a view. It will keep running and give the dataset constantly.


There are two kinds of journalism to use with the data. You can make tools, specific tools and find a story.

In Belfast took a list of historic houses in the UK. The data scraper looked through a host of websites, using Python, can use Ruby.
There are a multitude of visuals available. The Belfast project showed a spike in 1979, this was explained due to a political sectarian issue.

Answering a question, Francis confirms you can scrape more than one website at a time.

Francis would like to see more linked data and merging datasets together.

Asked about licensing for commercial use. Francis says it’s mainly used for public data. Scraperwiki blocks scraping Facebook because it’s private data, but the code can be adjusted.

Interested areas for projects today are: farming, local government budgets, public sector salaries, mapping chemical companies and distributors, environment, transport, road transport crime, truckstops map, energy data, countryprofile link to carbon emissions, e-waste, airline data, plastics data, empty shops, infotainment to make user interested in the data, another visualisation on companies ranking based on customer reviews, using the crowd to share information with data and create interesting information, data annotating content and enriching content, health data… and anything else we’re doing.



Programming for the public (@frabcus) #hhldn

Francis talking about two different stories on the internet.

It used to be the case you had to check the division list to find out how MPs voted.

Created a web scraper pulling out the information and created The Public Whip, showing how MPs voted.

Have to be a parliament nerd to understand, even when it’s broken down.

They Work for You simplifies the information even more, it tells you something about your MP.

Bring the division information together. Take a list from public whip and create a summary of how they voted.

Checking how one MP voted on the Iraq War. Voted with the majority in favour of the war on three votes and abstained from the first and then the final three. It’s almost a deal with electorate.

MP asked to have “voted moderately” removed because found it misleading. A number of MPs have complained, but checked the votes.


Richard Pope founder of Scraperwiki made a website after the demolition of his local pub (a fine-looking establishment called The Queen) and created Planning website.

It helps people access information from outside the immediate catchment area. He wrote lots of web scrapers. Example of different councils’ planning application systems.

Scraperwiki is like Wikipedia but for data. It’s a technical product for use when you’re not technical. Can look at different data scrapers and copy what others are doing without learning Pearl or Python.

Planning Alerts is being moved over to Scraperwiki. Can tag it on Scraperwiki and find information. Can find stories and in-depth information.

Can request a dataset and have something built for you.

Francis was asked,  is it legal? In the UK if it’s public data, not for sale, you can reuse it. Would take things down if asked, but it’s open stuff.

Could it be stopped? Would be ill-advised to stop people, and journalists, reading public information.

Public whip and They work for you, look at numerous votes.

Looking at ways to fund it such as private scrapers, or scrapers in a cocoon. Looking at white label for intranet use. There’s a market for data and developers who want to give data. Want to match developers with data. Currently funded by Channel4. Want to remain free for the public.

Does it make people lazy? No, it’s already published but it makes it easier. Movement of people trying to get publishers of data to change. Always a need to pull out in a variety of formats.

Running Hacks and Hackers days working together finding stories and hunting around.

Have had data scraped from What do they know site.



The Iraq War logs – How data drove the story (@jamesrbuk) #hhldn

Filed under: journalism — Sarah Booker Lewis @ 7:45 pm
Tags: , , , , , , , , , ,

James Ball from the Bureau of Investigative Journalism

He was the chief data analyst for Dispatches and Al Jazeera by turning the logs into English to help journalists working on the programmes.

Stories on torture; civilian deaths at US checkpoints; 109,032 dead; 183,991 one in 5o detailed; 1,3oo allegations of torture against Iraqi troops, 30p against American forces

US helicopters gun down surrendering insurgents.

US claim to have killed 103 civilians.

Getting the data..; Freedom of INformation Act, Web scrapers ( or turn up at an undisclosed location, at 1am on Sunday, and told not to go straight home after picking up a USB stick.

It was a 400mb text file. Almost 400,000 documents and almost 40 million words of dense military jargon.

Couldn’t read it or open it up. It’s a data cleaning problem. Had a text file, a comma separated file and these did not work. Dates creeping into wrong columns.

Had to scrap and look at MySQL file. Used UltraEdit and worked really well.

To turn it into something workable was knocking off bits of code.

Dates didn’t work, also inconsistent. Find Google Refine a useful new tool to clean up information.

Old Excel cut off so you can see more than a scrap. Needed to find a way to help people view it when had limited number of computers to look at it.

Low tech solutions were small PDFs but these were really helpful.

Always asked what data looks like, so by exporting sections as 800 page PDFs it worked to give something for people to see. Not good for data crunching, but good for reading several hundred reports. Worked well for reporters, particularly when looking at a specific area or torture records.

Used mail merge as a handy way to free out the data.

Ran a MySQL database and got a tech person to build a web interface.

War Logs diary dig is very neat but it’s not the best thing.

Searching for information such as escalation of force, or blue on white, find few reports. Search for friendly actions, find more. These are attacks with civilian categories.

Asking the right questions and searching brought out the right stories. Had to be so sure asking data the right question.

Searched for Prime Minister’s name. Found out more about stories already reportered. Data had it from the in-depth. Covered all areas, not just limited to where the few journalists were embedded.

Used great software to show incidents over periods of time. Colour coded to show deaths, civilians, enemies, police, friendlys etc.

Ten thousand killed through ethnic cleansing murders. More people killed in murders than IED explosions, found in data.

Discovered a category of incident marked as causing media outcry. – Tutorial.

Used Tableau to see data. Limit to free version of up to 100,000 records.

Searches of the data found civilians killed at checkpoints due to car bombs exploding.  Had people reading 800 reports to get the real story behind the numbers, too.

Found was great to use, particularly visually without worrying about code.

People liked word highlights and PDF was the best way to use it.

Used the data as part of the research. Didn’t think, let’s do maps and data images, but did.

Had maps showing where fatal incidents happened.

Powerful information, especially when you pull out from central Baghdad.

Team on the ground went out to Baghdad talking to people for Dispatches.

All the data was geocoded. Took an area and pulled out every report from the area. Used in a map view to see what had happened.

The map helped reporters speak to people on the ground.

Had video of man in a white sedan come out of his vehicle who was then gunned down by an Apache. Found the report in the Iraq log mentioning the sedan using geodata. Report didn’t show the driver getting out and surrendering, the video did.

Checking details found it was within range of Apache, and lawyer cleared the footage for Dispatches.

Information tells story that doesn’t look like a data story. Man shot while surrendering is a stronger story, although he had a mortar tube in his car.

It wasn’t found with clever tricks but 10 weeks, with 25 people reading detailed reports working more than 18 hours a day. 30,000 reports read in detail. 5,000 read closely.


Richard Dixon from The Times asks if the leak will make this type of data more difficult to come across and unlock.

James suggests not because of the way it was leaked.

Francis Irving asked who paid? Funding from the David and Elaine Potter foundation. Dispatches paid a standard fee. Also took a fee from Al Jazeera. This gave a budget to cover research.

Mechanical Turk used for mundane repeat tasks, but something like this is too sensitive for farming out to different nationalities. Needed researchers who were trusted and had been working on it for some time because the information was so sensitive.


Judith Townend asked if there were issues with mainstream media taking up the story. James said it was difficult but explaining the data and making it clear helped. Put across idea it’s battlefield data but trust the data. The numbers change as you’re going through in data journalism.

As people became more comfortable with it, it didn’t become difficult to ‘sell’ at all.

Bureau of Investivative Journalism put all information, maps, animations on the web. Also put the raw data, heavily redacted, online. Wikileaks put it all online.


Great people for journalists to follow on Twitter #ff

Alan Rusbridger‘s article today, Why Twitter matters for media organisations listed a great many reasons for using Twitter.

During my years on Twitter I have found it is a great way to learn and I continue to learn a great deal by following other digital journalists, educators and developers.

In an effort to help journalists stepping into the Twittersphere for the first time I have compiled a list of really useful people to follow and learn from.

Teaching and learning

Paul Bradshaw – Lecturer and social media consultant Online journalism blog – great tips

BBC Journalism College

Clay Shirky – Influential future media blogger

Glynn Mottershead – Journalism lecturer

Andy Dickinson – Online journalism lecturer and links;

Jeff Jarvis – The Buzz Machine blogger and journalism professor

Sue Llewellyn – BBC social media trainer and TV journo

Steve Yelvington – Newsroom trainer

Jay Rosen – Journalism lecturer at NYU

Roy Greenslade – City University, media commentator


Alison Gow – Executive Editor, digital, for the Liverpool Daily Post & Liverpool Echo

Marc Reeves – The Business Desk, West Midlands

Richard Kendall – Web editor Peterborough Evening Telegraph

David Higgerson – Head of Multimedia, Trinity Mirror

Sam Shepherd – Bournemouth Echo digital projects

Jo WadsworthBrighton Argus web editor

Matt Cornish – journalist and author of Monkeys and Typewriters

Louise Bolotin – Journalist and hyperlocal blogger

Sarah Booker (me because I try to be useful)

Joanna Geary – Guardian digital development editor and

Adam Tinworth –  Consultant and ex-Reed Business Information editorial development manager

Adam Westbrook – Lecturer and multimedia journalist

Patrick Smith – The Media Briefing

Shane Richmond – Telegraph Head of technology

Edward Roussel – Telegraph digital editor

Damian Thompson – Telegraph blogs editor

Kate Day – Telegraph communities editor

Ilicco Elia – Former Head of mobile Reuters

Sarah Hartley– Guardian local

Jemima Kiss – Guardian media/tech reporter

Kate Bevan – Guardian media/tech reporter

Josh Halliday – Media Guardian

Jessica Reid – Guardian Comment is Free

Charles Arthur – Tech Guardian editor

Heather Brooke – Investigative journalist, FOI campaigner

Kevin Anderson – Journalist, ex BBC, ex Guardian

Wannabehacks – Journalism students and trainees

Simon Rogers – Guardian data journalist and editor of the datastore

Jon Slattery – Journalist

Laura Oliver –

Johann Hari – Journalist, The Independent (personal)

Guy Clapperton – Journalist and writer

Alan Rusbridger – Guardian editor


George Hopkin – Seo evangelist

Nieman Journalism Lab – Harvard

Martin Belam – Guardian internet advisor

Tony Hirst – OU lecturer and data mash up artist

Christian Payne – Photography, video, mobile media

David Allen Green – Lawyer and writer

Judith Townend – Meeja Law & From the Online

Richard Pope – Scraperwiki director

Suw Charman-Anderson – social software consultant and writer

Scraperwiki – Data scraping and information

Chris Taggart – Founder of Openly Local and They Work for You

Suzanne Kavanagh – Publishing sector manager at Skillset, personal account

Greg Hadfield – Director of strategic projects at Cogapp, ex Fleet Streets

Francis Irving – Scraperwiki

Ben Goldacre – Bad Science

Philip John – Journal Local, Litchfield Blog,

David McCandless – Information is Beautiful

Flying Binary – Cloud computing and visual analytics

Rick Waghorn – Journalist and founder of Addiply

News sources

Journalism news

Journalism blogs

Mike ButcherTech Crunch UK

Richard MacManus – Read Write Web

The Media Blog

Press Gazette

Hold the Front Page

Mashable – Social media blog

Media Guardian

Guardian tech weekly

Paid Content

The Media Brief

BBC news

Channel4 news

Channel4 newsroom blogger

Sky News

House of Twits –  Houses of Parliament

Telegraph Technology


Heather Brooke at The web data revolution #iweu live blog

Filed under: journalism — Sarah Booker Lewis @ 4:47 pm
Tags: , , , , , ,

As questions became more complex, the only way questions could be answered was through data.

First issue was with police not showing up when called 999. Wondered how often they didn’t turn up

Didn’t get an answer. Formed basis for Your Right to Know,

Want people to stop being deferential, but get the data before making a decision.

Made 52 FOI requests into response times to 999 calls.

Lots of data coming back as spread sheet, complex with 999 calls divided into priorities. Sometimes say how many incidents there were and how responded at the time.

Found inaccurate even if had the facts at all.

Deferential conclusion, if meant to be responded to in 10 minutes, wasn’t kept track. Assumed it had been successful.

Kept coming across attitude public can’t be trusted with the raw data.

Thought Parliament ought to uphold laws. Found stonewalling of expenses requests frustrated.

If MPs won’t disclose, why should councils or hospitals. Taking hold as a symbol.

AK – How MPs get around?

HB – In Silent State, data collected in the name of the public collected but public can’t use it.

Who working for? The public. Seeing a change in attitude. They work for us and have to give out this data.

Information is Beautiful – David McCandless #iweu live blog

Filed under: journalism — Sarah Booker Lewis @ 4:39 pm
Tags: , , , , ,

Billion Dollar-o-gram


Billion Dollar-o-gram

The Billion Dollar-o-gram from Information is Beautiful by David McCandles



Visualising the massive numbers after scraping data from New York Times and Guardian.

Different colours for different uses of vast amounts of cash.

GIves perspective and shows relationships between spending.

US gives more money than UK

Opec gives a tiny proportion to environment

Money spent on deficit could more than wipe out serious diseases like AIDs.

Relate the money spent on daily items with the tax paid per days.

Shows the break up times Facebook created.

Pure example of data journalism. Saturated by data journalism. Ask the right question it will reveal itself.

Data is the new oil.

Data is the new soil, it’s a creative medium and we create flowers

Mountains out of molehill visualisation showing swine flu, killer wasps, using Google insights to track trends.

Hidden patterns, shows the line for video games. Peaks at the same time of year. November big month for buying.

April is a big month for computer game fear as Colombine anniversary. It’s co-dependent.

Data is a prism to correct vision.

USA military budget bigger than the African debt and UK budget deficit. Can fit all the other top 10s.

Budget as proportion, largest is Myanmar and US is seventh.

Big spenders Jordan, Georgia, Saudi, Kyrgystan, Burundi, Oman…

China has the biggest army. Huge population.

Adjusted biggest army in terms of percentage and the top five are North Korea, Eritrea, Israel, Djibouti, Iraq.

USA 45th. China is 124th.

Visual CV. Journalism images create the story.

Design literacy and language, Looking for patterns and visual relief.

Speak two languages, data and visual.