Sarah Booker

29/11/2010

Data projects presented at #hhhrbi

Filed under: journalism,technical — Sarah Booker Lewis @ 6:45 pm
Tags: , ,

Transparency

How your money is really being spent.

Wanted to look at local government spending in various areas. Looked at government account figures published on Number 10 website.

Government temp spending is triple it’s own staff budget

Found page with 190,000 individual data entires.

Had someone writing in java, one in Ruby and found the data was a bit rubbish.

Date columns were not filled, or had the number of days since 1900.

Cannot trust the data, have to ask if it’s correct and can be validated.

Had a massive amount of data, tried to break through agency and temp staff. Cutting back a massive spreadsheet.

Used Zoho(?) where you can see things pretty quickly.

Visuals created once separated the costs. Need to dig deep into the data to find the quirks.

Taking home to learn the accuracy of data, structured database, other axis of investigation, getting data clean, automatically updating.

Is it worth it?
Took extensive salary data.

Put in location and job and then the function shows if it’s worth living there.

A Welsh teacher earning £45,000, not competitive.

Someone in London working as an accountant at £45,000, data showed 16 applicants per job making it a 50/50.

From the initial data service a map was created where you can choose a function, a job title and a region to find out visually whether a job is worth it. It pushes down per region.

Can also zone in to regions using a slider system.

A splendid and complex visualisation. (The winner)

Truck stops

Started with the idea of truck stops and which ones were safe.

Started looking for data on the Highways Agency site and found it wanting.

Found a map with decent truck stop sites.

Had the xml source and started to develop a scraper on Scraperwiki and got a view on Google maps.

Plotted all the points. Letter on the point shows how safe by analysing which ones had CCTV and various security measures.

Further on wanted to find out more about truck crime. Looked ast the TruckPol website and took the data from PDFs and put in a spreadsheet.

Updated the view with the information about crimes. Red ones not so great, blue are good and a purple is okay.

(Winner of the best scraper award from Scraperwiki and third place overall).

Take over watch

UK Takeover panel was the prime source of information showing all take overs in play. The aim was to create something to provide details about companies.

Had scraped data but needed to add sector and revenue to create context.

Also used Investigate.co.uk

Had a live table showing activity from the last two days.

Have different sectors and can pull information out to see what’s happening in different areas

Snow Holes

Creates a map showing areas affected by snow and see where the nearest snow hole is. (See snow hole blog)

Plantlife

How people move around the chemical world

Used Google Refine to play with the data. Pulled out the geocode to map where the companies were.

Google Fusion also used.

Top 100 chemical companies. Merged Google finance information with Isis.

Created a visual showing how sales had gone down with the chemical industry sales halving from 2007-08.

(Second place)

Creating something visually stimulating from data #hhhrbi

Filed under: journalism,technical — Sarah Booker Lewis @ 12:56 pm
Tags: , , , , ,

We were quite a large group to start with, so we’ve ended up splitting in two. One group is working on scraping details of registered care homes, and I’m in a group working on information gathered but creating an interesting and informative visual.

Our first battle was making sure Scraperwiki could read our data so we could work with it.

First of all I uploaded to Google docs, but the comma separated values (CSV) scraper didn’t like it. Then when the spreadsheet was published as a web page, as suggested by  it still wasn’t happy because it wanted to be signed into Google.

Matt suggested putting the CSV onto his server, so I exported it and sent it over to him.

Francis Irving also suggested scraping What Do They Know, because it was Freedom of Information dat.

After much fiddling Matt managed to pull out the raw data by popping (pulling from the top of the list) and using a Python scraper.

It turned out the data we had was so unstructured it wasn’t possible to work with it.

After lunch we’re working on a different project.

Introduction to ScraperWiki #hhhrbi

Filed under: journalism,technical — Sarah Booker Lewis @ 10:44 am
Tags: , , , , , , , ,

Francis Irving of Scraperwiki explains how it works.

Take the Gulf oil spill. You can find a list of oil fields around the UK, but it’s all in a strange lump.

He shows a piece of Python code reading the oil field pages and turns it into a piece of data.

It’s quite simple to make a map view, but also code to make more complicated views.

Scraperwiki is automatic data conversion.

 

Scrape internet pages, Parser it, organise it, collect it and model it into a view. It will keep running and give the dataset constantly.

 

There are two kinds of journalism to use with the data. You can make tools, specific tools and find a story.

In Belfast took a list of historic houses in the UK. The data scraper looked through a host of websites, using Python, can use Ruby.
There are a multitude of visuals available. The Belfast project showed a spike in 1979, this was explained due to a political sectarian issue.

Answering a question, Francis confirms you can scrape more than one website at a time.

Francis would like to see more linked data and merging datasets together.

Asked about licensing for commercial use. Francis says it’s mainly used for public data. Scraperwiki blocks scraping Facebook because it’s private data, but the code can be adjusted.

Interested areas for projects today are: farming, local government budgets, public sector salaries, mapping chemical companies and distributors, environment, transport, road transport crime, truckstops map, energy data, countryprofile link to carbon emissions, e-waste, airline data, plastics data, empty shops, infotainment to make user interested in the data, another visualisation on companies ranking based on customer reviews, using the crowd to share information with data and create interesting information, data annotating content and enriching content, health data… and anything else we’re doing.

 

24/11/2010

Programming for the public (@frabcus) #hhldn

Francis talking about two different stories on the internet.

It used to be the case you had to check the division list to find out how MPs voted.

Created a web scraper pulling out the information and created The Public Whip, showing how MPs voted.

Have to be a parliament nerd to understand, even when it’s broken down.

They Work for You simplifies the information even more, it tells you something about your MP.

Bring the division information together. Take a list from public whip and create a summary of how they voted.

Checking how one MP voted on the Iraq War. Voted with the majority in favour of the war on three votes and abstained from the first and then the final three. It’s almost a deal with electorate.

MP asked to have “voted moderately” removed because found it misleading. A number of MPs have complained, but checked the votes.

 

Richard Pope founder of Scraperwiki made a website after the demolition of his local pub (a fine-looking establishment called The Queen) and created Planning Alerts.com website.

It helps people access information from outside the immediate catchment area. He wrote lots of web scrapers. Example of different councils’ planning application systems.

Scraperwiki is like Wikipedia but for data. It’s a technical product for use when you’re not technical. Can look at different data scrapers and copy what others are doing without learning Pearl or Python.

Planning Alerts is being moved over to Scraperwiki. Can tag it on Scraperwiki and find information. Can find stories and in-depth information.

Can request a dataset and have something built for you.

Francis was asked,  is it legal? In the UK if it’s public data, not for sale, you can reuse it. Would take things down if asked, but it’s open stuff.

Could it be stopped? Would be ill-advised to stop people, and journalists, reading public information.

Public whip and They work for you, look at numerous votes.

Looking at ways to fund it such as private scrapers, or scrapers in a cocoon. Looking at white label for intranet use. There’s a market for data and developers who want to give data. Want to match developers with data. Currently funded by Channel4. Want to remain free for the public.

Does it make people lazy? No, it’s already published but it makes it easier. Movement of people trying to get publishers of data to change. Always a need to pull out in a variety of formats.

Running Hacks and Hackers days working together finding stories and hunting around.

Have had data scraped from What do they know site.

 

 

19/11/2010

Great people for journalists to follow on Twitter #ff

Alan Rusbridger‘s article today, Why Twitter matters for media organisations listed a great many reasons for using Twitter.

During my years on Twitter I have found it is a great way to learn and I continue to learn a great deal by following other digital journalists, educators and developers.

In an effort to help journalists stepping into the Twittersphere for the first time I have compiled a list of really useful people to follow and learn from.

Teaching and learning

Paul Bradshaw – Lecturer and social media consultant Online journalism blog – great tips  Twitter.com/ojblog

BBC Journalism College

Clay Shirky – Influential future media blogger

Glynn Mottershead – Journalism lecturer

Andy Dickinson – Online journalism lecturer and links; twitter.com/linkydickinson

Jeff Jarvis – The Buzz Machine blogger and journalism professor

Sue Llewellyn – BBC social media trainer and TV journo

Steve Yelvington – Newsroom trainer

Jay Rosen – Journalism lecturer at NYU

Roy Greenslade – City University, media commentator

Journalists

Alison Gow – Executive Editor, digital, for the Liverpool Daily Post & Liverpool Echo

Marc Reeves – The Business Desk, West Midlands

Richard Kendall – Web editor Peterborough Evening Telegraph

David Higgerson – Head of Multimedia, Trinity Mirror

Sam Shepherd – Bournemouth Echo digital projects

Jo WadsworthBrighton Argus web editor

Matt Cornish – journalist and author of Monkeys and Typewriters

Louise Bolotin – Journalist and hyperlocal blogger

Sarah Booker (me because I try to be useful)

Joanna Geary – Guardian digital development editor twitter.com/joannageary and  twitter.com/joannaslinks

Adam Tinworth –  Consultant and ex-Reed Business Information editorial development manager

Adam Westbrook – Lecturer and multimedia journalist

Patrick Smith – The Media Briefing

Shane Richmond – Telegraph Head of technology

Edward Roussel – Telegraph digital editor

Damian Thompson – Telegraph blogs editor

Kate Day – Telegraph communities editor

Ilicco Elia – Former Head of mobile Reuters

Sarah Hartley– Guardian local

Jemima Kiss – Guardian media/tech reporter

Kate Bevan – Guardian media/tech reporter

Josh Halliday – Media Guardian

Jessica Reid – Guardian Comment is Free

Charles Arthur – Tech Guardian editor

Heather Brooke – Investigative journalist, FOI campaigner

Kevin Anderson – Journalist, ex BBC, ex Guardian

Wannabehacks – Journalism students and trainees

Simon Rogers – Guardian data journalist and editor of the datastore

Jon Slattery – Journalist

Laura Oliver – Journalism.co.uk

Johann Hari – Journalist, The Independent (personal)

Guy Clapperton – Journalist and writer

Alan Rusbridger – Guardian editor

Specialists

George Hopkin – Seo evangelist

Nieman Journalism Lab – Harvard

Martin Belam – Guardian internet advisor

Tony Hirst – OU lecturer and data mash up artist

Christian Payne – Photography, video, mobile media

David Allen Green – Lawyer and writer

Judith Townend – Meeja Law & From the Online

Richard Pope – Scraperwiki director

Suw Charman-Anderson – social software consultant and writer

Scraperwiki – Data scraping and information

Chris Taggart – Founder of Openly Local and They Work for You

Suzanne Kavanagh – Publishing sector manager at Skillset, personal account

Greg Hadfield – Director of strategic projects at Cogapp, ex Fleet Streets

Francis Irving – Scraperwiki

Ben Goldacre – Bad Science

Philip John – Journal Local, Litchfield Blog,  twitter.com/hyperaboutlocal

David McCandless – Information is Beautiful

Flying Binary – Cloud computing and visual analytics

Rick Waghorn – Journalist and founder of Addiply

News sources

Journalism news

Journalism blogs

Mike ButcherTech Crunch UK

Richard MacManus – Read Write Web

The Media Blog

Press Gazette

Hold the Front Page

Mashable – Social media blog

Media Guardian

Guardian tech weekly

Paid Content

The Media Brief

BBC news

Channel4 news

Channel4 newsroom blogger

Sky News

House of Twits –  Houses of Parliament

Telegraph Technology

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,101 other followers