Visualization: handle with care. (Quite happy with this one. Less happy with the hair. I’ve fixed the hair now, at least.)
Filmed by Steve Allen at sameAs.
Visualization: handle with care. (Quite happy with this one. Less happy with the hair. I’ve fixed the hair now, at least.)
Filmed by Steve Allen at sameAs.
Not too long ago – a week ago today, in fact, I gave one of the keynotes at J-P Stacey’s Oxford Geek Nights. The team there video all the talks, so here you are — if you’ve got some spare time, check out the others as well.
If you’d rather read, though, here’s what I was planning to say. The talk melted a bit under the stagelights; this really isn’t an exact transcript. More like a setlist, really, or a business plan, and we all know how long they last in the wild before they become unrecognisable!
Hi there!
You know how it is when you get a phrase in your head and you can’t get it to go away? Kind of like the songs which keep coming back to you, I mean; just a few words that have a good sound, and a few images and impressions which you associate with them, and they become a nucleus for a web of notions and hunches and intuitions: a low-flying flock of ideas.
When J-P asked me if I was interested in coming to speak with you all, this phrase was the one which kept popping back into my head: “accidental journalism”. So I’m going to talk about a few ideas and a story or two: really, I’m trying to work out what the phrase means, because I think it’s important. It’s going to be more philosophical than technical, but I hope you’ll indulge me.
I was at News Innovation: London on Friday, which was an unconference for journalists and people working in these kind-of para-journalism spaces. One of the big themes there was that people are really, really worried about how you fund investigative journalism in the future. Investigative journalism is, in a way, the counterbalancing part of the social contract newspapers have had: they had a pretty-much artificial monopoly on local advertising, but they paid for that, in a hippy karmic kind of sense, by being watchdogs and the trusted information brokers. Now the advertising model’s toast, Craigslist and eBay and Gumtree have seen to that. Maybe Craigslist’ll start sponsoring journalism: they seem pretty socially conscious, and weirder things have happened. But it’s not exactly something you can rely on, is it?
So this really important pillar of society has been a happy accident. There’s that word again; accident. We’ve been relying on serendipity, and our luck’s running out. How do we engineer that kind of good fortune back into our future media?
And you might wonder how we fit into this; my colleagues and I, we’re not exactly traditional journalists. There’s three of us behind Timetric, Dan Wilson, Toby White and me; we all come from science backgrounds, we met doing postdocs designing markup and building data management systems for really absurd volumes of output from quantum mechanical simulations — for us cache is something you find in CPUs, not something you find in dodgy brown envelopes handed to MPs in darkened alleys. One of us isn’t even into the idea of liquid lunches, and that’s definitely a disqualification.
So it does seem kind of unlikely, even given that on top of all the programming you’d expect us to be doing, Dan’s a really really sharp designer (though he’d hate me saying that, it’s true, he’s great) and I’m a full-on obsessive ex-student-radio media-geek, that we’d be the kind of people to do anything about this problem. What we’ve been trying to do is build a really good platform for managing numerical data on the Web. But when you think about that problem, well…
First up, what kind of numbers are really compelling? The ones which give shape to the stories in our lives, is the obvious answer; fuel prices, exchange rates, crime and employment statistics, inflation. And with these, the thing is, their absolute values aren’t as interesting as the trends they represent: whether prices are increasing or decreasing, whether employment’s going up or down. And anything you’re recording is going to make more sense if you can compare it against these big indicators: numbers don’t exist in isolation. If your salary goes up three percent, but inflation’s up four and a half, you just got a pretty substantial pay cut.
So you’ve got to track the history of these numbers over time. You don’t want to build a platform for managing numbers; you want to build a platform for managing, comparing, sharing, recording, and building models on time series, and you want to fill it with data people care about: from the Government, the European Central Bank, the US Federal Reserve, really wherever you can get it. And you want to make it really easy for people to put their own data in.
So we built Timetric. It’s at timetric.com: here’ss a quick video showing you some of what you can do with it.
Another thing: a surprisingly big cultural/zeitgeisty thing right now, and over the last couple of years, is that being into data’s becoming, well, cool. Being into data is big, big entertainment business. One of the biggest online entertainment businesses is nothing much apart from massively multiplayer Excel.
And I don’t mean Warcraft. Never played it, cos I suspect if I had, I wouldn’t be here, I’d be locked in a shady corner of a darkened room babbling about orcs. I’m talking about fantasy football and fantasy baseball. Anyone else here ever played them? The basic mechanic is that they turn stats into points, and we all know points make prizes, right? So you wind up playing a numbers game, trying to work out trends and predict performance. It’s applied statistics gone redneck. It’s great.
And fantasy sports techniques have been bleeding back into real football and real baseball. Here’s a site called FootballOutsiders, which I really, really like: they’ve been breaking down and analyzing the tapes of NFL games so they can get better data to feed into their models. They’ve got some fantastically insightful stuff, and it’s pressuring the mainstream sports journalists to get more in-depth in their analysis: less cliche, more thought.
Baseball’s even further down that road: there are whole teams built around what they call “sabermetric” principles. There’s a book called Moneyball which is meant to be great: I’ve got to admit I haven’t got round to it yet.
And speaking of baseball, one of the guys behind Baseball Prospectus is an economist called Nate Silver. He applied his economics training, and the models he built predicting how baseball pitchers would develop, to predicting how states would vote in the last US election on his site, fivethirtyeight.com. And he went from zero to Colbert Report guest in about two months.
And I haven’t even mentioned Freakonomics yet. Or the Armchair Economist. Or Bad Science. Everywhere you look, this is happening: a lot of the coolest, best new insight is coming from data-analysis. So for all the talk about geek chic, data’s where the action is. Or, at least, that’s what I tell myself at two in the morning when I’m staring at another bloody spreadsheet.
But, seriously: I think there’s something deep here, and that’s that if you can make it fun, make it easy to look at data, people will look through it, and they will find stuff. What you’ve got to do is make it really quick to go from first hypothesis — “I wonder if this is correlated with that?” — to first test. If it takes thirty seconds, you’re in there; if it takes ten minutes, or an hour, only the most obsess… um, dedicated, will get there.
So that’s a kind of quantitative measure of goodness. How much effort is it for people to ask a simple question of the data you’re making available? I used to be a theoretical chemist and a hillwalker, and I’ve seen this kind of thing a lot. The barrier here’s how much effort it is to ask something; make that low and you’ll get lots more questions, some of those will have interesting answers, and hey presto, by accident you’ve engineered a lot more journalism.
There’s those words again. It makes some kind of sense too. Was anyone else at OpenTech? It’s like something Ben “Bad Science” Goldacre said there: he was talking about building a ’shits and giggles’ economy in his ongoing war on the stupid and the venal, and if you can’t pay for journalism with cash, you’d better make it fun for people to get involved and really easy for people to share what they’ve done, so they can get the kudos, the respect of their peers, and the attention of their admirers.
If you can make it playful for geeks too, you’re on a big winner. You need to have good APIs, and we reckon we do; we’re on the RESTy end of things, and you can do things like take these little embedded microcontrollers from ARM, hook them up to sensors, and have them post data at series on Timetric to log it in real time. We think there’s a bunch of interesting potential applications there; if you want to know more, grab me later. Anyway, that was a bit of a digression, so moving on…
If you’re going to think about the antithesis of fun, though, government websites would be high on the list. Take the Office of National Statistics. Has anyone here actually tried to use their website? A few of you, then. It’s painful. It’s about the opposite of playful, whatever that is. It makes you want to avoid asking questions. Maybe that’s the effect they were going for – it makes it really, really hard to ask questions. It’s weird: I kind of look forward to using that website, because it feels so, so good when you stop and go and look at something, anything else.
Have a look at this. It’s kind of comic, really, how hard it is to get a relatively simple bit of data like headline inflation out of this website; I’ve not counted the number of clicks, but it’s a lot. And once you’ve got there, you get a HTML table, which you’d need to chuck into a spreadsheet, then turn into a graph, then upload that graph somewhere, and only then can you put it into your blog or whatever.
That sucks. That really, really sucks.
So a while back we reverse-engineered the file format beneath all of these things, and uploaded the data to Timetric, and worked out how to break down the titles and glossaries associated with the data to put good tags on it, and once you’ve done all of that, you get a search interface like this.
[video]
And with this it’s really, really quick to grab some data, compare it, ask new questions, generally just mess around and test theories. One thing we did was work out how much lunch costs. We like cheese sandwiches, and the government publishes how much the ingredients — cheese, bread, butter — cost, because they’re part of the basket of staple foods that the government uses to work out inflation. And using Timetric, you can find the data, graph it, compare it, put in a formula which will be kept up to date whenever new inflation data comes in, and here you are; a graph of how much it’d cost you to make yourself a nice cheese sandwich, going back to the early seventies.
That’s fun! And it’s useful, because how much a cheese sandwich costs is something which anyone can relate to. It’s a custom index, like the FTSE 100 is an index. And the implications of a platform where you can aggregate all the models that people build and mine them for the relationships they reveal between concepts is really exciting.
But it turns out that a lot of the playfulness, the secret sauce which makes it fast to find the data you want and do something with it, is in the metadata, and that’s not a phrase I ever expected I’d be able to say straight-facedly.
So we’ve built this thing, and it turns out that our first customers were the Guardian. We like them a lot, and not just because they pay us and sponsor events like this, though gotta say, they do have good taste. We’d been speaking with them for a while, but it was at the Rewired State event at their offices in early March — tagline: The Government isn’t very good at computers. They spend millions to produce mediocre websites, hide away really useful public information and generally get it wrong. Which is a shame. — and it’s hard to argue with that, where things started really moving forward. We’re making the data in the Guardian Data Store, which is essentially a collection of Google spreadsheets filled with data curated by Guardian journalists, more fun and more appealing. So we’ve been doing a few things with them, and it’s been going well.
But, as it happens, the first story we did with them was one we cooked up; no pun intended, but I did promise you drugs. So here’s a story of a real piece of accidental journalism which happened through Timetric.
Part of my job’s getting interesting data into Timetric. We’ve been writing about bits of it, as and when, on our data blog, Byline – that’s at byline.timetric.com. One of the bits of data the Home Office were pushing out looked particularly juicy: drug seizure data. The number of drug busts and the purity of the gear lifted by Customs and by the police. I had to get that into Timetric, alongside some street-drug-price index data from the EMCDDA, the pan-European drug monitoring agency.
So from an email to an Excel spreadsheet to Timetric, and then I wrote a blog post about it;Â here that is. Maybe the most compelling graph in it was showing the precipitous fall in the quality of cocaine on the streets: that was a nice little story buried in the data, and it was much easier to find once I could upload the data and click around in it to see what we might be able to find. Gary Penn, the games journalist, talks about the “toyetics” of systems: that’s an idea I picked up from Matt Jones of Schulze and Webb and latterly Dopplr (http://www.slideshare.net/blackbeltjones/interesting2007?src=embed). By making the data into a toy, making it playful, we let the story emerge from it.
Anyway, we were visiting the Guardian later that week, and Simon Rogers, a journalist there, was rather taken with the tale, and put it up on the Data Blog. We were really excited. A story we’d originated on a national newspaper! Using our software! Really cool. That was a good Friday. Here’s what it looked like on the Guardian.
And then it got weird. Vice Magazine picked the story up. Seriously. And they connected it to a pet theory I’d been carrying around for years: the hidden links between how fast and aggressive dance music is and what the drugs are like…
…and the following Monday, this remarkably similar story appears on the BBC; except they’ve got more recent data there. The journalist there said he hadn’t seen our story, and to be honest I believe him, but that’s not to say that someone else didn’t see it and suggest that it was worth looking into. Even if it was a coincidence, we beat them to press by three, four days, and we were only being journalists by accident!
Accidental journalists. Serendipitous journalism. And that’s where we came in.
So here we are. We’re three erstwhile academics with a thing about data, based in a small office in Cambridge, and through writing systems which make it easier and more fun to play around with datasets, we’re helping national newspapers explain and break news stories.
That’s tremendously exciting. Really motivating. And I reckon everyone in this room’s got stories in them waiting to get out, either by giving them the right tools or you all going out there and building the right tools. Of course, I hope you’re motivated to go and have a play with ours, record and upload some data, beat on our APIs, play around and see what you can find. Stick what you’ve come up with on a blog, tell us about it, and we’ll link to it. And through that, together maybe we can find some interesting things.
Cheers for your time, and if you’ve got any questions, grab me later and I’d love to talk with you all. Thanks very much!
Today’s been pretty great, really – I’ve been at Rewired State in London with some very smart political-programmer types. There’ll, hopefully, soon be links out there to a lot of the things people’ve been making here, but for now here’s what I’ve been up to today…
I’m really into the idea of data journalism. People are justly cynical about statistics, and I reckon part of this is because they can’t see them – the data isn’t put in front of people for them to see, and play with, themselves, so why should they trust it? It’s just another bit of talking-head nonsense if you can’t see and use the facts. And, usually, the facts are in some bunch of Excel spreadsheets on a backwater government website.
But journalists and bloggers shouldn’t really have to be computer programmers to be able to do their jobs.
I live in Cambridge, along with…
other voters. (Where they all went in 2007 and 2008, well, your guess is as good as mine!). Now, probably the number one thing most people care about from their council is how much their council tax is. For an average house – a Band D – in Cambridge, that’s…
around £1350 a year. (I pay a bit less, ’cause I rent a flat, but still.)
You’re better off to live somewhere nice and (relatively) rich, like Cambridge, though…
From bottom to top: the City of London, Cambridge, and Liverpool’s council tax. There must be a story in there somewhere.
There’s a weird spike in the City of London’s council tax around ‘02-’03, too. Wonder why. How many people live there?
Not very many – a few thousand. Is there any connection between the number of people living there and the council tax?
Could be. Correlation isn’t causation and all that, but you can’t rule it out.
That’s all very well, but what’s my point here? I’ve got a couple, really: people shouldn’t have to convert this data every time they use it, and that there’s value in simply collecting all this data into one place – you can start trying out connections and seeing if they work. If it’s too hard to get the data into a form you can use, people won’t bother; it’s a rubbish use of their time, if nothing else.
So I’ve converted the data…
… and now it’s way easier to talk about this stuff with the facts to hand.