At Guardian, data scientists focus on the story data can tell
The key focus for The Guardian is understanding the importance of knowing its audience, Mary Hamilton, executive editor for audience at The Guardian, told delegates at the 2017 Big Data for Media Week conference in February.
Hamilton reflects that back in 2011, the main issue facing The Guardian was a lack of data, offering staff insight into what they were doing right or wrong.
Introducing data into the newsroom was vital, but there were hurdles along the way, she said: “Knowledge about our audience helps us make better decisions in the newsroom … but editors and journalists are not that keen and prefer words.”
Hamilton advocated that news outlets should first think like a journalist: “The job of most journalists is asking interesting questions that get illuminating answers that make people make real change…. The question isn’t what’s the data. It’s what’s the story?”
Focusing on what the story is and using data to facilitate that process has been shown in The Guardian’s approach to its editors. Hamilton went on to explain that while The Guardian use the live stats tool Ophan, its mission is to prioritise the editors whose “gut instinct” is at the forefront. Moreover, she went on to explain that the use of data as informative, never “underpinning the editor’s judgement call.”
“There is no good metric, but a front page editor understands that if they change the order of stories on a page, they’ll make a difference. Similarly, a journalist can understand that if they write something that is well read, they will boost traffic.”
The recent move to focus on data at The Guardian has been prompted by a new focus on membership and subscriptions as a business model.
Showing conference participants the Ophan live tool, Hamilton mentioned the various ways The Guardian can now look at how its content is being engaged, noting that physicist Brian Cox recently caused a recent spike in engagement for the news outlet because of a tweet.
Other developments include a new tool called Abacus that allows staff to test various questions quickly. Hamilton says that since the introduction of Abacus there have been “three times more tests in the first quarter of this year. That’s three times the speed of production.”
Abacus “allows us to run multifarious test that The Guardian own,” Hamilton said. “That’s Web, iOS, and Android. The tool tells you if you have a big enough sample and whether there is a significant statistical difference.”
Hamilton concluded by discussing the most recent development at The Guardian, which, after analysing its own comments section last year, introduced an automated tool to speed up the moderating process and enhance user experience.
“So a year ago we started to look at the commenting data set in some depth we ran a project called ‘The Dark Side of Comments.’ We analysed 70 million comments on The Guardian site and looked at patterns of removals of those comments. We have comment data going back 20 years,” she explained.
The site receives on average 70,000 comments a day, but human moderators only read 10% and block a further 7%. Since finding that women and ethnic minority journalists suffer more online abuse, The Guardian has developed a tool called Eirene, which uses signals from existing comment set and from elsewhere to predict whether a comment is at high risk for breaching moderation rules.
Yet there is more room for improvement, as Hamilton explained: “We don’t yet have the machine learning analysis that have those decisions critically and cleanly.”
The presentation ended with a reflection on data at large as Hamilton returned to a main point that data is a facilitator and not the end goal: “Data isn’t the point, it’s how you use it. Knowledge doesn’t just come from having numbers from a big bucket. So don’t just show people the numbers. The key thing to know is what is the question they’re trying to answer.”
Ella Wilks-Harper is an interactive journalism MA student at City, University of London. She can be reached at ellawilksharper@gmail.com.