Helen Wieffering

The ethics of scraping data in journalism

April 6, 2020

By Helen Wieffering

In early January, The Detroit News published a sweeping investigation that showed major flaws in the city’s tax system. Officials had failed to re-assess home values in the years since the Great Recession, overtaxing homeowners by more than a half billion dollars and putting thousands on the path to foreclosure.

The story was only possible with property data from the county treasurer’s office. But when The News requested a list of indebted homes, it received a bill for $235,000.

Reporters scraped the website instead of paying the fee.

Here, scraping was used to powerful effect when a request for data hit a dead end. But when is it ethical to circumvent officials and data gatekeepers? And more broadly, what are the ground rules for scraping as a journalist?

It’s clear that data is everywhere on the web, from government agencies to shopping sites, and much of it falls within the public interest. Data can take many forms: a table of baseball wins and losses, price fluctuations on Amazon, or details on a layoff notice, to name a few.

“Scraping” data is a programming tactic that automates the tedium of copying and clicking through online data. Despite the single term, scraping involves multiple steps, ranging from analyzing the website, “crawling” for data and organizing results.

Scraping is all the more useful for investigative journalists; it’s primed to document the inner workings of a system. The Columbia Journalism Review describes an increasing number of “essential” stories that have relied on scraping, including the Atlanta Constitution-Journal’s “Doctors & Sex Abuse” series, and a Reuters investigation that unearthed a spree of adoption posts on Yahoo. The former, similar to the Detroit investigation, began with a stymied request for public records.

Jonathan Soma helps teach journalists how to scrape data as director of the Lede Program, a summer course in data journalism at Columbia University. He said the most common concern he hears from students is whether journalists are legally allowed to scrape.

“And my response, much to everyone’s chagrin, is you can always be sued for anything,” he said.

Soma’s irreverence is based on a close following of the lawsuits around scraping. Most judges rule in its favor, he said. In Ticketmaster Corp. v. Tickets.com, Inc., for instance, the court found that copyright law does not protect facts. Another recent decision in hiQ Labs v. LinkedIn allowed hiQ Labs to scrape LinkedIn profiles and sell the results.

An anti-hacking law called the Computer Fraud and Abuse Act might give journalists pause, as it bars users from accessing websites “without authorization.” But no journalists have been prosecuted under the CFAA, according to the Columbia Journalism Review, which described the courts as “reluctant” to punish scraping as a form of breaking and entering.

“You can pretty much ignore what people say about how you can’t scrape their site,” Soma said.

Legal specifics aside, journalists who scrape data are still tasked with translating their ethics to the realm of technology.

“Regardless of whether you have permission or don’t have permission, it’s important to be responsible,” said Todd Wallack, a Boston Globe reporter. Wallack has used scraping to uncover racial bias on Craigslist and rampant fraud within nonprofits. As a gesture of transparency, Wallack typically identifies his name, email address and title in his scraping code so that IT specialists don’t suspect a malicious attack.

Ethical scraping also requires forethought to ensure the code, and the reporter by extension, does no harm. Aggressive or deadline-driven code could make too many requests to a webpage too quickly and crash the site.

At the 2016 NICAR conference, journalist David Eads recalled how he learned from that mistake. Eads had been scraping inmate information from the Cook County Jail in Illinois when he accidentally overloaded the county’s server.

“Our scraper actually took down the inmate tracker,” Eads said in his presentation. “Public defenders, family, friends, are relying on this thing to know where they can call their loved one, or where they can call their client. They needed to know when the next court date is.”

Eads’ scraper also dealt with inmates who were innocent until proven guilty — an important privacy concern, he said. Scraping often generates a dataset that didn’t exist in the public realm previously. Just as they would with any source, journalists should verify and carefully present their results.

Steve Doig, a Pulitzer Prize-winning journalist and professor at Arizona State University, echoes that sentiment around publishing data. “It’s one thing for it to be on a perhaps obscure government site, but (another) for it to be published in the newspaper,” he said.

Doig takes a more cautious approach to scraping than Soma. In his view, journalists should have a goal for every scraping request; it should always be done in pursuit of stories. Wallack, too, thinks of scraping as a tool of last resort. Requesting the data, he said, might actually turn up more information than is offered online.

For government data, Doig thinks journalists shouldn’t have to scrape at all. “If the data needs to be gathered by the public agency to do its job … (and) public money is being used to do that job, then the data should be available to everybody,” he said.

He saw nothing but persistence in how The Detroit News obtained the property data. After all, scraping data saved the county time and effort, and nullified the $235,000 bill.

Though it’s easy to be waylaid by technical concerns, scraping remains an important reporting tool, rising out of a fundamental search for truth.

In the months since The News scraped the data and published their story, Detroit has begun plans to compensate homeowners who were overtaxed by the city’s mistakes.

This piece was written for an investigative journalism course taught by Walter V. Robinson at the Walter Cronkite School of Journalism and Mass Communication. Many thanks to the experts who shared their thoughts and concerns.

Endnotes:

1. Read the backstory of The Detroit News investigation, written by Christine MacDonald, here.

2. Scraping definition comes from Krotov, V. & Leiser, S. (2018). Legality and ethics of web scraping. Twenty-fourth Americas Conference on Information Systems. P. 2.

3. Atlanta Constitution-Journal scraped data for Pulitzer-nominated story on sexual abuse by doctors. Source: Columbia Journalism Review, Sept. 2018: “When [AJC] reporters’ public record requests to medical boards and regulatory agencies in every state yielded very little return, the newsroom’s data journalism team wrote multiple scripts that crawled the regulators’ websites.” https://www.cjr.org/tow_center_reports/data-journalism-and-the-law.php

4. David Eads quote sourced from IRE audio recording in 2016: “Best practices for scraping: From ethics to techniques.” https://www.ire.org/resource-center/audio/820/

Using Format