Once you’ve wrangled the data, how do you tame it?

By Alyssa A. Botelho

You’ve submitted your public records request and finally received thousands of pages of long-sought data for your next investigative story. But through the carelessness (or perhaps the malice) of your source, you’ve received an “image PDF” — one of those PDFs that doesn’t allow you to highlight and copy text, or search for key terms. This is what freelancer reporter Tyler Dukes of reporterslab.org described as an investigative reporter’s “nightmare document.”

The drain of time and labor for investigative reporting projects can result in costs upwards of $200,000 for major new organizations, Dukes said. It’s an unthinkable price for freelancers with tight deadlines and even tighter budgets.

But in his NASW workshop, “Tools for tackling nightmare documents and data,” Dukes presented an Internet toolkit that can make investigative stories a more feasible prospect.

To this end, he presented three online resources that make handling data and documents cheaper and easier: a PDF-file converter, a document sorting program, and a website that allows you to recruit people online to transcribe your interviews and complete other methodical jobs.

All tools are free of charge except for the last, which Dukes said usually costs between $20 and $50 per job.

The first tool, called CometDocs, is a website that allows you to upload PDF files — such as those difficult “image PDFs” — and convert them into Word documents, Excel spreadsheets, and other user-friendly formats. In a live test-run, Dukes uploaded a PDF of a long list of North Carolina traffic violations on to the site and punched in his email address. Within two minutes, an Excel spreadsheet of the data from the PDF had been sent to his inbox.

The second tool, DocumentCloud, is an “entity extraction” program that can filter through PDF files and pop out names, places, organizations, emails, and phone numbers that show up in high frequency on a page of interest or in the entire file. The program also has a “timeline function” that picks out dates from the document set and displays them in a chronological list with links to documents where the dates were found. This “extraction” capability, Dukes said, is handy when trying to find power players, key events, and contact information that would otherwise remain hidden without a thorough read.

DocumentCloud was developed exclusively for journalists, and though it is free, one must provide credentials that you’re a working reporter to download the program from the web. It can also be used to annotate and share documents among colleagues in a newsroom.

Dukes’ final tool was the most technical — a crowdsourcing website created by Amazon called Mechanical Turk. The site is an Internet marketplace where a person can advertise a short task that a computer can’t easily do — such as transcribing audio — to a network of Internet users who agree to a small payment set by the user. To demonstrate, Dukes described how he broke up a 47-minute recording of a N.C. General Assembly meeting into two-minute sound bytes and uploaded each as a short transcription “task” that paid one dollar. Within three hours, he said, the recording was transcribed by a cohort of 25 online workers — normal Internet users who want a quick buck or Amazon credit — at a cost of about 26 dollars. Mechanical Turk workers cannot browse interview clips before accepting a transcription job, and those who don’t do a good job can be blocked from doing tasks you post the next time around.

Dukes noted that none of these tools are perfect — Excel documents from CometDocs often don’t come perfectly formatted, DocumentCloud presents a large set of tools but none are of highest quality, and Mechanical Turk can be tricky to price and use with sensitive interview material.

“But it’s about getting close,” Dukes said. “It’s about taking that haystack — that big discovery process — and making the haystack a lot smaller.”

Read the slides from Tyler Dukes’ workshop on his website at http://www.reporterslab.org/nightmaredocs.

ADVERTISEMENT
Knight Science Journalism @MIT

ADVERTISEMENT
Stanford Center for Biomedical Ethics