Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

How automation can save you on deadline

leave a comment »

In the data journalism world, data dumps are pretty common occurrences. And often times these dumps are released periodically — every month, every six months, every year, etc.

Despite their periodic nature, these dumps can catch you off guard. Especially if you are working on a million things at once (but nobody does that, right?).

ss

Case in point: Every year the Centers for Medicare & Medicaid Services releases data on how much money doctors received from medical companies. If I’m not mistaken, the centers release the data now because of the great investigative work of ProPublica. The data they release is important and something our readers care about.

Last year, I spent a ton of time parsing the data to put together a site that shows our readers the top paid doctors in Iowa. We decided to limit the number of doctors shown to 100, mostly because we ran out of time.

In all, I spent at least a week on the project last year. But this year, I was not afforded that luxury because the data dump caught us off guard. Instead of a few weeks, we had a few days to put together the site and story.

Fortunately, I’ve spent quite a bit of time in the last year moving away from editing spreadsheets in Excel or Google Docs to editing them using shell scripts.

The problem with GUI editors is they aren’t easily reproducible. If you make 20 changes to a spreadsheet in Excel one year and then the next year you get an updated version of that same spreadsheet, you have to make those exactly same 20 changes again.

That sucks.

Fortunately with tools like CSVKit and Agate, you can write A LOT of Excel-like functions in a shell script or a Python script. This allows you to document what you are doing so you don’t forget. I can’t tell you how many times I’ve done something in Excel, saved the spreadsheet, then came back to it a few days later and completely forgotten what the hell I was doing.

It also makes it so you can automate your data analysis by allowing you to re-run the functions over and over again. Let’s say you perform five functions in Excel and trim 20 columns from your spreadsheet. Now let’s say you want to do that again, either because you did it wrong the first time or because you received new data. Now would you rather run a shell script that takes seconds to perform this task or do everything over in Excel?

Another nice thing about shell scripts is you can hook your spreadsheet into any number of data processing programs. Want to port your data into a SQLite database so you can perform queries? No problem. You can also create separate files for those SQL queries and run them in your shell script as well.

All of this came in handy when this year’s most paid doctors data was released. Sure, I still spent a stressful day editing the new spreadsheets. But if I hadn’t been using shell scripts, I doubt I would have gotten done by our deadline.

Another thing I was able to do was increase the number of doctors listed online. We went from 100 total doctors to every doctor who received $500 or more  (totaling more than 2,000 doctors). This means it’s much more likely that readers will find their doctors this year.

The project is static and doesn’t have a database backend so I didn’t necessarily need to limit the number of doctors last year. We just ran out of time. This year, we were able to not only update the app but give our readers much more information about Iowa doctors. And we did it in a day instead of a week.

The project is online here. The code I used to edit the 15+ gigabytes worth of data from CMS is available here.

 

 

Advertisements

Written by csessig

September 7, 2016 at 4:59 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: