Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Databases’ Category

How automation can save you on deadline

leave a comment »

In the data journalism world, data dumps are pretty common occurrences. And often times these dumps are released periodically — every month, every six months, every year, etc.

Despite their periodic nature, these dumps can catch you off guard. Especially if you are working on a million things at once (but nobody does that, right?).

ss

Case in point: Every year the Centers for Medicare & Medicaid Services releases data on how much money doctors received from medical companies. If I’m not mistaken, the centers release the data now because of the great investigative work of ProPublica. The data they release is important and something our readers care about.

Last year, I spent a ton of time parsing the data to put together a site that shows our readers the top paid doctors in Iowa. We decided to limit the number of doctors shown to 100, mostly because we ran out of time.

In all, I spent at least a week on the project last year. But this year, I was not afforded that luxury because the data dump caught us off guard. Instead of a few weeks, we had a few days to put together the site and story.

Fortunately, I’ve spent quite a bit of time in the last year moving away from editing spreadsheets in Excel or Google Docs to editing them using shell scripts.

The problem with GUI editors is they aren’t easily reproducible. If you make 20 changes to a spreadsheet in Excel one year and then the next year you get an updated version of that same spreadsheet, you have to make those exactly same 20 changes again.

That sucks.

Fortunately with tools like CSVKit and Agate, you can write A LOT of Excel-like functions in a shell script or a Python script. This allows you to document what you are doing so you don’t forget. I can’t tell you how many times I’ve done something in Excel, saved the spreadsheet, then came back to it a few days later and completely forgotten what the hell I was doing.

It also makes it so you can automate your data analysis by allowing you to re-run the functions over and over again. Let’s say you perform five functions in Excel and trim 20 columns from your spreadsheet. Now let’s say you want to do that again, either because you did it wrong the first time or because you received new data. Now would you rather run a shell script that takes seconds to perform this task or do everything over in Excel?

Another nice thing about shell scripts is you can hook your spreadsheet into any number of data processing programs. Want to port your data into a SQLite database so you can perform queries? No problem. You can also create separate files for those SQL queries and run them in your shell script as well.

All of this came in handy when this year’s most paid doctors data was released. Sure, I still spent a stressful day editing the new spreadsheets. But if I hadn’t been using shell scripts, I doubt I would have gotten done by our deadline.

Another thing I was able to do was increase the number of doctors listed online. We went from 100 total doctors to every doctor who received $500 or more  (totaling more than 2,000 doctors). This means it’s much more likely that readers will find their doctors this year.

The project is static and doesn’t have a database backend so I didn’t necessarily need to limit the number of doctors last year. We just ran out of time. This year, we were able to not only update the app but give our readers much more information about Iowa doctors. And we did it in a day instead of a week.

The project is online here. The code I used to edit the 15+ gigabytes worth of data from CMS is available here.

 

 

Advertisements

Written by csessig

September 7, 2016 at 4:59 pm

How-to: Creating a public salary database

leave a comment »

before_after_teaserNote: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

People love knowing what public employees earn in a year. They always make for interesting conversation and plenty of controversy.

In honor of Sunshine Week, I’m going to go through the steps of creating a quick and dirty salary database using DataTables.

Here’s the process:

1. Find a data set and clean, clean, clean

For this example, I’m going to use the 2012 Salary Book for public employees in Iowa. The spreadsheet has records for more than 59,000 public employees. And there are plenty of things in the spreadsheet you’ll probably want to clean up.

I’m not going to dive too deeply into cleaning data here except to point out that it is VERY IMPORTANT. I like to first bring my spreadsheets into Excel and simply sort each row from A to Z to find empty cells, cells with funky text or characters, etc. Here’s a handy guide for cleaning data in Excel.

I then use a great program called Open Refine (formerly Google Refine) to help clean the data. Their website includes some video tutorials on how to use the program. Be sure to also read this post on how investigative journalists should use Refine. Finally, this walkthrough also teaches journalists to use Refine.

The process of cleaning data is probably the most important step in creating a database. It make take several hours. But just as you wouldn’t print a story without first proofreading it, you shouldn’t make data available on your site that you haven’t cleaned to the best of your abilities.

2. Hosting your data

When you’re done cleaning the data, you’ll export the spreadsheet in Google Refine. You’ll then you’ll need to upload the data and host it somewhere. You have a few options:

a. Don’t host it at all. Instead follow this method I mentioned in an earlier blog post that converts the spreadsheet into an HTML table, which is then converted into a DataTable. The disadvantage of this method is if you want to update the spreadsheet, you’ll have to do the conversion over and paste the new HTML table into your document. For spreadsheets that are changed repeatedly, this could be a pain.

b. Upload the spreadsheet into Google Drive and use an awesome service called Tabletop.JS to feed the spreadsheet to DataTables. This is quickly becoming one of my preferred ways to host data simply because it allows several journalists and editors to edit the same Google spreadsheet. All their changes will be reflected on your page, which will be updated automatically as the spreadsheet is updated.

And in fact, my hope was to upload our spreadsheet of Iowa public employee salaries to Google and use Tabletop.JS for this project. But I couldn’t because Google only allows spreadsheets with less than 400,000 fields. This spreadsheet, which was 59,000 rows and seven columns, just eclipses Google’s limit.

For an example of Tabletop.JS in action, check out this repo on my Github page. If you have any questions on how it works, shoot me an e-mail or leave a comment on this article.

c. You can also use an open source tool called DataSet from the Miso Project. Instead of hosting it with Google, you just load the CSV file on your own server and pull the data directly from the file. You can convert the CSV information into Javascript arrays, objects, etc., which allows the data to work with DataTables and tons of other awesome Javascript libraries.

The advantage to DataSet is you can work with huge spreadsheets like this one on Iowa public employee salaries and you don’t have to worry about Google’s size limits. The downfall is only one person can edit the spreadsheet at a time. And they’ll have to download the CSV, re-save it, then re-upload it to the server every time they make a change.

3. The code

The first thing you’ll need to do is go to my Github page and download the directory for this project.

I’m going to break down the Javascript file into chunks to give you an idea of what is happening:

// DataTables currency
// Use to sort the table via currency
jQuery.extend( jQuery.fn.dataTableExt.oSort, {
	"currency-pre": function ( a ) {
		a = (a==="-") ? 0 : a.replace( /[^\d\-\.]/g, "" );
		return parseFloat( a );
	},
	
	"currency-asc": function ( a, b ) {
		return a - b;
	},
	
	"currency-desc": function ( a, b ) {
		return b - a;
	}
});

This function allows us to sort the table based on currency. Basically, when the reader firsts loads the database, the employee with the highest salary will be shown. For more information on the different sorting techniques offered by DataTables, click here.

// Where we'll put the data
var newDataSet = [];
 
// Our column headers
// Change these to fit your table
var tableColumnSet =   [
  { "sTitle": "Employee" },
	{ "sTitle": "Department" },
	{ "sTitle": "Total FY12 Salary", "sType": "currency" },
	{ "sTitle": "Gender", "sClass": "center" },
	{ "sTitle": "County", "sClass": "center" },
	{ "sTitle": "Position", "sClass": "center" }
];

This creates a blank array called “newDataSet.” Later, we will push all of the content in our CSV into this array, which will be used by DataTables and displayed on the page. The “tableColumnSet” variable is where we set our column header names. For future projects, you’ll want to change those to match your own CSV file.

Notice also how we set the “Total FY12 Salary” column to currency, which allows us to sort the table by currency.

// Our DataTable information
// Don't need to change anything here
// Unless you want to customize the table format
function showInfo() {
  $('#table_div').html( '<table cellpadding="0" cellspacing="0" border="0" class="table table-striped table-bordered" id="table"></table>' );
 
	// Push the data to the table
	$('#table').dataTable( {
		"bProcessing":true,
		"sPaginationType": "bootstrap",
		"iDisplayLength": 100,
		"aaData": newDataSet,
		"aoColumns": tableColumnSet,
		// Pick which column we will sort by default
		// For this table, we'll sort ascending by Total FY12 Salary
		"aaSorting": [[ 2, "desc" ]],
		"oLanguage": {
			"sLengthMenu": "_MENU_ records per page"
		}
	});
}

DataTables provides us with several different options for displaying the data, including how to sort the data, where to the pull the data from, how many rows to show at a time, etc. We’re putting all this in a function called “showInfo”, which we will call below.

For more on the options available to us with DataTables, click here.

You may also notice that this table is styled using Twitter’s Bootstrap framework. Basically that gives the table the zebra look to it, as well as style the buttons and search field. For more information on working with Bootstrap, click here.

// Load up the CSV using Miso.Dataset
$(document).ready( function() {
  // Change URL to the right path for your CSV
	var ds = new Miso.Dataset({
  		url : 'csvs/salary_book_clean.csv',
  		delimiter : ','
  	});
 
	// Run this after we load our CSV
	ds.fetch({ success : function() {
    	this.each(function(row, rowIndex) {
    		// Change these variables to match your column names
    		var employeeData = row.Employee;
    		var departmentData = row.Department;
    		var salaryData = row.TotalFY12Salary;
    		var genderData = row.Gender;
    		var countyData = row.County;
    		var positionData = row.Position;
    		// Put information in an array and push it to our table
    		// Change these variables to match variables above
    		var myArray = [employeeData, departmentData, salaryData, genderData, countyData, positionData];
			newDataSet.push(myArray);
  		});
    	// Call DataTable function showInfo
		showInfo();
	}});
});

The last part of this file will load when our document is ready. It uses Miso’s DataSet library to pull in the information from our CSV file. Once the information is loading, a loop for each row in our data. A Javascript array is created for each row containing that row’s data and then that is pushed to our newDataSet variable, which we created earlier.

For future projects, make sure you change the variable names in each loop to fit your CSV file. So change the words “Employee”, “Department”, etc. to make the column headers in your CSV file.

Finally, once that is done, we call our “showInfo” function, which puts the content in our CSV on the page using DataTables.

If you have any questions, feel free to leave me a comment or e-mail me directly at chris.essig@wcfcourier.com.

Written by csessig

March 15, 2013 at 10:03 am

Turning Excel spreadsheets into searchable databases in five minutes

with 21 comments

Note: This is cross-posted from Lee’s data journalism blog and includes references to the Blox content management system, which is what Lee newspapers use. Reporters at Lee newspapers can read my blog over there by clicking here.

data_tables_screenshotHere’s a scenario I run into all the time: A government agency sends along a spreadsheet of data to go along with a story one of our reporters is working on. And you want to post the spreadsheet(s) online with the story in a reader-friendly way.

One way is to create a sortable database with the information. There are a couple of awesome options on the table put out by other news organizations. One is TableSetter, published by ProPublica, which we’ve used in the past. The other one is TableStacker, which is a spin off of TableSetter. We have also used this before.

The only problem is making these tables typically take a little bit of leg work. And of course, you’re on deadline.

Here’s one quick and dirty option:

1. First open the spreadsheet in Excel or Google Docs and highlight the fields you want to make into the sortable database.

2. Then go to this wonderful website called Mr. Data Converter and paste the spreadsheet information into the top box. Then select the output as “HTML.”

3. We will use a service called DataTables to create the sortable databases. It’s a great and easy jQuery plugin to create the sortable tables.

4. Now create an HTML asset in Blox and paste in this DataTables template below:

Note: I’ve added some CSS styling to make the tables look better.

<html>
<head>
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_page.css">
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_table.css">

<style>
table {
	font-size: 12px;
	font-family: Arial, Helvetica, sans-serif;
float: left
}
table th, table td {
    text-align: center;
}

th, td {
	padding-top: 10px;
	padding-bottom: 10px;
	font-size: 14px;
}

label {
	width: 100%;
	text-align: left;
}

table th {
	font-weight: bold;
}

table thead th {
    vertical-align: middle;
}

label, input, button, select, textarea {
    line-height: 30px;
}
input, textarea, select, .uneditable-input {
    height: 25px;
    line-height: 25px;
}

select {
    width: 100px;
}

.dataTables_length {
    padding-left: 10px;
}
.dataTables_filter {
	padding-right: 10px;
}

</style>

<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.js"></script>
<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.dataTables.min.js"></script>

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>
</head>

<body>

<--- Enter HTML table here --->

</body>

</html>

– This will link the page to the necessary CSS spreadsheets and Javascript files to get the DataTable working. The other option is go to the DataTable’s website and download the files yourself and post them on your own server, then link to those files instead of the ones hosted by WCFCourier.com.

5. Where you see the text “<— Enter HTML table here —>,” paste in your HTML table from Mr. Data Converter.

6. The last thing you will need to do is create an “id” for the table and link that “id” to the DataTable’s plugin. In the example above, the “id” is “five_year.” It is noted in this line of code in the DataTable template:

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>

– The header on your HTML table that you post into the template will look like so:

<table id="five_year" style="width: 620px;">
  <thead>
    <tr>
      <th class="NAME-cell">NAME</th>
      <th class="2008 Enrollment-cell">2008 Enrollment</th>
      <th class="2012 Enrollment-cell">2012 Enrollment</th>
      <th class="Increase/Decrease-cell">Increase/Decrease</th>
      <th class="Percent Increase/Decrease-cell">Percent Increase/Decrease</th>
    </tr>
  </thead>

– Here’s an live example of two sortable tables. The first table has an “id” of “five_year.” The second has an “id” of “one_year.” The full code for the two tables is available here.

– As an alternative, you can use a jQuery plugin called TableSorter (not to be confused with the TableSorter project mentioned above). The process of creating the table is very similar.

7. That’s it! Of course, DataTables provides several customization options that are worth looking into if you want to make your table look fancier.

Written by csessig

January 3, 2013 at 4:34 pm