Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Public Affairs Reporting’ Category

How-to: Creating a public salary database

leave a comment »

before_after_teaserNote: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

People love knowing what public employees earn in a year. They always make for interesting conversation and plenty of controversy.

In honor of Sunshine Week, I’m going to go through the steps of creating a quick and dirty salary database using DataTables.

Here’s the process:

1. Find a data set and clean, clean, clean

For this example, I’m going to use the 2012 Salary Book for public employees in Iowa. The spreadsheet has records for more than 59,000 public employees. And there are plenty of things in the spreadsheet you’ll probably want to clean up.

I’m not going to dive too deeply into cleaning data here except to point out that it is VERY IMPORTANT. I like to first bring my spreadsheets into Excel and simply sort each row from A to Z to find empty cells, cells with funky text or characters, etc. Here’s a handy guide for cleaning data in Excel.

I then use a great program called Open Refine (formerly Google Refine) to help clean the data. Their website includes some video tutorials on how to use the program. Be sure to also read this post on how investigative journalists should use Refine. Finally, this walkthrough also teaches journalists to use Refine.

The process of cleaning data is probably the most important step in creating a database. It make take several hours. But just as you wouldn’t print a story without first proofreading it, you shouldn’t make data available on your site that you haven’t cleaned to the best of your abilities.

2. Hosting your data

When you’re done cleaning the data, you’ll export the spreadsheet in Google Refine. You’ll then you’ll need to upload the data and host it somewhere. You have a few options:

a. Don’t host it at all. Instead follow this method I mentioned in an earlier blog post that converts the spreadsheet into an HTML table, which is then converted into a DataTable. The disadvantage of this method is if you want to update the spreadsheet, you’ll have to do the conversion over and paste the new HTML table into your document. For spreadsheets that are changed repeatedly, this could be a pain.

b. Upload the spreadsheet into Google Drive and use an awesome service called Tabletop.JS to feed the spreadsheet to DataTables. This is quickly becoming one of my preferred ways to host data simply because it allows several journalists and editors to edit the same Google spreadsheet. All their changes will be reflected on your page, which will be updated automatically as the spreadsheet is updated.

And in fact, my hope was to upload our spreadsheet of Iowa public employee salaries to Google and use Tabletop.JS for this project. But I couldn’t because Google only allows spreadsheets with less than 400,000 fields. This spreadsheet, which was 59,000 rows and seven columns, just eclipses Google’s limit.

For an example of Tabletop.JS in action, check out this repo on my Github page. If you have any questions on how it works, shoot me an e-mail or leave a comment on this article.

c. You can also use an open source tool called DataSet from the Miso Project. Instead of hosting it with Google, you just load the CSV file on your own server and pull the data directly from the file. You can convert the CSV information into Javascript arrays, objects, etc., which allows the data to work with DataTables and tons of other awesome Javascript libraries.

The advantage to DataSet is you can work with huge spreadsheets like this one on Iowa public employee salaries and you don’t have to worry about Google’s size limits. The downfall is only one person can edit the spreadsheet at a time. And they’ll have to download the CSV, re-save it, then re-upload it to the server every time they make a change.

3. The code

The first thing you’ll need to do is go to my Github page and download the directory for this project.

I’m going to break down the Javascript file into chunks to give you an idea of what is happening:

// DataTables currency
// Use to sort the table via currency
jQuery.extend( jQuery.fn.dataTableExt.oSort, {
	"currency-pre": function ( a ) {
		a = (a==="-") ? 0 : a.replace( /[^\d\-\.]/g, "" );
		return parseFloat( a );
	},
	
	"currency-asc": function ( a, b ) {
		return a - b;
	},
	
	"currency-desc": function ( a, b ) {
		return b - a;
	}
});

This function allows us to sort the table based on currency. Basically, when the reader firsts loads the database, the employee with the highest salary will be shown. For more information on the different sorting techniques offered by DataTables, click here.

// Where we'll put the data
var newDataSet = [];
 
// Our column headers
// Change these to fit your table
var tableColumnSet =   [
  { "sTitle": "Employee" },
	{ "sTitle": "Department" },
	{ "sTitle": "Total FY12 Salary", "sType": "currency" },
	{ "sTitle": "Gender", "sClass": "center" },
	{ "sTitle": "County", "sClass": "center" },
	{ "sTitle": "Position", "sClass": "center" }
];

This creates a blank array called “newDataSet.” Later, we will push all of the content in our CSV into this array, which will be used by DataTables and displayed on the page. The “tableColumnSet” variable is where we set our column header names. For future projects, you’ll want to change those to match your own CSV file.

Notice also how we set the “Total FY12 Salary” column to currency, which allows us to sort the table by currency.

// Our DataTable information
// Don't need to change anything here
// Unless you want to customize the table format
function showInfo() {
  $('#table_div').html( '<table cellpadding="0" cellspacing="0" border="0" class="table table-striped table-bordered" id="table"></table>' );
 
	// Push the data to the table
	$('#table').dataTable( {
		"bProcessing":true,
		"sPaginationType": "bootstrap",
		"iDisplayLength": 100,
		"aaData": newDataSet,
		"aoColumns": tableColumnSet,
		// Pick which column we will sort by default
		// For this table, we'll sort ascending by Total FY12 Salary
		"aaSorting": [[ 2, "desc" ]],
		"oLanguage": {
			"sLengthMenu": "_MENU_ records per page"
		}
	});
}

DataTables provides us with several different options for displaying the data, including how to sort the data, where to the pull the data from, how many rows to show at a time, etc. We’re putting all this in a function called “showInfo”, which we will call below.

For more on the options available to us with DataTables, click here.

You may also notice that this table is styled using Twitter’s Bootstrap framework. Basically that gives the table the zebra look to it, as well as style the buttons and search field. For more information on working with Bootstrap, click here.

// Load up the CSV using Miso.Dataset
$(document).ready( function() {
  // Change URL to the right path for your CSV
	var ds = new Miso.Dataset({
  		url : 'csvs/salary_book_clean.csv',
  		delimiter : ','
  	});
 
	// Run this after we load our CSV
	ds.fetch({ success : function() {
    	this.each(function(row, rowIndex) {
    		// Change these variables to match your column names
    		var employeeData = row.Employee;
    		var departmentData = row.Department;
    		var salaryData = row.TotalFY12Salary;
    		var genderData = row.Gender;
    		var countyData = row.County;
    		var positionData = row.Position;
    		// Put information in an array and push it to our table
    		// Change these variables to match variables above
    		var myArray = [employeeData, departmentData, salaryData, genderData, countyData, positionData];
			newDataSet.push(myArray);
  		});
    	// Call DataTable function showInfo
		showInfo();
	}});
});

The last part of this file will load when our document is ready. It uses Miso’s DataSet library to pull in the information from our CSV file. Once the information is loading, a loop for each row in our data. A Javascript array is created for each row containing that row’s data and then that is pushed to our newDataSet variable, which we created earlier.

For future projects, make sure you change the variable names in each loop to fit your CSV file. So change the words “Employee”, “Department”, etc. to make the column headers in your CSV file.

Finally, once that is done, we call our “showInfo” function, which puts the content in our CSV on the page using DataTables.

If you have any questions, feel free to leave me a comment or e-mail me directly at chris.essig@wcfcourier.com.

Advertisements

Written by csessig

March 15, 2013 at 10:03 am

How We Did It: Waterloo crime map

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Last week we launched a new feature on the Courier’s website: A crime map for the city of Waterloo that will be updated daily Monday through Friday.

The map uses data provided by the Waterloo police department. It’s presented in a way to allow readers to make their own stories out of the data.

(Note: The full code for this project is available here.)

Here’s a quick run-through of what we did to get the map up and running:

1. Turning a PDF into manageable data

The hardest part of this project was the first step: Turning a PDF into something usable. Every morning, the Waterloo police department updates their calls for service PDF with the latest service calls. It’s a rolling PDF that keeps track of about a week of calls.

The first step I took was turning the PDF into a HTML document using the command line tool PDFtoHTMLFor Mac users, you can download it by going to the command line and typing in “brew install pdftohtml.” Then run “pdftohtml -c (ENTER NAME OF PDF HERE)” to turn the PDF into an HTML document.

The PDF we are converting is basically a spreadsheet. Each cell of the spreadsheet is turned into a DIV with PDFtoHTML. Each page of the PDF is turned into its own HTML document. We will then scrape these HTML documents using the programming language Python, which I have blogged about before. The Python library that will allow us to scrape the information is Beautiful Soup.

The “-c” command adds a bunch of inline CSS properties to these DIVs based on where they are on the page. These inline properties are important because they help us get the information off the spreadsheet we want.

All dates and times, for instance, are located in the second column. As a result, all the dates and times have the exact same inline left CSS property of “107” because they are all the same distance from the left side of the page.

The same goes for the dispositions. They are in the fifth column and are farther from the left side of the page so they have an inline left CSS property of “677.”

We use these properties to find the columns of information we want. The first thing we want is the dates. With our Python scraper, we’ll grab all the data in the second column, which is all the DIVs that have an inline left CSS property of “107.”

We then have a second argument that uses regular expressions to make sure the data is in the correct format i.e. numbers and not letters. We do this to make sure we are pulling dates and not text accidently.

The second argument is basically an insurance policy. Everything we pull with the CSS property of “107” should be a date. But we want to be 100% so we’ll make sure it’s integers and not a string with regular expressions.

The third column is the reported crimes. But in our converted HTML document, crimes are actually located in the DIV previous to the date + time DIV. So once we have grabbed a date + time DIV with our Python scraper, we will check the previous DIV to see if it matches one of the seven crimes we are going to map. For this project, we decided not to map minor reports like business checks and traffic stops. Instead we are mapping the seven most serious reports.

If it is one of our seven crimes, we will run one final check to make sure it’s not a cancelled call, an unfounded call, etc. We do this by checking the disposition DIVs (column five in the spreadsheet), which are located before the crime DIVs. Also remember that all these have an inline left CSS property of “677”.

So we check these DIVs with our dispositions to make sure they don’t contain words like “NOT NEEDED” or “NO REPORT” or “CALL CANCELLED.”

Once we know it’s a crime that fits into one of our seven categories and it wasn’t a cancelled call, we add the crime, the date, the time, the disposition and the location to a CSV spreadsheet.

The full Python scraper is available here.

2. Using Google to get latitude, longitude and JSON

The mapping service I used was Leaflet, as opposed to Google Maps. But we will need to geocode our addresses to get latitude and longitude information for each point to use with Leaflet. We also need to convert our spreadsheet into a Javascript object file, also known as a JSON file.

Fortunately that is an easy and quick process thanks to two gadgets available to us using Google Docs.

The first thing we need to do is upload our CSV to Google Docs. Then we can use this gadget to get latitude and longitude points for each address. Then we can use this gadget to get the JSON file we will use with the map.

3. Powering the map with Leaflet, jQRangeSlider, DataTables and Bootstrap

As I mentioned, Leaflet powers the map. It uses the latitude and longitude points from the JSON file to map our crimes.

For this map, I created my own icons. I used a free image editor known as Seashore, which is a fantastic program for those who are too cheap to shell out the dough for Adobe’s Photoshop.

The date range slider below the map is a very awesome tool called jQRangeSlider. Basically every time the date range is moved, a Javascript function is called that will go through the JSON file and see if the crimes are between those two dates.

This Javascript function also checks to see if the crime has been selected by the user. Notice on the map the check boxes next to each crime logo under “Types of Crimes.”

If the crime is both between the dates on the slider and checked by the users, it is mapped.

While this is going on, an HTML table of this information is being created below the map. We use another awesome tool called DataTables to make that table of crimes interactive. With it, readers can display up to a 100 records on the page or search through the records.

Finally, we create a pretty basic bar chart using the Progress Bars made available by Bootstrap, an awesome interface released by the people who brought us Twitter.

Creating these bars are easy: We just need to create DIVs and give them a certain class so Bootstrap knows how to style them. We create a bar for each crime that is automatically updated when we tweak the map

For more information on progress bars, check out the documentation from Bootstrap. I also want to thank the app team at the Chicago Tribune for providing the inspiration behind the bar chart with their 2012 primary election app.

The full Javascript file is available here.

4. Daily upkeep

This map is not updated automatically so every day, Monday through Friday, I will be adding new crimes to our map.

Fortunately, this only takes about 5-10 minutes of work. Basically I scrape the last few pages of the police’s crime log PDF, pull out the crimes that are new, pull them into Google Docs, get the latitude and longitude information, output the JSON file and put that new file into our FTP server.

Trust me, it doesn’t take nearly as long as it sounds to do.

5. What’s next?

Besides minor tweaks and possible design improvements, I have two main goals for this project in the future:

A. Create a crime map for Cedar Falls – Cedar Falls is Waterloo’s sister city and like the Waterloo police department, the Cedar Falls police department keeps a daily log of calls for service. They also post PDFs, so I’m hoping the process of pulling out the data won’t be drastically different that what I did for the Waterloo map.

B. Create a mobile version for both crime maps – Maps don’t work tremendously well on the mobile phone. So I’d like to develop some sort of alternative for mobile users. Fortunately, we have all the data. We just need to figure out how to display it best for smartphones.

Have any questions? Feel free to e-mail me at chris.essig@wcfcourier.com.

My arm!

leave a comment »

Written by csessig

May 17, 2010 at 8:44 pm

Sheriffs and mental health advocates: Quinn cuts will hit hard

leave a comment »

Many across the state are lining up to oppose Gov Pat Quinn’s proposed cuts, complaining they will be devastating, not only for schools across Illinois who could face more a $1 billion in reductions, but to all facets of life. Of course, this is to be expected. But sheriffs, in particular, paint a pretty bleak picture:

SPRINGFIELD — Laying off state troopers may lead to slower response times to accidents and could hamper a recently launched methamphetamine prevention campaign in southern Illinois, a number of county sheriffs said Thursday.

“The cuts to the Illinois State Police are going to be a tremendous blow to us,” Union County Sheriff David Livesay said Thursday. “We don’t have the personnel to put out there.”

More…

But because state police are often called out to accidents, the response time from law enforcement could be delayed.“Any time you lose resources, it’s a bad thing,” [Saline County Sheriff Keith Brown] said. “We are stressed in southern Illinois, especially with manpower.”

For instance, three bicyclists were hit by a van in Gallatin County on Wednesday. One bicyclist was killed and all were airlifted to a hospital. The state police is needed for situations like these because the agency specializes in accident reconstruction, said Brown.

Ugh. Here’s the link.

Obviously, no one wants to risk public security. And Quinn’s budget is far from finalized. But something eventually has to be done: either a tax increase or massive cuts. Which will it be, lawmakers?

Mental health advocates are also concerned their needs will be lost in the shuffle:

[Director David] Cole [of the the Moultrie County Counseling Center] agreed the budget is likely to be tweaked several times before it is enacted, but he fears so much attention will be placed on education that mental health services will be all but forgotten.

“We may not get the awareness we need to get a decent budget,” Cole said.

Link.

When the governor proposing slashing education funding so severely, it’s going to get a lot of coverage. But Cole has a good point: with so much uproar centered around the education budget blows, other areas could become easier to cut politically.

Either way, the situation is dire. And only getting worse.

Nothing to do with politics

leave a comment »

Ironically, the first story of mine to make it on the front page of any Lee Enterprise web site had nothing to do with politics. Instead, the story featured the unveiling of a new Abraham Lincoln penny. Imagine that. Here’s an excerpt:

SPRINGFIELD — Temperatures were in the teens Thursday morning but that didn’t stop a host of out-of-state residents from trekking to the state’s capitol to be among the first batch of people to purchase a newly-minted Abraham Lincoln penny.

Kentucky, Missouri and Indiana were among the states represented at the unveiling of the new coin, which came on the eve of Lincoln’s 201st birthday. The line to purchase the pennies formed as early as 5 a.m. and wrapped around the Abraham Lincoln Presidential Museum shortly before the ceremony kicked off at 9:30 a.m.

And the link.

There were plenty of angles to take on this story, but I thought it was fascinating how far people were willing to travel to purchase a few rolls of pennies. The new coins will be around for the next 49 years, after all. But coin collectors, like other enthusiasts, are very passionate about their trade and more than willing to go that extra mile (or 200, which every be the case) to support the hobby they love.

Of course, I had to screen capture the beautiful front pages from the Bloomington Pantagraph and the Decatur Herald & Review. Here are the screen caps:

Breathtaking.

Written by csessig

February 15, 2010 at 12:33 am

Legislative session in full swing

leave a comment »

The first month of the new decade in Illinois politics has been dominated by primary election coverage and for good reason. There were so many close races, it was hard to keep up with all of them. But now that the elections are done and most of the winners have been announced, it’s time for our local politicians to the return to the Capitol for this year’s legislative session. True, they were in session for a few days in January but very little got done because of the lingering February primary. But that is in the past. And, boy, do the lawmakers have a big agenda to tackle.

The issue that will dominate this session is the state’s budget woes, which are among the worst in the nation. Our deficit is a whopping $12 billion and every state entity under the moon is starting to fasten their belts…in a big way. Many universities are still waiting millions of dollars worth of payments from the state and are implementing furloughs and layoffs until the money comes through. And that is just one, tiny example.

Democrats are somewhat united behind an income and corporate tax increase, which would give the state much needed breathing room in its finances. But tax increases are terribly unpopular, especially during a recession that has produced 11 percent unemployment across the state. And the general election coming in November makes it even more difficult for lawmakers to approve a politically-suicidal tax increase. The Senate did pass an increase last session, but the House has refused to bring it to the floor.

Republicans, on the other hand, are try to cut, cut, cut their way out of the deficit, which would mean spending reductions that could affect nearly everyone. But every little bit helps, they argue. One such proposal I’ve been covering  would remove front license plates on registered cars and trucks. I’ve blogged about this before but today it was brought up in committee. And it failed:

SPRINGFIELD — A proposed law that would have removed the front license plates on Illinois cars and trucks failed an early test Tuesday.

A House committee voted 4-2 against the measure, which would require drivers have just one plate in the rear of their vehicle. State Rep. Jerry Mitchell, R-Sterling, sponsored the measure and said it would save the state $800,000.

The committee was concerned the measure would hinder the police force and compromise public safety. It could be brought up in the future, but we will see.

Obviously, every cut you make affects someone or something. For every cut there is a compelling argument for why so-and-so or such-and-such really needs the funding. And some are easier to argue than others. But if a cut like this can’t even make it out of committee, it’s going to be very tough to implement enough cuts to get the state out of this fiscal nightmare. I would assume much more drastic and controversial cuts would need to be implemented to trim a deficit that is tens of billions of dollars.

I could be wrong. But many contend a tax increase is the only way to realistically shore up the state’s deficit. But even then, cuts may be needed. And who is to say a tax increase would really bring down the state deficit? Could lawmakers just spend the money on local pork projects to help win favor in their district and leave the state’s finances in dismal shape? In short, is a tax increase really the answer?

But before we get too ahead of ourselves, we have to realize that many lawmakers won’t vote for something so controversial because their job will be on the line in nine months. In the mean time, the state continues to wobble along, barely scraping by…

As the saying goes, it make “blood on the sidewalks” for lawmakers to vote on something that is damaging enough to get the state out of the red ink.

Image posted by Flickr user klynslis. It is used under a Creative Commons license.

Written by csessig

February 10, 2010 at 8:38 pm

Post-election analysis stories

leave a comment »

While the Illinois primary season officially ended a week ago, neither party knows what their ticket is going to look like in the fall. For one, the Democrats are missing a lieutenant governor. The party is still hurting because their former candidate, Scott Lee Cohen, was engulfed in controversy due to revelations into his past, which included horrific tales of violence, sexual misconduct and steroid use. How much of it is true is unknown. But it was enough for him to bow out of the race on Super Bowl Sunday, just five days after he was nominated. The Dems are surely relieved he stepped down, but they now face a new problem: finding a replacement. Rumors are flying everywhere, but it may be a while before they officially name someone.

One would think this controversy would be the perfect battle cry for Republicans to rally behind their candidate for governor. Just one problem: they don’t have one. State Sen. Bill Brady, of Bloomington, is currently in the lead, but it’s so slim, his opponent, State Sen. Kirk Dillard, of Hinsdale, is refusing to give up before all of the provisional and absentee votes are counted. This leaves their ticket even more in limbo than their opponents on the left.

1. With that in mind, it was interesting to see former Gov. Jim Edgar and David Yepsen, director of the Paul Simon Public Policy Institute at Southern Illinois University in Carbondale, give their post-election analysis before a number of movers and shakers Thursday afternoon at a luncheon sponsored by UIS. It was an unprecedented election with unprecedented results. And both acknowledged these points.

The one thing Edgar said that I thought was most interesting was a quick line about how he thought Brady’s 400-vote lead would hold up. Of course, Dillard was Edgar’s chief of staff for a while and Dillard diligently touted his experience under the Edgar administration every day, it seemed, of the election. The fact that he was calling such a close race caught me off guard and eventually became my story:

“I’m saying this as if the 406 votes are going to hold up. I suspect it will,” Edgar said.

Edgar said the key to the Republican race was three candidates from the Chicago area — Dillard, former GOP chairman Andy McKenna and former Attorney General Jim Ryan — split the city vote. All three candidates finished within striking distance of Brady.

“The key factor was geography. It wasn’t ideology,” Edgar said.

A number of media outlets picked up on the story as well, including Capitol Fax, Progress Illinois, Illinois Observer and Illinois Review. It will be interesting to see if Edgar’s prediction comes true in the coming weeks.

2. The governor’s race wasn’t the only one addressed at the luncheon, however. The U.S. Senate race also dominated the conversation and for good reason. This year’s race between Treasurer Alexi Giannoulias and Congressman Mark Kirk should be a barn burner. National Republicans are giddy at the chance of winning Obama’s old seat, while Democrats are poised not to be embarrassed in a state that has been predominately blue for quite some time.

Anways, a few quotes from Edgar and Yepsen made it into another story my colleague Mike was working on. Here is a quick look:

“The Republicans have probably one of the best chances in the United States of picking up a Democratic-held Senate seat,” Edgar said of the Illinois race. “Kind of the icing on the cake is this is a seat that Barack Obama held.”

This race will be watched closely, not only in Illinois but across the nation. Stay tuned…

3. Lastly, one other quick point about the primary election: voter turnout was dismal. Officials pinned the voting public at less than 30 percent of registered voters, which is lower than usual. Primary turnout is usually horrendous but many lawmakers blamed the February primary (the earliest in the nation) for this year’s low numbers. I talked to a number of local Lee lawmakers and did a quick story:

State Rep. Dan Brady, R-Bloomington, agrees the primary is too early and also is co-sponsoring its move to June. In his district, McLean County reported 22 percent of registered voters made their voices heard Tuesday night, which is a slight increase from the 2006 primary when 20 percent voted. In 2002, turnout was 30 percent.

“I think a big part of it was people not being prepared and ready to a deal with a primary election at this point in time,” he said. “It caught them off guard.”

Moving the primary date is going to be a hot issue this session, especially for the minority party who feel the early primary unfairly gives incumbents an edge. I’ll keep you posted.

Well I’m running out of wind once again, so I’m going to stop this post now. What a doozy! I hope you enjoyed every last word…

Image posted by Flickr user myoldpostcards. It is used under a Creative Commons license.