Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Google Refine’ Category

How to: Using OpenRefine, Excel to clean messy, user-submitted data

with one comment

Note: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

We at The Courier ran into an interesting problem last week that I’m sure other news organizations run into as well.

Every year, The Courier puts on a Best of the Best competition. We select a series of categories like food, dining,  jewelry stores, etc. and ask our readers to pick the best business in town for each category.

The businesses with the most votes in each category are rewarded by being featured in the newspaper. They also get plaques they can hang at their business to show off their nifty award.

We have several forms that people use to submit their votes. They are allowed to type in their favorite businesses, instead of pick from a field of choices.

We store their votes in a database. We then spit out the data into a spreadsheet with all the votes. Each row is one vote. For week two, we had a spreadsheet with 3,000+ rows of entries.

To view that spreadsheet, click here. We have a row for entry id, category, entrant id and the actual entry. The “entry” field is what we will focus on throughout this walkthrough.

The problem: Everyone has their own spelling for businesses in the “entry” column. For instance, “Famous Footwear” was spelled “Famous Footwear, Cedar Falls” and “Famous Footware”, it sometimes was partially upper-cased and sometimes it was misspelled altogether.

How do we merge all those different spellings for businesses into one spelling and count them up?

We could do it by hand, of course, but that is painful, tedious and time-consuming. We could also pay a company to do it but that could be expensive. At least one company that we know of charges $4,000 for such serves.

Fortunately there is a better option out there that is FREE and not horribly time consuming.

refine_cluster_ss

Here’s how we cleaned this spreadsheet with user-submitted data:

1. First download a program called OpenRefine and open it up. This program was originally created by the fine people at Google and then later open sourced. Its purpose is to clean “dirty” data (like our user-submitted data). Their website has three videos on how to use the program that you should definitely check out before firing it up.

2. With OpenRefine open, create a new project and import in the spreadsheet we want to clean. Then find the column of data we want to clean. For us, all of our user-submitted businesses were in a column called “entry.”

3. In OpenRefine, there is a little arrow next to the column title (“entry”). Click that to see a list of options for cleaning the data. If you look under “Edit cells” and “Common transforms,” you’ll see a list of common edits like trimming the leading and trailing whitespace in the entries. So ” U.S. Cellular ” would be changed to “U.S. Cellular” (notice how the white space is removed).

4. The real magic, however, is in the cluster function. This uses algorithms to find different spellings for the same business and then allows you to pick ONE common spelling for ALL of them.

– This makes it much easier down the road to count how many times a particular business was voted on because we don’t have a wide range of spellings for that business. Instead, we can narrow down it to just a couple of spellings. Or if we are really lucky, we can convert all of the different spellings into one common spelling using the algorithms.

– To pull them up, click “Facet” then “Text facet”. That will pull up a list of unique entries in the “entry” column. If you click “count”, you can see how many times a business was entered. For instance, “Scheels” was voted on 115 times. Then click “Cluster”.

– Clustering is one of OpenRefine’s most powerful tools. When you open this section up, you’ll see a list of clusters with different spellings for what the algorithm thinks is the same business.

– If those are accurate clusters — or if the different spellings listed in the cluster really are different variations for the same business — you can select the check box under “Merge?” to convert all of those spellings in the cluster to the same, single spelling.

– OpenRefine comes packed with several clustering algorithms. You can change “Method,” “Keying Function” etc. to check them all out and see which ones work best for you. When you’re done with one algorithm, you can select “Merge Selected & Re-Cluster” to mess with other algorithms or “Merge Selected & Close” to close out of this section.

– I think it’s best to go through all of the algorithms and play around with them a bit to cluster as many businesses together as you can.

6. Once you are done, export the spreadsheet with the button in the top right corner of the screen and click “Excel.”

Note: We used OpenRefine to cluster together different spellings for the same business. Now we’ll use Excel to create a list of uniquely-titled businesses and how many times those were voted on.

– After we are done, we will still probably have different spellings for the same business. It’s likely that the OpenRefine algorithms will be able to find many of the wide range of spellings for a particular business but not all of them. When you allow users to type in their own business names, you open yourself up to A LOT of spellings for the same thing.

– But the hope is we can narrow this list of spellings down to as few as possible so the manual work of tallying up the votes is as painless as possible.

Now open you spreadsheet up in Excel:

– Right now, we still have a spreadsheet that is about 3,000 rows long. The advantage is we have limited the number of different spellings for all the businesses people have voted on. What we need to do now is create a list of uniquely titled businesses and how many times they were voted on.

6. A lot of people accidentally put in their phone number for the entry. Some entered “====”. Others didn’t fill in the spot at all. We need to get rid of all those.

– I first moved the “entry” column from column E to column A. I then sorted the data from A to Z so the entries with phone numbers are at the top. I deleted those. I then sorted it from Z to A to put the empty entries at the top. And then I deleted those too.

– We can also delete the “entry_id” and “entrant_id” columns because for our purposes, they are useless.

– For those following along, here’s what my spreadsheet looks like now.

– One tricky aspect of this is people can vote for the same business in different categories. So we need to make sure the business and the category of each vote stay to together. That way, a vote for “Ace Hardware” in the “Hardware Store” category doesn’t count the same as a vote for “Ace Hardware” in the “Buy a Lawnmower” category.

7. I did this by merging my entry column with my category column into a new column. This can be done with the concatenate formula:

=CONCATENATE(A2, ” | “, B2)

– A2 is our entry column and B2 is our category column. The ” | ” isn’t necessary but is useful is we want to break apart this column again in the future. We then fill down that function to the last row in the spreadsheet using the fill handle.

– Here’s what my spreadsheet looks like now.

Tip: When using formulas on big spreadsheets, Excel tends to slow down A LOT. So when you are done using a formula, you can copy it and paste just the values (not the formulas) in its place to help Excel run smoother. I do this throughout the walkthrough to make our spreadsheets much easier work with.

8. The next step is to get a list of uniquely-titled businesses and categories. The best walkthrough I found to accomplish this was this one from Microsoft. Complete steps 1-6 under “Count the number of unique values by using a filter” to create a new column of uniquely-titled businesses.

– I also deleted the entry and cat_id columns. We will not need those any more.

– Here’s what my spreadsheet looks like now.

– We are down to under 1,000 rows of unique-titled businesses and categories. And honestly, if I would have messed with the algorithms in OpenRefine more, I could have narrowed that list even farther.

8. Now we have a column for each vote (column A) and column with each unique business and category (column B). We need to count how many times each unique business and category (column B) was voted on — or how many times each unique business and category appears in column A.

– We’ll do this by counting the number of times column B occurred in column A using this SUM formula:

=SUM(IF($A:$A=B2,1,0))

– This basically goes through each row in column A and if it is equal to what is in column B, it adds one. Then its adds all that together. The result is how many times our unique business and category was voted on.

Note: The dollar signs in the formula basically means go through every row in column A.

– Here’s what my spreadsheet looks like now.

9. We’re almost done. All we need to do now is split up our business category column into a column for businesses and a column for categories.

– Do this by first copying and pasting the value of the “unique_entry_cat” column into a new column. We have to do this first, otherwise our sum formula in our “count” column won’t work properly.

– We then break apart that new column using this method and use “|” as the delimiter to split them up. Then delete the old columns and copy and paste the values over the formulas in the spreadsheet.

– Here’s what my spreadsheet looks like now.

10. Finally, copy and paste the values of the columns onto a new spreadsheet and custom sort the spreadsheet first from A to Z for the category column and then largest to smallest in the count column.

– Here’s what the final spreadsheet looks like.

– That’s it! I realize that this is a VERY long-winded blog post and may be confusing to follow at times. If you have any questions, don’t hesitate to e-mail me at chris.essig@wcfcourier.com

Advertisements

Written by csessig

August 16, 2013 at 3:06 pm

How-to: Creating a public salary database

leave a comment »

before_after_teaserNote: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

People love knowing what public employees earn in a year. They always make for interesting conversation and plenty of controversy.

In honor of Sunshine Week, I’m going to go through the steps of creating a quick and dirty salary database using DataTables.

Here’s the process:

1. Find a data set and clean, clean, clean

For this example, I’m going to use the 2012 Salary Book for public employees in Iowa. The spreadsheet has records for more than 59,000 public employees. And there are plenty of things in the spreadsheet you’ll probably want to clean up.

I’m not going to dive too deeply into cleaning data here except to point out that it is VERY IMPORTANT. I like to first bring my spreadsheets into Excel and simply sort each row from A to Z to find empty cells, cells with funky text or characters, etc. Here’s a handy guide for cleaning data in Excel.

I then use a great program called Open Refine (formerly Google Refine) to help clean the data. Their website includes some video tutorials on how to use the program. Be sure to also read this post on how investigative journalists should use Refine. Finally, this walkthrough also teaches journalists to use Refine.

The process of cleaning data is probably the most important step in creating a database. It make take several hours. But just as you wouldn’t print a story without first proofreading it, you shouldn’t make data available on your site that you haven’t cleaned to the best of your abilities.

2. Hosting your data

When you’re done cleaning the data, you’ll export the spreadsheet in Google Refine. You’ll then you’ll need to upload the data and host it somewhere. You have a few options:

a. Don’t host it at all. Instead follow this method I mentioned in an earlier blog post that converts the spreadsheet into an HTML table, which is then converted into a DataTable. The disadvantage of this method is if you want to update the spreadsheet, you’ll have to do the conversion over and paste the new HTML table into your document. For spreadsheets that are changed repeatedly, this could be a pain.

b. Upload the spreadsheet into Google Drive and use an awesome service called Tabletop.JS to feed the spreadsheet to DataTables. This is quickly becoming one of my preferred ways to host data simply because it allows several journalists and editors to edit the same Google spreadsheet. All their changes will be reflected on your page, which will be updated automatically as the spreadsheet is updated.

And in fact, my hope was to upload our spreadsheet of Iowa public employee salaries to Google and use Tabletop.JS for this project. But I couldn’t because Google only allows spreadsheets with less than 400,000 fields. This spreadsheet, which was 59,000 rows and seven columns, just eclipses Google’s limit.

For an example of Tabletop.JS in action, check out this repo on my Github page. If you have any questions on how it works, shoot me an e-mail or leave a comment on this article.

c. You can also use an open source tool called DataSet from the Miso Project. Instead of hosting it with Google, you just load the CSV file on your own server and pull the data directly from the file. You can convert the CSV information into Javascript arrays, objects, etc., which allows the data to work with DataTables and tons of other awesome Javascript libraries.

The advantage to DataSet is you can work with huge spreadsheets like this one on Iowa public employee salaries and you don’t have to worry about Google’s size limits. The downfall is only one person can edit the spreadsheet at a time. And they’ll have to download the CSV, re-save it, then re-upload it to the server every time they make a change.

3. The code

The first thing you’ll need to do is go to my Github page and download the directory for this project.

I’m going to break down the Javascript file into chunks to give you an idea of what is happening:

// DataTables currency
// Use to sort the table via currency
jQuery.extend( jQuery.fn.dataTableExt.oSort, {
	"currency-pre": function ( a ) {
		a = (a==="-") ? 0 : a.replace( /[^\d\-\.]/g, "" );
		return parseFloat( a );
	},
	
	"currency-asc": function ( a, b ) {
		return a - b;
	},
	
	"currency-desc": function ( a, b ) {
		return b - a;
	}
});

This function allows us to sort the table based on currency. Basically, when the reader firsts loads the database, the employee with the highest salary will be shown. For more information on the different sorting techniques offered by DataTables, click here.

// Where we'll put the data
var newDataSet = [];
 
// Our column headers
// Change these to fit your table
var tableColumnSet =   [
  { "sTitle": "Employee" },
	{ "sTitle": "Department" },
	{ "sTitle": "Total FY12 Salary", "sType": "currency" },
	{ "sTitle": "Gender", "sClass": "center" },
	{ "sTitle": "County", "sClass": "center" },
	{ "sTitle": "Position", "sClass": "center" }
];

This creates a blank array called “newDataSet.” Later, we will push all of the content in our CSV into this array, which will be used by DataTables and displayed on the page. The “tableColumnSet” variable is where we set our column header names. For future projects, you’ll want to change those to match your own CSV file.

Notice also how we set the “Total FY12 Salary” column to currency, which allows us to sort the table by currency.

// Our DataTable information
// Don't need to change anything here
// Unless you want to customize the table format
function showInfo() {
  $('#table_div').html( '<table cellpadding="0" cellspacing="0" border="0" class="table table-striped table-bordered" id="table"></table>' );
 
	// Push the data to the table
	$('#table').dataTable( {
		"bProcessing":true,
		"sPaginationType": "bootstrap",
		"iDisplayLength": 100,
		"aaData": newDataSet,
		"aoColumns": tableColumnSet,
		// Pick which column we will sort by default
		// For this table, we'll sort ascending by Total FY12 Salary
		"aaSorting": [[ 2, "desc" ]],
		"oLanguage": {
			"sLengthMenu": "_MENU_ records per page"
		}
	});
}

DataTables provides us with several different options for displaying the data, including how to sort the data, where to the pull the data from, how many rows to show at a time, etc. We’re putting all this in a function called “showInfo”, which we will call below.

For more on the options available to us with DataTables, click here.

You may also notice that this table is styled using Twitter’s Bootstrap framework. Basically that gives the table the zebra look to it, as well as style the buttons and search field. For more information on working with Bootstrap, click here.

// Load up the CSV using Miso.Dataset
$(document).ready( function() {
  // Change URL to the right path for your CSV
	var ds = new Miso.Dataset({
  		url : 'csvs/salary_book_clean.csv',
  		delimiter : ','
  	});
 
	// Run this after we load our CSV
	ds.fetch({ success : function() {
    	this.each(function(row, rowIndex) {
    		// Change these variables to match your column names
    		var employeeData = row.Employee;
    		var departmentData = row.Department;
    		var salaryData = row.TotalFY12Salary;
    		var genderData = row.Gender;
    		var countyData = row.County;
    		var positionData = row.Position;
    		// Put information in an array and push it to our table
    		// Change these variables to match variables above
    		var myArray = [employeeData, departmentData, salaryData, genderData, countyData, positionData];
			newDataSet.push(myArray);
  		});
    	// Call DataTable function showInfo
		showInfo();
	}});
});

The last part of this file will load when our document is ready. It uses Miso’s DataSet library to pull in the information from our CSV file. Once the information is loading, a loop for each row in our data. A Javascript array is created for each row containing that row’s data and then that is pushed to our newDataSet variable, which we created earlier.

For future projects, make sure you change the variable names in each loop to fit your CSV file. So change the words “Employee”, “Department”, etc. to make the column headers in your CSV file.

Finally, once that is done, we call our “showInfo” function, which puts the content in our CSV on the page using DataTables.

If you have any questions, feel free to leave me a comment or e-mail me directly at chris.essig@wcfcourier.com.

Written by csessig

March 15, 2013 at 10:03 am

Spreadsheet tutorials to help journalists clean and format data

leave a comment »

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

I was talking to colleague the other day who was interested in getting into computer programming and more data projects. He asked where the best place to start was.

My gut reaction was to tell him to learn the basics of spreadsheets. Almost all of the data I have used with projects — whether they end up being a map or a graphic — were initially set up in spreadsheet form.

Most of us are familiar with spreadsheets and have likely worked with Microsoft Excel. Agencies both big and small often gather spreadsheets of useful information for us to use with our stories.

The problem is most of the time the data isn’t formatted right or contains inaccuracies like misspellings. That is where the magic of spreadsheet formulas can come in to help organize your data.

Here’s a few resources you might find handy for working with spreadsheets.

1. Poynter: How journalists can use Excel to organize data for stories

This walkthrough is great because it is written directly for journalists. It is also intended for beginners so those with no spreadsheet knowledge will be able to keep up. Finally, the walkthrough is intended for both print and online journalists. So journalists who have no intention of making visualizations will still find a handful of features in Excel to help them with their day-to-day reporting.

2. My Favorite (Excel) Things

This PDF provided by Mary Jo Webster at the St. Paul Pioneer Press is a great, concise list of Excel formulas she uses all the time. It includes formulas on how to format dates, run sums and even run if-else statements in Excel. It’s one of my favorite resources for Excel and definitely worth bookmarking.

3. Google Docs – Spreadsheet resources

Not a fan of Excel or don’t have it installed on your work computer? Fortunately you can make a spreadsheet with Google by logging into Google Drive and clicking “Create > Spreadsheet.” The best part is spreadsheets you create with Google can be accessed from any computer as long as you log in with your Google account.

Here’s some resources for getting started with Google spreadsheets:

4. Google Refine resources

Sometimes you need more than just spreadsheet formulas to clean dirty data. That’s where the powerful Google Refine program can come into play. The program was designed to clean dirty data by finding inconsistencies in your spreadsheets. It can also help you sort data, add to data, transform it from one service to another and much, much more.

Here’s some resources you might find handy:

 5. NICAR-L e-mail list

Still stuck? Fortunately, there is a wonderful community of computer-assisted reporters who are more than willing to help others out. If you want great information on spreadsheets or any other data journalism topic, check out the National Institute of Computer-Assisted Reporting’s email list. Questions on Excel come up almost every day.

Have any other useful resources not listed here? E-mail me at chris.essig@wcfcourier.com.

Written by csessig

June 25, 2012 at 9:37 am