Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for August 2013

How to: Using OpenRefine, Excel to clean messy, user-submitted data

with one comment

Note: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

We at The Courier ran into an interesting problem last week that I’m sure other news organizations run into as well.

Every year, The Courier puts on a Best of the Best competition. We select a series of categories like food, dining,  jewelry stores, etc. and ask our readers to pick the best business in town for each category.

The businesses with the most votes in each category are rewarded by being featured in the newspaper. They also get plaques they can hang at their business to show off their nifty award.

We have several forms that people use to submit their votes. They are allowed to type in their favorite businesses, instead of pick from a field of choices.

We store their votes in a database. We then spit out the data into a spreadsheet with all the votes. Each row is one vote. For week two, we had a spreadsheet with 3,000+ rows of entries.

To view that spreadsheet, click here. We have a row for entry id, category, entrant id and the actual entry. The “entry” field is what we will focus on throughout this walkthrough.

The problem: Everyone has their own spelling for businesses in the “entry” column. For instance, “Famous Footwear” was spelled “Famous Footwear, Cedar Falls” and “Famous Footware”, it sometimes was partially upper-cased and sometimes it was misspelled altogether.

How do we merge all those different spellings for businesses into one spelling and count them up?

We could do it by hand, of course, but that is painful, tedious and time-consuming. We could also pay a company to do it but that could be expensive. At least one company that we know of charges $4,000 for such serves.

Fortunately there is a better option out there that is FREE and not horribly time consuming.

refine_cluster_ss

Here’s how we cleaned this spreadsheet with user-submitted data:

1. First download a program called OpenRefine and open it up. This program was originally created by the fine people at Google and then later open sourced. Its purpose is to clean “dirty” data (like our user-submitted data). Their website has three videos on how to use the program that you should definitely check out before firing it up.

2. With OpenRefine open, create a new project and import in the spreadsheet we want to clean. Then find the column of data we want to clean. For us, all of our user-submitted businesses were in a column called “entry.”

3. In OpenRefine, there is a little arrow next to the column title (“entry”). Click that to see a list of options for cleaning the data. If you look under “Edit cells” and “Common transforms,” you’ll see a list of common edits like trimming the leading and trailing whitespace in the entries. So ” U.S. Cellular ” would be changed to “U.S. Cellular” (notice how the white space is removed).

4. The real magic, however, is in the cluster function. This uses algorithms to find different spellings for the same business and then allows you to pick ONE common spelling for ALL of them.

– This makes it much easier down the road to count how many times a particular business was voted on because we don’t have a wide range of spellings for that business. Instead, we can narrow down it to just a couple of spellings. Or if we are really lucky, we can convert all of the different spellings into one common spelling using the algorithms.

– To pull them up, click “Facet” then “Text facet”. That will pull up a list of unique entries in the “entry” column. If you click “count”, you can see how many times a business was entered. For instance, “Scheels” was voted on 115 times. Then click “Cluster”.

– Clustering is one of OpenRefine’s most powerful tools. When you open this section up, you’ll see a list of clusters with different spellings for what the algorithm thinks is the same business.

– If those are accurate clusters — or if the different spellings listed in the cluster really are different variations for the same business — you can select the check box under “Merge?” to convert all of those spellings in the cluster to the same, single spelling.

– OpenRefine comes packed with several clustering algorithms. You can change “Method,” “Keying Function” etc. to check them all out and see which ones work best for you. When you’re done with one algorithm, you can select “Merge Selected & Re-Cluster” to mess with other algorithms or “Merge Selected & Close” to close out of this section.

– I think it’s best to go through all of the algorithms and play around with them a bit to cluster as many businesses together as you can.

6. Once you are done, export the spreadsheet with the button in the top right corner of the screen and click “Excel.”

Note: We used OpenRefine to cluster together different spellings for the same business. Now we’ll use Excel to create a list of uniquely-titled businesses and how many times those were voted on.

– After we are done, we will still probably have different spellings for the same business. It’s likely that the OpenRefine algorithms will be able to find many of the wide range of spellings for a particular business but not all of them. When you allow users to type in their own business names, you open yourself up to A LOT of spellings for the same thing.

– But the hope is we can narrow this list of spellings down to as few as possible so the manual work of tallying up the votes is as painless as possible.

Now open you spreadsheet up in Excel:

– Right now, we still have a spreadsheet that is about 3,000 rows long. The advantage is we have limited the number of different spellings for all the businesses people have voted on. What we need to do now is create a list of uniquely titled businesses and how many times they were voted on.

6. A lot of people accidentally put in their phone number for the entry. Some entered “====”. Others didn’t fill in the spot at all. We need to get rid of all those.

– I first moved the “entry” column from column E to column A. I then sorted the data from A to Z so the entries with phone numbers are at the top. I deleted those. I then sorted it from Z to A to put the empty entries at the top. And then I deleted those too.

– We can also delete the “entry_id” and “entrant_id” columns because for our purposes, they are useless.

– For those following along, here’s what my spreadsheet looks like now.

– One tricky aspect of this is people can vote for the same business in different categories. So we need to make sure the business and the category of each vote stay to together. That way, a vote for “Ace Hardware” in the “Hardware Store” category doesn’t count the same as a vote for “Ace Hardware” in the “Buy a Lawnmower” category.

7. I did this by merging my entry column with my category column into a new column. This can be done with the concatenate formula:

=CONCATENATE(A2, ” | “, B2)

– A2 is our entry column and B2 is our category column. The ” | ” isn’t necessary but is useful is we want to break apart this column again in the future. We then fill down that function to the last row in the spreadsheet using the fill handle.

– Here’s what my spreadsheet looks like now.

Tip: When using formulas on big spreadsheets, Excel tends to slow down A LOT. So when you are done using a formula, you can copy it and paste just the values (not the formulas) in its place to help Excel run smoother. I do this throughout the walkthrough to make our spreadsheets much easier work with.

8. The next step is to get a list of uniquely-titled businesses and categories. The best walkthrough I found to accomplish this was this one from Microsoft. Complete steps 1-6 under “Count the number of unique values by using a filter” to create a new column of uniquely-titled businesses.

– I also deleted the entry and cat_id columns. We will not need those any more.

– Here’s what my spreadsheet looks like now.

– We are down to under 1,000 rows of unique-titled businesses and categories. And honestly, if I would have messed with the algorithms in OpenRefine more, I could have narrowed that list even farther.

8. Now we have a column for each vote (column A) and column with each unique business and category (column B). We need to count how many times each unique business and category (column B) was voted on — or how many times each unique business and category appears in column A.

– We’ll do this by counting the number of times column B occurred in column A using this SUM formula:

=SUM(IF($A:$A=B2,1,0))

– This basically goes through each row in column A and if it is equal to what is in column B, it adds one. Then its adds all that together. The result is how many times our unique business and category was voted on.

Note: The dollar signs in the formula basically means go through every row in column A.

– Here’s what my spreadsheet looks like now.

9. We’re almost done. All we need to do now is split up our business category column into a column for businesses and a column for categories.

– Do this by first copying and pasting the value of the “unique_entry_cat” column into a new column. We have to do this first, otherwise our sum formula in our “count” column won’t work properly.

– We then break apart that new column using this method and use “|” as the delimiter to split them up. Then delete the old columns and copy and paste the values over the formulas in the spreadsheet.

– Here’s what my spreadsheet looks like now.

10. Finally, copy and paste the values of the columns onto a new spreadsheet and custom sort the spreadsheet first from A to Z for the category column and then largest to smallest in the count column.

– Here’s what the final spreadsheet looks like.

– That’s it! I realize that this is a VERY long-winded blog post and may be confusing to follow at times. If you have any questions, don’t hesitate to e-mail me at chris.essig@wcfcourier.com

Written by csessig

August 16, 2013 at 3:06 pm

Our experiment with an ultra-light CMS

leave a comment »

So I was featured in Source recently. No not that Source. This Source.  It’s a blog for journalists who code. It’s a Knight-Mozilla OpenNews project, and it’s maintained by some of the brightest journo-techies in the industry.

The article was about newspapers using “ultra-light CMSes” instead of their own full-blown CMSes to put publish web projects. I interviewed with Katie Zhu, a very talented digital journalist who currently works for the New York Times. She talked to me about how we use Tarbell to maintain our news apps landing page. We used Tarbell instead of BLOX, which is our day-to-day CMS, to deploy the app outside of the CMS.

The article covers Tarbell — and other ultra-light CMSes — in great detail so I won’t go into detail about it here. Instead, you should read the entire blog post. I’m quoted near the bottom!

Written by csessig

August 12, 2013 at 9:40 am