Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Tutorial’ Category

Building your first Leaflet.js map

leave a comment »

Earlier this year, I had the privilege of teaching a class on building your first Leaflet.js map at the NICAR conference. I just realized I forgot to post the code in this blog so I figured I’d post it now. Better later than never.

If you’re interested in building your first map, check out my Github repo for the presentation.

Written by csessig

August 17, 2016 at 7:58 am

How to: Using OpenRefine, Excel to clean messy, user-submitted data

with one comment

Note: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

We at The Courier ran into an interesting problem last week that I’m sure other news organizations run into as well.

Every year, The Courier puts on a Best of the Best competition. We select a series of categories like food, dining,  jewelry stores, etc. and ask our readers to pick the best business in town for each category.

The businesses with the most votes in each category are rewarded by being featured in the newspaper. They also get plaques they can hang at their business to show off their nifty award.

We have several forms that people use to submit their votes. They are allowed to type in their favorite businesses, instead of pick from a field of choices.

We store their votes in a database. We then spit out the data into a spreadsheet with all the votes. Each row is one vote. For week two, we had a spreadsheet with 3,000+ rows of entries.

To view that spreadsheet, click here. We have a row for entry id, category, entrant id and the actual entry. The “entry” field is what we will focus on throughout this walkthrough.

The problem: Everyone has their own spelling for businesses in the “entry” column. For instance, “Famous Footwear” was spelled “Famous Footwear, Cedar Falls” and “Famous Footware”, it sometimes was partially upper-cased and sometimes it was misspelled altogether.

How do we merge all those different spellings for businesses into one spelling and count them up?

We could do it by hand, of course, but that is painful, tedious and time-consuming. We could also pay a company to do it but that could be expensive. At least one company that we know of charges $4,000 for such serves.

Fortunately there is a better option out there that is FREE and not horribly time consuming.

refine_cluster_ss

Here’s how we cleaned this spreadsheet with user-submitted data:

1. First download a program called OpenRefine and open it up. This program was originally created by the fine people at Google and then later open sourced. Its purpose is to clean “dirty” data (like our user-submitted data). Their website has three videos on how to use the program that you should definitely check out before firing it up.

2. With OpenRefine open, create a new project and import in the spreadsheet we want to clean. Then find the column of data we want to clean. For us, all of our user-submitted businesses were in a column called “entry.”

3. In OpenRefine, there is a little arrow next to the column title (“entry”). Click that to see a list of options for cleaning the data. If you look under “Edit cells” and “Common transforms,” you’ll see a list of common edits like trimming the leading and trailing whitespace in the entries. So ” U.S. Cellular ” would be changed to “U.S. Cellular” (notice how the white space is removed).

4. The real magic, however, is in the cluster function. This uses algorithms to find different spellings for the same business and then allows you to pick ONE common spelling for ALL of them.

– This makes it much easier down the road to count how many times a particular business was voted on because we don’t have a wide range of spellings for that business. Instead, we can narrow down it to just a couple of spellings. Or if we are really lucky, we can convert all of the different spellings into one common spelling using the algorithms.

– To pull them up, click “Facet” then “Text facet”. That will pull up a list of unique entries in the “entry” column. If you click “count”, you can see how many times a business was entered. For instance, “Scheels” was voted on 115 times. Then click “Cluster”.

– Clustering is one of OpenRefine’s most powerful tools. When you open this section up, you’ll see a list of clusters with different spellings for what the algorithm thinks is the same business.

– If those are accurate clusters — or if the different spellings listed in the cluster really are different variations for the same business — you can select the check box under “Merge?” to convert all of those spellings in the cluster to the same, single spelling.

– OpenRefine comes packed with several clustering algorithms. You can change “Method,” “Keying Function” etc. to check them all out and see which ones work best for you. When you’re done with one algorithm, you can select “Merge Selected & Re-Cluster” to mess with other algorithms or “Merge Selected & Close” to close out of this section.

– I think it’s best to go through all of the algorithms and play around with them a bit to cluster as many businesses together as you can.

6. Once you are done, export the spreadsheet with the button in the top right corner of the screen and click “Excel.”

Note: We used OpenRefine to cluster together different spellings for the same business. Now we’ll use Excel to create a list of uniquely-titled businesses and how many times those were voted on.

– After we are done, we will still probably have different spellings for the same business. It’s likely that the OpenRefine algorithms will be able to find many of the wide range of spellings for a particular business but not all of them. When you allow users to type in their own business names, you open yourself up to A LOT of spellings for the same thing.

– But the hope is we can narrow this list of spellings down to as few as possible so the manual work of tallying up the votes is as painless as possible.

Now open you spreadsheet up in Excel:

– Right now, we still have a spreadsheet that is about 3,000 rows long. The advantage is we have limited the number of different spellings for all the businesses people have voted on. What we need to do now is create a list of uniquely titled businesses and how many times they were voted on.

6. A lot of people accidentally put in their phone number for the entry. Some entered “====”. Others didn’t fill in the spot at all. We need to get rid of all those.

– I first moved the “entry” column from column E to column A. I then sorted the data from A to Z so the entries with phone numbers are at the top. I deleted those. I then sorted it from Z to A to put the empty entries at the top. And then I deleted those too.

– We can also delete the “entry_id” and “entrant_id” columns because for our purposes, they are useless.

– For those following along, here’s what my spreadsheet looks like now.

– One tricky aspect of this is people can vote for the same business in different categories. So we need to make sure the business and the category of each vote stay to together. That way, a vote for “Ace Hardware” in the “Hardware Store” category doesn’t count the same as a vote for “Ace Hardware” in the “Buy a Lawnmower” category.

7. I did this by merging my entry column with my category column into a new column. This can be done with the concatenate formula:

=CONCATENATE(A2, ” | “, B2)

– A2 is our entry column and B2 is our category column. The ” | ” isn’t necessary but is useful is we want to break apart this column again in the future. We then fill down that function to the last row in the spreadsheet using the fill handle.

– Here’s what my spreadsheet looks like now.

Tip: When using formulas on big spreadsheets, Excel tends to slow down A LOT. So when you are done using a formula, you can copy it and paste just the values (not the formulas) in its place to help Excel run smoother. I do this throughout the walkthrough to make our spreadsheets much easier work with.

8. The next step is to get a list of uniquely-titled businesses and categories. The best walkthrough I found to accomplish this was this one from Microsoft. Complete steps 1-6 under “Count the number of unique values by using a filter” to create a new column of uniquely-titled businesses.

– I also deleted the entry and cat_id columns. We will not need those any more.

– Here’s what my spreadsheet looks like now.

– We are down to under 1,000 rows of unique-titled businesses and categories. And honestly, if I would have messed with the algorithms in OpenRefine more, I could have narrowed that list even farther.

8. Now we have a column for each vote (column A) and column with each unique business and category (column B). We need to count how many times each unique business and category (column B) was voted on — or how many times each unique business and category appears in column A.

– We’ll do this by counting the number of times column B occurred in column A using this SUM formula:

=SUM(IF($A:$A=B2,1,0))

– This basically goes through each row in column A and if it is equal to what is in column B, it adds one. Then its adds all that together. The result is how many times our unique business and category was voted on.

Note: The dollar signs in the formula basically means go through every row in column A.

– Here’s what my spreadsheet looks like now.

9. We’re almost done. All we need to do now is split up our business category column into a column for businesses and a column for categories.

– Do this by first copying and pasting the value of the “unique_entry_cat” column into a new column. We have to do this first, otherwise our sum formula in our “count” column won’t work properly.

– We then break apart that new column using this method and use “|” as the delimiter to split them up. Then delete the old columns and copy and paste the values over the formulas in the spreadsheet.

– Here’s what my spreadsheet looks like now.

10. Finally, copy and paste the values of the columns onto a new spreadsheet and custom sort the spreadsheet first from A to Z for the category column and then largest to smallest in the count column.

– Here’s what the final spreadsheet looks like.

– That’s it! I realize that this is a VERY long-winded blog post and may be confusing to follow at times. If you have any questions, don’t hesitate to e-mail me at chris.essig@wcfcourier.com

Written by csessig

August 16, 2013 at 3:06 pm

Courses, tutorials and more for those looking to code

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Without a doubt, there is an abundance of resources online for programmers and non-programers alike to learn to code.

This, of course, is great news for journalists like us who are looking to use programming to make visualizations, scrape websites or simply pick up a new skill.

Here’s a list of courses and tutorials I’ve found in the last couple months that have either helped me personally or look very promising:

1. Codecademy

Is 2012 the year of code? The startup service Codecademy sure thinks it is. They have made it their mission to teach every one who is willing how to code within one year. The idea was so intriguing that the New York Times ran a front page story (at least online) on it.

Basically, users create an account with the service and every week they are sent new exercises that will teach them how to code. The first exercises focused on Javascript. Now, users are moving into HTML and CSS. Each exercise takes a couple hours to complete and build off the previous week’s exercsies. And best of all, it’s FREE.

If you are a huge nerd like me, you’ll gladly spend your free time completing the courses.

2. Coursera

Want to take courses from Stanford University, Princeton University, University of Michigan and University of Pennsylvania for free? Yeah, I didn’t really think it was possible either until I found Coursera, which offers a wide variety of courses in computer science and other topics.

Right now, I am enrolled in Computer Science 101, which is a six-week course that focuses on learning the basics. Each week, you are e-mailed about an hour of video lectures, as well as exercises based on those lectures. There is also a discussion forum so you can meet your peers. This isn’t nearly as time consuming as Codecademy is, which might be appealing to some.

3. Udacity

Like Coursesra, Udacity offers a number of computer science classes on beginner, intermediate and advanced topics. The classes are also based on video lectures put together by some very, very smart people. I have not used this service, however, so I can’t speak to it too much. It looks promising though. And who wouldn’t want to learn how to program a robotic car?

4. Code School

This service offers screencasts on a host of topics like Javascript, jQuery, Ruby, HTML, CSS and more. The downside, however, is this service does cost: $20 a month or $55 a screencast. If you are looking to try it out, check out their free beginner’s screencast on the Javascript library jQuery, which is the best beginner’s introduction to jQuery I’ve seen. They also have a free screencast for the Ruby programming language.

5. PeepCode

If you are looking for screencasts but are on a tighter budget, check out PeepCode and their list of programming screencasts. Each are about $12, are downloadable and typically include source code for the programs to help you follow along at home. One of my favorites is “Meet the Command Line,” which will get you started with the Unix Command Line. Be warned though because some of their screencasts are geared towards more advanced users. A good understanding of programming is recommended before diving into some of these (An exception is the command line tutorial mentioned above).

6. Net Tuts+

Many of the tutorials on this site are geared towards programmers wanting to learn very specific things or solve specific problems. This tutorial, for instance, runs through how to make borders in CSS. And this one deals with the Command Line text editor called Vim. So if you have a particular problem but don’t have a ton of time to sit through video tutorials, you might want to check out this site’s extensive catalog.

7. ScraperWiki

Web scraping is a great skill for journalists to have because it can help us pull a large amount of information from websites in a matter of seconds. If you are looking for a place to start, check out some of the screencasts offered by ScraperWiki, a service that specializes in — you guessed it — web scraping.

8. Coding blogs

The number of blogs out there devoted to coding and programming is both vast and impressive. Two of my favorite are Life and Code and Baby Steps in Data Journalism. Both are geared towards journalists. In fact, many of the sites I listed here were initially posted on one of these blogs.

– Got a cool website that has helped you out?

I’d love to hear about it! Feel free to leave a comment or e-mail me at chris.essig@wcfcourier.com

Written by csessig

May 3, 2012 at 8:22 am