Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘ProPublica’ Category

Turning Excel spreadsheets into searchable databases in five minutes

with 21 comments

Note: This is cross-posted from Lee’s data journalism blog and includes references to the Blox content management system, which is what Lee newspapers use. Reporters at Lee newspapers can read my blog over there by clicking here.

data_tables_screenshotHere’s a scenario I run into all the time: A government agency sends along a spreadsheet of data to go along with a story one of our reporters is working on. And you want to post the spreadsheet(s) online with the story in a reader-friendly way.

One way is to create a sortable database with the information. There are a couple of awesome options on the table put out by other news organizations. One is TableSetter, published by ProPublica, which we’ve used in the past. The other one is TableStacker, which is a spin off of TableSetter. We have also used this before.

The only problem is making these tables typically take a little bit of leg work. And of course, you’re on deadline.

Here’s one quick and dirty option:

1. First open the spreadsheet in Excel or Google Docs and highlight the fields you want to make into the sortable database.

2. Then go to this wonderful website called Mr. Data Converter and paste the spreadsheet information into the top box. Then select the output as “HTML.”

3. We will use a service called DataTables to create the sortable databases. It’s a great and easy jQuery plugin to create the sortable tables.

4. Now create an HTML asset in Blox and paste in this DataTables template below:

Note: I’ve added some CSS styling to make the tables look better.

<html>
<head>
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_page.css">
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_table.css">

<style>
table {
	font-size: 12px;
	font-family: Arial, Helvetica, sans-serif;
float: left
}
table th, table td {
    text-align: center;
}

th, td {
	padding-top: 10px;
	padding-bottom: 10px;
	font-size: 14px;
}

label {
	width: 100%;
	text-align: left;
}

table th {
	font-weight: bold;
}

table thead th {
    vertical-align: middle;
}

label, input, button, select, textarea {
    line-height: 30px;
}
input, textarea, select, .uneditable-input {
    height: 25px;
    line-height: 25px;
}

select {
    width: 100px;
}

.dataTables_length {
    padding-left: 10px;
}
.dataTables_filter {
	padding-right: 10px;
}

</style>

<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.js"></script>
<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.dataTables.min.js"></script>

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>
</head>

<body>

<--- Enter HTML table here --->

</body>

</html>

– This will link the page to the necessary CSS spreadsheets and Javascript files to get the DataTable working. The other option is go to the DataTable’s website and download the files yourself and post them on your own server, then link to those files instead of the ones hosted by WCFCourier.com.

5. Where you see the text “<— Enter HTML table here —>,” paste in your HTML table from Mr. Data Converter.

6. The last thing you will need to do is create an “id” for the table and link that “id” to the DataTable’s plugin. In the example above, the “id” is “five_year.” It is noted in this line of code in the DataTable template:

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>

– The header on your HTML table that you post into the template will look like so:

<table id="five_year" style="width: 620px;">
  <thead>
    <tr>
      <th class="NAME-cell">NAME</th>
      <th class="2008 Enrollment-cell">2008 Enrollment</th>
      <th class="2012 Enrollment-cell">2012 Enrollment</th>
      <th class="Increase/Decrease-cell">Increase/Decrease</th>
      <th class="Percent Increase/Decrease-cell">Percent Increase/Decrease</th>
    </tr>
  </thead>

– Here’s an live example of two sortable tables. The first table has an “id” of “five_year.” The second has an “id” of “one_year.” The full code for the two tables is available here.

– As an alternative, you can use a jQuery plugin called TableSorter (not to be confused with the TableSorter project mentioned above). The process of creating the table is very similar.

7. That’s it! Of course, DataTables provides several customization options that are worth looking into if you want to make your table look fancier.

Written by csessig

January 3, 2013 at 4:34 pm

Turning Blox assets into timelines: Last Part

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about web-scraping even if you don’t use Blox.

For part one of this tutorial, click here. For part two, click here

 

If you’ve followed along thus far, you’ll see we’re almost done turning a collection of Blox assets — which can include articles, photos and videos — into  timelines using a tool made available by ProPublica called TimelineSetter.

Right now, we are in the process of creating a CSV file, which we will use with the program TimelineSetter. This programs takes the CSV file and creates a nice looking timeline out of it.

The CSV file is created by using a Python script that scrapes a web page we created with all of our content on it. The full code for this script is available here.

In the second part of the series, we went ahead and created a Python scraper that will pull all the information on the page we need.

1. If you’ve gone through the second part, you can go ahead and run python timeline.py now on your command line (More information on how to run the scraper is available in the first part of the blog).

You’ll notice the script will output a CSV that has all the information we need. But some of it is ugly. We need to delete certain tags, for instance, that appear before and after some of the text.

But before we can run Python commands to delete these unnecessary tags, we need to convert all of the text into strings. This is basically just text. Integers, by contrast, are numbers.

So let’s do that first:

# Extract that information in strings
    date2 = str(date)
    link2 = str(link)
    headline2 = str(headline)
    image2 = str(image)

    # These are pulled from outside the loop
    description2 = str(description)

Note: This code and the following chunks should go inside the loop statement we created in the second part of the series. This is because we want these changes to take effect on every item we scrape. If you are wondering, I tried to explain loop statements in my second blog. Your best bet though is to ask Google about loop statements.

2. Now that we’ve converted everything into strings, we can get rid of the ugly tags. You’ll notice that the descriptions of our stories start and end with with a “p” tag. Similarly, all dates start and end with an “em” tag.

Because of this, I added a few lines to the Python script. These lines will replace the ugly tags with nothing…effectively deleting them before they are put into the CSV file:

# Extra formatting needed for dates to get rid of em tags and unnecessary formatting
    date4 = date3.replace('[<em>', "")
    date5 = date4.replace('</em>]', "")
    date6 = date5.replace('- ', "")
    date7 = date6.replace("at ", "")

    # Extra formatting is also need for the description to get rid of p tags and new line returns
    description4 = description3.replace('[<p>', "")
    description5 = description4.replace('</p>]', "")
    description6 = description5.replace('\n', " ")
    description7 = description6.replace('[]', "")

3. For images, the script spits out “\d\d\d” for the width property. We will change this so it’s 300px, making all images on the page 300px wide. Also, we will delete the text “None,” which shows up if an image doesn’t exist for a particular asset.

# We will adjust the width of all images to 300 pixels. Also, Python spits out the word 'None' if it doesn't find an image. Delete that.
    image4 = re.sub(r'width="\d\d\d"', 'width="300"', image3)
    image5 = image4.replace('None', "")

4. If you are at all familiar with the way articles work in Blox, you know that when you update them, red text shows up next to the headline telling you when the story was last updated. The code that formats that text and makes it red is useless to us now. So we will delete this and replace it with the current time and date using the Python datetime module.

To use the datetime module, we have to first import it at the top of our Python script. We then need to call the object within the module that we want. A good introduction to Python modules is available here.

The module we want is called “datetime.now()”. As the name suggests, it returns the current date and time. I then put it in a variable called “now”, which makes it more reader friendly to use later in the script.

So the top of our page should look like this:

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

Inside the loop we call the “datetime.now()” object and replace the text “Updated” with the current date and time:

# If the story has been updated recently, an em class tag will appear on the page showing the time but not the date. We will delete the class and         replace it with today's date. We can change the date in the CSV if we need to.
    date8 = date7.replace('[<em class="item-updated badge">Updated:', str(now.strftime("%Y-%m-%d %H:%M")))

5. Now there is just one last bit of cleaning up we will need to do. For those who don’t know, CSV stands for comma-separated values. This means that basically the columns we are creating in our spreadsheet are separated by commas. This is the preferred type of spreadsheet for most programs because it’s simple.

We can run into problems, however, if some of our data includes commas. So these next few lines in our script will replace all of the commas we scraped with dashes. You can change it to whatever character or characters you want:

# We'll replace commas with dashes so we don't screw up the CSV. You can change the dash to whatever character you want
    date3 = date2.replace(",", " -")
    link3 = link2.replace(",", " -")
    headline3 = headline2.replace(",", " -")
    description3 = description2.replace(",", " -")
    image3 = image2.replace(",", " -")

If you want to put the commas back into the timeline, you can do so after the final HTML file is created (after you run python timeline.py). Typically I’ll replace all the commas with “////” and then do a simple find and replace on the final HTML file with a text editor and change every instance of “////” back into a comma.

Now we have clean, concise data! We will now put this into the CSV file using the “write” command. Again, all these commands are put inside the loop statement we created in the second part of the series, so every image, description, date, etc. we scrape will be cleaned up and ready to go:

# Write the information to the file. The HTML code is based on coding recognized by TimelineSetter
    f.write(date8 + "," + description7 + "," + link3 + "," + '<h2 class="timeline-img-hed">' + headline3 + '</h2>' + image5 + "\n")

The headlines, you’ll notice, are put into an HTML tag “h2” tag. This will bold and increase the size of the headlines so they stick out when a reader clicks on a particular event in the timeline.

That’s it for the loop statement. Now we will add a simple line outside of the loop statement that closes the CSV file when all the loops we want run are done:

#You're done! Close file.
f.close()

And there you have it folks. We are now done with our Python script. We can now run it (see part 1 of the series) and have a nice, clean looking CSV file we can turn into a timeline.

If you have any questions, PLEASE don’t hesitate to e-mail or Tweet me. I’d be more than happy to help.

Written by csessig

March 16, 2012 at 8:38 am

Turning Blox assets into timelines: Part 2

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about webscraping even if you don’t use Blox.

For part one of this tutorial, click here. For part three, click here

 

On my last blog, I discussed how you can turn Blox assets into a  timeline using a tool made available by ProPublica called TimelineSetter.

If you recall, most of the magic happens with a little Python script called Timeline.py. It scrapes information from a page and puts it into a CSV file, which can then be used with TimelineSetter.

So what’s behind this Timeline.py file? I’ll go through the code by breaking it down into chunks. The full code is here and is heavily commented to help you follow along.

(NOTE: This python script is based off this tutorial from BuzzData. You should definitely check it out!)

– The first part of the script is basically the preliminary work. We’re not actually scraping the web page yet. This code first imports the necessary libraries for the script to run. We are using a Python library called BeautifulSoup that was designed for web scraping.

We then create a CSV to put the data in with the open attribute and create an initial header row in the CSV file with the write attribute.  Also be sure to enter the URL of the page you want to scrape.

Note: For now, ignore the line “now = datetime.datetime.now().” We will discuss it later.

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

# Create a CSV where we'll save our data. See further docs:
# http://propublica.github.com/timeline-setter/#csv
f = open('timeline.csv', 'w')

# Make the header rows. These are based on headers recognized by TimelineSetter.
f.write("date" + "," + "description" + "," + "link" + "," + "html" + "\n")

# URL we will scrape
url = 'http://wcfcourier.com/test/scrape/dunkerton/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

– Before we go any further, we need to look at the page we are scraping, which in this example is this page. It’s basically a running list of articles about a particular subject. All of these stories will go on the timeline.

Now we’ll ask: what do we actually want to pull from this page? For each article we want to pull: the headline, the date, the photo, the first paragraph of the story and the link to the article.

Now we need to become familiar with the HTML of the page so we can tell BeautifulSoup what HTML attributes we want to pull from it. Go ahead and open the page up and view its source (Right click > View page source for Firefox and Chrome users).

One of the easiest things we can do is just search for the headline of the first article. So type in “Mayor’s arrest rattles Dunkerton.” This will take us to the chunk of code for that article. You’ll notice how the headline and all the other attributes for the story are contained in a DIV with the class “story-block.’

All stories on this page are formatted the same so every story is put into a DIV with the class ‘story-block.’ Thus, the number of DIVs with the class ‘story-block’ is also equal to the number of articles on the page we want to scrape.

– For the next line of code, we will put that number (whatever it may be) into a variable called ‘events.’ The line after that is what is known as a ‘for loop.’ These two lines tell BeautifulSoup how many times we want to run the ‘for loop.’

So if we have five articles we want to scrape, the ‘for loop’ will run five times. If we have 25 articles, it will run 25 times.

events = soup.findAll('div', attrs={'class': 'story-block'})
for x in events:

– Inside the ‘for loop,’ we need to tell it what information from each article we want to pull. Now go back to the source of the page we are scraping and find the headline, the date, the photo, the first paragraph of the story and the link to the article. You should see that:

  • The date is in a paragraph tag with the class ‘story more’
  • The link appears several times, including within a tag called ‘fb:like,’ which is the Facebook like button people can click to share the article on Facebook.
  • The headline is in a h3 tag, which is a header tag.
  • The first few paragraphs of the story are contained within a DIV with the id ‘blox-story-text.’ Note: In the Python script, we will tell BeautifulSoup to pull only the first paragraph.
  • The photo is contained within an img tag, which shouldn’t be a surprise.

So let’s put all of that in the ‘for loop’ so it knows what we want from each article. The code below uses BeautifulSoup syntax, which you can find out about by reading their documentation.

    # Information on the page that we will scrape
    date = x.find('p', attrs={'class': 'story-more'})('em')
    link = x.find('fb:like')['href']
    headline = x.find('h3').text
    description = x.find('div', attrs={'id': 'blox-story-text'})('p', limit=1)
    image = x.find('img')

One note about the above code: The ‘x’ is equal to the number that the ‘for loop’ is on. For example, say we want to scrape 20 articles. The first time we run the ‘for loop,’ the ‘x’ will be equal to one. The second time through, the ‘x’ will be equal to two. The last time through, it will be equal to 20.

We use the ‘x’ so we pull information from a different article each time we go through the ‘for loop’. The first time through the ‘for loop,’ we will pull information from the first article because the ‘x’ will be equal to one. And the second time through, we pull information from the second article because the ‘x’ will be equal to two.

If we didn’t use ‘x,’ we’d run through the ‘for loop’ 20 times but we’d pull the same information from the same article each time. The ‘x’ in combination with the ‘for loop’ basically tells BeautifulSoup to start with one article, then move onto the next and then the next and so on until we’ve scraped all the articles we want to scrape.

– Now you should be well on your way to creating timelines with Blox assets. For the third and final part of this tutorial, we will just clean up the data a little bit so it looks like nice on the page. Look for the final post of this series soon!

Written by csessig

March 7, 2012 at 2:21 pm

Turning Blox assets into timelines: Part 1

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. For part two of this tutorial, click here. For part three, click here.

 

A couple weeks ago I blogged about the command line and a few reasons journalists should learn it. Among the reasons was a timeline tool made available by ProPublica called TimelineSetter, which is shown to the left. Here are two live examples to give you an idea of what the tool looks like.

To create the timeline, you will first need to make a specially-structured CSV file (CSV is a type of spreadsheet file. Excel can export CSV files). Rows in the CSV file will represent particular events on the timeline. Columns will include information on those events, like date, description, photo, etc.

ProPublica has a complete list of available columns here. To give you an idea of what the final product will look like BEFORE you make the timeline, you can download one CSV file we used to make a timeline by clicking here.

After you have your CSV file, you run a simple command and walla! A beautiful timeline will be created. For more information on the command you have to run, check out the TimelineSetter page. (Hint: The command you run is: timeline-setter -c timeline.csv)

By far the most tedious part of all this is tracking down events and articles you want to include in the timeline and making your CSV file. That is why I wrote a simple Python script that will help turn Blox assets into a CSV file you can use with TimelineSetter.

Here’s a walkthrough of how to use it:

1. The first thing you need to do is go to this GitHub page and follow the instructions in the ReadMe file. After that you will have a page set up will all of the events you want to include in the timeline. Here’s an example of what that page should look like.

2. Download the Python script (Timeline.py). What this script will actually be doing is scraping the web page we just created. And by that I mean it will be going to that page and pulling out all the information we need to create our timeline. So it will be grabbing photos, headlines, dates, etc. It will then create a CSV file with all that information. We can then use that CSV file with TimelineSetter.

3. The script uses a Python library called Beautiful Soup. If you don’t have that downloaded already, click here. It takes only a few seconds to install.

4. In the Timeline.py file on line 16 is a spot for the URL of the page we want to scrape. Make sure you change that to whatever URL you created.

5. Run the command python timeline.py from your command line in the same directory as the Python script you downloaded. This will output a CSV file.

6. You will need to download TimelineSetter, which is also really easy to do. Just run this command: gem install timeline_setter. For more information, click here.

7. Now navigate to the folder with the CSV file and run this command: timeline-setter -c timeline.csv. (Or whatever your CSV file is called).

8. You should end up with a directory of javascript files, a CSS file and a Timeline.html file. This is your timeline. Now put in on your server and embed it using an HTML asset in Blox (or whatever you want to do with it).

9. Do the happy dance (Mandatory step)

This will get you pushing out timelines in no time. On my next blog, I will be going through that Python script (Timeline.py) and what it actually does to create the CSV file.

Written by csessig

March 2, 2012 at 9:59 am

ProPublica to the rescue

leave a comment »

ProPublica, known for producing excellent, investigative journalism, also has a wonderful staff of developers that have put out several tools to help fellow journalists like myself. Here’s a quick run through two of their tools I’ve used in the last month.

TimelineSetter – I’m a big fan of timelines. They can help newspapers show passage of time, obviously, as well as keep their stories on a particular subject in one central location. This is exactly how we used ProPublica’s TimelineSetter tool when ‘Extreme Makeover: Home Edition’ announced they were going to build a new home for a local family. From the print side, we ran several stories, including about one a day during the week of the build. The photo department also put out four photo galleries on the build and a fundraiser. Finally, our videographer shot several videos. Our audience ate up our coverage, raking up more than 100,000 page views on the photo galleries alone. But unless you wanted to attach every story, gallery and video to any new story we did (which would be both cumbersome and unattractive), it would have been hard to get a full scope of our coverage. That’s were the ProPublica tool came into play. Simply, it helped compile all of our coverage of the event on one page.

I’m not going to go into detail on how I put together the timeline. Instead, I will revert you to their fantastic and easy to use documentation. Best of all, the timeline is easy to customize and upload to your site. It’s also free, unlike the popular timeline-maker Dipity. Check it out!

TableSorter – This tool is equally as impressive and fairly easy to use. The final output is basically an easy-to-navigate-and-sort spreadsheet. And, again, the documentation is comprehensive. Run through it and you’ll have up sorted table in no time! I’ve put together two already in the last week or so.

The first is a list of farmers markets in Iowa, with links to their home page (if available) and a Google map link, which was formatted using a formula in Microsoft Excel. The formula for the first row looked like this: =CONCATENATE(“http://www.google.com/maps?q=&#8221;, ” “, B2, ” “, C2, ” “, E2, ” “, “Iowa”)

The first part is the Google Map link, obviously. B2 represented the cell with the city address; C2 = City; E2 = Zip code and finally “Iowa” so Google Maps knows where to look. In between each field I put in a space so Google can read the text and try to map it our using Google Maps (I should note that not every location was able to be mapped out). Then I just copy and pasted this for every row in the table. At this point, I had a standard XLS Excel file, which I saved as a CSV file. TableSetter uses that CSV file and formats it using a YML file to produce the final output. Go and read the docs…It’s easier than having me try to explain it all. Here’s what my CSV looked like; here’s my YML file; and finally the table, which was posted on our site.

In the same vein, I put together this table on what each state department is requesting from the government in the upcoming fiscal year.

I should also note here that the Data Desk at the LA Times has a variation of ProPublica’s TableSorter that offers more features (like embedding photos into the table, for instance). It’s called Table Stacker and works in a very similar fashion as TableSorter. I recommend checking it out after you get a feel for ProPublica’s TableSorter.

Learning the Command Line: Both of these tools require the use of the command line using the Terminal program installed on your computer. If you are deathly afraid of that mysterious black box like I was, I highly recommend watching PeepCode’s video introduction to the command line called “Meet the Command Line.” And when your done, go ahead and jump into their second part called “Advanced Command Line.” Yes, they both cost money ($12 a piece), but there is so much information packed into each hour-long screencast, that they are both completely worth it. I was almost-instantly comfortable with the command line after watching both screencasts.

Written by csessig

October 21, 2011 at 3:23 pm