Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Lee Enterprises’ Category

Turning Excel spreadsheets into searchable databases in five minutes

with 21 comments

Note: This is cross-posted from Lee’s data journalism blog and includes references to the Blox content management system, which is what Lee newspapers use. Reporters at Lee newspapers can read my blog over there by clicking here.

data_tables_screenshotHere’s a scenario I run into all the time: A government agency sends along a spreadsheet of data to go along with a story one of our reporters is working on. And you want to post the spreadsheet(s) online with the story in a reader-friendly way.

One way is to create a sortable database with the information. There are a couple of awesome options on the table put out by other news organizations. One is TableSetter, published by ProPublica, which we’ve used in the past. The other one is TableStacker, which is a spin off of TableSetter. We have also used this before.

The only problem is making these tables typically take a little bit of leg work. And of course, you’re on deadline.

Here’s one quick and dirty option:

1. First open the spreadsheet in Excel or Google Docs and highlight the fields you want to make into the sortable database.

2. Then go to this wonderful website called Mr. Data Converter and paste the spreadsheet information into the top box. Then select the output as “HTML.”

3. We will use a service called DataTables to create the sortable databases. It’s a great and easy jQuery plugin to create the sortable tables.

4. Now create an HTML asset in Blox and paste in this DataTables template below:

Note: I’ve added some CSS styling to make the tables look better.

<html>
<head>
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_page.css">
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_table.css">

<style>
table {
	font-size: 12px;
	font-family: Arial, Helvetica, sans-serif;
float: left
}
table th, table td {
    text-align: center;
}

th, td {
	padding-top: 10px;
	padding-bottom: 10px;
	font-size: 14px;
}

label {
	width: 100%;
	text-align: left;
}

table th {
	font-weight: bold;
}

table thead th {
    vertical-align: middle;
}

label, input, button, select, textarea {
    line-height: 30px;
}
input, textarea, select, .uneditable-input {
    height: 25px;
    line-height: 25px;
}

select {
    width: 100px;
}

.dataTables_length {
    padding-left: 10px;
}
.dataTables_filter {
	padding-right: 10px;
}

</style>

<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.js"></script>
<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.dataTables.min.js"></script>

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>
</head>

<body>

<--- Enter HTML table here --->

</body>

</html>

– This will link the page to the necessary CSS spreadsheets and Javascript files to get the DataTable working. The other option is go to the DataTable’s website and download the files yourself and post them on your own server, then link to those files instead of the ones hosted by WCFCourier.com.

5. Where you see the text “<— Enter HTML table here —>,” paste in your HTML table from Mr. Data Converter.

6. The last thing you will need to do is create an “id” for the table and link that “id” to the DataTable’s plugin. In the example above, the “id” is “five_year.” It is noted in this line of code in the DataTable template:

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>

– The header on your HTML table that you post into the template will look like so:

<table id="five_year" style="width: 620px;">
  <thead>
    <tr>
      <th class="NAME-cell">NAME</th>
      <th class="2008 Enrollment-cell">2008 Enrollment</th>
      <th class="2012 Enrollment-cell">2012 Enrollment</th>
      <th class="Increase/Decrease-cell">Increase/Decrease</th>
      <th class="Percent Increase/Decrease-cell">Percent Increase/Decrease</th>
    </tr>
  </thead>

– Here’s an live example of two sortable tables. The first table has an “id” of “five_year.” The second has an “id” of “one_year.” The full code for the two tables is available here.

– As an alternative, you can use a jQuery plugin called TableSorter (not to be confused with the TableSorter project mentioned above). The process of creating the table is very similar.

7. That’s it! Of course, DataTables provides several customization options that are worth looking into if you want to make your table look fancier.

Advertisements

Written by csessig

January 3, 2013 at 4:34 pm

Turning Blox assets into timelines: Last Part

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about web-scraping even if you don’t use Blox.

For part one of this tutorial, click here. For part two, click here

 

If you’ve followed along thus far, you’ll see we’re almost done turning a collection of Blox assets — which can include articles, photos and videos — into  timelines using a tool made available by ProPublica called TimelineSetter.

Right now, we are in the process of creating a CSV file, which we will use with the program TimelineSetter. This programs takes the CSV file and creates a nice looking timeline out of it.

The CSV file is created by using a Python script that scrapes a web page we created with all of our content on it. The full code for this script is available here.

In the second part of the series, we went ahead and created a Python scraper that will pull all the information on the page we need.

1. If you’ve gone through the second part, you can go ahead and run python timeline.py now on your command line (More information on how to run the scraper is available in the first part of the blog).

You’ll notice the script will output a CSV that has all the information we need. But some of it is ugly. We need to delete certain tags, for instance, that appear before and after some of the text.

But before we can run Python commands to delete these unnecessary tags, we need to convert all of the text into strings. This is basically just text. Integers, by contrast, are numbers.

So let’s do that first:

# Extract that information in strings
    date2 = str(date)
    link2 = str(link)
    headline2 = str(headline)
    image2 = str(image)

    # These are pulled from outside the loop
    description2 = str(description)

Note: This code and the following chunks should go inside the loop statement we created in the second part of the series. This is because we want these changes to take effect on every item we scrape. If you are wondering, I tried to explain loop statements in my second blog. Your best bet though is to ask Google about loop statements.

2. Now that we’ve converted everything into strings, we can get rid of the ugly tags. You’ll notice that the descriptions of our stories start and end with with a “p” tag. Similarly, all dates start and end with an “em” tag.

Because of this, I added a few lines to the Python script. These lines will replace the ugly tags with nothing…effectively deleting them before they are put into the CSV file:

# Extra formatting needed for dates to get rid of em tags and unnecessary formatting
    date4 = date3.replace('[<em>', "")
    date5 = date4.replace('</em>]', "")
    date6 = date5.replace('- ', "")
    date7 = date6.replace("at ", "")

    # Extra formatting is also need for the description to get rid of p tags and new line returns
    description4 = description3.replace('[<p>', "")
    description5 = description4.replace('</p>]', "")
    description6 = description5.replace('\n', " ")
    description7 = description6.replace('[]', "")

3. For images, the script spits out “\d\d\d” for the width property. We will change this so it’s 300px, making all images on the page 300px wide. Also, we will delete the text “None,” which shows up if an image doesn’t exist for a particular asset.

# We will adjust the width of all images to 300 pixels. Also, Python spits out the word 'None' if it doesn't find an image. Delete that.
    image4 = re.sub(r'width="\d\d\d"', 'width="300"', image3)
    image5 = image4.replace('None', "")

4. If you are at all familiar with the way articles work in Blox, you know that when you update them, red text shows up next to the headline telling you when the story was last updated. The code that formats that text and makes it red is useless to us now. So we will delete this and replace it with the current time and date using the Python datetime module.

To use the datetime module, we have to first import it at the top of our Python script. We then need to call the object within the module that we want. A good introduction to Python modules is available here.

The module we want is called “datetime.now()”. As the name suggests, it returns the current date and time. I then put it in a variable called “now”, which makes it more reader friendly to use later in the script.

So the top of our page should look like this:

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

Inside the loop we call the “datetime.now()” object and replace the text “Updated” with the current date and time:

# If the story has been updated recently, an em class tag will appear on the page showing the time but not the date. We will delete the class and         replace it with today's date. We can change the date in the CSV if we need to.
    date8 = date7.replace('[<em class="item-updated badge">Updated:', str(now.strftime("%Y-%m-%d %H:%M")))

5. Now there is just one last bit of cleaning up we will need to do. For those who don’t know, CSV stands for comma-separated values. This means that basically the columns we are creating in our spreadsheet are separated by commas. This is the preferred type of spreadsheet for most programs because it’s simple.

We can run into problems, however, if some of our data includes commas. So these next few lines in our script will replace all of the commas we scraped with dashes. You can change it to whatever character or characters you want:

# We'll replace commas with dashes so we don't screw up the CSV. You can change the dash to whatever character you want
    date3 = date2.replace(",", " -")
    link3 = link2.replace(",", " -")
    headline3 = headline2.replace(",", " -")
    description3 = description2.replace(",", " -")
    image3 = image2.replace(",", " -")

If you want to put the commas back into the timeline, you can do so after the final HTML file is created (after you run python timeline.py). Typically I’ll replace all the commas with “////” and then do a simple find and replace on the final HTML file with a text editor and change every instance of “////” back into a comma.

Now we have clean, concise data! We will now put this into the CSV file using the “write” command. Again, all these commands are put inside the loop statement we created in the second part of the series, so every image, description, date, etc. we scrape will be cleaned up and ready to go:

# Write the information to the file. The HTML code is based on coding recognized by TimelineSetter
    f.write(date8 + "," + description7 + "," + link3 + "," + '<h2 class="timeline-img-hed">' + headline3 + '</h2>' + image5 + "\n")

The headlines, you’ll notice, are put into an HTML tag “h2” tag. This will bold and increase the size of the headlines so they stick out when a reader clicks on a particular event in the timeline.

That’s it for the loop statement. Now we will add a simple line outside of the loop statement that closes the CSV file when all the loops we want run are done:

#You're done! Close file.
f.close()

And there you have it folks. We are now done with our Python script. We can now run it (see part 1 of the series) and have a nice, clean looking CSV file we can turn into a timeline.

If you have any questions, PLEASE don’t hesitate to e-mail or Tweet me. I’d be more than happy to help.

Written by csessig

March 16, 2012 at 8:38 am

Turning Blox assets into timelines: Part 2

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about webscraping even if you don’t use Blox.

For part one of this tutorial, click here. For part three, click here

 

On my last blog, I discussed how you can turn Blox assets into a  timeline using a tool made available by ProPublica called TimelineSetter.

If you recall, most of the magic happens with a little Python script called Timeline.py. It scrapes information from a page and puts it into a CSV file, which can then be used with TimelineSetter.

So what’s behind this Timeline.py file? I’ll go through the code by breaking it down into chunks. The full code is here and is heavily commented to help you follow along.

(NOTE: This python script is based off this tutorial from BuzzData. You should definitely check it out!)

– The first part of the script is basically the preliminary work. We’re not actually scraping the web page yet. This code first imports the necessary libraries for the script to run. We are using a Python library called BeautifulSoup that was designed for web scraping.

We then create a CSV to put the data in with the open attribute and create an initial header row in the CSV file with the write attribute.  Also be sure to enter the URL of the page you want to scrape.

Note: For now, ignore the line “now = datetime.datetime.now().” We will discuss it later.

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

# Create a CSV where we'll save our data. See further docs:
# http://propublica.github.com/timeline-setter/#csv
f = open('timeline.csv', 'w')

# Make the header rows. These are based on headers recognized by TimelineSetter.
f.write("date" + "," + "description" + "," + "link" + "," + "html" + "\n")

# URL we will scrape
url = 'http://wcfcourier.com/test/scrape/dunkerton/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

– Before we go any further, we need to look at the page we are scraping, which in this example is this page. It’s basically a running list of articles about a particular subject. All of these stories will go on the timeline.

Now we’ll ask: what do we actually want to pull from this page? For each article we want to pull: the headline, the date, the photo, the first paragraph of the story and the link to the article.

Now we need to become familiar with the HTML of the page so we can tell BeautifulSoup what HTML attributes we want to pull from it. Go ahead and open the page up and view its source (Right click > View page source for Firefox and Chrome users).

One of the easiest things we can do is just search for the headline of the first article. So type in “Mayor’s arrest rattles Dunkerton.” This will take us to the chunk of code for that article. You’ll notice how the headline and all the other attributes for the story are contained in a DIV with the class “story-block.’

All stories on this page are formatted the same so every story is put into a DIV with the class ‘story-block.’ Thus, the number of DIVs with the class ‘story-block’ is also equal to the number of articles on the page we want to scrape.

– For the next line of code, we will put that number (whatever it may be) into a variable called ‘events.’ The line after that is what is known as a ‘for loop.’ These two lines tell BeautifulSoup how many times we want to run the ‘for loop.’

So if we have five articles we want to scrape, the ‘for loop’ will run five times. If we have 25 articles, it will run 25 times.

events = soup.findAll('div', attrs={'class': 'story-block'})
for x in events:

– Inside the ‘for loop,’ we need to tell it what information from each article we want to pull. Now go back to the source of the page we are scraping and find the headline, the date, the photo, the first paragraph of the story and the link to the article. You should see that:

  • The date is in a paragraph tag with the class ‘story more’
  • The link appears several times, including within a tag called ‘fb:like,’ which is the Facebook like button people can click to share the article on Facebook.
  • The headline is in a h3 tag, which is a header tag.
  • The first few paragraphs of the story are contained within a DIV with the id ‘blox-story-text.’ Note: In the Python script, we will tell BeautifulSoup to pull only the first paragraph.
  • The photo is contained within an img tag, which shouldn’t be a surprise.

So let’s put all of that in the ‘for loop’ so it knows what we want from each article. The code below uses BeautifulSoup syntax, which you can find out about by reading their documentation.

    # Information on the page that we will scrape
    date = x.find('p', attrs={'class': 'story-more'})('em')
    link = x.find('fb:like')['href']
    headline = x.find('h3').text
    description = x.find('div', attrs={'id': 'blox-story-text'})('p', limit=1)
    image = x.find('img')

One note about the above code: The ‘x’ is equal to the number that the ‘for loop’ is on. For example, say we want to scrape 20 articles. The first time we run the ‘for loop,’ the ‘x’ will be equal to one. The second time through, the ‘x’ will be equal to two. The last time through, it will be equal to 20.

We use the ‘x’ so we pull information from a different article each time we go through the ‘for loop’. The first time through the ‘for loop,’ we will pull information from the first article because the ‘x’ will be equal to one. And the second time through, we pull information from the second article because the ‘x’ will be equal to two.

If we didn’t use ‘x,’ we’d run through the ‘for loop’ 20 times but we’d pull the same information from the same article each time. The ‘x’ in combination with the ‘for loop’ basically tells BeautifulSoup to start with one article, then move onto the next and then the next and so on until we’ve scraped all the articles we want to scrape.

– Now you should be well on your way to creating timelines with Blox assets. For the third and final part of this tutorial, we will just clean up the data a little bit so it looks like nice on the page. Look for the final post of this series soon!

Written by csessig

March 7, 2012 at 2:21 pm

Multiple layers and rollover effects for Fusion Table maps

with 3 comments

Note: Because of a change in the Fusion Tables API, the method for using rollover effects no longer works.

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

If you haven’t noticed by now, a lot of journalists are in love with Google Fusion Tables. I’m one of them. It’s helped me put together a ton of handy maps on deadline with little programming needed.

For those getting started, I suggest these two walkthroughs on Poynter. For those who have some experience with FT, here are a couple of options that may help you spruce up your next map.

Multiple Fusion tables on one map

Hypothetically speaking, let’s say you have two tables with information: one has county data and the other city data. And you want to display that data on just one map.

Fusion Tables makes it very easy to display both at the same time. We’ll start from the top and create a simple Javascript function to display our map (via the Google Maps API):

function initialize() {
	map = new google.maps.Map(document.getElementById('map_canvas'), {
	    center: new google.maps.LatLng(42.5, -92.2),
		zoom: 10,
		minZoom: 8,
		maxZoom: 15,
	    mapTypeId: google.maps.MapTypeId.TERRAIN
	});
	loadmap();
}

At the end of the initialize function we call a function “loadmap(); With this function, we will actually pull in our Fusion Tables layers. For this example we’ll bring in two layers instead of one. Notice how strikingly similar the two are:

function loadmap() {
	layer2 = new google.maps.FusionTablesLayer({
		query: {
			select: 'geometry',
			from: 2814002
		}
	});
	layer2.setMap(map);

	layer = new google.maps.FusionTablesLayer({
		query: {
			select: 'Mappable_location',
			from: 2813443
		}
	});
	layer.setMap(map);

That’s it! You now have one map pulling in two sets of data. To see the full code for this map, click here.

Rollover effects for Fusion Tables

One feature often requested in the Fusion Tables forums is to enable mouse rollover events for Fusion Table layers. Typically, readers who look at a map have to click on a point, polygon, etc. to open up new data about that point, polygon, etc. A mouseover event would allow new data to pop up if a reader hovers over a point, polygon, etc. with their mouse.

A few months ago, some very smart person rolled out a “workable solution” for the rollover request. Here’s the documentation and here’s an example of it in the wild.

Another example is this map on poverty rates in Iowa. The code below is from this map and is very similar to the code on the documentation page:

layer.enableMapTips({
		select: "'Number', 'Tract', 'County', 'Population for whom poverty status is determined - Total', 'Population for whom poverty status is determined - Below poverty level', 'Population for whom poverty status is determined - Percent below poverty level', 'One race - White', 'One race - Black', 'Other', 'Two or more races'", // list of columns to query, typially need only one column.
		from: 2415095, // fusion table name
		geometryColumn: 'geometry', // geometry column name
		suppressMapTips: true, // optional, whether to show map tips. default false
		delay: 1, // milliseconds mouse pause before send a server query. default 300.
		tolerance: 6 // tolerance in pixel around mouse. default is 6.
		});

	//here's the pseudo-hover
	google.maps.event.addListener(layer, 'mouseover', function(fEvent) {
var NumVal = fEvent.row['Number'].value;
	layer.setOptions({
		styles: [{
			where: "'Number' = " + NumVal,
			polygonOptions: {
				fillColor: "#4D4D4D",
				fillOpacity: 0.6
			}
		}]
	});

Note: It’s easiest to think of your Fusion Table as a list of polygons with certain values attached to it. For this poverty map, each row represents a Census Tract with values like Tract name, number of people within that tract that live in poverty, etc. And for this map, I made it so each polygon in the Fusion Table has its own, unique number in the “Number” column.

Here’s a run through of what the above code does:

1.  We’ve already declared “layer” as a particular Google Fusion Table layer (see the second box of code above). Now the “layer.enableMapTips” will allow us to add rollover effects to that Fusion Table layer. The “select” option represents all the columns in that Fusion Table layer that you want to use with the rollover effect.

For instance, here’s the Fusion Table I’m calling in the above “enableMapTips” function. Notice how I’ve called all the columns with data (‘Tract’, ‘County’, etc.). I then told it which Fusion Table to look for with “from: 2415095.” Each Fusion Table has its own unique number. The number for my poverty Fusion Table is 2415095, which is called. To find out what number your Fusion Table is, click File > About.

Finally, I’ve told it what column contains the geometry information for this Fusion Table (Again, go through this Poynter walkthrough to find out all you need to know about the geometry field). Mine is simply called “geometry.” Each row in the “geometry” column represents one polygon.

2. The second step is the “google.maps.event.addListener(layer, ‘mouseover’, function(fEvent).” Basically this says “anytime the reader rollovers a polygon, the following will happen.”

In this function, “fEvent”represents the polygon that the reader is currently hovering over. Let’s say I’m rolling over the polygon that is represented by the first row in the Fusion Table. It’s Census Tract 9601 and has the value of “1” in the “Number” column.

Every time a reader rolls over Census Tract 9601, the code “fEvent.row[‘Number’].value” goes into the Google Fusion Table, finds the Census Tract 9601 row and returns the value of the Number column, which is “1.” So var “NumVal” would become “1” when a reader rolls over Census Tract 9601.

The next part changes that polygon’s color. This happens with the “where” statement. This is saying, “when I rollover a polygon, find the polygon in the Fusion Table that represents ‘NumVal’ and change its color.” Since the variable “NumVal” represents the polygon currently being hovered over, this is the polygon that changes colors. For a reader, the output is simple: I rollover a polygon. It changes colors.

In short: I roll over Census Tract 9601, firing of the “google.maps.event.addListener” function. This function finds the value of the “Number” column. It returns “1.” The code then says change the color of the polygon that has a value of “1” in the “Number” column. This is Census Tract 9601, which I rolled over. It changes colors and life is good.

MapTips

If you go back up to “layer.enableMapTips” in the third box of code, you’ll notice there is an option for “suppressMapTips.” For the poverty map, I have it set to true. But what if you set it to false? Basically, any time a reader hovers over a point or polygon, a small box shows up next to it containing information on that point or polygon. Notice the small yellow box that pops on this example page.

This is a nifty feature and a great replacement for the traditional InfoBox (the box that opens when you click on a point in a Google map). The only problem is the default text size is almost too small to read. How do we change that? Fairly easily:

1. Download a copy of the FusionTips javascript file.

2. Copy the file to the same folder your map is in and add this at the top of your document header:

<script type="text/javascript" src="fusiontips.js"></script>

3. Open the FusionTips file and look for “var div = document.createElement(‘DIV’).” It’s near the top of the Javascript file.

4. This ‘DIV’ represents the MapTips box. By editing this, you can change how the box and its text will display when a reader hovers over a point on the map. For instance, this map of historical places in Iowa used MapTips but the text is larger, the background is white, etc. Here’s what the DIV looks like in my FusionTips javascript file:

FusionTipOverlay.prototype.onAdd = function() {
    var div = document.createElement('DIV');
    div.style.border = "1px solid #999999";
	div.style.opacity = ".85";
    div.style.position = "absolute";
    div.style.whiteSpace = "nowrap";
    div.style.backgroundColor = "#ffffff";
    div.style.fontSize = '13px';
    div.style.padding = '10px';
    div.style.fontWeight = 'bold';
    div.style.margin = '10px';
    div.style.lineHeight = '1.3em';
    if (this.style_) {
      for (var x in this.style_) {
        if (this.style_.hasOwnProperty(x)) {
          div.style[x] = this.style_[x]
        }
      }
    }

Much better! Here’s the code for this map. And here are my three Fusion Tables.

I hope some of these tips help and if you have any questions, send me an e-mail. I’d be more than happy to help.

Written by csessig

February 7, 2012 at 4:34 pm

A few reasons to learn the command line

leave a comment »

Note: This is my first entry for Lee Enterprises’ data journalism blog. Reporters at Lee newspapers can read the blog by clicking here.

As computer users, we have grown accustomed to what is known as the graphical user interface (GUI). What’s GUI, you ask? Here are a few examples: When you drag and drop a text document into the trash, that’s GUI in action. Or when you create a shortcut on your desktop, that’s GUI in action. Or how about simply navigating from one folder to the next? You guessed it: that’s GUI in action.

GUI, basically, is the process of interacting with images (most notably icons on computers) to get things done on electronic devices. It’s easy and we all do it all the time. But there is another way to accomplish many tasks on a computer: using text-based commands. Welcome to the command line.

So where do you enter these text-based commands and accomplish these tasks? There is a nifty little program called the Terminal on your computer that does the work. If you’ve never opened up your computer’s Terminal, now would be a good time. On my Mac, it’s located in the Applications > Utilities folder.

A scary black box will open up. Trust me, I know: I was scared of it just a few months ago. But I promise there are compelling reasons for journalists to learn the basics of the command line. Here are a few:

 

1. Several programs created by journalists for journalists require the command line.

Two of my favorite tools out there for journalists come from ProPublica: TimelineSetter and TableSetter.

The first makes it easy to create timelines. We’ve made a few at the Courier. The second makes easily searchable tables out of spreadsheets (more specifically, CSV files), which we’ve also used at the Courier. But to create the timelines and tables, you’ll need to run very basic commands in your Terminal.

It’s worth noting the LA Times also has its own version of TableSetter called TableStacker that offers more customizations. We used it recently to break down candidates running in our local elections. Again, these tables are created after running a simple command.

The New York Times has a host of useful tools for journalists. Some, like Fech, require the command line to run. Fech can help journalists extract data from the Federal Election Commission to show who is spending money on whom in the current presidential campaign cycle.

 

2. Other programs/tools that journalists can use:

Let’s say you want to pull a bunch of information from a website to use in a story or visualization, but copy and pasting the text is not only tedious but very time consuming.

Why not just scrape the data using a program made in a language like Python or Ruby and put it in a spreadsheet or Word document? After all, computers are great at performing tedious tasks in just a few minutes.

One of my favorite web scraping walkthroughs comes from BuzzData. It shows how to pull water usage rates for every ward in Toronto and can easily be applied to other scenarios (I used it to pull precinct information from the Iowa GOP website). The best way to run this program and scrape the data is to run it through your command line.

Another great walkthrough on data scraping is this one from ProPublica’s Dan Nguyen. Instead of using the Python programming language, like the one above, it uses Ruby. But the goal remains the same: making data manageable for both journalists and readers.

A neat mapping service that is gaining popularity at news organizations is TileMill. Here are a few examples to help get you motivated.

One of the best places to start with TileMill is this walkthrough from the application team at the Chicago Tribune. But beware: you’ll need to know the command line before you start (trust me, I learned the hard way).

 

3. You’ll impress your friends because the command line kind of looks like the Matrix

And who doesn’t want that?

 

Okay I’m sort of interested…How do I learn?

I can’t tell people enough how much these two command line tutorials from PeepCode helped me. I should note that each costs $12 but are well worth it, in my opinion.

Also, there is this basic tutorial from HypeXR that will help. And these shortcuts from LifeHacker are also great.

Otherwise, a quick Google or YouTube search will turn up thousands of results. A lot of programmers use the command line and, fortunately, many are willing to help teach others.

Written by csessig

January 31, 2012 at 9:21 am

Better map rollover option

with 2 comments

A month ago, I blogged about an attempt I made to use a new feature from Google Fusion Tables that allowed map makers to customize their maps based on mouse overs. The idea for users was you could rollover a point/state/census tract/whatever on a map and some some sort of data would pop up on that map. You also customize it so the polygon changes colors, polygon borders grow in size or whatever option you decide to use to let the reader know they have rolled over a particular object. I used it on a recent map of poverty rates in Iowa. The result worked but mouseover events seem delayed and clunky. Not so user friendly, especially if you have a slow Internet connection. So I looked for a new option.

What I found was this great library from NY Time’s Albert Sun for polygons and rollover effects. Granted, this takes much longer to put together. But the result is much smoother and user friendly IMO. I used it on a recent map of heroin rates in the U.S.

Here’s brief synopsis on how I put it together:

1. First I grabbed data from the Substance Abuse and Mental Health
Data Archive
 on reported heroin cases at substance abuse centers broken down by state. I then grabbed a shapefile from the U.S. Census bureau of each state in the U.S. This is the file that maps each state based on its boundaries.

2. The data related to drug cases from the SAMHDA was sorted by state and year. You can download individual spreadsheets of data for years 2009 and before. I downloaded spreadsheets for 2005 through 2009. I then pulled out the information I needed from each one and merged them, ending up with a final spreadsheet that had the following data for each state dating back to 2005: number people in who were admitted to a substance abuse treatment center for heroin, number of people who were admitted to a substance abuse treatment center for any drug and the percentage of people who were heroin users.

3. I first uploaded that final spreadsheet to Google Fusion Tables. I then uploaded the shapefile I downloaded into Google Fusion Tables using the AWESOME Shpescape tool. Important: Make sure you select “Create a Simplified Geometry column” under advanced options before you import the shapefile, otherwise it will be too big. Finally I merged the shapefile with the spreadsheet and ended up with this final table.

4. I exported that table into a KML (If you click Vizualize > Map, you’ll see the option to download a KML file) and converted that to a GeoJSON file using this Ogre web client. I only did this because the KML wasn’t working on Internet Explorer. Anyways, converting it into a GeoJSON file was a TOTAL PAIN IN THE ASS that required me to mess with the GeoJSON file much more than I wanted. Finally, I ended up with this GeoJSON File.

5. Now for the code. My final script is here. I’m not going to run through it all but will point out a few lines of note:

  • Line 179 brings in the JSON file. Underneath it are polygon highlighting options based on Albert Sun’s code. I have it set up so the opacity on the state polygons is light enough to see the state names underneath it when you first pull up the map. When you rollover a state, the polygons gets shaded in all the way and you can’t read the text underneath it (on the Google Map). But if you click a polygon, you can once again see the text. I also have a red border pop up when you rollover a state (strokeColor and strokeWeight).
  • Line 207 is the highlightCallback function. This creates a variable for each of line of data: number of heroin users, number of drug users, percentage, etc. for 2005 to 2009. It’s what you see under “Figures” in the DataTable on the map when you rollover a state. I first made each line a string by adding quotes to each variable.
  • Each variable is called into function selected_district (line 229). This function creates the Google DataTable via “new google.visualization.DataTable().” I’ve used this table in the past on a map for prep high school football teams. Check this past blog post for more information.
  • Line 255 is a function that puts in commas for numbers in the thousands…I didn’t make it. It’s freely available online. Please take it and use it as you see fit.
  • Line 107 to 154 is the legend.

Per usual, I used Colorbrewer to come up with the colors…

I’m happy with the resulting map and hope to use this polygon feature in the future. If you have any questions, feel free to leave a comment. I’d be more than happy to pass my limited amount of knowledge along to others.

Written by csessig

January 21, 2012 at 5:07 pm

The caucus night that almost didn’t end

leave a comment »

All eyes were on Iowa last night as the Iowa caucuses took place. It was pretty much the longest work day I’ve ever had…By alot. Anyways, we did a ton of updating on WCFCourier.com all day and night…I wish I had a screen shot of all the photos/stories we put on the front of our website.  I did take one at about 2:30 a.m., which is shown above. It’s the site after Romney was (finally) declared the winner. The template is now being used on our Iowa caucus website.

Here’s a quick summary of our online coverage: It started with a general Iowa caucus coverage, switched over to our live coverage from the UNI-Dome (which hosted the largest caucus in the state Tuesday night) and then to the statewide race between Santorum-Romney-Paul and, finally, the grudge match between Santorum and Romney. At 1:30 a.m., the Iowa GOP announced Romney had won by eight flippin’ votes. At one point, Santorum was leading by ONE vote with ONE precinct left. What are the odds?

My schedule on caucus night went something like this:

– 9 a.m. – 2:30 p.m. –  Preparing for the day / arranging plans with reporters /posting stories, photos and other content. We had caucus stories going up all day, obviously. I also posted and helped monitor a live chat, which was shared with other Iowa newspapers and was active all day, as well as posted live video from KCRG, which played from about 7 p.m. to 12 a.m. I opted to put both the live video and the live chat on the same page, making it easier for readers to follow action at home.

– 2:30 p.m. – 2:31 p.m. – Lunch

– 2:31 p.m. – 4:30 p.m. – More preparing and posting. We also posted two maps with our coverage: one of live caucus results (which started coming in after the 7 p.m. caucus start). This map was provided by the Iowa Republican Party and is pictured to the left. You may have seen it on several news sites… Many had it or a variation of it.

The second map I made myself and featured caucus locations for all (I believe, although I haven’t counted) 1,700 caucus locations in the state of Iowa. The addresses were pulled from the Iowa GOP website, which listed every site. Basically, I wrote a Python program that scraped the data from their site and put it into a spreadsheet, pulled it into Google Fusion tables and mapped the locations based on their addresses. The Python scraper is based on this FANTASTIC walk-through by BuzzData on how to scrape data from websites. Check it out!

(NOTE: Here’s the code for the Python scraper. Here’s my Google Fusion Table.)

At about 3 p.m., we rolled over the site to feature one huge photo and story (see the screenshot at the top of this post). It was caucus night, after all,  so we had to go big.

– 4:30 p.m. – 5 p.m. – Mad dash to the UNI-Dome, where Black Hawk County was caucusing. The doors opened at 5:30 p.m. and I wanted to get there and set up before either Bachmann or Gingrich stopped by to speak.

– 5 p.m. – 10 p.m. – Posted up at the UNI-Dome. At about 5:30 p.m., we switched our main story to our Dome coverage…This was basically when our first photo and update came in. Throughout the evening, we posted small updates from the Dome and new photos. We also had three videos from the Dome.

At about 6:30 p.m., Bachmann and Gingrich spoke. I took a few photos for our live chat (which, BTW, had more than 5,000 viewers at one point!) and posted a fresh candidate Dome story when it came in.

At about 8:30 p.m., the Dome action was winding down and our attention turned to the statewide race between Santorum-Romney-Paul and then Santorum-Romney. We relied on the AP our Lee Des Moines bureau for our main story on the site, adding photos from the Dome and the wire with it.

– 10 p.m. – 10:30 p.m. – Mad dash back to the newsroom. I was actually afraid they might announce the winner while I was on the road back to the newsroom but I was off by about three hours.

– 10:30 p.m. – 12:30 a.m. – We waited. And waited. And made some jokes. And waited some more. The precinct results continued to flood in and amazingly the number of votes between Romney and Santorum dwindled. Santorum was actually in the lead for much of the night. By 16 votes. Then one vote. Then four votes. Just ridiculous.

– 12:30 a.m. – 1:30 a.m. – At about this time, they announced they were down to three precincts then one precinct…At that point, I knew I would be in the newsroom until the final precinct was counted. The lone holdout was in Clinton County (eastern Iowa along the river) and apparently there was some confusion about whether or not they had submitted their results to the state yet.

– 1:30 a.m. – 2 a.m. – The Iowa GOP finally announces Romney won by eight (!!!) votes. Hurray! I slapped a quick update on top of story we had online and added a new photo. At this point, I just wanted to make sure those who got up in the morning would see the final results.

– 2 a.m. – 3 a.m. – The longest day ever came to a close. I took down the big photo, big story template we had used all night (see screenshot at the top) and returned the site back to our standard carousel template with five rotating stories on the front (see WCFCourier.com). I also added a teaser to our Iowa caucus site on the top so people see all of our caucus coverage from the night/morning. Because there was a ton of it.

– 3 a.m. – Sleep

Here’s what the Courier’s front page looked like on Wednesday. We’re a morning paper so we were able to get the final results in:

Written by csessig

January 4, 2012 at 1:42 pm