Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Turning Blox assets into timelines: Last Part

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about web-scraping even if you don’t use Blox.

For part one of this tutorial, click here. For part two, click here

 

If you’ve followed along thus far, you’ll see we’re almost done turning a collection of Blox assets — which can include articles, photos and videos — into  timelines using a tool made available by ProPublica called TimelineSetter.

Right now, we are in the process of creating a CSV file, which we will use with the program TimelineSetter. This programs takes the CSV file and creates a nice looking timeline out of it.

The CSV file is created by using a Python script that scrapes a web page we created with all of our content on it. The full code for this script is available here.

In the second part of the series, we went ahead and created a Python scraper that will pull all the information on the page we need.

1. If you’ve gone through the second part, you can go ahead and run python timeline.py now on your command line (More information on how to run the scraper is available in the first part of the blog).

You’ll notice the script will output a CSV that has all the information we need. But some of it is ugly. We need to delete certain tags, for instance, that appear before and after some of the text.

But before we can run Python commands to delete these unnecessary tags, we need to convert all of the text into strings. This is basically just text. Integers, by contrast, are numbers.

So let’s do that first:

# Extract that information in strings
    date2 = str(date)
    link2 = str(link)
    headline2 = str(headline)
    image2 = str(image)

    # These are pulled from outside the loop
    description2 = str(description)

Note: This code and the following chunks should go inside the loop statement we created in the second part of the series. This is because we want these changes to take effect on every item we scrape. If you are wondering, I tried to explain loop statements in my second blog. Your best bet though is to ask Google about loop statements.

2. Now that we’ve converted everything into strings, we can get rid of the ugly tags. You’ll notice that the descriptions of our stories start and end with with a “p” tag. Similarly, all dates start and end with an “em” tag.

Because of this, I added a few lines to the Python script. These lines will replace the ugly tags with nothing…effectively deleting them before they are put into the CSV file:

# Extra formatting needed for dates to get rid of em tags and unnecessary formatting
    date4 = date3.replace('[<em>', "")
    date5 = date4.replace('</em>]', "")
    date6 = date5.replace('- ', "")
    date7 = date6.replace("at ", "")

    # Extra formatting is also need for the description to get rid of p tags and new line returns
    description4 = description3.replace('[<p>', "")
    description5 = description4.replace('</p>]', "")
    description6 = description5.replace('\n', " ")
    description7 = description6.replace('[]', "")

3. For images, the script spits out “\d\d\d” for the width property. We will change this so it’s 300px, making all images on the page 300px wide. Also, we will delete the text “None,” which shows up if an image doesn’t exist for a particular asset.

# We will adjust the width of all images to 300 pixels. Also, Python spits out the word 'None' if it doesn't find an image. Delete that.
    image4 = re.sub(r'width="\d\d\d"', 'width="300"', image3)
    image5 = image4.replace('None', "")

4. If you are at all familiar with the way articles work in Blox, you know that when you update them, red text shows up next to the headline telling you when the story was last updated. The code that formats that text and makes it red is useless to us now. So we will delete this and replace it with the current time and date using the Python datetime module.

To use the datetime module, we have to first import it at the top of our Python script. We then need to call the object within the module that we want. A good introduction to Python modules is available here.

The module we want is called “datetime.now()”. As the name suggests, it returns the current date and time. I then put it in a variable called “now”, which makes it more reader friendly to use later in the script.

So the top of our page should look like this:

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

Inside the loop we call the “datetime.now()” object and replace the text “Updated” with the current date and time:

# If the story has been updated recently, an em class tag will appear on the page showing the time but not the date. We will delete the class and         replace it with today's date. We can change the date in the CSV if we need to.
    date8 = date7.replace('[<em class="item-updated badge">Updated:', str(now.strftime("%Y-%m-%d %H:%M")))

5. Now there is just one last bit of cleaning up we will need to do. For those who don’t know, CSV stands for comma-separated values. This means that basically the columns we are creating in our spreadsheet are separated by commas. This is the preferred type of spreadsheet for most programs because it’s simple.

We can run into problems, however, if some of our data includes commas. So these next few lines in our script will replace all of the commas we scraped with dashes. You can change it to whatever character or characters you want:

# We'll replace commas with dashes so we don't screw up the CSV. You can change the dash to whatever character you want
    date3 = date2.replace(",", " -")
    link3 = link2.replace(",", " -")
    headline3 = headline2.replace(",", " -")
    description3 = description2.replace(",", " -")
    image3 = image2.replace(",", " -")

If you want to put the commas back into the timeline, you can do so after the final HTML file is created (after you run python timeline.py). Typically I’ll replace all the commas with “////” and then do a simple find and replace on the final HTML file with a text editor and change every instance of “////” back into a comma.

Now we have clean, concise data! We will now put this into the CSV file using the “write” command. Again, all these commands are put inside the loop statement we created in the second part of the series, so every image, description, date, etc. we scrape will be cleaned up and ready to go:

# Write the information to the file. The HTML code is based on coding recognized by TimelineSetter
    f.write(date8 + "," + description7 + "," + link3 + "," + '<h2 class="timeline-img-hed">' + headline3 + '</h2>' + image5 + "\n")

The headlines, you’ll notice, are put into an HTML tag “h2” tag. This will bold and increase the size of the headlines so they stick out when a reader clicks on a particular event in the timeline.

That’s it for the loop statement. Now we will add a simple line outside of the loop statement that closes the CSV file when all the loops we want run are done:

#You're done! Close file.
f.close()

And there you have it folks. We are now done with our Python script. We can now run it (see part 1 of the series) and have a nice, clean looking CSV file we can turn into a timeline.

If you have any questions, PLEASE don’t hesitate to e-mail or Tweet me. I’d be more than happy to help.

Advertisements

Written by csessig

March 16, 2012 at 8:38 am

2 Responses

Subscribe to comments with RSS.

  1. […] For part one of this tutorial, click here. For part three, click here […]

  2. […] Turning Blox assets into timelines: Last Part […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: