Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Python’ Category

Using hexagon maps to show areas with high levels of reported crime

leave a comment »

Screenshot: Violent crime in WaterlooCrime maps are a big hit at news organizations these days. And with good reason: Good crime maps can tell your readers a whole lot about crime not only in their city but their backyard.

For the last year and a half, we’ve been doing just that. By collecting data every week from our local police departments, we’ve been able to provide our community a great, constantly updated resource on crime in Waterloo and Cedar Falls. And we’ve even dispelled some myths on where crime happens along the way.

The process, which I’ve blogged about in the past, has evolved since the project was first introduced. But basically it goes like this: I download PDFs of crime reports from the Waterloo and Cedar Falls police departments every week. Those PDFs are scraped using a Python script and exported into an CSV. The script pulls out reports of the most serious crimes and disregards minor reports like assistance calls.

The CSV is then uploaded to Google spreadsheets as a back up. We also use this website to get latitude, longitude coordinates from the addresses in the Google spreadsheet. Finally, the spreadsheet is converted into JSON using Mr. Data Converter and loaded onto our FTP server. We use Leaflet to map the JSON files.

We’ve broken the map into months and cities, meaning we have a map and a corresponding JSON file for each month for both Waterloo and Cedar Falls. We also have JSON files with all of the year’s crime reports for both Waterloo and Cedar Falls.

The resulting maps are great for readers looking for detailed information on crime in their neighborhood. It also provides plenty of filters so readers can see what areas have more shootings, assaults, burglaries, etc.

But it largely fails to give a good overview of what areas of the cities had the most reported crimes in 2013 for two reasons: 1) Because of the sheer number of crime reports (3,684 reports in 2013 for Waterloo alone) and the presence of multiple reports coming from the same address, it’s nearly impossible to simply look at the 2013 maps and see what areas of the cities had the most reports, and; 2) The JSON files containing all of the year’s reports runs well on modern browsers but runs more slowly outdated browsers (I’m looking at you, Internet Explorer), which aren’t equipped to iterate over JSON files with thousands of items.

As a result, I decided to create separate choropleth maps that serve one purpose: To show what areas of the cities had the most reported crimes. The cities would be divided into sections and areas with more crimes would be darker on the map, while the areas with fewer crimes would be lighter. Dividing the city into sections would also mean the resulting JSON file would have dozens of items instead of thousands. The map would serve a simple purpose and load quickly.

What would the areas of the cities actually look like? I’ve been fascinated with hex maps since I saw this gorgeous map from the L.A. Times on 9/11 response times. The map not only looked amazing but did an incredibly effective job displaying the data and telling the story. While we want our map to look nice, telling  its story and serving our journalistic purpose will always be our No. 1 goal. Fortunately, hex maps do both.

And even more fortunately, the process of creating hex maps got a whole lot easier when data journalist Kevin Schaul introduced a command line tool called binify that turns dot density shapefiles into shapefiles with hexagon shapes. Basically the map is cut into hexagon shapes, and all the points inside each hexagon are counted. Then each hexagon is given a corresponding count value for the number points that fall in between its borders.

I used binify to convert our 2013 crime maps into hex maps. As I mentioned earlier, we started out with CSVs with all the crimes. My workflow for creating the hex maps and then converting them into JSON files for Leaflet is in a gist below. It utilizes the fantastic command line tool og2ogr, which is used to convert map files, and binify.

# We start with a CSV called "crime_master_wloo_violent.csv" that has violent crime reports in Waterloo.
# Included in csv are columns for latitude ("lat") and longitude ("long"):
https://github.com/csessig86/crime_map2013/blob/master/csv/crime_master_wloo_violent.csv

# Command to turn CSV into DBF with the same name
ogr2ogr -f "ESRI Shapefile" crime_master_wloo_violent.dbf crime_master_wloo_violent.csv

# Create a file called "crime_master_wloo_violent.vrt"
# Change "SrcLayer" to the name of the source
# Change OGRVRTLayer name to "wloo_violent"
# Add name of lat, long columns to "GeometryField"
<OGRVRTDataSource>
  <OGRVRTLayer name="wloo_violent">
    <SrcDataSource relativeToVRT="1">./</SrcDataSource>
    <SrcLayer>crime_master_wloo_violent</SrcLayer>
    <GeometryType>wkbPoint</GeometryType>
    <LayerSRS>WGS84</LayerSRS>
    <GeometryField encoding="PointFromColumns" x="long" y="lat"/>
  </OGRVRTLayer>
</OGRVRTDataSource>

# Create "violent" folder to stash shp file
mkdir violent

# Convert "vrt" file into shp and put it in our "violent" folder
ogr2ogr -f "ESRI Shapefile" violent crime_master_wloo_property.vrt

# Binify the "wloo_violent" shp into a shp called "wloo_violent_binify" with three options
# The first sets the number of hexagons across to 45
# The second sets the custom extent to a box around Waterloo
# Custom extent is basically the extent of the area that the overlay of hexagons will cover
# The third tells binify to ignore hexagons that have zero points
binify wloo_violent.shp wloo_violent_binify.shp -n 45 -E -92.431873 -92.260867 42.4097259 42.558403 --exclude-empty

# Convert binified shp into JSON file called "crime_data_review_violent"
ogr2ogr -f "GeoJSON" crime_data_review_violent.json wloo_violent_binify.shp

# Add this variable to JSON file
var crime_data_review_violent

# Here's the final files:
https://github.com/csessig86/crime_map2013/tree/master/shps/wloo_shps

# And our JSON file:
https://github.com/csessig86/crime_map2013/blob/master/JSON/crime_data_review_violent.json

# And the Javascript file that is used with Leaflet to run the map.
# Lines 113 through 291 are the ones related to pulling the data from the JSON files,
# coloring them, setting the legend and setting a mouseover for each hexagon
https://github.com/csessig86/crime_map2013/blob/master/js/script_map_review.js

# And the map online:
http://wcfcourier.com/app/crime_map2013/index_wloo.php

The above shows the process I took for creating our hex map of violent crimes in Waterloo. I duplicated this process three more times to get four maps in all. One showed property crimes in Waterloo, while the other two show violent and property crime in Cedar Falls.

* Note: For more information on converting shapefiles with ogr2ogr, check out this blog post from Ben Balter. Also, this cheat sheet of ogr2ogr command line tools was invaluable as well.

Another great resource is this blog post on minifying GeoJSON files from Bjørn Sandvik. While the post deals with GeoJSON files specifically, the process works with JSON files as well. Minifying my JSON files made them about 50-150 KB smaller. While it’s not a drastic difference, every little bit counts.

To get choropleth maps up and running in Leaflet, check out this tutorial. You can also check out my Javascript code for our crime maps.

One last note: The map is responsive. Check it out.

Advertisements

Written by csessig

January 11, 2014 at 4:42 pm

How We Did It: Waterloo crime map

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Last week we launched a new feature on the Courier’s website: A crime map for the city of Waterloo that will be updated daily Monday through Friday.

The map uses data provided by the Waterloo police department. It’s presented in a way to allow readers to make their own stories out of the data.

(Note: The full code for this project is available here.)

Here’s a quick run-through of what we did to get the map up and running:

1. Turning a PDF into manageable data

The hardest part of this project was the first step: Turning a PDF into something usable. Every morning, the Waterloo police department updates their calls for service PDF with the latest service calls. It’s a rolling PDF that keeps track of about a week of calls.

The first step I took was turning the PDF into a HTML document using the command line tool PDFtoHTMLFor Mac users, you can download it by going to the command line and typing in “brew install pdftohtml.” Then run “pdftohtml -c (ENTER NAME OF PDF HERE)” to turn the PDF into an HTML document.

The PDF we are converting is basically a spreadsheet. Each cell of the spreadsheet is turned into a DIV with PDFtoHTML. Each page of the PDF is turned into its own HTML document. We will then scrape these HTML documents using the programming language Python, which I have blogged about before. The Python library that will allow us to scrape the information is Beautiful Soup.

The “-c” command adds a bunch of inline CSS properties to these DIVs based on where they are on the page. These inline properties are important because they help us get the information off the spreadsheet we want.

All dates and times, for instance, are located in the second column. As a result, all the dates and times have the exact same inline left CSS property of “107” because they are all the same distance from the left side of the page.

The same goes for the dispositions. They are in the fifth column and are farther from the left side of the page so they have an inline left CSS property of “677.”

We use these properties to find the columns of information we want. The first thing we want is the dates. With our Python scraper, we’ll grab all the data in the second column, which is all the DIVs that have an inline left CSS property of “107.”

We then have a second argument that uses regular expressions to make sure the data is in the correct format i.e. numbers and not letters. We do this to make sure we are pulling dates and not text accidently.

The second argument is basically an insurance policy. Everything we pull with the CSS property of “107” should be a date. But we want to be 100% so we’ll make sure it’s integers and not a string with regular expressions.

The third column is the reported crimes. But in our converted HTML document, crimes are actually located in the DIV previous to the date + time DIV. So once we have grabbed a date + time DIV with our Python scraper, we will check the previous DIV to see if it matches one of the seven crimes we are going to map. For this project, we decided not to map minor reports like business checks and traffic stops. Instead we are mapping the seven most serious reports.

If it is one of our seven crimes, we will run one final check to make sure it’s not a cancelled call, an unfounded call, etc. We do this by checking the disposition DIVs (column five in the spreadsheet), which are located before the crime DIVs. Also remember that all these have an inline left CSS property of “677”.

So we check these DIVs with our dispositions to make sure they don’t contain words like “NOT NEEDED” or “NO REPORT” or “CALL CANCELLED.”

Once we know it’s a crime that fits into one of our seven categories and it wasn’t a cancelled call, we add the crime, the date, the time, the disposition and the location to a CSV spreadsheet.

The full Python scraper is available here.

2. Using Google to get latitude, longitude and JSON

The mapping service I used was Leaflet, as opposed to Google Maps. But we will need to geocode our addresses to get latitude and longitude information for each point to use with Leaflet. We also need to convert our spreadsheet into a Javascript object file, also known as a JSON file.

Fortunately that is an easy and quick process thanks to two gadgets available to us using Google Docs.

The first thing we need to do is upload our CSV to Google Docs. Then we can use this gadget to get latitude and longitude points for each address. Then we can use this gadget to get the JSON file we will use with the map.

3. Powering the map with Leaflet, jQRangeSlider, DataTables and Bootstrap

As I mentioned, Leaflet powers the map. It uses the latitude and longitude points from the JSON file to map our crimes.

For this map, I created my own icons. I used a free image editor known as Seashore, which is a fantastic program for those who are too cheap to shell out the dough for Adobe’s Photoshop.

The date range slider below the map is a very awesome tool called jQRangeSlider. Basically every time the date range is moved, a Javascript function is called that will go through the JSON file and see if the crimes are between those two dates.

This Javascript function also checks to see if the crime has been selected by the user. Notice on the map the check boxes next to each crime logo under “Types of Crimes.”

If the crime is both between the dates on the slider and checked by the users, it is mapped.

While this is going on, an HTML table of this information is being created below the map. We use another awesome tool called DataTables to make that table of crimes interactive. With it, readers can display up to a 100 records on the page or search through the records.

Finally, we create a pretty basic bar chart using the Progress Bars made available by Bootstrap, an awesome interface released by the people who brought us Twitter.

Creating these bars are easy: We just need to create DIVs and give them a certain class so Bootstrap knows how to style them. We create a bar for each crime that is automatically updated when we tweak the map

For more information on progress bars, check out the documentation from Bootstrap. I also want to thank the app team at the Chicago Tribune for providing the inspiration behind the bar chart with their 2012 primary election app.

The full Javascript file is available here.

4. Daily upkeep

This map is not updated automatically so every day, Monday through Friday, I will be adding new crimes to our map.

Fortunately, this only takes about 5-10 minutes of work. Basically I scrape the last few pages of the police’s crime log PDF, pull out the crimes that are new, pull them into Google Docs, get the latitude and longitude information, output the JSON file and put that new file into our FTP server.

Trust me, it doesn’t take nearly as long as it sounds to do.

5. What’s next?

Besides minor tweaks and possible design improvements, I have two main goals for this project in the future:

A. Create a crime map for Cedar Falls – Cedar Falls is Waterloo’s sister city and like the Waterloo police department, the Cedar Falls police department keeps a daily log of calls for service. They also post PDFs, so I’m hoping the process of pulling out the data won’t be drastically different that what I did for the Waterloo map.

B. Create a mobile version for both crime maps – Maps don’t work tremendously well on the mobile phone. So I’d like to develop some sort of alternative for mobile users. Fortunately, we have all the data. We just need to figure out how to display it best for smartphones.

Have any questions? Feel free to e-mail me at chris.essig@wcfcourier.com.

Courses, tutorials and more for those looking to code

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Without a doubt, there is an abundance of resources online for programmers and non-programers alike to learn to code.

This, of course, is great news for journalists like us who are looking to use programming to make visualizations, scrape websites or simply pick up a new skill.

Here’s a list of courses and tutorials I’ve found in the last couple months that have either helped me personally or look very promising:

1. Codecademy

Is 2012 the year of code? The startup service Codecademy sure thinks it is. They have made it their mission to teach every one who is willing how to code within one year. The idea was so intriguing that the New York Times ran a front page story (at least online) on it.

Basically, users create an account with the service and every week they are sent new exercises that will teach them how to code. The first exercises focused on Javascript. Now, users are moving into HTML and CSS. Each exercise takes a couple hours to complete and build off the previous week’s exercsies. And best of all, it’s FREE.

If you are a huge nerd like me, you’ll gladly spend your free time completing the courses.

2. Coursera

Want to take courses from Stanford University, Princeton University, University of Michigan and University of Pennsylvania for free? Yeah, I didn’t really think it was possible either until I found Coursera, which offers a wide variety of courses in computer science and other topics.

Right now, I am enrolled in Computer Science 101, which is a six-week course that focuses on learning the basics. Each week, you are e-mailed about an hour of video lectures, as well as exercises based on those lectures. There is also a discussion forum so you can meet your peers. This isn’t nearly as time consuming as Codecademy is, which might be appealing to some.

3. Udacity

Like Coursesra, Udacity offers a number of computer science classes on beginner, intermediate and advanced topics. The classes are also based on video lectures put together by some very, very smart people. I have not used this service, however, so I can’t speak to it too much. It looks promising though. And who wouldn’t want to learn how to program a robotic car?

4. Code School

This service offers screencasts on a host of topics like Javascript, jQuery, Ruby, HTML, CSS and more. The downside, however, is this service does cost: $20 a month or $55 a screencast. If you are looking to try it out, check out their free beginner’s screencast on the Javascript library jQuery, which is the best beginner’s introduction to jQuery I’ve seen. They also have a free screencast for the Ruby programming language.

5. PeepCode

If you are looking for screencasts but are on a tighter budget, check out PeepCode and their list of programming screencasts. Each are about $12, are downloadable and typically include source code for the programs to help you follow along at home. One of my favorites is “Meet the Command Line,” which will get you started with the Unix Command Line. Be warned though because some of their screencasts are geared towards more advanced users. A good understanding of programming is recommended before diving into some of these (An exception is the command line tutorial mentioned above).

6. Net Tuts+

Many of the tutorials on this site are geared towards programmers wanting to learn very specific things or solve specific problems. This tutorial, for instance, runs through how to make borders in CSS. And this one deals with the Command Line text editor called Vim. So if you have a particular problem but don’t have a ton of time to sit through video tutorials, you might want to check out this site’s extensive catalog.

7. ScraperWiki

Web scraping is a great skill for journalists to have because it can help us pull a large amount of information from websites in a matter of seconds. If you are looking for a place to start, check out some of the screencasts offered by ScraperWiki, a service that specializes in — you guessed it — web scraping.

8. Coding blogs

The number of blogs out there devoted to coding and programming is both vast and impressive. Two of my favorite are Life and Code and Baby Steps in Data Journalism. Both are geared towards journalists. In fact, many of the sites I listed here were initially posted on one of these blogs.

– Got a cool website that has helped you out?

I’d love to hear about it! Feel free to leave a comment or e-mail me at chris.essig@wcfcourier.com

Written by csessig

May 3, 2012 at 8:22 am

Turning Blox assets into timelines: Last Part

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about web-scraping even if you don’t use Blox.

For part one of this tutorial, click here. For part two, click here

 

If you’ve followed along thus far, you’ll see we’re almost done turning a collection of Blox assets — which can include articles, photos and videos — into  timelines using a tool made available by ProPublica called TimelineSetter.

Right now, we are in the process of creating a CSV file, which we will use with the program TimelineSetter. This programs takes the CSV file and creates a nice looking timeline out of it.

The CSV file is created by using a Python script that scrapes a web page we created with all of our content on it. The full code for this script is available here.

In the second part of the series, we went ahead and created a Python scraper that will pull all the information on the page we need.

1. If you’ve gone through the second part, you can go ahead and run python timeline.py now on your command line (More information on how to run the scraper is available in the first part of the blog).

You’ll notice the script will output a CSV that has all the information we need. But some of it is ugly. We need to delete certain tags, for instance, that appear before and after some of the text.

But before we can run Python commands to delete these unnecessary tags, we need to convert all of the text into strings. This is basically just text. Integers, by contrast, are numbers.

So let’s do that first:

# Extract that information in strings
    date2 = str(date)
    link2 = str(link)
    headline2 = str(headline)
    image2 = str(image)

    # These are pulled from outside the loop
    description2 = str(description)

Note: This code and the following chunks should go inside the loop statement we created in the second part of the series. This is because we want these changes to take effect on every item we scrape. If you are wondering, I tried to explain loop statements in my second blog. Your best bet though is to ask Google about loop statements.

2. Now that we’ve converted everything into strings, we can get rid of the ugly tags. You’ll notice that the descriptions of our stories start and end with with a “p” tag. Similarly, all dates start and end with an “em” tag.

Because of this, I added a few lines to the Python script. These lines will replace the ugly tags with nothing…effectively deleting them before they are put into the CSV file:

# Extra formatting needed for dates to get rid of em tags and unnecessary formatting
    date4 = date3.replace('[<em>', "")
    date5 = date4.replace('</em>]', "")
    date6 = date5.replace('- ', "")
    date7 = date6.replace("at ", "")

    # Extra formatting is also need for the description to get rid of p tags and new line returns
    description4 = description3.replace('[<p>', "")
    description5 = description4.replace('</p>]', "")
    description6 = description5.replace('\n', " ")
    description7 = description6.replace('[]', "")

3. For images, the script spits out “\d\d\d” for the width property. We will change this so it’s 300px, making all images on the page 300px wide. Also, we will delete the text “None,” which shows up if an image doesn’t exist for a particular asset.

# We will adjust the width of all images to 300 pixels. Also, Python spits out the word 'None' if it doesn't find an image. Delete that.
    image4 = re.sub(r'width="\d\d\d"', 'width="300"', image3)
    image5 = image4.replace('None', "")

4. If you are at all familiar with the way articles work in Blox, you know that when you update them, red text shows up next to the headline telling you when the story was last updated. The code that formats that text and makes it red is useless to us now. So we will delete this and replace it with the current time and date using the Python datetime module.

To use the datetime module, we have to first import it at the top of our Python script. We then need to call the object within the module that we want. A good introduction to Python modules is available here.

The module we want is called “datetime.now()”. As the name suggests, it returns the current date and time. I then put it in a variable called “now”, which makes it more reader friendly to use later in the script.

So the top of our page should look like this:

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

Inside the loop we call the “datetime.now()” object and replace the text “Updated” with the current date and time:

# If the story has been updated recently, an em class tag will appear on the page showing the time but not the date. We will delete the class and         replace it with today's date. We can change the date in the CSV if we need to.
    date8 = date7.replace('[<em class="item-updated badge">Updated:', str(now.strftime("%Y-%m-%d %H:%M")))

5. Now there is just one last bit of cleaning up we will need to do. For those who don’t know, CSV stands for comma-separated values. This means that basically the columns we are creating in our spreadsheet are separated by commas. This is the preferred type of spreadsheet for most programs because it’s simple.

We can run into problems, however, if some of our data includes commas. So these next few lines in our script will replace all of the commas we scraped with dashes. You can change it to whatever character or characters you want:

# We'll replace commas with dashes so we don't screw up the CSV. You can change the dash to whatever character you want
    date3 = date2.replace(",", " -")
    link3 = link2.replace(",", " -")
    headline3 = headline2.replace(",", " -")
    description3 = description2.replace(",", " -")
    image3 = image2.replace(",", " -")

If you want to put the commas back into the timeline, you can do so after the final HTML file is created (after you run python timeline.py). Typically I’ll replace all the commas with “////” and then do a simple find and replace on the final HTML file with a text editor and change every instance of “////” back into a comma.

Now we have clean, concise data! We will now put this into the CSV file using the “write” command. Again, all these commands are put inside the loop statement we created in the second part of the series, so every image, description, date, etc. we scrape will be cleaned up and ready to go:

# Write the information to the file. The HTML code is based on coding recognized by TimelineSetter
    f.write(date8 + "," + description7 + "," + link3 + "," + '<h2 class="timeline-img-hed">' + headline3 + '</h2>' + image5 + "\n")

The headlines, you’ll notice, are put into an HTML tag “h2” tag. This will bold and increase the size of the headlines so they stick out when a reader clicks on a particular event in the timeline.

That’s it for the loop statement. Now we will add a simple line outside of the loop statement that closes the CSV file when all the loops we want run are done:

#You're done! Close file.
f.close()

And there you have it folks. We are now done with our Python script. We can now run it (see part 1 of the series) and have a nice, clean looking CSV file we can turn into a timeline.

If you have any questions, PLEASE don’t hesitate to e-mail or Tweet me. I’d be more than happy to help.

Written by csessig

March 16, 2012 at 8:38 am

Turning Blox assets into timelines: Part 2

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about webscraping even if you don’t use Blox.

For part one of this tutorial, click here. For part three, click here

 

On my last blog, I discussed how you can turn Blox assets into a  timeline using a tool made available by ProPublica called TimelineSetter.

If you recall, most of the magic happens with a little Python script called Timeline.py. It scrapes information from a page and puts it into a CSV file, which can then be used with TimelineSetter.

So what’s behind this Timeline.py file? I’ll go through the code by breaking it down into chunks. The full code is here and is heavily commented to help you follow along.

(NOTE: This python script is based off this tutorial from BuzzData. You should definitely check it out!)

– The first part of the script is basically the preliminary work. We’re not actually scraping the web page yet. This code first imports the necessary libraries for the script to run. We are using a Python library called BeautifulSoup that was designed for web scraping.

We then create a CSV to put the data in with the open attribute and create an initial header row in the CSV file with the write attribute.  Also be sure to enter the URL of the page you want to scrape.

Note: For now, ignore the line “now = datetime.datetime.now().” We will discuss it later.

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

# Create a CSV where we'll save our data. See further docs:
# http://propublica.github.com/timeline-setter/#csv
f = open('timeline.csv', 'w')

# Make the header rows. These are based on headers recognized by TimelineSetter.
f.write("date" + "," + "description" + "," + "link" + "," + "html" + "\n")

# URL we will scrape
url = 'http://wcfcourier.com/test/scrape/dunkerton/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

– Before we go any further, we need to look at the page we are scraping, which in this example is this page. It’s basically a running list of articles about a particular subject. All of these stories will go on the timeline.

Now we’ll ask: what do we actually want to pull from this page? For each article we want to pull: the headline, the date, the photo, the first paragraph of the story and the link to the article.

Now we need to become familiar with the HTML of the page so we can tell BeautifulSoup what HTML attributes we want to pull from it. Go ahead and open the page up and view its source (Right click > View page source for Firefox and Chrome users).

One of the easiest things we can do is just search for the headline of the first article. So type in “Mayor’s arrest rattles Dunkerton.” This will take us to the chunk of code for that article. You’ll notice how the headline and all the other attributes for the story are contained in a DIV with the class “story-block.’

All stories on this page are formatted the same so every story is put into a DIV with the class ‘story-block.’ Thus, the number of DIVs with the class ‘story-block’ is also equal to the number of articles on the page we want to scrape.

– For the next line of code, we will put that number (whatever it may be) into a variable called ‘events.’ The line after that is what is known as a ‘for loop.’ These two lines tell BeautifulSoup how many times we want to run the ‘for loop.’

So if we have five articles we want to scrape, the ‘for loop’ will run five times. If we have 25 articles, it will run 25 times.

events = soup.findAll('div', attrs={'class': 'story-block'})
for x in events:

– Inside the ‘for loop,’ we need to tell it what information from each article we want to pull. Now go back to the source of the page we are scraping and find the headline, the date, the photo, the first paragraph of the story and the link to the article. You should see that:

  • The date is in a paragraph tag with the class ‘story more’
  • The link appears several times, including within a tag called ‘fb:like,’ which is the Facebook like button people can click to share the article on Facebook.
  • The headline is in a h3 tag, which is a header tag.
  • The first few paragraphs of the story are contained within a DIV with the id ‘blox-story-text.’ Note: In the Python script, we will tell BeautifulSoup to pull only the first paragraph.
  • The photo is contained within an img tag, which shouldn’t be a surprise.

So let’s put all of that in the ‘for loop’ so it knows what we want from each article. The code below uses BeautifulSoup syntax, which you can find out about by reading their documentation.

    # Information on the page that we will scrape
    date = x.find('p', attrs={'class': 'story-more'})('em')
    link = x.find('fb:like')['href']
    headline = x.find('h3').text
    description = x.find('div', attrs={'id': 'blox-story-text'})('p', limit=1)
    image = x.find('img')

One note about the above code: The ‘x’ is equal to the number that the ‘for loop’ is on. For example, say we want to scrape 20 articles. The first time we run the ‘for loop,’ the ‘x’ will be equal to one. The second time through, the ‘x’ will be equal to two. The last time through, it will be equal to 20.

We use the ‘x’ so we pull information from a different article each time we go through the ‘for loop’. The first time through the ‘for loop,’ we will pull information from the first article because the ‘x’ will be equal to one. And the second time through, we pull information from the second article because the ‘x’ will be equal to two.

If we didn’t use ‘x,’ we’d run through the ‘for loop’ 20 times but we’d pull the same information from the same article each time. The ‘x’ in combination with the ‘for loop’ basically tells BeautifulSoup to start with one article, then move onto the next and then the next and so on until we’ve scraped all the articles we want to scrape.

– Now you should be well on your way to creating timelines with Blox assets. For the third and final part of this tutorial, we will just clean up the data a little bit so it looks like nice on the page. Look for the final post of this series soon!

Written by csessig

March 7, 2012 at 2:21 pm

Turning Blox assets into timelines: Part 1

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. For part two of this tutorial, click here. For part three, click here.

 

A couple weeks ago I blogged about the command line and a few reasons journalists should learn it. Among the reasons was a timeline tool made available by ProPublica called TimelineSetter, which is shown to the left. Here are two live examples to give you an idea of what the tool looks like.

To create the timeline, you will first need to make a specially-structured CSV file (CSV is a type of spreadsheet file. Excel can export CSV files). Rows in the CSV file will represent particular events on the timeline. Columns will include information on those events, like date, description, photo, etc.

ProPublica has a complete list of available columns here. To give you an idea of what the final product will look like BEFORE you make the timeline, you can download one CSV file we used to make a timeline by clicking here.

After you have your CSV file, you run a simple command and walla! A beautiful timeline will be created. For more information on the command you have to run, check out the TimelineSetter page. (Hint: The command you run is: timeline-setter -c timeline.csv)

By far the most tedious part of all this is tracking down events and articles you want to include in the timeline and making your CSV file. That is why I wrote a simple Python script that will help turn Blox assets into a CSV file you can use with TimelineSetter.

Here’s a walkthrough of how to use it:

1. The first thing you need to do is go to this GitHub page and follow the instructions in the ReadMe file. After that you will have a page set up will all of the events you want to include in the timeline. Here’s an example of what that page should look like.

2. Download the Python script (Timeline.py). What this script will actually be doing is scraping the web page we just created. And by that I mean it will be going to that page and pulling out all the information we need to create our timeline. So it will be grabbing photos, headlines, dates, etc. It will then create a CSV file with all that information. We can then use that CSV file with TimelineSetter.

3. The script uses a Python library called Beautiful Soup. If you don’t have that downloaded already, click here. It takes only a few seconds to install.

4. In the Timeline.py file on line 16 is a spot for the URL of the page we want to scrape. Make sure you change that to whatever URL you created.

5. Run the command python timeline.py from your command line in the same directory as the Python script you downloaded. This will output a CSV file.

6. You will need to download TimelineSetter, which is also really easy to do. Just run this command: gem install timeline_setter. For more information, click here.

7. Now navigate to the folder with the CSV file and run this command: timeline-setter -c timeline.csv. (Or whatever your CSV file is called).

8. You should end up with a directory of javascript files, a CSS file and a Timeline.html file. This is your timeline. Now put in on your server and embed it using an HTML asset in Blox (or whatever you want to do with it).

9. Do the happy dance (Mandatory step)

This will get you pushing out timelines in no time. On my next blog, I will be going through that Python script (Timeline.py) and what it actually does to create the CSV file.

Written by csessig

March 2, 2012 at 9:59 am