Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘web scraping’ Category

How We Did It: Waterloo crime map

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Last week we launched a new feature on the Courier’s website: A crime map for the city of Waterloo that will be updated daily Monday through Friday.

The map uses data provided by the Waterloo police department. It’s presented in a way to allow readers to make their own stories out of the data.

(Note: The full code for this project is available here.)

Here’s a quick run-through of what we did to get the map up and running:

1. Turning a PDF into manageable data

The hardest part of this project was the first step: Turning a PDF into something usable. Every morning, the Waterloo police department updates their calls for service PDF with the latest service calls. It’s a rolling PDF that keeps track of about a week of calls.

The first step I took was turning the PDF into a HTML document using the command line tool PDFtoHTMLFor Mac users, you can download it by going to the command line and typing in “brew install pdftohtml.” Then run “pdftohtml -c (ENTER NAME OF PDF HERE)” to turn the PDF into an HTML document.

The PDF we are converting is basically a spreadsheet. Each cell of the spreadsheet is turned into a DIV with PDFtoHTML. Each page of the PDF is turned into its own HTML document. We will then scrape these HTML documents using the programming language Python, which I have blogged about before. The Python library that will allow us to scrape the information is Beautiful Soup.

The “-c” command adds a bunch of inline CSS properties to these DIVs based on where they are on the page. These inline properties are important because they help us get the information off the spreadsheet we want.

All dates and times, for instance, are located in the second column. As a result, all the dates and times have the exact same inline left CSS property of “107” because they are all the same distance from the left side of the page.

The same goes for the dispositions. They are in the fifth column and are farther from the left side of the page so they have an inline left CSS property of “677.”

We use these properties to find the columns of information we want. The first thing we want is the dates. With our Python scraper, we’ll grab all the data in the second column, which is all the DIVs that have an inline left CSS property of “107.”

We then have a second argument that uses regular expressions to make sure the data is in the correct format i.e. numbers and not letters. We do this to make sure we are pulling dates and not text accidently.

The second argument is basically an insurance policy. Everything we pull with the CSS property of “107” should be a date. But we want to be 100% so we’ll make sure it’s integers and not a string with regular expressions.

The third column is the reported crimes. But in our converted HTML document, crimes are actually located in the DIV previous to the date + time DIV. So once we have grabbed a date + time DIV with our Python scraper, we will check the previous DIV to see if it matches one of the seven crimes we are going to map. For this project, we decided not to map minor reports like business checks and traffic stops. Instead we are mapping the seven most serious reports.

If it is one of our seven crimes, we will run one final check to make sure it’s not a cancelled call, an unfounded call, etc. We do this by checking the disposition DIVs (column five in the spreadsheet), which are located before the crime DIVs. Also remember that all these have an inline left CSS property of “677”.

So we check these DIVs with our dispositions to make sure they don’t contain words like “NOT NEEDED” or “NO REPORT” or “CALL CANCELLED.”

Once we know it’s a crime that fits into one of our seven categories and it wasn’t a cancelled call, we add the crime, the date, the time, the disposition and the location to a CSV spreadsheet.

The full Python scraper is available here.

2. Using Google to get latitude, longitude and JSON

The mapping service I used was Leaflet, as opposed to Google Maps. But we will need to geocode our addresses to get latitude and longitude information for each point to use with Leaflet. We also need to convert our spreadsheet into a Javascript object file, also known as a JSON file.

Fortunately that is an easy and quick process thanks to two gadgets available to us using Google Docs.

The first thing we need to do is upload our CSV to Google Docs. Then we can use this gadget to get latitude and longitude points for each address. Then we can use this gadget to get the JSON file we will use with the map.

3. Powering the map with Leaflet, jQRangeSlider, DataTables and Bootstrap

As I mentioned, Leaflet powers the map. It uses the latitude and longitude points from the JSON file to map our crimes.

For this map, I created my own icons. I used a free image editor known as Seashore, which is a fantastic program for those who are too cheap to shell out the dough for Adobe’s Photoshop.

The date range slider below the map is a very awesome tool called jQRangeSlider. Basically every time the date range is moved, a Javascript function is called that will go through the JSON file and see if the crimes are between those two dates.

This Javascript function also checks to see if the crime has been selected by the user. Notice on the map the check boxes next to each crime logo under “Types of Crimes.”

If the crime is both between the dates on the slider and checked by the users, it is mapped.

While this is going on, an HTML table of this information is being created below the map. We use another awesome tool called DataTables to make that table of crimes interactive. With it, readers can display up to a 100 records on the page or search through the records.

Finally, we create a pretty basic bar chart using the Progress Bars made available by Bootstrap, an awesome interface released by the people who brought us Twitter.

Creating these bars are easy: We just need to create DIVs and give them a certain class so Bootstrap knows how to style them. We create a bar for each crime that is automatically updated when we tweak the map

For more information on progress bars, check out the documentation from Bootstrap. I also want to thank the app team at the Chicago Tribune for providing the inspiration behind the bar chart with their 2012 primary election app.

The full Javascript file is available here.

4. Daily upkeep

This map is not updated automatically so every day, Monday through Friday, I will be adding new crimes to our map.

Fortunately, this only takes about 5-10 minutes of work. Basically I scrape the last few pages of the police’s crime log PDF, pull out the crimes that are new, pull them into Google Docs, get the latitude and longitude information, output the JSON file and put that new file into our FTP server.

Trust me, it doesn’t take nearly as long as it sounds to do.

5. What’s next?

Besides minor tweaks and possible design improvements, I have two main goals for this project in the future:

A. Create a crime map for Cedar Falls – Cedar Falls is Waterloo’s sister city and like the Waterloo police department, the Cedar Falls police department keeps a daily log of calls for service. They also post PDFs, so I’m hoping the process of pulling out the data won’t be drastically different that what I did for the Waterloo map.

B. Create a mobile version for both crime maps – Maps don’t work tremendously well on the mobile phone. So I’d like to develop some sort of alternative for mobile users. Fortunately, we have all the data. We just need to figure out how to display it best for smartphones.

Have any questions? Feel free to e-mail me at chris.essig@wcfcourier.com.

Advertisements

Turning Blox assets into timelines: Last Part

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about web-scraping even if you don’t use Blox.

For part one of this tutorial, click here. For part two, click here

 

If you’ve followed along thus far, you’ll see we’re almost done turning a collection of Blox assets — which can include articles, photos and videos — into  timelines using a tool made available by ProPublica called TimelineSetter.

Right now, we are in the process of creating a CSV file, which we will use with the program TimelineSetter. This programs takes the CSV file and creates a nice looking timeline out of it.

The CSV file is created by using a Python script that scrapes a web page we created with all of our content on it. The full code for this script is available here.

In the second part of the series, we went ahead and created a Python scraper that will pull all the information on the page we need.

1. If you’ve gone through the second part, you can go ahead and run python timeline.py now on your command line (More information on how to run the scraper is available in the first part of the blog).

You’ll notice the script will output a CSV that has all the information we need. But some of it is ugly. We need to delete certain tags, for instance, that appear before and after some of the text.

But before we can run Python commands to delete these unnecessary tags, we need to convert all of the text into strings. This is basically just text. Integers, by contrast, are numbers.

So let’s do that first:

# Extract that information in strings
    date2 = str(date)
    link2 = str(link)
    headline2 = str(headline)
    image2 = str(image)

    # These are pulled from outside the loop
    description2 = str(description)

Note: This code and the following chunks should go inside the loop statement we created in the second part of the series. This is because we want these changes to take effect on every item we scrape. If you are wondering, I tried to explain loop statements in my second blog. Your best bet though is to ask Google about loop statements.

2. Now that we’ve converted everything into strings, we can get rid of the ugly tags. You’ll notice that the descriptions of our stories start and end with with a “p” tag. Similarly, all dates start and end with an “em” tag.

Because of this, I added a few lines to the Python script. These lines will replace the ugly tags with nothing…effectively deleting them before they are put into the CSV file:

# Extra formatting needed for dates to get rid of em tags and unnecessary formatting
    date4 = date3.replace('[<em>', "")
    date5 = date4.replace('</em>]', "")
    date6 = date5.replace('- ', "")
    date7 = date6.replace("at ", "")

    # Extra formatting is also need for the description to get rid of p tags and new line returns
    description4 = description3.replace('[<p>', "")
    description5 = description4.replace('</p>]', "")
    description6 = description5.replace('\n', " ")
    description7 = description6.replace('[]', "")

3. For images, the script spits out “\d\d\d” for the width property. We will change this so it’s 300px, making all images on the page 300px wide. Also, we will delete the text “None,” which shows up if an image doesn’t exist for a particular asset.

# We will adjust the width of all images to 300 pixels. Also, Python spits out the word 'None' if it doesn't find an image. Delete that.
    image4 = re.sub(r'width="\d\d\d"', 'width="300"', image3)
    image5 = image4.replace('None', "")

4. If you are at all familiar with the way articles work in Blox, you know that when you update them, red text shows up next to the headline telling you when the story was last updated. The code that formats that text and makes it red is useless to us now. So we will delete this and replace it with the current time and date using the Python datetime module.

To use the datetime module, we have to first import it at the top of our Python script. We then need to call the object within the module that we want. A good introduction to Python modules is available here.

The module we want is called “datetime.now()”. As the name suggests, it returns the current date and time. I then put it in a variable called “now”, which makes it more reader friendly to use later in the script.

So the top of our page should look like this:

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

Inside the loop we call the “datetime.now()” object and replace the text “Updated” with the current date and time:

# If the story has been updated recently, an em class tag will appear on the page showing the time but not the date. We will delete the class and         replace it with today's date. We can change the date in the CSV if we need to.
    date8 = date7.replace('[<em class="item-updated badge">Updated:', str(now.strftime("%Y-%m-%d %H:%M")))

5. Now there is just one last bit of cleaning up we will need to do. For those who don’t know, CSV stands for comma-separated values. This means that basically the columns we are creating in our spreadsheet are separated by commas. This is the preferred type of spreadsheet for most programs because it’s simple.

We can run into problems, however, if some of our data includes commas. So these next few lines in our script will replace all of the commas we scraped with dashes. You can change it to whatever character or characters you want:

# We'll replace commas with dashes so we don't screw up the CSV. You can change the dash to whatever character you want
    date3 = date2.replace(",", " -")
    link3 = link2.replace(",", " -")
    headline3 = headline2.replace(",", " -")
    description3 = description2.replace(",", " -")
    image3 = image2.replace(",", " -")

If you want to put the commas back into the timeline, you can do so after the final HTML file is created (after you run python timeline.py). Typically I’ll replace all the commas with “////” and then do a simple find and replace on the final HTML file with a text editor and change every instance of “////” back into a comma.

Now we have clean, concise data! We will now put this into the CSV file using the “write” command. Again, all these commands are put inside the loop statement we created in the second part of the series, so every image, description, date, etc. we scrape will be cleaned up and ready to go:

# Write the information to the file. The HTML code is based on coding recognized by TimelineSetter
    f.write(date8 + "," + description7 + "," + link3 + "," + '<h2 class="timeline-img-hed">' + headline3 + '</h2>' + image5 + "\n")

The headlines, you’ll notice, are put into an HTML tag “h2” tag. This will bold and increase the size of the headlines so they stick out when a reader clicks on a particular event in the timeline.

That’s it for the loop statement. Now we will add a simple line outside of the loop statement that closes the CSV file when all the loops we want run are done:

#You're done! Close file.
f.close()

And there you have it folks. We are now done with our Python script. We can now run it (see part 1 of the series) and have a nice, clean looking CSV file we can turn into a timeline.

If you have any questions, PLEASE don’t hesitate to e-mail or Tweet me. I’d be more than happy to help.

Written by csessig

March 16, 2012 at 8:38 am

Turning Blox assets into timelines: Part 2

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about webscraping even if you don’t use Blox.

For part one of this tutorial, click here. For part three, click here

 

On my last blog, I discussed how you can turn Blox assets into a  timeline using a tool made available by ProPublica called TimelineSetter.

If you recall, most of the magic happens with a little Python script called Timeline.py. It scrapes information from a page and puts it into a CSV file, which can then be used with TimelineSetter.

So what’s behind this Timeline.py file? I’ll go through the code by breaking it down into chunks. The full code is here and is heavily commented to help you follow along.

(NOTE: This python script is based off this tutorial from BuzzData. You should definitely check it out!)

– The first part of the script is basically the preliminary work. We’re not actually scraping the web page yet. This code first imports the necessary libraries for the script to run. We are using a Python library called BeautifulSoup that was designed for web scraping.

We then create a CSV to put the data in with the open attribute and create an initial header row in the CSV file with the write attribute.  Also be sure to enter the URL of the page you want to scrape.

Note: For now, ignore the line “now = datetime.datetime.now().” We will discuss it later.

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

# Create a CSV where we'll save our data. See further docs:
# http://propublica.github.com/timeline-setter/#csv
f = open('timeline.csv', 'w')

# Make the header rows. These are based on headers recognized by TimelineSetter.
f.write("date" + "," + "description" + "," + "link" + "," + "html" + "\n")

# URL we will scrape
url = 'http://wcfcourier.com/test/scrape/dunkerton/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

– Before we go any further, we need to look at the page we are scraping, which in this example is this page. It’s basically a running list of articles about a particular subject. All of these stories will go on the timeline.

Now we’ll ask: what do we actually want to pull from this page? For each article we want to pull: the headline, the date, the photo, the first paragraph of the story and the link to the article.

Now we need to become familiar with the HTML of the page so we can tell BeautifulSoup what HTML attributes we want to pull from it. Go ahead and open the page up and view its source (Right click > View page source for Firefox and Chrome users).

One of the easiest things we can do is just search for the headline of the first article. So type in “Mayor’s arrest rattles Dunkerton.” This will take us to the chunk of code for that article. You’ll notice how the headline and all the other attributes for the story are contained in a DIV with the class “story-block.’

All stories on this page are formatted the same so every story is put into a DIV with the class ‘story-block.’ Thus, the number of DIVs with the class ‘story-block’ is also equal to the number of articles on the page we want to scrape.

– For the next line of code, we will put that number (whatever it may be) into a variable called ‘events.’ The line after that is what is known as a ‘for loop.’ These two lines tell BeautifulSoup how many times we want to run the ‘for loop.’

So if we have five articles we want to scrape, the ‘for loop’ will run five times. If we have 25 articles, it will run 25 times.

events = soup.findAll('div', attrs={'class': 'story-block'})
for x in events:

– Inside the ‘for loop,’ we need to tell it what information from each article we want to pull. Now go back to the source of the page we are scraping and find the headline, the date, the photo, the first paragraph of the story and the link to the article. You should see that:

  • The date is in a paragraph tag with the class ‘story more’
  • The link appears several times, including within a tag called ‘fb:like,’ which is the Facebook like button people can click to share the article on Facebook.
  • The headline is in a h3 tag, which is a header tag.
  • The first few paragraphs of the story are contained within a DIV with the id ‘blox-story-text.’ Note: In the Python script, we will tell BeautifulSoup to pull only the first paragraph.
  • The photo is contained within an img tag, which shouldn’t be a surprise.

So let’s put all of that in the ‘for loop’ so it knows what we want from each article. The code below uses BeautifulSoup syntax, which you can find out about by reading their documentation.

    # Information on the page that we will scrape
    date = x.find('p', attrs={'class': 'story-more'})('em')
    link = x.find('fb:like')['href']
    headline = x.find('h3').text
    description = x.find('div', attrs={'id': 'blox-story-text'})('p', limit=1)
    image = x.find('img')

One note about the above code: The ‘x’ is equal to the number that the ‘for loop’ is on. For example, say we want to scrape 20 articles. The first time we run the ‘for loop,’ the ‘x’ will be equal to one. The second time through, the ‘x’ will be equal to two. The last time through, it will be equal to 20.

We use the ‘x’ so we pull information from a different article each time we go through the ‘for loop’. The first time through the ‘for loop,’ we will pull information from the first article because the ‘x’ will be equal to one. And the second time through, we pull information from the second article because the ‘x’ will be equal to two.

If we didn’t use ‘x,’ we’d run through the ‘for loop’ 20 times but we’d pull the same information from the same article each time. The ‘x’ in combination with the ‘for loop’ basically tells BeautifulSoup to start with one article, then move onto the next and then the next and so on until we’ve scraped all the articles we want to scrape.

– Now you should be well on your way to creating timelines with Blox assets. For the third and final part of this tutorial, we will just clean up the data a little bit so it looks like nice on the page. Look for the final post of this series soon!

Written by csessig

March 7, 2012 at 2:21 pm

Turning Blox assets into timelines: Part 1

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. For part two of this tutorial, click here. For part three, click here.

 

A couple weeks ago I blogged about the command line and a few reasons journalists should learn it. Among the reasons was a timeline tool made available by ProPublica called TimelineSetter, which is shown to the left. Here are two live examples to give you an idea of what the tool looks like.

To create the timeline, you will first need to make a specially-structured CSV file (CSV is a type of spreadsheet file. Excel can export CSV files). Rows in the CSV file will represent particular events on the timeline. Columns will include information on those events, like date, description, photo, etc.

ProPublica has a complete list of available columns here. To give you an idea of what the final product will look like BEFORE you make the timeline, you can download one CSV file we used to make a timeline by clicking here.

After you have your CSV file, you run a simple command and walla! A beautiful timeline will be created. For more information on the command you have to run, check out the TimelineSetter page. (Hint: The command you run is: timeline-setter -c timeline.csv)

By far the most tedious part of all this is tracking down events and articles you want to include in the timeline and making your CSV file. That is why I wrote a simple Python script that will help turn Blox assets into a CSV file you can use with TimelineSetter.

Here’s a walkthrough of how to use it:

1. The first thing you need to do is go to this GitHub page and follow the instructions in the ReadMe file. After that you will have a page set up will all of the events you want to include in the timeline. Here’s an example of what that page should look like.

2. Download the Python script (Timeline.py). What this script will actually be doing is scraping the web page we just created. And by that I mean it will be going to that page and pulling out all the information we need to create our timeline. So it will be grabbing photos, headlines, dates, etc. It will then create a CSV file with all that information. We can then use that CSV file with TimelineSetter.

3. The script uses a Python library called Beautiful Soup. If you don’t have that downloaded already, click here. It takes only a few seconds to install.

4. In the Timeline.py file on line 16 is a spot for the URL of the page we want to scrape. Make sure you change that to whatever URL you created.

5. Run the command python timeline.py from your command line in the same directory as the Python script you downloaded. This will output a CSV file.

6. You will need to download TimelineSetter, which is also really easy to do. Just run this command: gem install timeline_setter. For more information, click here.

7. Now navigate to the folder with the CSV file and run this command: timeline-setter -c timeline.csv. (Or whatever your CSV file is called).

8. You should end up with a directory of javascript files, a CSS file and a Timeline.html file. This is your timeline. Now put in on your server and embed it using an HTML asset in Blox (or whatever you want to do with it).

9. Do the happy dance (Mandatory step)

This will get you pushing out timelines in no time. On my next blog, I will be going through that Python script (Timeline.py) and what it actually does to create the CSV file.

Written by csessig

March 2, 2012 at 9:59 am