Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Command Line’ Category

How automation can save you on deadline

leave a comment »

In the data journalism world, data dumps are pretty common occurrences. And often times these dumps are released periodically — every month, every six months, every year, etc.

Despite their periodic nature, these dumps can catch you off guard. Especially if you are working on a million things at once (but nobody does that, right?).

ss

Case in point: Every year the Centers for Medicare & Medicaid Services releases data on how much money doctors received from medical companies. If I’m not mistaken, the centers release the data now because of the great investigative work of ProPublica. The data they release is important and something our readers care about.

Last year, I spent a ton of time parsing the data to put together a site that shows our readers the top paid doctors in Iowa. We decided to limit the number of doctors shown to 100, mostly because we ran out of time.

In all, I spent at least a week on the project last year. But this year, I was not afforded that luxury because the data dump caught us off guard. Instead of a few weeks, we had a few days to put together the site and story.

Fortunately, I’ve spent quite a bit of time in the last year moving away from editing spreadsheets in Excel or Google Docs to editing them using shell scripts.

The problem with GUI editors is they aren’t easily reproducible. If you make 20 changes to a spreadsheet in Excel one year and then the next year you get an updated version of that same spreadsheet, you have to make those exactly same 20 changes again.

That sucks.

Fortunately with tools like CSVKit and Agate, you can write A LOT of Excel-like functions in a shell script or a Python script. This allows you to document what you are doing so you don’t forget. I can’t tell you how many times I’ve done something in Excel, saved the spreadsheet, then came back to it a few days later and completely forgotten what the hell I was doing.

It also makes it so you can automate your data analysis by allowing you to re-run the functions over and over again. Let’s say you perform five functions in Excel and trim 20 columns from your spreadsheet. Now let’s say you want to do that again, either because you did it wrong the first time or because you received new data. Now would you rather run a shell script that takes seconds to perform this task or do everything over in Excel?

Another nice thing about shell scripts is you can hook your spreadsheet into any number of data processing programs. Want to port your data into a SQLite database so you can perform queries? No problem. You can also create separate files for those SQL queries and run them in your shell script as well.

All of this came in handy when this year’s most paid doctors data was released. Sure, I still spent a stressful day editing the new spreadsheets. But if I hadn’t been using shell scripts, I doubt I would have gotten done by our deadline.

Another thing I was able to do was increase the number of doctors listed online. We went from 100 total doctors to every doctor who received $500 or more  (totaling more than 2,000 doctors). This means it’s much more likely that readers will find their doctors this year.

The project is static and doesn’t have a database backend so I didn’t necessarily need to limit the number of doctors last year. We just ran out of time. This year, we were able to not only update the app but give our readers much more information about Iowa doctors. And we did it in a day instead of a week.

The project is online here. The code I used to edit the 15+ gigabytes worth of data from CMS is available here.

 

 

Written by csessig

September 7, 2016 at 4:59 pm

How laziness led me to produce news apps quicker

with one comment

Note: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

ssAn important trait for programmers to adopt, especially for those who are writing code on deadline, is laziness. While that sounds counter-productive, it can be incredibly beneficial.

When writing code, you  do a lot of repetitive tasks. If you’re using a Javascript library for mapping, for instance, you’ll need to copy and paste the Javascript and CSS files into a new directory every time you start a new project. If you want to use plugins with that map, you will need to include their files in the directory as well.

The more complicated your project, the more libraries you will likely use, which means a lot of tedious work putting all those files in the new directory.

And that’s before you even start to code. If you want your projects to have a similar look to them (which we do at the Courier), you will need to create code that you can to replicate in the future. And then every time you want to use it, you will need to rewrite the same code or, at the very least, copy code from your old projects and paste it into your new project.

I did this for a long time. Whenever I was creating  a new app, I’d open a bunch of projects I’ve already completed, go into their CSS and JS files, copy some of it and paste it into the new files. It was maddening because I knew with every new project, I would waste hours basically replicating work I had already done.

This is where laziness pays off. If you are too lazy to do the same things over and over again, you’ll start coming up with solutions that automatically does these repetitive tasks.

That’s why many news apps teams have created templates for their applications. When they create new projects, they start with this template. This allows their apps to have a similar look. It also means they don’t have to do the same things over and over again every time they are working with a project.

App templating is something I’ve dabbled with before. But those attempts were minor. For the most part, I did a lot of the same mind-numbing tasks every time I wanted to create a map, for instance.

Fortunately I found a good solution a few months ago: Yeoman, which is dubbed “the web’s scaffolding tool for modern web apps,” has helped me eliminate much of that boring, repetitive work.

Yeoman is built in Node. Before I stumbled upon it a few months ago, I knew very little about Node. I still don’t know a ton but this video helped me grasp the basics. One advantage of Node is it allows you to run Javascript on the command line. Pretty cool.

Yeoman is built on three parts. The first is yo, which helps create new applications and all the files that go with it. It also uses Grunt, which I have come to love. Grunt is used for running tasks, like minifying code, starting a local server so you test your code, copy files into new directories and a whole bunch of other things that makes coding quicker, easier and less repetitive.

Finally, Yeoman uses Bower, which is a directory of all the useful packages you use with applications, from CSS frameworks to Javascript mapping libraries. With a combination of Bower and Grunt, you can download the files you need for your projects and put them in the appropriate directories using a Grunt task. Tasks are written in a Gruntfile and run on the command line. The packages you want to download from Bower and use in your project are stored in bower.json file.

In Yeoman, generators are created. A generator makes it easy to create new apps using a template that’s already been created. You can either create your own generator or use one of the 400+ community generators. The generators themselves are directories of files, like  images, JS files, CSS stylesheets, etc. The advantage of using generators is you can create a stylesheet, for instance, save it in the generator directory and use that same stylesheet with every new project you create. If you decide to edit it, those changes will be reflected in every new project.

When you use a generator, those files are copied into a new directory, along with Grunt tasks, which are great for when you are testing and deploying your apps. Yeoman also has a built-in prompt system that asks you a series of questions whenever you create an app. This helps copy over the appropriate files and code, depending on what you are trying to make.

I decided I would take a stab at creating a generator. The code is here, along with a lot more explanation on how to use it. Basically, I created it so it would be easier to produce simple apps. Whenever I create a new Leaflet map, for instance, I can run this and a sample map with sample data will be outputted into a new directory. I also have a few options depending on how I want to style the map. I also have options for different data types: GeoJSON, JSON with latitude and longitude coordinates and Tabletop. What files are scaffolded out depends on how you responded to the prompts that are displayed when you first fire up the generator.

There is also an option for non-maps. Finally, you can use Handlebars for templating your data, if that’s your thing.

My hope is to expand upon this generator and create more options in the future. But for now, this is easy way to avoid a bunch of repetitive tasks. And thus, laziness pays off.

If you are interested in creating your own generator, Yeoman has a great tutorial that you should check out.

Written by csessig

February 19, 2014 at 12:08 pm

Using hexagon maps to show areas with high levels of reported crime

leave a comment »

Screenshot: Violent crime in WaterlooCrime maps are a big hit at news organizations these days. And with good reason: Good crime maps can tell your readers a whole lot about crime not only in their city but their backyard.

For the last year and a half, we’ve been doing just that. By collecting data every week from our local police departments, we’ve been able to provide our community a great, constantly updated resource on crime in Waterloo and Cedar Falls. And we’ve even dispelled some myths on where crime happens along the way.

The process, which I’ve blogged about in the past, has evolved since the project was first introduced. But basically it goes like this: I download PDFs of crime reports from the Waterloo and Cedar Falls police departments every week. Those PDFs are scraped using a Python script and exported into an CSV. The script pulls out reports of the most serious crimes and disregards minor reports like assistance calls.

The CSV is then uploaded to Google spreadsheets as a back up. We also use this website to get latitude, longitude coordinates from the addresses in the Google spreadsheet. Finally, the spreadsheet is converted into JSON using Mr. Data Converter and loaded onto our FTP server. We use Leaflet to map the JSON files.

We’ve broken the map into months and cities, meaning we have a map and a corresponding JSON file for each month for both Waterloo and Cedar Falls. We also have JSON files with all of the year’s crime reports for both Waterloo and Cedar Falls.

The resulting maps are great for readers looking for detailed information on crime in their neighborhood. It also provides plenty of filters so readers can see what areas have more shootings, assaults, burglaries, etc.

But it largely fails to give a good overview of what areas of the cities had the most reported crimes in 2013 for two reasons: 1) Because of the sheer number of crime reports (3,684 reports in 2013 for Waterloo alone) and the presence of multiple reports coming from the same address, it’s nearly impossible to simply look at the 2013 maps and see what areas of the cities had the most reports, and; 2) The JSON files containing all of the year’s reports runs well on modern browsers but runs more slowly outdated browsers (I’m looking at you, Internet Explorer), which aren’t equipped to iterate over JSON files with thousands of items.

As a result, I decided to create separate choropleth maps that serve one purpose: To show what areas of the cities had the most reported crimes. The cities would be divided into sections and areas with more crimes would be darker on the map, while the areas with fewer crimes would be lighter. Dividing the city into sections would also mean the resulting JSON file would have dozens of items instead of thousands. The map would serve a simple purpose and load quickly.

What would the areas of the cities actually look like? I’ve been fascinated with hex maps since I saw this gorgeous map from the L.A. Times on 9/11 response times. The map not only looked amazing but did an incredibly effective job displaying the data and telling the story. While we want our map to look nice, telling  its story and serving our journalistic purpose will always be our No. 1 goal. Fortunately, hex maps do both.

And even more fortunately, the process of creating hex maps got a whole lot easier when data journalist Kevin Schaul introduced a command line tool called binify that turns dot density shapefiles into shapefiles with hexagon shapes. Basically the map is cut into hexagon shapes, and all the points inside each hexagon are counted. Then each hexagon is given a corresponding count value for the number points that fall in between its borders.

I used binify to convert our 2013 crime maps into hex maps. As I mentioned earlier, we started out with CSVs with all the crimes. My workflow for creating the hex maps and then converting them into JSON files for Leaflet is in a gist below. It utilizes the fantastic command line tool og2ogr, which is used to convert map files, and binify.

# We start with a CSV called "crime_master_wloo_violent.csv" that has violent crime reports in Waterloo.
# Included in csv are columns for latitude ("lat") and longitude ("long"):
https://github.com/csessig86/crime_map2013/blob/master/csv/crime_master_wloo_violent.csv

# Command to turn CSV into DBF with the same name
ogr2ogr -f "ESRI Shapefile" crime_master_wloo_violent.dbf crime_master_wloo_violent.csv

# Create a file called "crime_master_wloo_violent.vrt"
# Change "SrcLayer" to the name of the source
# Change OGRVRTLayer name to "wloo_violent"
# Add name of lat, long columns to "GeometryField"
<OGRVRTDataSource>
  <OGRVRTLayer name="wloo_violent">
    <SrcDataSource relativeToVRT="1">./</SrcDataSource>
    <SrcLayer>crime_master_wloo_violent</SrcLayer>
    <GeometryType>wkbPoint</GeometryType>
    <LayerSRS>WGS84</LayerSRS>
    <GeometryField encoding="PointFromColumns" x="long" y="lat"/>
  </OGRVRTLayer>
</OGRVRTDataSource>

# Create "violent" folder to stash shp file
mkdir violent

# Convert "vrt" file into shp and put it in our "violent" folder
ogr2ogr -f "ESRI Shapefile" violent crime_master_wloo_property.vrt

# Binify the "wloo_violent" shp into a shp called "wloo_violent_binify" with three options
# The first sets the number of hexagons across to 45
# The second sets the custom extent to a box around Waterloo
# Custom extent is basically the extent of the area that the overlay of hexagons will cover
# The third tells binify to ignore hexagons that have zero points
binify wloo_violent.shp wloo_violent_binify.shp -n 45 -E -92.431873 -92.260867 42.4097259 42.558403 --exclude-empty

# Convert binified shp into JSON file called "crime_data_review_violent"
ogr2ogr -f "GeoJSON" crime_data_review_violent.json wloo_violent_binify.shp

# Add this variable to JSON file
var crime_data_review_violent

# Here's the final files:
https://github.com/csessig86/crime_map2013/tree/master/shps/wloo_shps

# And our JSON file:
https://github.com/csessig86/crime_map2013/blob/master/JSON/crime_data_review_violent.json

# And the Javascript file that is used with Leaflet to run the map.
# Lines 113 through 291 are the ones related to pulling the data from the JSON files,
# coloring them, setting the legend and setting a mouseover for each hexagon
https://github.com/csessig86/crime_map2013/blob/master/js/script_map_review.js

# And the map online:
http://wcfcourier.com/app/crime_map2013/index_wloo.php

The above shows the process I took for creating our hex map of violent crimes in Waterloo. I duplicated this process three more times to get four maps in all. One showed property crimes in Waterloo, while the other two show violent and property crime in Cedar Falls.

* Note: For more information on converting shapefiles with ogr2ogr, check out this blog post from Ben Balter. Also, this cheat sheet of ogr2ogr command line tools was invaluable as well.

Another great resource is this blog post on minifying GeoJSON files from Bjørn Sandvik. While the post deals with GeoJSON files specifically, the process works with JSON files as well. Minifying my JSON files made them about 50-150 KB smaller. While it’s not a drastic difference, every little bit counts.

To get choropleth maps up and running in Leaflet, check out this tutorial. You can also check out my Javascript code for our crime maps.

One last note: The map is responsive. Check it out.

Written by csessig

January 11, 2014 at 4:42 pm

How We Did It: Waterloo crime map

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Last week we launched a new feature on the Courier’s website: A crime map for the city of Waterloo that will be updated daily Monday through Friday.

The map uses data provided by the Waterloo police department. It’s presented in a way to allow readers to make their own stories out of the data.

(Note: The full code for this project is available here.)

Here’s a quick run-through of what we did to get the map up and running:

1. Turning a PDF into manageable data

The hardest part of this project was the first step: Turning a PDF into something usable. Every morning, the Waterloo police department updates their calls for service PDF with the latest service calls. It’s a rolling PDF that keeps track of about a week of calls.

The first step I took was turning the PDF into a HTML document using the command line tool PDFtoHTMLFor Mac users, you can download it by going to the command line and typing in “brew install pdftohtml.” Then run “pdftohtml -c (ENTER NAME OF PDF HERE)” to turn the PDF into an HTML document.

The PDF we are converting is basically a spreadsheet. Each cell of the spreadsheet is turned into a DIV with PDFtoHTML. Each page of the PDF is turned into its own HTML document. We will then scrape these HTML documents using the programming language Python, which I have blogged about before. The Python library that will allow us to scrape the information is Beautiful Soup.

The “-c” command adds a bunch of inline CSS properties to these DIVs based on where they are on the page. These inline properties are important because they help us get the information off the spreadsheet we want.

All dates and times, for instance, are located in the second column. As a result, all the dates and times have the exact same inline left CSS property of “107” because they are all the same distance from the left side of the page.

The same goes for the dispositions. They are in the fifth column and are farther from the left side of the page so they have an inline left CSS property of “677.”

We use these properties to find the columns of information we want. The first thing we want is the dates. With our Python scraper, we’ll grab all the data in the second column, which is all the DIVs that have an inline left CSS property of “107.”

We then have a second argument that uses regular expressions to make sure the data is in the correct format i.e. numbers and not letters. We do this to make sure we are pulling dates and not text accidently.

The second argument is basically an insurance policy. Everything we pull with the CSS property of “107” should be a date. But we want to be 100% so we’ll make sure it’s integers and not a string with regular expressions.

The third column is the reported crimes. But in our converted HTML document, crimes are actually located in the DIV previous to the date + time DIV. So once we have grabbed a date + time DIV with our Python scraper, we will check the previous DIV to see if it matches one of the seven crimes we are going to map. For this project, we decided not to map minor reports like business checks and traffic stops. Instead we are mapping the seven most serious reports.

If it is one of our seven crimes, we will run one final check to make sure it’s not a cancelled call, an unfounded call, etc. We do this by checking the disposition DIVs (column five in the spreadsheet), which are located before the crime DIVs. Also remember that all these have an inline left CSS property of “677”.

So we check these DIVs with our dispositions to make sure they don’t contain words like “NOT NEEDED” or “NO REPORT” or “CALL CANCELLED.”

Once we know it’s a crime that fits into one of our seven categories and it wasn’t a cancelled call, we add the crime, the date, the time, the disposition and the location to a CSV spreadsheet.

The full Python scraper is available here.

2. Using Google to get latitude, longitude and JSON

The mapping service I used was Leaflet, as opposed to Google Maps. But we will need to geocode our addresses to get latitude and longitude information for each point to use with Leaflet. We also need to convert our spreadsheet into a Javascript object file, also known as a JSON file.

Fortunately that is an easy and quick process thanks to two gadgets available to us using Google Docs.

The first thing we need to do is upload our CSV to Google Docs. Then we can use this gadget to get latitude and longitude points for each address. Then we can use this gadget to get the JSON file we will use with the map.

3. Powering the map with Leaflet, jQRangeSlider, DataTables and Bootstrap

As I mentioned, Leaflet powers the map. It uses the latitude and longitude points from the JSON file to map our crimes.

For this map, I created my own icons. I used a free image editor known as Seashore, which is a fantastic program for those who are too cheap to shell out the dough for Adobe’s Photoshop.

The date range slider below the map is a very awesome tool called jQRangeSlider. Basically every time the date range is moved, a Javascript function is called that will go through the JSON file and see if the crimes are between those two dates.

This Javascript function also checks to see if the crime has been selected by the user. Notice on the map the check boxes next to each crime logo under “Types of Crimes.”

If the crime is both between the dates on the slider and checked by the users, it is mapped.

While this is going on, an HTML table of this information is being created below the map. We use another awesome tool called DataTables to make that table of crimes interactive. With it, readers can display up to a 100 records on the page or search through the records.

Finally, we create a pretty basic bar chart using the Progress Bars made available by Bootstrap, an awesome interface released by the people who brought us Twitter.

Creating these bars are easy: We just need to create DIVs and give them a certain class so Bootstrap knows how to style them. We create a bar for each crime that is automatically updated when we tweak the map

For more information on progress bars, check out the documentation from Bootstrap. I also want to thank the app team at the Chicago Tribune for providing the inspiration behind the bar chart with their 2012 primary election app.

The full Javascript file is available here.

4. Daily upkeep

This map is not updated automatically so every day, Monday through Friday, I will be adding new crimes to our map.

Fortunately, this only takes about 5-10 minutes of work. Basically I scrape the last few pages of the police’s crime log PDF, pull out the crimes that are new, pull them into Google Docs, get the latitude and longitude information, output the JSON file and put that new file into our FTP server.

Trust me, it doesn’t take nearly as long as it sounds to do.

5. What’s next?

Besides minor tweaks and possible design improvements, I have two main goals for this project in the future:

A. Create a crime map for Cedar Falls – Cedar Falls is Waterloo’s sister city and like the Waterloo police department, the Cedar Falls police department keeps a daily log of calls for service. They also post PDFs, so I’m hoping the process of pulling out the data won’t be drastically different that what I did for the Waterloo map.

B. Create a mobile version for both crime maps – Maps don’t work tremendously well on the mobile phone. So I’d like to develop some sort of alternative for mobile users. Fortunately, we have all the data. We just need to figure out how to display it best for smartphones.

Have any questions? Feel free to e-mail me at chris.essig@wcfcourier.com.

Turning Blox assets into timelines: Part 2

with 2 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Also note: You will need to run on the Blox CMS for this to work. That said you could probably learn a thing or two about webscraping even if you don’t use Blox.

For part one of this tutorial, click here. For part three, click here

 

On my last blog, I discussed how you can turn Blox assets into a  timeline using a tool made available by ProPublica called TimelineSetter.

If you recall, most of the magic happens with a little Python script called Timeline.py. It scrapes information from a page and puts it into a CSV file, which can then be used with TimelineSetter.

So what’s behind this Timeline.py file? I’ll go through the code by breaking it down into chunks. The full code is here and is heavily commented to help you follow along.

(NOTE: This python script is based off this tutorial from BuzzData. You should definitely check it out!)

– The first part of the script is basically the preliminary work. We’re not actually scraping the web page yet. This code first imports the necessary libraries for the script to run. We are using a Python library called BeautifulSoup that was designed for web scraping.

We then create a CSV to put the data in with the open attribute and create an initial header row in the CSV file with the write attribute.  Also be sure to enter the URL of the page you want to scrape.

Note: For now, ignore the line “now = datetime.datetime.now().” We will discuss it later.

import urllib2
from BeautifulSoup import BeautifulSoup
import datetime
import re

now = datetime.datetime.now()

# Create a CSV where we'll save our data. See further docs:
# http://propublica.github.com/timeline-setter/#csv
f = open('timeline.csv', 'w')

# Make the header rows. These are based on headers recognized by TimelineSetter.
f.write("date" + "," + "description" + "," + "link" + "," + "html" + "\n")

# URL we will scrape
url = 'http://wcfcourier.com/test/scrape/dunkerton/'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)

– Before we go any further, we need to look at the page we are scraping, which in this example is this page. It’s basically a running list of articles about a particular subject. All of these stories will go on the timeline.

Now we’ll ask: what do we actually want to pull from this page? For each article we want to pull: the headline, the date, the photo, the first paragraph of the story and the link to the article.

Now we need to become familiar with the HTML of the page so we can tell BeautifulSoup what HTML attributes we want to pull from it. Go ahead and open the page up and view its source (Right click > View page source for Firefox and Chrome users).

One of the easiest things we can do is just search for the headline of the first article. So type in “Mayor’s arrest rattles Dunkerton.” This will take us to the chunk of code for that article. You’ll notice how the headline and all the other attributes for the story are contained in a DIV with the class “story-block.’

All stories on this page are formatted the same so every story is put into a DIV with the class ‘story-block.’ Thus, the number of DIVs with the class ‘story-block’ is also equal to the number of articles on the page we want to scrape.

– For the next line of code, we will put that number (whatever it may be) into a variable called ‘events.’ The line after that is what is known as a ‘for loop.’ These two lines tell BeautifulSoup how many times we want to run the ‘for loop.’

So if we have five articles we want to scrape, the ‘for loop’ will run five times. If we have 25 articles, it will run 25 times.

events = soup.findAll('div', attrs={'class': 'story-block'})
for x in events:

– Inside the ‘for loop,’ we need to tell it what information from each article we want to pull. Now go back to the source of the page we are scraping and find the headline, the date, the photo, the first paragraph of the story and the link to the article. You should see that:

  • The date is in a paragraph tag with the class ‘story more’
  • The link appears several times, including within a tag called ‘fb:like,’ which is the Facebook like button people can click to share the article on Facebook.
  • The headline is in a h3 tag, which is a header tag.
  • The first few paragraphs of the story are contained within a DIV with the id ‘blox-story-text.’ Note: In the Python script, we will tell BeautifulSoup to pull only the first paragraph.
  • The photo is contained within an img tag, which shouldn’t be a surprise.

So let’s put all of that in the ‘for loop’ so it knows what we want from each article. The code below uses BeautifulSoup syntax, which you can find out about by reading their documentation.

    # Information on the page that we will scrape
    date = x.find('p', attrs={'class': 'story-more'})('em')
    link = x.find('fb:like')['href']
    headline = x.find('h3').text
    description = x.find('div', attrs={'id': 'blox-story-text'})('p', limit=1)
    image = x.find('img')

One note about the above code: The ‘x’ is equal to the number that the ‘for loop’ is on. For example, say we want to scrape 20 articles. The first time we run the ‘for loop,’ the ‘x’ will be equal to one. The second time through, the ‘x’ will be equal to two. The last time through, it will be equal to 20.

We use the ‘x’ so we pull information from a different article each time we go through the ‘for loop’. The first time through the ‘for loop,’ we will pull information from the first article because the ‘x’ will be equal to one. And the second time through, we pull information from the second article because the ‘x’ will be equal to two.

If we didn’t use ‘x,’ we’d run through the ‘for loop’ 20 times but we’d pull the same information from the same article each time. The ‘x’ in combination with the ‘for loop’ basically tells BeautifulSoup to start with one article, then move onto the next and then the next and so on until we’ve scraped all the articles we want to scrape.

– Now you should be well on your way to creating timelines with Blox assets. For the third and final part of this tutorial, we will just clean up the data a little bit so it looks like nice on the page. Look for the final post of this series soon!

Written by csessig

March 7, 2012 at 2:21 pm

A few reasons to learn the command line

leave a comment »

Note: This is my first entry for Lee Enterprises’ data journalism blog. Reporters at Lee newspapers can read the blog by clicking here.

As computer users, we have grown accustomed to what is known as the graphical user interface (GUI). What’s GUI, you ask? Here are a few examples: When you drag and drop a text document into the trash, that’s GUI in action. Or when you create a shortcut on your desktop, that’s GUI in action. Or how about simply navigating from one folder to the next? You guessed it: that’s GUI in action.

GUI, basically, is the process of interacting with images (most notably icons on computers) to get things done on electronic devices. It’s easy and we all do it all the time. But there is another way to accomplish many tasks on a computer: using text-based commands. Welcome to the command line.

So where do you enter these text-based commands and accomplish these tasks? There is a nifty little program called the Terminal on your computer that does the work. If you’ve never opened up your computer’s Terminal, now would be a good time. On my Mac, it’s located in the Applications > Utilities folder.

A scary black box will open up. Trust me, I know: I was scared of it just a few months ago. But I promise there are compelling reasons for journalists to learn the basics of the command line. Here are a few:

 

1. Several programs created by journalists for journalists require the command line.

Two of my favorite tools out there for journalists come from ProPublica: TimelineSetter and TableSetter.

The first makes it easy to create timelines. We’ve made a few at the Courier. The second makes easily searchable tables out of spreadsheets (more specifically, CSV files), which we’ve also used at the Courier. But to create the timelines and tables, you’ll need to run very basic commands in your Terminal.

It’s worth noting the LA Times also has its own version of TableSetter called TableStacker that offers more customizations. We used it recently to break down candidates running in our local elections. Again, these tables are created after running a simple command.

The New York Times has a host of useful tools for journalists. Some, like Fech, require the command line to run. Fech can help journalists extract data from the Federal Election Commission to show who is spending money on whom in the current presidential campaign cycle.

 

2. Other programs/tools that journalists can use:

Let’s say you want to pull a bunch of information from a website to use in a story or visualization, but copy and pasting the text is not only tedious but very time consuming.

Why not just scrape the data using a program made in a language like Python or Ruby and put it in a spreadsheet or Word document? After all, computers are great at performing tedious tasks in just a few minutes.

One of my favorite web scraping walkthroughs comes from BuzzData. It shows how to pull water usage rates for every ward in Toronto and can easily be applied to other scenarios (I used it to pull precinct information from the Iowa GOP website). The best way to run this program and scrape the data is to run it through your command line.

Another great walkthrough on data scraping is this one from ProPublica’s Dan Nguyen. Instead of using the Python programming language, like the one above, it uses Ruby. But the goal remains the same: making data manageable for both journalists and readers.

A neat mapping service that is gaining popularity at news organizations is TileMill. Here are a few examples to help get you motivated.

One of the best places to start with TileMill is this walkthrough from the application team at the Chicago Tribune. But beware: you’ll need to know the command line before you start (trust me, I learned the hard way).

 

3. You’ll impress your friends because the command line kind of looks like the Matrix

And who doesn’t want that?

 

Okay I’m sort of interested…How do I learn?

I can’t tell people enough how much these two command line tutorials from PeepCode helped me. I should note that each costs $12 but are well worth it, in my opinion.

Also, there is this basic tutorial from HypeXR that will help. And these shortcuts from LifeHacker are also great.

Otherwise, a quick Google or YouTube search will turn up thousands of results. A lot of programmers use the command line and, fortunately, many are willing to help teach others.

Written by csessig

January 31, 2012 at 9:21 am

ProPublica to the rescue

leave a comment »

ProPublica, known for producing excellent, investigative journalism, also has a wonderful staff of developers that have put out several tools to help fellow journalists like myself. Here’s a quick run through two of their tools I’ve used in the last month.

TimelineSetter – I’m a big fan of timelines. They can help newspapers show passage of time, obviously, as well as keep their stories on a particular subject in one central location. This is exactly how we used ProPublica’s TimelineSetter tool when ‘Extreme Makeover: Home Edition’ announced they were going to build a new home for a local family. From the print side, we ran several stories, including about one a day during the week of the build. The photo department also put out four photo galleries on the build and a fundraiser. Finally, our videographer shot several videos. Our audience ate up our coverage, raking up more than 100,000 page views on the photo galleries alone. But unless you wanted to attach every story, gallery and video to any new story we did (which would be both cumbersome and unattractive), it would have been hard to get a full scope of our coverage. That’s were the ProPublica tool came into play. Simply, it helped compile all of our coverage of the event on one page.

I’m not going to go into detail on how I put together the timeline. Instead, I will revert you to their fantastic and easy to use documentation. Best of all, the timeline is easy to customize and upload to your site. It’s also free, unlike the popular timeline-maker Dipity. Check it out!

TableSorter – This tool is equally as impressive and fairly easy to use. The final output is basically an easy-to-navigate-and-sort spreadsheet. And, again, the documentation is comprehensive. Run through it and you’ll have up sorted table in no time! I’ve put together two already in the last week or so.

The first is a list of farmers markets in Iowa, with links to their home page (if available) and a Google map link, which was formatted using a formula in Microsoft Excel. The formula for the first row looked like this: =CONCATENATE(“http://www.google.com/maps?q=&#8221;, ” “, B2, ” “, C2, ” “, E2, ” “, “Iowa”)

The first part is the Google Map link, obviously. B2 represented the cell with the city address; C2 = City; E2 = Zip code and finally “Iowa” so Google Maps knows where to look. In between each field I put in a space so Google can read the text and try to map it our using Google Maps (I should note that not every location was able to be mapped out). Then I just copy and pasted this for every row in the table. At this point, I had a standard XLS Excel file, which I saved as a CSV file. TableSetter uses that CSV file and formats it using a YML file to produce the final output. Go and read the docs…It’s easier than having me try to explain it all. Here’s what my CSV looked like; here’s my YML file; and finally the table, which was posted on our site.

In the same vein, I put together this table on what each state department is requesting from the government in the upcoming fiscal year.

I should also note here that the Data Desk at the LA Times has a variation of ProPublica’s TableSorter that offers more features (like embedding photos into the table, for instance). It’s called Table Stacker and works in a very similar fashion as TableSorter. I recommend checking it out after you get a feel for ProPublica’s TableSorter.

Learning the Command Line: Both of these tools require the use of the command line using the Terminal program installed on your computer. If you are deathly afraid of that mysterious black box like I was, I highly recommend watching PeepCode’s video introduction to the command line called “Meet the Command Line.” And when your done, go ahead and jump into their second part called “Advanced Command Line.” Yes, they both cost money ($12 a piece), but there is so much information packed into each hour-long screencast, that they are both completely worth it. I was almost-instantly comfortable with the command line after watching both screencasts.

Written by csessig

October 21, 2011 at 3:23 pm