Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Spreadsheets’ Category

Turning Excel spreadsheets into searchable databases in five minutes

with 20 comments

Note: This is cross-posted from Lee’s data journalism blog and includes references to the Blox content management system, which is what Lee newspapers use. Reporters at Lee newspapers can read my blog over there by clicking here.

data_tables_screenshotHere’s a scenario I run into all the time: A government agency sends along a spreadsheet of data to go along with a story one of our reporters is working on. And you want to post the spreadsheet(s) online with the story in a reader-friendly way.

One way is to create a sortable database with the information. There are a couple of awesome options on the table put out by other news organizations. One is TableSetter, published by ProPublica, which we’ve used in the past. The other one is TableStacker, which is a spin off of TableSetter. We have also used this before.

The only problem is making these tables typically take a little bit of leg work. And of course, you’re on deadline.

Here’s one quick and dirty option:

1. First open the spreadsheet in Excel or Google Docs and highlight the fields you want to make into the sortable database.

2. Then go to this wonderful website called Mr. Data Converter and paste the spreadsheet information into the top box. Then select the output as “HTML.”

3. We will use a service called DataTables to create the sortable databases. It’s a great and easy jQuery plugin to create the sortable tables.

4. Now create an HTML asset in Blox and paste in this DataTables template below:

Note: I’ve added some CSS styling to make the tables look better.

<html>
<head>
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_page.css">
<link rel="stylesheet" type="text/css" href="http://wcfcourier.com/app/special/data_tables/media/css/demo_table.css">

<style>
table {
	font-size: 12px;
	font-family: Arial, Helvetica, sans-serif;
float: left
}
table th, table td {
    text-align: center;
}

th, td {
	padding-top: 10px;
	padding-bottom: 10px;
	font-size: 14px;
}

label {
	width: 100%;
	text-align: left;
}

table th {
	font-weight: bold;
}

table thead th {
    vertical-align: middle;
}

label, input, button, select, textarea {
    line-height: 30px;
}
input, textarea, select, .uneditable-input {
    height: 25px;
    line-height: 25px;
}

select {
    width: 100px;
}

.dataTables_length {
    padding-left: 10px;
}
.dataTables_filter {
	padding-right: 10px;
}

</style>

<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.js"></script>
<script type="text/javascript" language="javascript" src="http://wcfcourier.com/app/special/data_tables/media/js/jquery.dataTables.min.js"></script>

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>
</head>

<body>

<--- Enter HTML table here --->

</body>

</html>

– This will link the page to the necessary CSS spreadsheets and Javascript files to get the DataTable working. The other option is go to the DataTable’s website and download the files yourself and post them on your own server, then link to those files instead of the ones hosted by WCFCourier.com.

5. Where you see the text “<— Enter HTML table here —>,” paste in your HTML table from Mr. Data Converter.

6. The last thing you will need to do is create an “id” for the table and link that “id” to the DataTable’s plugin. In the example above, the “id” is “five_year.” It is noted in this line of code in the DataTable template:

<script type="text/javascript" charset="utf-8">
$(document).ready(function() {
$('#five_year').dataTable({
"iDisplayLength": 25
});
});
</script>

– The header on your HTML table that you post into the template will look like so:

<table id="five_year" style="width: 620px;">
  <thead>
    <tr>
      <th class="NAME-cell">NAME</th>
      <th class="2008 Enrollment-cell">2008 Enrollment</th>
      <th class="2012 Enrollment-cell">2012 Enrollment</th>
      <th class="Increase/Decrease-cell">Increase/Decrease</th>
      <th class="Percent Increase/Decrease-cell">Percent Increase/Decrease</th>
    </tr>
  </thead>

– Here’s an live example of two sortable tables. The first table has an “id” of “five_year.” The second has an “id” of “one_year.” The full code for the two tables is available here.

– As an alternative, you can use a jQuery plugin called TableSorter (not to be confused with the TableSorter project mentioned above). The process of creating the table is very similar.

7. That’s it! Of course, DataTables provides several customization options that are worth looking into if you want to make your table look fancier.

Advertisements

Written by csessig

January 3, 2013 at 4:34 pm

How We Did It: Waterloo crime map

with 3 comments

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

Last week we launched a new feature on the Courier’s website: A crime map for the city of Waterloo that will be updated daily Monday through Friday.

The map uses data provided by the Waterloo police department. It’s presented in a way to allow readers to make their own stories out of the data.

(Note: The full code for this project is available here.)

Here’s a quick run-through of what we did to get the map up and running:

1. Turning a PDF into manageable data

The hardest part of this project was the first step: Turning a PDF into something usable. Every morning, the Waterloo police department updates their calls for service PDF with the latest service calls. It’s a rolling PDF that keeps track of about a week of calls.

The first step I took was turning the PDF into a HTML document using the command line tool PDFtoHTMLFor Mac users, you can download it by going to the command line and typing in “brew install pdftohtml.” Then run “pdftohtml -c (ENTER NAME OF PDF HERE)” to turn the PDF into an HTML document.

The PDF we are converting is basically a spreadsheet. Each cell of the spreadsheet is turned into a DIV with PDFtoHTML. Each page of the PDF is turned into its own HTML document. We will then scrape these HTML documents using the programming language Python, which I have blogged about before. The Python library that will allow us to scrape the information is Beautiful Soup.

The “-c” command adds a bunch of inline CSS properties to these DIVs based on where they are on the page. These inline properties are important because they help us get the information off the spreadsheet we want.

All dates and times, for instance, are located in the second column. As a result, all the dates and times have the exact same inline left CSS property of “107” because they are all the same distance from the left side of the page.

The same goes for the dispositions. They are in the fifth column and are farther from the left side of the page so they have an inline left CSS property of “677.”

We use these properties to find the columns of information we want. The first thing we want is the dates. With our Python scraper, we’ll grab all the data in the second column, which is all the DIVs that have an inline left CSS property of “107.”

We then have a second argument that uses regular expressions to make sure the data is in the correct format i.e. numbers and not letters. We do this to make sure we are pulling dates and not text accidently.

The second argument is basically an insurance policy. Everything we pull with the CSS property of “107” should be a date. But we want to be 100% so we’ll make sure it’s integers and not a string with regular expressions.

The third column is the reported crimes. But in our converted HTML document, crimes are actually located in the DIV previous to the date + time DIV. So once we have grabbed a date + time DIV with our Python scraper, we will check the previous DIV to see if it matches one of the seven crimes we are going to map. For this project, we decided not to map minor reports like business checks and traffic stops. Instead we are mapping the seven most serious reports.

If it is one of our seven crimes, we will run one final check to make sure it’s not a cancelled call, an unfounded call, etc. We do this by checking the disposition DIVs (column five in the spreadsheet), which are located before the crime DIVs. Also remember that all these have an inline left CSS property of “677”.

So we check these DIVs with our dispositions to make sure they don’t contain words like “NOT NEEDED” or “NO REPORT” or “CALL CANCELLED.”

Once we know it’s a crime that fits into one of our seven categories and it wasn’t a cancelled call, we add the crime, the date, the time, the disposition and the location to a CSV spreadsheet.

The full Python scraper is available here.

2. Using Google to get latitude, longitude and JSON

The mapping service I used was Leaflet, as opposed to Google Maps. But we will need to geocode our addresses to get latitude and longitude information for each point to use with Leaflet. We also need to convert our spreadsheet into a Javascript object file, also known as a JSON file.

Fortunately that is an easy and quick process thanks to two gadgets available to us using Google Docs.

The first thing we need to do is upload our CSV to Google Docs. Then we can use this gadget to get latitude and longitude points for each address. Then we can use this gadget to get the JSON file we will use with the map.

3. Powering the map with Leaflet, jQRangeSlider, DataTables and Bootstrap

As I mentioned, Leaflet powers the map. It uses the latitude and longitude points from the JSON file to map our crimes.

For this map, I created my own icons. I used a free image editor known as Seashore, which is a fantastic program for those who are too cheap to shell out the dough for Adobe’s Photoshop.

The date range slider below the map is a very awesome tool called jQRangeSlider. Basically every time the date range is moved, a Javascript function is called that will go through the JSON file and see if the crimes are between those two dates.

This Javascript function also checks to see if the crime has been selected by the user. Notice on the map the check boxes next to each crime logo under “Types of Crimes.”

If the crime is both between the dates on the slider and checked by the users, it is mapped.

While this is going on, an HTML table of this information is being created below the map. We use another awesome tool called DataTables to make that table of crimes interactive. With it, readers can display up to a 100 records on the page or search through the records.

Finally, we create a pretty basic bar chart using the Progress Bars made available by Bootstrap, an awesome interface released by the people who brought us Twitter.

Creating these bars are easy: We just need to create DIVs and give them a certain class so Bootstrap knows how to style them. We create a bar for each crime that is automatically updated when we tweak the map

For more information on progress bars, check out the documentation from Bootstrap. I also want to thank the app team at the Chicago Tribune for providing the inspiration behind the bar chart with their 2012 primary election app.

The full Javascript file is available here.

4. Daily upkeep

This map is not updated automatically so every day, Monday through Friday, I will be adding new crimes to our map.

Fortunately, this only takes about 5-10 minutes of work. Basically I scrape the last few pages of the police’s crime log PDF, pull out the crimes that are new, pull them into Google Docs, get the latitude and longitude information, output the JSON file and put that new file into our FTP server.

Trust me, it doesn’t take nearly as long as it sounds to do.

5. What’s next?

Besides minor tweaks and possible design improvements, I have two main goals for this project in the future:

A. Create a crime map for Cedar Falls – Cedar Falls is Waterloo’s sister city and like the Waterloo police department, the Cedar Falls police department keeps a daily log of calls for service. They also post PDFs, so I’m hoping the process of pulling out the data won’t be drastically different that what I did for the Waterloo map.

B. Create a mobile version for both crime maps – Maps don’t work tremendously well on the mobile phone. So I’d like to develop some sort of alternative for mobile users. Fortunately, we have all the data. We just need to figure out how to display it best for smartphones.

Have any questions? Feel free to e-mail me at chris.essig@wcfcourier.com.

Spreadsheet tutorials to help journalists clean and format data

leave a comment »

Note: This is cross-posted from Lee’s data journalism blog. Reporters at Lee newspapers can read my blog over there by clicking here.

I was talking to colleague the other day who was interested in getting into computer programming and more data projects. He asked where the best place to start was.

My gut reaction was to tell him to learn the basics of spreadsheets. Almost all of the data I have used with projects — whether they end up being a map or a graphic — were initially set up in spreadsheet form.

Most of us are familiar with spreadsheets and have likely worked with Microsoft Excel. Agencies both big and small often gather spreadsheets of useful information for us to use with our stories.

The problem is most of the time the data isn’t formatted right or contains inaccuracies like misspellings. That is where the magic of spreadsheet formulas can come in to help organize your data.

Here’s a few resources you might find handy for working with spreadsheets.

1. Poynter: How journalists can use Excel to organize data for stories

This walkthrough is great because it is written directly for journalists. It is also intended for beginners so those with no spreadsheet knowledge will be able to keep up. Finally, the walkthrough is intended for both print and online journalists. So journalists who have no intention of making visualizations will still find a handful of features in Excel to help them with their day-to-day reporting.

2. My Favorite (Excel) Things

This PDF provided by Mary Jo Webster at the St. Paul Pioneer Press is a great, concise list of Excel formulas she uses all the time. It includes formulas on how to format dates, run sums and even run if-else statements in Excel. It’s one of my favorite resources for Excel and definitely worth bookmarking.

3. Google Docs – Spreadsheet resources

Not a fan of Excel or don’t have it installed on your work computer? Fortunately you can make a spreadsheet with Google by logging into Google Drive and clicking “Create > Spreadsheet.” The best part is spreadsheets you create with Google can be accessed from any computer as long as you log in with your Google account.

Here’s some resources for getting started with Google spreadsheets:

4. Google Refine resources

Sometimes you need more than just spreadsheet formulas to clean dirty data. That’s where the powerful Google Refine program can come into play. The program was designed to clean dirty data by finding inconsistencies in your spreadsheets. It can also help you sort data, add to data, transform it from one service to another and much, much more.

Here’s some resources you might find handy:

 5. NICAR-L e-mail list

Still stuck? Fortunately, there is a wonderful community of computer-assisted reporters who are more than willing to help others out. If you want great information on spreadsheets or any other data journalism topic, check out the National Institute of Computer-Assisted Reporting’s email list. Questions on Excel come up almost every day.

Have any other useful resources not listed here? E-mail me at chris.essig@wcfcourier.com.

Written by csessig

June 25, 2012 at 9:37 am