Chris Essig

Walkthroughs, tips and tricks from a data journalist in eastern Iowa

Archive for the ‘Data journalism’ Category

How automation can save you on deadline

leave a comment »

In the data journalism world, data dumps are pretty common occurrences. And often times these dumps are released periodically — every month, every six months, every year, etc.

Despite their periodic nature, these dumps can catch you off guard. Especially if you are working on a million things at once (but nobody does that, right?).

ss

Case in point: Every year the Centers for Medicare & Medicaid Services releases data on how much money doctors received from medical companies. If I’m not mistaken, the centers release the data now because of the great investigative work of ProPublica. The data they release is important and something our readers care about.

Last year, I spent a ton of time parsing the data to put together a site that shows our readers the top paid doctors in Iowa. We decided to limit the number of doctors shown to 100, mostly because we ran out of time.

In all, I spent at least a week on the project last year. But this year, I was not afforded that luxury because the data dump caught us off guard. Instead of a few weeks, we had a few days to put together the site and story.

Fortunately, I’ve spent quite a bit of time in the last year moving away from editing spreadsheets in Excel or Google Docs to editing them using shell scripts.

The problem with GUI editors is they aren’t easily reproducible. If you make 20 changes to a spreadsheet in Excel one year and then the next year you get an updated version of that same spreadsheet, you have to make those exactly same 20 changes again.

That sucks.

Fortunately with tools like CSVKit and Agate, you can write A LOT of Excel-like functions in a shell script or a Python script. This allows you to document what you are doing so you don’t forget. I can’t tell you how many times I’ve done something in Excel, saved the spreadsheet, then came back to it a few days later and completely forgotten what the hell I was doing.

It also makes it so you can automate your data analysis by allowing you to re-run the functions over and over again. Let’s say you perform five functions in Excel and trim 20 columns from your spreadsheet. Now let’s say you want to do that again, either because you did it wrong the first time or because you received new data. Now would you rather run a shell script that takes seconds to perform this task or do everything over in Excel?

Another nice thing about shell scripts is you can hook your spreadsheet into any number of data processing programs. Want to port your data into a SQLite database so you can perform queries? No problem. You can also create separate files for those SQL queries and run them in your shell script as well.

All of this came in handy when this year’s most paid doctors data was released. Sure, I still spent a stressful day editing the new spreadsheets. But if I hadn’t been using shell scripts, I doubt I would have gotten done by our deadline.

Another thing I was able to do was increase the number of doctors listed online. We went from 100 total doctors to every doctor who received $500 or more  (totaling more than 2,000 doctors). This means it’s much more likely that readers will find their doctors this year.

The project is static and doesn’t have a database backend so I didn’t necessarily need to limit the number of doctors last year. We just ran out of time. This year, we were able to not only update the app but give our readers much more information about Iowa doctors. And we did it in a day instead of a week.

The project is online here. The code I used to edit the 15+ gigabytes worth of data from CMS is available here.

 

 

Written by csessig

September 7, 2016 at 4:59 pm

Leaflet formula: Counting markers within a radius

with 5 comments

ssThe last project I worked on at The Courier was a comprehensive guide to mental health and disability services in NE Iowa. For the project, we first wanted to explain the MHDS system to our readers and how it works. It’s a relatively new system with a lot of moving parts and no good way to search for services. We really wanted to help residents and families in need of these services, so we designed everything — from the wording we used to the way items were laid out — to be focused on our readers.

After laying out the details of the new system, we wanted to allow our readers to search for services near them. A map seemed like an obvious choice; but we wanted to do more than just have them search for their address and plot a dot on the map.

(As a side note, we used Ben Welsh‘s jquery-geocodify plugin to geocode the addresses because it’s awesome.)

Instead, we wanted to show readers how many providers were close to them, as well as how to get ahold of the service provider and directions to the facility. For the former task, we used the following process:

First, the providers are plotted on the map. Then, we ask the reader to type in their address, which is plotted on the map using jquery-geocodify. They also have a second option for radius, which will be used to determine how many providers are within whatever radius they have given. They have four options: 5, 10, 25 and 50 miles.

Then, as the markers are getting plotted on the map, we run a few Javascript functions to calculate how many markers on the map are within the radius.

I’ve pulled out just this part of the mental health project and open-sourced it. You can view the full code over at my Github page.

The demo uses data on mental health providers, which is contained in this JSON fileand plots them on the map based on each provider’s latitude and longitude coordinates.

Here’s a quick run through the Javascript that makes all this run. It’s contained within the project’s script.js file. We’ll start from the bottom of the file and move our way up.

Note: This walkthrough assumes a decent understanding of Leaflet. I’m also using Leaflet-Awesome Markers and Font Awesome in this project.

First, we’ll add the base map tiles to our map. We’ll also  add the json_group to the map, which is a FeatureGroup we will create later in the file:

// Base map
var layer = new L.StamenTileLayer('toner-background');
var map = new L.Map('map', {
    center: new L.LatLng(42,-93.3),
    minZoom: 4,
    maxZoom: 10,
    zoom: 6,
    keyboard: false,
    boxZoom: false,
    doubleClickZoom: false,
    scrollWheelZoom: false,
    maxBounds: [[33.154799,-116.586914],[50.190089,-77.563477]]
});
// Add base layer to group
map.addLayer(layer);
// Add our markers in our JSON file on the map
map.addLayer(json_group);

Next, we’ll add the code for jquery-geocodify. For the onSelect function, we will grab the location of reader’s address and then call a geocodePlaceMarkersOnMap function, which we will create later:

// jQuery Geocodify
var maxY = 43.749935;
var minY = 40.217754;
var minX = -96.459961;
var maxX = -90.175781;

var search_marker;
var search_icon = L.AwesomeMarkers.icon({
    icon: 'icon-circle',
    color: 'green'
});

$('#geocoder').geocodify({
    onSelect: function (result) {
        // Extract the location from the geocoder result
        var location = result.geometry.location;

        // Call function and place markers, circle on map
        geocodePlaceMarkersOnMap(location);
    },
    initialText: 'Zip code, city, etc...',
    regionBias: 'US',
    // Lat, long information for Cedar Valley enter here
    viewportBias: new google.maps.LatLngBounds(
        new google.maps.LatLng(40.217754, -96.459961),
        new google.maps.LatLng(43.749935, -90.175781)
    ),
    width: 300,
    height: 26,
    fontSize: '14px',
    filterResults: function (results) {
        var filteredResults = [];
        $.each(results, function (i, val) {
            var location = val.geometry.location;
            if (location.lat() > minY && location.lat() < maxY) {
                if (location.lng() > minX && location.lng() < maxX) {
                    filteredResults.push(val);
                }
            }
        });
        return filteredResults;
    }
});

Next, we’ll loop through our providers.json file, which has a variable of json_data, and create a marker for each object in our array. We’ll then add it to our json_group FeatureGroup, which is defined at the top of our script file:

// This loops through the data in our JSON file
// And puts it on the map
_.each(json_data, function(num) {
    var dataLat = num['latitude'];
    var dataLong = num['longitude'];

    // Add to our marker
    var marker_location = new L.LatLng(dataLat, dataLong);

    // Options for our circle marker
    var layer_marker = L.circleMarker(marker_location, {
        radius: 7,
        fillColor: "#984ea3",
        color: "#FFFFFF",
        weight: 1,
        opacity: 1,
        fillOpacity: 0.8
    });

    // Add events to marker
    layer_marker.on({
        // What happens when mouse hovers markers
        mouseover: function(e) {
            var layer_marker = e.target;
            layer_marker.setStyle({
                radius: 8,
                fillColor: "#FFFFFF",
                color: "#000000",
                weight: 1,
                opacity: 1,
                fillOpacity: 1
            });
        },
        // What happens when mouse leaves the marker
        mouseout: function(e) {
            var layer_marker = e.target;
            layer_marker.setStyle({
                radius: 7,
                fillColor: "#984ea3",
                color: "#FFFFFF",
                weight: 1,
                opacity: 1,
                fillOpacity: 0.8
            });
        }
    // Close event add for markers
    });

    json_group.addLayer(layer_marker);
// Close for loop
}, this);

The changeCircleRadius function is called anytime the radius is changed on the page. It first calls the pointsInCircle function, which we will define later, and then sets the radius of the circle using Leaflet’s setRadius function for circles.

Leaflet uses meters, so we will need to convert miles to meters using the milesToMeters function, which will be defined later. The mile’s value is contained within the DIV with the id of radius-selected. We’ll grab the value, which is the radius in miles, convert it to meters and then set the radius of the circle using Leaflet.

// Change circle radius when changed on page
function changeCircleRadius(e) {
    // Determine which geocode box is filled
    // And fire click event

    // This will determine how many markers are within the circle
    pointsInCircle(circle, milesToMeters( $('#radius-selected').val() ) )

    // Set radius of circle only if we already have one on the map
    if (circle) {
        circle.setRadius( milesToMeters( $('#radius-selected').val() ) );
    }
}

$('select').change(function() {
    changeCircleRadius();
});

The geocodePlaceMarkersOnMap function is called every time the reader enters an address into our geocoder and presses enter. It passes with it the location of their address, which we get from jquery-geocodify. With this function, we will create both a circle and a marker on top of the given address.

First, we set the view of the map based on this location and remove the circle on the map if there already is one. Then we actually create a circle and set how large it is based on the radius set by the user.

We then remove the marker if it’s already on the map and create a new one. When we create a new marker, we’ll set it to draggable and set its dragend method to reset the view of the map every time the reader drags the marker somewhere on the. It then calls the pointsInCircle function, which we will define next. Finally, we’ll call the same pointsInCircle function and add the marker to the map.

// This places marker, circle on map
function geocodePlaceMarkersOnMap(location) {
    // Center the map on the result
    map.setView(new L.LatLng(location.lat(), location.lng()), 10);

    // Remove circle if one is already on map
    if(circle) {
        map.removeLayer(circle);
    }
    
    // Create circle around marker with our selected radius
    circle = L.circle([location.lat(), location.lng()], milesToMeters( $('#radius-selected').val() ), {
        color: 'red',
        fillColor: '#f03',
        fillOpacity: 0.1,
        clickable: false
    }).addTo(map);
    
    // Remove marker if one is already on map
    if (search_marker) {
        map.removeLayer(search_marker);
    }
        
    // Create marker
    search_marker = L.marker([location.lat(), location.lng()], {
        // Allow user to drag marker
        draggable: true,
        icon: search_icon
    });

    // Reset map view on marker drag
    search_marker.on('dragend', function(event) {
        map.setView( event.target.getLatLng() ); 
        circle.setLatLng( event.target.getLatLng() );

        // This will determine how many markers are within the circle
        pointsInCircle( circle, milesToMeters( $('#radius-selected').val() ) );

        // Redraw: Leaflet function
        circle.redraw();

        // Clear out address in geocoder
        $('#geocoder-input').val('');
    });

    // This will determine how many markers are within the circle
    // Called when points are initially loaded
    pointsInCircle( circle, milesToMeters( $('#radius-selected').val() ) );

    // Add marker to the map
    search_marker.addTo(map);
        
// Close geocodePlaceMarkersOnMap
}

The pointsInCircle function is the last function called when one of two things happen: the user enters an address or the user drags the marker on the map. Its purpose is to find out how many markers on the map are within the circle we created and then display that number in the legend.

Fortunately, Leaflet has a really handy distanceTo method, which determines how many meters are within two points. It uses the two points’ latitude and longitude coordinates to find this out.

With the pointsInCircle function, we first capture the lat, long of the circle we’re putting on the map. We then loop through our providers.json file, find the latitude and longitude coordinates of each object and determine how far it is from our circle using the distanceTo method.

We then decide if that amount is higher or lower than the radius set by the user. If it is lower, we add it to a counter. When we’re done, that counter will equal the number of markers within the selected radius. It will be put on the map under the “Results” header.

Lastly, we will determine how many results we have and use the correct wording on the page. If we have one result, we’ll use the word “provider” (singular) in the map’s legend. Otherwise, we’ll use “providers.”

// This figures out how many points are within out circle
function pointsInCircle(circle, meters_user_set ) {
    if (circle !== undefined) {
        // Only run if we have an address entered
        // Lat, long of circle
        circle_lat_long = circle.getLatLng();

        // Singular, plural information about our JSON file
        // Which is getting put on the map
        var title_singular = 'provider';
        var title_plural = 'providers';

        var selected_provider = $('#dropdown_select').val();
        var counter_points_in_circle = 0;

        // Loop through each point in JSON file
        json_group.eachLayer(function (layer) {

            // Lat, long of current point
            layer_lat_long = layer.getLatLng();

            // Distance from our circle marker
            // To current point in meters
            distance_from_layer_circle = layer_lat_long.distanceTo(circle_lat_long);

            // See if meters is within raduis
            // The user has selected
            if (distance_from_layer_circle <= meters_user_set) {
                counter_points_in_circle += 1;
            }
        });

        // If we have just one result, we'll change the wording
        // So it reflects the category's singular form
        // I.E. facility not facilities
        if (counter_points_in_circle === 1) {
            $('#json_one_title').html( title_singular );
        // If not one, set to plural form of word
        } else {
            $('#json_one_title').html( title_plural );
        }
        
        // Set number of results on main page
        $('#json_one_results').html( counter_points_in_circle );
    }
// Close pointsInCircle
};

The final piece is to add the variables we’re using to the top of the script file. We’ll also add the milesToMeters function, which converts miles to meters.

// We'll append our markers to this global variable
var json_group = new L.FeatureGroup();
// This is the circle on the map that will be determine how many markers are around
var circle;
// Marker in the middle of the circle
var search_marker;
// Marker icon
var search_icon = L.AwesomeMarkers.icon({
    icon: 'icon-circle',
    color: 'red'
});


// Convert miles to meters to set radius of circle
function milesToMeters(miles) {
    return miles * 1069.344;
};

I hope this walkthrough helps fellow Leaflet users! If you have any questions, leave them in the comments.

Happy coding.

Written by csessig

June 22, 2014 at 1:32 pm

How laziness led me to produce news apps quicker

with one comment

Note: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

ssAn important trait for programmers to adopt, especially for those who are writing code on deadline, is laziness. While that sounds counter-productive, it can be incredibly beneficial.

When writing code, you  do a lot of repetitive tasks. If you’re using a Javascript library for mapping, for instance, you’ll need to copy and paste the Javascript and CSS files into a new directory every time you start a new project. If you want to use plugins with that map, you will need to include their files in the directory as well.

The more complicated your project, the more libraries you will likely use, which means a lot of tedious work putting all those files in the new directory.

And that’s before you even start to code. If you want your projects to have a similar look to them (which we do at the Courier), you will need to create code that you can to replicate in the future. And then every time you want to use it, you will need to rewrite the same code or, at the very least, copy code from your old projects and paste it into your new project.

I did this for a long time. Whenever I was creating  a new app, I’d open a bunch of projects I’ve already completed, go into their CSS and JS files, copy some of it and paste it into the new files. It was maddening because I knew with every new project, I would waste hours basically replicating work I had already done.

This is where laziness pays off. If you are too lazy to do the same things over and over again, you’ll start coming up with solutions that automatically does these repetitive tasks.

That’s why many news apps teams have created templates for their applications. When they create new projects, they start with this template. This allows their apps to have a similar look. It also means they don’t have to do the same things over and over again every time they are working with a project.

App templating is something I’ve dabbled with before. But those attempts were minor. For the most part, I did a lot of the same mind-numbing tasks every time I wanted to create a map, for instance.

Fortunately I found a good solution a few months ago: Yeoman, which is dubbed “the web’s scaffolding tool for modern web apps,” has helped me eliminate much of that boring, repetitive work.

Yeoman is built in Node. Before I stumbled upon it a few months ago, I knew very little about Node. I still don’t know a ton but this video helped me grasp the basics. One advantage of Node is it allows you to run Javascript on the command line. Pretty cool.

Yeoman is built on three parts. The first is yo, which helps create new applications and all the files that go with it. It also uses Grunt, which I have come to love. Grunt is used for running tasks, like minifying code, starting a local server so you test your code, copy files into new directories and a whole bunch of other things that makes coding quicker, easier and less repetitive.

Finally, Yeoman uses Bower, which is a directory of all the useful packages you use with applications, from CSS frameworks to Javascript mapping libraries. With a combination of Bower and Grunt, you can download the files you need for your projects and put them in the appropriate directories using a Grunt task. Tasks are written in a Gruntfile and run on the command line. The packages you want to download from Bower and use in your project are stored in bower.json file.

In Yeoman, generators are created. A generator makes it easy to create new apps using a template that’s already been created. You can either create your own generator or use one of the 400+ community generators. The generators themselves are directories of files, like  images, JS files, CSS stylesheets, etc. The advantage of using generators is you can create a stylesheet, for instance, save it in the generator directory and use that same stylesheet with every new project you create. If you decide to edit it, those changes will be reflected in every new project.

When you use a generator, those files are copied into a new directory, along with Grunt tasks, which are great for when you are testing and deploying your apps. Yeoman also has a built-in prompt system that asks you a series of questions whenever you create an app. This helps copy over the appropriate files and code, depending on what you are trying to make.

I decided I would take a stab at creating a generator. The code is here, along with a lot more explanation on how to use it. Basically, I created it so it would be easier to produce simple apps. Whenever I create a new Leaflet map, for instance, I can run this and a sample map with sample data will be outputted into a new directory. I also have a few options depending on how I want to style the map. I also have options for different data types: GeoJSON, JSON with latitude and longitude coordinates and Tabletop. What files are scaffolded out depends on how you responded to the prompts that are displayed when you first fire up the generator.

There is also an option for non-maps. Finally, you can use Handlebars for templating your data, if that’s your thing.

My hope is to expand upon this generator and create more options in the future. But for now, this is easy way to avoid a bunch of repetitive tasks. And thus, laziness pays off.

If you are interested in creating your own generator, Yeoman has a great tutorial that you should check out.

Written by csessig

February 19, 2014 at 12:08 pm

Using hexagon maps to show areas with high levels of reported crime

leave a comment »

Screenshot: Violent crime in WaterlooCrime maps are a big hit at news organizations these days. And with good reason: Good crime maps can tell your readers a whole lot about crime not only in their city but their backyard.

For the last year and a half, we’ve been doing just that. By collecting data every week from our local police departments, we’ve been able to provide our community a great, constantly updated resource on crime in Waterloo and Cedar Falls. And we’ve even dispelled some myths on where crime happens along the way.

The process, which I’ve blogged about in the past, has evolved since the project was first introduced. But basically it goes like this: I download PDFs of crime reports from the Waterloo and Cedar Falls police departments every week. Those PDFs are scraped using a Python script and exported into an CSV. The script pulls out reports of the most serious crimes and disregards minor reports like assistance calls.

The CSV is then uploaded to Google spreadsheets as a back up. We also use this website to get latitude, longitude coordinates from the addresses in the Google spreadsheet. Finally, the spreadsheet is converted into JSON using Mr. Data Converter and loaded onto our FTP server. We use Leaflet to map the JSON files.

We’ve broken the map into months and cities, meaning we have a map and a corresponding JSON file for each month for both Waterloo and Cedar Falls. We also have JSON files with all of the year’s crime reports for both Waterloo and Cedar Falls.

The resulting maps are great for readers looking for detailed information on crime in their neighborhood. It also provides plenty of filters so readers can see what areas have more shootings, assaults, burglaries, etc.

But it largely fails to give a good overview of what areas of the cities had the most reported crimes in 2013 for two reasons: 1) Because of the sheer number of crime reports (3,684 reports in 2013 for Waterloo alone) and the presence of multiple reports coming from the same address, it’s nearly impossible to simply look at the 2013 maps and see what areas of the cities had the most reports, and; 2) The JSON files containing all of the year’s reports runs well on modern browsers but runs more slowly outdated browsers (I’m looking at you, Internet Explorer), which aren’t equipped to iterate over JSON files with thousands of items.

As a result, I decided to create separate choropleth maps that serve one purpose: To show what areas of the cities had the most reported crimes. The cities would be divided into sections and areas with more crimes would be darker on the map, while the areas with fewer crimes would be lighter. Dividing the city into sections would also mean the resulting JSON file would have dozens of items instead of thousands. The map would serve a simple purpose and load quickly.

What would the areas of the cities actually look like? I’ve been fascinated with hex maps since I saw this gorgeous map from the L.A. Times on 9/11 response times. The map not only looked amazing but did an incredibly effective job displaying the data and telling the story. While we want our map to look nice, telling  its story and serving our journalistic purpose will always be our No. 1 goal. Fortunately, hex maps do both.

And even more fortunately, the process of creating hex maps got a whole lot easier when data journalist Kevin Schaul introduced a command line tool called binify that turns dot density shapefiles into shapefiles with hexagon shapes. Basically the map is cut into hexagon shapes, and all the points inside each hexagon are counted. Then each hexagon is given a corresponding count value for the number points that fall in between its borders.

I used binify to convert our 2013 crime maps into hex maps. As I mentioned earlier, we started out with CSVs with all the crimes. My workflow for creating the hex maps and then converting them into JSON files for Leaflet is in a gist below. It utilizes the fantastic command line tool og2ogr, which is used to convert map files, and binify.

# We start with a CSV called "crime_master_wloo_violent.csv" that has violent crime reports in Waterloo.
# Included in csv are columns for latitude ("lat") and longitude ("long"):
https://github.com/csessig86/crime_map2013/blob/master/csv/crime_master_wloo_violent.csv

# Command to turn CSV into DBF with the same name
ogr2ogr -f "ESRI Shapefile" crime_master_wloo_violent.dbf crime_master_wloo_violent.csv

# Create a file called "crime_master_wloo_violent.vrt"
# Change "SrcLayer" to the name of the source
# Change OGRVRTLayer name to "wloo_violent"
# Add name of lat, long columns to "GeometryField"
<OGRVRTDataSource>
  <OGRVRTLayer name="wloo_violent">
    <SrcDataSource relativeToVRT="1">./</SrcDataSource>
    <SrcLayer>crime_master_wloo_violent</SrcLayer>
    <GeometryType>wkbPoint</GeometryType>
    <LayerSRS>WGS84</LayerSRS>
    <GeometryField encoding="PointFromColumns" x="long" y="lat"/>
  </OGRVRTLayer>
</OGRVRTDataSource>

# Create "violent" folder to stash shp file
mkdir violent

# Convert "vrt" file into shp and put it in our "violent" folder
ogr2ogr -f "ESRI Shapefile" violent crime_master_wloo_property.vrt

# Binify the "wloo_violent" shp into a shp called "wloo_violent_binify" with three options
# The first sets the number of hexagons across to 45
# The second sets the custom extent to a box around Waterloo
# Custom extent is basically the extent of the area that the overlay of hexagons will cover
# The third tells binify to ignore hexagons that have zero points
binify wloo_violent.shp wloo_violent_binify.shp -n 45 -E -92.431873 -92.260867 42.4097259 42.558403 --exclude-empty

# Convert binified shp into JSON file called "crime_data_review_violent"
ogr2ogr -f "GeoJSON" crime_data_review_violent.json wloo_violent_binify.shp

# Add this variable to JSON file
var crime_data_review_violent

# Here's the final files:
https://github.com/csessig86/crime_map2013/tree/master/shps/wloo_shps

# And our JSON file:
https://github.com/csessig86/crime_map2013/blob/master/JSON/crime_data_review_violent.json

# And the Javascript file that is used with Leaflet to run the map.
# Lines 113 through 291 are the ones related to pulling the data from the JSON files,
# coloring them, setting the legend and setting a mouseover for each hexagon
https://github.com/csessig86/crime_map2013/blob/master/js/script_map_review.js

# And the map online:
http://wcfcourier.com/app/crime_map2013/index_wloo.php

The above shows the process I took for creating our hex map of violent crimes in Waterloo. I duplicated this process three more times to get four maps in all. One showed property crimes in Waterloo, while the other two show violent and property crime in Cedar Falls.

* Note: For more information on converting shapefiles with ogr2ogr, check out this blog post from Ben Balter. Also, this cheat sheet of ogr2ogr command line tools was invaluable as well.

Another great resource is this blog post on minifying GeoJSON files from Bjørn Sandvik. While the post deals with GeoJSON files specifically, the process works with JSON files as well. Minifying my JSON files made them about 50-150 KB smaller. While it’s not a drastic difference, every little bit counts.

To get choropleth maps up and running in Leaflet, check out this tutorial. You can also check out my Javascript code for our crime maps.

One last note: The map is responsive. Check it out.

Written by csessig

January 11, 2014 at 4:42 pm

How we used a Google spreadsheet to power our election app

with one comment

Election 2013 app screen shot

Last Tuesday was election night in the Cedar Valley. While we didn’t have a huge number of contested races on the ballot, we did have a mayor’s race in Waterloo and Waverly, Iowa, as well as city council races in several area towns.

As NPR’s Jeremy Bowers proudly proclaimed, election nights are exciting times for news nerds. This election, we decided to do a little bit of experimenting.

Before the election, we posted biographical information for all the candidates running in a contested race in Waterloo, Cedar Falls and Waverly. We also promoted other races in smaller towns. This gave our readers a good overview of the races on the ballot and information on the candidates. And best of all, it was all in one place.

Three reporters were responsible for getting the biographical information for each candidate and entering it online. We used a Google spreadsheet to store the information. This allowed the reporters to enter the information online themselves. I could then go into the spreadsheet, edit the text and make sure it was formatted correctly. We then used Tabletop.js to import the data into our app and Handlebars.js to template it.

The basic setup for the app is available on my Github page. This is a very similar setup mentioned in my last blog post.

Before the app went live, I exported all the data in the Google spreadsheet into JSON format using a method mentioned here. I did this for two reasons: 1) It spend up the load time of the app because browsers didn’t have to connect to the spreadsheet, download the data, format the data into JSON (which is what Tabletop.js does with the data in this app) and then display it online using templates rendered in Handlebars.js. Instead, it’s already downloaded, formatted into JSON and ready to be templated. And: 2) Tabletop.js has a bug that may cause some readers not to see any of the data at all. I wanted to avoid this problem.

The night of the election, we wanted to update the election results live. And we wanted to use the same app to display the results. Fortunately, the process was easy to do: I added new columns in our spreadsheet for vote totals and precincts reported. The new data was then pulled into our app and displayed with simple bar charts.

To display the results live, we had to ditch the exported JSON data and use the Google spreadsheet to power the app. This allowed us to update the spreadsheet and have the results display live on our website. We had one reporter at the county courthouse who punched in numbers for our Waterloo and Cedar Falls races. We had another reporter in Waverly who punched in results for races in that town. And we had another reporter in the newsroom who was monitoring the races in rural towns and giving me updated election results to enter into the spreadsheet.

The workflow worked great. While most news outlets were waiting for the county websites to update with election results, we were able to display the results right away. Unfortunately, the Black Hawk County website never wound up working on election, making our app the only place readers could go to get election results.

Our effort paid off: The app received about 11,500 pageviews, with about 10,000 of those pageviews coming the night of election (about 7,000 pageviews) and the following day. The app was more popular than any single story on our website for the month of October and November.

Reporters at the party headquarters said many of the candidates first tuned into the local television station to get election results but quickly went to our webpage when they realized we were the only ones with updated results. In fact, Waterloo’s mayor found out he won his race by looking at our website. Here’s a photo of him checking out our website on election night.

Which leads me to my last point: You’ll notice how our mayor is checking out the results on a smartphone. Our app (like all of our apps) was responsively designed, which means it looks great on mobile phones. It’s critical that news producers make sure their apps work on mobile. It’s pretty much mandatory. Because as our mayor shows, people love checking the news on their phones.

Written by csessig

November 10, 2013 at 10:19 pm

How to: Using OpenRefine, Excel to clean messy, user-submitted data

with one comment

Note: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

We at The Courier ran into an interesting problem last week that I’m sure other news organizations run into as well.

Every year, The Courier puts on a Best of the Best competition. We select a series of categories like food, dining,  jewelry stores, etc. and ask our readers to pick the best business in town for each category.

The businesses with the most votes in each category are rewarded by being featured in the newspaper. They also get plaques they can hang at their business to show off their nifty award.

We have several forms that people use to submit their votes. They are allowed to type in their favorite businesses, instead of pick from a field of choices.

We store their votes in a database. We then spit out the data into a spreadsheet with all the votes. Each row is one vote. For week two, we had a spreadsheet with 3,000+ rows of entries.

To view that spreadsheet, click here. We have a row for entry id, category, entrant id and the actual entry. The “entry” field is what we will focus on throughout this walkthrough.

The problem: Everyone has their own spelling for businesses in the “entry” column. For instance, “Famous Footwear” was spelled “Famous Footwear, Cedar Falls” and “Famous Footware”, it sometimes was partially upper-cased and sometimes it was misspelled altogether.

How do we merge all those different spellings for businesses into one spelling and count them up?

We could do it by hand, of course, but that is painful, tedious and time-consuming. We could also pay a company to do it but that could be expensive. At least one company that we know of charges $4,000 for such serves.

Fortunately there is a better option out there that is FREE and not horribly time consuming.

refine_cluster_ss

Here’s how we cleaned this spreadsheet with user-submitted data:

1. First download a program called OpenRefine and open it up. This program was originally created by the fine people at Google and then later open sourced. Its purpose is to clean “dirty” data (like our user-submitted data). Their website has three videos on how to use the program that you should definitely check out before firing it up.

2. With OpenRefine open, create a new project and import in the spreadsheet we want to clean. Then find the column of data we want to clean. For us, all of our user-submitted businesses were in a column called “entry.”

3. In OpenRefine, there is a little arrow next to the column title (“entry”). Click that to see a list of options for cleaning the data. If you look under “Edit cells” and “Common transforms,” you’ll see a list of common edits like trimming the leading and trailing whitespace in the entries. So ” U.S. Cellular ” would be changed to “U.S. Cellular” (notice how the white space is removed).

4. The real magic, however, is in the cluster function. This uses algorithms to find different spellings for the same business and then allows you to pick ONE common spelling for ALL of them.

– This makes it much easier down the road to count how many times a particular business was voted on because we don’t have a wide range of spellings for that business. Instead, we can narrow down it to just a couple of spellings. Or if we are really lucky, we can convert all of the different spellings into one common spelling using the algorithms.

– To pull them up, click “Facet” then “Text facet”. That will pull up a list of unique entries in the “entry” column. If you click “count”, you can see how many times a business was entered. For instance, “Scheels” was voted on 115 times. Then click “Cluster”.

– Clustering is one of OpenRefine’s most powerful tools. When you open this section up, you’ll see a list of clusters with different spellings for what the algorithm thinks is the same business.

– If those are accurate clusters — or if the different spellings listed in the cluster really are different variations for the same business — you can select the check box under “Merge?” to convert all of those spellings in the cluster to the same, single spelling.

– OpenRefine comes packed with several clustering algorithms. You can change “Method,” “Keying Function” etc. to check them all out and see which ones work best for you. When you’re done with one algorithm, you can select “Merge Selected & Re-Cluster” to mess with other algorithms or “Merge Selected & Close” to close out of this section.

– I think it’s best to go through all of the algorithms and play around with them a bit to cluster as many businesses together as you can.

6. Once you are done, export the spreadsheet with the button in the top right corner of the screen and click “Excel.”

Note: We used OpenRefine to cluster together different spellings for the same business. Now we’ll use Excel to create a list of uniquely-titled businesses and how many times those were voted on.

– After we are done, we will still probably have different spellings for the same business. It’s likely that the OpenRefine algorithms will be able to find many of the wide range of spellings for a particular business but not all of them. When you allow users to type in their own business names, you open yourself up to A LOT of spellings for the same thing.

– But the hope is we can narrow this list of spellings down to as few as possible so the manual work of tallying up the votes is as painless as possible.

Now open you spreadsheet up in Excel:

– Right now, we still have a spreadsheet that is about 3,000 rows long. The advantage is we have limited the number of different spellings for all the businesses people have voted on. What we need to do now is create a list of uniquely titled businesses and how many times they were voted on.

6. A lot of people accidentally put in their phone number for the entry. Some entered “====”. Others didn’t fill in the spot at all. We need to get rid of all those.

– I first moved the “entry” column from column E to column A. I then sorted the data from A to Z so the entries with phone numbers are at the top. I deleted those. I then sorted it from Z to A to put the empty entries at the top. And then I deleted those too.

– We can also delete the “entry_id” and “entrant_id” columns because for our purposes, they are useless.

– For those following along, here’s what my spreadsheet looks like now.

– One tricky aspect of this is people can vote for the same business in different categories. So we need to make sure the business and the category of each vote stay to together. That way, a vote for “Ace Hardware” in the “Hardware Store” category doesn’t count the same as a vote for “Ace Hardware” in the “Buy a Lawnmower” category.

7. I did this by merging my entry column with my category column into a new column. This can be done with the concatenate formula:

=CONCATENATE(A2, ” | “, B2)

– A2 is our entry column and B2 is our category column. The ” | ” isn’t necessary but is useful is we want to break apart this column again in the future. We then fill down that function to the last row in the spreadsheet using the fill handle.

– Here’s what my spreadsheet looks like now.

Tip: When using formulas on big spreadsheets, Excel tends to slow down A LOT. So when you are done using a formula, you can copy it and paste just the values (not the formulas) in its place to help Excel run smoother. I do this throughout the walkthrough to make our spreadsheets much easier work with.

8. The next step is to get a list of uniquely-titled businesses and categories. The best walkthrough I found to accomplish this was this one from Microsoft. Complete steps 1-6 under “Count the number of unique values by using a filter” to create a new column of uniquely-titled businesses.

– I also deleted the entry and cat_id columns. We will not need those any more.

– Here’s what my spreadsheet looks like now.

– We are down to under 1,000 rows of unique-titled businesses and categories. And honestly, if I would have messed with the algorithms in OpenRefine more, I could have narrowed that list even farther.

8. Now we have a column for each vote (column A) and column with each unique business and category (column B). We need to count how many times each unique business and category (column B) was voted on — or how many times each unique business and category appears in column A.

– We’ll do this by counting the number of times column B occurred in column A using this SUM formula:

=SUM(IF($A:$A=B2,1,0))

– This basically goes through each row in column A and if it is equal to what is in column B, it adds one. Then its adds all that together. The result is how many times our unique business and category was voted on.

Note: The dollar signs in the formula basically means go through every row in column A.

– Here’s what my spreadsheet looks like now.

9. We’re almost done. All we need to do now is split up our business category column into a column for businesses and a column for categories.

– Do this by first copying and pasting the value of the “unique_entry_cat” column into a new column. We have to do this first, otherwise our sum formula in our “count” column won’t work properly.

– We then break apart that new column using this method and use “|” as the delimiter to split them up. Then delete the old columns and copy and paste the values over the formulas in the spreadsheet.

– Here’s what my spreadsheet looks like now.

10. Finally, copy and paste the values of the columns onto a new spreadsheet and custom sort the spreadsheet first from A to Z for the category column and then largest to smallest in the count column.

– Here’s what the final spreadsheet looks like.

– That’s it! I realize that this is a VERY long-winded blog post and may be confusing to follow at times. If you have any questions, don’t hesitate to e-mail me at chris.essig@wcfcourier.com

Written by csessig

August 16, 2013 at 3:06 pm

Creating responsive maps with Leaflet, Google Docs

with 9 comments

wloo_history_teaserNote: This is cross-posted from Lee’s data journalism blog, which you can read over there by clicking here.

Quick: Name five people you know who don’t have a smartphone.

Stumped? Yeah, me too. The fact is more and more people have smartphones and are using them to keep up with the world.

What does that mean for news app developers? We need to be especially conscience of the mobile platform and make sure everything we build for the web is compatible on the smallest of screens.

One great way to do this is creating apps through responsive design. The idea behind responsive design is creating one web page for all users, as opposed to making separate pages for desktop and mobile computers.

Then we simply add, rearrange, subtract or tweak features on a web page based on the size of the browser the user has when they are viewing the app.

Maps can be difficult to manage on mobile platforms, especially when you add in legends, info boxes, etc. But they are not impossible. Fortunately Leaflet, an alternative to Google maps, is designed to work especially well on mobile platforms.

In this example, we will be loading data into a Google spreadsheet and using Leaflet to map the data on a responsive map.

1. Learn a little bit of Leaflet

Before we start, it would probably be best if you familiarize yourself with Leaflet. Fortunately, their website has some wonderful walk-throughs. I’d recommend going through this one before going anything further.

2. Grab the code

Now, go to my Github page and download the template.

The Readme file includes instructions on how to set up a basic map using Google spreadsheets and Tabletop.js, which is a wonderful tool that allows us to do all kinds of things with data in a Google spreadsheet, including map it using Leaflet.

3. Edit your index.html page

After you have followed the instructions on my Github page, you should have a map ready to go.

All you have to do is go into the index.html page and edit the title of the map, as well as add your own information into the “sidebar_content” div. Also make sure you add your name to the credits because you deserve credit for this awesome map you are putting together.

4. How does it work?

Now open up your map in a browser. If you rearrange the browser window size, you’ll notice that the map rearranges its size. The other components on the page also automatically readjust.

Some of this is done with the Bootstrap web framework, which was designed with responsive designing in mind.

I’ve also added my own CSS. One easy thing I’ve done with elements on the page is declare their widths and heights using percentages instead of pixels. This ensures that the components will automatically be adjusted regardless of the screen size.

Take a look at our css/styles.css file to get an idea of what I’m talking about: 

/* Body */
body {
    padding-left: 0px;
    padding-right: 0px;
  margin: 0;
    height: 100%;
}

html {
	height: 100%;
}


/* Map */
#map {
	position: absolute;
	float: left;
	top: 1%;
	height: 98%;
	width: 100%;
	z-index: 1;
}

The map’s height is 98 percent and width is 100 percent, ensuring it changes its size when the browser changes its size. If we set it to 600 pixels wide, the map would stay 600 pixels wide, even when the browser was adjusted.

– You’ll notice some other changes. For instance, if you have a wide screen, the map’s sidebar will be on the right side of the screen. We did this by using absolute positioning to place the sidebar and its content on the page:

/* Sidebar */
#sidebar {
  position: absolute;
	top: 2%;
	right: 1%;
	height: 96%;
	width: 30%;
	z-index: 2;
	border: 1px solid #999;
	padding-left: 1%;
	padding-right: 1%;
	background-color: #FFFFFF;
    background-color:rgba(255,255,255,0.9);
}

#sidebar h3 {
	line-height: 30px;
}

#sidebar_content {
	float: left;
	width: 30%;
	height: 70%;
	position: fixed;
	overflow: auto;
	padding-top: 5px;
}

The sidebar’s “right” position is set to 1 percent. This ensures that the sidebar will appear only 1 percent from the right side of the page. Additionally, its “top” position is set to 2 percent. This, effectively, pushes it to the top right corner of the screen.

We also used percentages to declare widths, heights and padding lengths for the sidebar.

– You’ll also notice when your browser is reduced drastically, the content of the sidebar disappears off the page. Instead, we have just the title of the sidebar at the top of the page. This is done with CSS media queries:

/* Styles from mobile devices */
@media (max-width: 625px) {
  
  #sidebar_content {
  	display: none;
	}

  /* Sidebar */
  #sidebar {
		position: relative;
		margin-top: 0%;
		float: left;
		left: 0%;
		right: 0%;
		top: 0%;
		padding-left: 2%;
		padding-right: 2%;
		height: 35px;
		width: 96.5%;
	}

}

Basically what the above code is saying is: If the browser is 625 pixels wide or smaller, apply the following CSS styles. These styles would therefore apply to almost all mobile phones. So what you are saying to the browser is: If this is a mobile browser, apply these styles to the elements on the page.

The first thing we do is hide the “sidebar_content” div, which is within our main “sidebar” div. Besides the “sidebar_content” div, we also have a div within the “sidebar” div called “sidebar_header” for our title. The template sets the title to “Tabletop to Leaflet” initially, although you should change that to match your project.

We hide the “sidebar_content” div with the property “display: none.” Hiding it ensures that the only thing left in our “sidebar” div is the title. Then the sidebar is pushed to the top left corner of the page using absolute positioning.

So what do we do with that information we have hidden? We put it in another div using some Javascript. Then we toggle that div from hidden to visible using a button with the class “toggle_description.” This toggle feature is enabled using jQuery.

From our js/script.js file:

// Toggle for 'About this map' and X buttons
// Only visible on mobile
isVisibleDescription = false;
// Grab header, then content of sidebar
sidebarHeader = $('#sidebar_header').html();
sidebarContent = $('#sidebar_content').html();
// Then grab credit information
creditsContent = $('#credits_content').html();
$('.toggle_description').click(function() {
  if (isVisibleDescription === false) {
		$('#description_box_cover').show();
		// Add Sidebar header into our description box
		// And 'Scroll to read more...' text on wide mobile screen
		$('#description_box_header').html(sidebarHeader + '<div id="scroll_more"><strong>Scroll to read more...</strong></div>');
		// Add the rest of our sidebar content, credit information
		$('#description_box_content').html(sidebarContent + '<br />' + 'Credits:' + creditsContent);
		$('#description_box').show();
		isVisibleDescription = true;
	} else {
		$('#description_box').hide();
		$('#description_box_cover').hide();
		isVisibleDescription = false;
	}
});

The above code first grabs the information from the “sidebar_content” div, then places it in the “description_box” div. It also sets our toggle function, which is activated when the user clicks on the button with the class “toggle_description.”

– The “description_box” div is also styled similarly to the “sidebar” div. The big difference is the “description_box” div is hidden by default because we only want it shown if we are on a mobile phone. The button with the class “toggle_description” is also hidden by default.

From our css/styles.css file:

/* 'About this map' button, description box */
/* Mobile only */
.toggle_description {
  display: none;
	z-index: 8;
	position: relative;
	float: right;
    right: 0%;
    top: 0%;
}

#description_box_cover {
	display: none;
	z-index: 10;
	position: absolute;
	top: 0%;
	width: 100%;
	height: 100%;
    background-color: #444444;
    background-color:rgba(44,44,44,0.9);
}

#description_box {
	position: absolute;
	display: none;
	z-index: 11;
	width: 92%;
	height: 93%;
	padding-top: 1%;
	padding-left: 1%;
    padding-right: 1%;
    left: 2.5%;
    top: 2.5%;
    border: 1px solid #999;
    background-color: #FFFFFF;
    background-color:rgba(255,255,255,0.9);
}

#description_box h3 {
	padding-bottom: 0px;
	line-height: 15px;
}

Now we do the opposite with the “toggle_description” when compared to what we did with the “sidebar_content” div.

With the “sidebar_content” div, we had it shown by default then hidden on mobile phones using CSS media queries. With the button, we hide it by default and then show it on mobile phones using a CSS property of “display: inline.”

From our css/styles.css file:

/* Styles from mobile devices */
@media (max-width: 625px) {

	.toggle_description {
		display: inline;
	}

}

As a noted above, when someone clicks on that button, jQuery toggles between hidden and shown on the button with the class “toggle_description”. It is hidden by default, so it is shown when the user first clicks  it. Then when the user clicks the blue X button (which also has the class of “toggle description”), the box disappears.

A similar philosophy is in place to hide and show the credits box.

That should give you a good idea of what is happening with this map. Feel free to fork the repo and create your own awesome maps.

Have any questions? Don’t hesitate to leave a comment.