Archive for the ‘Data Sources’ Category

Python Geocoding Take 2 – US Addresses

Monday, May 30th, 2016

In my previous post I discussed my recent adventures with geocoding addresses outside the US. In contrast, there are countless options for batch geocoding addresses within the United States. I’ll discuss a few of those options here, but will focus primarily on the US Census Geocoder and a Python script I’ve written to batch match addresses using their API. The code and documentation is available on my lab’s resources page.

A Few Different Options

ESRI’s geocoding services allow you (with an account) to access their geocoding servers through tools in the ArcToolbox, or you can write a script and access them through an API. QGIS has a third-party plugin for accessing Google’s services (2500 records a day free) or the Open Streetmap. You can still do things the old fashioned way, by downloading geocoded street files and creating a matching service.

Alternatively, you can subscribe to any number of commercial or academic services where you can upload a file, do the matching, and download results. For years I’ve used the geocoding services at Texas A&M that allow you to do just that. Their rates are reasonable, or if you’re an academic institution and partner with them (place some links to their service on their website) you can request free credits for doing matches in batches.

The Census Geocoder and API, and a Python Script for Batch Geocoding

The Census Bureau’s TIGER and address files are often used as the foundational layers for building these other services, to which the service providers add refinements and improvements. You can access the Census Bureau’s services directly through the Census Geocoder, where you can match an address one at a time, or you can upload a batch of 1000 records. It returns longitude and latitude coordinates in NAD 83, and you can get names and codes for all the census geographies where the address is located. The service is pretty picky about the structure of the upload file (must be plain text, csv, with an id column and then columns with the address components in a specific order – with no other attributes allowed) but the nice thing is it requires no login and no key. It’s also public domain, so you can do whatever you want with the data you’ve retrieved. A tutorial for using it is available on our lab’s census tutorials page.

census geocoder

They also have an API with some basic documentation. You can match parsed and unparsed addresses, and can even do reverse geocoding. So I took a stab at writing a script to batch process addresses in text-delimited files (csv or txt). Unfortnately, the Census Geocoding API is not one of the services covered by the Python Geocoder that I mentioned in my previous post, but I did find another third party module called censusgeocode which provides a thin wrapper you can use. I incorporated that module into my Python 3 script, which I wrote as a function that takes the following inputs:

census_geocode(datafile,delim,header,start,addcol)
(str,str,str,int,list[int]) -> files

  • datafile – this is the name of the file you want to process (file name and extension). If you place the geocode_census_funct.py file in the same directory as your data file, then you just need to provide the name of the file. Otherwise, you need to provide the full path to the file.
  • delim – this is the delimiter or character that separates the values in your data file. Common delimiters includes commas ‘,’, tabs ‘\t’, and pipes ‘|’.
  • header – here you specify whether your file has a header row, i.e. column names. Enter ‘y’ or ‘yes’ if it does, ‘n’ or ‘no’ if it doesn’t.
  • start – type 0 to specify that you want to start reading the file from the beginning. If you were previously running the script and it broke and exited for some reason, it provides an index number where it stopped reading; if that’s the case you can provide that index number here, to pick up where you left off.
  • addcol – provide a list that indicates the position number of the columns that contain the address components in your data file. For an unparsed address, you provide just one position number. For a parsed address, you provide 4 positions: address, city, state, and ZIP code. Whether you provide 1 or 4, the numbers must be supplied in brackets, as the function requires a Python list.

You can open the script in IDLE, run it to load it into memory, and then type the function with the necessary parameters in the shell to execute it. Some examples:

  • A tab-delimited, unparsed address file with a header that’s stored in the same folder as the script. Start from the beginning and the address is in the 2nd column: census_geocode('my_addresses.txt','\t','y',0,[2])
  • A comma-delimited, parsed address file with no header that’s stored in the same folder as the script. Start from the beginning and the addresses are in the 2nd through 5th columns: census_geocode('addresses_to_match.csv',',','n',0,[2,3,4,5])
  • A comma-delimited, unparsed address file with a header that’s not in the same folder as the script. We ran the file before and it stopped at index 250, so restart there – the address is in the 3rd column: census_geocode('C:\address_data\data1.csv',',','y',250,[3])

The beginning of the script “sets the table”: we read the address columns into variables, create the output files (one for matches, one for non-matches, and a summary report), and we handle whether or not there’s a header row. For reading the file I used Python’s CSV module. Typically I don’t use this module, as I find it’s much simpler to do the basic: read a line in, split it on a delimiter, strip whitespace, read it into a list, etc. But in this case the CSV module allows you to handle a wider array of input files; if the input data was a csv and there happened to be commas embedded in the values themselves, the CSV module easily takes care of it; if you ignore it, the parsing would get thrown off for that record.

Handling Exceptions and Server Errors

In terms of expanding my skills, the new things I had to learn were exception handling and control flows. Since the censusgeocoding module is a thin wrapper, it had no built in mechanism for retrying a match a certain number of times if the server timed out. This is an absolute necessity, because the census server often times out, is busy, or just hiccups, returning a generic error message. I had already learned how to handle crashes in my earlier geocoding experiments, where I would write the script to match and write a record one by one as it went along. It would try to do a match, but if any error was raised, it would exit that loop cleanly, write a report, and all would be saved and you could pick up where you left off. But in this case, if that server non-response error was returned I didn’t want to give up – I wanted to keep trying.

So on the outside there is a loop to try and do a match, unless any error happens, then exit the loop cleanly and wrap up. But inside there is another try loop, where we try to do a match but if we get that specific server error, continue: go back to the top of that for loop and try again. That loop begins with While True – if we successfully get to the end, then we start with the next record. If we get that server error we stay in that While loop and keep trying until we get a match, or we run out of tries (5) and write as a non-match.

error handling

In doing an actual match, the script does a parsed or unparsed match based on user input. But there was another sticking point; in some instances the API would return a matched result (we got coordinates!), but some of the objects that it returned were actually errors because of some java problem (failed to get the tract number or county name – here’s an error message instead!) To handle this, we have a for i in range loop. If we have a matched record and we don’t have a status message (that indicates an error) then we move along and grab all the info we need – the coordinates, and all the census geography where that coordinate falls, and write it out, and then that for loop ends with a break. But if we receive an error message we continue – go back to the top of that loop and try doing the match again. After 3 tries we give up and write no match.

Figuring all that out took a while – where do these loops go and what goes in them, how do I make sure that I retry a record rather than passing over it to the next one, etc. Stack Exchange to the rescue! Difference between continue, pass and break, returning to the beginning of a loop, breaking out of a nested loop, and retrying after an exception. The rest is pretty straightforward. Once the matching is done we close the files, and write out a little report that tells us how many matches we got versus fails. The Census Geocoder via the API is pretty unforgiving; it either finds a match, or it doesn’t. There is no match score or partial matching, and it doesn’t give you a ZIP Code or municipal centroid if it can’t find the address. It’s all or nothing; if you have partial or messy addresses or PO Boxes, it’s pretty much guaranteed that you won’t get matches.

There’s no limit on number of matches, but I’ve built in a number of pauses so I’m not hammering the server too hard – one second after each match, 5 seconds after every 1000 matches, a couple seconds before retrying after an error. Your mileage will vary, but the other day I did about 2500 matches in just under 2 hours. Their server can be balky at times – in some cases I’ve encountered only a couple problems for every 100 records, but on other occasions there were hang-ups on every other record. For diagnostic purposes the script prints every 100th record to the screen, as well as any problems it encountered (see pic below). If you launch a process and notice the server is hanging on every other record and repeatedly failing to get matches, it’s probably best to bail out and come back later. Recently, I’ve noticed fewer problems during off-peak times: evenings and weekends.

Script Running

Wrap Up

The script and the documentation are posted on our labs resources page, for all to see and use – you just have to install the third party censusgeocode module before using it. When would you want to use this? Well, if you need something that’s free, this is a good choice. If you have batches in the 10ks to do, this would be a good solution. If you’re in the 100ks, it could be a feasible solution – one of my colleagues has confirmed that he’s used the script to match about 40k addresses, so the service is up to the task for doing larger jobs.

If you have less than a couple thousand records, you might as well use their website and upload files directly. If you’re pushing a million or more – well, you’ll probably want to set up something locally. PostGIS has a TIGER module that lets you do desktop matching if you need to go into the millions, or you simply have a lot to do on a consistent basis. The excellent book PostGIS in Action has a chapter dedicated to to this.

In some cases, large cities or counties may offer their own geocoding services, and if you know you’re just going to be doing matches for your local area those sources will probably have greater accuracy, if they’re adding value with local knowledge. For example, my results with NYC’s geocoding API for addresses in the five boroughs are better than the Census Bureau’s and is customized for local quirks; for example, I can pass in a borough name instead of a postal city and ZIP Code, and it’s able to handle those funky addresses in Queens that have dashes and similar names for multiple streets (35th st, 35th ave, 35th dr…). But for a free, public domain service that requires no registration, no keys, covers the entire country, and is the foundation for just about every US geocoding platform out there, the Census Geocoder is hard to beat.

Python Geocoding Take 1 – International Addresses

Monday, May 16th, 2016

This past semester has been the semester of geocoding. I’ve had a number of requests for processing large batches of addresses. Now that the term is drawing to a close, I’ll share some of my trials and tribulations. In this post, I’ll focus on my adventures in international geocoding.

First, it’s necessary to provide some context. As an academic librarian I’m primarily engaged with assisting students and faculty with their coursework and their research. My users are interested in getting coordinates for data so that they can do both analysis and visualization, which requires them to download the actual coordinate data in a batch and integrate it with the rest of their projects.

This is an important distinction to make, because in many cases the large web mapping companies (Google, Bing, Mapquest, etc) are not catering to this population – they provide services and APIs to web developers, so these folks can integrate geocoding services into the Google, Bing, etc maps they are embedding in their website. They geocoding providers specifically prohibit (in the fine print of their terms of use) anyone from using their services to create and download geocoded data. This essentially excludes a lot of academic use – which, is something I hadn’t fully grasped at the outset.

Google’s Geocoding API Perhaps?

My adventure began when a professor asked me for help in geocoding about 1 million addresses – in Turkey. Right from the beginning, many of the usual sources I would turn to (for US addresses) were out the window. I knew that I could do small scale batches of international addresses with the mmQGIS geocoding plugin, so I started testing there. The address file we had consisted of unparsed addresses, and the formating looked rather chaotic – but after doing some research I discovered that geocoding Turkish addresses was a tough proposition. The Open Street Map plugin (using Nominatim) returned no matches for our 1000 test cases. The Google results were much better, so we decided to investigate writing a script and using an API and to pay for the matching. According to the documentation, it would end up costing $500 to do 1 million addresses.

I searched around for some Python APIs and found what I believed was the official one for google maps geocoding. So I spent a day writing a script that would loop through the addresses, which we divided into batches of 100k records each (which is the max you can do per day with Google if you set up billing), and the professor obtained an API key and set up billing for the account. The interface for setting up and managing the Google APIs was ridiculously confusing. Eventually we were set and I let the script rip, and found that it wouldn’t rip for long. It would consistently stop after doing a few thousand records. I had written it to write results one by one as they were obtained, and to exit cleanly in the case of errors. Upon exit, it provides the index number of the record where it stopped, so I was able to pick up where it left off. But the server would constantly time out – sometimes it could do 10 to 12k records in a stretch, but often less, so I could never leave it unattended for long. The matches themselves were a mixed bag – you could throw absolute garbage at the Google geocoder and still get a match – if not to an address or property, then to a street segment, and beyond that to useless things like postal codes, administrative districts, and the country as a whole (i.e. I can’t find your address, so here are the coordinates for the geographic center of Istanbul, or for all of Turkey. Have a nice day).

It seemed like it was going to be a long climb to get to 1 million – but after about 100k we could go no further. Google simply refused every additional request. A new API key would get us a little further, but soon after that nothing would work and we wouldn’t get any useful error messages to explain why. Having never done anything like this before, I started to investigate why, and eventually discovered the problem: these web mapping geocoding services, even if you pay for them, are not meant to be used this way. Buried in the documentation I found the license restrictions, which stipulate that you are not allowed to download any of the data, and you had to plot every coordinate you retrieved onto a Google map. This is a service for web mapping developers, not researchers.

Why hadn’t I realized this before? One, I simply had never made this distinction as I thought geocoding was geocoding, and in my world of course people are going to want to download the coordinates. Two, the Internet is full of thousands of little blog posts and tutorials which demonstrate how to use the Google Maps APIs, so I thought this was possible. But they never mention any of the caveats about what you can and can’t do with these services. In addition to violating the service terms, what I was doing was akin to yelling in the back of a crowded room, as I was hammering their server, sending requests as fast as I could with no limit. A normal web mapping application (which is what the service is designed for) would send a fraction of those requests in that amount of time. No wonder the requests were refused. Thus ended my Google geocoding experiment.

Nope – How About ESRI Instead?

So what to do next? I found that most of the other commercial web mapping services didn’t provide anything near the maximum caps and low prices that Google was offering. Mapquest for example requires that you subscribe to an account on a monthly or annual basis, and 100k is the amount you could do in a month. Most of the other commercial services also prohibit any downloads.

The big exception is ESRI – they are one of the few that understand and cater to the academic market, and they do allow downloads: they say quite plainly: “Take your Coordinates with you. Once you have the results of a Geocode operation, they’re yours to take anywhere.” My university has a site license for ArcGIS, but it doesn’t include geocoding. You can create an account and have a certain number of free credits, and after that you pay. 1 mil records was going to cost about $4000 – substantially more than Google, but totally legal. ESRI provides lists of countries and ranks them according to how complete their street network coverage is. You can use their API via a script, or you can set up the service in ArcGIS Desktop and do the matching through the ArcToolbox. This would be painfully slow if you were doing a large job (like this one) but for the purpose of testing it out with a few hundred records this is what I tried. Unfortunately, in our case the results still weren’t good. Most of the addresses were to administrative or postal areas; not specific enough.

The Python Geocoder and a Wealth of Options

What often happens in librarianship when a patron makes an initial request (this should be a piece of cake, right?) and then discovers that what they’re looking for is more involved (ahhh this will be tougher than we thought), is that they reframe the question. He went back through the addresses with a research partner and winnowed them down based on what they really, absolutely needed, so now we were down from 1 million to just finding a match for about 300k. His colleague also suggested that we use Yandex, the Russian search and mapping engine. The structure of Russian addresses is quite similar to Turkish ones, and since Russia is closer to Turkey geographically and economically Yandex might do a better job.

I was dubious of this at first, but was quickly surprised. I found the Python Geocoder module, which provides a common, uniform API to over two dozen different geocoding services – including Google. Given the simplicity and flexibility of this module, it’s the one I should have used in the first place. And while Google limits you to 2500 free matches in one day, Yandex allows you 25k – that’s 25,000 – free matches in one day, without having to request an API key! I modified the original script I wrote to use the Python Geocoder module with Yandex, and the initial small-batch tests were successful. Here’s a small portion of the code – it loops through a file where the address is stored in one field (unparsed):

for index, line in enumerate(readfile):
        address=line.strip().split(delim)
        result=geocoder.yandex(address[add]).json

And it spits you back this JSON result (you could also do XML if you prefer):

{‘quality’: ‘street’, ‘address’: ‘Türkiye, İstanbul, Fatih, Cankurtaran Mh., Ayasofya Meydanı’, ‘location’: ‘Hagia Sophia Museum, Sultanahmet Mh., Ayasofya Meydanı, Fatih/İstanbul’, ‘state’: ‘İstanbul’, ‘lng’: ‘28.979031’, ‘accuracy’: ‘street’, ‘encoding’: ‘utf-8’, ‘provider’: ‘yandex’, ‘country_code’: ‘TR’, ‘ok’: True, ‘status_code’: 200, ‘lat’: ‘41.00772’, ‘country’: ‘Türkiye’, ‘county’: ‘Fatih’, ‘confidence’: 10, ‘bbox’: {‘northeast’: [41.008156, 28.979714], ‘southwest’: [41.007285, 28.978349]}, ‘street’: ‘Ayasofya Meydanı’, ‘status’: ‘OK’}

If the result you get back is not OK (ok is False – nothing matched), then write the record to the unmatched file. Otherwise, get the bits and pieces out of the json object that you want, append them to the record, and write the whole record out to a matched file.

        if result.get('ok')==False:
                nomatch.append(address)
                nomatchfile.writelines('\t'.join(address)+'\n')
        else:
                lng=result.get('lng')
                lat=result.get('lat')
                qual=result.get('quality')
                accu=result.get('accuracy')
                matchadd=result.get('address')
                newitems=lng,lat,qual,accu,matchadd
                address.extend(newitems)
                matched.append(address)
                matchfile.writelines('\t'.join(address)+'\n')

But is it legal? It was unclear to me; they specify that map data is meant for personal/noncommercial use and in the same sentence: “Any copying of the Data, their reproduction, conversion, distribution, promulgation (publication) in the Internet, any use of the Data in mass media and/or for commercial purposes without a prior written consent of the right holder, shall be prohibited”. Does that mean any copying, or just copying for commercial use or for redistributing the data? In our case, this is for academic non-profit use and the data (individual geocoded records) wasn’t going to be republished – it would be used for plotting distances between locations and making highly generalized static dot maps for an article. At this stage we seemed to be out of options – if you need to geocode a large batch of international addresses, AND you are willing to pay for it, where on Earth can you go?

Ultimately, I left it up to the professor to contact them or not, and we decided to roll the dice. For my part, I engineered the script to put a minimum load on their servers – essentially I could take 24 hours to do 25k records. I used the time and random modules in Python to build pauses in between records to slow things down. In sharp contrast to Google, the Yandex servers were amazingly reliable – they were able to do batches of 25k records every single time without timing out – not even once – and in less than a couple weeks we were finished. About 50% of the matches were good, and for the others he and a research assistant went back and cleaned up unmatched records, and I gave them the script so they could try again.

International Geocoding: The Take-Aways

  1. If you need to geocode a large batch of foreign addresses for academic or research purposes, forget Google. Their service was less than stellar (to put it mildly) and anyway it’s a violation of their license agreement. And all those lousy little blog posts out there that show you how to use the Google Map APIs with Python and say “Gee isn’t this great!” are largely useless for practical purposes.
  2. The Python Geocoder module is simple to use and let’s you write a single script to access a ton of different geocoding services, including Open Streetmap, Yandex, and ESRI. But you still need to review the terms of service for each one to see what’s allowed and what the daily limits are.
  3. If you have funding for your research project, and ESRI geocoding has good coverage for your geographic area (based on their documentation but also on your own testing) then go with them, as you’re free and clear to download data under their terms. Arc Desktop will be too sluggish for large batches so write a script – you can use the Python Geocoder.
  4. Otherwise – the Open Street Map / Nominatim services are worth a try but your success will vary by country. I had used them before for addresses in France with fair success, but it didn’t help me with Turkey.
  5. You can also crawl through the GIS Stackexchange for advice. I’ve found that most of the suggestions are either for US geocoding, or are companies that are answering posts saying “Hey you can try my service!”

Happy geocoding, comrades! In my next post I’ll discuss my experience with batch geocoding addresses here in the US of A with Python.

Average Distance to Public Libraries in the US

Monday, February 22nd, 2016

A few months ago I had a new article published in LISR, but given the absurd restrictions of academic journal publishing I’m not allowed to publicly post the article, and have to wait 12 months before sharing my post-print copy. It is available via your local library if they have a subscription to the Science Direct database (you can also email me to request a copy). I am sharing some of the un-published state-level data that was generated for the project here.

Citation and Abstract

Regional variations in average distance to public libraries in the United States
F. Donnelly
Library & Information Science Research
Volume 37, Issue 4, October 2015, Pages 280–289
http://dx.doi.org/10.1016/j.lisr.2015.11.008

Abstract

“There are substantive regional variations in public library accessibility in the United States, which is a concern considering the civic and educational roles that libraries play in communities. Average population-weighted distances and the total population living within one mile segments of the nearest public library were calculated at a regional level for metropolitan and non-metropolitan areas, and at a state level. The findings demonstrate significant regional variations in accessibility that have been persistent over time and cannot be explained by simple population distribution measures alone. Distances to the nearest public library are higher in the South compared to other regions, with statistically significant clusters of states with lower accessibility than average. The national average population-weighted distance to the nearest public library is 2.1 miles. While this supports the use of a two-mile buffer employed in many LIS studies to measure library service areas, the degree of variation that exists between regions and states suggests that local measures should be applied to local areas.”

Purpose

I’m not going to repeat all the findings, but will provide some context.

As a follow-up to my earlier work, I was interested in trying an alternate approach for measuring public library spatial equity. I previously used the standard container approach – draw a buffer at some fixed distance around a library and count whether people are in or out, and as an approximation for individuals I used population centroids for census tracts. In my second approach, I used straight-line distance measurements from census block groups (smaller than tracts) to the nearest public library so I could compare average distances for regions and states; I also summed populations for these areas by calculating the percentage of people that lived within one-mile rings of the nearest library. I weighted the distances by population, to account for the fact that census areas vary in population size (tracts and block groups are designed to fall within an ideal population range – for block groups it’s between 600 and 3000 people).

Despite the difference in approach, the outcome was similar. Using the earlier approach (census tract centroids that fell within a library buffer that varied from 1 to 3 miles based on urban or rural setting), two-thirds of Americans fell within a “library service area”, which means that they lived within a reasonable distance to a library based on standard practices in LIS research. Using the latest approach (using block group centroids and measuring the distance to the nearest library) two-thirds of Americans lived within two miles of a public library – the average population weighted distance was 2.1 miles. Both studies illustrate that there is a great deal of variation by geographic region – people in the South consistently lived further away from public libraries compared to the national average, while people in the Northeast lived closer. Spatial Autocorrelation (LISA) revealed a cluster of states in the South with high distances and a cluster in the Northeast with low distances.

The idea in doing this research was not to model actual travel behavior to measure accessibility. People in rural areas may be accustomed to traveling greater distances, public transportation can be a factor, people may not visit the library that’s close to their home for several reasons, measuring distance along a network is more precise than Euclidean distance, etc. The point is that libraries are a public good that provide tangible benefits to communities. People that live in close proximity to a public library are more likely to reap the benefits that it provides relative to those living further away. Communities that have libraries will benefit more than communities that don’t. The distance measurements serve as a basic metric for evaluating spatial equity. So, if someone lives more than six miles away from a library that does not mean that they don’t have access; it does means they are less likely to utilize it or realize it’s benefits compared to someone who lives a mile or two away.

Data

I used the 2010 Census at the block group level, and obtained the location of public libraries from the 2010 IMLS. I improved the latter by geocoding libraries that did not have address-level coordinates, so that I had street matches for 95% of the 16,720 libraries in the dataset. The tables that I’m providing here were not published in the original article, but were tacked on as supplementary material in appendices. I wanted to share them so others could incorporate them into local studies. In most LIS research the prevailing approach for measuring library service areas is to use a buffer of 1 to 2 miles for all locations. Given the variation between states, if you wanted to use the state-average for library planning in your own state you can consider using these figures.

To provide some context, the image below shows public libraries (red stars) in relation to census block group centroids (white circles) for northern Delaware (primarily suburban) and surrounding areas (mix of suburban and rural). The line drawn between the Swedesboro and Woodstown libraries in New Jersey is 5-miles in length. I used QGIS and Spatialite for most of the work, along with Python for processing the data and Geoda for the spatial autocorrelation.

Map Example - Northern Delaware

The three tables I’m posting on the resources page are for states: one counts the 2010 Census population within one to six mile rings of the nearest public library, the second is the percentage of the total state population that falls within that ring, and the third is a summary table that calculates the mean and population-weighted distance to the nearest library by state. One set of tables is formatted text (for printing or just looking up numbers) while the other set are CSV files that you can use in a spreadsheet. I’ve included a metadata record with some methodological info, but you can read the full details in the article.

In the article itself I tabulated and examined data at a broader, regional level (Northeast, Midwest, South, and West), and also broke it down into metropolitan and non-metropolitan areas for the regions. Naturally people that live in non-metropolitan areas lived further away, but the same regional patterns existed: more people in the South in both metro and non-metro areas lived further away compared to their counterparts in other parts of the country. This weekend I stumbled across this article in the Washington Post about troubles in the Deep South, and was struck by how these maps mirrored the low library accessibility maps in my past two articles.

Review of The Census Reporter

Monday, February 8th, 2016

Picking up where I left off from my previous post (gee – welcome to 2016!) I thought I’d give a brief review of another census resource, The Census Reporter at http://censusreporter.org/.

The Census Reporter was created to make it easier for journalists to write stories using census data. To that end, they’ve created a really slick and easy to use web site that makes the data accessible and fun to explore. From the homepage you have three ways of diving into the data: you can pull up a profile by typing in the name of a place, you can enter an address and explore places that contain that address, or you can explore tables by topic.

Census Reporter Homepage

First, the place-based approach. You can type in a named place, like a state, county, or a census place (incorporated cities and towns, or census designated places) to get started. This will give you a selection of data from the most recent release of the American Community Survey. For larger areas where the data is available, it gives you 1-year ACS data by default; otherwise you get the latest 5-year data.

You’re presented with a map of the location at the top, and a series of attractive looking graphs and charts sorted by the demographic profile table source – social, economic, housing, and demographic. If you hover over a data point in a table it gives you some geographic context by comparing this place’s value with that of larger places where it’s contained. For example, if I search for Philadelphia I can hover over the chart to get the value for the Philly metro area and the State of Pennsylvania. I can click a link below each chart to open the full table, which includes both estimates and margins of error. There are small links for viewing the table by itself on a separate page (which also gives you the ability to download it) and for embedding the chart in a website.

Census Reporter Chart and Table

Viewing the table gives you additional options, like adding additional places for comparison, or subdividing the place into smaller areas for comparison. So if I’m looking at Philadelphia, I can break it down into tracts, block groups, or ZIP Codes. From there I can toggle away from the table view to view a map or a distribution bar to explore that variable by individual geographies.

View Data by Sub-Areas, Distribution Bars

The place-based search is great at allowing you to drill down either by topic or by these smaller geographies. But if you wanted to access a fuller range of geographies like congressional districts or PUMAs, it seems easier to do an address-based search. Back on the homepage, selecting the address button and typing in an address brings you to a map with the address pin-pointed, and on the left you can choose any geography that encloses that address. Once you do that, you get a profile for that geography and can start doing the same sorts of operations for changing the topics or tables, or adding or subdividing geographies for comparison.

Address Search

The topic-based search lets you search just by topic and then figure out the geography piece later. Of the three types of searches this one is the toughest, given the sheer number of tables and cross-tabulations. You can click on a link for a general topic to narrow things down a bit before beginning a search.

In downloading the data you have a variety of useful options: CSV, Excel, GeoJSON, KML, and shapefiles. So in theory you can download data that’s readily mappable – in practice I wasn’t able to download a shapefile, but could grab a KML or GeoJson and was able to visualize it in QGIS. One challenge in downloading any of the files is that the column names use the identifier codes, and the actual names of the variables aren’t included in the download format you choose – they’re included in a json file. So you can use that for reference, but it can’t be readily incorporated into the table.

So – where would this resource fall within the pantheon of US Census data resources? I think it’s great for accessing and, especially, visualizing profiles (profile = lots of data for one place) from the most recent ACS releases. It’s easy to use and succeeds at making the data interesting; for that reason I certainly would incorporate it into undergraduate courses where I’m introducing data. The ability to embed the charts into websites is certainly a bonus, and they deserve a big thumbs up for incorporating the margin of error data, rather than hiding or discarding it like other resources do.

The ability to create and view comparison tables (comparison = one piece of data for many places) is good – select an area and then break it down – but not as strong as the profile options. If you want to get a profile for a non-named place like a tract, ZIP Code, or PUMA you can’t do that from the profile search. You can do an address search and back out (if you know an address for that you’re interested in) or you can drill down by topic, which lets you search by summary area in addition to named places.

For users who need to download a lot of data, or for folks who need datasets that aren’t the most recent ACS release, this resource isn’t the place to go. The focus here is on providing the data in an easy and compelling way, as is. In viewing the profiles, it’s not clear if you can choose 5-year data over 1-year data for places where both datasets are available – even for large geographic areas with high population, sometimes it’s preferable to use the 5-year data to take advantage of the smaller margins of error. I also didn’t see an option for choosing decennial census data.

In short, this resource is well-designed and definitely worth exploring. It seems clear why this would be a go-to source for journalists, but it can be for many others as well.

The New NYC Census Factfinder

Wednesday, September 9th, 2015

As I’m updating my presentations and handouts for the new academic year, I’m taking two new census resources for a test drive. I’ll talk about the first resource in this post.

The NYC Department of City Planning has been collating census data and publishing it for the City for quite some time. They’ve created neighborhood tabulation areas (NTAs) by aggregating census tracts, so that they could publish more reliable ACS data for small areas (since the margins of error for census tracts can be quite large) and so that New Yorkers have data for neighborhood-like areas that they would recognize. The City also publishes PUMA-level data that’s associated with the City’s Community Districts, as well as borough and city-level data. All of this information is available in a large series of Excel spreadsheets or PDFs in the form of comparison tables for each dataset.

The Department of Planning also created the NYC Census Factfinder, a web-mapping interface that let’s users explore census tract and NTA level data profiles. You could plug in an address or click on the map and get a 2010 Census profile, or a demographic change profile that showed shifts between the 2000 and 2010 Census.

pic1_factfinder

It was a nice application, but they’ve just made a series of updates that make it infinitely better:

  1. They’ve added the American Community Survey data from 2009-2013, and you can view the four demographic profile tables (demographics, social, economic, and housing) for tracts and NTAs.
  2. Unlike many other sources, they do publish the margin of error for all of the ACS data, which is immensely important. Estimates that have a high margin of error (as defined by a coefficient of variation) appear in grey instead of solid black. While the actual margins are not shown by default, you can simply click the Show radio button to turn on the Reliability data.
  3. Tracts or neighborhoods can be compared to the City as a whole or to an individual borough by selecting the drop down for the column header.
  4. This is especially cool – if you’re viewing census tracts you can use the select pointer and hold down the Control key (Command key on a Mac) to select multiple tracts, and then the data tables will aggregate the tract-level data for you (so essentially you can build your own neighborhoods). What’s noteworthy here is that it also calculates the new margins of error for all of the derived estimates, AND it even calculates new medians and averages with margins of error! This is something that I’ve never seen in any other application.
  5. In addition to searching for locations by address, you can hit the search type drop down and you have a number of additional options like Intersection, Place of Interest, and even Subway Stations.

nyc_factfinder_table

There are a few quirks:

  1. I had trouble viewing the map in Firefox – this isn’t a consistent problem but something I noticed today when I went exploring. Hopefully something temporary that will be corrected. Had no problems in IE.
  2. If you want to click to select an area on the map, you have to hit the select button first (the arrow beside the zoom slider and print button) and then click on your area to select it. Just clicking on the map without hitting select first won’t do much – it will just highlight the area and tell you it’s name. Clicking the arrow button turns it blue and allows you to select features, clicking it again turns it white and lets you identify features and pan around the map.
  3. factfinder_buttons

  4. The one bummer is that there isn’t a way to download any of the profiles – particularly the ones you custom design by selecting tracts. Hitting the Get Data button takes you out of the Factfinder and back to the page with all of the pre-compiled comparison tables. You can print the table out to a PDF for presentation purposes, but if you want a data-friendly format you’ll have to highlight and select the table on the page, copy, and paste into a spreadsheet.

These are just small quibbles that I’m sure will eventually be addressed. As is stands, with the addition of the ACS and the new features they’ve added, I’ll definitely be integrating the NYC Census Factfinder into my presentations and will be revising my NYC Neighborhood Census data handout to add it as a source. It’s unique among resources in that it provides NTA-level data in addition to tract data, has 2000 and 2010 historical change and the latest 5-year ACS (with margins of error) in one application, and allows you to build your own neighborhoods to aggregate tract data WITH new margins of error for all derived estimates. It’s well-suited for users who want basic Census demographic profiles for neighborhood-like areas in NYC.

Update Your Links to the New Baruch Geoportal

Thursday, August 13th, 2015

A few weeks ago I launched a new version of our college’s GIS data repository, the Baruch Geoportal. At the back end I have a simplified process for getting data onto our server, and on the front end we did away with manually updating HTML and CSS webpages in favor of using a Confluence wiki. My college has a subscription to Confluence, and I’ve been using an internal wiki for documenting and administering all aspects of our projects. A public, external wiki for providing our data seemed like a nice way to go – we can focus more on the content and it’s easier for my team and I to collaborate.

Since it’s a new site with a new address, many of the links to projects I’ve referred to throughout the years on this blog are no longer valid. Redirects are in place, but they won’t last forever. Some notable links to update:

The new site has a dedicated blog that you can follow (via RSS) for the latest updates to the portal. The portal also has a number of relatively new and publicly accessible datasets that we’ve posted over the last year (but that I haven’t had time to post about). These include the NYC Mass Transit Spatial Layers series and population centroids for US census geographies. We’ve been creating ISO spatial metadata for all of our new layers, but we still need to create XML stylesheets to make them more human-readable. That will be one of many projects to do for this academic year.

baruch_geoportal

Census Proposes to Cut 3-year ACS in Fiscal 2016

Friday, February 6th, 2015

I’m coming out of my blog hibernation for this announcement – the US Census Bureau is proposing that they drop the 3-year series of the American Community Survey in fiscal year 2016. A colleague mentioned that he overheard this at a meeting yesterday. Searching the web, I found a post at the Free Government Information site which points to this Census Bureau Press release. The press release cites the predictable reasons (budget constraints, funding priorities, etc.) for dropping the series. Oddly, the news comes through some random site and not through the Census Bureau’s website, where there’s no mention of it. I saw that Stanford also had a post, where they shared the same press release.

I kept searching for some definitive proof, and through someone’s tweet I found a link to a PDF of the US Census Bureau’s Budget Estimates for Fiscal Year 2016, presented to Congress this February 2015. I found confirmation buried on page CEN – 106 (the 100th page in a 190 page doc):

Data Products

Restoration of ACS Data Products ($1.5 million): Each year, the ACS releases a wide range of data products widely used by policymakers, Federal, state and local governments, businesses and the public to make decisions on allocation of taxpayer-funds, the location of businesses and the placement of products, emergency management plans, and a host of other matters. Resource constraints have led to the cancellation of data products for areas with populations between 20 and 60 thousand based on 3-year rolling averages of ACS data (known as the “3-Year Data” Product).They have also resulted in delays in the release of the 1- and 5- year Public Use Macro Sample (PUMS) data files and canceled the release of the 5- year Comparison Profile data product and the Spanish Translation of the 1- and 5- year Puerto Rico data products.

The Census Bureau proposes to terminate permanently the 3-Year Data Product. The Census Bureau intended to produce this data product for a few years when the ACS was a new survey. Now that the ACS has collected data for nearly a decade, this product can be discontinued without serious impacts on the availability of the estimates for these communities.

The ACS would like to restore the timely release of the other essential products in FY2016. The continued absence of these data products will impact the availability of data – especially for Puerto Rico – to public and private sector decision makers.

So at this point it’s still just a proposal. The benefits, besides the ability to release other datasets in a timely fashion, would be simplification for users. Instead of choosing between three datasets now there will only be two – the one year and the five year. You choose the one year for large areas and the five year for every place else. In terms of disadvantages, consider this example – here are the number of children enrolled in nursery school in NY State PUMA 03808, which covers Murray Hill, Gramercy, and Stuyvesant Town in the eastern half of Midtown Manhattan:

PUMA NY 03808

Population Over 3 Years Old Enrolled in Nursery / Pre-school

  • 1 year 2013: 1,166 +/- 609
  • 3 year 2011-2013: 1,549 +/- 530
  • 5 year 2009-2013: 1,819 +/- 409

Since PUMAs are statistical areas built to contain 100k people, data for all of them is available in each series. Like all the ACS estimates these have a 90% confidence interval. Look at the data for the 1-year series. The margin of error (ME) is so large that’s it’s approximately 50% of the estimate, which in my opinion makes it worthless for just about any application. The estimate itself is much lower than the estimate for the other two series. It’s true that it’s only capturing the latest year, but administrative data and news reports suggest that the number of nursery school children in the district that covers this area has been relatively stable over time, with modest increases (geographically the district covers an area much larger than this PUMA). This suggests that the estimate itself is not so great.

The 5 year estimate may be closer to reality, and its ME is only 20% of the estimate. But it covers five years in time. If you wanted something that was a compromise – more timely than the five year but with a lower ME than the one year, then the three year series was your choice, in this case with an ME that’s about 33% of the estimate. But under this proposal, this choice goes away and you have to make do with either 1-year estimates (which will be lousy for geographies that aren’t far above the 65k population threshold, and lousy for small population groups where ever they are located), or better 5-year estimates that cover a greater time span.

NYC Geodatabase Updates: Spatialite Upgrade & ZIPs to ZCTAs

Wednesday, July 30th, 2014

I released the latest version of the NYC geodatabase (nyc_gdb) a few weeks ago. In addition to simply updating the data (for subway stations and ridership, city point features, and ZIP Code Business Patterns data) I had to make a couple of serious upgrades.

Spatialite Updates

The first was that is was time for me to update the version of Spatialite I was using, from 2.4 to 4.1, and to update my documentation and tutorial from the Spatialite GUI 1.4 to 1.7. I used the spatialite_convert tool (see the bottom of this page for info)to upgrade and had no problem. There were some major benefits to making the switch. For one, writing statements that utilize spatial indexes is much simpler – this was version 2.4, generating a neighbor list of census tracts:

SELECT tract1.tractid AS tract, tract2.tractid AS neighbor
FROM a_tracts AS tract1, a_tracts AS tract2
WHERE ST_Touches(tract1.geometry, tract2.geometry) AND tract2.ROWID IN (
SELECT pkid FROM idx_a_tracts_geometry
WHERE pkid MATCH RTreeIntersects (MbrMinX(tract1.geometry), MbrMinY(tract1.geometry),
MbrMaxX(tract1.geometry), MbrMaxY(tract1.geometry)))

And here’s the same statement in 4.1 (for zctas instead of tracts):

SELECT zcta1.zcta AS zcta, zcta2.zcta AS neighbor
FROM a_zctas AS zcta1, a_zctas AS zcta2
WHERE ST_Touches(zcta1.geometry, zcta2.geometry)
AND zcta1.rowid IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name=’a_zctas’ AND search_frame=zcta2.geometry)
ORDER BY zcta, neighbor

There are also a number of improvements in the GUI. Tables generated by the user are now grouped under one heading for user data, and the internal tables are grouped under subsequent headings, so that users don’t have to sift through all the objects in the database to see what they need. The import options have improved – with shapefiles and dbfs you can now designate your own primary keys on import. You also have the option of importing Excel spreadsheets of the 97-2003 variety (.xls). In practice, if you want the import to go smoothly you have to designate data types (format-cells) in the Excel sheet (including number of decimal places) prior to importing.

spatialite_gui_17

I was hesitant to make the leap, because version 2.4 was the last version where they made pre-compiled binaries for all operating systems; after that, the only binaries are for MS Windows and for Mac and Linux you have to compile from source – which is daunting for many Mac users that I am ill-equipped to help. But now that Spatialite seems to be more fully integrated with QGIS (you can create databases with Q and using the DB Manager you can export layers to an existing database) I can always steer folks there as an alternative. As for Linux, more distros are including updated version of the GUI in their repositories which makes installation simple.

One of the latest features in Spatialite 4.1.1 is the ability to import XML ISO metadata into the database, where it’s stored as an XML-type blob in a dedicated table. Now that I’m doing more work with metadata this is something I’ll explore for the future.

ZIPs to ZCTAs

The other big change was how the ZIP Code Business Patterns data is represented in the database. The ZBP data is reported for actual ZIP Codes that are taken from the addresses of the business establishments, while the boundaries in the nyc_gdb database are for ZIP Code Tabulation Areas (ZCTAs) from the Census. Previously, the ZBP data in the database only represented records for ZIP Codes that had a matching ZCTA number. As a result, ZIP Codes that lacked a corollary because they didn’t have any meaningful geographic area – the ZIP represented clusters of PO Boxes or large organizations that process a lot of mail – were omitted entirely.

In order to get a more accurate representation of business establishments in the City, I instituted a process to aggregate the ZIP Code data in the ZBP to the ZCTA level. I used the crosswalk table provided by the MCDC which assigns ZIPs to ZCTAs, so those PO Boxes and large institutions are assigned to the ZCTA where they are physically located. I used SQLite to import that crosswalk, imported the ZBP data, joined the two on the ZIP Code and did a group by on the ZCTA to sum employment, establishments, etc. For ZIPs that lacked data due to disclosure regulations, I added some note or flag columns that indicate how many businesses in a ZCTA are missing data. So now the data tables represent records for ZCTAs instead of ZIPs, and they can be joined to the ZCTA features and mapped.

The latest ZBP data in the new database is for 2012. I also went back and performed the same operation on the 2011 and 2010 ZBP data that was included in earlier databases, and have provided that data in CSV format for download in the archives page (in case anyone is using the old data and wants to go back and update it).

Government Shutdown: Alternate Resources for US Census Data

Sunday, October 13th, 2013

As the US government shutdown continues (thanks to a handful of ideological nutcases in congress) those of us who work with and rely on government data are re-learning the lesson of why it’s important to keep copies of things. This includes having alternate sources of information floating around on the web and in the cloud, as well as the tried and true approach of downloading and saving datasets locally. There have been a number of good posts (like this succinct one) to point users to alternatives to the federal sources that many of us rely on. I’ll go into more detail here with my suggestions on where to access US Census data, based on user-level and need.

  • The Social Explorer: this web-mapping resource for depicting and accessing US Census data from 1790 to present (including the 2010 Census and the latest American Community Survey data) is intuitive and user-friendly. Many academic and public libraries subscribe to the premium edition that provides full access to all features and datasets (so check with yours to see if you have access), while a basic free version is available online. Given the current circumstances the Social Explorer team has announced that it will open the hatch and provide free access to users who request it.
  • The NHGIS (National Historic GIS): this project is managed by the Minnesota Population Center and also provides access to all US Census data from 1790 to present. While it’s a little more complex than the Social Exlorer, the NHGIS is the better option for downloading lots of data en-masse, and is the go-to place if you need access to all datasets in their entirety, including all the detail from the American Community Survey (as the Social Explorer does not include margins of error for any of the ACS estimates) or if you need access to other datasets like the County Business Patterns. Lastly – it is the alternative to the TIGER site for GIS users who need shapefiles of census geography. You have to register to use NHGIS, but it’s free. For users who need microdata (decennial census, ACS, Current Population Survey), you can visit a related MPC project to the NHGIS: IPUMS.
  • The Missouri Census Data Center (MCDC): I’ve mentioned a number of their tools in the past; they provide easy-to-access profiles from the 2010 Census and American Community Survey, as well as historical trend reports for the ACS. For intermediate users they provide extract applications for the 2010 Census and ACS for creating spreadsheets and SAS files for download, and for advanced users the Dexter tool for downloading data en-masse from 1980 to present. Unlike the other resources no registration or sign-up is required. I also recommend the MCDC’s ACS and 2010 Census profiles to web designers and web mappers; if you’ve created online resources that tapped directly into the American Factfinder via deep links (like I did), you can use the MCDC’s profiles as an alternative. The links to their profiles are persistent and use a logical syntax (as it looks like there’s no end in site to this shutdown I may make the change-over this week). Lastly, the MCDC is a great resource for technical documentation about geography and datasets.
  • State and local government: thankfully many state and local governments have taken subsets of census data of interest to people in their areas and have recompiled and republished it on the web. These past few weeks I’ve been constantly sending students to the NYC Department of City Planning’s population resources. Take a look at your state data center’s resources, as well as local county or city planning departments, transportation agencies, or economic development offices to see what they provide.

Downloading Data for Small Census Geographies in Bulk

Tuesday, May 7th, 2013

I needed to download block group level census data for a project I’m working on; there was one particular 2010 Census table that I needed for every block group in the US. I knew that the American Factfinder was out – you can only download block group data county by county (which would mean over 3,000 downloads if you want them all). I thought I’d share the alternatives I looked at; as I searched around the web I found many others who were looking for the same thing (i.e. data for the smallest census geographies covering a large area).

The Census FTP site at http://www2.census.gov/

This would be the first logical step, but in the end it wasn’t optimal based on my need. When you drill down through Census 2010, Summary File 1, you see a file for every state and a national file. Initially I thought – great! I’ll just grab the national file. But the national file does NOT contain the small census statistical areas – no tracts, block groups, or blocks. If you want those small areas you have to download the files for each of the states – 51 downloads. When you download the data you can also download an MS Access database, which is an empty shell with the geography and field headers, and you can import each of the text file data tables (there a lot of them for 2010 SF1) into the db and match them to the headers during import (the instructions that were included for doing this were pretty good). This is great if you need every variable in every table for every geography, but I was only interested in one table for one geography. I could just import the one text file with my table, but then I’d have to do this import process 51 times. The alternative is to use some Python to get that one text file for every state into one big file and then do the import once, but I opted for a different route.

The NHGIS at https://www.nhgis.org/

I always recommend this resource to anyone who’s looking for historical census data or boundary files, but it’s also good if you want current data for these small areas. I was able to use their query window to widdle down the selection by dataset (2010 SF1), geography (block groups), and topic (Hispanic origin and race in my case), then I was able to choose the table I needed. On the last screen before download I was able to check a box to include all 50 states plus DC and PR in one file. I had to wait a couple minutes for the request to process, then downloaded the file as a CSV and loaded it into my database. This was the best solution for my circumstances by far – one table for all block groups in the country. If you had to download a lot (or all) of the tables or variables for every block group or block it may take quite awhile, and plugging through all of those menus to select everything would be tedious – if that’s your situation it may be easier to grab everything using the Census FTP.

nhgis

UExplore / Dexter at http://mcdc.missouri.edu/applications/uexplore.shtml

The Missouri Census Data Center’s UExplore / Dexter tool lets you choose a dataset and takes you to a window that resembles a file system, with a ton of files in it. The MCDC takes their extracts directly from the Census, so they’re structured in a similar way to the FTP site as state-based files. They begin with the state prefix and have a name that indicates geography – there are files for block groups, blocks, and one for everything else. There are national files (which don’t contain small census areas) that begin with ‘us’. The difference here is – when you click on a file, it launches a query window that let’s you customize the extract. The interface may look daunting at first, but it’s worth exploring (and there’s a tutorial to help guide you). You can choose from several output formats, specific variables or tables (if you don’t want them all), and there are a bunch of handy options that you can specify like aggregation or percent totals. In addition to the complete datasets, they’ve also created ‘Standard Extracts’ that have the most common variables, if you want just a core subset. While the NHGIS was the best choice for my specific need, the customization abilities in Dexter may fit your needs – and the state-level block group and block data is conveniently broken out from the other files.

Lastly…

There are a few others tools – I’ll give an honorable mention to the Summary File Retrieval tool, which is an Excel plugin that lets you tap directly into the American Community Survey from a spreadsheet. So if you wanted tracts or block groups for a wide area for but a small number of variables (I think 20 is the limit) that could be a winner, provided you’re using Excel 2007 or later and are just looking at the ACS. No dice in my case, as I needed Decennial Census data and use OpenOffice at home.


Copyright © 2017 Gothos. All Rights Reserved.
No computers were harmed in the 0.447 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.