Posts Tagged ‘united states’

Python Geocoding Take 2 – US Addresses

Monday, May 30th, 2016

In my previous post I discussed my recent adventures with geocoding addresses outside the US. In contrast, there are countless options for batch geocoding addresses within the United States. I’ll discuss a few of those options here, but will focus primarily on the US Census Geocoder and a Python script I’ve written to batch match addresses using their API. The code and documentation is available on my lab’s resources page.

A Few Different Options

ESRI’s geocoding services allow you (with an account) to access their geocoding servers through tools in the ArcToolbox, or you can write a script and access them through an API. QGIS has a third-party plugin for accessing Google’s services (2500 records a day free) or the Open Streetmap. You can still do things the old fashioned way, by downloading geocoded street files and creating a matching service.

Alternatively, you can subscribe to any number of commercial or academic services where you can upload a file, do the matching, and download results. For years I’ve used the geocoding services at Texas A&M that allow you to do just that. Their rates are reasonable, or if you’re an academic institution and partner with them (place some links to their service on their website) you can request free credits for doing matches in batches.

The Census Geocoder and API, and a Python Script for Batch Geocoding

The Census Bureau’s TIGER and address files are often used as the foundational layers for building these other services, to which the service providers add refinements and improvements. You can access the Census Bureau’s services directly through the Census Geocoder, where you can match an address one at a time, or you can upload a batch of 1000 records. It returns longitude and latitude coordinates in NAD 83, and you can get names and codes for all the census geographies where the address is located. The service is pretty picky about the structure of the upload file (must be plain text, csv, with an id column and then columns with the address components in a specific order – with no other attributes allowed) but the nice thing is it requires no login and no key. It’s also public domain, so you can do whatever you want with the data you’ve retrieved. A tutorial for using it is available on our lab’s census tutorials page.

census geocoder

They also have an API with some basic documentation. You can match parsed and unparsed addresses, and can even do reverse geocoding. So I took a stab at writing a script to batch process addresses in text-delimited files (csv or txt). Unfortnately, the Census Geocoding API is not one of the services covered by the Python Geocoder that I mentioned in my previous post, but I did find another third party module called censusgeocode which provides a thin wrapper you can use. I incorporated that module into my Python 3 script, which I wrote as a function that takes the following inputs:

census_geocode(datafile,delim,header,start,addcol)
(str,str,str,int,list[int]) -> files

  • datafile – this is the name of the file you want to process (file name and extension). If you place the geocode_census_funct.py file in the same directory as your data file, then you just need to provide the name of the file. Otherwise, you need to provide the full path to the file.
  • delim – this is the delimiter or character that separates the values in your data file. Common delimiters includes commas ‘,’, tabs ‘\t’, and pipes ‘|’.
  • header – here you specify whether your file has a header row, i.e. column names. Enter ‘y’ or ‘yes’ if it does, ‘n’ or ‘no’ if it doesn’t.
  • start – type 0 to specify that you want to start reading the file from the beginning. If you were previously running the script and it broke and exited for some reason, it provides an index number where it stopped reading; if that’s the case you can provide that index number here, to pick up where you left off.
  • addcol – provide a list that indicates the position number of the columns that contain the address components in your data file. For an unparsed address, you provide just one position number. For a parsed address, you provide 4 positions: address, city, state, and ZIP code. Whether you provide 1 or 4, the numbers must be supplied in brackets, as the function requires a Python list.

You can open the script in IDLE, run it to load it into memory, and then type the function with the necessary parameters in the shell to execute it. Some examples:

  • A tab-delimited, unparsed address file with a header that’s stored in the same folder as the script. Start from the beginning and the address is in the 2nd column: census_geocode('my_addresses.txt','\t','y',0,[2])
  • A comma-delimited, parsed address file with no header that’s stored in the same folder as the script. Start from the beginning and the addresses are in the 2nd through 5th columns: census_geocode('addresses_to_match.csv',',','n',0,[2,3,4,5])
  • A comma-delimited, unparsed address file with a header that’s not in the same folder as the script. We ran the file before and it stopped at index 250, so restart there – the address is in the 3rd column: census_geocode('C:\address_data\data1.csv',',','y',250,[3])

The beginning of the script “sets the table”: we read the address columns into variables, create the output files (one for matches, one for non-matches, and a summary report), and we handle whether or not there’s a header row. For reading the file I used Python’s CSV module. Typically I don’t use this module, as I find it’s much simpler to do the basic: read a line in, split it on a delimiter, strip whitespace, read it into a list, etc. But in this case the CSV module allows you to handle a wider array of input files; if the input data was a csv and there happened to be commas embedded in the values themselves, the CSV module easily takes care of it; if you ignore it, the parsing would get thrown off for that record.

Handling Exceptions and Server Errors

In terms of expanding my skills, the new things I had to learn were exception handling and control flows. Since the censusgeocoding module is a thin wrapper, it had no built in mechanism for retrying a match a certain number of times if the server timed out. This is an absolute necessity, because the census server often times out, is busy, or just hiccups, returning a generic error message. I had already learned how to handle crashes in my earlier geocoding experiments, where I would write the script to match and write a record one by one as it went along. It would try to do a match, but if any error was raised, it would exit that loop cleanly, write a report, and all would be saved and you could pick up where you left off. But in this case, if that server non-response error was returned I didn’t want to give up – I wanted to keep trying.

So on the outside there is a loop to try and do a match, unless any error happens, then exit the loop cleanly and wrap up. But inside there is another try loop, where we try to do a match but if we get that specific server error, continue: go back to the top of that for loop and try again. That loop begins with While True – if we successfully get to the end, then we start with the next record. If we get that server error we stay in that While loop and keep trying until we get a match, or we run out of tries (5) and write as a non-match.

error handling

In doing an actual match, the script does a parsed or unparsed match based on user input. But there was another sticking point; in some instances the API would return a matched result (we got coordinates!), but some of the objects that it returned were actually errors because of some java problem (failed to get the tract number or county name – here’s an error message instead!) To handle this, we have a for i in range loop. If we have a matched record and we don’t have a status message (that indicates an error) then we move along and grab all the info we need – the coordinates, and all the census geography where that coordinate falls, and write it out, and then that for loop ends with a break. But if we receive an error message we continue – go back to the top of that loop and try doing the match again. After 3 tries we give up and write no match.

Figuring all that out took a while – where do these loops go and what goes in them, how do I make sure that I retry a record rather than passing over it to the next one, etc. Stack Exchange to the rescue! Difference between continue, pass and break, returning to the beginning of a loop, breaking out of a nested loop, and retrying after an exception. The rest is pretty straightforward. Once the matching is done we close the files, and write out a little report that tells us how many matches we got versus fails. The Census Geocoder via the API is pretty unforgiving; it either finds a match, or it doesn’t. There is no match score or partial matching, and it doesn’t give you a ZIP Code or municipal centroid if it can’t find the address. It’s all or nothing; if you have partial or messy addresses or PO Boxes, it’s pretty much guaranteed that you won’t get matches.

There’s no limit on number of matches, but I’ve built in a number of pauses so I’m not hammering the server too hard – one second after each match, 5 seconds after every 1000 matches, a couple seconds before retrying after an error. Your mileage will vary, but the other day I did about 2500 matches in just under 2 hours. Their server can be balky at times – in some cases I’ve encountered only a couple problems for every 100 records, but on other occasions there were hang-ups on every other record. For diagnostic purposes the script prints every 100th record to the screen, as well as any problems it encountered (see pic below). If you launch a process and notice the server is hanging on every other record and repeatedly failing to get matches, it’s probably best to bail out and come back later. Recently, I’ve noticed fewer problems during off-peak times: evenings and weekends.

Script Running

Wrap Up

The script and the documentation are posted on our labs resources page, for all to see and use – you just have to install the third party censusgeocode module before using it. When would you want to use this? Well, if you need something that’s free, this is a good choice. If you have batches in the 10ks to do, this would be a good solution. If you’re in the 100ks, it could be a feasible solution – one of my colleagues has confirmed that he’s used the script to match about 40k addresses, so the service is up to the task for doing larger jobs.

If you have less than a couple thousand records, you might as well use their website and upload files directly. If you’re pushing a million or more – well, you’ll probably want to set up something locally. PostGIS has a TIGER module that lets you do desktop matching if you need to go into the millions, or you simply have a lot to do on a consistent basis. The excellent book PostGIS in Action has a chapter dedicated to to this.

In some cases, large cities or counties may offer their own geocoding services, and if you know you’re just going to be doing matches for your local area those sources will probably have greater accuracy, if they’re adding value with local knowledge. For example, my results with NYC’s geocoding API for addresses in the five boroughs are better than the Census Bureau’s and is customized for local quirks; for example, I can pass in a borough name instead of a postal city and ZIP Code, and it’s able to handle those funky addresses in Queens that have dashes and similar names for multiple streets (35th st, 35th ave, 35th dr…). But for a free, public domain service that requires no registration, no keys, covers the entire country, and is the foundation for just about every US geocoding platform out there, the Census Geocoder is hard to beat.

Average Distance to Public Libraries in the US

Monday, February 22nd, 2016

A few months ago I had a new article published in LISR, but given the absurd restrictions of academic journal publishing I’m not allowed to publicly post the article, and have to wait 12 months before sharing my post-print copy. It is available via your local library if they have a subscription to the Science Direct database (you can also email me to request a copy). I am sharing some of the un-published state-level data that was generated for the project here.

Citation and Abstract

Regional variations in average distance to public libraries in the United States
F. Donnelly
Library & Information Science Research
Volume 37, Issue 4, October 2015, Pages 280–289
http://dx.doi.org/10.1016/j.lisr.2015.11.008

Abstract

“There are substantive regional variations in public library accessibility in the United States, which is a concern considering the civic and educational roles that libraries play in communities. Average population-weighted distances and the total population living within one mile segments of the nearest public library were calculated at a regional level for metropolitan and non-metropolitan areas, and at a state level. The findings demonstrate significant regional variations in accessibility that have been persistent over time and cannot be explained by simple population distribution measures alone. Distances to the nearest public library are higher in the South compared to other regions, with statistically significant clusters of states with lower accessibility than average. The national average population-weighted distance to the nearest public library is 2.1 miles. While this supports the use of a two-mile buffer employed in many LIS studies to measure library service areas, the degree of variation that exists between regions and states suggests that local measures should be applied to local areas.”

Purpose

I’m not going to repeat all the findings, but will provide some context.

As a follow-up to my earlier work, I was interested in trying an alternate approach for measuring public library spatial equity. I previously used the standard container approach – draw a buffer at some fixed distance around a library and count whether people are in or out, and as an approximation for individuals I used population centroids for census tracts. In my second approach, I used straight-line distance measurements from census block groups (smaller than tracts) to the nearest public library so I could compare average distances for regions and states; I also summed populations for these areas by calculating the percentage of people that lived within one-mile rings of the nearest library. I weighted the distances by population, to account for the fact that census areas vary in population size (tracts and block groups are designed to fall within an ideal population range – for block groups it’s between 600 and 3000 people).

Despite the difference in approach, the outcome was similar. Using the earlier approach (census tract centroids that fell within a library buffer that varied from 1 to 3 miles based on urban or rural setting), two-thirds of Americans fell within a “library service area”, which means that they lived within a reasonable distance to a library based on standard practices in LIS research. Using the latest approach (using block group centroids and measuring the distance to the nearest library) two-thirds of Americans lived within two miles of a public library – the average population weighted distance was 2.1 miles. Both studies illustrate that there is a great deal of variation by geographic region – people in the South consistently lived further away from public libraries compared to the national average, while people in the Northeast lived closer. Spatial Autocorrelation (LISA) revealed a cluster of states in the South with high distances and a cluster in the Northeast with low distances.

The idea in doing this research was not to model actual travel behavior to measure accessibility. People in rural areas may be accustomed to traveling greater distances, public transportation can be a factor, people may not visit the library that’s close to their home for several reasons, measuring distance along a network is more precise than Euclidean distance, etc. The point is that libraries are a public good that provide tangible benefits to communities. People that live in close proximity to a public library are more likely to reap the benefits that it provides relative to those living further away. Communities that have libraries will benefit more than communities that don’t. The distance measurements serve as a basic metric for evaluating spatial equity. So, if someone lives more than six miles away from a library that does not mean that they don’t have access; it does means they are less likely to utilize it or realize it’s benefits compared to someone who lives a mile or two away.

Data

I used the 2010 Census at the block group level, and obtained the location of public libraries from the 2010 IMLS. I improved the latter by geocoding libraries that did not have address-level coordinates, so that I had street matches for 95% of the 16,720 libraries in the dataset. The tables that I’m providing here were not published in the original article, but were tacked on as supplementary material in appendices. I wanted to share them so others could incorporate them into local studies. In most LIS research the prevailing approach for measuring library service areas is to use a buffer of 1 to 2 miles for all locations. Given the variation between states, if you wanted to use the state-average for library planning in your own state you can consider using these figures.

To provide some context, the image below shows public libraries (red stars) in relation to census block group centroids (white circles) for northern Delaware (primarily suburban) and surrounding areas (mix of suburban and rural). The line drawn between the Swedesboro and Woodstown libraries in New Jersey is 5-miles in length. I used QGIS and Spatialite for most of the work, along with Python for processing the data and Geoda for the spatial autocorrelation.

Map Example - Northern Delaware

The three tables I’m posting on the resources page are for states: one counts the 2010 Census population within one to six mile rings of the nearest public library, the second is the percentage of the total state population that falls within that ring, and the third is a summary table that calculates the mean and population-weighted distance to the nearest library by state. One set of tables is formatted text (for printing or just looking up numbers) while the other set are CSV files that you can use in a spreadsheet. I’ve included a metadata record with some methodological info, but you can read the full details in the article.

In the article itself I tabulated and examined data at a broader, regional level (Northeast, Midwest, South, and West), and also broke it down into metropolitan and non-metropolitan areas for the regions. Naturally people that live in non-metropolitan areas lived further away, but the same regional patterns existed: more people in the South in both metro and non-metro areas lived further away compared to their counterparts in other parts of the country. This weekend I stumbled across this article in the Washington Post about troubles in the Deep South, and was struck by how these maps mirrored the low library accessibility maps in my past two articles.

Review of The Census Reporter

Monday, February 8th, 2016

Picking up where I left off from my previous post (gee – welcome to 2016!) I thought I’d give a brief review of another census resource, The Census Reporter at http://censusreporter.org/.

The Census Reporter was created to make it easier for journalists to write stories using census data. To that end, they’ve created a really slick and easy to use web site that makes the data accessible and fun to explore. From the homepage you have three ways of diving into the data: you can pull up a profile by typing in the name of a place, you can enter an address and explore places that contain that address, or you can explore tables by topic.

Census Reporter Homepage

First, the place-based approach. You can type in a named place, like a state, county, or a census place (incorporated cities and towns, or census designated places) to get started. This will give you a selection of data from the most recent release of the American Community Survey. For larger areas where the data is available, it gives you 1-year ACS data by default; otherwise you get the latest 5-year data.

You’re presented with a map of the location at the top, and a series of attractive looking graphs and charts sorted by the demographic profile table source – social, economic, housing, and demographic. If you hover over a data point in a table it gives you some geographic context by comparing this place’s value with that of larger places where it’s contained. For example, if I search for Philadelphia I can hover over the chart to get the value for the Philly metro area and the State of Pennsylvania. I can click a link below each chart to open the full table, which includes both estimates and margins of error. There are small links for viewing the table by itself on a separate page (which also gives you the ability to download it) and for embedding the chart in a website.

Census Reporter Chart and Table

Viewing the table gives you additional options, like adding additional places for comparison, or subdividing the place into smaller areas for comparison. So if I’m looking at Philadelphia, I can break it down into tracts, block groups, or ZIP Codes. From there I can toggle away from the table view to view a map or a distribution bar to explore that variable by individual geographies.

View Data by Sub-Areas, Distribution Bars

The place-based search is great at allowing you to drill down either by topic or by these smaller geographies. But if you wanted to access a fuller range of geographies like congressional districts or PUMAs, it seems easier to do an address-based search. Back on the homepage, selecting the address button and typing in an address brings you to a map with the address pin-pointed, and on the left you can choose any geography that encloses that address. Once you do that, you get a profile for that geography and can start doing the same sorts of operations for changing the topics or tables, or adding or subdividing geographies for comparison.

Address Search

The topic-based search lets you search just by topic and then figure out the geography piece later. Of the three types of searches this one is the toughest, given the sheer number of tables and cross-tabulations. You can click on a link for a general topic to narrow things down a bit before beginning a search.

In downloading the data you have a variety of useful options: CSV, Excel, GeoJSON, KML, and shapefiles. So in theory you can download data that’s readily mappable – in practice I wasn’t able to download a shapefile, but could grab a KML or GeoJson and was able to visualize it in QGIS. One challenge in downloading any of the files is that the column names use the identifier codes, and the actual names of the variables aren’t included in the download format you choose – they’re included in a json file. So you can use that for reference, but it can’t be readily incorporated into the table.

So – where would this resource fall within the pantheon of US Census data resources? I think it’s great for accessing and, especially, visualizing profiles (profile = lots of data for one place) from the most recent ACS releases. It’s easy to use and succeeds at making the data interesting; for that reason I certainly would incorporate it into undergraduate courses where I’m introducing data. The ability to embed the charts into websites is certainly a bonus, and they deserve a big thumbs up for incorporating the margin of error data, rather than hiding or discarding it like other resources do.

The ability to create and view comparison tables (comparison = one piece of data for many places) is good – select an area and then break it down – but not as strong as the profile options. If you want to get a profile for a non-named place like a tract, ZIP Code, or PUMA you can’t do that from the profile search. You can do an address search and back out (if you know an address for that you’re interested in) or you can drill down by topic, which lets you search by summary area in addition to named places.

For users who need to download a lot of data, or for folks who need datasets that aren’t the most recent ACS release, this resource isn’t the place to go. The focus here is on providing the data in an easy and compelling way, as is. In viewing the profiles, it’s not clear if you can choose 5-year data over 1-year data for places where both datasets are available – even for large geographic areas with high population, sometimes it’s preferable to use the 5-year data to take advantage of the smaller margins of error. I also didn’t see an option for choosing decennial census data.

In short, this resource is well-designed and definitely worth exploring. It seems clear why this would be a go-to source for journalists, but it can be for many others as well.

Census Proposes to Cut 3-year ACS in Fiscal 2016

Friday, February 6th, 2015

I’m coming out of my blog hibernation for this announcement – the US Census Bureau is proposing that they drop the 3-year series of the American Community Survey in fiscal year 2016. A colleague mentioned that he overheard this at a meeting yesterday. Searching the web, I found a post at the Free Government Information site which points to this Census Bureau Press release. The press release cites the predictable reasons (budget constraints, funding priorities, etc.) for dropping the series. Oddly, the news comes through some random site and not through the Census Bureau’s website, where there’s no mention of it. I saw that Stanford also had a post, where they shared the same press release.

I kept searching for some definitive proof, and through someone’s tweet I found a link to a PDF of the US Census Bureau’s Budget Estimates for Fiscal Year 2016, presented to Congress this February 2015. I found confirmation buried on page CEN – 106 (the 100th page in a 190 page doc):

Data Products

Restoration of ACS Data Products ($1.5 million): Each year, the ACS releases a wide range of data products widely used by policymakers, Federal, state and local governments, businesses and the public to make decisions on allocation of taxpayer-funds, the location of businesses and the placement of products, emergency management plans, and a host of other matters. Resource constraints have led to the cancellation of data products for areas with populations between 20 and 60 thousand based on 3-year rolling averages of ACS data (known as the “3-Year Data” Product).They have also resulted in delays in the release of the 1- and 5- year Public Use Macro Sample (PUMS) data files and canceled the release of the 5- year Comparison Profile data product and the Spanish Translation of the 1- and 5- year Puerto Rico data products.

The Census Bureau proposes to terminate permanently the 3-Year Data Product. The Census Bureau intended to produce this data product for a few years when the ACS was a new survey. Now that the ACS has collected data for nearly a decade, this product can be discontinued without serious impacts on the availability of the estimates for these communities.

The ACS would like to restore the timely release of the other essential products in FY2016. The continued absence of these data products will impact the availability of data – especially for Puerto Rico – to public and private sector decision makers.

So at this point it’s still just a proposal. The benefits, besides the ability to release other datasets in a timely fashion, would be simplification for users. Instead of choosing between three datasets now there will only be two – the one year and the five year. You choose the one year for large areas and the five year for every place else. In terms of disadvantages, consider this example – here are the number of children enrolled in nursery school in NY State PUMA 03808, which covers Murray Hill, Gramercy, and Stuyvesant Town in the eastern half of Midtown Manhattan:

PUMA NY 03808

Population Over 3 Years Old Enrolled in Nursery / Pre-school

  • 1 year 2013: 1,166 +/- 609
  • 3 year 2011-2013: 1,549 +/- 530
  • 5 year 2009-2013: 1,819 +/- 409

Since PUMAs are statistical areas built to contain 100k people, data for all of them is available in each series. Like all the ACS estimates these have a 90% confidence interval. Look at the data for the 1-year series. The margin of error (ME) is so large that’s it’s approximately 50% of the estimate, which in my opinion makes it worthless for just about any application. The estimate itself is much lower than the estimate for the other two series. It’s true that it’s only capturing the latest year, but administrative data and news reports suggest that the number of nursery school children in the district that covers this area has been relatively stable over time, with modest increases (geographically the district covers an area much larger than this PUMA). This suggests that the estimate itself is not so great.

The 5 year estimate may be closer to reality, and its ME is only 20% of the estimate. But it covers five years in time. If you wanted something that was a compromise – more timely than the five year but with a lower ME than the one year, then the three year series was your choice, in this case with an ME that’s about 33% of the estimate. But under this proposal, this choice goes away and you have to make do with either 1-year estimates (which will be lousy for geographies that aren’t far above the 65k population threshold, and lousy for small population groups where ever they are located), or better 5-year estimates that cover a greater time span.

Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

The citation and link:

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

Government Shutdown: Alternate Resources for US Census Data

Sunday, October 13th, 2013

As the US government shutdown continues (thanks to a handful of ideological nutcases in congress) those of us who work with and rely on government data are re-learning the lesson of why it’s important to keep copies of things. This includes having alternate sources of information floating around on the web and in the cloud, as well as the tried and true approach of downloading and saving datasets locally. There have been a number of good posts (like this succinct one) to point users to alternatives to the federal sources that many of us rely on. I’ll go into more detail here with my suggestions on where to access US Census data, based on user-level and need.

  • The Social Explorer: this web-mapping resource for depicting and accessing US Census data from 1790 to present (including the 2010 Census and the latest American Community Survey data) is intuitive and user-friendly. Many academic and public libraries subscribe to the premium edition that provides full access to all features and datasets (so check with yours to see if you have access), while a basic free version is available online. Given the current circumstances the Social Explorer team has announced that it will open the hatch and provide free access to users who request it.
  • The NHGIS (National Historic GIS): this project is managed by the Minnesota Population Center and also provides access to all US Census data from 1790 to present. While it’s a little more complex than the Social Exlorer, the NHGIS is the better option for downloading lots of data en-masse, and is the go-to place if you need access to all datasets in their entirety, including all the detail from the American Community Survey (as the Social Explorer does not include margins of error for any of the ACS estimates) or if you need access to other datasets like the County Business Patterns. Lastly – it is the alternative to the TIGER site for GIS users who need shapefiles of census geography. You have to register to use NHGIS, but it’s free. For users who need microdata (decennial census, ACS, Current Population Survey), you can visit a related MPC project to the NHGIS: IPUMS.
  • The Missouri Census Data Center (MCDC): I’ve mentioned a number of their tools in the past; they provide easy-to-access profiles from the 2010 Census and American Community Survey, as well as historical trend reports for the ACS. For intermediate users they provide extract applications for the 2010 Census and ACS for creating spreadsheets and SAS files for download, and for advanced users the Dexter tool for downloading data en-masse from 1980 to present. Unlike the other resources no registration or sign-up is required. I also recommend the MCDC’s ACS and 2010 Census profiles to web designers and web mappers; if you’ve created online resources that tapped directly into the American Factfinder via deep links (like I did), you can use the MCDC’s profiles as an alternative. The links to their profiles are persistent and use a logical syntax (as it looks like there’s no end in site to this shutdown I may make the change-over this week). Lastly, the MCDC is a great resource for technical documentation about geography and datasets.
  • State and local government: thankfully many state and local governments have taken subsets of census data of interest to people in their areas and have recompiled and republished it on the web. These past few weeks I’ve been constantly sending students to the NYC Department of City Planning’s population resources. Take a look at your state data center’s resources, as well as local county or city planning departments, transportation agencies, or economic development offices to see what they provide.

Article on Working With the American Community Survey

Monday, June 17th, 2013

I’ve got another article that’s just hit the presses. In this one I discuss the American Community Survey: how it differs from the Decennial Census, when you should use it versus other summary data sets, how to work with the different period estimates, and how to create derived estimates and calculate their margins of error. For that last piece I’ve essentially done an extended version of this old post on Excel formulas, with several different and updated examples.

The article is available via Emerald’s journal database. If you don’t have access to it from your library feel free to contact me and I’ll send you a copy (can’t share this one freely online).

Title: The American Community Survey: practical considerations for researchers
Author(s): Francis P. Donnelly
Citation: Francis P. Donnelly, (2013) “The American Community Survey: practical considerations for researchers”, Reference Services Review, Vol. 41 Iss: 2, pp.280 – 297
Keywords: American Community Survey, Census, Census geography, Data handling, Decennial census, Demographic data, Government data processing, Government information, Margins of error, Sample-based data, United States of America, US Census Bureau
Article type: Technical paper
DOI: 10.1108/00907321311326228 (Permanent URL)
Publisher: Emerald Group Publishing Limited

Downloading Data for Small Census Geographies in Bulk

Tuesday, May 7th, 2013

I needed to download block group level census data for a project I’m working on; there was one particular 2010 Census table that I needed for every block group in the US. I knew that the American Factfinder was out – you can only download block group data county by county (which would mean over 3,000 downloads if you want them all). I thought I’d share the alternatives I looked at; as I searched around the web I found many others who were looking for the same thing (i.e. data for the smallest census geographies covering a large area).

The Census FTP site at http://www2.census.gov/

This would be the first logical step, but in the end it wasn’t optimal based on my need. When you drill down through Census 2010, Summary File 1, you see a file for every state and a national file. Initially I thought – great! I’ll just grab the national file. But the national file does NOT contain the small census statistical areas – no tracts, block groups, or blocks. If you want those small areas you have to download the files for each of the states – 51 downloads. When you download the data you can also download an MS Access database, which is an empty shell with the geography and field headers, and you can import each of the text file data tables (there a lot of them for 2010 SF1) into the db and match them to the headers during import (the instructions that were included for doing this were pretty good). This is great if you need every variable in every table for every geography, but I was only interested in one table for one geography. I could just import the one text file with my table, but then I’d have to do this import process 51 times. The alternative is to use some Python to get that one text file for every state into one big file and then do the import once, but I opted for a different route.

The NHGIS at https://www.nhgis.org/

I always recommend this resource to anyone who’s looking for historical census data or boundary files, but it’s also good if you want current data for these small areas. I was able to use their query window to widdle down the selection by dataset (2010 SF1), geography (block groups), and topic (Hispanic origin and race in my case), then I was able to choose the table I needed. On the last screen before download I was able to check a box to include all 50 states plus DC and PR in one file. I had to wait a couple minutes for the request to process, then downloaded the file as a CSV and loaded it into my database. This was the best solution for my circumstances by far – one table for all block groups in the country. If you had to download a lot (or all) of the tables or variables for every block group or block it may take quite awhile, and plugging through all of those menus to select everything would be tedious – if that’s your situation it may be easier to grab everything using the Census FTP.

nhgis

UExplore / Dexter at http://mcdc.missouri.edu/applications/uexplore.shtml

The Missouri Census Data Center’s UExplore / Dexter tool lets you choose a dataset and takes you to a window that resembles a file system, with a ton of files in it. The MCDC takes their extracts directly from the Census, so they’re structured in a similar way to the FTP site as state-based files. They begin with the state prefix and have a name that indicates geography – there are files for block groups, blocks, and one for everything else. There are national files (which don’t contain small census areas) that begin with ‘us’. The difference here is – when you click on a file, it launches a query window that let’s you customize the extract. The interface may look daunting at first, but it’s worth exploring (and there’s a tutorial to help guide you). You can choose from several output formats, specific variables or tables (if you don’t want them all), and there are a bunch of handy options that you can specify like aggregation or percent totals. In addition to the complete datasets, they’ve also created ‘Standard Extracts’ that have the most common variables, if you want just a core subset. While the NHGIS was the best choice for my specific need, the customization abilities in Dexter may fit your needs – and the state-level block group and block data is conveniently broken out from the other files.

Lastly…

There are a few others tools – I’ll give an honorable mention to the Summary File Retrieval tool, which is an Excel plugin that lets you tap directly into the American Community Survey from a spreadsheet. So if you wanted tracts or block groups for a wide area for but a small number of variables (I think 20 is the limit) that could be a winner, provided you’re using Excel 2007 or later and are just looking at the ACS. No dice in my case, as I needed Decennial Census data and use OpenOffice at home.

The Geography of US Public Libraries

Monday, March 18th, 2013

Last month my article on the geographic disribution of US public libraries was pre-published online in JOLIS, with a print date pending. I can’t share this one freely on-line, so if you don’t have access via a library database (Sage Journals, ERIC, various LIS dbs) you can contact me if you’re interested in getting a copy.

Title: The geographic distribution of United States public libraries : An analysis of locations and service areas
Author: Francis P. Donnelly
Journal: Journal of Librarianship and Information Science (JOLIS)
Year: 2014, Volume 46, Issue 2, Pages 110-129
ISSN: 0961-0006
DOI: 10.1177/0961000612470276
Publisher: Sage

Abstract

This article explores the geography of public libraries in the United States. The distribution of libraries is examined using maps and spatial statistics to analyze spatial patterns. Methods for delineating and studying library service areas used in previous LIS research are modified and applied using geographic information systems to study variations in library accessibility by state and by socio-economic group at the national level. A history of library development is situated within the broader economic and demographic history of the US to provide insight to contemporary patterns, and Louis Round Wilson’s Geography of Reading is used as a focal point for studying historical trends. Findings show that while national library coverage is extensive, the percentage of the population that lives in a library’s geographic service area varies considerably by region and state, with Southern and Western states having lower values than Northeastern and Midwestern states.

Keywords

Geographic information systems, geography, public libraries, service areas, spatial equity, United States

This OCLC flier (How Public Libraries Stack Up) piqued my interest in public libraries as community resources, public goods, and placemaking institutions. If the presence of a public library brings so much value to a community, then by extension the lack of a public library could leave a community at a disadvantage. This led to the next set of logical questions: how are libraries distributed across the country, and which people and communties are being served and which aren’t?

I took a few different approaches to answer these questions. The first approach was to use (and learn) spatial statistics so the overall distribution could be characterized, and the second was to use spatial extraction methods to select census areas and populations that were within the service areas of each library, to see differences in how populations were served and to study these differences across different states. The LIS literature is rich with research that uses GIS to study library use, so I provide a thorough summary of what’s come before. Then after I had the results I spent a good deal of time researching how the contemporary pattern came to be, and coupled the research on the history of public libraries with the broader history of urban and economic development in the United States.

I had a few unstated motives – one of them was to learn spatial statistics, with the help of: OpenGeoda and its documentation, this excellent book on Spatial Data Analysis (for theory), these great examples from Spatial Justice, and invaluable advice from Deborah Balk, a Professor of Spatial Demography with the CUNY Institute for Demographic Research.

One of my other goals was to use only open source software – QGIS, GRASS, and OpenGeoda, which was also a success. Although in my next study I’ll probably rely on QGIS and Spatialite; I found I was doing a lot of attribute data crunching using the SQLite Manager, since the attributes of GRASS vectors can be stored in SQLite, and I could probably save time (and frustration) by using Spatialite’s features instead. I did get to learn a lot about GRASS, but for my purposes it was overkill and I would have been just fine with a spatial database. I was definetely able to sharpen my Python skills, as processing the American Community Survey data for every census tract in the US manually would have been crazy.

In a project this size there are always some pieces that end up on the cutting room floor, so I thought I’d share one here – a dot map that shows where all 16,700 public libraries are. In the article I went with a county choropleth map to show the distribution, because I was doing other county-level stuff and because the dimension restrictions on the graphic made it a more viable option. The dot map reveals that libraries are typically where people are, except that the south looks emptier and the midwest fuller than it should be, if libraries were in fact evenly distributed by population. As my research shows – they’re not.

US Public Libraries

ACS Trend Reports and Census Geography Guide

Sunday, February 12th, 2012

I recently received my first question from someone who wanted to compare 2005-2007 ACS data with 2008-2010. With the release of the latter, we can make historical comparisons with the three year data for the first time since we have estimates that don’t overlap. We should be able to make some interesting comparisons, since the first set covers the real estate boom years (remember those?) and the second covers the Great Recession. One resource that makes such comparisons relatively painless is over at the Missouri Census Data Center. They’ve put together a really clean and simple interface called the ACS Trends Menu, which allows you to select either two one period estimates or two three period estimates and compare them for several different census geographies – states, counties, MCDs, places, metros, Congressional Districts, PUMAs, and a few others – for the entire US (not just Missouri). The end result is a profile that groups data into the Economic, Demographic, Social, and Housing categories that the Census uses for its Demographic Profile tables. The calculations for change and percent change for the estimates and margins of error are done for you.

Downloading the data is not as straightforward – the links to extract it just brought me some error messages, so it’s still a work in progress. Until then, a simple copy and paste into your spreadsheet of choice will work fine.

ACS Trends Menu

If you like the interface, they’ve created separate ones for downloading profiles from any of the ACS periods or from the 2010 Census. The difference here is that you’re looking at one time frame; not across time periods. The interface and the output are the same, but in these menus you can compare four different geographies at once in one profile. Unlike the Trends reports, both the ACS and 2010 Census profiles have easy, clear cut ways to download the profiles as a PDF or a spreadsheet. If you’re happy with data in a profile format and want an interface that’s a little less confusing to navigate than the American Factfinder, these are all great alternatives (and if you’re building web applications these profiles are MUCH easier to work with – you can easily build permanent links or generate them on the fly).

The US Census Bureau also recently put together a great resource called the Guide to State and Local Census Geography. They provide a census geography overview of each state: 2010 population, land area, bordering states, year of entry into the union, population centroids, and a description of how local government is organized in the state – (i.e. do they have municipal civil divisions or only incorporated cities and unincorporated land, etc). You get counts for every type of geography – how many counties, tracts, ZCTAs, and so on, AND best of all you can download all of this data directly in tab delimited files. Need a list of every county subdivision in a state, with codes, land area, and coordinates? No problem – it’s all there.


Copyright © 2017 Gothos. All Rights Reserved.
No computers were harmed in the 0.460 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.