Posts Tagged ‘zctas’

Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

The citation and link:

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21).

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

NYC Geodatabase Updates: Spatialite Upgrade & ZIPs to ZCTAs

Wednesday, July 30th, 2014

I released the latest version of the NYC geodatabase (nyc_gdb) a few weeks ago. In addition to simply updating the data (for subway stations and ridership, city point features, and ZIP Code Business Patterns data) I had to make a couple of serious upgrades.

Spatialite Updates

The first was that is was time for me to update the version of Spatialite I was using, from 2.4 to 4.1, and to update my documentation and tutorial from the Spatialite GUI 1.4 to 1.7. I used the spatialite_convert tool (see the bottom of this page for info)to upgrade and had no problem. There were some major benefits to making the switch. For one, writing statements that utilize spatial indexes is much simpler – this was version 2.4, generating a neighbor list of census tracts:

SELECT tract1.tractid AS tract, tract2.tractid AS neighbor
FROM a_tracts AS tract1, a_tracts AS tract2
WHERE ST_Touches(tract1.geometry, tract2.geometry) AND tract2.ROWID IN (
SELECT pkid FROM idx_a_tracts_geometry
WHERE pkid MATCH RTreeIntersects (MbrMinX(tract1.geometry), MbrMinY(tract1.geometry),
MbrMaxX(tract1.geometry), MbrMaxY(tract1.geometry)))

And here’s the same statement in 4.1 (for zctas instead of tracts):

SELECT zcta1.zcta AS zcta, zcta2.zcta AS neighbor
FROM a_zctas AS zcta1, a_zctas AS zcta2
WHERE ST_Touches(zcta1.geometry, zcta2.geometry)
AND zcta1.rowid IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name=’a_zctas’ AND search_frame=zcta2.geometry)
ORDER BY zcta, neighbor

There are also a number of improvements in the GUI. Tables generated by the user are now grouped under one heading for user data, and the internal tables are grouped under subsequent headings, so that users don’t have to sift through all the objects in the database to see what they need. The import options have improved – with shapefiles and dbfs you can now designate your own primary keys on import. You also have the option of importing Excel spreadsheets of the 97-2003 variety (.xls). In practice, if you want the import to go smoothly you have to designate data types (format-cells) in the Excel sheet (including number of decimal places) prior to importing.


I was hesitant to make the leap, because version 2.4 was the last version where they made pre-compiled binaries for all operating systems; after that, the only binaries are for MS Windows and for Mac and Linux you have to compile from source – which is daunting for many Mac users that I am ill-equipped to help. But now that Spatialite seems to be more fully integrated with QGIS (you can create databases with Q and using the DB Manager you can export layers to an existing database) I can always steer folks there as an alternative. As for Linux, more distros are including updated version of the GUI in their repositories which makes installation simple.

One of the latest features in Spatialite 4.1.1 is the ability to import XML ISO metadata into the database, where it’s stored as an XML-type blob in a dedicated table. Now that I’m doing more work with metadata this is something I’ll explore for the future.


The other big change was how the ZIP Code Business Patterns data is represented in the database. The ZBP data is reported for actual ZIP Codes that are taken from the addresses of the business establishments, while the boundaries in the nyc_gdb database are for ZIP Code Tabulation Areas (ZCTAs) from the Census. Previously, the ZBP data in the database only represented records for ZIP Codes that had a matching ZCTA number. As a result, ZIP Codes that lacked a corollary because they didn’t have any meaningful geographic area – the ZIP represented clusters of PO Boxes or large organizations that process a lot of mail – were omitted entirely.

In order to get a more accurate representation of business establishments in the City, I instituted a process to aggregate the ZIP Code data in the ZBP to the ZCTA level. I used the crosswalk table provided by the MCDC which assigns ZIPs to ZCTAs, so those PO Boxes and large institutions are assigned to the ZCTA where they are physically located. I used SQLite to import that crosswalk, imported the ZBP data, joined the two on the ZIP Code and did a group by on the ZCTA to sum employment, establishments, etc. For ZIPs that lacked data due to disclosure regulations, I added some note or flag columns that indicate how many businesses in a ZCTA are missing data. So now the data tables represent records for ZCTAs instead of ZIPs, and they can be joined to the ZCTA features and mapped.

The latest ZBP data in the new database is for 2012. I also went back and performed the same operation on the 2011 and 2010 ZBP data that was included in earlier databases, and have provided that data in CSV format for download in the archives page (in case anyone is using the old data and wants to go back and update it).

NYC Geodatabase in Spatialite

Wednesday, February 6th, 2013

I spent much of the fall semester and winter interim compiling and creating the NYC geodatabase (nyc_gdb), a desktop geodatabase resource for doing basic mapping and analysis at a neighborhood level – PUMAs, ZIP Codes / ZCTAs, and census tracts. There were several motivations for doing this. First and foremost, as someone who is constantly introducing new people to GIS it’s a pain sending people to a half dozen different websites to download shapefiles and process basic features and data before actually doing a project. By creating this resource I hoped to lower the hurdles a bit for newcomers; eventually they still need to learn about the original sources and data processing, but this gives them a chance to experiment and see the possibilities of GIS before getting into nitty gritty details.

Second, for people who are already familiar with GIS and who have various projects to work on (like me) this saves a lot of duplicated effort, as the db provides a foundation to build on and saves the trouble of starting from scratch each time.

Third, it gave me something new to learn and will allow me to build a second part to my open source GIS workshops. I finally sat down and hammered away with Spatialite (went through the Spatialite Cookbook from start to finish) and learned spatial SQL, so I could offer a resource that’s open source and will compliment my QGIS workshop. I was familiar with the Access personal geodatabases in ArcGIS, but for the most part these serve as simple containers. With the ability to run all the spatial SQL operations, Spatialite expands QGIS functionality, which was something I was really looking for.

My original hope was to create a server-based PostGIS database, but at this point I’m not set up to do that on my campus. I figured Spatialite was a good alternative – the basic operations and spatial SQL commands are relatively the same, and I figured I could eventually scale up to PostGIS when the time comes.

I also created an identical, MS Access version of the database for ArcGIS users. Once I got my features in Spatialite I exported them all out as shapefiles and imported them all via ArcCatalog – not too arduous as I don’t have a ton of features. I used the SQLite ODBC driver to import all of my data tables from SQLite into Access – that went flawlessly and was a real time saver; it just took a little bit of time to figure out how to set up (but this blog post helped).

The databases are focused on NYC features and resources, since that’s what my user base is primarily interested in. I purposefully used the Census TIGER files as the base, so that if people wanted to expand the features to the broader region they easily could. I spent a good deal of time creating generalized layers, so that users would have the primary water / coastline and large parks and wildlife areas as reference features for thematic maps, without having every single pond and patch of grass to clutter things up. I took several features (schools, subway stations, etc) from the City and the MTA that were stored in tables and converted them to point features so they’re readily useable.

Given that focus, it’s primarily of interest to NYC folks, but I figured it may be useful for others who wish to experiment with Spatialite. I assumed that most people who would be interested in the database would not be familiar with this format, so I wrote a tutorial that covers the database and it’s features, how to add and map data in QGIS, how to work with the data and do SQL / spatial SQL in the Spatialite GUI, and how to map data in ArcGIS using the Access Geodb. It’s Creative Commons, Attribution, Non-Commercial, Share-alike, so feel free to give it a try.

I spent a good amount of time building a process rather than just a product, so I’ll be able to update the db twice a year, as city features (schools, libraries, hospitals, transit) change and new census data (American Community Survey, ZIP Business Patterns) is released. Many of the Census features, as well as the 2010 Census data, will be static until 2020.

American Factfinder Tutorial & Census Geography Updates

Monday, July 23rd, 2012

I’ve been en-meshed in the census lately as I’ve been writing a paper about the American Community Survey. Here are a few a things to share:

  • Since I frequently receive questions about how to use the American Factfinder, I’ve created a brief tutorial with screenshots demonstrating a few ways to navigate it. I illustrate how to download a profile for a single census tract from the American Community Survey, and how to download a table for all ZIP Code Tabulation Areas (ZCTAs) in a county using the 2010 Census.
  • New boundaries for PUMAs based on 2010 census geography have been released; they’re not available from the TIGER web-based interface yet but you get can state-based files from the FTP site. I’ve downloaded the boundaries for New York and there are small changes here and there from the 2000 Census boundaries; not surprising as PUMAs are built from tracts and tract boundaries have changed. One big bonus is that PUMAs now have names associated with them, based on local government suggestions. In NY State they either take the name of counties with some directional element (east, central, south, etc), or the name of MCDs that are contained within them. In NYC they’ve been given the names of community districts.
  • I’ve done some digging through the FAQs at and discovered that the census is going to stick with the old 2000 PUMA boundaries for the next release of the American Community Survey – the 2011 ACS will be released at the end of this year. 2010 PUMAs won’t be used until the 2012 ACS, to be released at the end of 2013.
  • Urban Areas are the other holdovers in the ACS that use 2000 vintage boundaries. The ACS will also transition to the 2010 boundaries for urban areas in the 2012 ACS.
  • In the course of my digging I discovered that the census will begin including ZCTA-level data as part of the 5-year ACS estimates, beginning with the 2011 release this year. 2010 ZCTA boundaries are already available, and 2010 Census data has already been released for ZCTAs. The ACS will use the 2010 vintage ZCTAs for each release until they’re redrawn for 2020.

ZIP Code KML Map for NYC Census Data

Saturday, September 10th, 2011

With the release of both the 2010 Census profiles for ZCTAs (ZIP Code Tabulation Areas) and the TIGER line files for 2010 Census geographies, I created another Google Map finding aid for NYC neighborhood data by ZIP code (I previously created one for PUMAs with American Community Survey data). Once again I used the Export to KML plugin that was created for ArcGIS. This allowed me to use the TIGER shapefile in ArcGIS to create the map I wanted and then export it as a KML, while using fields in the attribute table of each feature to insert the ZCTA number into stable links for the census profiles, automatically generating unique urls for each feature. Click on the ZCTA in the map, and then click on a link to open a profile directly from the new American Factfinder.

There were two new obstacles I had to contend with this time. The first was that my department has finally migrated to Windows 7 from Windows XP, and I upgraded from ArcGIS 9.3 to 10. I had to reinstall the Export to KML plugin (version 2.5.5) and ran into trouble; fortunately all the work-arounds were included in the plugin’s documentation. I don’t have administrator rights on my machine, so I had to have someone install the plugin as an administrator; this included running the initial setup file AND running Arc as an administrator as you add and turn the plugin on. That was straightforward, but when I ran it the first time I got an error message – there’s a particular Windows dll or ocx file that the plugin needs and it was missing (presumably something that was included in XP but not in 7). I downloaded the necessary file, and with administrator rights moved it into the system32 folder and registered the file via the command line. After that I was good to go.

The second issue was with the Census Bureau’s new American Factfinder. With the old Factfinder the urls that were generated as you built and accessed tables were static and you could simply save and bookmark them. Not the case in the new Factfinder; you can bookmark some basic tables but most of them are “too complex to bookmark”; you can save and download queries from the online ap but that’s it. After some digging I found a CB document that tells you how you can create deep links to any query you run and table you create. The url consists of a fixed series of codes that identify the dataset, year, table, and geography. So this link:

Tells us that were getting a table from version 1.0 of the American Factfinder in English. It’s from the Decennial Census, 2010 Demographic Profiles, Demographic Profile Table 1, for ZCTA 10010 (860 is the summary level code that indicates we’re looking at ZCTAs). So for the plugin to create the links, I just included this URL but for the last five digits I specified the attribute from the ZCTA shapefile that held the ZCTA code. So when the plugin creates the KML, each KML feature has a link generated that is specific to it:[ZCTA5CE10]

You can see this previous post for details on how the Export to KML plugin works.

For now, the 2010 and 2000 Census are in the new American Factfinder. The American Community Survey, the Economic Census, population estimates, and a few other datasets are still in the older, legacy Factfinder. According to the CB all of this data will be migrated to the new Factfinder by the end of 2011 and the legacy version will disappear. At that point I’ll have to update my PUMA map so that it points to the profiles in the new Factfinder.

You can take a look at the ZCTA map and profiles below (I’m hosting it on the NYC data resource guide I’ve created for my college). As I’ve written before, ZCTAs are odd Census geographies since they are approximations of residential USPS ZIP Codes created by aggregating census blocks based on addresses; you can see in many instances where boundaries have a blocky teeth-like appearance instead of straight lines. Since they’re created directly by aggregating blocks, ZCTAs don’t correspond or mesh with other census boundaries like tracts or PUMAs, or even legal boundaries like counties. In some cases my assignment of county-based colors doesn’t ring true. For example, ZCTA 11370 includes part of the East Elmhurst neighborhood in Queens and Rikers Island, which is in the Bronx. ZCTA 10463 includes the Bronx neighborhoods of Kingsbridge and Spuyten Duyvil and the Manhattan neighborhood of Marble Hill (a geographic anomaly; it’s not on the Island of Manhattan but it’s part of Manhattan borough).

The most salient issue with ZCTAs is that they are only tabulated for the decennial census and not the American Community Survey; the currency of data and spectrum of census variables will be limited compared to other types of geography. ***NOTE*** This is no longer the case – ZCTA-level data is now available as part of the 5-year ACS, beginning with the 2007-2011 series.

View Larger Map

Relating ZIP Codes / ZCTAs to PUMAs

Saturday, March 19th, 2011

Ever since I created the Google Maps finding aid for census data for NYC PUMAs and the associated PUMA – NYC neighborhood names maps, I’ve received several requests for tables or maps that relate PUMAs to ZIP Codes. These are usually from non-profits in NYC who have lists of donors, members, or constituents with addresses, and they want to relate the addresses (using the ZIP) to recent demographic data from American Community Survey (ACS) for the broader neighborhood where the ZIP is located.

The problem is that ZIP Codes are an all around pain. They actually don’t exist as areas with distinct boundaries; ZIP Codes are all address based, with ZIPs tied to addresses along street segments. The USPS doesn’t publish these tables or create maps; they contract this out for private companies to do, who turn around and sell these products for hefty fees.

Fortunately the Census Bureau has used these address tables to create approximations of ZIP Codes that they call ZCTAs or ZIP Code Tabulation Areas. ZCTAs are aggregates of census blocks that attempt to mimic ZIP Codes that exist as areas; codes associated with specific single-point firms or organization are dropped. Since ZIPS were created by the USPS, ZCTAs do not nest or mesh with any census geography; they cross PUMA, county, and in some cases even state boundaries. They are also less stable than census geography, with frequent changes, and as statistical areas they vary widely in area and population. For this reason ZCTA data is only published every ten years in the decennial census; it’s not included in the ACS (so far).

With these caveats in mind, I used the Missouri Census Data Center’s MABLE/GEOCORR engine to correlate ZCTAs with PUMAs. While the interface looks a little retro and daunting, it’s actually pretty simple. You choose the state, the two geographies you want to relate, the weighting method for allocating one to the other, and an output format that includes CSV or HTML. I also used an option that lets you type in FIPS codes for the counties you want, so I didn’t end up with the entire state.

This method was the way to go, as they give you the option to allocate geographies based on population and not simply land area; each ZCTA was allocated to PUMAs based on where the majority of the ZCTA’s population lived using 2000 census block data. The final output contains one row for each ZCTA to PUMA combination. So you had multiple rows for ZCTAs that weren’t contained within a single PUMA, and for each of those ZCTAs you had fields that showed the percentage of the ZCTA’s population that lived in each PUMA (along with the actual population number) as well as the percentage of the PUMA’s population that lived in that ZCTA.

I took that table and cleaned it up in a spreadsheet, so that I was left with one row for each ZCTA, where the ZCTA was allocated to one PUMA based on where the majority of it’s population lives. I used some ZCTA and PUMA boundaries that I had originally downloaded and subsequently cleaned up from the 2009 TIGER shapefiles page, added them to QGIS, joined the ZCTA allocation table to the ZCTA geography, and mapped the result. I color-coded ZCTAs so that clusters of ZCTAs within a particular PUMA had the same color. Then I overlaid the PUMA boundaries on top to see how well they corresponded.

In the end, they didn’t correspond all that well. There was a fairly good relationship in Manhattan, ok relationship in Queens and Staten Island, and a rather lousy relationship in the Bronx and Brooklyn. I overlaid greenspace and facilities (airports, shipyards, etc) boundaries I had, and that made some difference; you could see in some areas where ZCTAs overlapped two PUMAs that the overlap coincided with parks, cemeteries, or other areas with low or no residential population in one of the PUMAs.

I’ve posted both sets of tables, maps, and some instructions on the NYC neighborhoods resource page. You can use the original MABLE / GEOCORR table to judge where allocations were good and were they were not so good based on population. For now, the engine is still based on 2000 Census geography and data. Even though the Census has started releasing 2010 TIGER files based on 2010 Census geography, ZCTAs and PUMAs are often some of the last geographies to be updated; current releases of the ACS are still based on the 2000 geographies. Stay tuned to the Census Bureau and MCDC websites for news on updates, and keep the MABLE / GEOCORR in mind if you want to create lists to relate census geographies by population or land area.

Copyright © 2017 Gothos. All Rights Reserved.
No computers were harmed in the 0.513 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.