Archive for the ‘Data Processing’ Category

Screen Scraping Data with Python

Friday, March 9th, 2012

I had a request recently for population centers (aka population centroids) for all the counties in the US. The Census provides the 2010 centroids in state level files and in one national file for download, but the 2000 centroids were provided in HTML tables on individual web pages for each state. Rather than doing the tedious work of copying and pasting 51 web pages into a spreadsheet, I figured this was my chance to learn how to do some screen scraping with Python. I’m certainly no programmer, but based on what I’ve learned (I took a three day workshop a couple years ago) and by consulting books and crawling the web for answers when I get stuck, I’ve been able to write some decent scripts for processing data.

For screen scraping there’s a must-have module called Beautiful Soup which easily let’s you parse web pages, well or ill-formed. After reading the Beautiful Soup Quickstart and some nice advice I found on a post on Stack Overflow, I was able to build a script that looped through each of the state web pages, scraped the data from the tables, and dumped it into a delimited text file. Here’s the code:

## Frank Donnelly Feb 29, 2012
## Scrapes 2000 centers of population for counties from individual state web pages
## and saves in one national-level text file.

from urllib.request import urlopen
from bs4 import BeautifulSoup

output_file=open('CenPop2000_Mean_CO.txt','a')
header=['STATEFP','COUNTYFP','COUNAME','STNAME','POPULATION','LATITUDE','LONGITUDE']
output_file.writelines(",".join(header)+"\n")

url='http://www.census.gov/geo/www/cenpop/county/coucntr%s.html'

fips=['01','02','04','05','06','08','09','10',
'11','12','13','15','16','17','18','19','20',
'21','22','23','24','25','26','27','28','29','30',
'31','32','33','34','35','36','37','38','39','40',
'41','42','44','45','46','47','48','49','50',
'51','53','54','55','56']

for i in fips:
  soup = BeautifulSoup(urlopen(url %i).read())
  titleTag = soup.html.head.title
  list=titleTag.string.split()
  name=(list[4:])
  state=' '.join(name)  

  for row in soup('table')[1].tbody('tr'):
    tds = row('td')
    line=tds[0].string, tds[1].string, tds[2].string, state,
    tds[3].string.replace(',',''), tds[4].string, tds[5].string

    output_file.writelines(",".join(line)+"\n")     

output_file.close()

After installing the modules step 1 is to import them into the script. I initially got a little stuck here, because there are also some standard modules for working with urls (urllib and urlib2) that I’ve seen in books and other examples that weren’t working for me. I discovered that since I’m using Python 3.x and not the 2.x series, something had changed recently and I had to change how I was referencing urllib.

With that out of the way I created a a text file, a list with the column headings I want, and then wrote those column headings to my file.

Next I read in the url. Since the Census uses a static URL that varies for each state by FIPS code, I was able to assign the URL to a variable and inserted the % symbol to substitute where the FIPS code goes. I created a list of all the FIPS codes, and then I run through a loop – for every FIPS code in the list I pass that code into the url where the % place holder is, and process that page.

The first bit of info I need to grab is the name of the state, which doesn’t appear in the table. I grab the title tag from the page and save it as a list, and then grab everything from the fourth element (fifth word) to the end of the list to capture the state name, and then collapse those list elements back into one string (have to do this for states that have multiple words – New, North, South, etc.).

So we go from the HTML Title tag:

County Population Centroids for New York

To a list with elements 0 to 5:

list=["County", "Population", "Centroids", "for", "New", "York"]

To a shorter list with elements 4 to end:

name=["New","York"]

To a string:

state=”New York”

But the primary goal here is to grab everything in the table. So we identify the table in the HTML that we want – the first table in those pages [0] is just an empty frame and the second one [1] is the one with the data. For every row (tr) in the table we can reference and grab each cell (td), and string those cells together as a line by referencing them in the list. As I string these together I also insert the state name so that it appears on every line, and for the third list element (total population in 2000) I strip out any commas (numbers in the HTML table included commas, a major no-no that leads to headaches in a csv file). After we grab that line we dump it into the output file, with each value separated by a comma and each record on it’s own line (using the new line character). Once we’ve looped through each table on each page for each state, we close the file.

There are a few variations I could have tried; I could have read the FIPS codes in from a table rather than inserting them into the script, but I preferred to keep everything together. I could have read the state names in as a list, or coupled them with the codes in a dictionary. This would have been less risky then relying on the state name in the title tag, but since the pages were well-formed and I wanted to experiment a little I went the title tag route. Instead of typing the codes in by hand I used Excel trickery to concatenate commas to the end of each code, and then concatenated all the values together in one cell so I could copy and paste the list into the script.

You can go here to see an individual state page and source, and here to see what the final output looks like. Or if you’re just looking for a national level file of 2000 population centroids for counties that you can download, look no further!

Thiessen Polygons and Listing Neighboring Features

Monday, January 2nd, 2012

I was helping someone with a project recently that I thought would be straightforward but turned out to be rather complex. We had a list of about 10,000 addresses that had to be plotted as coordinates, and then we needed to create Thiessen or Voroni polygons for each point to create market areas. Lastly we needed to generate an adjacency table or list of neighbors; for every polygon list all the neighboring polygons.

For step one I turned to the USC Geocoding service to geocode the addresses; I became a partner a ways back so I could batch geocode datasets for students and faculty on my campus. Once I had coordinates I plotted them in ArcGIS 10 (and learned that the Add XY data feature had been moved to File > Add Data > Add XY Data). Step 2 seemed easy enough; in Arc you go to ArcToolbox > Analysis Tools > Proximity > Create Thiessen Polygons. This creates a polygon for each point and assigns the attributes of each point to the polygon.

I hit a snag with Step 3 – Arc didn’t have a tool for generating the adjacency table. After a thorough search of the ESRI and Stack Exchange forums, I stumbled on the Find Adjacent Features Script by Ken Buja which did exactly what I wanted in ArcGIS 9.2 and 9.3, but not in 10. I had used this script before on a previous project, but I’ve since upgraded and can’t go back. So I searched some more until I found the Find Adjacent & Neighboring Polygons Tool by cmaene. I was able to add this custom toolbox directly to ArcToolbox, and it did exactly what I wanted in ArcGIS 10. I get to select the unique identifying field, and for every ID I get a list of the IDs of the neighboring polygons in a text file (just like Ken’s tool). This tool also had the option of saving the list of neighbors for each feature directly in the attribute table of a shapefile (which is only OK for small files with few neighbors; fields longer than 254 characters get truncated), and it gave you the option of listing neighbors to the next degree (a list of all the neighbor’s neighbors).

Everything seemed to run fine, so I re-ran the tool on a second set of Thiessen polygons that I had clipped with an outline of the US to create something more geographically realistic (so polygons that share a boundary only in the ocean or across the Great Lakes are not considered neighbors).

THEN – TROUBLE. I took some samples of the output table and checked the neighbors of a few features visually in Arc. I discovered two problems. First, I was missing about a thousand records or so in the output. When I geocoded them I couldn’t get a street-level address match for every record; the worse case scenario was a plot to the ZCTA / ZIP code centroid for the address, which was an acceptable level of accuracy for this project. The problem is that if there are many point features plotted to the same coordinate (because they share the same ZIP), a polygon was created for one feature and the overlapping ones fell away (you can’t have overlapping Thiessen polygons). Fortunately this also wasn’t an issue for the person I was helping; we just needed to join the output table back to the master one to track which ones fell out and live with the result.

The bigger problem was the output was wrong. I discovered that the neighbor list for most of the features I checked, especially polygons that had borders on the outer edge of the space, had incomplete lists; each feature had several (and in some cases, all) neighbors missing. Instead of using a shapefile of Thiessen’s I tried running the tool on polygons that I generated as feature classes within an Arc geodatabase, and got the same output. For the heck of it I tried dissolving all the Thiessen’s into one big polygon, and when I did that I noticed that I had orphaned lines and small gaps in what should have been one big, solid rectangle. I tried checking the geometry of the polygons and there were tons of problems. This led me to conclude that Arc did a lousy job when constructing the topology of the polygons, and the neighbor tool was giving me bad output as a result.

Since I’ve been working more with GRASS, I remembered that GRASS vectors have strict topology rules, where features have shared boundaries (instead of redundant overlapping ones). So I imported my points layer from a shapefile into GRASS and then used the v.voroni tool to create the polygons. The geometry looked sound, the attributes of each point were assigned to a polygon, and for overlapping points one polygon was created and attributes of the shared points were dumped. I exported the polygons out as a shapefile and brought them back into Arc, ran the Find Adjacent & Neighboring Polygons tool, spot checked the neighbors of some features, and voila! The output was good. I clipped these polygons with my US outline, ran the tool again, and everything checked out.

Morals of this story? When geocoding addresses consider how the accuracy of the results will impact your project. If a tool or feature doesn’t exist assume that someone else has encountered the same problem and search for solutions. Never blindly accept output; take a sample and do manual checks. If one tool or piece of software doesn’t work, try exporting your data out to something else that will. Open source software and Creative Commons tools can save the day!

Footnote – apparently it’s possible to create lists of adjacent polygons in GRASS using the sides option in v.to.db, although it isn’t clear to me how this is accomplished; the documentation talks about categories of areas on the right and left of a boundary, but not on all sides of an area. Since I already had a working solution I didn’t investigate further.

Giving GRASS GIS a Try

Saturday, July 30th, 2011

I’ve been taking the plunge in learning GRASS GIS this summer, as I’ve been working at home (and thus don’t have access to ArcGIS) on a larger and more long-term project (and thus don’t want to mess around with shapefiles and csv tables). I liked the idea of working with GRASS vectors, as everything is contained in one folder and all my attributes are stored rather neatly in a SQLite database.

I started out using QGIS to create my mapset and to connect it to my SQLite db which I had created and loaded with some census data. Then I thought, why not give the GRASS interface a try? I started using the newer Python-wx GUI and as I’m trying different things, I bounce back and forth between using the GUI for launching commands and the command line for typing them in – all the while I have Open Source GIS A GRASS GIS Approach at my side and the online manual a click away . So far, so good.

I loaded and cleaned a shapefile with the GRASS GUI (the GUI launches v.in.ogr, v.build, abd v.clean) and it’s attributes were loaded into the SQLite database I had set (using db.connect – need to do this otherwise a DBF is created by default for storing attributes). Then I had an age-old task to perform – the US Census FIPS / ANSI codes where stored in separate fields, and in order to join them to my attribute tables I had to concatenate them. I also needed to append some zeros to census tract IDs that lacked them – FIPS codes for states are two digits long, counties are three digits, and tracts are between four and six digits, but to standardize them four digit tracts should have two zeros appended.

Added the new JOIN_ID column using v.db.addcol, then did the following using db.execute:

UPDATE tracts_us99_nad83
SET JOIN_ID = STATE || COUNTY || TRACT

Then:

UPDATE tracts_us99_nad83
SET JOIN_ID = JOIN_ID || ’00′
WHERE length(JOIN_ID)=9

So this:

STATE COUNTY TRACT
01 077 0113
01 077 0114
01 077 011502
01 077 011602

Becomes this:

JOIN_ID
01077011300
01077011400
01077011502
01077011602

db.execute GRASS GUI

I could have done this a few different ways from within GRASS: instead of the separate v.db.addcol command I could have written a SQL statement in db.execute to alter the table and add a column. Or, instead of db.execute I could have used the v.db.update command.

My plan is to use GRASS for geoprocessing and analysis (will be doing some buffering, geographic selection, and basic spatial stats), and QGIS for displaying and creating final maps. I just used v.in.db to transform an attribute table with coordinates in my db to a point vector. But now I’m realizing that in order to draw buffers, I’ll need a projected coordinate system that uses meters or feet, as inputting degrees for a buffer distance (for points throughout the US) isn’t going to work too well. I’m also having trouble figuring out how to link my attribute data to my vectors – I can easily use v.db.join to fuse the two together, but there is a way to link them more loosely using the CAT ID number for each vector, but I’m getting stuck. We’ll see how it goes.

Some final notes – since I’m working with large datasets (every census tract in the US) and GRASS uses a topological data model where common node and boundaries between polygons are shared, geoprocessing can take awhile. I’ve gotten in the habit of testing things out on a small subset of my vectors, and once I get it right I run the process on the total set.

Lastly, there are times where I read about commands in the manual and for the life of me I can’t find them in the GUI – for example, finding a menu to delete (i.e. permanently remove) layers. But if you type the command without any of its parameters in the command line (in this case, g.remove) it will launch the right window in the GUI.

GRASS GIS Interface

Define Projection for a Batch of Shapefiles

Sunday, June 12th, 2011

I was working on a project where I had downloaded 51 shapefiles (state-based census tract files) from the Census Generalized Cartographic Boundary Files. Each file lacked a projection .prj file, so I had to define each one as NAD83. Not wanting to do this one at a time, I used the GDAL / OGR tools and a bash script to process them all in a batch. I wrote a little script in a text file and then pasted it in the command line:

#!/bin/bash
for i in $(ls *.shp); do
ogr2ogr -f “ESRI Shapefile” -a_srs “EPSG:4269″ ./nad83 $i
done

It iterates through a list of all the shapefiles in a directory, uses OGR to define them as NAD83, then writes them to a new subdirectory called NAD83.

After searching through the web for some guidance on this, I later realized that there was a nice, succinct example of this in a book that I had (yeah – remember books? They’re still great!)

#!/bin/bash
# from Sherman (2008) Desktop GIS Mapping the Planet With Open Source Tools pp 243-44

for shp in *.shp
do
echo “Processing $shp”
ogr2ogr -f “ESRI Shapefile” -t_srs EPSG:4326 geo/$shp $shp
done

This does the same thing, difference here is that it prints a message to the command line for each file that’s processed and uses the -t_srs switch (transform projection) rather than the -a_srs (assign an output projection), which in this case seems to do the same thing. Of course you could tweak this a little to transform projections from one system to another as well.

This is fine and good if you’re using Linux and can use bash (go here for more info about bash). If you’re using Windows, you can do this if you’re using a Linux / UNIX terminal emulator like MSYS; otherwise you can use the DOS Command Prompt and write a batch (.bat) file to do this instead – the post on this forum is the first thing I found in my quest to figure all of this out.

Joining CSV Files in QGIS

Monday, April 4th, 2011

DBF files are the other thorny issue that comes up when I’m teaching the Intro to GIS workshops with QGIS – specifically how do you create and edit them? Doing that has been pretty inconvenient in the Windows world since they were deprecated in Excel 2007. You can download plugins or basic stand-alone software to do the job. Since I’m a Linux user I have the Open Office suite by default, and I can easily work with DBFs in Calc. That’s an option for Windows and Mac users too, but it’s kind of a drag to download an entire office suite just for working with DBFs (assuming most folks are use MS Office).

Another possibility is to dump DBF all together; even though the join table option in the fTools menu in QGIS only presents you the option of joining shapfiles or DBF, it turns out if you choose the DBF radio button option you can actually point to DBFs OR CSV files when you’re browsing to point to your data table. The join proceeds the same way, and you get a new shapefile with the data from the CSV joined to it. CSVs can be easily created from any spreadsheet, text editor, or database program in any operating system, so you can prep your attribute data in your program of choice before exporting to CSV and importing to GIS.

The problem with this approach is that all of the fields from the CSV file are automatically saved as text or strings when they’re appended to the shapefile. This means that any numeric data you have can’t be treated numerically; you can’t perform calculations or classify data as value ranges for mapping. You can go into the attribute table, enter the edit mode, and use the field calculator to create new fields where you convert the values to integers, but this adds a bunch of duplicate fields and is rather messy. You can’t delete columns from shapefiles within QGIS; you have to edit the attributes outside the software to remove extra columns.

Here’s an easy work around: you can create a CSVT file to go along with your CSV file. Open a text editor like Notepad or gedit and create a one line file where you specify the data type for each of the fields in your CSV file. Save the CSVT with the SAME NAME as your CSV file. Now when you go to join your CSV to your shapefile in QGIS, it will read the CSVT and save all of your fields properly in the new shapefile.

So for a CSV file like this with six fields:

CODE, ST, ID_NAICS51, EMP_51, ID_TOTAL, TOTAL_EMP
01, AL, ENU0100010551, 27013, ENU0100010510, 1570188
02, AK, ENU0200010551, 6988, ENU0200010510, 237708
04, AZ, ENU 0400010551, 41833, ENU0400010510, 2174919 …

You’d create a CSVT file like this:

“String”, “String”, “String”, “Integer”, “String”, “Integer”

So the 4th and 6th fields that have numeric values are saved as integers. There are a few other field types you can use, like Real for decimal numbers and some Date / Time options, and you can specify length and precision – see this post for details.

Using shapefiles, CSVs, and CSVTs are fine for small to medium projects and for the introductory workshops I’m teaching; geodatabases are another option and are certainly better for medium to large projects, but introducing them in my intro workshop is a little too much.

(NOTE – in QGIS 1.7 the join tool has been dropped from the ftools menu. To join a csv or dbf in 1.7+, you have to add your data table as a vector to your map, and then use the join tab within the properties menu of the feature you want to join it to. See this post for details.)

Calculating Standard Deviation for Summarized Data

Tuesday, March 29th, 2011

This isn’t a geospatial issue per say, but I thought it would be useful to share. I have a spreadsheet where I’m tracking course evaluation responses for the GIS workshops I’m teaching. I have to report the total number of responses, the mean, and standard deviation for each question. The worksheet I designed tracks aggregate responses; the total number of people who responded to each question in each category, on a scale of 5 (strongly agree) to 1 (strongly disagree). For example:

The problem I had was that Excel’s standard deviation formula doesn’t work for summaries – you need to give the formula individual responses or raw scores for arguments. In other words:

So I was fixated on trying to find a formula, through the help and by searching the forums, where somehow I could calculate standard deviation using summaries or aggregates. It finally dawned on me (duh) that I could plug in the standard deviation formula myself and modify it.

To calculate the standard deviation for an entire population you compute the difference of each data point from the mean and square each result. Then you calculate the average of all these values and take the square root.

So for each question I subtract the mean score for that question from the score category for that question, square it, and then multiply the result by the number of people who answered in that category. So if 10 people strongly agreed with the question and strongly agreed is associated with a score of 5, I subtract the average score (4.71) from 5, square the result, and multiply it by 10 (since ten people responded that they strongly agreed).

((score value – mean score)^2)*respondents

I perform the same operation for each category. So if 4 people said they agreed with a question and agreed is a value of 4, subtract 4.71 from 4, square it, and multiply by 4. After I do this this for each category, I sum the values for each one and take the square root of the whole thing.

SQRT ((((score5 – mean score)^2)*respondents)+(((score4 – mean score)^2)*respondents))

SQRT ((((5-4.71)^2)*10)+(((4-4.71)^2)*4))

For my spreadsheet the formula is repeated for each of the 5 possible scores, references are used to pull in the mean and respondent values from other cells, and I round the entire result to 1 decimal place. The number of parens makes it a little confusing; I’ve inserted a color-coded image below so it’s a little clearer.

=ROUND((SQRT(((((5-H9)^2)*B9)+(((4-H9)^2)*C9)+(((3-H9)^2)*D9)+(((2-H9)^2)*E9)+(((1-H9)^2)*F9))/G9)),1)

Given all that can go wrong with one misplaced parens, I tested this by inputting some raw scores by hand and running the STDEVPA formula to verify that I get the same result.

Relating ZIP Codes / ZCTAs to PUMAs

Saturday, March 19th, 2011

Ever since I created the Google Maps finding aid for census data for NYC PUMAs and the associated PUMA – NYC neighborhood names maps, I’ve received several requests for tables or maps that relate PUMAs to ZIP Codes. These are usually from non-profits in NYC who have lists of donors, members, or constituents with addresses, and they want to relate the addresses (using the ZIP) to recent demographic data from American Community Survey (ACS) for the broader neighborhood where the ZIP is located.

The problem is that ZIP Codes are an all around pain. They actually don’t exist as areas with distinct boundaries; ZIP Codes are all address based, with ZIPs tied to addresses along street segments. The USPS doesn’t publish these tables or create maps; they contract this out for private companies to do, who turn around and sell these products for hefty fees.

Fortunately the Census Bureau has used these address tables to create approximations of ZIP Codes that they call ZCTAs or ZIP Code Tabulation Areas. ZCTAs are aggregates of census blocks that attempt to mimic ZIP Codes that exist as areas; codes associated with specific single-point firms or organization are dropped. Since ZIPS were created by the USPS, ZCTAs do not nest or mesh with any census geography; they cross PUMA, county, and in some cases even state boundaries. They are also less stable than census geography, with frequent changes, and as statistical areas they vary widely in area and population. For this reason ZCTA data is only published every ten years in the decennial census; it’s not included in the ACS (so far).

With these caveats in mind, I used the Missouri Census Data Center’s MABLE/GEOCORR engine to correlate ZCTAs with PUMAs. While the interface looks a little retro and daunting, it’s actually pretty simple. You choose the state, the two geographies you want to relate, the weighting method for allocating one to the other, and an output format that includes CSV or HTML. I also used an option that lets you type in FIPS codes for the counties you want, so I didn’t end up with the entire state.

This method was the way to go, as they give you the option to allocate geographies based on population and not simply land area; each ZCTA was allocated to PUMAs based on where the majority of the ZCTA’s population lived using 2000 census block data. The final output contains one row for each ZCTA to PUMA combination. So you had multiple rows for ZCTAs that weren’t contained within a single PUMA, and for each of those ZCTAs you had fields that showed the percentage of the ZCTA’s population that lived in each PUMA (along with the actual population number) as well as the percentage of the PUMA’s population that lived in that ZCTA.

I took that table and cleaned it up in a spreadsheet, so that I was left with one row for each ZCTA, where the ZCTA was allocated to one PUMA based on where the majority of it’s population lives. I used some ZCTA and PUMA boundaries that I had originally downloaded and subsequently cleaned up from the 2009 TIGER shapefiles page, added them to QGIS, joined the ZCTA allocation table to the ZCTA geography, and mapped the result. I color-coded ZCTAs so that clusters of ZCTAs within a particular PUMA had the same color. Then I overlaid the PUMA boundaries on top to see how well they corresponded.

In the end, they didn’t correspond all that well. There was a fairly good relationship in Manhattan, ok relationship in Queens and Staten Island, and a rather lousy relationship in the Bronx and Brooklyn. I overlaid greenspace and facilities (airports, shipyards, etc) boundaries I had, and that made some difference; you could see in some areas where ZCTAs overlapped two PUMAs that the overlap coincided with parks, cemeteries, or other areas with low or no residential population in one of the PUMAs.

I’ve posted both sets of tables, maps, and some instructions on the NYC neighborhoods resource page. You can use the original MABLE / GEOCORR table to judge where allocations were good and were they were not so good based on population. For now, the engine is still based on 2000 Census geography and data. Even though the Census has started releasing 2010 TIGER files based on 2010 Census geography, ZCTAs and PUMAs are often some of the last geographies to be updated; current releases of the ACS are still based on the 2000 geographies. Stay tuned to the Census Bureau and MCDC websites for news on updates, and keep the MABLE / GEOCORR in mind if you want to create lists to relate census geographies by population or land area.

Learning Python at PyCamp

Thursday, June 10th, 2010

I got back from leave a couple week ago, and spent part of it at a Python boot camp. I’ve gotten tired of hacking away at data in spreadsheets and read in several places that Python is a good language to learn for beginning programmers – it’s also open source, flexible, and is used by many in the GIS community for processing data and building plugins and software (the instructor for the camp, Chris Calloway, pointed me to this presentation on Python scripting techniques for ArcGIS).

The workshop was a three-day event hosted at Penn State by the Triangle Zope and Python Users Group (TriZPUG). It was geared towards beginners and non-programmers (although many of my fellow classmates were IT and systems people) and provided a pretty thorough review of all of the elements of the language – now it’s up to me to tie it all together! The price was extremely reasonable (only $300 for a 3 day class!) and I’d certainly recommend it if there’s a camp in your area; although I would also recommend reading a book or taking a tutorial to familiarize yourself with the basics BEFORE attending the class; I did, and as a result I think I got more out of it than I would have had going in cold.

The next PyCamp is being held in LA in a few days, and the following one will be in Toronto from Aug 30th to Sept 3rd (although this isn’t posted on the website yet); the normal workshop is a five day affair, the one I attended was a mini 3 day version which suited my needs pretty well.

There are tons of Python tutorials on the web and Python’s site is pretty definitive. If you’re looking for a book, I’d recommend Practical Programming: An Introduction to Computer Science Using Python. Unlike the “Learn Language X” books, this one introduces you to general theory and practice in programming, and the authors illustrate the applications with practical examples using Python – it’s been immensely helpful to me. Now that I’m around the initial learning curve, I’ve been relying more on Beginning Python: From Novice to Professional, which is better as a reference book and good for illustrating many of the uses for individual objects, methods, etc (which I had a hard time grasping before I covered the basics of programming).

Google Maps to Create a Census Finding Aid

Thursday, May 13th, 2010

Yikes! It’s been quite awhile since my last post (the past couple months have been a little tough for me), but I just finished an interesting project that I can share.

I constantly get questions from students who are interested in getting recent demographic and socio-economic profiles for neighborhoods in New York City. The problem is that neighborhoods are not officially defined, so we have to look for a surrogate. The City has created neighborhood-like areas out of census tracts called community districts and they publish profiles for them, but this data is from the decennial census  and not current enough for their needs.  ZIP code data is also only available from the decennial census.

We can use PUMAs (Public Use Microdata Areas) to approximate neighborhoods in large cities, and they are published as part of the 3 year estimates of the American Community Survey. The problem is, in order to look up the data from the census you need to search by PUMA number – there are no qualitative place names. The city and the census have worked together to assign names to neighborhoods as part of the NYC Housing and Vacancy Survey, but this is the only place (I’ve found) that uses these names. You need to look in several places to figure out what the PUMA number and boundaries for an area are and then navigate through the census site to find it. Too much for the average student who visits me at the reference desk or emails me looking for data.

My solution was to create a finding aid in Google maps that tied everything together:

View Larger Map

I downloaded PUMA boundaries from the Census TIGER file site in a shapefile format. I opened them up in ArcGIS and used an excellent script that I downloaded called Export to KML. ArcGIS 9.3 does support KML exports via the toolbox, and there are a number of other scripts and stand-alone programs that can do this (I tried several) but Export to KML was best (assuming you have access to ArcGIS) in terms of the level of customization and the thoroughness of the user documentation. I symbolized the PUMAs in ArcGIS using the colors and line thickness that I wanted and fired up the tool. It allows you to automatically group and color features based on the layer’s symbology. I was able to add a “snippet” to each feature to help identify it (I used the PUMA number as the attribute name and the neighborhood name as my snippet, so both appear in the legend) and added a description that would appear in the pop up window when that feature is clicked. In that description, I added the URL from the ACS census profile page for a particular PUMA – the cool part here is that the URL is consistent and contains the PUMA number. So, I replaced the specific number and inserted the [field] name from the PUMAs attribute table that contained the number. When I did the export, the URLs for each individual feature were created with their PUMA number inserted into the link.

There were a few quirks – I discovered that you can’t automatically display labels on a Google Map without subterfuge, like creating the labels as images and not text. Google Earth (but not Maps) supports labels if you create multi-geometry where you have a point for a label and a polygon for the feature. If you select a labeling attribute on the initial options screen of the Export to KML tool, you create an icon in the middle of each polygon that has a different description pop-up (which I didn’t want so I left it to none and lived without labels). I made my features 75% transparent (a handy feature of Export to KML) so that you could see the underlying Google Map features through the PUMA, but this made the fill AND the lines transparent, making the features too difficult to see. After the export I opened the KML in a text editor and changed the color values for the lines / boundaries by hand, which was easy since the styles are saved by feature group (boroughs) and not by individual feature (pumas). I also manually changed the value of the folder open element (from 0 to 1) so that the feature and feature groups (pumas and boroughs) are expanded by default when someone opens the map.

After making the manual edits, I uploaded the KML to my webserver and pasted the url for it into the Google Maps search box, which overlayed my KML on the map. Then I was able to get a persistent link to the map and code for embedding it into websites via the Google Map Interface. No need to add it to Google My Maps, as I have my own space. One big quirk – it’s difficult to make changes to an existing KML once you’ve uploaded and displayed it. After I uploaded what I thought would be my final version I noticed a typo. So I fixed it locally, uploaded the KML and overwrote the old one. But – the changes I made didn’t appear. I tried reloading and clearing the cache in my browser, but no good – once the KML is uploaded and Google caches it, you won’t see any of your changes until Google re-caches. The conventional wisdom is to change the name of the file every single time – which is pretty dumb as you’ll never be able to have a persistent link to anything. There are ways to circumvent the problem, or you can just wait it out. I waited one day and by the next the file was updated; good enough for me, as I’ll only need to update it once a year.

I’m hosting the map, along with some static PDF maps and a spreadsheet of PUMA names and neighborhood numbers, from the NYC Data LibGuide I created (part of my college’s collection of research guides). If you’re looking for neighborhood names to associate with PUMA numbers for your city, you’ll have to hunt around and see if a local planning agency or non-profit has created them for a project or research study (as the Census Bureau does not create them). For example, the County of Los Angeles Department of Mental Health uses pumas in a large study they did where they associated local place names with each puma.

If you’re interested in dabbling in some KML, there’s Google’s KML tutorial. I’d also recommend The KML Handbook by Josie Wernecke. The catch for any guide to KML is that while all KML elements are supported by Google Earth, there’s only partial support for Google Maps.

Calculated Fields in SpatiaLite / SQLite

Wednesday, February 3rd, 2010

After downloading data, it’s pretty common that you’ll want to create calculated fields, such as percent totals or change, to use for analysis and mapping. The next step in my QGIS / SpatiaLite experiment was to create a calculated field (aka derived field). I’ll run through three ways of accomplishing this, using my subway commuter data to calculate the percentage of workers in each NYC PUMA who commute to work. Just to keep everything straight:

  • sub_commuters is a census data table for all PUMAs in NY State
    • [SUBWAY] field that has the labor force that commutes by subway
    • [WORKERS_16] field with the total labor force
    • [SUB_PER] a calculated field with the % of labor force that commutes by subway
    • [GEO_ID2] the primary key field, FIPS code that is the unqiue identifier
  • nyc_pumas is a feature class with all PUMAs in NYC
    • [PUMA5ID00] is the primary key field, FIPS code that is the unqiue identifier
  • pumas_nyc_subcom is the data table that results from joining sub_commuters and nyc_pumas; it can be converted to a feature class for mapping

Spreadsheet

The first method would be to add the calculated field to the data after downloading it from the census in a spreadsheet, as part of the cleaning / preparation stage. You could then save it as a delimited text file for import to SpatiaLite. No magic there, so I’ll skip to the second method.

SpatiaLite

The second method would be to create the calculated field in the SpatiaLite database. I’ll go through the steps I used to figure this out. The basic SQL select query:

SELECT *, (SUBWAY / WORKERS_16) AS SUB_PER FROM sub_commuters

This gives us the proper result, but there are two problems. First, the data in my SUBWAY and WORKERS_16 field are stored as integers, and when you divide the result is rounded to the nearest whole number. Not very helpful here, as my percentage results get rounded to 0 or 1. There are many ways to work around this: set the numeric fields as double, real, or float in the spreadsheet before import (didn’t work for me), specify the field types when importing (didn’t get that option with the SpatiaLite GUI, but maybe you can with the command line), add * 100 to the expression to multiply the percentage to a whole number (ok unless you need decimals in your result) or use the CAST operator. CAST converts the current data type of a field to a specified data type in the result of the expression. So:

SELECT *, (CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL)) AS SUB_PER FROM sub_commuters

This gave me the percentages with several decimal places (since we’re casting the fields as real instead of integer), which is what I needed. The second problem is that this query just produces a temporary view; in order to map this data, we need to create a new table to make the calculated field permanent and join it to a feature class. Here’s how we do that:

CREATE TABLE pumas_nyc_subcom AS
SELECT *, (CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL)) AS SUB_PER
FROM sub_commuters, nyc_pumas
WHERE nyc_pumas.PUMA5ID00=sub_commuters.geo_id2

The CREATE TABLE AS statement let’s us create a new table from the existing two tables – the data table of subway commuters and the feature class table for NYC PUMAs. We select all the fields in both while throwing in the new calculated field, and we join the data table to the feature class all in one step, and via the join we end up with just data from NYC (the data for the rest of the state gets dropped). After that, it’s just a matter of taking our new table and enabling the geometry to make it a feature class (as explained in the previous post).

This seems like it should work – but I discovered another problem. The resulting calculated field that has the percentage of subway commuters per PUMA, SUB_PER, has no data type associated with it. Looking at the schema for the table in SpatiaLite shows that the data type is blank. If I bring this into QGIS, I’m not able to map this field as a numeric value, because QGIS doesn’t know what it is. I have to define the data type for this field. SpatiaLite (SQLite really) doesn’t allow you to re-define an existing field – we have to create and define a new blank field, and the set the value of our calculated field equal to it. Here are the SQL statements to make it all happen:

ALTER TABLE sub_commuters ADD SUB_PER REAL

UPDATE sub_commuters SET SUB_PER=(CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL))

CREATE TABLE pumas_nyc_subcom AS
SELECT * FROM sub_commuters, nyc_pumas
WHERE nyc_pumas.PUMA5ID00=sub_commuters.geo_id2

So, we add a new blank field to our data table and define it as real. Then we update our data table by seting that blank field equal to our expression, thus filling the field with the result of our expression. Once we have the defined calculated field, we can create a new table from the data plus the features based on the ID they share in common. Once the table is created, then we can activate the geometry (right click on geometry field in the feature class and activate – see previous post for details) so we can map it in QGIS. Phew!

QGIS

The third method is to create the calculated field within QGIS, using the new field calculator. It’s pretty easy to do – you select the layer in the table of contents and go into an edit mode. Open the attribute table for the features and click the last button in the row of buttons underneath the table – this is the field calculator button. Once we’re in the field calculator window, we can choose to update an existing field or create a new field. We give the output field a name and a data type, enter our expression SUBWAY / WORKERS_16, hit OK, and we have our new field. Save the edits and we should be good to go. HOWEVER – I wasn’t able to add a calculated fields to features in a SpatiaLite geodatabase without getting errors. I posted to the QGIS forum – initially it was thought that the SpatiaLite driver was read only, but it turns out that’s not the case and so and the developers are investigating a possible bug. The investigation continues – stay tuned. I have tried the field calculator with shapefiles and it works perfectly (incidentally, you can export SpatiaLite features out of the database as shapefiles).

I’m providing the database I created here for download, if anyone wants to experiment.


Copyright © 2012 Gothos. All Rights Reserved.
No computers were harmed in the 0.396 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.