Archive for March, 2012

GIS Workshops This Apr & May

Sunday, March 25th, 2012

This semester I’ll be teaching three workshops with Prof. Deborah Balk in spatial tools and analysis. Sponsored by the CUNY Institute of Demographic Research (CIDR), the workshops will be held on Baruch College’s campus in midtown NYC on Friday afternoons. The course is primarily intended for data and policy analysts who want to gain familiarity with the basics of map making and spatial analysis; registration is open to anyone. The workshops progress from basic to intermediate skills that cover making a map (Apr 27th), geospatial calculations (May 4th), and geospatial analysis (May 11th). We’ll be using QGIS and participants will work off of their own laptops; we’ll also be demonstrating some of the processes in ArcGIS and participants will receive an evaluation copy of that software. Each workshop is $300 or you can register for all three for $750.

For full details check out this flier. You can register via the College’s CAPS website; do a search for DEM and register for each session (DEM0003, DEM0004, and DEM0005).

Screen Scraping Data with Python

Friday, March 9th, 2012

I had a request recently for population centers (aka population centroids) for all the counties in the US. The Census provides the 2010 centroids in state level files and in one national file for download, but the 2000 centroids were provided in HTML tables on individual web pages for each state. Rather than doing the tedious work of copying and pasting 51 web pages into a spreadsheet, I figured this was my chance to learn how to do some screen scraping with Python. I’m certainly no programmer, but based on what I’ve learned (I took a three day workshop a couple years ago) and by consulting books and crawling the web for answers when I get stuck, I’ve been able to write some decent scripts for processing data.

For screen scraping there’s a must-have module called Beautiful Soup which easily let’s you parse web pages, well or ill-formed. After reading the Beautiful Soup Quickstart and some nice advice I found on a post on Stack Overflow, I was able to build a script that looped through each of the state web pages, scraped the data from the tables, and dumped it into a delimited text file. Here’s the code:

## Frank Donnelly Feb 29, 2012
## Scrapes 2000 centers of population for counties from individual state web pages
## and saves in one national-level text file.

from urllib.request import urlopen
from bs4 import BeautifulSoup

output_file=open('CenPop2000_Mean_CO.txt','a')
header=['STATEFP','COUNTYFP','COUNAME','STNAME','POPULATION','LATITUDE','LONGITUDE']
output_file.writelines(",".join(header)+"\n")

url='http://www.census.gov/geo/www/cenpop/county/coucntr%s.html'

fips=['01','02','04','05','06','08','09','10',
'11','12','13','15','16','17','18','19','20',
'21','22','23','24','25','26','27','28','29','30',
'31','32','33','34','35','36','37','38','39','40',
'41','42','44','45','46','47','48','49','50',
'51','53','54','55','56']

for i in fips:
  soup = BeautifulSoup(urlopen(url %i).read())
  titleTag = soup.html.head.title
  list=titleTag.string.split()
  name=(list[4:])
  state=' '.join(name)  
 
  for row in soup('table')[1].tbody('tr'):
    tds = row('td')
    line=tds[0].string, tds[1].string, tds[2].string, state, 
    tds[3].string.replace(',',''), tds[4].string, tds[5].string

    output_file.writelines(",".join(line)+"\n")     

output_file.close()

After installing the modules step 1 is to import them into the script. I initially got a little stuck here, because there are also some standard modules for working with urls (urllib and urlib2) that I’ve seen in books and other examples that weren’t working for me. I discovered that since I’m using Python 3.x and not the 2.x series, something had changed recently and I had to change how I was referencing urllib.

With that out of the way I created a a text file, a list with the column headings I want, and then wrote those column headings to my file.

Next I read in the url. Since the Census uses a static URL that varies for each state by FIPS code, I was able to assign the URL to a variable and inserted the % symbol to substitute where the FIPS code goes. I created a list of all the FIPS codes, and then I run through a loop – for every FIPS code in the list I pass that code into the url where the % place holder is, and process that page.

The first bit of info I need to grab is the name of the state, which doesn’t appear in the table. I grab the title tag from the page and save it as a list, and then grab everything from the fourth element (fifth word) to the end of the list to capture the state name, and then collapse those list elements back into one string (have to do this for states that have multiple words – New, North, South, etc.).

So we go from the HTML Title tag:

County Population Centroids for New York

To a list with elements 0 to 5:

list=["County", "Population", "Centroids", "for", "New", "York"]

To a shorter list with elements 4 to end:

name=["New","York"]

To a string:

state=”New York”

But the primary goal here is to grab everything in the table. So we identify the table in the HTML that we want – the first table in those pages [0] is just an empty frame and the second one [1] is the one with the data. For every row (tr) in the table we can reference and grab each cell (td), and string those cells together as a line by referencing them in the list. As I string these together I also insert the state name so that it appears on every line, and for the third list element (total population in 2000) I strip out any commas (numbers in the HTML table included commas, a major no-no that leads to headaches in a csv file). After we grab that line we dump it into the output file, with each value separated by a comma and each record on it’s own line (using the new line character). Once we’ve looped through each table on each page for each state, we close the file.

There are a few variations I could have tried; I could have read the FIPS codes in from a table rather than inserting them into the script, but I preferred to keep everything together. I could have read the state names in as a list, or coupled them with the codes in a dictionary. This would have been less risky then relying on the state name in the title tag, but since the pages were well-formed and I wanted to experiment a little I went the title tag route. Instead of typing the codes in by hand I used Excel trickery to concatenate commas to the end of each code, and then concatenated all the values together in one cell so I could copy and paste the list into the script.

You can go here to see an individual state page and source, and here to see what the final output looks like. Or if you’re just looking for a national level file of 2000 population centroids for counties that you can download, look no further!


Copyright © 2013 Gothos. All Rights Reserved.
No computers were harmed in the 0.461 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.