## Posts Tagged ‘python’

### Python Geocoding Take 2 – US Addresses

Monday, May 30th, 2016

In my previous post I discussed my recent adventures with geocoding addresses outside the US. In contrast, there are countless options for batch geocoding addresses within the United States. I’ll discuss a few of those options here, but will focus primarily on the US Census Geocoder and a Python script I’ve written to batch match addresses using their API. The code and documentation is available on my lab’s resources page.

A Few Different Options

ESRI’s geocoding services allow you (with an account) to access their geocoding servers through tools in the ArcToolbox, or you can write a script and access them through an API. QGIS has a third-party plugin for accessing Google’s services (2500 records a day free) or the Open Streetmap. You can still do things the old fashioned way, by downloading geocoded street files and creating a matching service.

Alternatively, you can subscribe to any number of commercial or academic services where you can upload a file, do the matching, and download results. For years I’ve used the geocoding services at Texas A&M that allow you to do just that. Their rates are reasonable, or if you’re an academic institution and partner with them (place some links to their service on their website) you can request free credits for doing matches in batches.

The Census Geocoder and API, and a Python Script for Batch Geocoding

The Census Bureau’s TIGER and address files are often used as the foundational layers for building these other services, to which the service providers add refinements and improvements. You can access the Census Bureau’s services directly through the Census Geocoder, where you can match an address one at a time, or you can upload a batch of 1000 records. It returns longitude and latitude coordinates in NAD 83, and you can get names and codes for all the census geographies where the address is located. The service is pretty picky about the structure of the upload file (must be plain text, csv, with an id column and then columns with the address components in a specific order – with no other attributes allowed) but the nice thing is it requires no login and no key. It’s also public domain, so you can do whatever you want with the data you’ve retrieved. A tutorial for using it is available on our lab’s census tutorials page.

They also have an API with some basic documentation. You can match parsed and unparsed addresses, and can even do reverse geocoding. So I took a stab at writing a script to batch process addresses in text-delimited files (csv or txt). Unfortnately, the Census Geocoding API is not one of the services covered by the Python Geocoder that I mentioned in my previous post, but I did find another third party module called censusgeocode which provides a thin wrapper you can use. I incorporated that module into my Python 3 script, which I wrote as a function that takes the following inputs:
 census_geocode(datafile,delim,header,start,addcol) (str,str,str,int,list[int]) -> files

• datafile – this is the name of the file you want to process (file name and extension). If you place the geocode_census_funct.py file in the same directory as your data file, then you just need to provide the name of the file. Otherwise, you need to provide the full path to the file.
• delim – this is the delimiter or character that separates the values in your data file. Common delimiters includes commas ‘,’, tabs ‘\t’, and pipes ‘|’.
• header – here you specify whether your file has a header row, i.e. column names. Enter ‘y’ or ‘yes’ if it does, ‘n’ or ‘no’ if it doesn’t.
• start – type 0 to specify that you want to start reading the file from the beginning. If you were previously running the script and it broke and exited for some reason, it provides an index number where it stopped reading; if that’s the case you can provide that index number here, to pick up where you left off.
• addcol – provide a list that indicates the position number of the columns that contain the address components in your data file. For an unparsed address, you provide just one position number. For a parsed address, you provide 4 positions: address, city, state, and ZIP code. Whether you provide 1 or 4, the numbers must be supplied in brackets, as the function requires a Python list.

You can open the script in IDLE, run it to load it into memory, and then type the function with the necessary parameters in the shell to execute it. Some examples:

• A tab-delimited, unparsed address file with a header that’s stored in the same folder as the script. Start from the beginning and the address is in the 2nd column: census_geocode('my_addresses.txt','\t','y',0,[2])
• A comma-delimited, parsed address file with no header that’s stored in the same folder as the script. Start from the beginning and the addresses are in the 2nd through 5th columns: census_geocode('addresses_to_match.csv',',','n',0,[2,3,4,5])
• A comma-delimited, unparsed address file with a header that’s not in the same folder as the script. We ran the file before and it stopped at index 250, so restart there – the address is in the 3rd column: census_geocode('C:\address_data\data1.csv',',','y',250,[3])

The beginning of the script “sets the table”: we read the address columns into variables, create the output files (one for matches, one for non-matches, and a summary report), and we handle whether or not there’s a header row. For reading the file I used Python’s CSV module. Typically I don’t use this module, as I find it’s much simpler to do the basic: read a line in, split it on a delimiter, strip whitespace, read it into a list, etc. But in this case the CSV module allows you to handle a wider array of input files; if the input data was a csv and there happened to be commas embedded in the values themselves, the CSV module easily takes care of it; if you ignore it, the parsing would get thrown off for that record.

Handling Exceptions and Server Errors

In terms of expanding my skills, the new things I had to learn were exception handling and control flows. Since the censusgeocoding module is a thin wrapper, it had no built in mechanism for retrying a match a certain number of times if the server timed out. This is an absolute necessity, because the census server often times out, is busy, or just hiccups, returning a generic error message. I had already learned how to handle crashes in my earlier geocoding experiments, where I would write the script to match and write a record one by one as it went along. It would try to do a match, but if any error was raised, it would exit that loop cleanly, write a report, and all would be saved and you could pick up where you left off. But in this case, if that server non-response error was returned I didn’t want to give up – I wanted to keep trying.

So on the outside there is a loop to try and do a match, unless any error happens, then exit the loop cleanly and wrap up. But inside there is another try loop, where we try to do a match but if we get that specific server error, continue: go back to the top of that for loop and try again. That loop begins with While True – if we successfully get to the end, then we start with the next record. If we get that server error we stay in that While loop and keep trying until we get a match, or we run out of tries (5) and write as a non-match.

In doing an actual match, the script does a parsed or unparsed match based on user input. But there was another sticking point; in some instances the API would return a matched result (we got coordinates!), but some of the objects that it returned were actually errors because of some java problem (failed to get the tract number or county name – here’s an error message instead!) To handle this, we have a for i in range loop. If we have a matched record and we don’t have a status message (that indicates an error) then we move along and grab all the info we need – the coordinates, and all the census geography where that coordinate falls, and write it out, and then that for loop ends with a break. But if we receive an error message we continue – go back to the top of that loop and try doing the match again. After 3 tries we give up and write no match.

Figuring all that out took a while – where do these loops go and what goes in them, how do I make sure that I retry a record rather than passing over it to the next one, etc. Stack Exchange to the rescue! Difference between continue, pass and break, returning to the beginning of a loop, breaking out of a nested loop, and retrying after an exception. The rest is pretty straightforward. Once the matching is done we close the files, and write out a little report that tells us how many matches we got versus fails. The Census Geocoder via the API is pretty unforgiving; it either finds a match, or it doesn’t. There is no match score or partial matching, and it doesn’t give you a ZIP Code or municipal centroid if it can’t find the address. It’s all or nothing; if you have partial or messy addresses or PO Boxes, it’s pretty much guaranteed that you won’t get matches.

There’s no limit on number of matches, but I’ve built in a number of pauses so I’m not hammering the server too hard – one second after each match, 5 seconds after every 1000 matches, a couple seconds before retrying after an error. Your mileage will vary, but the other day I did about 2500 matches in just under 2 hours. Their server can be balky at times – in some cases I’ve encountered only a couple problems for every 100 records, but on other occasions there were hang-ups on every other record. For diagnostic purposes the script prints every 100th record to the screen, as well as any problems it encountered (see pic below). If you launch a process and notice the server is hanging on every other record and repeatedly failing to get matches, it’s probably best to bail out and come back later. Recently, I’ve noticed fewer problems during off-peak times: evenings and weekends.

Wrap Up

The script and the documentation are posted on our labs resources page, for all to see and use – you just have to install the third party censusgeocode module before using it. When would you want to use this? Well, if you need something that’s free, this is a good choice. If you have batches in the 10ks to do, this would be a good solution. If you’re in the 100ks, it could be a feasible solution – one of my colleagues has confirmed that he’s used the script to match about 40k addresses, so the service is up to the task for doing larger jobs.

If you have less than a couple thousand records, you might as well use their website and upload files directly. If you’re pushing a million or more – well, you’ll probably want to set up something locally. PostGIS has a TIGER module that lets you do desktop matching if you need to go into the millions, or you simply have a lot to do on a consistent basis. The excellent book PostGIS in Action has a chapter dedicated to to this.

In some cases, large cities or counties may offer their own geocoding services, and if you know you’re just going to be doing matches for your local area those sources will probably have greater accuracy, if they’re adding value with local knowledge. For example, my results with NYC’s geocoding API for addresses in the five boroughs are better than the Census Bureau’s and is customized for local quirks; for example, I can pass in a borough name instead of a postal city and ZIP Code, and it’s able to handle those funky addresses in Queens that have dashes and similar names for multiple streets (35th st, 35th ave, 35th dr…). But for a free, public domain service that requires no registration, no keys, covers the entire country, and is the foundation for just about every US geocoding platform out there, the Census Geocoder is hard to beat.

### Python Geocoding Take 1 – International Addresses

Monday, May 16th, 2016

This past semester has been the semester of geocoding. I’ve had a number of requests for processing large batches of addresses. Now that the term is drawing to a close, I’ll share some of my trials and tribulations. In this post, I’ll focus on my adventures in international geocoding.

First, it’s necessary to provide some context. As an academic librarian I’m primarily engaged with assisting students and faculty with their coursework and their research. My users are interested in getting coordinates for data so that they can do both analysis and visualization, which requires them to download the actual coordinate data in a batch and integrate it with the rest of their projects.

This is an important distinction to make, because in many cases the large web mapping companies (Google, Bing, Mapquest, etc) are not catering to this population – they provide services and APIs to web developers, so these folks can integrate geocoding services into the Google, Bing, etc maps they are embedding in their website. They geocoding providers specifically prohibit (in the fine print of their terms of use) anyone from using their services to create and download geocoded data. This essentially excludes a lot of academic use – which, is something I hadn’t fully grasped at the outset.

Google’s Geocoding API Perhaps?

My adventure began when a professor asked me for help in geocoding about 1 million addresses – in Turkey. Right from the beginning, many of the usual sources I would turn to (for US addresses) were out the window. I knew that I could do small scale batches of international addresses with the mmQGIS geocoding plugin, so I started testing there. The address file we had consisted of unparsed addresses, and the formating looked rather chaotic – but after doing some research I discovered that geocoding Turkish addresses was a tough proposition. The Open Street Map plugin (using Nominatim) returned no matches for our 1000 test cases. The Google results were much better, so we decided to investigate writing a script and using an API and to pay for the matching. According to the documentation, it would end up costing $500 to do 1 million addresses. I searched around for some Python APIs and found what I believed was the official one for google maps geocoding. So I spent a day writing a script that would loop through the addresses, which we divided into batches of 100k records each (which is the max you can do per day with Google if you set up billing), and the professor obtained an API key and set up billing for the account. The interface for setting up and managing the Google APIs was ridiculously confusing. Eventually we were set and I let the script rip, and found that it wouldn’t rip for long. It would consistently stop after doing a few thousand records. I had written it to write results one by one as they were obtained, and to exit cleanly in the case of errors. Upon exit, it provides the index number of the record where it stopped, so I was able to pick up where it left off. But the server would constantly time out – sometimes it could do 10 to 12k records in a stretch, but often less, so I could never leave it unattended for long. The matches themselves were a mixed bag – you could throw absolute garbage at the Google geocoder and still get a match – if not to an address or property, then to a street segment, and beyond that to useless things like postal codes, administrative districts, and the country as a whole (i.e. I can’t find your address, so here are the coordinates for the geographic center of Istanbul, or for all of Turkey. Have a nice day). It seemed like it was going to be a long climb to get to 1 million – but after about 100k we could go no further. Google simply refused every additional request. A new API key would get us a little further, but soon after that nothing would work and we wouldn’t get any useful error messages to explain why. Having never done anything like this before, I started to investigate why, and eventually discovered the problem: these web mapping geocoding services, even if you pay for them, are not meant to be used this way. Buried in the documentation I found the license restrictions, which stipulate that you are not allowed to download any of the data, and you had to plot every coordinate you retrieved onto a Google map. This is a service for web mapping developers, not researchers. Why hadn’t I realized this before? One, I simply had never made this distinction as I thought geocoding was geocoding, and in my world of course people are going to want to download the coordinates. Two, the Internet is full of thousands of little blog posts and tutorials which demonstrate how to use the Google Maps APIs, so I thought this was possible. But they never mention any of the caveats about what you can and can’t do with these services. In addition to violating the service terms, what I was doing was akin to yelling in the back of a crowded room, as I was hammering their server, sending requests as fast as I could with no limit. A normal web mapping application (which is what the service is designed for) would send a fraction of those requests in that amount of time. No wonder the requests were refused. Thus ended my Google geocoding experiment. Nope – How About ESRI Instead? So what to do next? I found that most of the other commercial web mapping services didn’t provide anything near the maximum caps and low prices that Google was offering. Mapquest for example requires that you subscribe to an account on a monthly or annual basis, and 100k is the amount you could do in a month. Most of the other commercial services also prohibit any downloads. The big exception is ESRI – they are one of the few that understand and cater to the academic market, and they do allow downloads: they say quite plainly: “Take your Coordinates with you. Once you have the results of a Geocode operation, they’re yours to take anywhere.” My university has a site license for ArcGIS, but it doesn’t include geocoding. You can create an account and have a certain number of free credits, and after that you pay. 1 mil records was going to cost about$4000 – substantially more than Google, but totally legal. ESRI provides lists of countries and ranks them according to how complete their street network coverage is. You can use their API via a script, or you can set up the service in ArcGIS Desktop and do the matching through the ArcToolbox. This would be painfully slow if you were doing a large job (like this one) but for the purpose of testing it out with a few hundred records this is what I tried. Unfortunately, in our case the results still weren’t good. Most of the addresses were to administrative or postal areas; not specific enough.

The Python Geocoder and a Wealth of Options

What often happens in librarianship when a patron makes an initial request (this should be a piece of cake, right?) and then discovers that what they’re looking for is more involved (ahhh this will be tougher than we thought), is that they reframe the question. He went back through the addresses with a research partner and winnowed them down based on what they really, absolutely needed, so now we were down from 1 million to just finding a match for about 300k. His colleague also suggested that we use Yandex, the Russian search and mapping engine. The structure of Russian addresses is quite similar to Turkish ones, and since Russia is closer to Turkey geographically and economically Yandex might do a better job.

I was dubious of this at first, but was quickly surprised. I found the Python Geocoder module, which provides a common, uniform API to over two dozen different geocoding services – including Google. Given the simplicity and flexibility of this module, it’s the one I should have used in the first place. And while Google limits you to 2500 free matches in one day, Yandex allows you 25k – that’s 25,000 – free matches in one day, without having to request an API key! I modified the original script I wrote to use the Python Geocoder module with Yandex, and the initial small-batch tests were successful. Here’s a small portion of the code – it loops through a file where the address is stored in one field (unparsed):

for index, line in enumerate(readfile):
result=geocoder.yandex(address[add]).json

And it spits you back this JSON result (you could also do XML if you prefer):

{‘quality’: ‘street’, ‘address’: ‘Türkiye, İstanbul, Fatih, Cankurtaran Mh., Ayasofya Meydanı’, ‘location’: ‘Hagia Sophia Museum, Sultanahmet Mh., Ayasofya Meydanı, Fatih/İstanbul’, ‘state’: ‘İstanbul’, ‘lng’: ‘28.979031’, ‘accuracy’: ‘street’, ‘encoding’: ‘utf-8’, ‘provider’: ‘yandex’, ‘country_code’: ‘TR’, ‘ok’: True, ‘status_code’: 200, ‘lat’: ‘41.00772’, ‘country’: ‘Türkiye’, ‘county’: ‘Fatih’, ‘confidence’: 10, ‘bbox’: {‘northeast’: [41.008156, 28.979714], ‘southwest’: [41.007285, 28.978349]}, ‘street’: ‘Ayasofya Meydanı’, ‘status’: ‘OK’}

If the result you get back is not OK (ok is False – nothing matched), then write the record to the unmatched file. Otherwise, get the bits and pieces out of the json object that you want, append them to the record, and write the whole record out to a matched file.

        if result.get('ok')==False:
else:
lng=result.get('lng')
lat=result.get('lat')
qual=result.get('quality')
accu=result.get('accuracy')
matchfile.writelines('\t'.join(address)+'\n')

But is it legal? It was unclear to me; they specify that map data is meant for personal/noncommercial use and in the same sentence: “Any copying of the Data, their reproduction, conversion, distribution, promulgation (publication) in the Internet, any use of the Data in mass media and/or for commercial purposes without a prior written consent of the right holder, shall be prohibited”. Does that mean any copying, or just copying for commercial use or for redistributing the data? In our case, this is for academic non-profit use and the data (individual geocoded records) wasn’t going to be republished – it would be used for plotting distances between locations and making highly generalized static dot maps for an article. At this stage we seemed to be out of options – if you need to geocode a large batch of international addresses, AND you are willing to pay for it, where on Earth can you go?

Ultimately, I left it up to the professor to contact them or not, and we decided to roll the dice. For my part, I engineered the script to put a minimum load on their servers – essentially I could take 24 hours to do 25k records. I used the time and random modules in Python to build pauses in between records to slow things down. In sharp contrast to Google, the Yandex servers were amazingly reliable – they were able to do batches of 25k records every single time without timing out – not even once – and in less than a couple weeks we were finished. About 50% of the matches were good, and for the others he and a research assistant went back and cleaned up unmatched records, and I gave them the script so they could try again.

International Geocoding: The Take-Aways

1. If you need to geocode a large batch of foreign addresses for academic or research purposes, forget Google. Their service was less than stellar (to put it mildly) and anyway it’s a violation of their license agreement. And all those lousy little blog posts out there that show you how to use the Google Map APIs with Python and say “Gee isn’t this great!” are largely useless for practical purposes.
2. The Python Geocoder module is simple to use and let’s you write a single script to access a ton of different geocoding services, including Open Streetmap, Yandex, and ESRI. But you still need to review the terms of service for each one to see what’s allowed and what the daily limits are.
3. If you have funding for your research project, and ESRI geocoding has good coverage for your geographic area (based on their documentation but also on your own testing) then go with them, as you’re free and clear to download data under their terms. Arc Desktop will be too sluggish for large batches so write a script – you can use the Python Geocoder.
4. Otherwise – the Open Street Map / Nominatim services are worth a try but your success will vary by country. I had used them before for addresses in France with fair success, but it didn’t help me with Turkey.
5. You can also crawl through the GIS Stackexchange for advice. I’ve found that most of the suggestions are either for US geocoding, or are companies that are answering posts saying “Hey you can try my service!”

Happy geocoding, comrades! In my next post I’ll discuss my experience with batch geocoding addresses here in the US of A with Python.

### Writing Functions and Building a Jinja Template

Thursday, August 6th, 2015

In previous posts I demonstrated how to pull data from a sqlite / spatialite data to generate reports using Python and Jinja, where Jinja2 is used as a template engine for creating LaTeX documents and the NYC Geodatabase is used as my test case. Up until now the scripts pulled the data “as is”. In this post I’ll demonstrate how I created derived variables, and how I created the Jinja2 template for the report. Please note – instead of duplicating all of the code I’m just going to illustrate the new pieces – you should check out the earlier posts to see how all the pieces fit together.

### Aggregating Variables

Aggregating census data is a pretty common operation, and when working with American Community Survey estimates it’s also necessary to calculate a new margin of error for each derived value. I wrote two functions to accomplish this. For each function you pass in the keys for values you want to aggregate, a name which will be the name of the new variable, and a dictionary that contains all the keys and values that were taken from a database table for a specific geography.

#Functions for summing individual values and calculating margins of error
#for individual values

tosum=[]
for val in keys:
agg=sum(tosum)

sqrd=[]
for val in keys:
if item=='':
pass
else:
sqrd.append(item**2)
moe=round(math.sqrt(sum(sqrd)))


Later in the script, as we’re looping through all the geographies and gathering the necessary data into dictionaries that represent each data table, we call the function. In this example we’re combining household income brackets so that we don’t have so many categories:

for geog in geodict.keys():

name=geodict.get(geog)
filename='zzpuma_' + geog + '.tex'
folder='puma_rept'
outpath=os.path.join(folder,filename)

acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)

calc_sums(['INC03_E','INC04_E'],'INC10K_E',acs1dict)
calc_moe(['INC03_M','INC04_M'],'INC10K_M',acs1dict)
calc_sums(['INC05_E','INC06_E'],'INC25K_E',acs1dict)
calc_moe(['INC05_M','INC06_M'],'INC25K_M',acs1dict)
calc_sums(['INC09_E','INC10_E'],'INC100K_E',acs1dict)
calc_moe(['INC09_M','INC10_M'],'INC100K_M',acs1dict)


Rather than creating a new dictionary, these new values are simply appended to the existing dictionaries that contain the data taken from each of the ACS data tables in the database. They can be referenced in the template using their new column name.

### Calculating Areas

I also want to include the geographic size of the PUMA as one of the report items. Columns for the area are included in the spatial table for the PUMAs – the features originally came from the TIGER files, and all TIGER files have an ALAND and an AWATER column that has land and water area in square meters. So we don’t have to calculate the area from the geometry – we can just use this function to convert the land and water attributes to square miles, and then calculate a total area:

def calc_area(adict,land,water,total):
totalarea=landarea+waterarea


In the body of our script, we invoke our pulltab function (explained in an earlier post) to grab all the data from the PUMA spatial boundary table:

area=pulltab('c_bndy_pumas2010','geoid10',geog)


And then we can call our area function. We pass in the area dictionary, and what we want the new output column names to be – area for land, water, and total:

calc_area(area, 'LAND_SQM','WAT_SQM','TOT_SQM')


Like our previous aggregate script, this function appends our new values to the existing table-dictionary – in this case, one called area.

### Aggregating Geographies

Our last function is a little more complicated. In all of our previous examples, we pulled PUMA-level data from the American Community Survey tables. What if we wanted 2010 Census data for the PUMAs? Decennial census data is not tabulated at the PUMA level, but it is tabulated at the census tract level. Since PUMAs are created by aggregating tracts, we can aggregate the census tract data in the NYC Geodatabase into PUMAs. Here’s our function:

#Function aggregates all values in a table with a group by field from a
#joined table, then creates a dictionary consisting of column names and values
#for a specific geography

def sumtab(tabname,jointab,id1,id2,gid,geog):
query='SELECT * FROM %s LIMIT 1' %(tabname)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
tosum=[]
for var in col_names[3:]:
tosum.append("SUM("+var+") AS '0_"+var+"'")
summer=', '.join(str(command) for command in tosum)
query='SELECT %s, %s FROM %s, %s WHERE %s = %s and %s = %s GROUP BY %s' %(gid,summer,tabname,jointab,id1,id2,gid,geog,gid)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
for row in rows:
thedict=dict(zip(col_names,row))
return thedict


What’s going on here? The first thing we need to do is associate the census tracts with the PUMAs they’re located in. The NYC Geodatabase does NOT have a relationship table for this, so I had to create one. We have to pass in the table name, the relationship table, the unique IDs for each, and then the ID and the geography that we’re interested in (remember our script is looping through PUMA geographies one by one). The first thing we do is a little trick – we get the names of every column in the existing data table, and we append them to a list where we create a new column name based on the existing one (in this case, append a 0 in front of the column name – in retrospect I realize this is a bad idea as column names should not begin with numbers, so this is something I will change). Then we can take the list of column names and create a giant string out of them.

With that giant string (called summer) we can now pass all of the parameters that we need into the SQL query. This selects all of our columns (using the summer string), the table names and join info, for the specific geographic area that we want and then groups the data by that geography (i.e. all tracts that have the same PUMA number). Then we zip the column names and values together in a dictionary that the function returns.

Later on in our script, we call the function:

    census10=sumtab('b_tracts_2010census','b_tracts_to_pumas','GEOID2','tractid','pumaid',geog)


Which creates a new dictionary called census10 that has all the 2010 census data for our PUMA. Like the rest of our dictionaries, census10 is passed out to the Jinja2 template and its values can be invoked using the dictionary keys (the column headings):

outfile=open(outpath,'w')
outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict, area=area,
c2010=census10))
outfile.close()


### Designing the Template

The Jinja template is going to look pretty busy compared to our earlier examples, and in both cases they’re not complete (this is still a work in progress).

I wanted to design the entire report first, to get a sense for how to balance everything I want on the page, without including any Jinja code to reference specific variables in the database. So I initially worked just in LaTeX and focused on designing the document with placeholders. Ultimately I decided to use the LaTeX minipage environment as it seemed the best approach in giving me control in balancing items on the page. The LaTeX wikibook entries on floats, figures, and captions and on boxes was invaluable for figuring this out. I used rule to draw boxes to serve as placeholders for charts and figures. Since the report is being designed as a document (ANSI A 8 1/2 by 11 inches) I had no hang-up with specifying precise dimensions (i.e. this isn’t going into a webpage that could be stretched or mushed on any number of screens). I loaded the xcolor package so I could modify the row colors of the tables, as well as a number of other packages that make it easy to balance table and figure captions on the page (caption, subscaption, and multicol).

Once I was satisfied with the look and feel, I made a copy of this template and started modifying the copy with the Jinja references. The references look awfully busy, but this is the same thing I’ve illustrated in earlier posts. We’re just getting the values from the dictionaries we created by invoking their keys, regardless of whether we’re taking new derived values that we created or simply pulling existing values that were in the original data tables. Here’s a snippet of the LaTeX with Jinja that includes both derived (2010 Census, area) and existing (ACS) variables:

%Orientation - detail map and basic background info
\begin{minipage}{\textwidth}
\begin{minipage}[h]{3in}
\centering
\rule{3in}{3in}
\captionof{figure}{Race by 2010 Census Tract}
\end{minipage}
\hfill
\begin{minipage}[h]{4in}
\centering
\captionof{table}{Geography}
\begin{tabular}{cccc}\hline
& Land & Water & Total\\ \hline
Area (sq miles) &  \num{\VAR{area.get('LAND_SQM')}} & \num {\VAR{area.get('WAT_SQM')}} &  \num {\VAR{area.get('TOT_SQM')}} \\ \hline
\vspace{10pt}
\end{tabular}
\captionof{table}{Basic Demographics}
\rowcolors{3}{SpringGreen}{white}
\begin{tabular}{cccc}\hline
& \textbf{2010 Census} & \textbf{2009-2013} & \textbf{ACS Margin}\\
& & \textbf{ACS} & \textbf{of Error}\\ \hline
Population & \num {\VAR{c2010.get('0_HD01_S001')}} & \num{\VAR{acs2.get('SXAG01_E')}} & +/- \num{\VAR{acs2.get('SXAG01_M')}}\\
Males & \num {\VAR{c2010.get('0_HD01_S026')}} & \num{\VAR{acs2.get('SXAG02_E')}} & +/- \num{\VAR{acs2.get('SXAG02_M')}}\\
Females & \num {\VAR{c2010.get('0_HD01_S051')}}& \num{\VAR{acs2.get('SXAG03_E')}} & +/- \num{\VAR{acs2.get('SXAG03_M')}}\\
Median Age (yrs) & \num {99999} & \num{\VAR{acs2.get('SXAG17_E')}} & +/- \num{\VAR{acs2.get('SXAG17_M')}}\\
Households & \num {\VAR{c2010.get('0_HD01_S150')}} & \num{\VAR{acs1.get('HSHD01_E')}} & +/- \num{\VAR{acs1.get('HSHD01_M')}}\\
Housing Units & \num {\VAR{c2010.get('0_HD01_S169')}} & \num{\VAR{acs2.get('HOC01_E')}} & +/- \num{\VAR{acs2.get('HOC01_M')}}\\ \hline
\end{tabular}
\end{minipage}
\end{minipage}


And here’s a snippet of the resulting PDF:

### What Next?

You may have noticed references to figures and charts in some of the code above. I’ll discuss my trials and tribulations with trying to use matplotlib to create charts in some future post. Ultimately I decided not to take that approach, and was experimenting with using various LaTeX packages to produce charts instead.

### Looping Through a Database to Create Reports

Tuesday, July 28th, 2015

I’ve got a lot of ground to cover, picking up where I left off several months ago. In earlier posts I presented the concept for creating reports from sqlite databases using Python, Jinja, and LaTeX, and looked at different methods for passing data from the database to the template. I’m using the NYC Geodatabase as our test case. In this entry I’ll cover how I implemented my preferred approach – creating Python dictionaries to pass to the Jinja template.

One of the primary decisions I had to make was how to loop through the database. Since the reports we’re making are profiles (lots of different data for one geographic area), we’re going to want to loop through the database by geography. So, for each geography select all the data from a specific table, pass the data out to the template where the pertinent variables are pulled, build the report and move on to the next geography. In contrast, if we were building comparison tables (one specific variable for many geographic areas) we would want to loop through the data by variable.

In the beginning of the script we import the necessary modules, set up the Jinja environment, and specify our template (not going to repeat that code here – see the previous post). Then we have our function that creates a dictionary for a specific data table for a specific geography:

def pulltab(tabname,idcol,geog):
query='SELECT * FROM %s WHERE %s = %s' %(tabname,idcol,geog)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
for row in rows:
thedict=dict(zip(col_names,row))
return thedict


We connect to the database and create a dictionary of all the geographies (limited to 3 PUMAs since this is just a test):

#Connect to database and create dictionary of all geographies

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT geoid10, namelsad10 FROM a_pumas2010 ORDER BY geoid10 LIMIT 3')
rows = curs.fetchall()
geodict=dict(rows)


And then we generate reports by looping through all the geographies in that dictionary, and we pass in the ID of each geography to pull all data from a data table for that geography out of the table and into a dictionary.

#Generate reports by looping through geographies and passing out
#dictionaries of values

for geog in geodict.keys():
acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)
name=geodict.get(geog)
filename='zzpuma_' + geog + '.tex'
folder='test5'
outpath=os.path.join(folder,filename)


Lastly, we pass the dictionaries out to the template, and run LaTeX to generate the report from the template:

  outfile=open(outpath,'w')
outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict))
outfile.close()

os.system("pdflatex -output-directory=" + folder + " " + outpath)

conn.close()


The Jinja template (as a LaTeX file) is below – the example here is similar to what I covered in my previous post. We passed two dictionaries into the template, one for each data table. The key is the name of the variable (the column name in the table) and the value is the American Community Survey estimate and the margin of error. We pass in the key and get the value in return. The PDF output follows.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage{graphicx}
\usepackage[labelformat=empty]{caption}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{acs1.get('GEOLABEL') | replace("&","\&")} \VAR{acs1.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{table}[h]
\centering
\caption{Commuting to Work - Workers 16 years and over}
\begin{tabular}{|c|c|c|c|c|}

\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\
\hline

Car, truck, or van alone & \num{\VAR{acs1.get('COM02_E')}} & +/- \num{\VAR{acs1.get('COM02_M')}}
& \num{\VAR{acs1.get('COM02_PC')}} & +/- \num{\VAR{acs1.get('COM02_PM')}}\\

Car, truck, or van carpooled & \num{\VAR{acs1.get('COM03_E')}} & +/- \num{\VAR{acs1.get('COM03_M')}}
& \num{\VAR{acs1.get('COM03_PC')}} & +/- \num{\VAR{acs1.get('COM03_PM')}}\\

Public transit & \num{\VAR{acs1.get('COM04_E')}} & +/- \num{\VAR{acs1.get('COM04_M')}}
& \num{\VAR{acs1.get('COM04_PC')}} & +/- \num{\VAR{acs1.get('COM04_PM')}}\\

Walked & \num{\VAR{acs1.get('COM05_E')}} & +/- \num{\VAR{acs1.get('COM05_M')}}
& \num{\VAR{acs1.get('COM05_PC')}} & +/- \num{\VAR{acs1.get('COM05_PM')}}\\

Other means & \num{\VAR{acs1.get('COM06_E')}} & +/- \num{\VAR{acs1.get('COM06_M')}}
& \num{\VAR{acs1.get('COM06_PC')}} & +/- \num{\VAR{acs1.get('COM06_PM')}}\\

Worked at home & \num{\VAR{acs1.get('COM07_E')}} & +/- \num{\VAR{acs1.get('COM07_M')}}
& \num{\VAR{acs1.get('COM07_PC')}} & +/- \num{\VAR{acs1.get('COM07_PM')}}\\
\hline

\end{tabular}
\end{table}

\begin{table}[h]
\centering
\caption{Housing Tenure}
\begin{tabular}{|c|c|c|c|c|}

\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\
\hline

Occupied housing units & \num{\VAR{acs2.get('HTEN01_E')}} & +/- \num{\VAR{acs2.get('HTEN01_M')}} &  &\\

Owner-occupied & \num{\VAR{acs2.get('HTEN02_E')}} & +/- \num{\VAR{acs2.get('HTEN02_M')}}
& \num{\VAR{acs2.get('HTEN02_PC')}} & +/- \num{\VAR{acs2.get('HTEN02_PM')}}\\

Renter-occupied & \num{\VAR{acs2.get('HTEN03_E')}} & +/- \num{\VAR{acs2.get('HTEN03_M')}}
& \num{\VAR{acs2.get('HTEN03_PC')}} & +/- \num{\VAR{acs2.get('HTEN03_PM')}}\\
\hline

\end{tabular}
\end{table}
\end{document}


In this example we took the simple approach of grabbing all the variables that were in a particular table, and then we just selected what we wanted within the template. This is fine since we’re only dealing with 55 PUMAs and a table that has 200 columns or so. If we were dealing with gigantic tables or tons of geographies, we could modify the Python script to pull just the variables we wanted to speed up the process; my inclination would be to create a list of variables in a text file, read that list into the script and modify the SQL function to just select those variables.

What if we want to modify some of the variables before we pass them into the template? I’ll cover that in the next post.

### Inserting Data into Templates with Python and Jinja

Friday, April 3rd, 2015

In this post, I’m picking up where I left off and will cover the different methods I experimented with to get data out of a SQLite database and into a Jinja LaTeX template using Python. I’m using the NYC Geodatabase as my test case.

### Standard Elements – Used Each Time

First – the Python script. For each iteration, the top half of the script remains the same. I import the necessary modules, and I set up my Jinja2 environment. This tells Jinja how to handle LaTeX syntax. I borrowed this code directly from the invaluable slides posted here. The only part that gets modified each time is the .get_template() bit, which is the actual LaTeX template with Jinja mark-up that is used for creating the reports.

import sqlite3

import jinja2
import os
from jinja2 import Template

latex_jinja_env = jinja2.Environment(
block_start_string = '\BLOCK{',
block_end_string = '}',
variable_start_string = '\VAR{',
variable_end_string = '}',
comment_start_string = '\#{',
comment_end_string = '}',
line_statement_prefix = '%-',
line_comment_prefix = '%#',
trim_blocks = True,
autoescape = False,
)
# Modify to specify the template
template = latex_jinja_env.get_template('test1.tex')


The method for connecting to a SQLite database is also the same each time. There are a zillion tutorials and posts for working with Python and SQLite so I won’t belabor that here. Take a look at this excellent one or this awesome one.

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT * FROM b_pumas_2013acs1 ORDER BY GEOID2 LIMIT 3')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
conn.close()


### First Iteration – Pass Individual Variables to the Template

Here’s the bit that I modify each time. Using the example from the tutorial slides, I loop through the rows returned from my database, and I specify individual variables each time by slicing the elements in the row and assigning them a name which is passed out to the template with template.render(). Then I make a call to LaTeX to generate the PDF file (straightforward since I’m using Linux), one for each row (which represent geographic areas). Each file is named using the unique ID number of the geography, which we grabbed from our row list.

for row in rows:
filename='zpuma_' + row[0] + '.tex'
folder='test1'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(geoid=row[0], geolabel=row[1], hshld=row[2], hshldmoe=row[3]))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


That’s the Python piece. The LaTeX template with the Jinja mark-up looks like this:

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{geolabel | replace("&","\&")} \VAR{geoid}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|}
\hline
& Estimate & Margin of Error\\
Households: & \num{\VAR{hshld}} & +/- \num{\VAR{hshldmoe}}\\
\hline
\end{tabular}

\end{document}


You can see here where I’m passing in the variables with \VAR – I’m using the same variable names that I created in the script to hold the row elements. I have to do a little bit of formatting to get this to work. First, one of my variables is text description that consistently contains an ampersand, so I have to use replace (a construct from Jinja) to replace & with \& so LaTeX can properly escape it. Second, I want to format my numeric variables with a thousands separator. Here I use a LaTeX construct with the siunitx package, and every place a number appears I mark it with \num. For this to work I always need to know that this variable will be a number; if it’s text or null LaTeX will throw an error and the doc won’t compile (an alternative to using this LaTeX solution would be to use Python’s formatting constructs). My simple output is below.

### Second Iteration – Pass Variables to Template in a List

Since I’m going to be passing lots of variables out to my template, it would be tedious if I had to declare them all individually, one by one. It would be better if I could pass out an entire list, and then do the slicing to get what I want in the template. Here’s the Python for doing that:

for row in rows:
filename='zzpuma_' + row[0] + '.tex'
folder='test2'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(thelist=row))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


And here’s the LaTeX template – in this example I modified the variables a bit.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{thelist[2] | replace("&","\&")} \VAR{thelist[1]}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Percent Margin of Error\\
Car, truck, or van alone: & \num{\VAR{thelist[171]}} & +/- \num{\VAR{thelist[172]}} & \num{\VAR{thelist[173]}} & +/- \num{\VAR{thelist[174]}}\\
Car, truck, or van carpooled: & \num{\VAR{thelist[175]}} & +/- \num{\VAR{thelist[176]}} & \num{\VAR{thelist[177]}} & +/- \num{\VAR{thelist[178]}}\\
\hline
\end{tabular}

\end{document}


While this is a bit better, the template is harder to read – you can’t really figure out what’s in there as you just have a bunch of list slices. You also have to keep careful track of which indices apply to what element, so you know what you’re generating. I thought I could improve this by creating nested lists where the column headings from the database get carried along, and I could reference them somehow. Then I had a better idea.

### Third Iteration – Pass Variables to Template in a Dictionary

I decided to use a dictionary instead of a list. Here’s the Python – since I grabbed the columns back in the database section of my code, I can loop through the elements in each row and create a dictionary by zipping the column names and row elements together, so the column name becomes the key and the row element is my data value. Then I pass the whole dictionary out to the template.

for row in rows:
thedict=dict(zip(col_names,row))
filename='zzpuma_' + row[0] + '.tex'
folder='test3'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(d=thedict))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


Now in the template, using Jinja I embed dict.get() for each variable and specify the key (column name) and the output will be the value. This is now highly readable, as I can see the names of the columns for the variables and there’s less potential for a mix-up.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{d.get('GEOLABEL') | replace("&","\&")} \VAR{d.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\

Car, truck, or van alone: & \num{\VAR{d.get('COM02_E')}} & +/- \num{\VAR{d.get('COM02_M')}} & \num{\VAR{d.get('COM02_PC')}} & +/- \num{\VAR{d.get('COM02_PM')}}\\

Car, truck, or van carpooled: & \num{\VAR{d.get('COM03_E')}} & +/- \num{\VAR{d.get('COM03_M')}} & \num{\VAR{d.get('COM03_PC')}} & +/- \num{\VAR{d.get('COM03_PM')}}\\
\hline
\end{tabular}

\end{document}


In this case, the output looks the same as it did in our last iteration. Those are some basic methods for getting data into a template, and in my case I think the dictionary is the ideal data structure for this. In going further, my goal is to keep all the formatting and presentation issues in LaTeX, and all the data processing and selection pieces in Python.

### Creating Reports with SQLite, Python, Jinja2, and LaTeX

Sunday, March 29th, 2015

For a long time, I’ve been wanting to figure out a way to generate reports from a SQLite / Spatialite database. For example, I’d like to reach into a database and generate profiles for different places that contain tables, charts, and maps. I know I can use Python to connect to the db and pull out variables. I also learned how to use LaTeX several years back when I revised the GIS Practicum manual, and routinely use it for writing reports, articles, and hand-outs.

I finally have time to devote to this, and am going to share what I’m learning in a series of posts. In this post I’ll describe how I got started, and will record some useful projects and posts that I’ve found.

### Figuring Out What the Pieces Are

In searching the web for building reports in Python, I’ve discovered a number of solutions. Many people have written modules that are in various states of production, from active to defunct. Prettytable was something I’ve used for generating basic text-file reports. It’s absolutely great at what it does, but I’m looking for something that’s more robust. Of all the tools out there, ReportLab seemed to be the most prominent package that would appear again and again. I’ve shied away from it, because I wanted a solution that was a little more general – if that makes sense. Something where every component is not so tightly bound to a specific module.

Luckily I found this post, which was perfect for helping me to understand conceptually what I wanted to do. The author describes how he automatically generates song sheets by using a programming language (JAVA in this case) to reach into a database and insert the content into a template (LaTeX in this case) using a template engine (Apache Velocity) to produce good looking output. In this case, the template has the shell of a document and place-holders where variables will be passed in from the scripting language and rendered using the engine. He included this helpful diagram from wikimedia in his post:

I started looking for a template engine that would work well with both LaTeX and Python. The author had mentioned Cheetah as another engine, and it turns out that Cheetah is often used in conjunction with Python and LaTeX. After digging around some more, I discovered another template engine called Jinja (or Jinja2) which I’ve adopted as my solution, largely because I’ve found that the project documentation was quite good and there are numerous user examples that I can follow. Jinja2 allows you to do much more than simply passing variables into the template and rendering it; you have the option to run a lot of Pythonesque code from within the template itself.

### Putting the Pieces Together

While Jinja is often used for generating HTML and XML (for example), it’s also used for LaTeX (for example). I found that this series of slides was the perfect introduction for me. They’re written in German, but since most of the syntax in the scripting and mark-up languages is in English it’s easy to grasp (and those three-years of German I took way back in high school are now reaping dividends!)

The slides break down how you can use Python to generate LaTeX reports in several iterations. The first iteration involves no templating at all – you simply use Python to generate the LaTeX code that you want (or if you prefer, Python serves as the template generator). The limit of this are obvious, in that you have to hard code variables into the output, or use string substitution to find and replace variable names with the intended output. In the next iteration, he demonstrates how to use Jinja2. This section is invaluable, as it provides an example of setting the Jinja2 environment so that you can escape all of the necessary characters and syntax that LaTeX needs to function. He demonstrates how to pass a variable from Python to render in a template that you create in LaTeX and mark-up with Jinja2 code (slides 18 to 20). He goes on to show how you can loop through lists to generate output.

The third iteration displays how you can pull data out of SQLite and then use Python and LaTeX to generate output. With a little imagination, you can combine this piece with his previous one and voila, you have a SQLite-Python-Jinja-Latex combo. He has a final piece that incorporates screen-scraping using Beautiful Soup, which is pretty neat but beyond my needs for this project.

Now that I understand the conceptual model and I have the four tools I’ll use with some examples, I’m ready to start experimenting. I know there will be several additional pieces I’ll need to incorporate, to generate charts (matplotlib) and maps (perhaps some of the Python modules from QGIS). There are some instances where I’ll also have to write functions to create derivatives of the data I’m pulling, so I imagine NumPy/SciPy and GDAL will come in handy for that. But first things first – I need to get the four basic pieces – SQLite – Python – Jinja2 – LaTeX – working together. That will be the topic of my next post.

### Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

The citation and link:

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

### Creating Reports with SQLite, Python, and prettytable

Friday, February 8th, 2013

In addition to providing the NYC Geodatabase as a resource, I also wanted to use it to generate reports and build applications. None of the open source SQLite GUIs that I’m familiar with have built in report generating capabilities, so I thought I could use Python to connect to the database and generate them. I have some grand ambitions here, but decided to start out small.

Python has a built-in module, sqlite3, that you can use to work with SQLite databases. This is pretty well documented – do a search and you’ll find a ton of brief tutorials. Take a look at this great post for a comprehensive intro.

For generating reports I gave prettytable a shot: it lets you create nice looking ASCII text tables that you can copy and paste from the prompt or export out to a file. The tutorial for the module was pretty clear and covers the basics quite nicely. In the examples he directly embeds the data in the script and generates the table from it, which makes the tutorial readily understandable. For my purposes I wanted to pull data out of a SQLite database and into a formatted table, so that’s what I’ll demonstrate here.

Initially I had some trouble getting the module to load, primarily (I think) because I’m using Python 3.x and the setup file for the module was written for Python 2.x; the utility you use for importing 3rd party modules has changed between versions. I’m certainly no Python expert, so instead of figuring it out I just downloaded the module, dumped it into the site-packages folder (as suggested in the prettytable installation instructions under “The Harder Way” – but it wasn’t hard at all) and unzipped it. In my script I couldn’t get the simple “import prettytable” to work without throwing an error, but when I added the name of the specific function “import PrettyTable from prettytable” it worked. Your mileage may vary.

So here was my first go at it. I created a test database and loaded a table of population estimates from the US Census Bureau into it (you can download it if you want to experiment):

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

x = PrettyTable(col_names)
x.align[col_names[1]] = "l"
x.align[col_names[2]] = "r"
for row in rows:

print (x)
tabstring = x.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()


The first piece is the standard SQLite piece – connect, activate a cursor, and execute a SQL statement. Here I’m grabbing three columns from the table for records that represent Northeastern states (Region 1). I read in the names of the columns from the first row into the col_names list, and I grab everything else and dump them into rows, a list that contains a tuple for each record:

>>> col_names
['State', 'Name', 'Est2010']
>>> rows
[('09', 'Connecticut', 3574097), ('23', 'Maine', 1328361), ('25', 'Massachusetts', 6547629),
('33', 'New Hampshire', 1316469), ('34', 'New Jersey', 8791898), ('36', 'New York', 19378104),
('42', 'Pennsylvania', 12702379), ('44', 'Rhode Island', 1052567), ('50', 'Vermont', 625741)]
>>>


The second piece will make sense after you have a quick look at the prettytable tutorial. Here I grab the list of columns names and specify how cells for the columns should be aligned (default is center) and padded (default is one space). Then I add each row from the nested list of tuples to the table, row by row. There are two outputs: print directly to the screen, and dump the whole table into a string. That string can then be dumped into a text file, along with a title. Here’s the screen output:

+-------+---------------+----------+
| State | Name          |  Est2010 |
+-------+---------------+----------+
|   09  | Connecticut   |  3574097 |
|   23  | Maine         |  1328361 |
|   25  | Massachusetts |  6547629 |
|   33  | New Hampshire |  1316469 |
|   34  | New Jersey    |  8791898 |
|   36  | New York      | 19378104 |
|   42  | Pennsylvania  | 12702379 |
|   44  | Rhode Island  |  1052567 |
|   50  | Vermont       |   625741 |
+-------+---------------+----------+


The one hangup I had was the formatting for the numbers: I really want some commas in there since the values are so large. I couldn’t figure out how to do this using the approach above – I’m writing all the rows in one swoop, and couldn’t step in and and format the last value for each row.

Unless – instead of constructing the table by rows, I construct it by columns. Here’s my second go at it:

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

y=PrettyTable()
y.add_column(col_names[0],[row[0] for row in rows])
y.add_column(col_names[1],[row[1] for row in rows])
y.add_column(col_names[2],[format(row[2],',d') for row in rows])
y.align[col_names[1]]="l"
y.align[col_names[2]]="r"

print(y)
tabstring = y.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()


To add by column, you don’t provide any arguments to the PrettyTable function. You just add the columns one by one: here I call the appropriate values using the index, first for the column name and then for all of the values from the rows that are in the same position. For the last value (the population estimate) I use format to display the value like a decimal number (this works in Python 3.1+ – for earlier versions there’s a similar command – see this post for details). I tried this in my first example but I couldn’t get the format to stick, or got an error. Since I’m specifically calling these row values and then writing them I was able to get it to work in this second example. In this version the alignment specifications have to come last. Here’s the result:

+-------+---------------+------------+
| State | Name          |    Est2010 |
+-------+---------------+------------+
|   09  | Connecticut   |  3,574,097 |
|   23  | Maine         |  1,328,361 |
|   25  | Massachusetts |  6,547,629 |
|   33  | New Hampshire |  1,316,469 |
|   34  | New Jersey    |  8,791,898 |
|   36  | New York      | 19,378,104 |
|   42  | Pennsylvania  | 12,702,379 |
|   44  | Rhode Island  |  1,052,567 |
|   50  | Vermont       |    625,741 |
+-------+---------------+------------+


prettytable gives you a few other options, like the ability to sort records by a certain column or to return only the first “n” records from a table. In this example, since we’re pulling the data from a database we could (and did) specify sorting and other constraints in the SQL statement instead. prettytable also gives you the option of exporting the table as HTML, which can certainly come in handy.

### Screen Scraping Data with Python

Friday, March 9th, 2012

I had a request recently for population centers (aka population centroids) for all the counties in the US. The Census provides the 2010 centroids in state level files and in one national file for download, but the 2000 centroids were provided in HTML tables on individual web pages for each state. Rather than doing the tedious work of copying and pasting 51 web pages into a spreadsheet, I figured this was my chance to learn how to do some screen scraping with Python. I’m certainly no programmer, but based on what I’ve learned (I took a three day workshop a couple years ago) and by consulting books and crawling the web for answers when I get stuck, I’ve been able to write some decent scripts for processing data.

For screen scraping there’s a must-have module called Beautiful Soup which easily let’s you parse web pages, well or ill-formed. After reading the Beautiful Soup Quickstart and some nice advice I found on a post on Stack Overflow, I was able to build a script that looped through each of the state web pages, scraped the data from the tables, and dumped it into a delimited text file. Here’s the code:

 

## Frank Donnelly Feb 29, 2012
## Scrapes 2000 centers of population for counties from individual state web pages
## and saves in one national-level text file.

from urllib.request import urlopen
from bs4 import BeautifulSoup

output_file=open('CenPop2000_Mean_CO.txt','a')

url='http://www.census.gov/geo/www/cenpop/county/coucntr%s.html'

fips=['01','02','04','05','06','08','09','10',
'11','12','13','15','16','17','18','19','20',
'21','22','23','24','25','26','27','28','29','30',
'31','32','33','34','35','36','37','38','39','40',
'41','42','44','45','46','47','48','49','50',
'51','53','54','55','56']

for i in fips:
soup = BeautifulSoup(urlopen(url %i).read())
list=titleTag.string.split()
name=(list[4:])
state=' '.join(name)

for row in soup('table')[1].tbody('tr'):
tds = row('td')
line=tds[0].string, tds[1].string, tds[2].string, state,
tds[3].string.replace(',',''), tds[4].string, tds[5].string

output_file.writelines(",".join(line)+"\n")

output_file.close()

 

After installing the modules step 1 is to import them into the script. I initially got a little stuck here, because there are also some standard modules for working with urls (urllib and urlib2) that I’ve seen in books and other examples that weren’t working for me. I discovered that since I’m using Python 3.x and not the 2.x series, something had changed recently and I had to change how I was referencing urllib.

With that out of the way I created a a text file, a list with the column headings I want, and then wrote those column headings to my file.

Next I read in the url. Since the Census uses a static URL that varies for each state by FIPS code, I was able to assign the URL to a variable and inserted the % symbol to substitute where the FIPS code goes. I created a list of all the FIPS codes, and then I run through a loop – for every FIPS code in the list I pass that code into the url where the % place holder is, and process that page.

The first bit of info I need to grab is the name of the state, which doesn’t appear in the table. I grab the title tag from the page and save it as a list, and then grab everything from the fourth element (fifth word) to the end of the list to capture the state name, and then collapse those list elements back into one string (have to do this for states that have multiple words – New, North, South, etc.).

So we go from the HTML Title tag:

County Population Centroids for New York

To a list with elements 0 to 5:

list=[“County”, “Population”, “Centroids”, “for”, “New”, “York”]

To a shorter list with elements 4 to end:

name=[“New”,”York”]

To a string:

state=”New York”

But the primary goal here is to grab everything in the table. So we identify the table in the HTML that we want – the first table in those pages [0] is just an empty frame and the second one [1] is the one with the data. For every row (tr) in the table we can reference and grab each cell (td), and string those cells together as a line by referencing them in the list. As I string these together I also insert the state name so that it appears on every line, and for the third list element (total population in 2000) I strip out any commas (numbers in the HTML table included commas, a major no-no that leads to headaches in a csv file). After we grab that line we dump it into the output file, with each value separated by a comma and each record on it’s own line (using the new line character). Once we’ve looped through each table on each page for each state, we close the file.

There are a few variations I could have tried; I could have read the FIPS codes in from a table rather than inserting them into the script, but I preferred to keep everything together. I could have read the state names in as a list, or coupled them with the codes in a dictionary. This would have been less risky then relying on the state name in the title tag, but since the pages were well-formed and I wanted to experiment a little I went the title tag route. Instead of typing the codes in by hand I used Excel trickery to concatenate commas to the end of each code, and then concatenated all the values together in one cell so I could copy and paste the list into the script.

You can go here to see an individual state page and source, and here to see what the final output looks like. Or if you’re just looking for a national level file of 2000 population centroids for counties that you can download, look no further!

### Learning Python at PyCamp

Thursday, June 10th, 2010

I got back from leave a couple week ago, and spent part of it at a Python boot camp. I’ve gotten tired of hacking away at data in spreadsheets and read in several places that Python is a good language to learn for beginning programmers – it’s also open source, flexible, and is used by many in the GIS community for processing data and building plugins and software (the instructor for the camp, Chris Calloway, pointed me to this presentation on Python scripting techniques for ArcGIS).

The workshop was a three-day event hosted at Penn State by the Triangle Zope and Python Users Group (TriZPUG). It was geared towards beginners and non-programmers (although many of my fellow classmates were IT and systems people) and provided a pretty thorough review of all of the elements of the language – now it’s up to me to tie it all together! The price was extremely reasonable (only \$300 for a 3 day class!) and I’d certainly recommend it if there’s a camp in your area; although I would also recommend reading a book or taking a tutorial to familiarize yourself with the basics BEFORE attending the class; I did, and as a result I think I got more out of it than I would have had going in cold.

The next PyCamp is being held in LA in a few days, and the following one will be in Toronto from Aug 30th to Sept 3rd (although this isn’t posted on the website yet); the normal workshop is a five day affair, the one I attended was a mini 3 day version which suited my needs pretty well.

There are tons of Python tutorials on the web and Python’s site is pretty definitive. If you’re looking for a book, I’d recommend Practical Programming: An Introduction to Computer Science Using Python. Unlike the “Learn Language X” books, this one introduces you to general theory and practice in programming, and the authors illustrate the applications with practical examples using Python – it’s been immensely helpful to me. Now that I’m around the initial learning curve, I’ve been relying more on Beginning Python: From Novice to Professional, which is better as a reference book and good for illustrating many of the uses for individual objects, methods, etc (which I had a hard time grasping before I covered the basics of programming).