## Archive for the ‘Data Processing’ Category

### Inserting Data into Templates with Python and Jinja

Friday, April 3rd, 2015

In this post, I’m picking up where I left off and will cover the different methods I experimented with to get data out of a SQLite database and into a Jinja LaTeX template using Python. I’m using the NYC Geodatabase as my test case.

### Standard Elements – Used Each Time

First – the Python script. For each iteration, the top half of the script remains the same. I import the necessary modules, and I set up my Jinja2 environment. This tells Jinja how to handle LaTeX syntax. I borrowed this code directly from the invaluable slides posted here. The only part that gets modified each time is the .get_template() bit, which is the actual LaTeX template with Jinja mark-up that is used for creating the reports.

import sqlite3

import jinja2
import os
from jinja2 import Template

latex_jinja_env = jinja2.Environment(
block_start_string = '\BLOCK{',
block_end_string = '}',
variable_start_string = '\VAR{',
variable_end_string = '}',
comment_start_string = '\#{',
comment_end_string = '}',
line_statement_prefix = '%-',
line_comment_prefix = '%#',
trim_blocks = True,
autoescape = False,
)
# Modify to specify the template
template = latex_jinja_env.get_template('test1.tex')


The method for connecting to a SQLite database is also the same each time. There are a zillion tutorials and posts for working with Python and SQLite so I won’t belabor that here. Take a look at this excellent one or this awesome one.

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT * FROM b_pumas_2013acs1 ORDER BY GEOID2 LIMIT 3')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
conn.close()


### First Iteration – Pass Individual Variables to the Template

Here’s the bit that I modify each time. Using the example from the tutorial slides, I loop through the rows returned from my database, and I specify individual variables each time by slicing the elements in the row and assigning them a name which is passed out to the template with template.render(). Then I make a call to LaTeX to generate the PDF file (straightforward since I’m using Linux), one for each row (which represent geographic areas). Each file is named using the unique ID number of the geography, which we grabbed from our row list.

for row in rows:
filename='zpuma_' + row[0] + '.tex'
folder='test1'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(geoid=row[0], geolabel=row[1], hshld=row[2], hshldmoe=row[3]))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


That’s the Python piece. The LaTeX template with the Jinja mark-up looks like this:

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{geolabel | replace("&","\&")} \VAR{geoid}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|}
\hline
& Estimate & Margin of Error\\
Households: & \num{\VAR{hshld}} & +/- \num{\VAR{hshldmoe}}\\
\hline
\end{tabular}

\end{document}


You can see here where I’m passing in the variables with \VAR – I’m using the same variable names that I created in the script to hold the row elements. I have to do a little bit of formatting to get this to work. First, one of my variables is text description that consistently contains an ampersand, so I have to use replace (a construct from Jinja) to replace & with \& so LaTeX can properly escape it. Second, I want to format my numeric variables with a thousands separator. Here I use a LaTeX construct with the siunitx package, and every place a number appears I mark it with \num. For this to work I always need to know that this variable will be a number; if it’s text or null LaTeX will throw an error and the doc won’t compile (an alternative to using this LaTeX solution would be to use Python’s formatting constructs). My simple output is below.

### Second Iteration – Pass Variables to Template in a List

Since I’m going to be passing lots of variables out to my template, it would be tedious if I had to declare them all individually, one by one. It would be better if I could pass out an entire list, and then do the slicing to get what I want in the template. Here’s the Python for doing that:

for row in rows:
filename='zzpuma_' + row[0] + '.tex'
folder='test2'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(thelist=row))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


And here’s the LaTeX template – in this example I modified the variables a bit.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{thelist[2] | replace("&","\&")} \VAR{thelist[1]}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Percent Margin of Error\\
Car, truck, or van alone: & \num{\VAR{thelist[171]}} & +/- \num{\VAR{thelist[172]}} & \num{\VAR{thelist[173]}} & +/- \num{\VAR{thelist[174]}}\\
Car, truck, or van carpooled: & \num{\VAR{thelist[175]}} & +/- \num{\VAR{thelist[176]}} & \num{\VAR{thelist[177]}} & +/- \num{\VAR{thelist[178]}}\\
\hline
\end{tabular}

\end{document}


While this is a bit better, the template is harder to read – you can’t really figure out what’s in there as you just have a bunch of list slices. You also have to keep careful track of which indices apply to what element, so you know what you’re generating. I thought I could improve this by creating nested lists where the column headings from the database get carried along, and I could reference them somehow. Then I had a better idea.

### Third Iteration – Pass Variables to Template in a Dictionary

I decided to use a dictionary instead of a list. Here’s the Python – since I grabbed the columns back in the database section of my code, I can loop through the elements in each row and create a dictionary by zipping the column names and row elements together, so the column name becomes the key and the row element is my data value. Then I pass the whole dictionary out to the template.

for row in rows:
thedict=dict(zip(col_names,row))
filename='zzpuma_' + row[0] + '.tex'
folder='test3'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(d=thedict))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


Now in the template, using Jinja I embed dict.get() for each variable and specify the key (column name) and the output will be the value. This is now highly readable, as I can see the names of the columns for the variables and there’s less potential for a mix-up.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{d.get('GEOLABEL') | replace("&","\&")} \VAR{d.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\

Car, truck, or van alone: & \num{\VAR{d.get('COM02_E')}} & +/- \num{\VAR{d.get('COM02_M')}} & \num{\VAR{d.get('COM02_PC')}} & +/- \num{\VAR{d.get('COM02_PM')}}\\

Car, truck, or van carpooled: & \num{\VAR{d.get('COM03_E')}} & +/- \num{\VAR{d.get('COM03_M')}} & \num{\VAR{d.get('COM03_PC')}} & +/- \num{\VAR{d.get('COM03_PM')}}\\
\hline
\end{tabular}

\end{document}


In this case, the output looks the same as it did in our last iteration. Those are some basic methods for getting data into a template, and in my case I think the dictionary is the ideal data structure for this. In going further, my goal is to keep all the formatting and presentation issues in LaTeX, and all the data processing and selection pieces in Python.

### Creating Reports with SQLite, Python, Jinja2, and LaTeX

Sunday, March 29th, 2015

For a long time, I’ve been wanting to figure out a way to generate reports from a SQLite / Spatialite database. For example, I’d like to reach into a database and generate profiles for different places that contain tables, charts, and maps. I know I can use Python to connect to the db and pull out variables. I also learned how to use LaTeX several years back when I revised the GIS Practicum manual, and routinely use it for writing reports, articles, and hand-outs.

I finally have time to devote to this, and am going to share what I’m learning in a series of posts. In this post I’ll describe how I got started, and will record some useful projects and posts that I’ve found.

### Figuring Out What the Pieces Are

In searching the web for building reports in Python, I’ve discovered a number of solutions. Many people have written modules that are in various states of production, from active to defunct. Prettytable was something I’ve used for generating basic text-file reports. It’s absolutely great at what it does, but I’m looking for something that’s more robust. Of all the tools out there, ReportLab seemed to be the most prominent package that would appear again and again. I’ve shied away from it, because I wanted a solution that was a little more general – if that makes sense. Something where every component is not so tightly bound to a specific module.

Luckily I found this post, which was perfect for helping me to understand conceptually what I wanted to do. The author describes how he automatically generates song sheets by using a programming language (JAVA in this case) to reach into a database and insert the content into a template (LaTeX in this case) using a template engine (Apache Velocity) to produce good looking output. In this case, the template has the shell of a document and place-holders where variables will be passed in from the scripting language and rendered using the engine. He included this helpful diagram from wikimedia in his post:

I started looking for a template engine that would work well with both LaTeX and Python. The author had mentioned Cheetah as another engine, and it turns out that Cheetah is often used in conjunction with Python and LaTeX. After digging around some more, I discovered another template engine called Jinja (or Jinja2) which I’ve adopted as my solution, largely because I’ve found that the project documentation was quite good and there are numerous user examples that I can follow. Jinja2 allows you to do much more than simply passing variables into the template and rendering it; you have the option to run a lot of Pythonesque code from within the template itself.

### Putting the Pieces Together

While Jinja is often used for generating HTML and XML (for example), it’s also used for LaTeX (for example). I found that this series of slides was the perfect introduction for me. They’re written in German, but since most of the syntax in the scripting and mark-up languages is in English it’s easy to grasp (and those three-years of German I took way back in high school are now reaping dividends!)

The slides break down how you can use Python to generate LaTeX reports in several iterations. The first iteration involves no templating at all – you simply use Python to generate the LaTeX code that you want (or if you prefer, Python serves as the template generator). The limit of this are obvious, in that you have to hard code variables into the output, or use string substitution to find and replace variable names with the intended output. In the next iteration, he demonstrates how to use Jinja2. This section is invaluable, as it provides an example of setting the Jinja2 environment so that you can escape all of the necessary characters and syntax that LaTeX needs to function. He demonstrates how to pass a variable from Python to render in a template that you create in LaTeX and mark-up with Jinja2 code (slides 18 to 20). He goes on to show how you can loop through lists to generate output.

The third iteration displays how you can pull data out of SQLite and then use Python and LaTeX to generate output. With a little imagination, you can combine this piece with his previous one and voila, you have a SQLite-Python-Jinja-Latex combo. He has a final piece that incorporates screen-scraping using Beautiful Soup, which is pretty neat but beyond my needs for this project.

Now that I understand the conceptual model and I have the four tools I’ll use with some examples, I’m ready to start experimenting. I know there will be several additional pieces I’ll need to incorporate, to generate charts (matplotlib) and maps (perhaps some of the Python modules from QGIS). There are some instances where I’ll also have to write functions to create derivatives of the data I’m pulling, so I imagine NumPy/SciPy and GDAL will come in handy for that. But first things first – I need to get the four basic pieces – SQLite – Python – Jinja2 – LaTeX – working together. That will be the topic of my next post.

### Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

Wednesday, July 30th, 2014

I released the latest version of the NYC geodatabase (nyc_gdb) a few weeks ago. In addition to simply updating the data (for subway stations and ridership, city point features, and ZIP Code Business Patterns data) I had to make a couple of serious upgrades.

The first was that is was time for me to update the version of Spatialite I was using, from 2.4 to 4.1, and to update my documentation and tutorial from the Spatialite GUI 1.4 to 1.7. I used the spatialite_convert tool (see the bottom of this page for info)to upgrade and had no problem. There were some major benefits to making the switch. For one, writing statements that utilize spatial indexes is much simpler – this was version 2.4, generating a neighbor list of census tracts:

SELECT tract1.tractid AS tract, tract2.tractid AS neighbor
FROM a_tracts AS tract1, a_tracts AS tract2
WHERE ST_Touches(tract1.geometry, tract2.geometry) AND tract2.ROWID IN (
SELECT pkid FROM idx_a_tracts_geometry
WHERE pkid MATCH RTreeIntersects (MbrMinX(tract1.geometry), MbrMinY(tract1.geometry),
MbrMaxX(tract1.geometry), MbrMaxY(tract1.geometry)))

And here’s the same statement in 4.1 (for zctas instead of tracts):

SELECT zcta1.zcta AS zcta, zcta2.zcta AS neighbor
FROM a_zctas AS zcta1, a_zctas AS zcta2
WHERE ST_Touches(zcta1.geometry, zcta2.geometry)
AND zcta1.rowid IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name=’a_zctas’ AND search_frame=zcta2.geometry)
ORDER BY zcta, neighbor

There are also a number of improvements in the GUI. Tables generated by the user are now grouped under one heading for user data, and the internal tables are grouped under subsequent headings, so that users don’t have to sift through all the objects in the database to see what they need. The import options have improved – with shapefiles and dbfs you can now designate your own primary keys on import. You also have the option of importing Excel spreadsheets of the 97-2003 variety (.xls). In practice, if you want the import to go smoothly you have to designate data types (format-cells) in the Excel sheet (including number of decimal places) prior to importing.

I was hesitant to make the leap, because version 2.4 was the last version where they made pre-compiled binaries for all operating systems; after that, the only binaries are for MS Windows and for Mac and Linux you have to compile from source – which is daunting for many Mac users that I am ill-equipped to help. But now that Spatialite seems to be more fully integrated with QGIS (you can create databases with Q and using the DB Manager you can export layers to an existing database) I can always steer folks there as an alternative. As for Linux, more distros are including updated version of the GUI in their repositories which makes installation simple.

One of the latest features in Spatialite 4.1.1 is the ability to import XML ISO metadata into the database, where it’s stored as an XML-type blob in a dedicated table. Now that I’m doing more work with metadata this is something I’ll explore for the future.

### ZIPs to ZCTAs

The other big change was how the ZIP Code Business Patterns data is represented in the database. The ZBP data is reported for actual ZIP Codes that are taken from the addresses of the business establishments, while the boundaries in the nyc_gdb database are for ZIP Code Tabulation Areas (ZCTAs) from the Census. Previously, the ZBP data in the database only represented records for ZIP Codes that had a matching ZCTA number. As a result, ZIP Codes that lacked a corollary because they didn’t have any meaningful geographic area – the ZIP represented clusters of PO Boxes or large organizations that process a lot of mail – were omitted entirely.

In order to get a more accurate representation of business establishments in the City, I instituted a process to aggregate the ZIP Code data in the ZBP to the ZCTA level. I used the crosswalk table provided by the MCDC which assigns ZIPs to ZCTAs, so those PO Boxes and large institutions are assigned to the ZCTA where they are physically located. I used SQLite to import that crosswalk, imported the ZBP data, joined the two on the ZIP Code and did a group by on the ZCTA to sum employment, establishments, etc. For ZIPs that lacked data due to disclosure regulations, I added some note or flag columns that indicate how many businesses in a ZCTA are missing data. So now the data tables represent records for ZCTAs instead of ZIPs, and they can be joined to the ZCTA features and mapped.

The latest ZBP data in the new database is for 2012. I also went back and performed the same operation on the 2011 and 2010 ZBP data that was included in earlier databases, and have provided that data in CSV format for download in the archives page (in case anyone is using the old data and wants to go back and update it).

### Introducing – Data!

Wednesday, April 9th, 2014

Professors invite me to their classes each semester to give students a crash course in finding data for neighborhoods in New York City, with a particular emphasis on Census data. I typically visit courses in journalism and public affairs, but this semester I added classes in management and – theater – to the list. Before I dive into what the Census is and what sources they should use, I preface the presentation with a discussion of what neighborhoods are and how we define them. This is important because neighborhoods are locally and informally defined, and when searching for datasets we often have to use a proxy, like census tracts, ZIP codes, PUMAs, or local legal or administrative areas, to approximate them.

But before we get this far, I always begin the discussion with some basic questions to set the stage: what is data, and what can we use it for? For the journalism students, I explain that data can help support a story. If they’re covering a town hall or community board meeting where affordabale housing is the topic of discussion, they’re going to want to provide some context and include some facts to support their story – what is the rent like in the neighborhood? How many people live there? Alternatively, data can provide the basis for a story. I point to one of many numerous examples in NYC where journalists have taken a big lump of unrefined data – the NYPD’s stop and frisk data, traffic fatality incidents, 311 complaints – and have refined it to produce information that leads them to an interesting story that was hidden in these numbers. Lastly, data is a story – whenever the Census releases a new dataset, someone is writing to announce the release and tell us what’s in there.

This idea of refining leads us to our first basic definition – data can be considered as raw and unrefined information. It doesn’t tell us much in and of itself, but if we sift through and refine it we can turn it into information that we can use to tell or support a story or reveal some fact or truth that was previously unknown. Data can be quantitative or qualitative – journalists for example may interview someone for two or three hours, but they’re not going to turn around and publish that entire interview. They’re going to write an article that summarizes it and gives us the most important bits, or edit it for a radio broadcast that covers the high lights. With quantitative data the issue is similar – I use a basic example of population data for the 50 states and show them this image of a comma delimited text file:

I explain that this is what data looks lke in a raw state. It’s in a basic format suitable for preservation or transit between systems, but is not in a presentable state. There are a lot of codes that are meaningless to the average person, the data isn’t sorted in a meaningful way, the column headings seem ambiguous, and the numbers aren’t formated for viewing. This isn’t something that they’d want to insert directly into their story or paper. But if they take this and do a little bit of work:

They can take that raw data and turn it into information. Here we’ve moved from raw data to a presentable table. The statistics are sorted in a logical order based on total population, columns are given comprehensible names, and unecessary information (for presentation purposes) is dropped. We add commas to the numbers so they’re more legible, and we create some value by adding a percent total column. Now we have something we can use to communicate a message. But we can go further – we can take that same information and turn it into this:

Now we have a chart. At this point I turn to the students and ask them what the benefit of using the chart is, followed by a discussion of trade-offs; we’ve gained something but lost something too. On the plus side, we can appeal to people’s visual sensibilities, and we can see more clearly that California has twice as many people as New York. The chart is also more concise, as it’s taking up less real estate on the page or on the screen. But we’ve exchanged conciseness for preciseness; we can no longer tell what the exact population numbers are with the chart; we can only approximate. But we can also go further:

We can take that same dataset and turn it into a map. Once again, we discuss the pluses and minuses. Now we can key into to people’s geographic knowledge as well as their visual senses; Ohio may be more meaningful now that we can see it on a map, rather than just seeing a number in a table. We can also see geographic patterns of clustering or diffusion, which the table or chart couldn’t show us. But with the map we’ve lost even more precision. Now we can only see that a state’s population number falls within a given range; we can’t see the precise number and can’t approximate it like we could with the chart.

At this point, one student will point out that if the chart or map is on the web, we can have the best of all worlds. If the graphic is interactive we can hover over it and see the exact population number. This leads to a discussion of the trade-offs between interactive web-based information and static information. The interactive chart or map let’s us keep precision and conciseness, but the sacrifice is complexity, portability, and preservation. It’s more complex to create, and it can only exist in it’s native environment, within a specific bundle of technology that includes programming and scripting langauges, software libraries, browsers, and operting systems. Such things go obsolete quickly and can easily break, so the shiny chart, map, or app you have today is non-functional in a year or two, and difficult to preserve. Contrast that with a static image or text, which is simple, easy to move around, depends on little else, and can make the jump from a screen to the printed page.

We sum up this little talk with the basis of what they’re trying to achieve – I use the DIK pyramid, which I was introduced to in library school (OK – this pic is the DIKW pyramid, with wisdom thrown on top – it’s public domain so I can safely use it):

As journalists or researchers, you’re taking data and refining it to turn it into information to support your work or to commuicate a message. You take those pieces of new information and bring them together to tell a bigger story and paint a bigger picture, which we hope will lead to greater knowledge (which, unlike data and information, is something that can only be learned and not simply assesmbled and communicated). The weather is a good example – a giant log of temperature and precitiptation data isn’t going to do me much good. But if you process that data to calculate the high, low, and mean, now you have information I can use. Take that information and combine it with a radar picture and a forecast and now I have a rich information object. I can take that object and piece it together with other information – another forecast I hear on the radio, what I see out the window, my previous experiences of getting wet, my wife’s advice – to formulate a decision that I can act on. By considering all of this information – my experiences, contextualized information, and know how – and weighing it to reach a conslucion, I am using my knowledge. In this case I’ll use it carry an umbrella.

The final point is that, in their papers, the students must take the information objects that they’ve created or acquired and integrate them into their work. Many students will just copy and download a table and stick it at the back of the paper, and assume that it speaks for itself. I tell them – it doesn’t! You have to explain why it’s there; make reference to it in the paper and weave it into your research.

Overall this presentation / discussion takes all of about 10 minutes, AND THEN we move into the discussion of neighborhoods, the census, and specific datasets. I’ve contemplated skipping it all together, but ultimately decided that it’s necessary. I think it’s essential to provide some context and theory coupled with the actual sources and the pragmatic nature of finding the data. There are some librarians who are completely adverse to teaching “tools” and will speak completely in the abstract, while there are others who cut directly to listing the sources and leaving it at that. The first approach is useless because the students won’t learn what to actually do; the second apporach makes assumptions about what they know and fails to prepare them for what they’ll face. There also seems to be a clear need for me to do this – I’ve heard many faculty who have commented that students are simply tacking data tables they’ve copied off the web into the back of papers without any explanation. When I present the slide that depicts the csv file, I was initally shocked by the looks of shock on many student’s faces – like they’d never seen or heard of this before and were worried that they’d have to wrestle with it. Here’s the data-driven world, step 1.

Tuesday, March 4th, 2014

This example takes over where the Loading Data Into PostGIS – Text Files left off; we’ll use the same schema usa_general. We’ll load two shapefiles from the Census Bureau’s 2010 Cartographic Boundary files from https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html.

Loading shapefiles can be done via the pgAdmin III GUI (for this example, used version 1.18 for MS Windows) from a client machine using the shp2pgsql-gui. This is an extra plugin that must be downloaded and installed. A command line version of the tool (shp2pgsql) is automatically bundled with PostGIS, but can’t be used by a client unless they either log into the server via a terminal or install PostGIS locally.

The shp2pgsql-gui can be downloaded for MS Windows from the extras folder on the PostGIS website: http://download.osgeo.org/postgis/windows/. Version 2.01 for 32 bit environment was downloaded and unzipped. Per the instructions in the README file, the postgisgui folder was dragged into the PG BIN path as designated in pgAdmin under File > Options > Binary Paths. Once pgAdmin was restarted the gui was available under the Plugins menu.

### Set Options

Under Plugins > PostGIS Shapefile and DBF Loader, Check the Options Menu. There are often problems when dealing with UTF-8 encoding in DBF files and when working in an MS Windows environment. Change encoding from UTF-8 to WINDOWS-1252 (another possibility would be LATIN-1). Keep boxes for creating spatial index and using COPY instead of INSERT checked.

Browse and select both shapefiles. Once they appear in the import list the following elements must be changed: switch schema from public to desired schema (usa_general), change table to give each one a meaningful name, and specify the SRID number for each (since these files are from the US Census they’re in NAD 83, which is EPSG 4269). Once you hit Load, the log window will display the results of each import.

Refresh the database and both shapefiles will appear as tables.

### Check Geometries

Because of the size of some geometries, they may appear blank if you preview the tables in pgAdmin – this is a well known and documented issue. The simplest way to verify is to check and make sure that none of the geometry columns are null:

SELECT COUNT(*)
FROM usa_general.states
WHERE geom IS NULL;

“0”

### Primary Keys

By default, PostGIS will automatically create a field called gid using sequential integers, and will designate this field as the primary key. If the data already has a natural key that you want to use, you would have to drop the key constraint on gid and add the constraint to the desired field. Since features from the US Census have unique ANSI / FIPS codes, we’ll designate that as the key.

ALTER TABLE usa_general.states
DROP CONSTRAINT states_pkey

Query returned successfully with no result in 125 ms.

ALTER TABLE usa_general.states
ADD CONSTRAINT states_pkey PRIMARY KEY (“state”);

Query returned successfully with no result in 125 ms.

### Constraints

An alternative to changing the primary key would be to add UNIQUE and NOT NULL constraints to additional columns; gid remains as the primary key but the other columns will automatically be indexed when constraints are added.

ALTER TABLE usa_general.metros ADD CONSTRAINT cbsa_unq UNIQUE (cbsa);
ALTER TABLE usa_general.metros ALTER COLUMN cbsa SET NOT NULL;

### Delete Records

For sake of example, we’ll delete all the records for micropolitan areas in the metros table, so we’re left with just metropolitan areas.

DELETE FROM usa_general.metros

Query returned successfully: 581 rows affected, 219 ms execution time.

## Spatial SQL

As a test we can do a query to select all of the populated places that are within the NYC metropolitan area but are not in Pennsylvania, that have an elevation greater than or equal to 500 feet.

SELECT feature_name, state_alpha, county_name, elev_in_ft
FROM usa_general.pop_places, usa_general.metros
WHERE metros.cbsa=’35620′
AND ST_WITHIN(pop_places.geom, metros.geom)
AND elev_in_ft >= 500
AND state_alpha !=’PA’
ORDER BY elev_in_ft DESC

“Kampe”;”NJ”;”Sussex”;1293
“Highland Lakes”;”NJ”;”Sussex”;1283
“Edison”;”NJ”;”Sussex”;1234″…

## Connect to QGIS

We can look at the geographic features by establishing a connection between the database and QGIS. In QGIS we can either hit the add PostGIS Layer button or use the browser to connect to a our database; we need to add the connection name, host, port, and database name. The username and password can be left blank and Q will prompt the user for it (if you don’t want it saved on the system). Once the connection is established you can add the layers to the view and change the symbolization.

Tuesday, February 25th, 2014

I’ve fallen out of the blogosphere for quite awhile. Time to catch up – lately we’ve been experimenting with deploying our own instance of Open Geoportal (see previous post) and square one is getting a functioning data repository up and running. I’ve been tinkering with PostgreSQL / PostGIS and am documenting tests in a wiki we’ve created. The wiki is private, so I thought I’d re-post some of the tests here.

I’m working with a developer who’s installed and configured the database on a server, and I’m accessing it remotely using pgAdmin. I’ve started loading data and running queries and so far I’m relieved that most of what I’m seeing seems familiar, having worked with SQLite / Spatialite for a few years now. I’m using the excellent book PostGIS in Action as my guide. Here’s my take on loading a text file with coordinates into the db and building geometry for it. Loading shapefiles, spatial SQL experiments, and connecting to Geoserver will follow.

## Verify Version

SELECT postgis_full_version();

“POSTGIS=”2.1.1 r12113″ GEOS=”3.4.2-CAPI-1.8.2 r3921″ PROJ=”Rel. 4.8.0, 6 March 2012″ GDAL=”GDAL 1.9.2, released 2012/10/08″ LIBXML=”2.7.6″ LIBJSON=”UNKNOWN” TOPOLOGY RASTER”

## Create Schema for Holding Related Objects

CREATE SCHEMA usa_general;

Query returned successfully with no result in 297 ms.

## Import Delimited Text File

Test file is the US Populated Places gazetteer file from the USGS Geographic Names Information Service (GNIS). It is a pipe-delimited text file in encoded in UTF-8 with longitude and latitude coordinates in both DMS and DEC format. http://geonames.usgs.gov/domestic/download_data.htm

### Create Table for Holding Data

CREATE TABLE usa_general.pop_places
(
feature_id int NOT NULL PRIMARY KEY,
feature_name varchar,
feature_class varchar,
state_alpha char(2),
state_numeric char(2),
county_name varchar,
county_numeric char(3),
primary_lat_dms varchar,
prim_long_dms varchar,
prim_lat_dec float4,
prim_long_dec float4,
source_lat_dms varchar,
source_long_dms varchar,
source_lat_dec float4,
source_long_dec float4,
elev_in_m int,
elev_in_ft int,
map_name varchar,
date_created date,
date_edited date
);

Query returned successfully with no result in 390 ms.

### Copy Data From File

Must be run from the PSQL Console plugin in order to load data from a local client machine. If data was stored on the server, you could use the PGAdmin GUI and use COPY in an SQL statement instead of \copy. When running the PSQL Console on MS Windows the default character encoding is WIN1252. In this example, the data contains characters unsupported by this encoding; the file is encoded in UTF-8 and the console must be set to match. COPY is not a standard SQL command but is native to PostgreSQL.

— The optional HEADER specifies that the data file has a header column that should be skipped when importing

\encoding UTF8
\copy usa_general.pop_places
FROM ‘C:\Workspace\postgis_testing\POP_PLACES_20140204.txt’

### Basic Query to Test that Load was Success

SELECT *
FROM usa_general.pop_places
WHERE state_alpha=’NY’ and county_name=’New York’
ORDER BY feature_name

### Create Geometry from Coordinates and Make Spatial Index

4269 is the EPSG code for NAD 83, the basic geographic coordinate system used by US government agencies.

“usa_general.pop_places.geom SRID:4269 TYPE:POINT DIMS:2 ”

UPDATE usa_general.pop_places
SET geom = ST_GeomFromText(‘POINT(‘|| prim_long_dec || ‘ ‘|| prim_lat_dec || ‘)’,4269);

Query returned successfully: 200359 rows affected, 18268 ms execution time.

CREATE INDEX idx_pop_places_geom
ON usa_general.pop_places USING gist(geom);

Query returned successfully with no result in 8439 ms.

## Basic SQL Query to Test Geometry

ST_AsEWKT transforms the output of geometry from binary code into the OGC Well Known Text format (WKT).

SELECT feature_name, ST_AsEWKT(geom) AS geometry
FROM usa_general.pop_places
WHERE state_alpha=’NY’ AND county_name=’New York’
ORDER BY feature_name

“Amalgamated Dwellings”;”SRID=4269;POINT(-73.9828 40.715)”
“Amsterdam Houses”;”SRID=4269;POINT(-73.9881 40.7731)”
“Battery Park City”;”SRID=4269;POINT(-74.0163 40.7115)”…

## Basic Spatial SQL Query to Test Geometry

Selects the northernmost point in the table.

SELECT feature_name, state_alpha, ST_AsEWKT(geom) AS geometry
FROM usa_general.pop_places
WHERE ST_Xmax(geom) IN(
SELECT Max(ST_Xmax(geom))
FROM usa_general.pop_places)

“Amchitka”;”AK”;”SRID=4269;POINT(178.878 51.5672)”

## Drop Columns

Drop some columns that have no data.

ALTER TABLE usa_general.pop_places
DROP COLUMN source_lat_dms,
DROP COLUMN source_long_dms,
DROP COLUMN source_lat_dec,
DROP COLUMN source_long_dec;

Query returned successfully with no result in 180 ms.

### Article on Working With the American Community Survey

Monday, June 17th, 2013

I’ve got another article that’s just hit the presses. In this one I discuss the American Community Survey: how it differs from the Decennial Census, when you should use it versus other summary data sets, how to work with the different period estimates, and how to create derived estimates and calculate their margins of error. For that last piece I’ve essentially done an extended version of this old post on Excel formulas, with several different and updated examples.

The article is available via Emerald’s journal database. If you don’t have access to it from your library feel free to contact me and I’ll send you a copy (can’t share this one freely online).

Title: The American Community Survey: practical considerations for researchers
Author(s): Francis P. Donnelly
Citation: Francis P. Donnelly, (2013) “The American Community Survey: practical considerations for researchers”, Reference Services Review, Vol. 41 Iss: 2, pp.280 – 297
Keywords: American Community Survey, Census, Census geography, Data handling, Decennial census, Demographic data, Government data processing, Government information, Margins of error, Sample-based data, United States of America, US Census Bureau
Article type: Technical paper
DOI: 10.1108/00907321311326228 (Permanent URL)
Publisher: Emerald Group Publishing Limited

Tuesday, May 7th, 2013

I needed to download block group level census data for a project I’m working on; there was one particular 2010 Census table that I needed for every block group in the US. I knew that the American Factfinder was out – you can only download block group data county by county (which would mean over 3,000 downloads if you want them all). I thought I’d share the alternatives I looked at; as I searched around the web I found many others who were looking for the same thing (i.e. data for the smallest census geographies covering a large area).

The Census FTP site at http://www2.census.gov/

This would be the first logical step, but in the end it wasn’t optimal based on my need. When you drill down through Census 2010, Summary File 1, you see a file for every state and a national file. Initially I thought – great! I’ll just grab the national file. But the national file does NOT contain the small census statistical areas – no tracts, block groups, or blocks. If you want those small areas you have to download the files for each of the states – 51 downloads. When you download the data you can also download an MS Access database, which is an empty shell with the geography and field headers, and you can import each of the text file data tables (there a lot of them for 2010 SF1) into the db and match them to the headers during import (the instructions that were included for doing this were pretty good). This is great if you need every variable in every table for every geography, but I was only interested in one table for one geography. I could just import the one text file with my table, but then I’d have to do this import process 51 times. The alternative is to use some Python to get that one text file for every state into one big file and then do the import once, but I opted for a different route.

The NHGIS at https://www.nhgis.org/

UExplore / Dexter at http://mcdc.missouri.edu/applications/uexplore.shtml

The Missouri Census Data Center’s UExplore / Dexter tool lets you choose a dataset and takes you to a window that resembles a file system, with a ton of files in it. The MCDC takes their extracts directly from the Census, so they’re structured in a similar way to the FTP site as state-based files. They begin with the state prefix and have a name that indicates geography – there are files for block groups, blocks, and one for everything else. There are national files (which don’t contain small census areas) that begin with ‘us’. The difference here is – when you click on a file, it launches a query window that let’s you customize the extract. The interface may look daunting at first, but it’s worth exploring (and there’s a tutorial to help guide you). You can choose from several output formats, specific variables or tables (if you don’t want them all), and there are a bunch of handy options that you can specify like aggregation or percent totals. In addition to the complete datasets, they’ve also created ‘Standard Extracts’ that have the most common variables, if you want just a core subset. While the NHGIS was the best choice for my specific need, the customization abilities in Dexter may fit your needs – and the state-level block group and block data is conveniently broken out from the other files.

Lastly…

There are a few others tools – I’ll give an honorable mention to the Summary File Retrieval tool, which is an Excel plugin that lets you tap directly into the American Community Survey from a spreadsheet. So if you wanted tracts or block groups for a wide area for but a small number of variables (I think 20 is the limit) that could be a winner, provided you’re using Excel 2007 or later and are just looking at the ACS. No dice in my case, as I needed Decennial Census data and use OpenOffice at home.

### Creating Reports with SQLite, Python, and prettytable

Friday, February 8th, 2013

In addition to providing the NYC Geodatabase as a resource, I also wanted to use it to generate reports and build applications. None of the open source SQLite GUIs that I’m familiar with have built in report generating capabilities, so I thought I could use Python to connect to the database and generate them. I have some grand ambitions here, but decided to start out small.

Python has a built-in module, sqlite3, that you can use to work with SQLite databases. This is pretty well documented – do a search and you’ll find a ton of brief tutorials. Take a look at this great post for a comprehensive intro.

For generating reports I gave prettytable a shot: it lets you create nice looking ASCII text tables that you can copy and paste from the prompt or export out to a file. The tutorial for the module was pretty clear and covers the basics quite nicely. In the examples he directly embeds the data in the script and generates the table from it, which makes the tutorial readily understandable. For my purposes I wanted to pull data out of a SQLite database and into a formatted table, so that’s what I’ll demonstrate here.

Initially I had some trouble getting the module to load, primarily (I think) because I’m using Python 3.x and the setup file for the module was written for Python 2.x; the utility you use for importing 3rd party modules has changed between versions. I’m certainly no Python expert, so instead of figuring it out I just downloaded the module, dumped it into the site-packages folder (as suggested in the prettytable installation instructions under “The Harder Way” – but it wasn’t hard at all) and unzipped it. In my script I couldn’t get the simple “import prettytable” to work without throwing an error, but when I added the name of the specific function “import PrettyTable from prettytable” it worked. Your mileage may vary.

So here was my first go at it. I created a test database and loaded a table of population estimates from the US Census Bureau into it (you can download it if you want to experiment):

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

x = PrettyTable(col_names)
x.align[col_names[1]] = "l"
x.align[col_names[2]] = "r"
for row in rows:

print (x)
tabstring = x.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()


The first piece is the standard SQLite piece – connect, activate a cursor, and execute a SQL statement. Here I’m grabbing three columns from the table for records that represent Northeastern states (Region 1). I read in the names of the columns from the first row into the col_names list, and I grab everything else and dump them into rows, a list that contains a tuple for each record:

>>> col_names
['State', 'Name', 'Est2010']
>>> rows
[('09', 'Connecticut', 3574097), ('23', 'Maine', 1328361), ('25', 'Massachusetts', 6547629),
('33', 'New Hampshire', 1316469), ('34', 'New Jersey', 8791898), ('36', 'New York', 19378104),
('42', 'Pennsylvania', 12702379), ('44', 'Rhode Island', 1052567), ('50', 'Vermont', 625741)]
>>>


The second piece will make sense after you have a quick look at the prettytable tutorial. Here I grab the list of columns names and specify how cells for the columns should be aligned (default is center) and padded (default is one space). Then I add each row from the nested list of tuples to the table, row by row. There are two outputs: print directly to the screen, and dump the whole table into a string. That string can then be dumped into a text file, along with a title. Here’s the screen output:

+-------+---------------+----------+
| State | Name          |  Est2010 |
+-------+---------------+----------+
|   09  | Connecticut   |  3574097 |
|   23  | Maine         |  1328361 |
|   25  | Massachusetts |  6547629 |
|   33  | New Hampshire |  1316469 |
|   34  | New Jersey    |  8791898 |
|   36  | New York      | 19378104 |
|   42  | Pennsylvania  | 12702379 |
|   44  | Rhode Island  |  1052567 |
|   50  | Vermont       |   625741 |
+-------+---------------+----------+


The one hangup I had was the formatting for the numbers: I really want some commas in there since the values are so large. I couldn’t figure out how to do this using the approach above – I’m writing all the rows in one swoop, and couldn’t step in and and format the last value for each row.

Unless – instead of constructing the table by rows, I construct it by columns. Here’s my second go at it:

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

y=PrettyTable()
y.align[col_names[1]]="l"
y.align[col_names[2]]="r"

print(y)
tabstring = y.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()


To add by column, you don’t provide any arguments to the PrettyTable function. You just add the columns one by one: here I call the appropriate values using the index, first for the column name and then for all of the values from the rows that are in the same position. For the last value (the population estimate) I use format to display the value like a decimal number (this works in Python 3.1+ – for earlier versions there’s a similar command – see this post for details). I tried this in my first example but I couldn’t get the format to stick, or got an error. Since I’m specifically calling these row values and then writing them I was able to get it to work in this second example. In this version the alignment specifications have to come last. Here’s the result:

+-------+---------------+------------+
| State | Name          |    Est2010 |
+-------+---------------+------------+
|   09  | Connecticut   |  3,574,097 |
|   23  | Maine         |  1,328,361 |
|   25  | Massachusetts |  6,547,629 |
|   33  | New Hampshire |  1,316,469 |
|   34  | New Jersey    |  8,791,898 |
|   36  | New York      | 19,378,104 |
|   42  | Pennsylvania  | 12,702,379 |
|   44  | Rhode Island  |  1,052,567 |
|   50  | Vermont       |    625,741 |
+-------+---------------+------------+


prettytable gives you a few other options, like the ability to sort records by a certain column or to return only the first “n” records from a table. In this example, since we’re pulling the data from a database we could (and did) specify sorting and other constraints in the SQL statement instead. prettytable also gives you the option of exporting the table as HTML, which can certainly come in handy.