## Inserting Data into Templates with Python and Jinja

April 3rd, 2015

In this post, I’m picking up where I left off and will cover the different methods I experimented with to get data out of a SQLite database and into a Jinja LaTeX template using Python. I’m using the NYC Geodatabase as my test case.

### Standard Elements – Used Each Time

First – the Python script. For each iteration, the top half of the script remains the same. I import the necessary modules, and I set up my Jinja2 environment. This tells Jinja how to handle LaTeX syntax. I borrowed this code directly from the invaluable slides posted here. The only part that gets modified each time is the .get_template() bit, which is the actual LaTeX template with Jinja mark-up that is used for creating the reports.

import sqlite3

import jinja2
import os
from jinja2 import Template

latex_jinja_env = jinja2.Environment(
block_start_string = '\BLOCK{',
block_end_string = '}',
variable_start_string = '\VAR{',
variable_end_string = '}',
comment_start_string = '\#{',
comment_end_string = '}',
line_statement_prefix = '%-',
line_comment_prefix = '%#',
trim_blocks = True,
autoescape = False,
)
# Modify to specify the template
template = latex_jinja_env.get_template('test1.tex')


The method for connecting to a SQLite database is also the same each time. There are a zillion tutorials and posts for working with Python and SQLite so I won’t belabor that here. Take a look at this excellent one or this awesome one.

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT * FROM b_pumas_2013acs1 ORDER BY GEOID2 LIMIT 3')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
conn.close()


### First Iteration – Pass Individual Variables to the Template

Here’s the bit that I modify each time. Using the example from the tutorial slides, I loop through the rows returned from my database, and I specify individual variables each time by slicing the elements in the row and assigning them a name which is passed out to the template with template.render(). Then I make a call to LaTeX to generate the PDF file (straightforward since I’m using Linux), one for each row (which represent geographic areas). Each file is named using the unique ID number of the geography, which we grabbed from our row list.

for row in rows:
filename='zpuma_' + row[0] + '.tex'
folder='test1'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(geoid=row[0], geolabel=row[1], hshld=row[2], hshldmoe=row[3]))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


That’s the Python piece. The LaTeX template with the Jinja mark-up looks like this:

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{geolabel | replace("&","\&")} \VAR{geoid}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|}
\hline
& Estimate & Margin of Error\\
Households: & \num{\VAR{hshld}} & +/- \num{\VAR{hshldmoe}}\\
\hline
\end{tabular}

\end{document}


You can see here where I’m passing in the variables with \VAR – I’m using the same variable names that I created in the script to hold the row elements. I have to do a little bit of formatting to get this to work. First, one of my variables is text description that consistently contains an ampersand, so I have to use replace (a construct from Jinja) to replace & with \& so LaTeX can properly escape it. Second, I want to format my numeric variables with a thousands separator. Here I use a LaTeX construct with the siunitx package, and every place a number appears I mark it with \num. For this to work I always need to know that this variable will be a number; if it’s text or null LaTeX will throw an error and the doc won’t compile (an alternative to using this LaTeX solution would be to use Python’s formatting constructs). My simple output is below.

### Second Iteration – Pass Variables to Template in a List

Since I’m going to be passing lots of variables out to my template, it would be tedious if I had to declare them all individually, one by one. It would be better if I could pass out an entire list, and then do the slicing to get what I want in the template. Here’s the Python for doing that:

for row in rows:
filename='zzpuma_' + row[0] + '.tex'
folder='test2'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(thelist=row))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


And here’s the LaTeX template – in this example I modified the variables a bit.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{thelist[2] | replace("&","\&")} \VAR{thelist[1]}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Percent Margin of Error\\
Car, truck, or van alone: & \num{\VAR{thelist[171]}} & +/- \num{\VAR{thelist[172]}} & \num{\VAR{thelist[173]}} & +/- \num{\VAR{thelist[174]}}\\
Car, truck, or van carpooled: & \num{\VAR{thelist[175]}} & +/- \num{\VAR{thelist[176]}} & \num{\VAR{thelist[177]}} & +/- \num{\VAR{thelist[178]}}\\
\hline
\end{tabular}

\end{document}


While this is a bit better, the template is harder to read – you can’t really figure out what’s in there as you just have a bunch of list slices. You also have to keep careful track of which indices apply to what element, so you know what you’re generating. I thought I could improve this by creating nested lists where the column headings from the database get carried along, and I could reference them somehow. Then I had a better idea.

### Third Iteration – Pass Variables to Template in a Dictionary

I decided to use a dictionary instead of a list. Here’s the Python – since I grabbed the columns back in the database section of my code, I can loop through the elements in each row and create a dictionary by zipping the column names and row elements together, so the column name becomes the key and the row element is my data value. Then I pass the whole dictionary out to the template.

for row in rows:
thedict=dict(zip(col_names,row))
filename='zzpuma_' + row[0] + '.tex'
folder='test3'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(d=thedict))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


Now in the template, using Jinja I embed dict.get() for each variable and specify the key (column name) and the output will be the value. This is now highly readable, as I can see the names of the columns for the variables and there’s less potential for a mix-up.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{d.get('GEOLABEL') | replace("&","\&")} \VAR{d.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\

Car, truck, or van alone: & \num{\VAR{d.get('COM02_E')}} & +/- \num{\VAR{d.get('COM02_M')}} & \num{\VAR{d.get('COM02_PC')}} & +/- \num{\VAR{d.get('COM02_PM')}}\\

Car, truck, or van carpooled: & \num{\VAR{d.get('COM03_E')}} & +/- \num{\VAR{d.get('COM03_M')}} & \num{\VAR{d.get('COM03_PC')}} & +/- \num{\VAR{d.get('COM03_PM')}}\\
\hline
\end{tabular}

\end{document}


In this case, the output looks the same as it did in our last iteration. Those are some basic methods for getting data into a template, and in my case I think the dictionary is the ideal data structure for this. In going further, my goal is to keep all the formatting and presentation issues in LaTeX, and all the data processing and selection pieces in Python.

## Creating Reports with SQLite, Python, Jinja2, and LaTeX

March 29th, 2015

For a long time, I’ve been wanting to figure out a way to generate reports from a SQLite / Spatialite database. For example, I’d like to reach into a database and generate profiles for different places that contain tables, charts, and maps. I know I can use Python to connect to the db and pull out variables. I also learned how to use LaTeX several years back when I revised the GIS Practicum manual, and routinely use it for writing reports, articles, and hand-outs.

I finally have time to devote to this, and am going to share what I’m learning in a series of posts. In this post I’ll describe how I got started, and will record some useful projects and posts that I’ve found.

### Figuring Out What the Pieces Are

In searching the web for building reports in Python, I’ve discovered a number of solutions. Many people have written modules that are in various states of production, from active to defunct. Prettytable was something I’ve used for generating basic text-file reports. It’s absolutely great at what it does, but I’m looking for something that’s more robust. Of all the tools out there, ReportLab seemed to be the most prominent package that would appear again and again. I’ve shied away from it, because I wanted a solution that was a little more general – if that makes sense. Something where every component is not so tightly bound to a specific module.

Luckily I found this post, which was perfect for helping me to understand conceptually what I wanted to do. The author describes how he automatically generates song sheets by using a programming language (JAVA in this case) to reach into a database and insert the content into a template (LaTeX in this case) using a template engine (Apache Velocity) to produce good looking output. In this case, the template has the shell of a document and place-holders where variables will be passed in from the scripting language and rendered using the engine. He included this helpful diagram from wikimedia in his post:

I started looking for a template engine that would work well with both LaTeX and Python. The author had mentioned Cheetah as another engine, and it turns out that Cheetah is often used in conjunction with Python and LaTeX. After digging around some more, I discovered another template engine called Jinja (or Jinja2) which I’ve adopted as my solution, largely because I’ve found that the project documentation was quite good and there are numerous user examples that I can follow. Jinja2 allows you to do much more than simply passing variables into the template and rendering it; you have the option to run a lot of Pythonesque code from within the template itself.

### Putting the Pieces Together

While Jinja is often used for generating HTML and XML (for example), it’s also used for LaTeX (for example). I found that this series of slides was the perfect introduction for me. They’re written in German, but since most of the syntax in the scripting and mark-up languages is in English it’s easy to grasp (and those three-years of German I took way back in high school are now reaping dividends!)

The slides break down how you can use Python to generate LaTeX reports in several iterations. The first iteration involves no templating at all – you simply use Python to generate the LaTeX code that you want (or if you prefer, Python serves as the template generator). The limit of this are obvious, in that you have to hard code variables into the output, or use string substitution to find and replace variable names with the intended output. In the next iteration, he demonstrates how to use Jinja2. This section is invaluable, as it provides an example of setting the Jinja2 environment so that you can escape all of the necessary characters and syntax that LaTeX needs to function. He demonstrates how to pass a variable from Python to render in a template that you create in LaTeX and mark-up with Jinja2 code (slides 18 to 20). He goes on to show how you can loop through lists to generate output.

The third iteration displays how you can pull data out of SQLite and then use Python and LaTeX to generate output. With a little imagination, you can combine this piece with his previous one and voila, you have a SQLite-Python-Jinja-Latex combo. He has a final piece that incorporates screen-scraping using Beautiful Soup, which is pretty neat but beyond my needs for this project.

Now that I understand the conceptual model and I have the four tools I’ll use with some examples, I’m ready to start experimenting. I know there will be several additional pieces I’ll need to incorporate, to generate charts (matplotlib) and maps (perhaps some of the Python modules from QGIS). There are some instances where I’ll also have to write functions to create derivatives of the data I’m pulling, so I imagine NumPy/SciPy and GDAL will come in handy for that. But first things first – I need to get the four basic pieces – SQLite – Python – Jinja2 – LaTeX – working together. That will be the topic of my next post.

## Census Proposes to Cut 3-year ACS in Fiscal 2016

February 6th, 2015

I’m coming out of my blog hibernation for this announcement – the US Census Bureau is proposing that they drop the 3-year series of the American Community Survey in fiscal year 2016. A colleague mentioned that he overheard this at a meeting yesterday. Searching the web, I found a post at the Free Government Information site which points to this Census Bureau Press release. The press release cites the predictable reasons (budget constraints, funding priorities, etc.) for dropping the series. Oddly, the news comes through some random site and not through the Census Bureau’s website, where there’s no mention of it. I saw that Stanford also had a post, where they shared the same press release.

I kept searching for some definitive proof, and through someone’s tweet I found a link to a PDF of the US Census Bureau’s Budget Estimates for Fiscal Year 2016, presented to Congress this February 2015. I found confirmation buried on page CEN – 106 (the 100th page in a 190 page doc):

Data Products

Restoration of ACS Data Products (\$1.5 million): Each year, the ACS releases a wide range of data products widely used by policymakers, Federal, state and local governments, businesses and the public to make decisions on allocation of taxpayer-funds, the location of businesses and the placement of products, emergency management plans, and a host of other matters. Resource constraints have led to the cancellation of data products for areas with populations between 20 and 60 thousand based on 3-year rolling averages of ACS data (known as the “3-Year Data” Product).They have also resulted in delays in the release of the 1- and 5- year Public Use Macro Sample (PUMS) data files and canceled the release of the 5- year Comparison Profile data product and the Spanish Translation of the 1- and 5- year Puerto Rico data products.

The Census Bureau proposes to terminate permanently the 3-Year Data Product. The Census Bureau intended to produce this data product for a few years when the ACS was a new survey. Now that the ACS has collected data for nearly a decade, this product can be discontinued without serious impacts on the availability of the estimates for these communities.

The ACS would like to restore the timely release of the other essential products in FY2016. The continued absence of these data products will impact the availability of data – especially for Puerto Rico – to public and private sector decision makers.

So at this point it’s still just a proposal. The benefits, besides the ability to release other datasets in a timely fashion, would be simplification for users. Instead of choosing between three datasets now there will only be two – the one year and the five year. You choose the one year for large areas and the five year for every place else. In terms of disadvantages, consider this example – here are the number of children enrolled in nursery school in NY State PUMA 03808, which covers Murray Hill, Gramercy, and Stuyvesant Town in the eastern half of Midtown Manhattan:

Population Over 3 Years Old Enrolled in Nursery / Pre-school

• 1 year 2013: 1,166 +/- 609
• 3 year 2011-2013: 1,549 +/- 530
• 5 year 2009-2013: 1,819 +/- 409

Since PUMAs are statistical areas built to contain 100k people, data for all of them is available in each series. Like all the ACS estimates these have a 90% confidence interval. Look at the data for the 1-year series. The margin of error (ME) is so large that’s it’s approximately 50% of the estimate, which in my opinion makes it worthless for just about any application. The estimate itself is much lower than the estimate for the other two series. It’s true that it’s only capturing the latest year, but administrative data and news reports suggest that the number of nursery school children in the district that covers this area has been relatively stable over time, with modest increases (geographically the district covers an area much larger than this PUMA). This suggests that the estimate itself is not so great.

The 5 year estimate may be closer to reality, and its ME is only 20% of the estimate. But it covers five years in time. If you wanted something that was a compromise – more timely than the five year but with a lower ME than the one year, then the three year series was your choice, in this case with an ME that’s about 33% of the estimate. But under this proposal, this choice goes away and you have to make do with either 1-year estimates (which will be lousy for geographies that aren’t far above the 65k population threshold, and lousy for small population groups where ever they are located), or better 5-year estimates that cover a greater time span.

## Article on Processing Government Data With Python

August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

July 30th, 2014

I released the latest version of the NYC geodatabase (nyc_gdb) a few weeks ago. In addition to simply updating the data (for subway stations and ridership, city point features, and ZIP Code Business Patterns data) I had to make a couple of serious upgrades.

The first was that is was time for me to update the version of Spatialite I was using, from 2.4 to 4.1, and to update my documentation and tutorial from the Spatialite GUI 1.4 to 1.7. I used the spatialite_convert tool (see the bottom of this page for info)to upgrade and had no problem. There were some major benefits to making the switch. For one, writing statements that utilize spatial indexes is much simpler – this was version 2.4, generating a neighbor list of census tracts:

SELECT tract1.tractid AS tract, tract2.tractid AS neighbor
FROM a_tracts AS tract1, a_tracts AS tract2
WHERE ST_Touches(tract1.geometry, tract2.geometry) AND tract2.ROWID IN (
SELECT pkid FROM idx_a_tracts_geometry
WHERE pkid MATCH RTreeIntersects (MbrMinX(tract1.geometry), MbrMinY(tract1.geometry),
MbrMaxX(tract1.geometry), MbrMaxY(tract1.geometry)))

And here’s the same statement in 4.1 (for zctas instead of tracts):

SELECT zcta1.zcta AS zcta, zcta2.zcta AS neighbor
FROM a_zctas AS zcta1, a_zctas AS zcta2
WHERE ST_Touches(zcta1.geometry, zcta2.geometry)
AND zcta1.rowid IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name=’a_zctas’ AND search_frame=zcta2.geometry)
ORDER BY zcta, neighbor

There are also a number of improvements in the GUI. Tables generated by the user are now grouped under one heading for user data, and the internal tables are grouped under subsequent headings, so that users don’t have to sift through all the objects in the database to see what they need. The import options have improved – with shapefiles and dbfs you can now designate your own primary keys on import. You also have the option of importing Excel spreadsheets of the 97-2003 variety (.xls). In practice, if you want the import to go smoothly you have to designate data types (format-cells) in the Excel sheet (including number of decimal places) prior to importing.

I was hesitant to make the leap, because version 2.4 was the last version where they made pre-compiled binaries for all operating systems; after that, the only binaries are for MS Windows and for Mac and Linux you have to compile from source – which is daunting for many Mac users that I am ill-equipped to help. But now that Spatialite seems to be more fully integrated with QGIS (you can create databases with Q and using the DB Manager you can export layers to an existing database) I can always steer folks there as an alternative. As for Linux, more distros are including updated version of the GUI in their repositories which makes installation simple.

One of the latest features in Spatialite 4.1.1 is the ability to import XML ISO metadata into the database, where it’s stored as an XML-type blob in a dedicated table. Now that I’m doing more work with metadata this is something I’ll explore for the future.

### ZIPs to ZCTAs

The other big change was how the ZIP Code Business Patterns data is represented in the database. The ZBP data is reported for actual ZIP Codes that are taken from the addresses of the business establishments, while the boundaries in the nyc_gdb database are for ZIP Code Tabulation Areas (ZCTAs) from the Census. Previously, the ZBP data in the database only represented records for ZIP Codes that had a matching ZCTA number. As a result, ZIP Codes that lacked a corollary because they didn’t have any meaningful geographic area – the ZIP represented clusters of PO Boxes or large organizations that process a lot of mail – were omitted entirely.

In order to get a more accurate representation of business establishments in the City, I instituted a process to aggregate the ZIP Code data in the ZBP to the ZCTA level. I used the crosswalk table provided by the MCDC which assigns ZIPs to ZCTAs, so those PO Boxes and large institutions are assigned to the ZCTA where they are physically located. I used SQLite to import that crosswalk, imported the ZBP data, joined the two on the ZIP Code and did a group by on the ZCTA to sum employment, establishments, etc. For ZIPs that lacked data due to disclosure regulations, I added some note or flag columns that indicate how many businesses in a ZCTA are missing data. So now the data tables represent records for ZCTAs instead of ZIPs, and they can be joined to the ZCTA features and mapped.

The latest ZBP data in the new database is for 2012. I also went back and performed the same operation on the 2011 and 2010 ZBP data that was included in earlier databases, and have provided that data in CSV format for download in the archives page (in case anyone is using the old data and wants to go back and update it).

## Some QGIS Odds and Ends

July 3rd, 2014

My colleague Joe Paccione recently finished a QGIS tutorial on working with raster data. My introductory tutorial for the GIS Practicum gives only cursory treatment to rasters, so this project was initially conceived to give people additional opportunities to learn about working with them. It focuses on elevation modeling and uses DEMs and DRGs to introduce tiling and warping, and creating hillshades and contour lines.

The tutorial was written using QGIS 2.0 and was tested with version 2.4; thus it’s readily usable with any 2.x version of QGIS. With the rapid progression of QGIS my introductory tutorial for the workshop is becoming woefully outdated, having been written in version 1.8. It’s going to take me quite a while to update (among other things, the image for every darn button has changed) but I plan to have a new version out sometime in the fall, but probably not at the beginning of semester. Since I have a fair amount of work to do any way, I’m going to rethink all of the content and exercises. Meanwhile, Lex Berman at Harvard has updated his wonderfully clear and concise tutorial to Q version 2.x.

The workshops have been successful for turning people on to open source GIS on my campus, to the point were people are using it and coming back to teach me new things – especially when it comes to uncovering useful plugins:

• I had a student who needed to geocode a bunch of addresses, but since many of them were international I couldn’t turn to my usual geocoding service at Texas A & M. While I’ve used the MMQGIS plugin for quite a while (it has an abundance of useful tools), I NEVER NOTICED that it includes a geocoding option that acts as a GUI for accessing both the Google and Open Streetmap API for geocoding. He discovered it, and it turned out quite well for him.
• I was helping a prof who was working with a large point file of street signs, and we discovered a handy plugin called Points2One that allowed us to take those points and turn them into lines based on an attribute the points held in common. In this case every sign on a city block shared a common id that allowed us to create lines representing each side of the street on each block.
• After doing some intersect and difference geoprocessing with shapefiles I was ending up with some dodgy results – orphaned nodes and lines that had duplicate attributes with good polygons. If I was in a database, an easy trick to find these duplicates would be to run a select query where you group by ID and count them, and anything with a count more than two are duplicates – but this was a shapefile. Luckily there’s a handy plugin called Group Stats that lets you create pivot tables, so I used that to do a summary count and identified the culprits. The plugin allowed me to select all the features that matched my criteria of having an id count of 2 or more, so I could eyeball them in the map view and the attribute table. I calculated the geometry for all the features and sorted the selected ones, revealing that all the duplicates had infinitesimally small areas. Then it was a simple matter of select and delete.
• ## Introducing – Data!

April 9th, 2014

Professors invite me to their classes each semester to give students a crash course in finding data for neighborhoods in New York City, with a particular emphasis on Census data. I typically visit courses in journalism and public affairs, but this semester I added classes in management and – theater – to the list. Before I dive into what the Census is and what sources they should use, I preface the presentation with a discussion of what neighborhoods are and how we define them. This is important because neighborhoods are locally and informally defined, and when searching for datasets we often have to use a proxy, like census tracts, ZIP codes, PUMAs, or local legal or administrative areas, to approximate them.

But before we get this far, I always begin the discussion with some basic questions to set the stage: what is data, and what can we use it for? For the journalism students, I explain that data can help support a story. If they’re covering a town hall or community board meeting where affordabale housing is the topic of discussion, they’re going to want to provide some context and include some facts to support their story – what is the rent like in the neighborhood? How many people live there? Alternatively, data can provide the basis for a story. I point to one of many numerous examples in NYC where journalists have taken a big lump of unrefined data – the NYPD’s stop and frisk data, traffic fatality incidents, 311 complaints – and have refined it to produce information that leads them to an interesting story that was hidden in these numbers. Lastly, data is a story – whenever the Census releases a new dataset, someone is writing to announce the release and tell us what’s in there.

This idea of refining leads us to our first basic definition – data can be considered as raw and unrefined information. It doesn’t tell us much in and of itself, but if we sift through and refine it we can turn it into information that we can use to tell or support a story or reveal some fact or truth that was previously unknown. Data can be quantitative or qualitative – journalists for example may interview someone for two or three hours, but they’re not going to turn around and publish that entire interview. They’re going to write an article that summarizes it and gives us the most important bits, or edit it for a radio broadcast that covers the high lights. With quantitative data the issue is similar – I use a basic example of population data for the 50 states and show them this image of a comma delimited text file:

I explain that this is what data looks lke in a raw state. It’s in a basic format suitable for preservation or transit between systems, but is not in a presentable state. There are a lot of codes that are meaningless to the average person, the data isn’t sorted in a meaningful way, the column headings seem ambiguous, and the numbers aren’t formated for viewing. This isn’t something that they’d want to insert directly into their story or paper. But if they take this and do a little bit of work:

They can take that raw data and turn it into information. Here we’ve moved from raw data to a presentable table. The statistics are sorted in a logical order based on total population, columns are given comprehensible names, and unecessary information (for presentation purposes) is dropped. We add commas to the numbers so they’re more legible, and we create some value by adding a percent total column. Now we have something we can use to communicate a message. But we can go further – we can take that same information and turn it into this:

Now we have a chart. At this point I turn to the students and ask them what the benefit of using the chart is, followed by a discussion of trade-offs; we’ve gained something but lost something too. On the plus side, we can appeal to people’s visual sensibilities, and we can see more clearly that California has twice as many people as New York. The chart is also more concise, as it’s taking up less real estate on the page or on the screen. But we’ve exchanged conciseness for preciseness; we can no longer tell what the exact population numbers are with the chart; we can only approximate. But we can also go further:

We can take that same dataset and turn it into a map. Once again, we discuss the pluses and minuses. Now we can key into to people’s geographic knowledge as well as their visual senses; Ohio may be more meaningful now that we can see it on a map, rather than just seeing a number in a table. We can also see geographic patterns of clustering or diffusion, which the table or chart couldn’t show us. But with the map we’ve lost even more precision. Now we can only see that a state’s population number falls within a given range; we can’t see the precise number and can’t approximate it like we could with the chart.

At this point, one student will point out that if the chart or map is on the web, we can have the best of all worlds. If the graphic is interactive we can hover over it and see the exact population number. This leads to a discussion of the trade-offs between interactive web-based information and static information. The interactive chart or map let’s us keep precision and conciseness, but the sacrifice is complexity, portability, and preservation. It’s more complex to create, and it can only exist in it’s native environment, within a specific bundle of technology that includes programming and scripting langauges, software libraries, browsers, and operting systems. Such things go obsolete quickly and can easily break, so the shiny chart, map, or app you have today is non-functional in a year or two, and difficult to preserve. Contrast that with a static image or text, which is simple, easy to move around, depends on little else, and can make the jump from a screen to the printed page.

We sum up this little talk with the basis of what they’re trying to achieve – I use the DIK pyramid, which I was introduced to in library school (OK – this pic is the DIKW pyramid, with wisdom thrown on top – it’s public domain so I can safely use it):

As journalists or researchers, you’re taking data and refining it to turn it into information to support your work or to commuicate a message. You take those pieces of new information and bring them together to tell a bigger story and paint a bigger picture, which we hope will lead to greater knowledge (which, unlike data and information, is something that can only be learned and not simply assesmbled and communicated). The weather is a good example – a giant log of temperature and precitiptation data isn’t going to do me much good. But if you process that data to calculate the high, low, and mean, now you have information I can use. Take that information and combine it with a radar picture and a forecast and now I have a rich information object. I can take that object and piece it together with other information – another forecast I hear on the radio, what I see out the window, my previous experiences of getting wet, my wife’s advice – to formulate a decision that I can act on. By considering all of this information – my experiences, contextualized information, and know how – and weighing it to reach a conslucion, I am using my knowledge. In this case I’ll use it carry an umbrella.

The final point is that, in their papers, the students must take the information objects that they’ve created or acquired and integrate them into their work. Many students will just copy and download a table and stick it at the back of the paper, and assume that it speaks for itself. I tell them – it doesn’t! You have to explain why it’s there; make reference to it in the paper and weave it into your research.

Overall this presentation / discussion takes all of about 10 minutes, AND THEN we move into the discussion of neighborhoods, the census, and specific datasets. I’ve contemplated skipping it all together, but ultimately decided that it’s necessary. I think it’s essential to provide some context and theory coupled with the actual sources and the pragmatic nature of finding the data. There are some librarians who are completely adverse to teaching “tools” and will speak completely in the abstract, while there are others who cut directly to listing the sources and leaving it at that. The first approach is useless because the students won’t learn what to actually do; the second apporach makes assumptions about what they know and fails to prepare them for what they’ll face. There also seems to be a clear need for me to do this – I’ve heard many faculty who have commented that students are simply tacking data tables they’ve copied off the web into the back of papers without any explanation. When I present the slide that depicts the csv file, I was initally shocked by the looks of shock on many student’s faces – like they’d never seen or heard of this before and were worried that they’d have to wrestle with it. Here’s the data-driven world, step 1.

March 4th, 2014

This example takes over where the Loading Data Into PostGIS – Text Files left off; we’ll use the same schema usa_general. We’ll load two shapefiles from the Census Bureau’s 2010 Cartographic Boundary files from https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html.

Loading shapefiles can be done via the pgAdmin III GUI (for this example, used version 1.18 for MS Windows) from a client machine using the shp2pgsql-gui. This is an extra plugin that must be downloaded and installed. A command line version of the tool (shp2pgsql) is automatically bundled with PostGIS, but can’t be used by a client unless they either log into the server via a terminal or install PostGIS locally.

The shp2pgsql-gui can be downloaded for MS Windows from the extras folder on the PostGIS website: http://download.osgeo.org/postgis/windows/. Version 2.01 for 32 bit environment was downloaded and unzipped. Per the instructions in the README file, the postgisgui folder was dragged into the PG BIN path as designated in pgAdmin under File > Options > Binary Paths. Once pgAdmin was restarted the gui was available under the Plugins menu.

### Set Options

Under Plugins > PostGIS Shapefile and DBF Loader, Check the Options Menu. There are often problems when dealing with UTF-8 encoding in DBF files and when working in an MS Windows environment. Change encoding from UTF-8 to WINDOWS-1252 (another possibility would be LATIN-1). Keep boxes for creating spatial index and using COPY instead of INSERT checked.

Browse and select both shapefiles. Once they appear in the import list the following elements must be changed: switch schema from public to desired schema (usa_general), change table to give each one a meaningful name, and specify the SRID number for each (since these files are from the US Census they’re in NAD 83, which is EPSG 4269). Once you hit Load, the log window will display the results of each import.

Refresh the database and both shapefiles will appear as tables.

### Check Geometries

Because of the size of some geometries, they may appear blank if you preview the tables in pgAdmin – this is a well known and documented issue. The simplest way to verify is to check and make sure that none of the geometry columns are null:

SELECT COUNT(*)
FROM usa_general.states
WHERE geom IS NULL;

“0”

### Primary Keys

By default, PostGIS will automatically create a field called gid using sequential integers, and will designate this field as the primary key. If the data already has a natural key that you want to use, you would have to drop the key constraint on gid and add the constraint to the desired field. Since features from the US Census have unique ANSI / FIPS codes, we’ll designate that as the key.

ALTER TABLE usa_general.states
DROP CONSTRAINT states_pkey

Query returned successfully with no result in 125 ms.

ALTER TABLE usa_general.states
ADD CONSTRAINT states_pkey PRIMARY KEY (“state”);

Query returned successfully with no result in 125 ms.

### Constraints

An alternative to changing the primary key would be to add UNIQUE and NOT NULL constraints to additional columns; gid remains as the primary key but the other columns will automatically be indexed when constraints are added.

ALTER TABLE usa_general.metros ADD CONSTRAINT cbsa_unq UNIQUE (cbsa);
ALTER TABLE usa_general.metros ALTER COLUMN cbsa SET NOT NULL;

### Delete Records

For sake of example, we’ll delete all the records for micropolitan areas in the metros table, so we’re left with just metropolitan areas.

DELETE FROM usa_general.metros

Query returned successfully: 581 rows affected, 219 ms execution time.

## Spatial SQL

As a test we can do a query to select all of the populated places that are within the NYC metropolitan area but are not in Pennsylvania, that have an elevation greater than or equal to 500 feet.

SELECT feature_name, state_alpha, county_name, elev_in_ft
FROM usa_general.pop_places, usa_general.metros
WHERE metros.cbsa=’35620′
AND ST_WITHIN(pop_places.geom, metros.geom)
AND elev_in_ft >= 500
AND state_alpha !=’PA’
ORDER BY elev_in_ft DESC

“Kampe”;”NJ”;”Sussex”;1293
“Highland Lakes”;”NJ”;”Sussex”;1283
“Edison”;”NJ”;”Sussex”;1234″…

## Connect to QGIS

We can look at the geographic features by establishing a connection between the database and QGIS. In QGIS we can either hit the add PostGIS Layer button or use the browser to connect to a our database; we need to add the connection name, host, port, and database name. The username and password can be left blank and Q will prompt the user for it (if you don’t want it saved on the system). Once the connection is established you can add the layers to the view and change the symbolization.

February 25th, 2014

I’ve fallen out of the blogosphere for quite awhile. Time to catch up – lately we’ve been experimenting with deploying our own instance of Open Geoportal (see previous post) and square one is getting a functioning data repository up and running. I’ve been tinkering with PostgreSQL / PostGIS and am documenting tests in a wiki we’ve created. The wiki is private, so I thought I’d re-post some of the tests here.

I’m working with a developer who’s installed and configured the database on a server, and I’m accessing it remotely using pgAdmin. I’ve started loading data and running queries and so far I’m relieved that most of what I’m seeing seems familiar, having worked with SQLite / Spatialite for a few years now. I’m using the excellent book PostGIS in Action as my guide. Here’s my take on loading a text file with coordinates into the db and building geometry for it. Loading shapefiles, spatial SQL experiments, and connecting to Geoserver will follow.

## Verify Version

SELECT postgis_full_version();

“POSTGIS=”2.1.1 r12113″ GEOS=”3.4.2-CAPI-1.8.2 r3921″ PROJ=”Rel. 4.8.0, 6 March 2012″ GDAL=”GDAL 1.9.2, released 2012/10/08″ LIBXML=”2.7.6″ LIBJSON=”UNKNOWN” TOPOLOGY RASTER”

## Create Schema for Holding Related Objects

CREATE SCHEMA usa_general;

Query returned successfully with no result in 297 ms.

## Import Delimited Text File

Test file is the US Populated Places gazetteer file from the USGS Geographic Names Information Service (GNIS). It is a pipe-delimited text file in encoded in UTF-8 with longitude and latitude coordinates in both DMS and DEC format. http://geonames.usgs.gov/domestic/download_data.htm

### Create Table for Holding Data

CREATE TABLE usa_general.pop_places
(
feature_id int NOT NULL PRIMARY KEY,
feature_name varchar,
feature_class varchar,
state_alpha char(2),
state_numeric char(2),
county_name varchar,
county_numeric char(3),
primary_lat_dms varchar,
prim_long_dms varchar,
prim_lat_dec float4,
prim_long_dec float4,
source_lat_dms varchar,
source_long_dms varchar,
source_lat_dec float4,
source_long_dec float4,
elev_in_m int,
elev_in_ft int,
map_name varchar,
date_created date,
date_edited date
);

Query returned successfully with no result in 390 ms.

### Copy Data From File

Must be run from the PSQL Console plugin in order to load data from a local client machine. If data was stored on the server, you could use the PGAdmin GUI and use COPY in an SQL statement instead of \copy. When running the PSQL Console on MS Windows the default character encoding is WIN1252. In this example, the data contains characters unsupported by this encoding; the file is encoded in UTF-8 and the console must be set to match. COPY is not a standard SQL command but is native to PostgreSQL.

— The optional HEADER specifies that the data file has a header column that should be skipped when importing

\encoding UTF8
\copy usa_general.pop_places
FROM ‘C:\Workspace\postgis_testing\POP_PLACES_20140204.txt’

### Basic Query to Test that Load was Success

SELECT *
FROM usa_general.pop_places
WHERE state_alpha=’NY’ and county_name=’New York’
ORDER BY feature_name

### Create Geometry from Coordinates and Make Spatial Index

4269 is the EPSG code for NAD 83, the basic geographic coordinate system used by US government agencies.

“usa_general.pop_places.geom SRID:4269 TYPE:POINT DIMS:2 ”

UPDATE usa_general.pop_places
SET geom = ST_GeomFromText(‘POINT(‘|| prim_long_dec || ‘ ‘|| prim_lat_dec || ‘)’,4269);

Query returned successfully: 200359 rows affected, 18268 ms execution time.

CREATE INDEX idx_pop_places_geom
ON usa_general.pop_places USING gist(geom);

Query returned successfully with no result in 8439 ms.

## Basic SQL Query to Test Geometry

ST_AsEWKT transforms the output of geometry from binary code into the OGC Well Known Text format (WKT).

SELECT feature_name, ST_AsEWKT(geom) AS geometry
FROM usa_general.pop_places
WHERE state_alpha=’NY’ AND county_name=’New York’
ORDER BY feature_name

“Amalgamated Dwellings”;”SRID=4269;POINT(-73.9828 40.715)”
“Amsterdam Houses”;”SRID=4269;POINT(-73.9881 40.7731)”
“Battery Park City”;”SRID=4269;POINT(-74.0163 40.7115)”…

## Basic Spatial SQL Query to Test Geometry

Selects the northernmost point in the table.

SELECT feature_name, state_alpha, ST_AsEWKT(geom) AS geometry
FROM usa_general.pop_places
WHERE ST_Xmax(geom) IN(
SELECT Max(ST_Xmax(geom))
FROM usa_general.pop_places)

“Amchitka”;”AK”;”SRID=4269;POINT(178.878 51.5672)”

## Drop Columns

Drop some columns that have no data.

ALTER TABLE usa_general.pop_places
DROP COLUMN source_lat_dms,
DROP COLUMN source_long_dms,
DROP COLUMN source_lat_dec,
DROP COLUMN source_long_dec;

Query returned successfully with no result in 180 ms.

## Notes from the Open Geoportal National Summit

October 30th, 2013

This past weekend I had the privilege of attending the Open Geoportal (OGP) National Summit in Boston, hosted by Tufts University and funded by the Sloan Foundation. The Open Geoportal (OGP) is a map-based search engine that allows users to discover and retrieve geospatial data from many repositories. The OGP serves as the front-end of a three-tiered system that includes a spatial database (like PostGIS) at the back and some middleware (Like OpenLayers) to communicate between the two.

Users navigate via a web map (Google by default but you can choose other options), and as they change the extent by panning or zooming a list of available spatial layers appears in a table of contents beside the map. Hovering over a layer in the contents reveals a bounding box that indicates its spatial extent. Several algorithms determine the ranking order of the results based on the spatial intersection of bounding boxes with the current map view. For instance, layers that are completely contained in the map view have priority over those that aren’t, and layers that have their geographic center in the view are also pushed higher in the results. Non-spatial search filters for date, data type, institution, and keywords help narrow down a search. Of course, the quality of the results is completely dependent on the underlying metadata for the layers, which is stored in the various repositories.

The project was pioneered by Tufts, Harvard, and MIT , and now about a dozen other large research universities are actively working with it, and others are starting to experiment. The purpose of the summit was to begin creating a cohesive community to manage and govern the project, and to increase and outline the possibilities for collaborating across institutions. At the back end, librarians and metadata experts are loading layers and metadata into their repositories; metadata creation is an exacting and time-consuming process, but the OGP will allow institutions to share their metadata records in the hope of avoiding duplicated effort. The OGP also allows for the export of detailed spatial metadata from FGDC and ISO to MODS and MARC, so that records for the spatial layers can be exported to other content management systems and library catalogs.

The summit gave metadata experts the opportunity to discuss best practices for metadata creation and maintenance, in the hopes of providing a consistent pool of records that can be shared; it also gave software developers the chance to lay out their road map for how they’ll function as an open source project (the OGP community could look towards the GeoNetwork opensource project, a forerunner in spatial metadata and search that’s used in Europe and by many international organizations). Series of five-minute talks called Ignite sessions gave librarians and developers the ability to share the work they were doing at their institutions, either with OGP in particular or with metadata and spatial search in general, which sparked further discussion.

The outcome of all the governance, resource sharing, and best practices discussions are available on a series of pages dedicated to the summit, on the opengeoportal.org project website. You can also experiment with the OGP via http://geodata.tufts.edu/, Tuft’s gateway to their repository. As you search for data you can identify which repository the data is coming from (Tufts, Harvard, or MIT) based on the little icon that appears beside each layer name. Public datasets (like US census layers) can be downloaded by anyone, while copyrighted sets that the schools’ purchased for their users require authentication.

OGP is a simple yet elegant open source project that operates under OGC standards and is awesome for spatial search, but the real gem here is the community of people that are forming around it. I was blown away by the level of expertise, dedication, and over all professionalism that each of the librarians, information specialists, and software developers exuded, via the discussions and particularly by the examples of the work they were doing at their institutions. Beyond just creating software, this project is poised to enhance the quality and compatibility of spatial metadata to keep our growing pile of geospatial stuff find-able.