## Posts Tagged ‘Data Processing’

### Writing Functions and Building a Jinja Template

Thursday, August 6th, 2015

In previous posts I demonstrated how to pull data from a sqlite / spatialite data to generate reports using Python and Jinja, where Jinja2 is used as a template engine for creating LaTeX documents and the NYC Geodatabase is used as my test case. Up until now the scripts pulled the data “as is”. In this post I’ll demonstrate how I created derived variables, and how I created the Jinja2 template for the report. Please note – instead of duplicating all of the code I’m just going to illustrate the new pieces – you should check out the earlier posts to see how all the pieces fit together.

### Aggregating Variables

Aggregating census data is a pretty common operation, and when working with American Community Survey estimates it’s also necessary to calculate a new margin of error for each derived value. I wrote two functions to accomplish this. For each function you pass in the keys for values you want to aggregate, a name which will be the name of the new variable, and a dictionary that contains all the keys and values that were taken from a database table for a specific geography.

#Functions for summing individual values and calculating margins of error
#for individual values

tosum=[]
for val in keys:
agg=sum(tosum)

sqrd=[]
for val in keys:
if item=='':
pass
else:
sqrd.append(item**2)
moe=round(math.sqrt(sum(sqrd)))

Later in the script, as we’re looping through all the geographies and gathering the necessary data into dictionaries that represent each data table, we call the function. In this example we’re combining household income brackets so that we don’t have so many categories:

for geog in geodict.keys():

name=geodict.get(geog)
filename='zzpuma_' + geog + '.tex'
folder='puma_rept'
outpath=os.path.join(folder,filename)

acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)

calc_sums(['INC03_E','INC04_E'],'INC10K_E',acs1dict)
calc_moe(['INC03_M','INC04_M'],'INC10K_M',acs1dict)
calc_sums(['INC05_E','INC06_E'],'INC25K_E',acs1dict)
calc_moe(['INC05_M','INC06_M'],'INC25K_M',acs1dict)
calc_sums(['INC09_E','INC10_E'],'INC100K_E',acs1dict)
calc_moe(['INC09_M','INC10_M'],'INC100K_M',acs1dict)

Rather than creating a new dictionary, these new values are simply appended to the existing dictionaries that contain the data taken from each of the ACS data tables in the database. They can be referenced in the template using their new column name.

### Calculating Areas

I also want to include the geographic size of the PUMA as one of the report items. Columns for the area are included in the spatial table for the PUMAs – the features originally came from the TIGER files, and all TIGER files have an ALAND and an AWATER column that has land and water area in square meters. So we don’t have to calculate the area from the geometry – we can just use this function to convert the land and water attributes to square miles, and then calculate a total area:

totalarea=landarea+waterarea

In the body of our script, we invoke our pulltab function (explained in an earlier post) to grab all the data from the PUMA spatial boundary table:

area=pulltab('c_bndy_pumas2010','geoid10',geog)

And then we can call our area function. We pass in the area dictionary, and what we want the new output column names to be – area for land, water, and total:

calc_area(area, 'LAND_SQM','WAT_SQM','TOT_SQM')

Like our previous aggregate script, this function appends our new values to the existing table-dictionary – in this case, one called area.

### Aggregating Geographies

Our last function is a little more complicated. In all of our previous examples, we pulled PUMA-level data from the American Community Survey tables. What if we wanted 2010 Census data for the PUMAs? Decennial census data is not tabulated at the PUMA level, but it is tabulated at the census tract level. Since PUMAs are created by aggregating tracts, we can aggregate the census tract data in the NYC Geodatabase into PUMAs. Here’s our function:

#Function aggregates all values in a table with a group by field from a
#joined table, then creates a dictionary consisting of column names and values
#for a specific geography

def sumtab(tabname,jointab,id1,id2,gid,geog):
query='SELECT * FROM %s LIMIT 1' %(tabname)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
tosum=[]
for var in col_names[3:]:
tosum.append("SUM("+var+") AS '0_"+var+"'")
summer=', '.join(str(command) for command in tosum)
query='SELECT %s, %s FROM %s, %s WHERE %s = %s and %s = %s GROUP BY %s' %(gid,summer,tabname,jointab,id1,id2,gid,geog,gid)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
for row in rows:
thedict=dict(zip(col_names,row))
return thedict

What’s going on here? The first thing we need to do is associate the census tracts with the PUMAs they’re located in. The NYC Geodatabase does NOT have a relationship table for this, so I had to create one. We have to pass in the table name, the relationship table, the unique IDs for each, and then the ID and the geography that we’re interested in (remember our script is looping through PUMA geographies one by one). The first thing we do is a little trick – we get the names of every column in the existing data table, and we append them to a list where we create a new column name based on the existing one (in this case, append a 0 in front of the column name – in retrospect I realize this is a bad idea as column names should not begin with numbers, so this is something I will change). Then we can take the list of column names and create a giant string out of them.

With that giant string (called summer) we can now pass all of the parameters that we need into the SQL query. This selects all of our columns (using the summer string), the table names and join info, for the specific geographic area that we want and then groups the data by that geography (i.e. all tracts that have the same PUMA number). Then we zip the column names and values together in a dictionary that the function returns.

Later on in our script, we call the function:

census10=sumtab('b_tracts_2010census','b_tracts_to_pumas','GEOID2','tractid','pumaid',geog)

Which creates a new dictionary called census10 that has all the 2010 census data for our PUMA. Like the rest of our dictionaries, census10 is passed out to the Jinja2 template and its values can be invoked using the dictionary keys (the column headings):

outfile=open(outpath,'w')
outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict, area=area,
c2010=census10))
outfile.close()

### Designing the Template

The Jinja template is going to look pretty busy compared to our earlier examples, and in both cases they’re not complete (this is still a work in progress).

I wanted to design the entire report first, to get a sense for how to balance everything I want on the page, without including any Jinja code to reference specific variables in the database. So I initially worked just in LaTeX and focused on designing the document with placeholders. Ultimately I decided to use the LaTeX minipage environment as it seemed the best approach in giving me control in balancing items on the page. The LaTeX wikibook entries on floats, figures, and captions and on boxes was invaluable for figuring this out. I used rule to draw boxes to serve as placeholders for charts and figures. Since the report is being designed as a document (ANSI A 8 1/2 by 11 inches) I had no hang-up with specifying precise dimensions (i.e. this isn’t going into a webpage that could be stretched or mushed on any number of screens). I loaded the xcolor package so I could modify the row colors of the tables, as well as a number of other packages that make it easy to balance table and figure captions on the page (caption, subscaption, and multicol).

Once I was satisfied with the look and feel, I made a copy of this template and started modifying the copy with the Jinja references. The references look awfully busy, but this is the same thing I’ve illustrated in earlier posts. We’re just getting the values from the dictionaries we created by invoking their keys, regardless of whether we’re taking new derived values that we created or simply pulling existing values that were in the original data tables. Here’s a snippet of the LaTeX with Jinja that includes both derived (2010 Census, area) and existing (ACS) variables:

%Orientation - detail map and basic background info
\begin{minipage}{\textwidth}
\begin{minipage}[h]{3in}
\centering
\rule{3in}{3in}
\captionof{figure}{Race by 2010 Census Tract}
\end{minipage}
\hfill
\begin{minipage}[h]{4in}
\centering
\captionof{table}{Geography}
\begin{tabular}{cccc}\hline
& Land & Water & Total\\ \hline
Area (sq miles) &  \num{\VAR{area.get('LAND_SQM')}} & \num {\VAR{area.get('WAT_SQM')}} &  \num {\VAR{area.get('TOT_SQM')}} \\ \hline
\vspace{10pt}
\end{tabular}
\captionof{table}{Basic Demographics}
\rowcolors{3}{SpringGreen}{white}
\begin{tabular}{cccc}\hline
& \textbf{2010 Census} & \textbf{2009-2013} & \textbf{ACS Margin}\\
& & \textbf{ACS} & \textbf{of Error}\\ \hline
Population & \num {\VAR{c2010.get('0_HD01_S001')}} & \num{\VAR{acs2.get('SXAG01_E')}} & +/- \num{\VAR{acs2.get('SXAG01_M')}}\\
Males & \num {\VAR{c2010.get('0_HD01_S026')}} & \num{\VAR{acs2.get('SXAG02_E')}} & +/- \num{\VAR{acs2.get('SXAG02_M')}}\\
Females & \num {\VAR{c2010.get('0_HD01_S051')}}& \num{\VAR{acs2.get('SXAG03_E')}} & +/- \num{\VAR{acs2.get('SXAG03_M')}}\\
Median Age (yrs) & \num {99999} & \num{\VAR{acs2.get('SXAG17_E')}} & +/- \num{\VAR{acs2.get('SXAG17_M')}}\\
Households & \num {\VAR{c2010.get('0_HD01_S150')}} & \num{\VAR{acs1.get('HSHD01_E')}} & +/- \num{\VAR{acs1.get('HSHD01_M')}}\\
Housing Units & \num {\VAR{c2010.get('0_HD01_S169')}} & \num{\VAR{acs2.get('HOC01_E')}} & +/- \num{\VAR{acs2.get('HOC01_M')}}\\ \hline
\end{tabular}
\end{minipage}
\end{minipage}

And here’s a snippet of the resulting PDF:

### What Next?

You may have noticed references to figures and charts in some of the code above. I’ll discuss my trials and tribulations with trying to use matplotlib to create charts in some future post. Ultimately I decided not to take that approach, and was experimenting with using various LaTeX packages to produce charts instead.

### Looping Through a Database to Create Reports

Tuesday, July 28th, 2015

I’ve got a lot of ground to cover, picking up where I left off several months ago. In earlier posts I presented the concept for creating reports from sqlite databases using Python, Jinja, and LaTeX, and looked at different methods for passing data from the database to the template. I’m using the NYC Geodatabase as our test case. In this entry I’ll cover how I implemented my preferred approach – creating Python dictionaries to pass to the Jinja template.

One of the primary decisions I had to make was how to loop through the database. Since the reports we’re making are profiles (lots of different data for one geographic area), we’re going to want to loop through the database by geography. So, for each geography select all the data from a specific table, pass the data out to the template where the pertinent variables are pulled, build the report and move on to the next geography. In contrast, if we were building comparison tables (one specific variable for many geographic areas) we would want to loop through the data by variable.

In the beginning of the script we import the necessary modules, set up the Jinja environment, and specify our template (not going to repeat that code here – see the previous post). Then we have our function that creates a dictionary for a specific data table for a specific geography:

def pulltab(tabname,idcol,geog):
query='SELECT * FROM %s WHERE %s = %s' %(tabname,idcol,geog)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
for row in rows:
thedict=dict(zip(col_names,row))
return thedict

We connect to the database and create a dictionary of all the geographies (limited to 3 PUMAs since this is just a test):

#Connect to database and create dictionary of all geographies

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT geoid10, namelsad10 FROM a_pumas2010 ORDER BY geoid10 LIMIT 3')
rows = curs.fetchall()
geodict=dict(rows)

And then we generate reports by looping through all the geographies in that dictionary, and we pass in the ID of each geography to pull all data from a data table for that geography out of the table and into a dictionary.

#Generate reports by looping through geographies and passing out
#dictionaries of values

for geog in geodict.keys():
acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)
name=geodict.get(geog)
filename='zzpuma_' + geog + '.tex'
folder='test5'
outpath=os.path.join(folder,filename)

Lastly, we pass the dictionaries out to the template, and run LaTeX to generate the report from the template:

outfile=open(outpath,'w')
outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict))
outfile.close()

os.system("pdflatex -output-directory=" + folder + " " + outpath)

conn.close()

The Jinja template (as a LaTeX file) is below – the example here is similar to what I covered in my previous post. We passed two dictionaries into the template, one for each data table. The key is the name of the variable (the column name in the table) and the value is the American Community Survey estimate and the margin of error. We pass in the key and get the value in return. The PDF output follows.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage{graphicx}
\usepackage[labelformat=empty]{caption}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{acs1.get('GEOLABEL') | replace("&","\&")} \VAR{acs1.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{table}[h]
\centering
\caption{Commuting to Work - Workers 16 years and over}
\begin{tabular}{|c|c|c|c|c|}

\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\
\hline

Car, truck, or van alone & \num{\VAR{acs1.get('COM02_E')}} & +/- \num{\VAR{acs1.get('COM02_M')}}
& \num{\VAR{acs1.get('COM02_PC')}} & +/- \num{\VAR{acs1.get('COM02_PM')}}\\

Car, truck, or van carpooled & \num{\VAR{acs1.get('COM03_E')}} & +/- \num{\VAR{acs1.get('COM03_M')}}
& \num{\VAR{acs1.get('COM03_PC')}} & +/- \num{\VAR{acs1.get('COM03_PM')}}\\

Public transit & \num{\VAR{acs1.get('COM04_E')}} & +/- \num{\VAR{acs1.get('COM04_M')}}
& \num{\VAR{acs1.get('COM04_PC')}} & +/- \num{\VAR{acs1.get('COM04_PM')}}\\

Walked & \num{\VAR{acs1.get('COM05_E')}} & +/- \num{\VAR{acs1.get('COM05_M')}}
& \num{\VAR{acs1.get('COM05_PC')}} & +/- \num{\VAR{acs1.get('COM05_PM')}}\\

Other means & \num{\VAR{acs1.get('COM06_E')}} & +/- \num{\VAR{acs1.get('COM06_M')}}
& \num{\VAR{acs1.get('COM06_PC')}} & +/- \num{\VAR{acs1.get('COM06_PM')}}\\

Worked at home & \num{\VAR{acs1.get('COM07_E')}} & +/- \num{\VAR{acs1.get('COM07_M')}}
& \num{\VAR{acs1.get('COM07_PC')}} & +/- \num{\VAR{acs1.get('COM07_PM')}}\\
\hline

\end{tabular}
\end{table}

\begin{table}[h]
\centering
\caption{Housing Tenure}
\begin{tabular}{|c|c|c|c|c|}

\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\
\hline

Occupied housing units & \num{\VAR{acs2.get('HTEN01_E')}} & +/- \num{\VAR{acs2.get('HTEN01_M')}} &  &\\

Owner-occupied & \num{\VAR{acs2.get('HTEN02_E')}} & +/- \num{\VAR{acs2.get('HTEN02_M')}}
& \num{\VAR{acs2.get('HTEN02_PC')}} & +/- \num{\VAR{acs2.get('HTEN02_PM')}}\\

Renter-occupied & \num{\VAR{acs2.get('HTEN03_E')}} & +/- \num{\VAR{acs2.get('HTEN03_M')}}
& \num{\VAR{acs2.get('HTEN03_PC')}} & +/- \num{\VAR{acs2.get('HTEN03_PM')}}\\
\hline

\end{tabular}
\end{table}
\end{document}

In this example we took the simple approach of grabbing all the variables that were in a particular table, and then we just selected what we wanted within the template. This is fine since we’re only dealing with 55 PUMAs and a table that has 200 columns or so. If we were dealing with gigantic tables or tons of geographies, we could modify the Python script to pull just the variables we wanted to speed up the process; my inclination would be to create a list of variables in a text file, read that list into the script and modify the SQL function to just select those variables.

What if we want to modify some of the variables before we pass them into the template? I’ll cover that in the next post.

### Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

### Introducing – Data!

Wednesday, April 9th, 2014

Professors invite me to their classes each semester to give students a crash course in finding data for neighborhoods in New York City, with a particular emphasis on Census data. I typically visit courses in journalism and public affairs, but this semester I added classes in management and – theater – to the list. Before I dive into what the Census is and what sources they should use, I preface the presentation with a discussion of what neighborhoods are and how we define them. This is important because neighborhoods are locally and informally defined, and when searching for datasets we often have to use a proxy, like census tracts, ZIP codes, PUMAs, or local legal or administrative areas, to approximate them.

But before we get this far, I always begin the discussion with some basic questions to set the stage: what is data, and what can we use it for? For the journalism students, I explain that data can help support a story. If they’re covering a town hall or community board meeting where affordabale housing is the topic of discussion, they’re going to want to provide some context and include some facts to support their story – what is the rent like in the neighborhood? How many people live there? Alternatively, data can provide the basis for a story. I point to one of many numerous examples in NYC where journalists have taken a big lump of unrefined data – the NYPD’s stop and frisk data, traffic fatality incidents, 311 complaints – and have refined it to produce information that leads them to an interesting story that was hidden in these numbers. Lastly, data is a story – whenever the Census releases a new dataset, someone is writing to announce the release and tell us what’s in there.

This idea of refining leads us to our first basic definition – data can be considered as raw and unrefined information. It doesn’t tell us much in and of itself, but if we sift through and refine it we can turn it into information that we can use to tell or support a story or reveal some fact or truth that was previously unknown. Data can be quantitative or qualitative – journalists for example may interview someone for two or three hours, but they’re not going to turn around and publish that entire interview. They’re going to write an article that summarizes it and gives us the most important bits, or edit it for a radio broadcast that covers the high lights. With quantitative data the issue is similar – I use a basic example of population data for the 50 states and show them this image of a comma delimited text file:

I explain that this is what data looks lke in a raw state. It’s in a basic format suitable for preservation or transit between systems, but is not in a presentable state. There are a lot of codes that are meaningless to the average person, the data isn’t sorted in a meaningful way, the column headings seem ambiguous, and the numbers aren’t formated for viewing. This isn’t something that they’d want to insert directly into their story or paper. But if they take this and do a little bit of work:

They can take that raw data and turn it into information. Here we’ve moved from raw data to a presentable table. The statistics are sorted in a logical order based on total population, columns are given comprehensible names, and unecessary information (for presentation purposes) is dropped. We add commas to the numbers so they’re more legible, and we create some value by adding a percent total column. Now we have something we can use to communicate a message. But we can go further – we can take that same information and turn it into this:

Now we have a chart. At this point I turn to the students and ask them what the benefit of using the chart is, followed by a discussion of trade-offs; we’ve gained something but lost something too. On the plus side, we can appeal to people’s visual sensibilities, and we can see more clearly that California has twice as many people as New York. The chart is also more concise, as it’s taking up less real estate on the page or on the screen. But we’ve exchanged conciseness for preciseness; we can no longer tell what the exact population numbers are with the chart; we can only approximate. But we can also go further:

We can take that same dataset and turn it into a map. Once again, we discuss the pluses and minuses. Now we can key into to people’s geographic knowledge as well as their visual senses; Ohio may be more meaningful now that we can see it on a map, rather than just seeing a number in a table. We can also see geographic patterns of clustering or diffusion, which the table or chart couldn’t show us. But with the map we’ve lost even more precision. Now we can only see that a state’s population number falls within a given range; we can’t see the precise number and can’t approximate it like we could with the chart.

At this point, one student will point out that if the chart or map is on the web, we can have the best of all worlds. If the graphic is interactive we can hover over it and see the exact population number. This leads to a discussion of the trade-offs between interactive web-based information and static information. The interactive chart or map let’s us keep precision and conciseness, but the sacrifice is complexity, portability, and preservation. It’s more complex to create, and it can only exist in it’s native environment, within a specific bundle of technology that includes programming and scripting langauges, software libraries, browsers, and operting systems. Such things go obsolete quickly and can easily break, so the shiny chart, map, or app you have today is non-functional in a year or two, and difficult to preserve. Contrast that with a static image or text, which is simple, easy to move around, depends on little else, and can make the jump from a screen to the printed page.

We sum up this little talk with the basis of what they’re trying to achieve – I use the DIK pyramid, which I was introduced to in library school (OK – this pic is the DIKW pyramid, with wisdom thrown on top – it’s public domain so I can safely use it):

As journalists or researchers, you’re taking data and refining it to turn it into information to support your work or to commuicate a message. You take those pieces of new information and bring them together to tell a bigger story and paint a bigger picture, which we hope will lead to greater knowledge (which, unlike data and information, is something that can only be learned and not simply assesmbled and communicated). The weather is a good example – a giant log of temperature and precitiptation data isn’t going to do me much good. But if you process that data to calculate the high, low, and mean, now you have information I can use. Take that information and combine it with a radar picture and a forecast and now I have a rich information object. I can take that object and piece it together with other information – another forecast I hear on the radio, what I see out the window, my previous experiences of getting wet, my wife’s advice – to formulate a decision that I can act on. By considering all of this information – my experiences, contextualized information, and know how – and weighing it to reach a conslucion, I am using my knowledge. In this case I’ll use it carry an umbrella.

The final point is that, in their papers, the students must take the information objects that they’ve created or acquired and integrate them into their work. Many students will just copy and download a table and stick it at the back of the paper, and assume that it speaks for itself. I tell them – it doesn’t! You have to explain why it’s there; make reference to it in the paper and weave it into your research.

Overall this presentation / discussion takes all of about 10 minutes, AND THEN we move into the discussion of neighborhoods, the census, and specific datasets. I’ve contemplated skipping it all together, but ultimately decided that it’s necessary. I think it’s essential to provide some context and theory coupled with the actual sources and the pragmatic nature of finding the data. There are some librarians who are completely adverse to teaching “tools” and will speak completely in the abstract, while there are others who cut directly to listing the sources and leaving it at that. The first approach is useless because the students won’t learn what to actually do; the second apporach makes assumptions about what they know and fails to prepare them for what they’ll face. There also seems to be a clear need for me to do this – I’ve heard many faculty who have commented that students are simply tacking data tables they’ve copied off the web into the back of papers without any explanation. When I present the slide that depicts the csv file, I was initally shocked by the looks of shock on many student’s faces – like they’d never seen or heard of this before and were worried that they’d have to wrestle with it. Here’s the data-driven world, step 1.

### Article on Working With the American Community Survey

Monday, June 17th, 2013

I’ve got another article that’s just hit the presses. In this one I discuss the American Community Survey: how it differs from the Decennial Census, when you should use it versus other summary data sets, how to work with the different period estimates, and how to create derived estimates and calculate their margins of error. For that last piece I’ve essentially done an extended version of this old post on Excel formulas, with several different and updated examples.

The article is available via Emerald’s journal database. If you don’t have access to it from your library feel free to contact me and I’ll send you a copy (can’t share this one freely online).

Title: The American Community Survey: practical considerations for researchers
Author(s): Francis P. Donnelly
Citation: Francis P. Donnelly, (2013) “The American Community Survey: practical considerations for researchers”, Reference Services Review, Vol. 41 Iss: 2, pp.280 – 297
Keywords: American Community Survey, Census, Census geography, Data handling, Decennial census, Demographic data, Government data processing, Government information, Margins of error, Sample-based data, United States of America, US Census Bureau
Article type: Technical paper
DOI: 10.1108/00907321311326228 (Permanent URL)
Publisher: Emerald Group Publishing Limited

### Creating Reports with SQLite, Python, and prettytable

Friday, February 8th, 2013

In addition to providing the NYC Geodatabase as a resource, I also wanted to use it to generate reports and build applications. None of the open source SQLite GUIs that I’m familiar with have built in report generating capabilities, so I thought I could use Python to connect to the database and generate them. I have some grand ambitions here, but decided to start out small.

Python has a built-in module, sqlite3, that you can use to work with SQLite databases. This is pretty well documented – do a search and you’ll find a ton of brief tutorials. Take a look at this great post for a comprehensive intro.

For generating reports I gave prettytable a shot: it lets you create nice looking ASCII text tables that you can copy and paste from the prompt or export out to a file. The tutorial for the module was pretty clear and covers the basics quite nicely. In the examples he directly embeds the data in the script and generates the table from it, which makes the tutorial readily understandable. For my purposes I wanted to pull data out of a SQLite database and into a formatted table, so that’s what I’ll demonstrate here.

Initially I had some trouble getting the module to load, primarily (I think) because I’m using Python 3.x and the setup file for the module was written for Python 2.x; the utility you use for importing 3rd party modules has changed between versions. I’m certainly no Python expert, so instead of figuring it out I just downloaded the module, dumped it into the site-packages folder (as suggested in the prettytable installation instructions under “The Harder Way” – but it wasn’t hard at all) and unzipped it. In my script I couldn’t get the simple “import prettytable” to work without throwing an error, but when I added the name of the specific function “import PrettyTable from prettytable” it worked. Your mileage may vary.

So here was my first go at it. I created a test database and loaded a table of population estimates from the US Census Bureau into it (you can download it if you want to experiment):

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

x = PrettyTable(col_names)
x.align[col_names[1]] = "l"
x.align[col_names[2]] = "r"
for row in rows:

print (x)
tabstring = x.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()

The first piece is the standard SQLite piece – connect, activate a cursor, and execute a SQL statement. Here I’m grabbing three columns from the table for records that represent Northeastern states (Region 1). I read in the names of the columns from the first row into the col_names list, and I grab everything else and dump them into rows, a list that contains a tuple for each record:

>>> col_names
['State', 'Name', 'Est2010']
>>> rows
[('09', 'Connecticut', 3574097), ('23', 'Maine', 1328361), ('25', 'Massachusetts', 6547629),
('33', 'New Hampshire', 1316469), ('34', 'New Jersey', 8791898), ('36', 'New York', 19378104),
('42', 'Pennsylvania', 12702379), ('44', 'Rhode Island', 1052567), ('50', 'Vermont', 625741)]
>>>

The second piece will make sense after you have a quick look at the prettytable tutorial. Here I grab the list of columns names and specify how cells for the columns should be aligned (default is center) and padded (default is one space). Then I add each row from the nested list of tuples to the table, row by row. There are two outputs: print directly to the screen, and dump the whole table into a string. That string can then be dumped into a text file, along with a title. Here’s the screen output:

+-------+---------------+----------+
| State | Name          |  Est2010 |
+-------+---------------+----------+
|   09  | Connecticut   |  3574097 |
|   23  | Maine         |  1328361 |
|   25  | Massachusetts |  6547629 |
|   33  | New Hampshire |  1316469 |
|   34  | New Jersey    |  8791898 |
|   36  | New York      | 19378104 |
|   42  | Pennsylvania  | 12702379 |
|   44  | Rhode Island  |  1052567 |
|   50  | Vermont       |   625741 |
+-------+---------------+----------+

The one hangup I had was the formatting for the numbers: I really want some commas in there since the values are so large. I couldn’t figure out how to do this using the approach above – I’m writing all the rows in one swoop, and couldn’t step in and and format the last value for each row.

Unless – instead of constructing the table by rows, I construct it by columns. Here’s my second go at it:

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

y=PrettyTable()
y.align[col_names[1]]="l"
y.align[col_names[2]]="r"

print(y)
tabstring = y.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()

To add by column, you don’t provide any arguments to the PrettyTable function. You just add the columns one by one: here I call the appropriate values using the index, first for the column name and then for all of the values from the rows that are in the same position. For the last value (the population estimate) I use format to display the value like a decimal number (this works in Python 3.1+ – for earlier versions there’s a similar command – see this post for details). I tried this in my first example but I couldn’t get the format to stick, or got an error. Since I’m specifically calling these row values and then writing them I was able to get it to work in this second example. In this version the alignment specifications have to come last. Here’s the result:

+-------+---------------+------------+
| State | Name          |    Est2010 |
+-------+---------------+------------+
|   09  | Connecticut   |  3,574,097 |
|   23  | Maine         |  1,328,361 |
|   25  | Massachusetts |  6,547,629 |
|   33  | New Hampshire |  1,316,469 |
|   34  | New Jersey    |  8,791,898 |
|   36  | New York      | 19,378,104 |
|   42  | Pennsylvania  | 12,702,379 |
|   44  | Rhode Island  |  1,052,567 |
|   50  | Vermont       |    625,741 |
+-------+---------------+------------+

prettytable gives you a few other options, like the ability to sort records by a certain column or to return only the first “n” records from a table. In this example, since we’re pulling the data from a database we could (and did) specify sorting and other constraints in the SQL statement instead. prettytable also gives you the option of exporting the table as HTML, which can certainly come in handy.

### Altering Tables in SQLite / Spatialite

Thursday, February 7th, 2013

In building the Spatialite geodatabase (see previous post), one of the fundamental things I learned was how to manage table creation and alteration in SQLite, which was quite different from my previous experience working with MS Access and ArcGIS.

In SQLite, the ALTER TABLE statement is limited to changing the name of a table or adding new columns. If you want to make any other changes, the process is: create a new, blank table that’s structured the way you want, and then write an INSERT statement to copy the data you want from the existing table into the new table. So if you have a table that has a bunch of columns you want to drop or change, create a new table that has your desired structure, then insert what you want into that new table. The same thing goes for primary keys or other constraints. If your existing table doesn’t have a specified key you can’t alter it by specifying one: create a new table that does, then copy your data over.

Let’s say we have this table (de_data) with basic population data for Delaware’s three counties:

USPS	GEOID	NAME			POP10	HU10	ALAND_SQMI	AWATER_SQMI
DE	10001	Kent County		162310	65338	586.179		212.152
DE	10003	New Castle County	538479	217511	426.286		67.717
DE	10005	Sussex County		197145	123036	936.079		260.312

And let’s say that we want to change some of the column names, drop the field for housing units, and specify our data types and GEOID as our key. First, create the table:

CREATE TABLE de_pop (STATE TEXT, GEOID TEXT NOT NULL PRIMARY KEY,
COUNTY TEXT, POP10 INTEGER, ALAND REAL, AWATER REAL)

GEOID is the FIPS/ANSI code that uniquely identifies each state. Since these codes may have leading zeros (the codes for all states from AL through CT do), we designate it as text.

Second, insert the data we want from the existing table:

INSERT INTO de_pop (STATE, GEOID, COUNTY, POP10, ALAND, AWATER)
SELECT USPS, GEOID, NAME, POP10, ALAND_SQMI, AWATER_SQMI
FROM de_data

Order here matters – it’s going to insert columns from the original table into the new one in sequence: USPS into STATE, GEOID into GEOID, etc. Sometimes it’s possible to use an * as a shortcut to insert and copy everything, instead of listing every field, if both tables contain the same number of columns and they’re in the right order. But this is always a bit risky.

Lastly, if we don’t need that original table we could delete it:

DROP TABLE de_data

SELECT * FROM de_pop

STATE	GEOID	COUNTY			POP10	ALAND	AWATER
DE	10001	Kent County		162310	586.179	212.152
DE	10003	New Castle County	538479	426.286	67.717
DE	10005	Sussex County		197145	936.079	260.312

The process is similar if we want to take a query or view and turn it into a permanent table. SQLite does support CREATE TABLE AS, followed by a select query, so you can create a table out of a query. However – this is usually NOT the best course of action. If you do this, you won’t be able to specify a primary key for the new table (you really never want a table that lacks a key). Furthermore, if you create a new, calculated field you won’t be able to specify a type for it. This is particularly a problem if you’re working in Spatialite and want to create a spatial view or table that you want to join to some features and map in QGIS. If no type is specified, QGIS won’t know how to handle that new field, and it will be unviewable.

So, we take the same approach as before. Let’s say we want to create a table with population density as a calculated field. First, create the table:

CREATE TABLE de_popdens (STATE TEXT, GEOID TEXT NOT NULL PRIMARY KEY,
COUNTY TEXT, POP10 INTEGER, ALAND REAL, AWATER REAL, POPDENS REAL)

Then we write our insert statement, and in the insert we create the calculated field:

INSERT INTO de_popdens (STATE, GEOID, COUNTY, POP10, ALAND, AWATER, POPDENS)
SELECT STATE, GEOID, COUNTY, POP10, ALAND, AWATER, (ROUND(POP10/ALAND),1)
FROM de_pop

SELECT * FROM de_density:

STATE	GEOID	COUNTY			POP10	ALAND	AWATER	POPDENS
DE	10001	Kent County		162310	586.179	212.152	276.9
DE	10003	New Castle County	538479	426.286	67.717	1263.2
DE	10005	Sussex County		197145	936.079	260.312	210.6

This gives us a new table with density properly typed. Once again, this is only necessary if you want the calculated field to be permanent and function with other objects in the database, and particularly if you want to join this table to a geodatabase feature or shapefile to map the attributes. If you simply want the new field to answer a specific question or to export the data to output, you can just do a SELECT query and save it as a view.

I also had to do this procedure for every table and shapefile that I imported to the geodatabase. In the Spatialite GUI you don’t have the option to specify a primary key or data types for columns when you do an import; for the latter it takes its best guess. So to get it well-structured, I imported (or created a virtual link), created a new table, inserted the data over, then deleted the original table or severed the link. If it was a shapefile I went through the extra step of activating the geometry (check out the Spatialite Cookbook or the tutorial I wrote for the NYC geodatabase for details).

Since I was dealing with some enormous tables with hundreds of columns, I used some trickery to avoid typing all the statements by hand. If the data was small enough and came from a spreadsheet, I used a series of concatenate formulas to build the CREATE TABLE and INSERT statements by copying the field names and stringing them together with type names and necessary syntax, so I could just copy and paste a statement into the SQL dialog box. For larger datasets I used Python to do the processing, and had Python grab field names and write statements that I could copy and paste.

The import issue was particular to the Spatialite GUI, and not SQLite in general. If you’re dealing with just data tables and use the SQLite Manager (Firefox plugin), it asks you to specify column names, keys and constraints, and types for the columns you’re importing. It does the latter by having you select from a dropdown box for each column – this works fine if you have 10 or 20 columns, but it’s rather tedious if you have hundreds. The Manager also gives you the ability to alter additional elements of the table, like column names, by essentially performing the same operations (create table, insert records, drop table) behind the scenes while holding things temporarily in memory. It prefaces this with a warning message that this operation is not part of the standard SQLite commands, and there’s a chance something could go awry.

### Screen Scraping Data with Python

Friday, March 9th, 2012

I had a request recently for population centers (aka population centroids) for all the counties in the US. The Census provides the 2010 centroids in state level files and in one national file for download, but the 2000 centroids were provided in HTML tables on individual web pages for each state. Rather than doing the tedious work of copying and pasting 51 web pages into a spreadsheet, I figured this was my chance to learn how to do some screen scraping with Python. I’m certainly no programmer, but based on what I’ve learned (I took a three day workshop a couple years ago) and by consulting books and crawling the web for answers when I get stuck, I’ve been able to write some decent scripts for processing data.

For screen scraping there’s a must-have module called Beautiful Soup which easily let’s you parse web pages, well or ill-formed. After reading the Beautiful Soup Quickstart and some nice advice I found on a post on Stack Overflow, I was able to build a script that looped through each of the state web pages, scraped the data from the tables, and dumped it into a delimited text file. Here’s the code:

## Frank Donnelly Feb 29, 2012
## Scrapes 2000 centers of population for counties from individual state web pages
## and saves in one national-level text file.

from urllib.request import urlopen
from bs4 import BeautifulSoup

output_file=open('CenPop2000_Mean_CO.txt','a')

url='http://www.census.gov/geo/www/cenpop/county/coucntr%s.html'

fips=['01','02','04','05','06','08','09','10',
'11','12','13','15','16','17','18','19','20',
'21','22','23','24','25','26','27','28','29','30',
'31','32','33','34','35','36','37','38','39','40',
'41','42','44','45','46','47','48','49','50',
'51','53','54','55','56']

for i in fips:
list=titleTag.string.split()
name=(list[4:])
state=' '.join(name)

for row in soup('table')[1].tbody('tr'):
tds = row('td')
line=tds[0].string, tds[1].string, tds[2].string, state,
tds[3].string.replace(',',''), tds[4].string, tds[5].string

output_file.writelines(",".join(line)+"\n")

output_file.close()

After installing the modules step 1 is to import them into the script. I initially got a little stuck here, because there are also some standard modules for working with urls (urllib and urlib2) that I’ve seen in books and other examples that weren’t working for me. I discovered that since I’m using Python 3.x and not the 2.x series, something had changed recently and I had to change how I was referencing urllib.

With that out of the way I created a a text file, a list with the column headings I want, and then wrote those column headings to my file.

Next I read in the url. Since the Census uses a static URL that varies for each state by FIPS code, I was able to assign the URL to a variable and inserted the % symbol to substitute where the FIPS code goes. I created a list of all the FIPS codes, and then I run through a loop – for every FIPS code in the list I pass that code into the url where the % place holder is, and process that page.

The first bit of info I need to grab is the name of the state, which doesn’t appear in the table. I grab the title tag from the page and save it as a list, and then grab everything from the fourth element (fifth word) to the end of the list to capture the state name, and then collapse those list elements back into one string (have to do this for states that have multiple words – New, North, South, etc.).

So we go from the HTML Title tag:

County Population Centroids for New York

To a list with elements 0 to 5:

list=[“County”, “Population”, “Centroids”, “for”, “New”, “York”]

To a shorter list with elements 4 to end:

name=[“New”,”York”]

To a string:

state=”New York”

But the primary goal here is to grab everything in the table. So we identify the table in the HTML that we want – the first table in those pages [0] is just an empty frame and the second one [1] is the one with the data. For every row (tr) in the table we can reference and grab each cell (td), and string those cells together as a line by referencing them in the list. As I string these together I also insert the state name so that it appears on every line, and for the third list element (total population in 2000) I strip out any commas (numbers in the HTML table included commas, a major no-no that leads to headaches in a csv file). After we grab that line we dump it into the output file, with each value separated by a comma and each record on it’s own line (using the new line character). Once we’ve looped through each table on each page for each state, we close the file.

There are a few variations I could have tried; I could have read the FIPS codes in from a table rather than inserting them into the script, but I preferred to keep everything together. I could have read the state names in as a list, or coupled them with the codes in a dictionary. This would have been less risky then relying on the state name in the title tag, but since the pages were well-formed and I wanted to experiment a little I went the title tag route. Instead of typing the codes in by hand I used Excel trickery to concatenate commas to the end of each code, and then concatenated all the values together in one cell so I could copy and paste the list into the script.

You can go here to see an individual state page and source, and here to see what the final output looks like. Or if you’re just looking for a national level file of 2000 population centroids for counties that you can download, look no further!

### Define Projection for a Batch of Shapefiles

Sunday, June 12th, 2011

I was working on a project where I had downloaded 51 shapefiles (state-based census tract files) from the Census Generalized Cartographic Boundary Files. Each file lacked a projection .prj file, so I had to define each one as NAD83. Not wanting to do this one at a time, I used the GDAL / OGR tools and a bash script to process them all in a batch. I wrote a little script in a text file and then pasted it in the command line:

#!/bin/bash
for i in $(ls *.shp); do ogr2ogr -f “ESRI Shapefile” -a_srs “EPSG:4269” ./nad83$i
done

It iterates through a list of all the shapefiles in a directory, uses OGR to define them as NAD83, then writes them to a new subdirectory called NAD83.

After searching through the web for some guidance on this, I later realized that there was a nice, succinct example of this in a book that I had (yeah – remember books? They’re still great!)

#!/bin/bash
# from Sherman (2008) Desktop GIS Mapping the Planet With Open Source Tools pp 243-44

for shp in *.shp
do
echo “Processing $shp” ogr2ogr -f “ESRI Shapefile” -t_srs EPSG:4326 geo/$shp \$shp
done

This does the same thing, difference here is that it prints a message to the command line for each file that’s processed and uses the -t_srs switch (transform projection) rather than the -a_srs (assign an output projection), which in this case seems to do the same thing. Of course you could tweak this a little to transform projections from one system to another as well.

This is fine and good if you’re using Linux and can use bash (go here for more info about bash). If you’re using Windows, you can do this if you’re using a Linux / UNIX terminal emulator like MSYS; otherwise you can use the DOS Command Prompt and write a batch (.bat) file to do this instead – the post on this forum is the first thing I found in my quest to figure all of this out.

### Joining CSV Files in QGIS

Monday, April 4th, 2011

DBF files are the other thorny issue that comes up when I’m teaching the Intro to GIS workshops with QGIS – specifically how do you create and edit them? Doing that has been pretty inconvenient in the Windows world since they were deprecated in Excel 2007. You can download plugins or basic stand-alone software to do the job. Since I’m a Linux user I have the Open Office suite by default, and I can easily work with DBFs in Calc. That’s an option for Windows and Mac users too, but it’s kind of a drag to download an entire office suite just for working with DBFs (assuming most folks are use MS Office).

Another possibility is to dump DBF all together; even though the join table option in the fTools menu in QGIS only presents you the option of joining shapfiles or DBF, it turns out if you choose the DBF radio button option you can actually point to DBFs OR CSV files when you’re browsing to point to your data table. The join proceeds the same way, and you get a new shapefile with the data from the CSV joined to it. CSVs can be easily created from any spreadsheet, text editor, or database program in any operating system, so you can prep your attribute data in your program of choice before exporting to CSV and importing to GIS.

The problem with this approach is that all of the fields from the CSV file are automatically saved as text or strings when they’re appended to the shapefile. This means that any numeric data you have can’t be treated numerically; you can’t perform calculations or classify data as value ranges for mapping. You can go into the attribute table, enter the edit mode, and use the field calculator to create new fields where you convert the values to integers, but this adds a bunch of duplicate fields and is rather messy. You can’t delete columns from shapefiles within QGIS; you have to edit the attributes outside the software to remove extra columns.

Here’s an easy work around: you can create a CSVT file to go along with your CSV file. Open a text editor like Notepad or gedit and create a one line file where you specify the data type for each of the fields in your CSV file. Save the CSVT with the SAME NAME as your CSV file. Now when you go to join your CSV to your shapefile in QGIS, it will read the CSVT and save all of your fields properly in the new shapefile.

So for a CSV file like this with six fields:

CODE, ST, ID_NAICS51, EMP_51, ID_TOTAL, TOTAL_EMP
01, AL, ENU0100010551, 27013, ENU0100010510, 1570188
02, AK, ENU0200010551, 6988, ENU0200010510, 237708
04, AZ, ENU 0400010551, 41833, ENU0400010510, 2174919 …

You’d create a CSVT file like this:

“String”, “String”, “String”, “Integer”, “String”, “Integer”

So the 4th and 6th fields that have numeric values are saved as integers. There are a few other field types you can use, like Real for decimal numbers and some Date / Time options, and you can specify length and precision – see this post for details.

Using shapefiles, CSVs, and CSVTs are fine for small to medium projects and for the introductory workshops I’m teaching; geodatabases are another option and are certainly better for medium to large projects, but introducing them in my intro workshop is a little too much.

(NOTE – in QGIS 1.7 the join tool has been dropped from the ftools menu. To join a csv or dbf in 1.7+, you have to add your data table as a vector to your map, and then use the join tab within the properties menu of the feature you want to join it to. See this post for details.)