## Posts Tagged ‘data’

Thursday, August 13th, 2015

A few weeks ago I launched a new version of our college’s GIS data repository, the Baruch Geoportal. At the back end I have a simplified process for getting data onto our server, and on the front end we did away with manually updating HTML and CSS webpages in favor of using a Confluence wiki. My college has a subscription to Confluence, and I’ve been using an internal wiki for documenting and administering all aspects of our projects. A public, external wiki for providing our data seemed like a nice way to go – we can focus more on the content and it’s easier for my team and I to collaborate.

Since it’s a new site with a new address, many of the links to projects I’ve referred to throughout the years on this blog are no longer valid. Redirects are in place, but they won’t last forever. Some notable links to update:

The new site has a dedicated blog that you can follow (via RSS) for the latest updates to the portal. The portal also has a number of relatively new and publicly accessible datasets that we’ve posted over the last year (but that I haven’t had time to post about). These include the NYC Mass Transit Spatial Layers series and population centroids for US census geographies. We’ve been creating ISO spatial metadata for all of our new layers, but we still need to create XML stylesheets to make them more human-readable. That will be one of many projects to do for this academic year.

### Inserting Data into Templates with Python and Jinja

Friday, April 3rd, 2015

In this post, I’m picking up where I left off and will cover the different methods I experimented with to get data out of a SQLite database and into a Jinja LaTeX template using Python. I’m using the NYC Geodatabase as my test case.

### Standard Elements – Used Each Time

First – the Python script. For each iteration, the top half of the script remains the same. I import the necessary modules, and I set up my Jinja2 environment. This tells Jinja how to handle LaTeX syntax. I borrowed this code directly from the invaluable slides posted here. The only part that gets modified each time is the .get_template() bit, which is the actual LaTeX template with Jinja mark-up that is used for creating the reports.

import sqlite3

import jinja2
import os
from jinja2 import Template

latex_jinja_env = jinja2.Environment(
block_start_string = '\BLOCK{',
block_end_string = '}',
variable_start_string = '\VAR{',
variable_end_string = '}',
comment_start_string = '\#{',
comment_end_string = '}',
line_statement_prefix = '%-',
line_comment_prefix = '%#',
trim_blocks = True,
autoescape = False,
)
# Modify to specify the template
template = latex_jinja_env.get_template('test1.tex')


The method for connecting to a SQLite database is also the same each time. There are a zillion tutorials and posts for working with Python and SQLite so I won’t belabor that here. Take a look at this excellent one or this awesome one.

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT * FROM b_pumas_2013acs1 ORDER BY GEOID2 LIMIT 3')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
conn.close()


### First Iteration – Pass Individual Variables to the Template

Here’s the bit that I modify each time. Using the example from the tutorial slides, I loop through the rows returned from my database, and I specify individual variables each time by slicing the elements in the row and assigning them a name which is passed out to the template with template.render(). Then I make a call to LaTeX to generate the PDF file (straightforward since I’m using Linux), one for each row (which represent geographic areas). Each file is named using the unique ID number of the geography, which we grabbed from our row list.

for row in rows:
filename='zpuma_' + row[0] + '.tex'
folder='test1'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(geoid=row[0], geolabel=row[1], hshld=row[2], hshldmoe=row[3]))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


That’s the Python piece. The LaTeX template with the Jinja mark-up looks like this:

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{geolabel | replace("&","\&")} \VAR{geoid}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|}
\hline
& Estimate & Margin of Error\\
Households: & \num{\VAR{hshld}} & +/- \num{\VAR{hshldmoe}}\\
\hline
\end{tabular}

\end{document}


You can see here where I’m passing in the variables with \VAR – I’m using the same variable names that I created in the script to hold the row elements. I have to do a little bit of formatting to get this to work. First, one of my variables is text description that consistently contains an ampersand, so I have to use replace (a construct from Jinja) to replace & with \& so LaTeX can properly escape it. Second, I want to format my numeric variables with a thousands separator. Here I use a LaTeX construct with the siunitx package, and every place a number appears I mark it with \num. For this to work I always need to know that this variable will be a number; if it’s text or null LaTeX will throw an error and the doc won’t compile (an alternative to using this LaTeX solution would be to use Python’s formatting constructs). My simple output is below.

### Second Iteration – Pass Variables to Template in a List

Since I’m going to be passing lots of variables out to my template, it would be tedious if I had to declare them all individually, one by one. It would be better if I could pass out an entire list, and then do the slicing to get what I want in the template. Here’s the Python for doing that:

for row in rows:
filename='zzpuma_' + row[0] + '.tex'
folder='test2'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(thelist=row))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


And here’s the LaTeX template – in this example I modified the variables a bit.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{thelist[2] | replace("&","\&")} \VAR{thelist[1]}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Percent Margin of Error\\
Car, truck, or van alone: & \num{\VAR{thelist[171]}} & +/- \num{\VAR{thelist[172]}} & \num{\VAR{thelist[173]}} & +/- \num{\VAR{thelist[174]}}\\
Car, truck, or van carpooled: & \num{\VAR{thelist[175]}} & +/- \num{\VAR{thelist[176]}} & \num{\VAR{thelist[177]}} & +/- \num{\VAR{thelist[178]}}\\
\hline
\end{tabular}

\end{document}


While this is a bit better, the template is harder to read – you can’t really figure out what’s in there as you just have a bunch of list slices. You also have to keep careful track of which indices apply to what element, so you know what you’re generating. I thought I could improve this by creating nested lists where the column headings from the database get carried along, and I could reference them somehow. Then I had a better idea.

### Third Iteration – Pass Variables to Template in a Dictionary

I decided to use a dictionary instead of a list. Here’s the Python – since I grabbed the columns back in the database section of my code, I can loop through the elements in each row and create a dictionary by zipping the column names and row elements together, so the column name becomes the key and the row element is my data value. Then I pass the whole dictionary out to the template.

for row in rows:
thedict=dict(zip(col_names,row))
filename='zzpuma_' + row[0] + '.tex'
folder='test3'
outpath=os.path.join(folder,filename)
outfile=open(outpath,'w')
outfile.write(template.render(d=thedict))
outfile.close()
os.system("pdflatex -output-directory=" + folder + " " + outpath)


Now in the template, using Jinja I embed dict.get() for each variable and specify the key (column name) and the output will be the value. This is now highly readable, as I can see the names of the columns for the variables and there’s less potential for a mix-up.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{d.get('GEOLABEL') | replace("&","\&")} \VAR{d.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{tabular}{|c|c|c|c|c|}
\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\

Car, truck, or van alone: & \num{\VAR{d.get('COM02_E')}} & +/- \num{\VAR{d.get('COM02_M')}} & \num{\VAR{d.get('COM02_PC')}} & +/- \num{\VAR{d.get('COM02_PM')}}\\

Car, truck, or van carpooled: & \num{\VAR{d.get('COM03_E')}} & +/- \num{\VAR{d.get('COM03_M')}} & \num{\VAR{d.get('COM03_PC')}} & +/- \num{\VAR{d.get('COM03_PM')}}\\
\hline
\end{tabular}

\end{document}


In this case, the output looks the same as it did in our last iteration. Those are some basic methods for getting data into a template, and in my case I think the dictionary is the ideal data structure for this. In going further, my goal is to keep all the formatting and presentation issues in LaTeX, and all the data processing and selection pieces in Python.

### Creating Reports with SQLite, Python, Jinja2, and LaTeX

Sunday, March 29th, 2015

For a long time, I’ve been wanting to figure out a way to generate reports from a SQLite / Spatialite database. For example, I’d like to reach into a database and generate profiles for different places that contain tables, charts, and maps. I know I can use Python to connect to the db and pull out variables. I also learned how to use LaTeX several years back when I revised the GIS Practicum manual, and routinely use it for writing reports, articles, and hand-outs.

I finally have time to devote to this, and am going to share what I’m learning in a series of posts. In this post I’ll describe how I got started, and will record some useful projects and posts that I’ve found.

### Figuring Out What the Pieces Are

In searching the web for building reports in Python, I’ve discovered a number of solutions. Many people have written modules that are in various states of production, from active to defunct. Prettytable was something I’ve used for generating basic text-file reports. It’s absolutely great at what it does, but I’m looking for something that’s more robust. Of all the tools out there, ReportLab seemed to be the most prominent package that would appear again and again. I’ve shied away from it, because I wanted a solution that was a little more general – if that makes sense. Something where every component is not so tightly bound to a specific module.

Luckily I found this post, which was perfect for helping me to understand conceptually what I wanted to do. The author describes how he automatically generates song sheets by using a programming language (JAVA in this case) to reach into a database and insert the content into a template (LaTeX in this case) using a template engine (Apache Velocity) to produce good looking output. In this case, the template has the shell of a document and place-holders where variables will be passed in from the scripting language and rendered using the engine. He included this helpful diagram from wikimedia in his post:

I started looking for a template engine that would work well with both LaTeX and Python. The author had mentioned Cheetah as another engine, and it turns out that Cheetah is often used in conjunction with Python and LaTeX. After digging around some more, I discovered another template engine called Jinja (or Jinja2) which I’ve adopted as my solution, largely because I’ve found that the project documentation was quite good and there are numerous user examples that I can follow. Jinja2 allows you to do much more than simply passing variables into the template and rendering it; you have the option to run a lot of Pythonesque code from within the template itself.

### Putting the Pieces Together

While Jinja is often used for generating HTML and XML (for example), it’s also used for LaTeX (for example). I found that this series of slides was the perfect introduction for me. They’re written in German, but since most of the syntax in the scripting and mark-up languages is in English it’s easy to grasp (and those three-years of German I took way back in high school are now reaping dividends!)

The slides break down how you can use Python to generate LaTeX reports in several iterations. The first iteration involves no templating at all – you simply use Python to generate the LaTeX code that you want (or if you prefer, Python serves as the template generator). The limit of this are obvious, in that you have to hard code variables into the output, or use string substitution to find and replace variable names with the intended output. In the next iteration, he demonstrates how to use Jinja2. This section is invaluable, as it provides an example of setting the Jinja2 environment so that you can escape all of the necessary characters and syntax that LaTeX needs to function. He demonstrates how to pass a variable from Python to render in a template that you create in LaTeX and mark-up with Jinja2 code (slides 18 to 20). He goes on to show how you can loop through lists to generate output.

The third iteration displays how you can pull data out of SQLite and then use Python and LaTeX to generate output. With a little imagination, you can combine this piece with his previous one and voila, you have a SQLite-Python-Jinja-Latex combo. He has a final piece that incorporates screen-scraping using Beautiful Soup, which is pretty neat but beyond my needs for this project.

Now that I understand the conceptual model and I have the four tools I’ll use with some examples, I’m ready to start experimenting. I know there will be several additional pieces I’ll need to incorporate, to generate charts (matplotlib) and maps (perhaps some of the Python modules from QGIS). There are some instances where I’ll also have to write functions to create derivatives of the data I’m pulling, so I imagine NumPy/SciPy and GDAL will come in handy for that. But first things first – I need to get the four basic pieces – SQLite – Python – Jinja2 – LaTeX – working together. That will be the topic of my next post.

### Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

Wednesday, July 30th, 2014

I released the latest version of the NYC geodatabase (nyc_gdb) a few weeks ago. In addition to simply updating the data (for subway stations and ridership, city point features, and ZIP Code Business Patterns data) I had to make a couple of serious upgrades.

The first was that is was time for me to update the version of Spatialite I was using, from 2.4 to 4.1, and to update my documentation and tutorial from the Spatialite GUI 1.4 to 1.7. I used the spatialite_convert tool (see the bottom of this page for info)to upgrade and had no problem. There were some major benefits to making the switch. For one, writing statements that utilize spatial indexes is much simpler – this was version 2.4, generating a neighbor list of census tracts:

SELECT tract1.tractid AS tract, tract2.tractid AS neighbor
FROM a_tracts AS tract1, a_tracts AS tract2
WHERE ST_Touches(tract1.geometry, tract2.geometry) AND tract2.ROWID IN (
SELECT pkid FROM idx_a_tracts_geometry
WHERE pkid MATCH RTreeIntersects (MbrMinX(tract1.geometry), MbrMinY(tract1.geometry),
MbrMaxX(tract1.geometry), MbrMaxY(tract1.geometry)))

And here’s the same statement in 4.1 (for zctas instead of tracts):

SELECT zcta1.zcta AS zcta, zcta2.zcta AS neighbor
FROM a_zctas AS zcta1, a_zctas AS zcta2
WHERE ST_Touches(zcta1.geometry, zcta2.geometry)
AND zcta1.rowid IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name=’a_zctas’ AND search_frame=zcta2.geometry)
ORDER BY zcta, neighbor

There are also a number of improvements in the GUI. Tables generated by the user are now grouped under one heading for user data, and the internal tables are grouped under subsequent headings, so that users don’t have to sift through all the objects in the database to see what they need. The import options have improved – with shapefiles and dbfs you can now designate your own primary keys on import. You also have the option of importing Excel spreadsheets of the 97-2003 variety (.xls). In practice, if you want the import to go smoothly you have to designate data types (format-cells) in the Excel sheet (including number of decimal places) prior to importing.

I was hesitant to make the leap, because version 2.4 was the last version where they made pre-compiled binaries for all operating systems; after that, the only binaries are for MS Windows and for Mac and Linux you have to compile from source – which is daunting for many Mac users that I am ill-equipped to help. But now that Spatialite seems to be more fully integrated with QGIS (you can create databases with Q and using the DB Manager you can export layers to an existing database) I can always steer folks there as an alternative. As for Linux, more distros are including updated version of the GUI in their repositories which makes installation simple.

One of the latest features in Spatialite 4.1.1 is the ability to import XML ISO metadata into the database, where it’s stored as an XML-type blob in a dedicated table. Now that I’m doing more work with metadata this is something I’ll explore for the future.

### ZIPs to ZCTAs

The other big change was how the ZIP Code Business Patterns data is represented in the database. The ZBP data is reported for actual ZIP Codes that are taken from the addresses of the business establishments, while the boundaries in the nyc_gdb database are for ZIP Code Tabulation Areas (ZCTAs) from the Census. Previously, the ZBP data in the database only represented records for ZIP Codes that had a matching ZCTA number. As a result, ZIP Codes that lacked a corollary because they didn’t have any meaningful geographic area – the ZIP represented clusters of PO Boxes or large organizations that process a lot of mail – were omitted entirely.

In order to get a more accurate representation of business establishments in the City, I instituted a process to aggregate the ZIP Code data in the ZBP to the ZCTA level. I used the crosswalk table provided by the MCDC which assigns ZIPs to ZCTAs, so those PO Boxes and large institutions are assigned to the ZCTA where they are physically located. I used SQLite to import that crosswalk, imported the ZBP data, joined the two on the ZIP Code and did a group by on the ZCTA to sum employment, establishments, etc. For ZIPs that lacked data due to disclosure regulations, I added some note or flag columns that indicate how many businesses in a ZCTA are missing data. So now the data tables represent records for ZCTAs instead of ZIPs, and they can be joined to the ZCTA features and mapped.

The latest ZBP data in the new database is for 2012. I also went back and performed the same operation on the 2011 and 2010 ZBP data that was included in earlier databases, and have provided that data in CSV format for download in the archives page (in case anyone is using the old data and wants to go back and update it).

### Introducing – Data!

Wednesday, April 9th, 2014

Professors invite me to their classes each semester to give students a crash course in finding data for neighborhoods in New York City, with a particular emphasis on Census data. I typically visit courses in journalism and public affairs, but this semester I added classes in management and – theater – to the list. Before I dive into what the Census is and what sources they should use, I preface the presentation with a discussion of what neighborhoods are and how we define them. This is important because neighborhoods are locally and informally defined, and when searching for datasets we often have to use a proxy, like census tracts, ZIP codes, PUMAs, or local legal or administrative areas, to approximate them.

But before we get this far, I always begin the discussion with some basic questions to set the stage: what is data, and what can we use it for? For the journalism students, I explain that data can help support a story. If they’re covering a town hall or community board meeting where affordabale housing is the topic of discussion, they’re going to want to provide some context and include some facts to support their story – what is the rent like in the neighborhood? How many people live there? Alternatively, data can provide the basis for a story. I point to one of many numerous examples in NYC where journalists have taken a big lump of unrefined data – the NYPD’s stop and frisk data, traffic fatality incidents, 311 complaints – and have refined it to produce information that leads them to an interesting story that was hidden in these numbers. Lastly, data is a story – whenever the Census releases a new dataset, someone is writing to announce the release and tell us what’s in there.

This idea of refining leads us to our first basic definition – data can be considered as raw and unrefined information. It doesn’t tell us much in and of itself, but if we sift through and refine it we can turn it into information that we can use to tell or support a story or reveal some fact or truth that was previously unknown. Data can be quantitative or qualitative – journalists for example may interview someone for two or three hours, but they’re not going to turn around and publish that entire interview. They’re going to write an article that summarizes it and gives us the most important bits, or edit it for a radio broadcast that covers the high lights. With quantitative data the issue is similar – I use a basic example of population data for the 50 states and show them this image of a comma delimited text file:

I explain that this is what data looks lke in a raw state. It’s in a basic format suitable for preservation or transit between systems, but is not in a presentable state. There are a lot of codes that are meaningless to the average person, the data isn’t sorted in a meaningful way, the column headings seem ambiguous, and the numbers aren’t formated for viewing. This isn’t something that they’d want to insert directly into their story or paper. But if they take this and do a little bit of work:

They can take that raw data and turn it into information. Here we’ve moved from raw data to a presentable table. The statistics are sorted in a logical order based on total population, columns are given comprehensible names, and unecessary information (for presentation purposes) is dropped. We add commas to the numbers so they’re more legible, and we create some value by adding a percent total column. Now we have something we can use to communicate a message. But we can go further – we can take that same information and turn it into this:

Now we have a chart. At this point I turn to the students and ask them what the benefit of using the chart is, followed by a discussion of trade-offs; we’ve gained something but lost something too. On the plus side, we can appeal to people’s visual sensibilities, and we can see more clearly that California has twice as many people as New York. The chart is also more concise, as it’s taking up less real estate on the page or on the screen. But we’ve exchanged conciseness for preciseness; we can no longer tell what the exact population numbers are with the chart; we can only approximate. But we can also go further:

We can take that same dataset and turn it into a map. Once again, we discuss the pluses and minuses. Now we can key into to people’s geographic knowledge as well as their visual senses; Ohio may be more meaningful now that we can see it on a map, rather than just seeing a number in a table. We can also see geographic patterns of clustering or diffusion, which the table or chart couldn’t show us. But with the map we’ve lost even more precision. Now we can only see that a state’s population number falls within a given range; we can’t see the precise number and can’t approximate it like we could with the chart.

At this point, one student will point out that if the chart or map is on the web, we can have the best of all worlds. If the graphic is interactive we can hover over it and see the exact population number. This leads to a discussion of the trade-offs between interactive web-based information and static information. The interactive chart or map let’s us keep precision and conciseness, but the sacrifice is complexity, portability, and preservation. It’s more complex to create, and it can only exist in it’s native environment, within a specific bundle of technology that includes programming and scripting langauges, software libraries, browsers, and operting systems. Such things go obsolete quickly and can easily break, so the shiny chart, map, or app you have today is non-functional in a year or two, and difficult to preserve. Contrast that with a static image or text, which is simple, easy to move around, depends on little else, and can make the jump from a screen to the printed page.

We sum up this little talk with the basis of what they’re trying to achieve – I use the DIK pyramid, which I was introduced to in library school (OK – this pic is the DIKW pyramid, with wisdom thrown on top – it’s public domain so I can safely use it):

As journalists or researchers, you’re taking data and refining it to turn it into information to support your work or to commuicate a message. You take those pieces of new information and bring them together to tell a bigger story and paint a bigger picture, which we hope will lead to greater knowledge (which, unlike data and information, is something that can only be learned and not simply assesmbled and communicated). The weather is a good example – a giant log of temperature and precitiptation data isn’t going to do me much good. But if you process that data to calculate the high, low, and mean, now you have information I can use. Take that information and combine it with a radar picture and a forecast and now I have a rich information object. I can take that object and piece it together with other information – another forecast I hear on the radio, what I see out the window, my previous experiences of getting wet, my wife’s advice – to formulate a decision that I can act on. By considering all of this information – my experiences, contextualized information, and know how – and weighing it to reach a conslucion, I am using my knowledge. In this case I’ll use it carry an umbrella.

The final point is that, in their papers, the students must take the information objects that they’ve created or acquired and integrate them into their work. Many students will just copy and download a table and stick it at the back of the paper, and assume that it speaks for itself. I tell them – it doesn’t! You have to explain why it’s there; make reference to it in the paper and weave it into your research.

Overall this presentation / discussion takes all of about 10 minutes, AND THEN we move into the discussion of neighborhoods, the census, and specific datasets. I’ve contemplated skipping it all together, but ultimately decided that it’s necessary. I think it’s essential to provide some context and theory coupled with the actual sources and the pragmatic nature of finding the data. There are some librarians who are completely adverse to teaching “tools” and will speak completely in the abstract, while there are others who cut directly to listing the sources and leaving it at that. The first approach is useless because the students won’t learn what to actually do; the second apporach makes assumptions about what they know and fails to prepare them for what they’ll face. There also seems to be a clear need for me to do this – I’ve heard many faculty who have commented that students are simply tacking data tables they’ve copied off the web into the back of papers without any explanation. When I present the slide that depicts the csv file, I was initally shocked by the looks of shock on many student’s faces – like they’d never seen or heard of this before and were worried that they’d have to wrestle with it. Here’s the data-driven world, step 1.

### Mapping Domestic Migration with IRS Data

Friday, November 18th, 2011

Forbes magazine just published a neat interactive map on American migration using data NOT from the Census, but from – the IRS. Whether you fill it out virtually or the old fashioned way, everyone fills in their address at the top of the 1040, and the IRS stores this in a database. If you file from a different address from one year to the next you must have moved, and the IRS publishes a summary file of where people went (with all personal information and practically all filing data stripped away) .

The Forbes map taps into five years of this data and lets you see all domestic in-migration and out-migration from a particular county. The map is a flow or line map with lines going from the county you choose to each target – net in-migration to your county is colored in blue and net out-migration is red. You can also hover over the sending and receiving counties to see how many people moved. Click on the map to choose your county or search by name; you also have the option of searching for cities or towns, as the largest place within each county is helpfully identified and tied to the data.

It’s relatively straightforward and fun to explore. Some of the trends are pretty striking – the difference between declining cities (Wayne County – Detroit MI) and growing ones (Travis County – Austin TX) is pretty vivid, as is the change in migration during the height of the housing boom period in 2005 compared to the depth of the bust in 2009 (see Maricopa County – Phoenix AZ). More subtle is the difference in the scope of migration between urban and rural counties, with the former having more numerous and broader connections and the latter having smaller, more localized exchanges. Case in point is my home state of Delaware – urban New Castle County (Wilmington) compared to rural Sussex County (Seaford). There are many other stories to see here – the exodus from New Orleans after Katrina and the subsequent return of residents, the escape from Los Angeles to the surrounding mountain states, and the pervasiveness of Florida as a destination for everybody (click on the thumbnails below for full images of each map).

 Wayne Co MI (Detroit) 2009 Travis Co TX (Austin) 2009 Mariciopa Co AZ (Phoenix) 2005 Mariciopa Co AZ (Phoenix) 2009 New Castle Co DE (Wilmington) 2009 Sussex Co DE (Seaford) 2009

While the map is great, the even better news is that the data is free and can be downloaded by anyone from the IRS Statistics page. They provide a lot of summary data – information for individuals is never reported. The individual tax data page with data gleaned from the 1040 has the most data that is geographic in nature. If you wanted to see how much and what kind of tax is collected by state, county, and ZIP code you could get it there. The US Population Migration data used to create the Forbes map is also there and the years from 2005 to 2009 are free (migration data from 1991 to 2004 is available for purchase).

You can download separate files for county inflow and county outflow on a state by state basis in Excel (.xls) format, or you can download the entire enormous dataset in .dat or .csv format. The data that’s reported is the number of filings and exemptions that represent a change in address by county from one year to the next, and includes the aggregated adjusted gross income of the total filers. There are some limitations – in order to protect confidentiality, if the flow from one county to another has less than 10 moves that data is lumped into an “other” category. International migration is also lumped into one interntaional category (on the Forbes map, both the other category where two counties have a flow less than 10 and the foreign migration category are not depicted).

The IRS migration data is often used when creating population estimates; when combined with vital stats on births and deaths it can serve as the migration piece of the demographic equation.

### 2010 Census Redistricting Data

Sunday, April 17th, 2011

The Redistricting Summary Data [P.L. 94-171] from the 2010 Census has all been published for the nation, states, counties, and places, and is available via the new American Factfinder. The redistricting data includes basic demographic data: total population, race, Hispanic or Latino origin, and number of housing units occupied and vacant. Data is available down to census blocks and is available for most (but not all – no ZCTAs or PUMAs) geographies.

If you don’t want all the data for a state, don’t want to slog through the Factfinder, and are comfortable working with large text files, you can FTP the summary data from the Redistricting Data homepage. If you want basic summary data for states, counties, and places and don’t want to fuss with the Factfinder or text files, you can download Excel spreadsheets from the Redistricting Data Press Kit. They also have some pdf / jpg maps showing county level population and population change, plus interactive map widgets like the one below for the country and for each state. 2010 Redistricting TIGER Shapefiles have also been released for geographies included in the redistricting dataset.

The full 2010 Census for all geographies will be released throughout this summer and into the fall in Summary File 1 [SF1]. Stay tuned.

### Freely Available World Bank Country Data

Saturday, September 11th, 2010

This actually happend a little while ago, but for various reasons I haven’t been able to keep up with posting…

Our library had been subscribing to the WDI (World Development Indicators) database from the World Bank, but we were recently informed that the product was being discontinued and all of the data from the WDI and a number of other World Bank datasets would now be freely available from their data portal at http://data.worldbank.org/.

You can download an indicator for all countries by browsing through a list of all 300, or drill down by broad topics. Select an indicator and you can view a table with the most recent data, or a graduated circle map. If you download a table you can choose between an Excel or XML format. If you download the Excel format you get all years for all countries for that particular indicator from 1960 to present; but for many indicators you end up with a lot of null values up until this decade. If you go the XML route, the nulls are omitted and only years with data are provided. Unfortunately, in neither case do you get any unique identifiers like an ISO code.

Fortunately, power users can opt to download an entire data set, such as all of the WDI Indicators, in one file via their data catalog. In this case you have the option for Excel (xlsx only) or CSV, and the records I looked at DID contain ISO codes for each country (3 letter alpha). It looks like they’re also letting people tap into an API, so you can build web applications that harness the data directly from their repository.

In addition to browsing through indicators, you also have the ability to pull up a profile for a particular country to view several indicators for one particular place. They have a snazzy dashboard with stats, charts, and a reference map.

### 2007 Economic Census

Monday, June 8th, 2009

The US Census Bureau has begun releasing data for the 2007 Economic Census. The bureau conducts the survey of businesses every five years – all medium to large size businesses and multi-part businesses are counted, samples are taken for smaller businesses, and various administrative records are used to calculate businesses with no employees (i.e. freelancers). All businesses are categorized hierarchically by North American Industrial Classification System (NAICS) codes and the data is reported by industry and geography. The number of establishments, employees, payroll, and sales are counted for the nation, states, counties, metro areas, places, and zip codes.

At this point national industry totals for the broadest categories of NAICS are available, as are preliminary numbers for the most specific NAICS categories (six digit) at the national level. Data for smaller geographic areas will be released between October 2009 and August 2010.

The biggest change from the 2002 Economic Census is the delivery method for the data. There will be no more 90 page pdf files or HTML tables that drill down six levels. All of the data will be released via the American Factfinder only. Other changes include the addition of some new geography (CDPs with at least 5000 people), new metro area definitions, and the revised 2007 definitions for NAICS which include small changes to the Finance, Insurance, Real Estate, Professional Services, and Administrative Services categories.

Additional changes for 2007, the data release schedule, NAICS codes, and methodology docs are all available at the 2007 Economic Census homepage within the Census Bureau’s website.

All of the data is aggregated by industry and geography – you cannot get lists of businesses with names and addresses as this information is kept confidential. Furthermore, to maintain confidentiality, if one company controls a large share of the market for a specific sector within a specific geographic area, or if there few businesses within a sector in a specific geographic area, much of the data (with the exception of the number of businesses) remains classified (marked with a D for disclosure). Oftentimes this means that data for industries within small areas (big box retail in a small town) and data for industries with few establishments in an area (mining establishments in New York City) are hidden. The smaller the geography, the more likely it is that the data will not be disclosed. This becomes a technical issue if you want / need to move this data into a database, as these pesky disclosure notes are stored in the same columns as the data and prevent you for designating the fields as numeric.

Given the delay between the time the data is collected and the time it is released, it isn’t particularly helpful for analyzing our current economic climate, but it does provide a snapshot of the way the US economy looked at that moment, and is useful in understanding how the economy is evolving. Be aware that when making comparisons to past data, you have to correct for changes in geography and NAICS definitions. The differences between 2002 and 2007 are not too great, but more adjustments are necessary as you go further back in time. The Bureau provides data back to 1997 through the American Factfinder and some data from 1992 on an older page. If you need to go back further, you’ll be entering the realm of (gasp!) CD-ROMs or the paper reports.