Article on Processing Government Data With Python

August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

The citation and link:

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21). http://journal.code4lib.org/articles/9652.

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

NYC Geodatabase Updates: Spatialite Upgrade & ZIPs to ZCTAs

July 30th, 2014

I released the latest version of the NYC geodatabase (nyc_gdb) a few weeks ago. In addition to simply updating the data (for subway stations and ridership, city point features, and ZIP Code Business Patterns data) I had to make a couple of serious upgrades.

Spatialite Updates

The first was that is was time for me to update the version of Spatialite I was using, from 2.4 to 4.1, and to update my documentation and tutorial from the Spatialite GUI 1.4 to 1.7. I used the spatialite_convert tool (see the bottom of this page for info)to upgrade and had no problem. There were some major benefits to making the switch. For one, writing statements that utilize spatial indexes is much simpler – this was version 2.4, generating a neighbor list of census tracts:

SELECT tract1.tractid AS tract, tract2.tractid AS neighbor
FROM a_tracts AS tract1, a_tracts AS tract2
WHERE ST_Touches(tract1.geometry, tract2.geometry) AND tract2.ROWID IN (
SELECT pkid FROM idx_a_tracts_geometry
WHERE pkid MATCH RTreeIntersects (MbrMinX(tract1.geometry), MbrMinY(tract1.geometry),
MbrMaxX(tract1.geometry), MbrMaxY(tract1.geometry)))

And here’s the same statement in 4.1 (for zctas instead of tracts):

SELECT zcta1.zcta AS zcta, zcta2.zcta AS neighbor
FROM a_zctas AS zcta1, a_zctas AS zcta2
WHERE ST_Touches(zcta1.geometry, zcta2.geometry)
AND zcta1.rowid IN (
SELECT rowid FROM SpatialIndex
WHERE f_table_name=’a_zctas’ AND search_frame=zcta2.geometry)
ORDER BY zcta, neighbor

There are also a number of improvements in the GUI. Tables generated by the user are now grouped under one heading for user data, and the internal tables are grouped under subsequent headings, so that users don’t have to sift through all the objects in the database to see what they need. The import options have improved – with shapefiles and dbfs you can now designate your own primary keys on import. You also have the option of importing Excel spreadsheets of the 97-2003 variety (.xls). In practice, if you want the import to go smoothly you have to designate data types (format-cells) in the Excel sheet (including number of decimal places) prior to importing.

spatialite_gui_17

I was hesitant to make the leap, because version 2.4 was the last version where they made pre-compiled binaries for all operating systems; after that, the only binaries are for MS Windows and for Mac and Linux you have to compile from source – which is daunting for many Mac users that I am ill-equipped to help. But now that Spatialite seems to be more fully integrated with QGIS (you can create databases with Q and using the DB Manager you can export layers to an existing database) I can always steer folks there as an alternative. As for Linux, more distros are including updated version of the GUI in their repositories which makes installation simple.

One of the latest features in Spatialite 4.1.1 is the ability to import XML ISO metadata into the database, where it’s stored as an XML-type blob in a dedicated table. Now that I’m doing more work with metadata this is something I’ll explore for the future.

ZIPs to ZCTAs

The other big change was how the ZIP Code Business Patterns data is represented in the database. The ZBP data is reported for actual ZIP Codes that are taken from the addresses of the business establishments, while the boundaries in the nyc_gdb database are for ZIP Code Tabulation Areas (ZCTAs) from the Census. Previously, the ZBP data in the database only represented records for ZIP Codes that had a matching ZCTA number. As a result, ZIP Codes that lacked a corollary because they didn’t have any meaningful geographic area – the ZIP represented clusters of PO Boxes or large organizations that process a lot of mail – were omitted entirely.

In order to get a more accurate representation of business establishments in the City, I instituted a process to aggregate the ZIP Code data in the ZBP to the ZCTA level. I used the crosswalk table provided by the MCDC which assigns ZIPs to ZCTAs, so those PO Boxes and large institutions are assigned to the ZCTA where they are physically located. I used SQLite to import that crosswalk, imported the ZBP data, joined the two on the ZIP Code and did a group by on the ZCTA to sum employment, establishments, etc. For ZIPs that lacked data due to disclosure regulations, I added some note or flag columns that indicate how many businesses in a ZCTA are missing data. So now the data tables represent records for ZCTAs instead of ZIPs, and they can be joined to the ZCTA features and mapped.

The latest ZBP data in the new database is for 2012. I also went back and performed the same operation on the 2011 and 2010 ZBP data that was included in earlier databases, and have provided that data in CSV format for download in the archives page (in case anyone is using the old data and wants to go back and update it).

Some QGIS Odds and Ends

July 3rd, 2014

My colleague Joe Paccione recently finished a QGIS tutorial on working with raster data. My introductory tutorial for the GIS Practicum gives only cursory treatment to rasters, so this project was initially conceived to give people additional opportunities to learn about working with them. It focuses on elevation modeling and uses DEMs and DRGs to introduce tiling and warping, and creating hillshades and contour lines.

topo_contour

The tutorial was written using QGIS 2.0 and was tested with version 2.4; thus it’s readily usable with any 2.x version of QGIS. With the rapid progression of QGIS my introductory tutorial for the workshop is becoming woefully outdated, having been written in version 1.8. It’s going to take me quite a while to update (among other things, the image for every darn button has changed) but I plan to have a new version out sometime in the fall, but probably not at the beginning of semester. Since I have a fair amount of work to do any way, I’m going to rethink all of the content and exercises. Meanwhile, Lex Berman at Harvard has updated his wonderfully clear and concise tutorial to Q version 2.x.

The workshops have been successful for turning people on to open source GIS on my campus, to the point were people are using it and coming back to teach me new things – especially when it comes to uncovering useful plugins:

  • I had a student who needed to geocode a bunch of addresses, but since many of them were international I couldn’t turn to my usual geocoding service at Texas A & M. While I’ve used the MMQGIS plugin for quite a while (it has an abundance of useful tools), I NEVER NOTICED that it includes a geocoding option that acts as a GUI for accessing both the Google and Open Streetmap API for geocoding. He discovered it, and it turned out quite well for him.
  • I was helping a prof who was working with a large point file of street signs, and we discovered a handy plugin called Points2One that allowed us to take those points and turn them into lines based on an attribute the points held in common. In this case every sign on a city block shared a common id that allowed us to create lines representing each side of the street on each block.
  • After doing some intersect and difference geoprocessing with shapefiles I was ending up with some dodgy results – orphaned nodes and lines that had duplicate attributes with good polygons. If I was in a database, an easy trick to find these duplicates would be to run a select query where you group by ID and count them, and anything with a count more than two are duplicates – but this was a shapefile. Luckily there’s a handy plugin called Group Stats that lets you create pivot tables, so I used that to do a summary count and identified the culprits. The plugin allowed me to select all the features that matched my criteria of having an id count of 2 or more, so I could eyeball them in the map view and the attribute table. I calculated the geometry for all the features and sorted the selected ones, revealing that all the duplicates had infinitesimally small areas. Then it was a simple matter of select and delete.
  • Introducing – Data!

    April 9th, 2014

    Professors invite me to their classes each semester to give students a crash course in finding data for neighborhoods in New York City, with a particular emphasis on Census data. I typically visit courses in journalism and public affairs, but this semester I added classes in management and – theater – to the list. Before I dive into what the Census is and what sources they should use, I preface the presentation with a discussion of what neighborhoods are and how we define them. This is important because neighborhoods are locally and informally defined, and when searching for datasets we often have to use a proxy, like census tracts, ZIP codes, PUMAs, or local legal or administrative areas, to approximate them.

    But before we get this far, I always begin the discussion with some basic questions to set the stage: what is data, and what can we use it for? For the journalism students, I explain that data can help support a story. If they’re covering a town hall or community board meeting where affordabale housing is the topic of discussion, they’re going to want to provide some context and include some facts to support their story – what is the rent like in the neighborhood? How many people live there? Alternatively, data can provide the basis for a story. I point to one of many numerous examples in NYC where journalists have taken a big lump of unrefined data – the NYPD’s stop and frisk data, traffic fatality incidents, 311 complaints – and have refined it to produce information that leads them to an interesting story that was hidden in these numbers. Lastly, data is a story – whenever the Census releases a new dataset, someone is writing to announce the release and tell us what’s in there.

    This idea of refining leads us to our first basic definition – data can be considered as raw and unrefined information. It doesn’t tell us much in and of itself, but if we sift through and refine it we can turn it into information that we can use to tell or support a story or reveal some fact or truth that was previously unknown. Data can be quantitative or qualitative – journalists for example may interview someone for two or three hours, but they’re not going to turn around and publish that entire interview. They’re going to write an article that summarizes it and gives us the most important bits, or edit it for a radio broadcast that covers the high lights. With quantitative data the issue is similar – I use a basic example of population data for the 50 states and show them this image of a comma delimited text file:

    csv

    I explain that this is what data looks lke in a raw state. It’s in a basic format suitable for preservation or transit between systems, but is not in a presentable state. There are a lot of codes that are meaningless to the average person, the data isn’t sorted in a meaningful way, the column headings seem ambiguous, and the numbers aren’t formated for viewing. This isn’t something that they’d want to insert directly into their story or paper. But if they take this and do a little bit of work:

    table

    They can take that raw data and turn it into information. Here we’ve moved from raw data to a presentable table. The statistics are sorted in a logical order based on total population, columns are given comprehensible names, and unecessary information (for presentation purposes) is dropped. We add commas to the numbers so they’re more legible, and we create some value by adding a percent total column. Now we have something we can use to communicate a message. But we can go further – we can take that same information and turn it into this:

    chart

    Now we have a chart. At this point I turn to the students and ask them what the benefit of using the chart is, followed by a discussion of trade-offs; we’ve gained something but lost something too. On the plus side, we can appeal to people’s visual sensibilities, and we can see more clearly that California has twice as many people as New York. The chart is also more concise, as it’s taking up less real estate on the page or on the screen. But we’ve exchanged conciseness for preciseness; we can no longer tell what the exact population numbers are with the chart; we can only approximate. But we can also go further:

    map

    We can take that same dataset and turn it into a map. Once again, we discuss the pluses and minuses. Now we can key into to people’s geographic knowledge as well as their visual senses; Ohio may be more meaningful now that we can see it on a map, rather than just seeing a number in a table. We can also see geographic patterns of clustering or diffusion, which the table or chart couldn’t show us. But with the map we’ve lost even more precision. Now we can only see that a state’s population number falls within a given range; we can’t see the precise number and can’t approximate it like we could with the chart.

    At this point, one student will point out that if the chart or map is on the web, we can have the best of all worlds. If the graphic is interactive we can hover over it and see the exact population number. This leads to a discussion of the trade-offs between interactive web-based information and static information. The interactive chart or map let’s us keep precision and conciseness, but the sacrifice is complexity, portability, and preservation. It’s more complex to create, and it can only exist in it’s native environment, within a specific bundle of technology that includes programming and scripting langauges, software libraries, browsers, and operting systems. Such things go obsolete quickly and can easily break, so the shiny chart, map, or app you have today is non-functional in a year or two, and difficult to preserve. Contrast that with a static image or text, which is simple, easy to move around, depends on little else, and can make the jump from a screen to the printed page.

    We sum up this little talk with the basis of what they’re trying to achieve – I use the DIK pyramid, which I was introduced to in library school (OK – this pic is the DIKW pyramid, with wisdom thrown on top – it’s public domain so I can safely use it):

    DIKW-diagram

    As journalists or researchers, you’re taking data and refining it to turn it into information to support your work or to commuicate a message. You take those pieces of new information and bring them together to tell a bigger story and paint a bigger picture, which we hope will lead to greater knowledge (which, unlike data and information, is something that can only be learned and not simply assesmbled and communicated). The weather is a good example – a giant log of temperature and precitiptation data isn’t going to do me much good. But if you process that data to calculate the high, low, and mean, now you have information I can use. Take that information and combine it with a radar picture and a forecast and now I have a rich information object. I can take that object and piece it together with other information – another forecast I hear on the radio, what I see out the window, my previous experiences of getting wet, my wife’s advice – to formulate a decision that I can act on. By considering all of this information – my experiences, contextualized information, and know how – and weighing it to reach a conslucion, I am using my knowledge. In this case I’ll use it carry an umbrella.

    The final point is that, in their papers, the students must take the information objects that they’ve created or acquired and integrate them into their work. Many students will just copy and download a table and stick it at the back of the paper, and assume that it speaks for itself. I tell them – it doesn’t! You have to explain why it’s there; make reference to it in the paper and weave it into your research.

    Overall this presentation / discussion takes all of about 10 minutes, AND THEN we move into the discussion of neighborhoods, the census, and specific datasets. I’ve contemplated skipping it all together, but ultimately decided that it’s necessary. I think it’s essential to provide some context and theory coupled with the actual sources and the pragmatic nature of finding the data. There are some librarians who are completely adverse to teaching “tools” and will speak completely in the abstract, while there are others who cut directly to listing the sources and leaving it at that. The first approach is useless because the students won’t learn what to actually do; the second apporach makes assumptions about what they know and fails to prepare them for what they’ll face. There also seems to be a clear need for me to do this – I’ve heard many faculty who have commented that students are simply tacking data tables they’ve copied off the web into the back of papers without any explanation. When I present the slide that depicts the csv file, I was initally shocked by the looks of shock on many student’s faces – like they’d never seen or heard of this before and were worried that they’d have to wrestle with it. Here’s the data-driven world, step 1.

    Loading Data Into PostGIS – Shapefiles

    March 4th, 2014

    This example takes over where the Loading Data Into PostGIS – Text Files left off; we’ll use the same schema usa_general. We’ll load two shapefiles from the Census Bureau’s 2010 Cartographic Boundary files from https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html.

    Loading shapefiles can be done via the pgAdmin III GUI (for this example, used version 1.18 for MS Windows) from a client machine using the shp2pgsql-gui. This is an extra plugin that must be downloaded and installed. A command line version of the tool (shp2pgsql) is automatically bundled with PostGIS, but can’t be used by a client unless they either log into the server via a terminal or install PostGIS locally.

    Download Plugin

    The shp2pgsql-gui can be downloaded for MS Windows from the extras folder on the PostGIS website: http://download.osgeo.org/postgis/windows/. Version 2.01 for 32 bit environment was downloaded and unzipped. Per the instructions in the README file, the postgisgui folder was dragged into the PG BIN path as designated in pgAdmin under File > Options > Binary Paths. Once pgAdmin was restarted the gui was available under the Plugins menu.

    Load Shapefiles

    Set Options

    Under Plugins > PostGIS Shapefile and DBF Loader, Check the Options Menu. There are often problems when dealing with UTF-8 encoding in DBF files and when working in an MS Windows environment. Change encoding from UTF-8 to WINDOWS-1252 (another possibility would be LATIN-1). Keep boxes for creating spatial index and using COPY instead of INSERT checked.

    import_options

    Load

    Browse and select both shapefiles. Once they appear in the import list the following elements must be changed: switch schema from public to desired schema (usa_general), change table to give each one a meaningful name, and specify the SRID number for each (since these files are from the US Census they’re in NAD 83, which is EPSG 4269). Once you hit Load, the log window will display the results of each import.

    shape_loader

    Refresh the database and both shapefiles will appear as tables.

    Check Geometries

    Because of the size of some geometries, they may appear blank if you preview the tables in pgAdmin – this is a well known and documented issue. The simplest way to verify is to check and make sure that none of the geometry columns are null:

    SELECT COUNT(*)
    FROM usa_general.states
    WHERE geom IS NULL;

    “0″

    Post-Load Operations

    Primary Keys

    By default, PostGIS will automatically create a field called gid using sequential integers, and will designate this field as the primary key. If the data already has a natural key that you want to use, you would have to drop the key constraint on gid and add the constraint to the desired field. Since features from the US Census have unique ANSI / FIPS codes, we’ll designate that as the key.

    ALTER TABLE usa_general.states
    DROP CONSTRAINT states_pkey

    Query returned successfully with no result in 125 ms.

    ALTER TABLE usa_general.states
    ADD CONSTRAINT states_pkey PRIMARY KEY (“state”);

    Query returned successfully with no result in 125 ms.

    Constraints

    An alternative to changing the primary key would be to add UNIQUE and NOT NULL constraints to additional columns; gid remains as the primary key but the other columns will automatically be indexed when constraints are added.

    ALTER TABLE usa_general.metros ADD CONSTRAINT cbsa_unq UNIQUE (cbsa);
    ALTER TABLE usa_general.metros ALTER COLUMN cbsa SET NOT NULL;

    Delete Records

    For sake of example, we’ll delete all the records for micropolitan areas in the metros table, so we’re left with just metropolitan areas.

    DELETE FROM usa_general.metros
    WHERE lsad=’Micro’

    Query returned successfully: 581 rows affected, 219 ms execution time.

    Spatial SQL

    As a test we can do a query to select all of the populated places that are within the NYC metropolitan area but are not in Pennsylvania, that have an elevation greater than or equal to 500 feet.

    SELECT feature_name, state_alpha, county_name, elev_in_ft
    FROM usa_general.pop_places, usa_general.metros
    WHERE metros.cbsa=’35620′
    AND ST_WITHIN(pop_places.geom, metros.geom)
    AND elev_in_ft >= 500
    AND state_alpha !=’PA’
    ORDER BY elev_in_ft DESC

    “Kampe”;”NJ”;”Sussex”;1293
    “Highland Lakes”;”NJ”;”Sussex”;1283
    “Edison”;”NJ”;”Sussex”;1234″…

    Connect to QGIS

    We can look at the geographic features by establishing a connection between the database and QGIS. In QGIS we can either hit the add PostGIS Layer button or use the browser to connect to a our database; we need to add the connection name, host, port, and database name. The username and password can be left blank and Q will prompt the user for it (if you don’t want it saved on the system). Once the connection is established you can add the layers to the view and change the symbolization.

    qgis_postgis

    Loading Data Into PostGIS – Text Files

    February 25th, 2014

    I’ve fallen out of the blogosphere for quite awhile. Time to catch up – lately we’ve been experimenting with deploying our own instance of Open Geoportal (see previous post) and square one is getting a functioning data repository up and running. I’ve been tinkering with PostgreSQL / PostGIS and am documenting tests in a wiki we’ve created. The wiki is private, so I thought I’d re-post some of the tests here.

    I’m working with a developer who’s installed and configured the database on a server, and I’m accessing it remotely using pgAdmin. I’ve started loading data and running queries and so far I’m relieved that most of what I’m seeing seems familiar, having worked with SQLite / Spatialite for a few years now. I’m using the excellent book PostGIS in Action as my guide. Here’s my take on loading a text file with coordinates into the db and building geometry for it. Loading shapefiles, spatial SQL experiments, and connecting to Geoserver will follow.

    Verify Version

    SELECT postgis_full_version();

    “POSTGIS=”2.1.1 r12113″ GEOS=”3.4.2-CAPI-1.8.2 r3921″ PROJ=”Rel. 4.8.0, 6 March 2012″ GDAL=”GDAL 1.9.2, released 2012/10/08″ LIBXML=”2.7.6″ LIBJSON=”UNKNOWN” TOPOLOGY RASTER”

    Create Schema for Holding Related Objects

    CREATE SCHEMA usa_general;

    Query returned successfully with no result in 297 ms.

    Import Delimited Text File

    Test file is the US Populated Places gazetteer file from the USGS Geographic Names Information Service (GNIS). It is a pipe-delimited text file in encoded in UTF-8 with longitude and latitude coordinates in both DMS and DEC format. http://geonames.usgs.gov/domestic/download_data.htm

    Create Table for Holding Data

    CREATE TABLE usa_general.pop_places
    (
    feature_id int NOT NULL PRIMARY KEY,
    feature_name varchar,
    feature_class varchar,
    state_alpha char(2),
    state_numeric char(2),
    county_name varchar,
    county_numeric char(3),
    primary_lat_dms varchar,
    prim_long_dms varchar,
    prim_lat_dec float4,
    prim_long_dec float4,
    source_lat_dms varchar,
    source_long_dms varchar,
    source_lat_dec float4,
    source_long_dec float4,
    elev_in_m int,
    elev_in_ft int,
    map_name varchar,
    date_created date,
    date_edited date
    );

    Query returned successfully with no result in 390 ms.

    Copy Data From File

    Must be run from the PSQL Console plugin in order to load data from a local client machine. If data was stored on the server, you could use the PGAdmin GUI and use COPY in an SQL statement instead of \copy. When running the PSQL Console on MS Windows the default character encoding is WIN1252. In this example, the data contains characters unsupported by this encoding; the file is encoded in UTF-8 and the console must be set to match. COPY is not a standard SQL command but is native to PostgreSQL.

    – The optional HEADER specifies that the data file has a header column that should be skipped when importing

    \encoding UTF8
    \copy usa_general.pop_places
    FROM ‘C:\Workspace\postgis_testing\POP_PLACES_20140204.txt’
    DELIMITER ‘|’ CSV HEADER

    Basic Query to Test that Load was Success

    SELECT *
    FROM usa_general.pop_places
    WHERE state_alpha=’NY’ and county_name=’New York’
    ORDER BY feature_name

    Create Geometry from Coordinates and Make Spatial Index

    4269 is the EPSG code for NAD 83, the basic geographic coordinate system used by US government agencies.

    SELECT AddGeometryColumn(‘usa_general’,’pop_places’,’geom’,4269,’POINT’,2)

    “usa_general.pop_places.geom SRID:4269 TYPE:POINT DIMS:2 ”

    UPDATE usa_general.pop_places
    SET geom = ST_GeomFromText(‘POINT(‘|| prim_long_dec || ‘ ‘|| prim_lat_dec || ‘)’,4269);

    Query returned successfully: 200359 rows affected, 18268 ms execution time.

    CREATE INDEX idx_pop_places_geom
    ON usa_general.pop_places USING gist(geom);

    Query returned successfully with no result in 8439 ms.

    Basic SQL Query to Test Geometry

    ST_AsEWKT transforms the output of geometry from binary code into the OGC Well Known Text format (WKT).

    SELECT feature_name, ST_AsEWKT(geom) AS geometry
    FROM usa_general.pop_places
    WHERE state_alpha=’NY’ AND county_name=’New York’
    ORDER BY feature_name

    “Amalgamated Dwellings”;”SRID=4269;POINT(-73.9828 40.715)”
    “Amsterdam Houses”;”SRID=4269;POINT(-73.9881 40.7731)”
    “Battery Park City”;”SRID=4269;POINT(-74.0163 40.7115)”…

    Basic Spatial SQL Query to Test Geometry

    Selects the northernmost point in the table.

    SELECT feature_name, state_alpha, ST_AsEWKT(geom) AS geometry
    FROM usa_general.pop_places
    WHERE ST_Xmax(geom) IN(
    SELECT Max(ST_Xmax(geom))
    FROM usa_general.pop_places)

    “Amchitka”;”AK”;”SRID=4269;POINT(178.878 51.5672)”

    Drop Columns

    Drop some columns that have no data.

    ALTER TABLE usa_general.pop_places
    DROP COLUMN source_lat_dms,
    DROP COLUMN source_long_dms,
    DROP COLUMN source_lat_dec,
    DROP COLUMN source_long_dec;

    Query returned successfully with no result in 180 ms.

    pgadmin3

    Notes from the Open Geoportal National Summit

    October 30th, 2013

    This past weekend I had the privilege of attending the Open Geoportal (OGP) National Summit in Boston, hosted by Tufts University and funded by the Sloan Foundation. The Open Geoportal (OGP) is a map-based search engine that allows users to discover and retrieve geospatial data from many repositories. The OGP serves as the front-end of a three-tiered system that includes a spatial database (like PostGIS) at the back and some middleware (Like OpenLayers) to communicate between the two.

    Users navigate via a web map (Google by default but you can choose other options), and as they change the extent by panning or zooming a list of available spatial layers appears in a table of contents beside the map. Hovering over a layer in the contents reveals a bounding box that indicates its spatial extent. Several algorithms determine the ranking order of the results based on the spatial intersection of bounding boxes with the current map view. For instance, layers that are completely contained in the map view have priority over those that aren’t, and layers that have their geographic center in the view are also pushed higher in the results. Non-spatial search filters for date, data type, institution, and keywords help narrow down a search. Of course, the quality of the results is completely dependent on the underlying metadata for the layers, which is stored in the various repositories.

    opengeoport_tufts

    The project was pioneered by Tufts, Harvard, and MIT , and now about a dozen other large research universities are actively working with it, and others are starting to experiment. The purpose of the summit was to begin creating a cohesive community to manage and govern the project, and to increase and outline the possibilities for collaborating across institutions. At the back end, librarians and metadata experts are loading layers and metadata into their repositories; metadata creation is an exacting and time-consuming process, but the OGP will allow institutions to share their metadata records in the hope of avoiding duplicated effort. The OGP also allows for the export of detailed spatial metadata from FGDC and ISO to MODS and MARC, so that records for the spatial layers can be exported to other content management systems and library catalogs.

    The summit gave metadata experts the opportunity to discuss best practices for metadata creation and maintenance, in the hopes of providing a consistent pool of records that can be shared; it also gave software developers the chance to lay out their road map for how they’ll function as an open source project (the OGP community could look towards the GeoNetwork opensource project, a forerunner in spatial metadata and search that’s used in Europe and by many international organizations). Series of five-minute talks called Ignite sessions gave librarians and developers the ability to share the work they were doing at their institutions, either with OGP in particular or with metadata and spatial search in general, which sparked further discussion.

    The outcome of all the governance, resource sharing, and best practices discussions are available on a series of pages dedicated to the summit, on the opengeoportal.org project website. You can also experiment with the OGP via http://geodata.tufts.edu/, Tuft’s gateway to their repository. As you search for data you can identify which repository the data is coming from (Tufts, Harvard, or MIT) based on the little icon that appears beside each layer name. Public datasets (like US census layers) can be downloaded by anyone, while copyrighted sets that the schools’ purchased for their users require authentication.

    OGP is a simple yet elegant open source project that operates under OGC standards and is awesome for spatial search, but the real gem here is the community of people that are forming around it. I was blown away by the level of expertise, dedication, and over all professionalism that each of the librarians, information specialists, and software developers exuded, via the discussions and particularly by the examples of the work they were doing at their institutions. Beyond just creating software, this project is poised to enhance the quality and compatibility of spatial metadata to keep our growing pile of geospatial stuff find-able.

    Government Shutdown: Alternate Resources for US Census Data

    October 13th, 2013

    As the US government shutdown continues (thanks to a handful of ideological nutcases in congress) those of us who work with and rely on government data are re-learning the lesson of why it’s important to keep copies of things. This includes having alternate sources of information floating around on the web and in the cloud, as well as the tried and true approach of downloading and saving datasets locally. There have been a number of good posts (like this succinct one) to point users to alternatives to the federal sources that many of us rely on. I’ll go into more detail here with my suggestions on where to access US Census data, based on user-level and need.

    • The Social Explorer: this web-mapping resource for depicting and accessing US Census data from 1790 to present (including the 2010 Census and the latest American Community Survey data) is intuitive and user-friendly. Many academic and public libraries subscribe to the premium edition that provides full access to all features and datasets (so check with yours to see if you have access), while a basic free version is available online. Given the current circumstances the Social Explorer team has announced that it will open the hatch and provide free access to users who request it.
    • The NHGIS (National Historic GIS): this project is managed by the Minnesota Population Center and also provides access to all US Census data from 1790 to present. While it’s a little more complex than the Social Exlorer, the NHGIS is the better option for downloading lots of data en-masse, and is the go-to place if you need access to all datasets in their entirety, including all the detail from the American Community Survey (as the Social Explorer does not include margins of error for any of the ACS estimates) or if you need access to other datasets like the County Business Patterns. Lastly – it is the alternative to the TIGER site for GIS users who need shapefiles of census geography. You have to register to use NHGIS, but it’s free. For users who need microdata (decennial census, ACS, Current Population Survey), you can visit a related MPC project to the NHGIS: IPUMS.
    • The Missouri Census Data Center (MCDC): I’ve mentioned a number of their tools in the past; they provide easy-to-access profiles from the 2010 Census and American Community Survey, as well as historical trend reports for the ACS. For intermediate users they provide extract applications for the 2010 Census and ACS for creating spreadsheets and SAS files for download, and for advanced users the Dexter tool for downloading data en-masse from 1980 to present. Unlike the other resources no registration or sign-up is required. I also recommend the MCDC’s ACS and 2010 Census profiles to web designers and web mappers; if you’ve created online resources that tapped directly into the American Factfinder via deep links (like I did), you can use the MCDC’s profiles as an alternative. The links to their profiles are persistent and use a logical syntax (as it looks like there’s no end in site to this shutdown I may make the change-over this week). Lastly, the MCDC is a great resource for technical documentation about geography and datasets.
    • State and local government: thankfully many state and local governments have taken subsets of census data of interest to people in their areas and have recompiled and republished it on the web. These past few weeks I’ve been constantly sending students to the NYC Department of City Planning’s population resources. Take a look at your state data center’s resources, as well as local county or city planning departments, transportation agencies, or economic development offices to see what they provide.

    Article on Working With the American Community Survey

    June 17th, 2013

    I’ve got another article that’s just hit the presses. In this one I discuss the American Community Survey: how it differs from the Decennial Census, when you should use it versus other summary data sets, how to work with the different period estimates, and how to create derived estimates and calculate their margins of error. For that last piece I’ve essentially done an extended version of this old post on Excel formulas, with several different and updated examples.

    The article is available via Emerald’s journal database. If you don’t have access to it from your library feel free to contact me and I’ll send you a copy (can’t share this one freely online).

    Title: The American Community Survey: practical considerations for researchers
    Author(s): Francis P. Donnelly
    Citation: Francis P. Donnelly, (2013) “The American Community Survey: practical considerations for researchers”, Reference Services Review, Vol. 41 Iss: 2, pp.280 – 297
    Keywords: American Community Survey, Census, Census geography, Data handling, Decennial census, Demographic data, Government data processing, Government information, Margins of error, Sample-based data, United States of America, US Census Bureau
    Article type: Technical paper
    DOI: 10.1108/00907321311326228 (Permanent URL)
    Publisher: Emerald Group Publishing Limited

    Downloading Data for Small Census Geographies in Bulk

    May 7th, 2013

    I needed to download block group level census data for a project I’m working on; there was one particular 2010 Census table that I needed for every block group in the US. I knew that the American Factfinder was out – you can only download block group data county by county (which would mean over 3,000 downloads if you want them all). I thought I’d share the alternatives I looked at; as I searched around the web I found many others who were looking for the same thing (i.e. data for the smallest census geographies covering a large area).

    The Census FTP site at http://www2.census.gov/

    This would be the first logical step, but in the end it wasn’t optimal based on my need. When you drill down through Census 2010, Summary File 1, you see a file for every state and a national file. Initially I thought – great! I’ll just grab the national file. But the national file does NOT contain the small census statistical areas – no tracts, block groups, or blocks. If you want those small areas you have to download the files for each of the states – 51 downloads. When you download the data you can also download an MS Access database, which is an empty shell with the geography and field headers, and you can import each of the text file data tables (there a lot of them for 2010 SF1) into the db and match them to the headers during import (the instructions that were included for doing this were pretty good). This is great if you need every variable in every table for every geography, but I was only interested in one table for one geography. I could just import the one text file with my table, but then I’d have to do this import process 51 times. The alternative is to use some Python to get that one text file for every state into one big file and then do the import once, but I opted for a different route.

    The NHGIS at https://www.nhgis.org/

    I always recommend this resource to anyone who’s looking for historical census data or boundary files, but it’s also good if you want current data for these small areas. I was able to use their query window to widdle down the selection by dataset (2010 SF1), geography (block groups), and topic (Hispanic origin and race in my case), then I was able to choose the table I needed. On the last screen before download I was able to check a box to include all 50 states plus DC and PR in one file. I had to wait a couple minutes for the request to process, then downloaded the file as a CSV and loaded it into my database. This was the best solution for my circumstances by far – one table for all block groups in the country. If you had to download a lot (or all) of the tables or variables for every block group or block it may take quite awhile, and plugging through all of those menus to select everything would be tedious – if that’s your situation it may be easier to grab everything using the Census FTP.

    nhgis

    UExplore / Dexter at http://mcdc.missouri.edu/applications/uexplore.shtml

    The Missouri Census Data Center’s UExplore / Dexter tool lets you choose a dataset and takes you to a window that resembles a file system, with a ton of files in it. The MCDC takes their extracts directly from the Census, so they’re structured in a similar way to the FTP site as state-based files. They begin with the state prefix and have a name that indicates geography – there are files for block groups, blocks, and one for everything else. There are national files (which don’t contain small census areas) that begin with ‘us’. The difference here is – when you click on a file, it launches a query window that let’s you customize the extract. The interface may look daunting at first, but it’s worth exploring (and there’s a tutorial to help guide you). You can choose from several output formats, specific variables or tables (if you don’t want them all), and there are a bunch of handy options that you can specify like aggregation or percent totals. In addition to the complete datasets, they’ve also created ‘Standard Extracts’ that have the most common variables, if you want just a core subset. While the NHGIS was the best choice for my specific need, the customization abilities in Dexter may fit your needs – and the state-level block group and block data is conveniently broken out from the other files.

    Lastly…

    There are a few others tools – I’ll give an honorable mention to the Summary File Retrieval tool, which is an Excel plugin that lets you tap directly into the American Community Survey from a spreadsheet. So if you wanted tracts or block groups for a wide area for but a small number of variables (I think 20 is the limit) that could be a winner, provided you’re using Excel 2007 or later and are just looking at the ACS. No dice in my case, as I needed Decennial Census data and use OpenOffice at home.


    Copyright © 2014 Gothos. All Rights Reserved.
    No computers were harmed in the 3.362 seconds it took to produce this page.

    Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.