Posts Tagged ‘open source’

Article on Processing Government Data With Python

Thursday, August 28th, 2014

Last month I had an article published in the code{4}lib journal, about a case study using Python to process IRS data on tax-exempt organizations (non-profits). It includes a working Python script that can be used by any one who wishes to make a place-based extract of that dataset for their geographic area of interest. The script utilizes the ZIP to ZCTA masterfile that I’ve mentioned in a previous post, and I include a discussion on wrestling with ZIP Code data. Both the script and the database are included in the download files at the bottom of the article.

I also provide a brief explanation of using OpenRefine to clean data using their text facet tools. One thing I forgot to mention in the article is that after you apply your data fixes with OpenRefine, it records the history. So if you have to process an update of the same file in the future (which I’ll have to do repeatedly), you can simply re-apply all the fixes you made in the past (which are saved in a JSON file).

While the article is pragmatic in nature, I did make an attempt to link this example to the bigger picture of data librarianship, advocating that data librarians can work to add value to datasets for their users, rather than simply pointing them to unrefined resources that many won’t be able to use.

The citation and link:

Donnelly, F. P. (2014). Processing government data: ZIP Codes, Python, and OpenRefine. code{4}lib Journal, 25 (2014-07-21).

As always the journal has a great mix of case studies, and this issue included an article on geospatial metadata.

While I’ve used Python quite a bit, this is the first time that I’ve written anything serious that I’ve released publicly. If there are ways I could improve it, I’d appreciate your feedback. Other than a three-day workshop I took years ago, I’m entirely self-taught and seldom have the opportunity to bounce ideas off people for this type of work. I’ve disabled the blog comments here a long time ago, but feel free to send me an email. If there’s enough interest I’ll do a follow-up post with the suggestions – mail AT gothos DOT info.

Some QGIS Odds and Ends

Thursday, July 3rd, 2014

My colleague Joe Paccione recently finished a QGIS tutorial on working with raster data. My introductory tutorial for the GIS Practicum gives only cursory treatment to rasters, so this project was initially conceived to give people additional opportunities to learn about working with them. It focuses on elevation modeling and uses DEMs and DRGs to introduce tiling and warping, and creating hillshades and contour lines.


The tutorial was written using QGIS 2.0 and was tested with version 2.4; thus it’s readily usable with any 2.x version of QGIS. With the rapid progression of QGIS my introductory tutorial for the workshop is becoming woefully outdated, having been written in version 1.8. It’s going to take me quite a while to update (among other things, the image for every darn button has changed) but I plan to have a new version out sometime in the fall, but probably not at the beginning of semester. Since I have a fair amount of work to do any way, I’m going to rethink all of the content and exercises. Meanwhile, Lex Berman at Harvard has updated his wonderfully clear and concise tutorial to Q version 2.x.

The workshops have been successful for turning people on to open source GIS on my campus, to the point were people are using it and coming back to teach me new things – especially when it comes to uncovering useful plugins:

  • I had a student who needed to geocode a bunch of addresses, but since many of them were international I couldn’t turn to my usual geocoding service at Texas A & M. While I’ve used the MMQGIS plugin for quite a while (it has an abundance of useful tools), I NEVER NOTICED that it includes a geocoding option that acts as a GUI for accessing both the Google and Open Streetmap API for geocoding. He discovered it, and it turned out quite well for him.
  • I was helping a prof who was working with a large point file of street signs, and we discovered a handy plugin called Points2One that allowed us to take those points and turn them into lines based on an attribute the points held in common. In this case every sign on a city block shared a common id that allowed us to create lines representing each side of the street on each block.
  • After doing some intersect and difference geoprocessing with shapefiles I was ending up with some dodgy results – orphaned nodes and lines that had duplicate attributes with good polygons. If I was in a database, an easy trick to find these duplicates would be to run a select query where you group by ID and count them, and anything with a count more than two are duplicates – but this was a shapefile. Luckily there’s a handy plugin called Group Stats that lets you create pivot tables, so I used that to do a summary count and identified the culprits. The plugin allowed me to select all the features that matched my criteria of having an id count of 2 or more, so I could eyeball them in the map view and the attribute table. I calculated the geometry for all the features and sorted the selected ones, revealing that all the duplicates had infinitesimally small areas. Then it was a simple matter of select and delete.
  • Notes from the Open Geoportal National Summit

    Wednesday, October 30th, 2013

    This past weekend I had the privilege of attending the Open Geoportal (OGP) National Summit in Boston, hosted by Tufts University and funded by the Sloan Foundation. The Open Geoportal (OGP) is a map-based search engine that allows users to discover and retrieve geospatial data from many repositories. The OGP serves as the front-end of a three-tiered system that includes a spatial database (like PostGIS) at the back and some middleware (Like OpenLayers) to communicate between the two.

    Users navigate via a web map (Google by default but you can choose other options), and as they change the extent by panning or zooming a list of available spatial layers appears in a table of contents beside the map. Hovering over a layer in the contents reveals a bounding box that indicates its spatial extent. Several algorithms determine the ranking order of the results based on the spatial intersection of bounding boxes with the current map view. For instance, layers that are completely contained in the map view have priority over those that aren’t, and layers that have their geographic center in the view are also pushed higher in the results. Non-spatial search filters for date, data type, institution, and keywords help narrow down a search. Of course, the quality of the results is completely dependent on the underlying metadata for the layers, which is stored in the various repositories.


    The project was pioneered by Tufts, Harvard, and MIT , and now about a dozen other large research universities are actively working with it, and others are starting to experiment. The purpose of the summit was to begin creating a cohesive community to manage and govern the project, and to increase and outline the possibilities for collaborating across institutions. At the back end, librarians and metadata experts are loading layers and metadata into their repositories; metadata creation is an exacting and time-consuming process, but the OGP will allow institutions to share their metadata records in the hope of avoiding duplicated effort. The OGP also allows for the export of detailed spatial metadata from FGDC and ISO to MODS and MARC, so that records for the spatial layers can be exported to other content management systems and library catalogs.

    The summit gave metadata experts the opportunity to discuss best practices for metadata creation and maintenance, in the hopes of providing a consistent pool of records that can be shared; it also gave software developers the chance to lay out their road map for how they’ll function as an open source project (the OGP community could look towards the GeoNetwork opensource project, a forerunner in spatial metadata and search that’s used in Europe and by many international organizations). Series of five-minute talks called Ignite sessions gave librarians and developers the ability to share the work they were doing at their institutions, either with OGP in particular or with metadata and spatial search in general, which sparked further discussion.

    The outcome of all the governance, resource sharing, and best practices discussions are available on a series of pages dedicated to the summit, on the project website. You can also experiment with the OGP via, Tuft’s gateway to their repository. As you search for data you can identify which repository the data is coming from (Tufts, Harvard, or MIT) based on the little icon that appears beside each layer name. Public datasets (like US census layers) can be downloaded by anyone, while copyrighted sets that the schools’ purchased for their users require authentication.

    OGP is a simple yet elegant open source project that operates under OGC standards and is awesome for spatial search, but the real gem here is the community of people that are forming around it. I was blown away by the level of expertise, dedication, and over all professionalism that each of the librarians, information specialists, and software developers exuded, via the discussions and particularly by the examples of the work they were doing at their institutions. Beyond just creating software, this project is poised to enhance the quality and compatibility of spatial metadata to keep our growing pile of geospatial stuff find-able.

    NYC Geodatabase in Spatialite

    Wednesday, February 6th, 2013

    I spent much of the fall semester and winter interim compiling and creating the NYC geodatabase (nyc_gdb), a desktop geodatabase resource for doing basic mapping and analysis at a neighborhood level – PUMAs, ZIP Codes / ZCTAs, and census tracts. There were several motivations for doing this. First and foremost, as someone who is constantly introducing new people to GIS it’s a pain sending people to a half dozen different websites to download shapefiles and process basic features and data before actually doing a project. By creating this resource I hoped to lower the hurdles a bit for newcomers; eventually they still need to learn about the original sources and data processing, but this gives them a chance to experiment and see the possibilities of GIS before getting into nitty gritty details.

    Second, for people who are already familiar with GIS and who have various projects to work on (like me) this saves a lot of duplicated effort, as the db provides a foundation to build on and saves the trouble of starting from scratch each time.

    Third, it gave me something new to learn and will allow me to build a second part to my open source GIS workshops. I finally sat down and hammered away with Spatialite (went through the Spatialite Cookbook from start to finish) and learned spatial SQL, so I could offer a resource that’s open source and will compliment my QGIS workshop. I was familiar with the Access personal geodatabases in ArcGIS, but for the most part these serve as simple containers. With the ability to run all the spatial SQL operations, Spatialite expands QGIS functionality, which was something I was really looking for.

    My original hope was to create a server-based PostGIS database, but at this point I’m not set up to do that on my campus. I figured Spatialite was a good alternative – the basic operations and spatial SQL commands are relatively the same, and I figured I could eventually scale up to PostGIS when the time comes.

    I also created an identical, MS Access version of the database for ArcGIS users. Once I got my features in Spatialite I exported them all out as shapefiles and imported them all via ArcCatalog – not too arduous as I don’t have a ton of features. I used the SQLite ODBC driver to import all of my data tables from SQLite into Access – that went flawlessly and was a real time saver; it just took a little bit of time to figure out how to set up (but this blog post helped).

    The databases are focused on NYC features and resources, since that’s what my user base is primarily interested in. I purposefully used the Census TIGER files as the base, so that if people wanted to expand the features to the broader region they easily could. I spent a good deal of time creating generalized layers, so that users would have the primary water / coastline and large parks and wildlife areas as reference features for thematic maps, without having every single pond and patch of grass to clutter things up. I took several features (schools, subway stations, etc) from the City and the MTA that were stored in tables and converted them to point features so they’re readily useable.

    Given that focus, it’s primarily of interest to NYC folks, but I figured it may be useful for others who wish to experiment with Spatialite. I assumed that most people who would be interested in the database would not be familiar with this format, so I wrote a tutorial that covers the database and it’s features, how to add and map data in QGIS, how to work with the data and do SQL / spatial SQL in the Spatialite GUI, and how to map data in ArcGIS using the Access Geodb. It’s Creative Commons, Attribution, Non-Commercial, Share-alike, so feel free to give it a try.

    I spent a good amount of time building a process rather than just a product, so I’ll be able to update the db twice a year, as city features (schools, libraries, hospitals, transit) change and new census data (American Community Survey, ZIP Business Patterns) is released. Many of the Census features, as well as the 2010 Census data, will be static until 2020.

    New Version of Introductory GIS Tutorial Now Available

    Sunday, October 7th, 2012

    The latest version of my Introduction to GIS tutorial using QGIS is now available. I’ve completely revised how it’s organized and presented; I wrote the first two manuals in HTML, since I wanted something that gave me flexibility with inserting many images in a large document (word processors are notoriously poor at this). Over the summer I learned how to use LaTeX, and the result for this 3rd edition is an infintely better document, for classroom use or self study.

    I also updated the manual for use with QGIS 1.8. I’m thinking that the addition of the Data Browser and the ability to simply select the CRS of the current layer or project when you’re doing a Save As (rather than having to select the CRS from the master list) will save a lot of valuable time in class. With every operation that we perform we’re constantly creating new files as the result of selections and geoprocessing, and I always lose a few people each time we’re forced to crawl through the file system to add new layers we’ve created. These simple changes should speed things up. I’ve updated the manual throughout to reflect these changes, and have also updated the datasets to reflect what’s currently available. I provide a summary of the most salient changes in the introduction.

    Plan Your Trip through the Roman Empire with ORBIS

    Wednesday, June 27th, 2012

    If you wanted to know the fastest route from Roma to Londinium in June of 300 AD or how much it would cost to ride the shortest distance at ox cart speed to Constantinopolis, check out ORBIS. Researchers at Stanford have created a model of Ancient Roman transport networks over land and sea composed of 751 points (cities, landmarks, mountain passes) and thousands of linkages.

    The model simulates the average distance of a large group of travelers taking a given route in a given month. The frictions of distance, terrain, climate, and monetary expense are all accounted for in the model and you have the ability to set many of the options. The technical aspects of the project as well as its historical bases are thoroughly documented. The output consists of route maps (which you can download as KML or as CSV) and interactive cartograms. The platform is an open source stack – PostgreSQL with PostGIS, Open Layers, Geoserver, and some JavaScript libraries.

    Check it out at

    The fastest route from Roma to Londinium in June? A boat ride across the Mediterranean to Narbo, foot/army/pack animal across southern Gaul, and a coast-hugging boat ride from Burdigala will get you there in 26.6 days and 2,974 kilometers. That carriage to Constantinopolis would cost you about 2,087 denarii and would take 128 days at ox cart speed – perhaps you should consider a fast military march instead?

    Goings on at FOSS4G 2011

    Thursday, September 15th, 2011

    I’m at FOSS4G in Denver this week (Free and Open Source for Geospatial conference) and have learned a few things (eventually all presentations, audio and visuals of slides, will be available online):

    • There will be a QGIS update, version 1.71, sometime this month; it’s a minor release that will fix a few bugs. Some future version of QGIS will included a Data Browser (think Arc Catalog).
    • For folks who have asked me how they can get more cartographic production power out of QGIS, Inkscape looks like a good option – folks at UC Davis have been experimenting with it with some success.
    • Learned about a documentation system for open source (or any) project called Sphinx; documents are stored as restructured text files with some Python scripts that link them together and provide formatting for output and display.
    • Got a great, clear, concise overview of what’s involved with an open source web mapping stack.
    • There’s a study at Idaho State (affiliated with the group of folks there that created Map Window)that’s attempted to define the core functions of GIS based on a survey of GIS users. You can view their data by contacting the project lead.
    • Educators at a community college in Arizona are experimenting with an open source raster program called Opticks; a viable solution to more expensive packages like ERDAS and IDRISI.
    • There are some new Python libraries you can use to create and mine KML data
    • The FCC used a clever method for collapsing / aggregating US Census geography from the block level to create their Broadband Map.
    • While I’ve heard of and poked around the Open Street Map Project, I never realized that many of the users were contributing to the project by walking, cycling, and driving around with GPS units, which they upload to create and update road networks around the world. They also use some free datasets (like the Census TIGER files and equivalents from other countries) to augment and provide a frame of reference for their systems.
    • Data in the UK is finally opening up some more, and demand for products from the Ordnance Survey have been off the charts.
    • My presentation on using QGIS in an Academic library went pretty well, and I was pleased to discover I’m not the only GIS librarian at the conference! I’ve met folks from Ontario, Alberta, and Kansas.

    Giving GRASS GIS a Try

    Saturday, July 30th, 2011

    I’ve been taking the plunge in learning GRASS GIS this summer, as I’ve been working at home (and thus don’t have access to ArcGIS) on a larger and more long-term project (and thus don’t want to mess around with shapefiles and csv tables). I liked the idea of working with GRASS vectors, as everything is contained in one folder and all my attributes are stored rather neatly in a SQLite database.

    I started out using QGIS to create my mapset and to connect it to my SQLite db which I had created and loaded with some census data. Then I thought, why not give the GRASS interface a try? I started using the newer Python-wx GUI and as I’m trying different things, I bounce back and forth between using the GUI for launching commands and the command line for typing them in – all the while I have Open Source GIS A GRASS GIS Approach at my side and the online manual a click away . So far, so good.

    I loaded and cleaned a shapefile with the GRASS GUI (the GUI launches,, abd v.clean) and it’s attributes were loaded into the SQLite database I had set (using db.connect – need to do this otherwise a DBF is created by default for storing attributes). Then I had an age-old task to perform – the US Census FIPS / ANSI codes where stored in separate fields, and in order to join them to my attribute tables I had to concatenate them. I also needed to append some zeros to census tract IDs that lacked them – FIPS codes for states are two digits long, counties are three digits, and tracts are between four and six digits, but to standardize them four digit tracts should have two zeros appended.

    Added the new JOIN_ID column using v.db.addcol, then did the following using db.execute:

    UPDATE tracts_us99_nad83


    UPDATE tracts_us99_nad83
    SET JOIN_ID = JOIN_ID || ’00’
    WHERE length(JOIN_ID)=9

    So this:

    01 077 0113
    01 077 0114
    01 077 011502
    01 077 011602

    Becomes this:


    db.execute GRASS GUI

    I could have done this a few different ways from within GRASS: instead of the separate v.db.addcol command I could have written a SQL statement in db.execute to alter the table and add a column. Or, instead of db.execute I could have used the v.db.update command.

    My plan is to use GRASS for geoprocessing and analysis (will be doing some buffering, geographic selection, and basic spatial stats), and QGIS for displaying and creating final maps. I just used to transform an attribute table with coordinates in my db to a point vector. But now I’m realizing that in order to draw buffers, I’ll need a projected coordinate system that uses meters or feet, as inputting degrees for a buffer distance (for points throughout the US) isn’t going to work too well. I’m also having trouble figuring out how to link my attribute data to my vectors – I can easily use v.db.join to fuse the two together, but there is a way to link them more loosely using the CAT ID number for each vector, but I’m getting stuck. We’ll see how it goes.

    Some final notes – since I’m working with large datasets (every census tract in the US) and GRASS uses a topological data model where common node and boundaries between polygons are shared, geoprocessing can take awhile. I’ve gotten in the habit of testing things out on a small subset of my vectors, and once I get it right I run the process on the total set.

    Lastly, there are times where I read about commands in the manual and for the life of me I can’t find them in the GUI – for example, finding a menu to delete (i.e. permanently remove) layers. But if you type the command without any of its parameters in the command line (in this case, g.remove) it will launch the right window in the GUI.

    GRASS GIS Interface

    FOSS4G In Denver This Sept

    Monday, June 20th, 2011

    I’m all set to go to FOSS4G 2011, the global conference on Free and Open Source Software for Geospatial, organized by OSGeo. The conference takes place in Denver, CO from Mon Sept 12 to Fri the 16th. The first two days (12th-13th) consist of morning and and afternoon workshops while the main conference takes place from the 14th to the 16th and features talks, presentations, tutorials, exhibits, and some fun social events.

    The full program is available here, and it looks like it’s chock full of interesting presentations and lots of great learning opportunities via the workshops and tutorials. I’ll be presenting on Weds afternoon, for those interested in my adventures in introducing QGIS on a college campus.

    If you’re on the fence about attending, consider this: this is the sixth year for the conference and it’s only the second time that it’s been held in North America (Canada hosted the 2nd conference in 2007) and the first time it’s being hosted in the US. So if you’re in North America and getting funding from your organization for travel is an issue, now’s your best chance to go. This is truly an international conference (was also hosted in Switzerland, South Africa, Australia, and Spain) so it probably won’t be back on these shores for awhile.

    Here’s some more motivation – early registration at the discounted rate ends on June 30th!

    Define Projection for a Batch of Shapefiles

    Sunday, June 12th, 2011

    I was working on a project where I had downloaded 51 shapefiles (state-based census tract files) from the Census Generalized Cartographic Boundary Files. Each file lacked a projection .prj file, so I had to define each one as NAD83. Not wanting to do this one at a time, I used the GDAL / OGR tools and a bash script to process them all in a batch. I wrote a little script in a text file and then pasted it in the command line:

    for i in $(ls *.shp); do
    ogr2ogr -f “ESRI Shapefile” -a_srs “EPSG:4269” ./nad83 $i

    It iterates through a list of all the shapefiles in a directory, uses OGR to define them as NAD83, then writes them to a new subdirectory called NAD83.

    After searching through the web for some guidance on this, I later realized that there was a nice, succinct example of this in a book that I had (yeah – remember books? They’re still great!)

    # from Sherman (2008) Desktop GIS Mapping the Planet With Open Source Tools pp 243-44

    for shp in *.shp
    echo “Processing $shp”
    ogr2ogr -f “ESRI Shapefile” -t_srs EPSG:4326 geo/$shp $shp

    This does the same thing, difference here is that it prints a message to the command line for each file that’s processed and uses the -t_srs switch (transform projection) rather than the -a_srs (assign an output projection), which in this case seems to do the same thing. Of course you could tweak this a little to transform projections from one system to another as well.

    This is fine and good if you’re using Linux and can use bash (go here for more info about bash). If you’re using Windows, you can do this if you’re using a Linux / UNIX terminal emulator like MSYS; otherwise you can use the DOS Command Prompt and write a batch (.bat) file to do this instead – the post on this forum is the first thing I found in my quest to figure all of this out.

    Copyright © 2017 Gothos. All Rights Reserved.
    No computers were harmed in the 0.561 seconds it took to produce this page.

    Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.