Some QGIS Odds and Ends

July 3rd, 2014

My colleague Joe Paccione recently finished a QGIS tutorial on working with raster data. My introductory tutorial for the GIS Practicum gives only cursory treatment to rasters, so this project was initially conceived to give people additional opportunities to learn about working with them. It focuses on elevation modeling and uses DEMs and DRGs to introduce tiling and warping, and creating hillshades and contour lines.

topo_contour

The tutorial was written using QGIS 2.0 and was tested with version 2.4; thus it’s readily usable with any 2.x version of QGIS. With the rapid progression of QGIS my introductory tutorial for the workshop is becoming woefully outdated, having been written in version 1.8. It’s going to take me quite a while to update (among other things, the image for every darn button has changed) but I plan to have a new version out sometime in the fall, but probably not at the beginning of semester. Since I have a fair amount of work to do any way, I’m going to rethink all of the content and exercises. Meanwhile, Lex Berman at Harvard has updated his wonderfully clear and concise tutorial to Q version 2.x.

The workshops have been successful for turning people on to open source GIS on my campus, to the point were people are using it and coming back to teach me new things – especially when it comes to uncovering useful plugins:

  • I had a student who needed to geocode a bunch of addresses, but since many of them were international I couldn’t turn to my usual geocoding service at Texas A & M. While I’ve used the MMQGIS plugin for quite a while (it has an abundance of useful tools), I NEVER NOTICED that it includes a geocoding option that acts as a GUI for accessing both the Google and Open Streetmap API for geocoding. He discovered it, and it turned out quite well for him.
  • I was helping a prof who was working with a large point file of street signs, and we discovered a handy plugin called Points2One that allowed us to take those points and turn them into lines based on an attribute the points held in common. In this case every sign on a city block shared a common id that allowed us to create lines representing each side of the street on each block.
  • After doing some intersect and difference geoprocessing with shapefiles I was ending up with some dodgy results – orphaned nodes and lines that had duplicate attributes with good polygons. If I was in a database, an easy trick to find these duplicates would be to run a select query where you group by ID and count them, and anything with a count more than two are duplicates – but this was a shapefile. Luckily there’s a handy plugin called Group Stats that lets you create pivot tables, so I used that to do a summary count and identified the culprits. The plugin allowed me to select all the features that matched my criteria of having an id count of 2 or more, so I could eyeball them in the map view and the attribute table. I calculated the geometry for all the features and sorted the selected ones, revealing that all the duplicates had infinitesimally small areas. Then it was a simple matter of select and delete.
  • Introducing – Data!

    April 9th, 2014

    Professors invite me to their classes each semester to give students a crash course in finding data for neighborhoods in New York City, with a particular emphasis on Census data. I typically visit courses in journalism and public affairs, but this semester I added classes in management and – theater – to the list. Before I dive into what the Census is and what sources they should use, I preface the presentation with a discussion of what neighborhoods are and how we define them. This is important because neighborhoods are locally and informally defined, and when searching for datasets we often have to use a proxy, like census tracts, ZIP codes, PUMAs, or local legal or administrative areas, to approximate them.

    But before we get this far, I always begin the discussion with some basic questions to set the stage: what is data, and what can we use it for? For the journalism students, I explain that data can help support a story. If they’re covering a town hall or community board meeting where affordabale housing is the topic of discussion, they’re going to want to provide some context and include some facts to support their story – what is the rent like in the neighborhood? How many people live there? Alternatively, data can provide the basis for a story. I point to one of many numerous examples in NYC where journalists have taken a big lump of unrefined data – the NYPD’s stop and frisk data, traffic fatality incidents, 311 complaints – and have refined it to produce information that leads them to an interesting story that was hidden in these numbers. Lastly, data is a story – whenever the Census releases a new dataset, someone is writing to announce the release and tell us what’s in there.

    This idea of refining leads us to our first basic definition – data can be considered as raw and unrefined information. It doesn’t tell us much in and of itself, but if we sift through and refine it we can turn it into information that we can use to tell or support a story or reveal some fact or truth that was previously unknown. Data can be quantitative or qualitative – journalists for example may interview someone for two or three hours, but they’re not going to turn around and publish that entire interview. They’re going to write an article that summarizes it and gives us the most important bits, or edit it for a radio broadcast that covers the high lights. With quantitative data the issue is similar – I use a basic example of population data for the 50 states and show them this image of a comma delimited text file:

    csv

    I explain that this is what data looks lke in a raw state. It’s in a basic format suitable for preservation or transit between systems, but is not in a presentable state. There are a lot of codes that are meaningless to the average person, the data isn’t sorted in a meaningful way, the column headings seem ambiguous, and the numbers aren’t formated for viewing. This isn’t something that they’d want to insert directly into their story or paper. But if they take this and do a little bit of work:

    table

    They can take that raw data and turn it into information. Here we’ve moved from raw data to a presentable table. The statistics are sorted in a logical order based on total population, columns are given comprehensible names, and unecessary information (for presentation purposes) is dropped. We add commas to the numbers so they’re more legible, and we create some value by adding a percent total column. Now we have something we can use to communicate a message. But we can go further – we can take that same information and turn it into this:

    chart

    Now we have a chart. At this point I turn to the students and ask them what the benefit of using the chart is, followed by a discussion of trade-offs; we’ve gained something but lost something too. On the plus side, we can appeal to people’s visual sensibilities, and we can see more clearly that California has twice as many people as New York. The chart is also more concise, as it’s taking up less real estate on the page or on the screen. But we’ve exchanged conciseness for preciseness; we can no longer tell what the exact population numbers are with the chart; we can only approximate. But we can also go further:

    map

    We can take that same dataset and turn it into a map. Once again, we discuss the pluses and minuses. Now we can key into to people’s geographic knowledge as well as their visual senses; Ohio may be more meaningful now that we can see it on a map, rather than just seeing a number in a table. We can also see geographic patterns of clustering or diffusion, which the table or chart couldn’t show us. But with the map we’ve lost even more precision. Now we can only see that a state’s population number falls within a given range; we can’t see the precise number and can’t approximate it like we could with the chart.

    At this point, one student will point out that if the chart or map is on the web, we can have the best of all worlds. If the graphic is interactive we can hover over it and see the exact population number. This leads to a discussion of the trade-offs between interactive web-based information and static information. The interactive chart or map let’s us keep precision and conciseness, but the sacrifice is complexity, portability, and preservation. It’s more complex to create, and it can only exist in it’s native environment, within a specific bundle of technology that includes programming and scripting langauges, software libraries, browsers, and operting systems. Such things go obsolete quickly and can easily break, so the shiny chart, map, or app you have today is non-functional in a year or two, and difficult to preserve. Contrast that with a static image or text, which is simple, easy to move around, depends on little else, and can make the jump from a screen to the printed page.

    We sum up this little talk with the basis of what they’re trying to achieve – I use the DIK pyramid, which I was introduced to in library school (OK – this pic is the DIKW pyramid, with wisdom thrown on top – it’s public domain so I can safely use it):

    DIKW-diagram

    As journalists or researchers, you’re taking data and refining it to turn it into information to support your work or to commuicate a message. You take those pieces of new information and bring them together to tell a bigger story and paint a bigger picture, which we hope will lead to greater knowledge (which, unlike data and information, is something that can only be learned and not simply assesmbled and communicated). The weather is a good example – a giant log of temperature and precitiptation data isn’t going to do me much good. But if you process that data to calculate the high, low, and mean, now you have information I can use. Take that information and combine it with a radar picture and a forecast and now I have a rich information object. I can take that object and piece it together with other information – another forecast I hear on the radio, what I see out the window, my previous experiences of getting wet, my wife’s advice – to formulate a decision that I can act on. By considering all of this information – my experiences, contextualized information, and know how – and weighing it to reach a conslucion, I am using my knowledge. In this case I’ll use it carry an umbrella.

    The final point is that, in their papers, the students must take the information objects that they’ve created or acquired and integrate them into their work. Many students will just copy and download a table and stick it at the back of the paper, and assume that it speaks for itself. I tell them – it doesn’t! You have to explain why it’s there; make reference to it in the paper and weave it into your research.

    Overall this presentation / discussion takes all of about 10 minutes, AND THEN we move into the discussion of neighborhoods, the census, and specific datasets. I’ve contemplated skipping it all together, but ultimately decided that it’s necessary. I think it’s essential to provide some context and theory coupled with the actual sources and the pragmatic nature of finding the data. There are some librarians who are completely adverse to teaching “tools” and will speak completely in the abstract, while there are others who cut directly to listing the sources and leaving it at that. The first approach is useless because the students won’t learn what to actually do; the second apporach makes assumptions about what they know and fails to prepare them for what they’ll face. There also seems to be a clear need for me to do this – I’ve heard many faculty who have commented that students are simply tacking data tables they’ve copied off the web into the back of papers without any explanation. When I present the slide that depicts the csv file, I was initally shocked by the looks of shock on many student’s faces – like they’d never seen or heard of this before and were worried that they’d have to wrestle with it. Here’s the data-driven world, step 1.

    Loading Data Into PostGIS – Shapefiles

    March 4th, 2014

    This example takes over where the Loading Data Into PostGIS – Text Files left off; we’ll use the same schema usa_general. We’ll load two shapefiles from the Census Bureau’s 2010 Cartographic Boundary files from https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html.

    Loading shapefiles can be done via the pgAdmin III GUI (for this example, used version 1.18 for MS Windows) from a client machine using the shp2pgsql-gui. This is an extra plugin that must be downloaded and installed. A command line version of the tool (shp2pgsql) is automatically bundled with PostGIS, but can’t be used by a client unless they either log into the server via a terminal or install PostGIS locally.

    Download Plugin

    The shp2pgsql-gui can be downloaded for MS Windows from the extras folder on the PostGIS website: http://download.osgeo.org/postgis/windows/. Version 2.01 for 32 bit environment was downloaded and unzipped. Per the instructions in the README file, the postgisgui folder was dragged into the PG BIN path as designated in pgAdmin under File > Options > Binary Paths. Once pgAdmin was restarted the gui was available under the Plugins menu.

    Load Shapefiles

    Set Options

    Under Plugins > PostGIS Shapefile and DBF Loader, Check the Options Menu. There are often problems when dealing with UTF-8 encoding in DBF files and when working in an MS Windows environment. Change encoding from UTF-8 to WINDOWS-1252 (another possibility would be LATIN-1). Keep boxes for creating spatial index and using COPY instead of INSERT checked.

    import_options

    Load

    Browse and select both shapefiles. Once they appear in the import list the following elements must be changed: switch schema from public to desired schema (usa_general), change table to give each one a meaningful name, and specify the SRID number for each (since these files are from the US Census they’re in NAD 83, which is EPSG 4269). Once you hit Load, the log window will display the results of each import.

    shape_loader

    Refresh the database and both shapefiles will appear as tables.

    Check Geometries

    Because of the size of some geometries, they may appear blank if you preview the tables in pgAdmin – this is a well known and documented issue. The simplest way to verify is to check and make sure that none of the geometry columns are null:

    SELECT COUNT(*)
    FROM usa_general.states
    WHERE geom IS NULL;

    “0″

    Post-Load Operations

    Primary Keys

    By default, PostGIS will automatically create a field called gid using sequential integers, and will designate this field as the primary key. If the data already has a natural key that you want to use, you would have to drop the key constraint on gid and add the constraint to the desired field. Since features from the US Census have unique ANSI / FIPS codes, we’ll designate that as the key.

    ALTER TABLE usa_general.states
    DROP CONSTRAINT states_pkey

    Query returned successfully with no result in 125 ms.

    ALTER TABLE usa_general.states
    ADD CONSTRAINT states_pkey PRIMARY KEY (“state”);

    Query returned successfully with no result in 125 ms.

    Constraints

    An alternative to changing the primary key would be to add UNIQUE and NOT NULL constraints to additional columns; gid remains as the primary key but the other columns will automatically be indexed when constraints are added.

    ALTER TABLE usa_general.metros ADD CONSTRAINT cbsa_unq UNIQUE (cbsa);
    ALTER TABLE usa_general.metros ALTER COLUMN cbsa SET NOT NULL;

    Delete Records

    For sake of example, we’ll delete all the records for micropolitan areas in the metros table, so we’re left with just metropolitan areas.

    DELETE FROM usa_general.metros
    WHERE lsad=’Micro’

    Query returned successfully: 581 rows affected, 219 ms execution time.

    Spatial SQL

    As a test we can do a query to select all of the populated places that are within the NYC metropolitan area but are not in Pennsylvania, that have an elevation greater than or equal to 500 feet.

    SELECT feature_name, state_alpha, county_name, elev_in_ft
    FROM usa_general.pop_places, usa_general.metros
    WHERE metros.cbsa=’35620′
    AND ST_WITHIN(pop_places.geom, metros.geom)
    AND elev_in_ft >= 500
    AND state_alpha !=’PA’
    ORDER BY elev_in_ft DESC

    “Kampe”;”NJ”;”Sussex”;1293
    “Highland Lakes”;”NJ”;”Sussex”;1283
    “Edison”;”NJ”;”Sussex”;1234″…

    Connect to QGIS

    We can look at the geographic features by establishing a connection between the database and QGIS. In QGIS we can either hit the add PostGIS Layer button or use the browser to connect to a our database; we need to add the connection name, host, port, and database name. The username and password can be left blank and Q will prompt the user for it (if you don’t want it saved on the system). Once the connection is established you can add the layers to the view and change the symbolization.

    qgis_postgis

    Loading Data Into PostGIS – Text Files

    February 25th, 2014

    I’ve fallen out of the blogosphere for quite awhile. Time to catch up – lately we’ve been experimenting with deploying our own instance of Open Geoportal (see previous post) and square one is getting a functioning data repository up and running. I’ve been tinkering with PostgreSQL / PostGIS and am documenting tests in a wiki we’ve created. The wiki is private, so I thought I’d re-post some of the tests here.

    I’m working with a developer who’s installed and configured the database on a server, and I’m accessing it remotely using pgAdmin. I’ve started loading data and running queries and so far I’m relieved that most of what I’m seeing seems familiar, having worked with SQLite / Spatialite for a few years now. I’m using the excellent book PostGIS in Action as my guide. Here’s my take on loading a text file with coordinates into the db and building geometry for it. Loading shapefiles, spatial SQL experiments, and connecting to Geoserver will follow.

    Verify Version

    SELECT postgis_full_version();

    “POSTGIS=”2.1.1 r12113″ GEOS=”3.4.2-CAPI-1.8.2 r3921″ PROJ=”Rel. 4.8.0, 6 March 2012″ GDAL=”GDAL 1.9.2, released 2012/10/08″ LIBXML=”2.7.6″ LIBJSON=”UNKNOWN” TOPOLOGY RASTER”

    Create Schema for Holding Related Objects

    CREATE SCHEMA usa_general;

    Query returned successfully with no result in 297 ms.

    Import Delimited Text File

    Test file is the US Populated Places gazetteer file from the USGS Geographic Names Information Service (GNIS). It is a pipe-delimited text file in encoded in UTF-8 with longitude and latitude coordinates in both DMS and DEC format. http://geonames.usgs.gov/domestic/download_data.htm

    Create Table for Holding Data

    CREATE TABLE usa_general.pop_places
    (
    feature_id int NOT NULL PRIMARY KEY,
    feature_name varchar,
    feature_class varchar,
    state_alpha char(2),
    state_numeric char(2),
    county_name varchar,
    county_numeric char(3),
    primary_lat_dms varchar,
    prim_long_dms varchar,
    prim_lat_dec float4,
    prim_long_dec float4,
    source_lat_dms varchar,
    source_long_dms varchar,
    source_lat_dec float4,
    source_long_dec float4,
    elev_in_m int,
    elev_in_ft int,
    map_name varchar,
    date_created date,
    date_edited date
    );

    Query returned successfully with no result in 390 ms.

    Copy Data From File

    Must be run from the PSQL Console plugin in order to load data from a local client machine. If data was stored on the server, you could use the PGAdmin GUI and use COPY in an SQL statement instead of \copy. When running the PSQL Console on MS Windows the default character encoding is WIN1252. In this example, the data contains characters unsupported by this encoding; the file is encoded in UTF-8 and the console must be set to match. COPY is not a standard SQL command but is native to PostgreSQL.

    – The optional HEADER specifies that the data file has a header column that should be skipped when importing

    \encoding UTF8
    \copy usa_general.pop_places
    FROM ‘C:\Workspace\postgis_testing\POP_PLACES_20140204.txt’
    DELIMITER ‘|’ CSV HEADER

    Basic Query to Test that Load was Success

    SELECT *
    FROM usa_general.pop_places
    WHERE state_alpha=’NY’ and county_name=’New York’
    ORDER BY feature_name

    Create Geometry from Coordinates and Make Spatial Index

    4269 is the EPSG code for NAD 83, the basic geographic coordinate system used by US government agencies.

    SELECT AddGeometryColumn(‘usa_general’,'pop_places’,'geom’,4269,’POINT’,2)

    “usa_general.pop_places.geom SRID:4269 TYPE:POINT DIMS:2 ”

    UPDATE usa_general.pop_places
    SET geom = ST_GeomFromText(‘POINT(‘|| prim_long_dec || ‘ ‘|| prim_lat_dec || ‘)’,4269);

    Query returned successfully: 200359 rows affected, 18268 ms execution time.

    CREATE INDEX idx_pop_places_geom
    ON usa_general.pop_places USING gist(geom);

    Query returned successfully with no result in 8439 ms.

    Basic SQL Query to Test Geometry

    ST_AsEWKT transforms the output of geometry from binary code into the OGC Well Known Text format (WKT).

    SELECT feature_name, ST_AsEWKT(geom) AS geometry
    FROM usa_general.pop_places
    WHERE state_alpha=’NY’ AND county_name=’New York’
    ORDER BY feature_name

    “Amalgamated Dwellings”;”SRID=4269;POINT(-73.9828 40.715)”
    “Amsterdam Houses”;”SRID=4269;POINT(-73.9881 40.7731)”
    “Battery Park City”;”SRID=4269;POINT(-74.0163 40.7115)”…

    Basic Spatial SQL Query to Test Geometry

    Selects the northernmost point in the table.

    SELECT feature_name, state_alpha, ST_AsEWKT(geom) AS geometry
    FROM usa_general.pop_places
    WHERE ST_Xmax(geom) IN(
    SELECT Max(ST_Xmax(geom))
    FROM usa_general.pop_places)

    “Amchitka”;”AK”;”SRID=4269;POINT(178.878 51.5672)”

    Drop Columns

    Drop some columns that have no data.

    ALTER TABLE usa_general.pop_places
    DROP COLUMN source_lat_dms,
    DROP COLUMN source_long_dms,
    DROP COLUMN source_lat_dec,
    DROP COLUMN source_long_dec;

    Query returned successfully with no result in 180 ms.

    pgadmin3

    Notes from the Open Geoportal National Summit

    October 30th, 2013

    This past weekend I had the privilege of attending the Open Geoportal (OGP) National Summit in Boston, hosted by Tufts University and funded by the Sloan Foundation. The Open Geoportal (OGP) is a map-based search engine that allows users to discover and retrieve geospatial data from many repositories. The OGP serves as the front-end of a three-tiered system that includes a spatial database (like PostGIS) at the back and some middleware (Like OpenLayers) to communicate between the two.

    Users navigate via a web map (Google by default but you can choose other options), and as they change the extent by panning or zooming a list of available spatial layers appears in a table of contents beside the map. Hovering over a layer in the contents reveals a bounding box that indicates its spatial extent. Several algorithms determine the ranking order of the results based on the spatial intersection of bounding boxes with the current map view. For instance, layers that are completely contained in the map view have priority over those that aren’t, and layers that have their geographic center in the view are also pushed higher in the results. Non-spatial search filters for date, data type, institution, and keywords help narrow down a search. Of course, the quality of the results is completely dependent on the underlying metadata for the layers, which is stored in the various repositories.

    opengeoport_tufts

    The project was pioneered by Tufts, Harvard, and MIT , and now about a dozen other large research universities are actively working with it, and others are starting to experiment. The purpose of the summit was to begin creating a cohesive community to manage and govern the project, and to increase and outline the possibilities for collaborating across institutions. At the back end, librarians and metadata experts are loading layers and metadata into their repositories; metadata creation is an exacting and time-consuming process, but the OGP will allow institutions to share their metadata records in the hope of avoiding duplicated effort. The OGP also allows for the export of detailed spatial metadata from FGDC and ISO to MODS and MARC, so that records for the spatial layers can be exported to other content management systems and library catalogs.

    The summit gave metadata experts the opportunity to discuss best practices for metadata creation and maintenance, in the hopes of providing a consistent pool of records that can be shared; it also gave software developers the chance to lay out their road map for how they’ll function as an open source project (the OGP community could look towards the GeoNetwork opensource project, a forerunner in spatial metadata and search that’s used in Europe and by many international organizations). Series of five-minute talks called Ignite sessions gave librarians and developers the ability to share the work they were doing at their institutions, either with OGP in particular or with metadata and spatial search in general, which sparked further discussion.

    The outcome of all the governance, resource sharing, and best practices discussions are available on a series of pages dedicated to the summit, on the opengeoportal.org project website. You can also experiment with the OGP via http://geodata.tufts.edu/, Tuft’s gateway to their repository. As you search for data you can identify which repository the data is coming from (Tufts, Harvard, or MIT) based on the little icon that appears beside each layer name. Public datasets (like US census layers) can be downloaded by anyone, while copyrighted sets that the schools’ purchased for their users require authentication.

    OGP is a simple yet elegant open source project that operates under OGC standards and is awesome for spatial search, but the real gem here is the community of people that are forming around it. I was blown away by the level of expertise, dedication, and over all professionalism that each of the librarians, information specialists, and software developers exuded, via the discussions and particularly by the examples of the work they were doing at their institutions. Beyond just creating software, this project is poised to enhance the quality and compatibility of spatial metadata to keep our growing pile of geospatial stuff find-able.

    Government Shutdown: Alternate Resources for US Census Data

    October 13th, 2013

    As the US government shutdown continues (thanks to a handful of ideological nutcases in congress) those of us who work with and rely on government data are re-learning the lesson of why it’s important to keep copies of things. This includes having alternate sources of information floating around on the web and in the cloud, as well as the tried and true approach of downloading and saving datasets locally. There have been a number of good posts (like this succinct one) to point users to alternatives to the federal sources that many of us rely on. I’ll go into more detail here with my suggestions on where to access US Census data, based on user-level and need.

    • The Social Explorer: this web-mapping resource for depicting and accessing US Census data from 1790 to present (including the 2010 Census and the latest American Community Survey data) is intuitive and user-friendly. Many academic and public libraries subscribe to the premium edition that provides full access to all features and datasets (so check with yours to see if you have access), while a basic free version is available online. Given the current circumstances the Social Explorer team has announced that it will open the hatch and provide free access to users who request it.
    • The NHGIS (National Historic GIS): this project is managed by the Minnesota Population Center and also provides access to all US Census data from 1790 to present. While it’s a little more complex than the Social Exlorer, the NHGIS is the better option for downloading lots of data en-masse, and is the go-to place if you need access to all datasets in their entirety, including all the detail from the American Community Survey (as the Social Explorer does not include margins of error for any of the ACS estimates) or if you need access to other datasets like the County Business Patterns. Lastly – it is the alternative to the TIGER site for GIS users who need shapefiles of census geography. You have to register to use NHGIS, but it’s free. For users who need microdata (decennial census, ACS, Current Population Survey), you can visit a related MPC project to the NHGIS: IPUMS.
    • The Missouri Census Data Center (MCDC): I’ve mentioned a number of their tools in the past; they provide easy-to-access profiles from the 2010 Census and American Community Survey, as well as historical trend reports for the ACS. For intermediate users they provide extract applications for the 2010 Census and ACS for creating spreadsheets and SAS files for download, and for advanced users the Dexter tool for downloading data en-masse from 1980 to present. Unlike the other resources no registration or sign-up is required. I also recommend the MCDC’s ACS and 2010 Census profiles to web designers and web mappers; if you’ve created online resources that tapped directly into the American Factfinder via deep links (like I did), you can use the MCDC’s profiles as an alternative. The links to their profiles are persistent and use a logical syntax (as it looks like there’s no end in site to this shutdown I may make the change-over this week). Lastly, the MCDC is a great resource for technical documentation about geography and datasets.
    • State and local government: thankfully many state and local governments have taken subsets of census data of interest to people in their areas and have recompiled and republished it on the web. These past few weeks I’ve been constantly sending students to the NYC Department of City Planning’s population resources. Take a look at your state data center’s resources, as well as local county or city planning departments, transportation agencies, or economic development offices to see what they provide.

    Article on Working With the American Community Survey

    June 17th, 2013

    I’ve got another article that’s just hit the presses. In this one I discuss the American Community Survey: how it differs from the Decennial Census, when you should use it versus other summary data sets, how to work with the different period estimates, and how to create derived estimates and calculate their margins of error. For that last piece I’ve essentially done an extended version of this old post on Excel formulas, with several different and updated examples.

    The article is available via Emerald’s journal database. If you don’t have access to it from your library feel free to contact me and I’ll send you a copy (can’t share this one freely online).

    Title: The American Community Survey: practical considerations for researchers
    Author(s): Francis P. Donnelly
    Citation: Francis P. Donnelly, (2013) “The American Community Survey: practical considerations for researchers”, Reference Services Review, Vol. 41 Iss: 2, pp.280 – 297
    Keywords: American Community Survey, Census, Census geography, Data handling, Decennial census, Demographic data, Government data processing, Government information, Margins of error, Sample-based data, United States of America, US Census Bureau
    Article type: Technical paper
    DOI: 10.1108/00907321311326228 (Permanent URL)
    Publisher: Emerald Group Publishing Limited

    Downloading Data for Small Census Geographies in Bulk

    May 7th, 2013

    I needed to download block group level census data for a project I’m working on; there was one particular 2010 Census table that I needed for every block group in the US. I knew that the American Factfinder was out – you can only download block group data county by county (which would mean over 3,000 downloads if you want them all). I thought I’d share the alternatives I looked at; as I searched around the web I found many others who were looking for the same thing (i.e. data for the smallest census geographies covering a large area).

    The Census FTP site at http://www2.census.gov/

    This would be the first logical step, but in the end it wasn’t optimal based on my need. When you drill down through Census 2010, Summary File 1, you see a file for every state and a national file. Initially I thought – great! I’ll just grab the national file. But the national file does NOT contain the small census statistical areas – no tracts, block groups, or blocks. If you want those small areas you have to download the files for each of the states – 51 downloads. When you download the data you can also download an MS Access database, which is an empty shell with the geography and field headers, and you can import each of the text file data tables (there a lot of them for 2010 SF1) into the db and match them to the headers during import (the instructions that were included for doing this were pretty good). This is great if you need every variable in every table for every geography, but I was only interested in one table for one geography. I could just import the one text file with my table, but then I’d have to do this import process 51 times. The alternative is to use some Python to get that one text file for every state into one big file and then do the import once, but I opted for a different route.

    The NHGIS at https://www.nhgis.org/

    I always recommend this resource to anyone who’s looking for historical census data or boundary files, but it’s also good if you want current data for these small areas. I was able to use their query window to widdle down the selection by dataset (2010 SF1), geography (block groups), and topic (Hispanic origin and race in my case), then I was able to choose the table I needed. On the last screen before download I was able to check a box to include all 50 states plus DC and PR in one file. I had to wait a couple minutes for the request to process, then downloaded the file as a CSV and loaded it into my database. This was the best solution for my circumstances by far – one table for all block groups in the country. If you had to download a lot (or all) of the tables or variables for every block group or block it may take quite awhile, and plugging through all of those menus to select everything would be tedious – if that’s your situation it may be easier to grab everything using the Census FTP.

    nhgis

    UExplore / Dexter at http://mcdc.missouri.edu/applications/uexplore.shtml

    The Missouri Census Data Center’s UExplore / Dexter tool lets you choose a dataset and takes you to a window that resembles a file system, with a ton of files in it. The MCDC takes their extracts directly from the Census, so they’re structured in a similar way to the FTP site as state-based files. They begin with the state prefix and have a name that indicates geography – there are files for block groups, blocks, and one for everything else. There are national files (which don’t contain small census areas) that begin with ‘us’. The difference here is – when you click on a file, it launches a query window that let’s you customize the extract. The interface may look daunting at first, but it’s worth exploring (and there’s a tutorial to help guide you). You can choose from several output formats, specific variables or tables (if you don’t want them all), and there are a bunch of handy options that you can specify like aggregation or percent totals. In addition to the complete datasets, they’ve also created ‘Standard Extracts’ that have the most common variables, if you want just a core subset. While the NHGIS was the best choice for my specific need, the customization abilities in Dexter may fit your needs – and the state-level block group and block data is conveniently broken out from the other files.

    Lastly…

    There are a few others tools – I’ll give an honorable mention to the Summary File Retrieval tool, which is an Excel plugin that lets you tap directly into the American Community Survey from a spreadsheet. So if you wanted tracts or block groups for a wide area for but a small number of variables (I think 20 is the limit) that could be a winner, provided you’re using Excel 2007 or later and are just looking at the ACS. No dice in my case, as I needed Decennial Census data and use OpenOffice at home.

    The Geography of US Public Libraries

    March 18th, 2013

    Last month my article on the geographic disribution of US public libraries was pre-published online in JOLIS, with a print date pending. I can’t share this one freely on-line, so if you don’t have access via a library database (Sage Journals, ERIC, various LIS dbs) you can contact me if you’re interested in getting a copy.

    Title: The geographic distribution of United States public libraries : An analysis of locations and service areas
    Author: Francis P. Donnelly
    Journal: Journal of Librarianship and Information Science (JOLIS)
    Year: 2013 Volume, Issue, and Pages pending print release
    ISSN: 0961-0006
    DOI: 10000612470276.1177/0961
    Publisher: Sage

    Abstract

    This article explores the geography of public libraries in the United States. The distribution of libraries is examined using maps and spatial statistics to analyze spatial patterns. Methods for delineating and studying library service areas used in previous LIS research are modified and applied using geographic information systems to study variations in library accessibility by state and by socio-economic group at the national level. A history of library development is situated within the broader economic and demographic history of the US to provide insight to contemporary patterns, and Louis Round Wilson’s Geography of Reading is used as a focal point for studying historical trends. Findings show that while national library coverage is extensive, the percentage of the population that lives in a library’s geographic service area varies considerably by region and state, with Southern and Western states having lower values than Northeastern and Midwestern states.

    Keywords

    Geographic information systems, geography, public libraries, service areas, spatial equity, United States

    This OCLC flier (How Public Libraries Stack Up) piqued my interest in public libraries as community resources, public goods, and placemaking institutions. If the presence of a public library brings so much value to a community, then by extension the lack of a public library could leave a community at a disadvantage. This led to the next set of logical questions: how are libraries distributed across the country, and which people and communties are being served and which aren’t?

    I took a few different approaches to answer these questions. The first approach was to use (and learn) spatial statistics so the overall distribution could be characterized, and the second was to use spatial extraction methods to select census areas and populations that were within the service areas of each library, to see differences in how populations were served and to study these differences across different states. The LIS literature is rich with research that uses GIS to study library use, so I provide a thorough summary of what’s come before. Then after I had the results I spent a good deal of time researching how the contemporary pattern came to be, and coupled the research on the history of public libraries with the broader history of urban and economic development in the United States.

    I had a few unstated motives – one of them was to learn spatial statistics, with the help of: OpenGeoda and its documentation, this excellent book on Spatial Data Analysis (for theory), these great examples from Spatial Justice, and invaluable advice from Deborah Balk, a Professor of Spatial Demography with the CUNY Institute for Demographic Research.

    One of my other goals was to use only open source software – QGIS, GRASS, and OpenGeoda, which was also a success. Although in my next study I’ll probably rely on QGIS and Spatialite; I found I was doing a lot of attribute data crunching using the SQLite Manager, since the attributes of GRASS vectors can be stored in SQLite, and I could probably save time (and frustration) by using Spatialite’s features instead. I did get to learn a lot about GRASS, but for my purposes it was overkill and I would have been just fine with a spatial database. I was definetely able to sharpen my Python skills, as processing the American Community Survey data for every census tract in the US manually would have been crazy.

    In a project this size there are always some pieces that end up on the cutting room floor, so I thought I’d share one here – a dot map that shows where all 16,700 public libraries are. In the article I went with a county choropleth map to show the distribution, because I was doing other county-level stuff and because the dimension restrictions on the graphic made it a more viable option. The dot map reveals that libraries are typically where people are, except that the south looks emptier and the midwest fuller than it should be, if libraries were in fact evenly distributed by population. As my research shows – they’re not.

    US Public Libraries

    Creating Reports with SQLite, Python, and prettytable

    February 8th, 2013

    In addition to providing the NYC Geodatabase as a resource, I also wanted to use it to generate reports and build applications. None of the open source SQLite GUIs that I’m familiar with have built in report generating capabilities, so I thought I could use Python to connect to the database and generate them. I have some grand ambitions here, but decided to start out small.

    Python has a built-in module, sqlite3, that you can use to work with SQLite databases. This is pretty well documented – do a search and you’ll find a ton of brief tutorials. Take a look at this great post for a comprehensive intro.

    For generating reports I gave prettytable a shot: it lets you create nice looking ASCII text tables that you can copy and paste from the prompt or export out to a file. The tutorial for the module was pretty clear and covers the basics quite nicely. In the examples he directly embeds the data in the script and generates the table from it, which makes the tutorial readily understandable. For my purposes I wanted to pull data out of a SQLite database and into a formatted table, so that’s what I’ll demonstrate here.

    Initially I had some trouble getting the module to load, primarily (I think) because I’m using Python 3.x and the setup file for the module was written for Python 2.x; the utility you use for importing 3rd party modules has changed between versions. I’m certainly no Python expert, so instead of figuring it out I just downloaded the module, dumped it into the site-packages folder (as suggested in the prettytable installation instructions under “The Harder Way” – but it wasn’t hard at all) and unzipped it. In my script I couldn’t get the simple “import prettytable” to work without throwing an error, but when I added the name of the specific function “import PrettyTable from prettytable” it worked. Your mileage may vary.

    So here was my first go at it. I created a test database and loaded a table of population estimates from the US Census Bureau into it (you can download it if you want to experiment):

    from prettytable import PrettyTable
    import sqlite3
    
    conn = sqlite3.connect('pop_test.sqlite')
    curs = conn.cursor()
    curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')
    
    col_names = [cn[0] for cn in curs.description]
    rows = curs.fetchall()
    
    x = PrettyTable(col_names)
    x.align[col_names[1]] = "l" 
    x.align[col_names[2]] = "r" 
    x.padding_width = 1    
    for row in rows:
        x.add_row(row)
    
    print (x)
    tabstring = x.get_string()
    
    output=open("export.txt","w")
    output.write("Population Data"+"\n")
    output.write(tabstring)
    output.close()
    
    conn.close()
    

    The first piece is the standard SQLite piece – connect, activate a cursor, and execute a SQL statement. Here I’m grabbing three columns from the table for records that represent Northeastern states (Region 1). I read in the names of the columns from the first row into the col_names list, and I grab everything else and dump them into rows, a list that contains a tuple for each record:

    >>> col_names
    ['State', 'Name', 'Est2010']
    >>> rows
    [('09', 'Connecticut', 3574097), ('23', 'Maine', 1328361), ('25', 'Massachusetts', 6547629),
     ('33', 'New Hampshire', 1316469), ('34', 'New Jersey', 8791898), ('36', 'New York', 19378104),
     ('42', 'Pennsylvania', 12702379), ('44', 'Rhode Island', 1052567), ('50', 'Vermont', 625741)]
    >>> 
    

    The second piece will make sense after you have a quick look at the prettytable tutorial. Here I grab the list of columns names and specify how cells for the columns should be aligned (default is center) and padded (default is one space). Then I add each row from the nested list of tuples to the table, row by row. There are two outputs: print directly to the screen, and dump the whole table into a string. That string can then be dumped into a text file, along with a title. Here’s the screen output:

    +-------+---------------+----------+
    | State | Name          |  Est2010 |
    +-------+---------------+----------+
    |   09  | Connecticut   |  3574097 |
    |   23  | Maine         |  1328361 |
    |   25  | Massachusetts |  6547629 |
    |   33  | New Hampshire |  1316469 |
    |   34  | New Jersey    |  8791898 |
    |   36  | New York      | 19378104 |
    |   42  | Pennsylvania  | 12702379 |
    |   44  | Rhode Island  |  1052567 |
    |   50  | Vermont       |   625741 |
    +-------+---------------+----------+
    

    The one hangup I had was the formatting for the numbers: I really want some commas in there since the values are so large. I couldn’t figure out how to do this using the approach above – I’m writing all the rows in one swoop, and couldn’t step in and and format the last value for each row.

    Unless – instead of constructing the table by rows, I construct it by columns. Here’s my second go at it:

    from prettytable import PrettyTable
    import sqlite3
    
    conn = sqlite3.connect('pop_test.sqlite')
    curs = conn.cursor()
    curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')
    
    col_names = [cn[0] for cn in curs.description]
    rows = curs.fetchall()
    
    y=PrettyTable()
    y.padding_width = 1
    y.add_column(col_names[0],[row[0] for row in rows])
    y.add_column(col_names[1],[row[1] for row in rows])
    y.add_column(col_names[2],[format(row[2],',d') for row in rows])
    y.align[col_names[1]]="l"
    y.align[col_names[2]]="r"
    
    print(y)
    tabstring = y.get_string()
    
    output=open("export.txt","w")
    output.write("Population Data"+"\n")
    output.write(tabstring)
    output.close()
    
    conn.close()
    

    To add by column, you don’t provide any arguments to the PrettyTable function. You just add the columns one by one: here I call the appropriate values using the index, first for the column name and then for all of the values from the rows that are in the same position. For the last value (the population estimate) I use format to display the value like a decimal number (this works in Python 3.1+ – for earlier versions there’s a similar command – see this post for details). I tried this in my first example but I couldn’t get the format to stick, or got an error. Since I’m specifically calling these row values and then writing them I was able to get it to work in this second example. In this version the alignment specifications have to come last. Here’s the result:

    +-------+---------------+------------+
    | State | Name          |    Est2010 |
    +-------+---------------+------------+
    |   09  | Connecticut   |  3,574,097 |
    |   23  | Maine         |  1,328,361 |
    |   25  | Massachusetts |  6,547,629 |
    |   33  | New Hampshire |  1,316,469 |
    |   34  | New Jersey    |  8,791,898 |
    |   36  | New York      | 19,378,104 |
    |   42  | Pennsylvania  | 12,702,379 |
    |   44  | Rhode Island  |  1,052,567 |
    |   50  | Vermont       |    625,741 |
    +-------+---------------+------------+
    

    prettytable gives you a few other options, like the ability to sort records by a certain column or to return only the first “n” records from a table. In this example, since we’re pulling the data from a database we could (and did) specify sorting and other constraints in the SQL statement instead. prettytable also gives you the option of exporting the table as HTML, which can certainly come in handy.


    Copyright © 2014 Gothos. All Rights Reserved.
    No computers were harmed in the 0.482 seconds it took to produce this page.

    Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.