Mapping Hard to Count Areas for Census 2010

There was an interesting article in the New York Times today about neighborhoods in New York that typically get under-counted in the Census. These include areas with high immigrant populations as well as places that have had new construction since the last census, as the buildings haven’t been added to the Census Bureau’s master address file.

What the article didn’t mention is that CUNY’s Center for Urban Research has created a great online ap called the Census 2010 Hard to Count mapping site. The site is built on the Census Bureau’s Tract Level Planning Database, which identified twelve population and housing variables, such as language isolation, recent movers, poverty, and crowded housing, that were associated with low mail response in the 2000 Census. This tool was designed to help Census reps, local government officials, and community activists identify traditionally under-counted areas to insure a more complete count this time around.

The database is national in scope, and you can easily map tracts for a particular state, county, city, metro area, or tribal area, and you can search for an area using an individual address. The map is built on a Google Maps interface, and zooming in will change the units mapped from larger units (states, counties, etc) to tracts. You can easily select one of the twelve variables color-coded in the menu to the left of the map, or a Hard to Count index of all the variables.

Calculated Fields in SpatiaLite / SQLite

After downloading data, it’s pretty common that you’ll want to create calculated fields, such as percent totals or change, to use for analysis and mapping. The next step in my QGIS / SpatiaLite experiment was to create a calculated field (aka derived field). I’ll run through three ways of accomplishing this, using my subway commuter data to calculate the percentage of workers in each NYC PUMA who commute to work. Just to keep everything straight:

  • sub_commuters is a census data table for all PUMAs in NY State
    • [SUBWAY] field that has the labor force that commutes by subway
    • [WORKERS_16] field with the total labor force
    • [SUB_PER] a calculated field with the % of labor force that commutes by subway
    • [GEO_ID2] the primary key field, FIPS code that is the unqiue identifier
  • nyc_pumas is a feature class with all PUMAs in NYC
    • [PUMA5ID00] is the primary key field, FIPS code that is the unqiue identifier
  • pumas_nyc_subcom is the data table that results from joining sub_commuters and nyc_pumas; it can be converted to a feature class for mapping

Spreadsheet

The first method would be to add the calculated field to the data after downloading it from the census in a spreadsheet, as part of the cleaning / preparation stage. You could then save it as a delimited text file for import to SpatiaLite. No magic there, so I’ll skip to the second method.

SpatiaLite

The second method would be to create the calculated field in the SpatiaLite database. I’ll go through the steps I used to figure this out. The basic SQL select query:

SELECT *, (SUBWAY / WORKERS_16) AS SUB_PER FROM sub_commuters

This gives us the proper result, but there are two problems. First, the data in my SUBWAY and WORKERS_16 field are stored as integers, and when you divide the result is rounded to the nearest whole number. Not very helpful here, as my percentage results get rounded to 0 or 1. There are many ways to work around this: set the numeric fields as double, real, or float in the spreadsheet before import (didn’t work for me), specify the field types when importing (didn’t get that option with the SpatiaLite GUI, but maybe you can with the command line), add * 100 to the expression to multiply the percentage to a whole number (ok unless you need decimals in your result) or use the CAST operator. CAST converts the current data type of a field to a specified data type in the result of the expression. So:

SELECT *, (CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL)) AS SUB_PER FROM sub_commuters

This gave me the percentages with several decimal places (since we’re casting the fields as real instead of integer), which is what I needed. The second problem is that this query just produces a temporary view; in order to map this data, we need to create a new table to make the calculated field permanent and join it to a feature class. Here’s how we do that:

CREATE TABLE pumas_nyc_subcom AS
SELECT *, (CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL)) AS SUB_PER
FROM sub_commuters, nyc_pumas
WHERE nyc_pumas.PUMA5ID00=sub_commuters.geo_id2

The CREATE TABLE AS statement let’s us create a new table from the existing two tables – the data table of subway commuters and the feature class table for NYC PUMAs. We select all the fields in both while throwing in the new calculated field, and we join the data table to the feature class all in one step, and via the join we end up with just data from NYC (the data for the rest of the state gets dropped). After that, it’s just a matter of taking our new table and enabling the geometry to make it a feature class (as explained in the previous post).

This seems like it should work – but I discovered another problem. The resulting calculated field that has the percentage of subway commuters per PUMA, SUB_PER, has no data type associated with it. Looking at the schema for the table in SpatiaLite shows that the data type is blank. If I bring this into QGIS, I’m not able to map this field as a numeric value, because QGIS doesn’t know what it is. I have to define the data type for this field. SpatiaLite (SQLite really) doesn’t allow you to re-define an existing field – we have to create and define a new blank field, and the set the value of our calculated field equal to it. Here are the SQL statements to make it all happen:

ALTER TABLE sub_commuters ADD SUB_PER REAL

UPDATE sub_commuters SET SUB_PER=(CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL))

CREATE TABLE pumas_nyc_subcom AS
SELECT * FROM sub_commuters, nyc_pumas
WHERE nyc_pumas.PUMA5ID00=sub_commuters.geo_id2

So, we add a new blank field to our data table and define it as real. Then we update our data table by seting that blank field equal to our expression, thus filling the field with the result of our expression. Once we have the defined calculated field, we can create a new table from the data plus the features based on the ID they share in common. Once the table is created, then we can activate the geometry (right click on geometry field in the feature class and activate – see previous post for details) so we can map it in QGIS. Phew!

QGIS

The third method is to create the calculated field within QGIS, using the new field calculator. It’s pretty easy to do – you select the layer in the table of contents and go into an edit mode. Open the attribute table for the features and click the last button in the row of buttons underneath the table – this is the field calculator button. Once we’re in the field calculator window, we can choose to update an existing field or create a new field. We give the output field a name and a data type, enter our expression SUBWAY / WORKERS_16, hit OK, and we have our new field. Save the edits and we should be good to go. HOWEVER – I wasn’t able to add a calculated fields to features in a SpatiaLite geodatabase without getting errors. I posted to the QGIS forum – initially it was thought that the SpatiaLite driver was read only, but it turns out that’s not the case and so and the developers are investigating a possible bug. The investigation continues – stay tuned. I have tried the field calculator with shapefiles and it works perfectly (incidentally, you can export SpatiaLite features out of the database as shapefiles).

I’m providing the database I created here for download, if anyone wants to experiment.

SpatiaLite and QGIS: Loading, Joining, Mapping Shapefiles and Tables

I stuck with with the Long Term Support Version of QGIS (1.02) last semester while I was teaching, but now I finally have had a chance to experiment with the latest version (1.4) which has a lot of great new features including: improved symbolization, labeling, print layouts, and support for SpatiaLite – a personal (single file) geodaatbase based on SQLite. For a summary of the new QGIS features check out the QGIS blog and this developer’s blog, and for an overview of SpatialLite you can go to the official docs page and this tutorial. The latter will show you the obvious strengths of SpatialLite – the ability to store features and attributes in one container, with the ability to run standard SQL and spatial queries on both. Since that’s covered pretty well, I thought I’d run through a basic operation – how do you load a shapefile and an attribute table in SpatialLite, join them, connect to the database in QGIS and map the data. I’m using the SpatialLite GUI, but for those more inclined you could use the command line tool instead.

Loading shapefile in SpatialiteFire up the GUI, and create a new, empty geodatabase under the File menu.Once we have a container, we can hit the load a shapefile button. I have a census PUMA layer for NYC that I’ve formatted by erasing water features. Click load, go to the path, give the file a nice brief name, and specify the SRID – the EPSG code that specifies what coordinate system my shapefile is in. In this case, it’s 4269 as the layer is in NAD83 (you can check your files by opening the prj file in a text editor or by using the OGR tools).

Table viewOnce it’s loaded, you can expand the listing in the table of contents to see all the field names of the feature, and you can right click on it and choose the edit option to see all of the data in the attribute table.

Next we can load a data table. I have a 2006-2008 ACS census table in tab-delimited text format that I’ve pre-formatted. The table has the number of workers (labor force age 16+) and number of workers who commuted to work via the subway for every PUMA in the State of New York (it’s faster to download the whole state and filter out the city PUMAs later). Hit the load txt/csv button, specify a path, a new table name (subway_commuters), the delimiter used, and load the table. It’s given a different icon in the table of contents (toc), to distinguish a regular data table from a feature class.

spatlite4The next step is to join them together; I already insured that they both share a common, unique identifier; a FIPS code that has a state and PUMA code. If I run a standard SELECT query I can join the tables in a temporary view – but that’s not what I want. I can save the query as a view, but I won’t be able to access the view within QGIS (at least not with this current stable version of SpatialLite, 2.31). What we have to do here is create a brand new table that combines both the puma feature class and the subway commuter table (referred to in Microsoft Access land as a Make Table Query). Here’s the SQL that we type in the command window:

CREATE TABLE pumas_nyc_subcom AS
SELECT *
FROM nyc_pumas, sub_commuters
WHERE PUMA5ID00=GEO_ID2

Execute the query, and we get a message that an empty results set was generated. Uh, ok. But then if we select the database path at the top of the TOC , right-click, and refresh, we’ll see our new combined table, pumas_nyc_subcom, and we can expand it and take a look at the data. The join worked, but we’re not done yet. Right now this is just a regular old data table (notice the icon?) We have to turn this into a feature class next.

Joined and created feature classExpand the fields for the new table in the TOC, select the Geometry field, right click, and check the geometry. We’ll see that it’s MULTIPOLYGON geometry, the projection is still NAD83, and there are 55 features (the non-NYC PUMAS were filtered out during the join, leaving us just with NYC data). Right click on Geometry again, choose the option to Recover Geometry. Specify the geometry type and the SRID, run, refresh the database, and success. A little globe appears next to pumas_nyc_subcom, indicating that it’s now a feature class.

spatlite7

QGIS Spatialite connection interfaceAt this point we can fire up QGIS. In the toolbar for versions post 1.02, there should be a connect to SpatialLite button. Hit connect, add a New database, and browse to get to it. Once it’s loaded, then we can hit connect to connect to it, and we’ll be able to see all feature classes (but NOT data tables, which is why we had to go through the join). Select pumas_nyc_subcom, which has features and data, and click add.

As with any GIS, now we have to symbolize the features to map the subway commuters. Right click on the layer in the table of contents, select properties, and you’ll get to the recently redesigned properties menu. Go to Symbology, map the subway commuters field by graduated values, change some colors, and voila, a map!

QGIS map with data and new labelsThe developers are still experimenting with improvements – there’s a button in the upper right-hand corner of the symbology tab that asks you if you want to try the New Symbology – this is a new layout, with the introduction of graduated color palettes. It’s pretty slick, but still a work in progress (color ranges are assigned from dark to light, with the lowest values getting the darkest color; the opposite of cartographic convention). The same label properties are there too, but you can experiment with the improved labeling engine under the Plugins menu. The automatic placement of labels is vastly improved.

Mapping totals for subway commuters isn’t as interesting as mapping the percentage of commuters in each PUMA who ride the subway. So I’ll share my experiments working with calculated fields (in SpatialLite and QGIS) in my next post.

Natural Earth Vector and Raster Data

I haven’t been posting regularly as I’ve been swamped this semester – but now that it’s coming to an end I should be able to crank out a post or two each month.

I recently saw a message on Maps-L about a new GIS data source, Natural Earth, and just got around to taking a look at it. It’s run by a volunteer organization dedicated to providing free, integrated, public domain map layers for producing high-quality maps at small scales. They have a pretty comprehensive website that includes a blog, feature list, contributor information, and details on how to volunteer.

Natural Earth provides smooth, generalized vector and raster layers at three scales: 1:10m, 1:50m, and 1:110m. See my screen shot of the Delmarva peninsula to see the distinctions (beige area is 110m, red line is 50m, and blue line is 10m).

nat_earth

Having a choice of scales with vector and raster data layers from the same source is a huge plus (many other country-level boundary files available on the web are detailed and suitable for large scale maps, but look messy when you zoom out to a smaller scale). Natural Earth also provide outlines for land and water (including legal water boundaries for all the Pacific islands), hydrographic features generalized to the different scales, ice shelves, urban areas, and several lat/long grid line layers.

For country boundaries they’ve gotten around the tangled issue of country definitions by providing different layers for different definitions, so you can choose the one that’s most appropriate – sovereign states (so, Greenland would be part of the Denmark polygon, Alaska and Puerto Rico part of the US, and French Guiana part of France), countries (Greenland separate from the Denmark polygon, Puerto Rico separate from the US, Alaska part of the US, and French Guiana part of France), and subunits (each place its own polygon). As you move down this hierarchy, places are linked back to their whole (so there are fields in the subunit file that list which country and sovereign state it’s part of).

At this point subdivisions (states / provinces) are only provided for the US and Canada. They do provide some descriptive metadata for each layer on the website, but the metadata doesn’t follow any standardized format for geographic data. The biggest missing link is unique identifiers – none of the countries have ISO or FIPS codes, so there aren’t any fields to join attribute data to for thematic mapping (except country name, which never works smoothly given the amount of variation with names).

Overall this looks like a great resource. Vector data is in shapefile format, raster data is in tiff, and everything is defined as simple WGS 84, so these files should work with almost any GIS package, ready to go.

Update on Some Data Sources

Here’s my last chance to squeeze in a post before the month is over. There have been a lot of changes and updates with some key data sites lately. Here’s a summary:

  • The homepage for gdata, which provides global GIS data that was created as part of UC Berkeley’s Biogeomancer project, has moved to the DIVA-GIS website. DIVA-GIS is a free GIS software project designed specifically for biology and ecology applications, with support from UC Berkeley as well as several other research institutions and independent contributors. It looks like the old download interface has been incorporated into the DIVA-GIS page.
  • The US Census Bureau has recently released its latest iteration of the TIGER shapefiles, the 2009 TIGER/Line Shapefiles. Since they seem to be making annual updates, which has involved changing the URLs around, it may be better to link to their main TIGER shapefile page where you can get to the latest and previous versions of the files.
  • The bureau has released its latest American Community Survey (ACS) data: 2008 annual estimates for geographic areas with 65,000 plus people, and three year 2006-2008 estimates for geographic areas with 20,000 plus people. Available through the American Factfinder.
  • Over the summer, UM Information Studies student Clint Newsom and I created a 2005-2007 PUMA-level New York Metropolitan ACS Geodatabase (NYMAG). It’s available for download on the new Baruch Geoportal, which was re-launched as a public website this past September. It’s a personal geodatabase in Microsoft Access format, so it can only be directly used with ArcGIS. I plan on creating the 2006-2008 version sometime between January and March 2010, and hope to release an Access and SQLite version, as the latest development versions of QGIS now offer direct support for SQlite geodatabases in the Spatialite format (which is awesome!).
  • While it’s not a source for GIS data or attribute tables, it’s still worth mentioning that the CIA World Factbook completely revised their website this past summer. The previous web versions of the factbook took their design cues from the old paper copies of the report. The CIA revamped the entire site and apparently will be using a model of continuous rather than annual updates. It’s a great site for getting country profiles – another good option is the UN World Statistics Pocketbook, which is part of the UNdata page.

UNdata Processing, Calc Data Pilot

I’d downloaded some data from the UNdata website and cleaned it up so I could use it for my class, and thought I’d share some tips here. In many cases when you download data from UNdata you get multiple records for each country; one record for each year for each data point. In order to bring this data into GIS, I needed to re-arrange it to move the years from rows to columns, so that I’d have one record for each country with multiple columns for years.

You can do this in Excel using a pivot table, but since I was working off of my Linux notebook, I accomplished this using the Data Pilot tool in Open Office 3.0’s Calc spreadsheet. Here’s what I did:

  • screenshot1I opened the csv file I downloaded from UNdata in Calc, and accepted the defaults on the text import screen. Once it was imported, I saved the file in a spreadsheet format – you can use Calc’s odt format, or you can save it in Excel xls format if you need to use Excel later. But you have to get out of the csv format – Calc crashed on me a couple of times when I was running the Pilot and creating multiple sheets in the csv.
  • I went up to the Data menu, selected Data Pilot, and Start, which opens the Data Pilot Menu. I clicked the More button to see the full range of options.
  • screenshot3Then it was a simple matter of dragging the field names into the right places. I dragged the country code and name fields into the rows box, the year into the column box (as I wanted to move years to columns), and the actual data field into the values box. Under the options listed under More, I changed the Results drop down box to save the table in a new sheet, and I unchecked all of the boxes listed below (for adding filters, creating totals rows and columns, etc). Then clicked OK.
  • screenshot4Voila! I had my newly formatted table, with one row for each country and one column for each year. But since I’ll be bringing this data into GIS (and will have to save the data in DBF format as I want my students to bring it into QGIS), I need to make sure that my data doesn’t have any funky formatting that may mess up joining my data to a shapefile. So I added a blank worksheet, copied my new pilot table, and did a Paste Special into the blank worksheet and pasted only text and numbers – with formatting, formulas, and anything else funky left out.
  • screenshot5Once I had my plain, reformatted data in my new sheet, I deleted the top row (which had labels for Sum Value and Year) so I’d be left with only one header row, and I changed the field names to something more database friendly (truncating names and removing spaces). Lastly, I deleted the original data sheet and the formatted data pilot table sheet, so I was left with just the final copy.

That’s it! Sort of. Since I now have the year’s in columns, I could create a few calculated fields to show change over time.

But the last piece will be dealing with the country codes. To get a data table with codes from the UNdata website, you have to choose an Add Columns option from their data browser page before you download, as you don’t get the country codes by default. Then, the codes you get could be anything. Since these data tables are coming from dozens of different organizations, agencies, and bureaus within the UN, the country codes will vary based on what that agency did. In some cases I’ve downloaded data that had the ISO two-digit alpha codes, and in other cases I had three digit numerical ISO codes (stored incorrectly as numbers, so leading zeros were dropped).

Most of the tables I’ve been downloading come from from the World Health Organization (WHO), and came with no standardized country codes. Instead, the codes are sequential numbers assigned to the countries in alphabetical order from 1 to 193. Doh! Then, if a new country gets added they tacked on the next available number regardless of the alphabet – so the country of Montenegro is assigned 194, after Zambia which is 193. Typically, data from countries that are not UN members or observers (like Liechtenstein and Vatican City) and are dependencies (Greenland, Falkland Islands, French Polynesia, etc) are not included in the data sets.

So, I’ll be typing in ISO alpha two codes into one of my data tables and will end up with a table that connects their sequential number system to ISO. Then I can bring this bridge and all of the other tables into a database, relate them to the bridge based on the sequential number, and create new tables out of them that have ISO numbers, so I can join them to my GIS file based on ISO. Or I guess I could add the sequential number field to my countries shapefile and join each table to it based on the sequential number.

Anyway – happy Data Piloting (or Pivoting, if you prefer).

Reading List for Geographic Information Course

The fall semester is here, and I’m about to start teaching the class I mentioned in my last post (an information studies course on geographic information). I thought I’d share my reading list and try out the Open Book plugin. I chose my readings based on: my particular audience (undergraduate students from many disciplines with little or no background in geography), relevance (materials appropriate in a hybrid information studies / geography course), cost (wanting to assign the students a single textbook that’s reasonably priced and covers all the bases, and will supplement with other readings), and copyright (staying within the bounds of fair use by not assigning too much from a single work). Here goes:

Making Maps
John Krygier, Denis Wood; The Guilford Press 2005

I decided to go with Krygier and Woods Making Maps as my assigned text book. Since cartography is a visual and technical art, I thought it made sense to use a book that relies on visuals for explanations rather than text. It’s approachable, particularly for my students who won’t be coming from a geography background, affordable, wonderfully quirky, and covers all of the essentials of the geographic framework and map interpretation and design independent of specific GIS software.

Place
Tim Cresswell; Blackwell Publishing Limited 2004

I’m using the first chapter of Cresswell’s book as a succinct introduction to how individuals define places, but would recommend the rest of the text for classes that cover geographic concepts and methods.

Georeferencing
Linda L. Hill; The MIT Press 2006

I’m assigning the second and third chapters of Hill’s book. The second chapter, which discusses how people process, store, and use geographic information is the best summary of this topic that I’ve ever seen, and the third chapter is a good overview of the different types of geographic objects. As a librarian-geo nerd, I love the chapters that deal with coordinate metadata and gazetteers, but won’t be using them in this class.

Image Of The City
Kevin Lynch; M.I.T. Press 1960

This is an urban planning / design classic, and I’ll have my students read the summary of Lynch’s city elements (based on his research, Lynch proposed that people mentally break the urban environment down into five types of elements in order to organize and navigate the city: paths, barriers, districts, nodes, and landmarks).

Realms, Regions And Concepts

This is the only traditional textbook that I’ll be borrowing from (I actually used it when I was a Freshmen, way back when). While I’m using the previous three books to discuss egocentric places, or how we as individuals conceive of place, I’m using the first chapter of this book to give the students an overview of geocentric places – the formal, defined hierarchy of places that exist in the world – and to introduce them to the concept of regions.

How To Lie With Maps
Mark S. Monmonier; University Of Chicago Press 1991

This has become a modern classic and I almost assigned it as a second textbook. I am assigning the chapter on maps for propaganda as a background to our discussion on map interpretation and communication, and will later use the chapter on census maps to talk about the effects of data classification and choice of enumeration units.

Desktop GIS
Gary Sherman; Pragmatic Bookshelf 2008

This is the only software book that I’ll be using chapters from, so the students have some formal guide for using QGIS (in addition to the QGIS documentation). I’m using the chapters on vector and raster data.

Key Concepts And Techniques In GIS
Jochen Albrecht; Sage Publications Ltd 2007

This concise, excellent book deals strictly with the concepts and principles behind GIS. I’m using the chapters on spatial search and geoprocessing, but would recommend the entire book for any GIS course, novice to advanced.

In addition to chapters from these books, I’ll also be using:

  • “Revolutions in Mapping” by John Noble Wilford, National Geographic Feb 1998 – a great overview of the history of cartography
  • USGS GIS poster – if there is such a thing, this is a “web classic” and an accessible intro to GIS
  • One article from a scholarly journal and one article from a mass market magazine to illustrate how geographic research is covered and used
  • And for shameless self-promotion, I summary I wrote about US Census data – In Three Parts

Finally, an honorable mention:

A Primer Of GIS
Francis Harvey; The Guilford Press 2008

If I was teaching an introductory GIS course in a geography or earth sciences department, this is certainly the book I would use, and for those of you in that boat I’d recommend checking it out. It does an excellent job of covering GIS principles without being software specific, contains exercises at the end of each chapter, and is well written and affordable. Since the scope of my course is broader than GIS and my audience more general and diverse, I opted to leave it out (but may still assign a chapter).

Geographic Information: Literacy and Systems

I’ve been spending a good portion of my summer working on the course that I’m going to teach this fall. The library at my college offers credit courses in Information Studies which students can take as a minor – they can choose two 3000 level courses and then a 4000 level capstone course. My course is a 3000 level special topics course which I’ve called Geographic Information: Literacy and Systems.

My situation is rather peculiar. I can’t teach this course as a pure GIS course, since it’s an information studies class and not geography or earth sciences. Beyond that, my college does not have a geography department, and earth sciences are not an individual department but are combined with other natural and physical sciences. With the exception of a regional geography class offered by the anthropology department, my college doesn’t offer geography instruction. So even if I could teach a pure GIS class, it’s unlikely that any of the students would have any foundational geographic knowledge.

I also can’t teach the course as a “library” class where I’m training people to be map or GIS librarians, because that isn’t the point of the info studies minor. The minor is meant to introduce students to the foundational principles of information – what is information, how do we search for it, organize it, what is its context in society, etc. I also could not teach the course as a basic software class, as that isn’t really appropriate for a college course. In short, I couldn’t find a model that I could follow, as what I’m doing falls outside these traditional realms.

So I decided to build the course around the concept of geographic information where I’ll cover some foundational geography,cartography, and GIS from an information science perspective that encompasses:  organization, search and retrieval, data processing, and assessment and analysis of GI. I’ve divided the class into four units that cover geographic information and fundamental geography, maps as information objects, and two units of GIS. In the first GIS unit we’ll cover the theoretical aspects and the basics of using the software with datasets that I’ll provide. In the second unit we’ll deal with the nitty gritty of actually searching for and processing freely available GIS data. In the last couple of weeks I’ll spend some time on web mapping and on geographic analysis and research.

Many of the concepts that I’ll be teaching are things that I never formally learned in a college course, such as a discussion of the kinds of administrative and statistical divisions that exist in the world, why they exist, and how data is collected for them. The second GIS unit on data processing is something that I feel is never adequately covered in GIS classes, but is essential for doing just about anything in GIS. I think this is also poignant in information studies, as it involves a discussion of the difference between data and information and how you can turn one into the other.

I’ve decided to use all open source software. Since these are undergraduate students who probably won’t be entering a geography related field, and we are a commuter campus where students have to make special trips to get to computer labs, I didn’t see any logic in using ArcGIS. With the open source software they can use it anywhere and there will be a better chance that they’ll use it after the course is over (and after they graduate). I’ve opted to go with QGIS as it covers all the bases I need. I liked gvSIG but had too many problems with the map layout – I might be able to cut my way through them, but can sophomore business and english majors? QGIS is also more thoroughly documented (in english), which is important since this is an introductory class.

I’m using Krygier and Woods Making Maps as my textbook, along with a few chapters here and there from other texts. I have looked to the pages Krygier’s created for his courses for guidance, and like the stream of consciousness style he used for writing his notes. I’ll post an annotated reading list later.

Since I’m breaking molds, I’ve also decided not to use Blackboard to organize the whole course and am using a blog and various other bits and pieces of software for creating assignments, organizing the roster, etc. If you’re interested you can follow along on my course blog – (only students can register). Classes start on August 31st…

Print Composer in QGIS – ACS Puma Maps

ny_youth_pumasI wrapped up a project recently where I created some thematic maps of 2005-2007 ACS PUMA level census data for New York State. I decided to do all the mapping in open source QGIS, and was quite happy with the result, which leads me to retract a statement from a post I made last year, where I suggested that QGIS may not be the best for map layout. The end product looked just as good as maps I’ve created in ArcGIS. There were a few tricks and quirks in using the QGIS Print Composer and I wanted to share those here. I’m using QGIS Kore 1.02, and since I was at work I was using Windows XP with SP3 (I run Ubuntu at home but haven’t experimented with all of these steps yet using Linux). Please note that the data in this map isn’t very strong – the subgroup I was mapping was so small that there were large margins of errors for many of the PUMAs, and in many cases the data was suppressed. But the map itself is a good example of what an ACS PUMA map can look like, and is a good example of what QGIS can do.

  • Inset Map – The map was of New York State, but I needed to add an inset map of New York City so the details there were not obscured. This was just a simple matter of using the Add New Map button for the first map, and doing it a second time for the inset. In the item tab for the map, I changed the preview from rectangle to cache and I had maps of NY state in each map. Changing the focus and zoom of the inset map was easy, once I realized that I could use the scroll on my mouse to zoom in and out and the Move Item Content button (hand over the globe) to re-position the extent (you can also manually type in the scale in the map item tab). Unlike other GIS software I’ve experimented with, the extent of the map layout window is not dynamically tied to the data view – which is a good thing! It means I can have these two maps with different extents based on data in one data window. Then it was just a matter of using the buttons to raise or lower one element over another.
  • Legend – Adding the legend was a snap, and editing each aspect of the legend, the data class labels, and the categories was a piece of cake. You can give your data global labels in the symbology tab for the layer, or you can simply alter them in the legend. One quirk for the legend and the inset map – if you give assign a frame outline that’s less than 1.0, and you save and exit your map, QGIS doesn’t remember this setting if when you open your map again – it sets the outline to zero.
  • Text Boxes / Labels – Adding them was straightforward, but you have to make sure that the label box is large enough to grab and move. One annoyance here is, if you accidentally select the wrong item and move your map frame instead of the label, there is no undo button or hotkey. If you have to insert a lot of labels or free text, it can be tiresome because you can’t simply copy and paste the label – you have to create a new one each time, which means you have to adjust your font size and type, change the opacity, turn the outline to zero, etc each time. Also, if the label looks “off” compared to any automatic labeling you’ve done in the data window, don’t sweat it. After you print or export the map it will look fine.
  • North Arrow – QGIS does have a plugin for north arrows, but the arrow appears in the data view and not in the print layout. To get a north arrow, I inserted a text label, went into the font menu, and chose a font called ESRI symbols, which contains tons of north arrows. I just had to make the font really large, and experiment with hitting keys to get the arrow I wanted.
  • Scale Bar – This was the biggest weakness of the print composer. The scale bar automatically takes the unit of measurement from your map, and there doesn’t seem to be an option to convert your measurement units. Which means you’re showing units in feet, meters, or decimal degrees instead of miles or kilometers, which doesn’t make a lot of sense. Since I was making a thematic map, I left the scale bar off. If anyone has some suggestions for getting around this or if I’m totally missing something, please chime in.
  • Exporting to Image – I exported my map to an image file, which was pretty simple. One quirk here – regardless of what you set as your paper size, QGIS will ignore this and export your map out as the optimal size based on the print quality (dpi) that you’ve set (this isn’t unique to QGIS – ArcGIS behaves the same way when you export a map). If you create an image that you need to insert into a report or web page, you’ll have to mess around with the dpi to get the correct size. The map I’ve linked to in this post uses the default 300 dpi in a PNG format.
  • Printing to PDF – QGIS doesn’t have a built in export function for PDF, so you have to use a PDF print driver via your print screen (if you don’t have the Adobe PDF printer or a reasonable facsimile pre-installed, there are a number  of free ones available on sourceforge – PDFcreator is a good one). I tried Adobe and PDFcreator and ran into trouble both times. For some reason when I printed to PDF it was unable to print the polygon layer I had in either the inset map or the primary map (I had a polygon layer of pumas and a point layer of puma centroids showing MOEs). It appeared that it started to draw the polygon layer but then stopped near the top of the map. I fiddled with the internal settings of both pdf drivers endlessly to no avail, and after endless tinkering found the answer. Right before I go to print to pdf, if I selected the inset map, chose the move item content button (hand with globe), used the arrow key to move the extent up one, and then back one to get it to it’s original position, then printed the map, it worked! I have no idea why, but it did the trick. After printing the map once, to print it again you have to re-do this trick. I also noticed that after hitting print, if the map blinked and I could see all the elements, I knew it would work. But, if the map blinked and I momentarily didn’t see the polygon layer, I knew it wouldn’t export correctly.

Despite a few quirks (what software doesn’t have them), I was really happy with the end result and find myself using QGIS more and more for making basic to intermediate maps at work. Not only was the print composer good, but I was also able to complete all of the pre-processing steps using QGIS or another open source tool. I’ll wrap up by giving you the details of the entire process, and links to previous posts where I discuss those particular issues.

I used 2005-2007 American Community Survey (ACS) date from the US Census Bureau, and mapped the data at the PUMA level. I had to aggregate and calculate percentages for the data I downloaded, which required using a number of spreadsheet formulas to calculate new margins of error; (MOEs). I downloaded a PUMA shapefile layer from the US Census Generalized Cartographic Boundary files page, since generalized features were appropriate at the scale I was using. The shapefile had an undefined coordinate system, so I used the Ftools add-on in QGIS I converted the shapefile from single-part to multi-part features. Then I used Ftools to join my shapefile to the ACS data table I had downloaded and cleaned-up (I had to save the data table as a DBF in order to do the join). Once they were joined, I classified the data using natural breaks (I sorted and eyeballed the data and manually created breaks based on where I thought there were gaps). I used the Color Brewer tool to choose a good color scheme, and entered the RGB values in the color / symbology screen. Once I had those colors, I saved them as custom colors so I could use them again and again. Then I used Ftools to create a polygon centroid layer out of my puma/data layer. I used this new point layer to map my margin of error values. Finally, I went into the print composer and set everything up. I exported my maps out as PNGs, since this is a good image format for preserving the quality of the maps, and as PDFs.

Formulas for Working With Census ACS Data in Excel / Calc

After downloading US census data, you often need to reformat it before using it. It’s quite common that you download files where the population is broken down by gender and age, and you need to aggregate the data to get a total or divide a particular characteristic to get a percent total. This is pretty straightforward if you’re working with decennial census data, but data from the American Community Survey (ACS) is a little trickier to deal with since you’re working with estimates that have a margin of error. When creating new data, you also have to calculate what the margin of error is for your derived numbers. I’ll walk through some examples of how you would do this in a spreadsheet (the formulas below will work in either Excel or Calc).

Creating an Aggregate

We’ll use the following data in our example:

screenshot1

We have the total population of people three years and older who are enrolled in school, and a breakdown of this population enrolled in grades 1 through 4 and grades 5 through 8 in a few counties in New York, with margins of error for each data point. Our data is from the 3 year averaged 2005-2007 American Community Survey.

Let’s say we want to create a total for students who are enrolled in grades 1 through 8 for each county. We create a new column and sum the estimates for each county with the formula e3+g3, or sum(e3:g3).

To calculate a margin or error (MOE) for our grade 1 to 8 data, first we have to use the find and replace command to get rid of the “+/-” signs in the MOE column, so our spreadsheet will treat our values as numbers and not text (this is an issue if you downloaded the data as an Excel file – if you download a txt file the +/- is not included). Depending on the dataset you’re working with, you may also need to replace dashes, which represent data that was null or not estimated.

Once the data is cleaned up, we can insert a new column with this formula:

=SQRT((F3^2)+(H3^2))

This calculates our new margin of error by squaring the moes for each of our data points, summing the results together, and taking the square root of that sum. In other words,

=SQRT((MOE1^2)+(MOE2^2))

Once that’s done, you may want to round the new MOE to a whole number.

Creating a Percent Total

Let’s calculate the percentage of the population 3 years and older enrolled in school that are in grades 1 through 8. Based on what we have thus far (I hid the columns E,F,G, and H for grades 1-4 and 5-8 in this screenshot, as we don’t need them):

screenshot-2

We insert a new column where we divide our subgroup by the total, as you would expect – I3/C3. In the next column we insert the following formula to create a MOE for our new percent total:

=(SQRT((J3^2)-((K3^2)*(D3^2))))/C3

This one’s a little weightier than our last formula. We’re taking the square of our percent total (K3) and the square of the MOE of the total population (D3), multiplying them together, then subtracting that number from the square of the MOE of our subgroup (J3). Then we take the square root of the whole thing, then divide it by our total population (C3). If you’re saying – HUH? Maybe this is clearer:

=(SQRT((MOEsubset^2)-((PercentTotal^2)*(MOEtotalpop^2))))/TotalPop

Finally, we have something like this:

screenshot-3

Based on our data, we can say things like “There were approximately 30,556 students enrolled in 1st through 8th grade per year in Dutchess County, NY between 2005 and 2007, plus or minus 1,184 students. An estimated 37% of the population enrolled in school in the county was in the 1st through 8th grade, plus or minus 1%.” The ACS estimates have a 90% confidence interval.

Wrap Up

In this example we worked with aggregating and calculating percentages based on characteristics. We could also use these same formulas to aggregate data by geography, if we wanted to add the characteristics for all the counties together.

For the full documentation on working with ACS data, take a look at the appendix in the Census’ ACS Compass Guide, What General Data Users Need to Know. It provides you with the formulas in their proper statistical notation (for those of you more mathematically inclined than I) and includes formulas for calculating other kinds of numbers, such as ratios and percent change. It does provide you with worked-through examples, but not with spreadsheet formulas. I used their examples when I created formulas the first time around, so I could compare my formula results to their examples to insure that I was getting it right. I’d strongly recommend doing that before you start plugging away with your own data – one misplaced parentheses and you could end up with a different (and incorrect) result.