Common Map Projection Definitions

April 3rd, 2011

Just finished teaching my second intro to GIS workshop using open source software (QGIS). Coordinate systems and map projections are always one of the toughest hurdles for novices. It’s hard enough just teaching the basic concepts using the existing CRS libraries in QGIS; having to custom define a number of common projected coordinate systems makes the process more daunting. For example, when we’re producing a thematic map of the US I want to use Lambert Conformal Conic for North America, but I have to give the students a proj4 definition in a text file and explain how you have to comb through the Spatial Reference site to find it.

For reference purposes and to make things a bit simpler, I’m providing some codes and definitions for some common coordinate reference systems (common for the participants in class) in this post. You can look up projection definitions at Spatial Reference and use the map projection resources at Radical Cartography and the USGS to see depictions and explanations of different systems. I created the projection images in this post using NASA’s G.Projector tool; a lightweight cross-platform tool for experimenting with projections.

The following CRS are pretty common and are included in the EPSG library used by QGIS – no need to custom define them, just search by name and code (the EPSG codes are ID codes for each CRS):

Geographic Coordinate Systems:

WGS84 (EPSG 4326): World Geodetic System of 1984, commonly used by organizations that provide GIS data for the entire globe or many countries and used by most web-based mapping engines (Google Maps)

NAD83 (EPSG 4269): North American Datum of 1983, commonly used by most US and Canadian federal government agencies (the US Census Bureau in particular) that provide GIS data

Since WGS84, NAD83, and all geographic coordinate systems are unprojected they will all look like Equirectangular or “Plate Caree” projections, which preserve distances:

Local Projected Coordinate Systems:

NAD 83 / New York Long Island (ft US) (EPSG 2263): The State Plane zone that covers Long Island and New York City is used by all NYC agencies that produce GIS data. Many city and state agencies produce data in their specific state plane zone. An alternate projection, EPSG 32118, represents the same zone but uses meters instead of feet.

NAD 83 / UTM Zone 18N (EPSG 26918): An alternative to State Plane that is better for larger regions; satellite or ortho imagery is often provided based on the UTM zone where the tile is. UTM Zone 18N covers much of the east coast of the US. An alternate projection, EPSG 32618, uses WGS 84 as a datum instead of NAD 83.

The following CRS are common continental and global projected coordinate systems that are NOT included in the EPSG library that is part of QGIS; you have to custom define them using the proj4 definitions.

North America Lambert Conformal Conic: Perhaps the most common map projection for North America, a conformal map preserves angles. LCC can be modified for optimally displaying specific countries (i.e. USA and Canada) or other continents (i.e. South America, Asia, etc.)

+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs

North America Albers Equal Area Conic: an alternative to LCC, all areas in an AEAC map are proportional to the same areas on the Earth. Can also be modified for specific countries or other continents. Visually it look more “compact” east to west versus LCC.

+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs

Robinson: a global map projection used by National Geographic for many decades. The Robinson map is a compromise projection; it doesn’t preserve any aspect of the earth precisely but makes the earth “look right” visually based on our common perceptions.

+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

Mollweide: a global map projection that preserves areas, often used in the sciences for depicting global distributions on small maps.

+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

Calculating Standard Deviation for Summarized Data

March 29th, 2011

This isn’t a geospatial issue per say, but I thought it would be useful to share. I have a spreadsheet where I’m tracking course evaluation responses for the GIS workshops I’m teaching. I have to report the total number of responses, the mean, and standard deviation for each question. The worksheet I designed tracks aggregate responses; the total number of people who responded to each question in each category, on a scale of 5 (strongly agree) to 1 (strongly disagree). For example:

The problem I had was that Excel’s standard deviation formula doesn’t work for summaries – you need to give the formula individual responses or raw scores for arguments. In other words:

So I was fixated on trying to find a formula, through the help and by searching the forums, where somehow I could calculate standard deviation using summaries or aggregates. It finally dawned on me (duh) that I could plug in the standard deviation formula myself and modify it.

To calculate the standard deviation for an entire population you compute the difference of each data point from the mean and square each result. Then you calculate the average of all these values and take the square root.

So for each question I subtract the mean score for that question from the score category for that question, square it, and then multiply the result by the number of people who answered in that category. So if 10 people strongly agreed with the question and strongly agreed is associated with a score of 5, I subtract the average score (4.71) from 5, square the result, and multiply it by 10 (since ten people responded that they strongly agreed).

((score value – mean score)^2)*respondents

I perform the same operation for each category. So if 4 people said they agreed with a question and agreed is a value of 4, subtract 4.71 from 4, square it, and multiply by 4. After I do this this for each category, I sum the values for each one and take the square root of the whole thing.

SQRT ((((score5 – mean score)^2)*respondents)+(((score4 – mean score)^2)*respondents))

SQRT ((((5-4.71)^2)*10)+(((4-4.71)^2)*4))

For my spreadsheet the formula is repeated for each of the 5 possible scores, references are used to pull in the mean and respondent values from other cells, and I round the entire result to 1 decimal place. The number of parens makes it a little confusing; I’ve inserted a color-coded image below so it’s a little clearer.

=ROUND((SQRT(((((5-H9)^2)*B9)+(((4-H9)^2)*C9)+(((3-H9)^2)*D9)+(((2-H9)^2)*E9)+(((1-H9)^2)*F9))/G9)),1)

Given all that can go wrong with one misplaced parens, I tested this by inputting some raw scores by hand and running the STDEVPA formula to verify that I get the same result.

Relating ZIP Codes / ZCTAs to PUMAs

March 19th, 2011

Ever since I created the Google Maps finding aid for census data for NYC PUMAs and the associated PUMA – NYC neighborhood names maps, I’ve received several requests for tables or maps that relate PUMAs to ZIP Codes. These are usually from non-profits in NYC who have lists of donors, members, or constituents with addresses, and they want to relate the addresses (using the ZIP) to recent demographic data from American Community Survey (ACS) for the broader neighborhood where the ZIP is located.

The problem is that ZIP Codes are an all around pain. They actually don’t exist as areas with distinct boundaries; ZIP Codes are all address based, with ZIPs tied to addresses along street segments. The USPS doesn’t publish these tables or create maps; they contract this out for private companies to do, who turn around and sell these products for hefty fees.

Fortunately the Census Bureau has used these address tables to create approximations of ZIP Codes that they call ZCTAs or ZIP Code Tabulation Areas. ZCTAs are aggregates of census blocks that attempt to mimic ZIP Codes that exist as areas; codes associated with specific single-point firms or organization are dropped. Since ZIPS were created by the USPS, ZCTAs do not nest or mesh with any census geography; they cross PUMA, county, and in some cases even state boundaries. They are also less stable than census geography, with frequent changes, and as statistical areas they vary widely in area and population. For this reason ZCTA data is only published every ten years in the decennial census; it’s not included in the ACS (so far).

With these caveats in mind, I used the Missouri Census Data Center’s MABLE/GEOCORR engine to correlate ZCTAs with PUMAs. While the interface looks a little retro and daunting, it’s actually pretty simple. You choose the state, the two geographies you want to relate, the weighting method for allocating one to the other, and an output format that includes CSV or HTML. I also used an option that lets you type in FIPS codes for the counties you want, so I didn’t end up with the entire state.

This method was the way to go, as they give you the option to allocate geographies based on population and not simply land area; each ZCTA was allocated to PUMAs based on where the majority of the ZCTA’s population lived using 2000 census block data. The final output contains one row for each ZCTA to PUMA combination. So you had multiple rows for ZCTAs that weren’t contained within a single PUMA, and for each of those ZCTAs you had fields that showed the percentage of the ZCTA’s population that lived in each PUMA (along with the actual population number) as well as the percentage of the PUMA’s population that lived in that ZCTA.

I took that table and cleaned it up in a spreadsheet, so that I was left with one row for each ZCTA, where the ZCTA was allocated to one PUMA based on where the majority of it’s population lives. I used some ZCTA and PUMA boundaries that I had originally downloaded and subsequently cleaned up from the 2009 TIGER shapefiles page, added them to QGIS, joined the ZCTA allocation table to the ZCTA geography, and mapped the result. I color-coded ZCTAs so that clusters of ZCTAs within a particular PUMA had the same color. Then I overlaid the PUMA boundaries on top to see how well they corresponded.

In the end, they didn’t correspond all that well. There was a fairly good relationship in Manhattan, ok relationship in Queens and Staten Island, and a rather lousy relationship in the Bronx and Brooklyn. I overlaid greenspace and facilities (airports, shipyards, etc) boundaries I had, and that made some difference; you could see in some areas where ZCTAs overlapped two PUMAs that the overlap coincided with parks, cemeteries, or other areas with low or no residential population in one of the PUMAs.

I’ve posted both sets of tables, maps, and some instructions on the NYC neighborhoods resource page. You can use the original MABLE / GEOCORR table to judge where allocations were good and were they were not so good based on population. For now, the engine is still based on 2000 Census geography and data. Even though the Census has started releasing 2010 TIGER files based on 2010 Census geography, ZCTAs and PUMAs are often some of the last geographies to be updated; current releases of the ACS are still based on the 2000 geographies. Stay tuned to the Census Bureau and MCDC websites for news on updates, and keep the MABLE / GEOCORR in mind if you want to create lists to relate census geographies by population or land area.

Some 2010 Census Updates

February 7th, 2011

Some geography updates to pass along regarding new US Census data:

  • The Census has released a few 2010 map widgets that you can embed in web pages. One shows population change, density, and apportionment for the whole country at the state level, while the other shows population, race, and Hispanic change for states at the county level. As of this post only four states are ready (LA, MS, NJ, and VA) but they’ll be adding the rest once they’re available.
  • The 2010 TIGER Line Files are starting to be released; they’ve changed the download interface a little bit based on user feedback. Most summary levels / geographic areas are available; some (like ZIP Codes and PUMAs) will be released later this year.
  • They’re also rolling out the new interface for the American Factfinder; currently you can get 2000 Census data, some population estimates, and the 2010 Census data as it becomes available. Other datasets like the American Community Survey and Economic Census will be added over time. Some maps and gov docs librarians have expressed concerned about the change – apparently when you download the data from the new interface the FIPS codes are not “ready to go” for joining to shapefiles; there’s one long geo id that has to be parsed. The other concern is that the 1990 Census won’t be carried over into the new interface at all. The original American Factfinder is slated to come down towards the end of this year.


QGIS – Workshop Plans and Updates

December 3rd, 2010

I’ve been hacking away for several months now at creating the day-long GIS practicum / workshop using QGIS that I hope to offer on my campus in the spring. I’ve finally finished it and am just working out the administrative details. My hope is that after completing the workshop, participants will have enough knowledge to then go out on their own and work on their own projects (with the tutorial manual to fall back on). The workshop will consist of five parts:

  • Part 1: General introduction and overview to GIS
  • Part 2: Introduction to GIS Interface (learn how to navigate the interface: adding data, layering data, symbolization, changing zoom, viewing attributes, viewing attribute table, making basic selections, difference between data formats, organizing projects and data)
  • Part 3: GIS Analysis (using site selection example in NYC, basic geoprocessing tasks, attribute table joins, plotting coordinate data, buffers, basic statistics, advanced selection)
  • Part 4: Thematic mapping (using US states as an example, map projections, coordinate systems, data classification, symbolization, calculated fields, labeling, map layouts)
  • Part 5: Going Further with GIS (exploring and evaluating online sources for free data, exploring open source and ArcGIS software resources for learning more)

I designed the workshop around QGIS 1.5, but now that version 1.6 is out I’ll have to go back and make a few tweaks. Details about the new version and recent updates are available HERE. For my purposes, the most noteworthy changes are:

  • New operators in the field calculator (like concatenate)
  • Some improvements to the measurement tools
  • The ability to view non-spatial attribute tables
  • Support for color ramps for symbolization
  • New classification schemes (including natural breaks!)

US Census Release Schedule

October 22nd, 2010

There are some big US Census data updates coming up. Thought I’d offer a summary:

  • 2009 Annual American Community Survey (ACS): released in Sept, data for geographies that have at least 65k people.
  • 2005-2009 5 Year ACS: to be released in Dec 2010, this will be the first release of five year estimate data, which goes down to the census tract level on the American Factfinder, and to block groups via downloadable summary files. From this point forward we’ll have annually updated data for small areas like tracts
  • First 2010 Decennial Census numbers: state population counts will be released before Dec 31st. The 2010 Census figures will be rolled out over a three year period. This is the first decennial census in many decades that will consist only of 100% count short-form questions that cover basic demographic variables. For anything else we’ll have to turn to the ACS.
  • 2007-2009 3 Year ACS: to be released in Jan 2011 for all geographies that have at least 20k people.

For more details you can check out the official release schedules for the 2010 Census and the ACS. This helpful comparison table guides you in deciding whether to use the 1 year, 3 year, or 5 year estimates for your particular needs.

Freely Available World Bank Country Data

September 11th, 2010

This actually happend a little while ago, but for various reasons I haven’t been able to keep up with posting…

Our library had been subscribing to the WDI (World Development Indicators) database from the World Bank, but we were recently informed that the product was being discontinued and all of the data from the WDI and a number of other World Bank datasets would now be freely available from their data portal at http://data.worldbank.org/.

You can download an indicator for all countries by browsing through a list of all 300, or drill down by broad topics. Select an indicator and you can view a table with the most recent data, or a graduated circle map. If you download a table you can choose between an Excel or XML format. If you download the Excel format you get all years for all countries for that particular indicator from 1960 to present; but for many indicators you end up with a lot of null values up until this decade. If you go the XML route, the nulls are omitted and only years with data are provided. Unfortunately, in neither case do you get any unique identifiers like an ISO code.

Fortunately, power users can opt to download an entire data set, such as all of the WDI Indicators, in one file via their data catalog. In this case you have the option for Excel (xlsx only) or CSV, and the records I looked at DID contain ISO codes for each country (3 letter alpha). It looks like they’re also letting people tap into an API, so you can build web applications that harness the data directly from their repository.

In addition to browsing through indicators, you also have the ability to pull up a profile for a particular country to view several indicators for one particular place. They have a snazzy dashboard with stats, charts, and a reference map.

NYC Subway and Transit GIS Layers

July 24th, 2010

I’ve started outlining a one-day, introductory GIS practicum / workshop that I hope to offer in the coming academic year. One of the primary examples I want to use in the workshop is site selection for a retail store, and I thought it would be great to use a subway layer as part of the exercise. But alas, I searched high and low for a layer late last year (for a site selection project) and couldn’t find a publicly available one. I had purchased some proprietary layers, but really don’t want to use them for this workshop because I want to be able to freely distribute all of the materials to anyone; the layer I purchased is also outdated now because the MTA cut many services (including two subway lines) last month.

But thanks to Steve Romalewski at the CUNY Mapping Service, there’s now an alternative! Steve’s work is a HUGE contribution to the GIS community in New York and fills a glaring hole in the city’s collection of freely available GIS data. The MTA does host a data feed service (based on the General Transit Feed Specification created by Google) where it provides the geography of all its transit services, among other things. Steve downloaded and processed this raw data and turned it into shapefiles. He quickly discovered that it required a fair amount of scrubbing to be usable, and he’s cleaned it up and documented the entire process in great detail in several posts on his blog (Spatiality). Links to download individual shapefiles are available at the bottom of each post, following his discussion of issues and methodology for each set of layers. The CUNY Center for Urban Research has created an index page with each post, which you can access here.

In addition, he’s created a lyr file for the subway lines in order to symbolize them correctly by color and a separate mxd file for labels. While the shapefiles represent where the lines are, there are some problems representing them as they appear cartographically on the MTA’s subway maps. Many lines, including some with different colors, share the same trunk line. For example the A and C trains (blue lines) share the same trunk with the B and D trains (orange lines) along 8th Ave from 59th St to 145th St. Depending on how you sort your symbol categories, you’ll only see one color (and line) depending on which one you have on top. Steve points out two ways for solving this issue – you can edit the geography and offset one of the lines, which is tedious and creates problems as you change scale (he has some great screen shots that depict this). If you’re using ArcGIS, he shows off some cartographic tools that you can use to offest lines by prioritizing values in the attribute table. This is more ideal, as it gives the illusion that the lines are side by side cartographically while keeping the geometry of the shapefile intact.

So if you’re using ArcGIS you’ll be good to go. I’ve downloaded the files to play around with, but as I’m at home and using QGIS I had some more work to do, since lyr and mxd files are proprietary ESRI formats that the open source packages can’t handle. I’ve assigned the appropriate colors to each subway line and saved them a QGIS style file (.qml), which you can import in the symbology window to quickly and easily get the right colors (which I plucked from the MTA’s website). I’ve also saved the RGB and hex values for each line in a text file, if you’re using some other GIS software and need to input them manually. As far as I know there isn’t an easy way to circumvent the shared-line subway problem if you’re using QGIS (see screenshot below), so you’d have your work cut out for you if you want to faithfully represent the lines the way they appear on the MTA maps. But if you’re using the layers for analysis (which is what I’ll be doing) or you don’t need to emulate “the” subway map in exact detail, it shouldn’t matter.

NYC subway layers from CUNY Mapping Service in QGIS

NYC subway layers from CUNY Mapping Service in QGIS

Footnote – for anyone who is interested, the proprietary data that I purchase for the college is from a company called Halcrow. The entire NYC transportation package costs $465. It includes NYC subways and buses (lines and stations for each, along with ridership statistics from 2008 and a historical bus stops layer from 1998), LIRR and Metro North (lines and stations), but also includes the PATH train, freight lines, and truck routes.

Learning Python at PyCamp

June 10th, 2010

I got back from leave a couple week ago, and spent part of it at a Python boot camp. I’ve gotten tired of hacking away at data in spreadsheets and read in several places that Python is a good language to learn for beginning programmers – it’s also open source, flexible, and is used by many in the GIS community for processing data and building plugins and software (the instructor for the camp, Chris Calloway, pointed me to this presentation on Python scripting techniques for ArcGIS).

The workshop was a three-day event hosted at Penn State by the Triangle Zope and Python Users Group (TriZPUG). It was geared towards beginners and non-programmers (although many of my fellow classmates were IT and systems people) and provided a pretty thorough review of all of the elements of the language – now it’s up to me to tie it all together! The price was extremely reasonable (only $300 for a 3 day class!) and I’d certainly recommend it if there’s a camp in your area; although I would also recommend reading a book or taking a tutorial to familiarize yourself with the basics BEFORE attending the class; I did, and as a result I think I got more out of it than I would have had going in cold.

The next PyCamp is being held in LA in a few days, and the following one will be in Toronto from Aug 30th to Sept 3rd (although this isn’t posted on the website yet); the normal workshop is a five day affair, the one I attended was a mini 3 day version which suited my needs pretty well.

There are tons of Python tutorials on the web and Python’s site is pretty definitive. If you’re looking for a book, I’d recommend Practical Programming: An Introduction to Computer Science Using Python. Unlike the “Learn Language X” books, this one introduces you to general theory and practice in programming, and the authors illustrate the applications with practical examples using Python – it’s been immensely helpful to me. Now that I’m around the initial learning curve, I’ve been relying more on Beginning Python: From Novice to Professional, which is better as a reference book and good for illustrating many of the uses for individual objects, methods, etc (which I had a hard time grasping before I covered the basics of programming).

Google Maps to Create a Census Finding Aid

May 13th, 2010

Yikes! It’s been quite awhile since my last post (the past couple months have been a little tough for me), but I just finished an interesting project that I can share.

I constantly get questions from students who are interested in getting recent demographic and socio-economic profiles for neighborhoods in New York City. The problem is that neighborhoods are not officially defined, so we have to look for a surrogate. The City has created neighborhood-like areas out of census tracts called community districts and they publish profiles for them, but this data is from the decennial census  and not current enough for their needs.  ZIP code data is also only available from the decennial census.

We can use PUMAs (Public Use Microdata Areas) to approximate neighborhoods in large cities, and they are published as part of the 3 year estimates of the American Community Survey. The problem is, in order to look up the data from the census you need to search by PUMA number – there are no qualitative place names. The city and the census have worked together to assign names to neighborhoods as part of the NYC Housing and Vacancy Survey, but this is the only place (I’ve found) that uses these names. You need to look in several places to figure out what the PUMA number and boundaries for an area are and then navigate through the census site to find it. Too much for the average student who visits me at the reference desk or emails me looking for data.

My solution was to create a finding aid in Google maps that tied everything together:

View Larger Map

I downloaded PUMA boundaries from the Census TIGER file site in a shapefile format. I opened them up in ArcGIS and used an excellent script that I downloaded called Export to KML. ArcGIS 9.3 does support KML exports via the toolbox, and there are a number of other scripts and stand-alone programs that can do this (I tried several) but Export to KML was best (assuming you have access to ArcGIS) in terms of the level of customization and the thoroughness of the user documentation. I symbolized the PUMAs in ArcGIS using the colors and line thickness that I wanted and fired up the tool. It allows you to automatically group and color features based on the layer’s symbology. I was able to add a “snippet” to each feature to help identify it (I used the PUMA number as the attribute name and the neighborhood name as my snippet, so both appear in the legend) and added a description that would appear in the pop up window when that feature is clicked. In that description, I added the URL from the ACS census profile page for a particular PUMA – the cool part here is that the URL is consistent and contains the PUMA number. So, I replaced the specific number and inserted the [field] name from the PUMAs attribute table that contained the number. When I did the export, the URLs for each individual feature were created with their PUMA number inserted into the link.

There were a few quirks – I discovered that you can’t automatically display labels on a Google Map without subterfuge, like creating the labels as images and not text. Google Earth (but not Maps) supports labels if you create multi-geometry where you have a point for a label and a polygon for the feature. If you select a labeling attribute on the initial options screen of the Export to KML tool, you create an icon in the middle of each polygon that has a different description pop-up (which I didn’t want so I left it to none and lived without labels). I made my features 75% transparent (a handy feature of Export to KML) so that you could see the underlying Google Map features through the PUMA, but this made the fill AND the lines transparent, making the features too difficult to see. After the export I opened the KML in a text editor and changed the color values for the lines / boundaries by hand, which was easy since the styles are saved by feature group (boroughs) and not by individual feature (pumas). I also manually changed the value of the folder open element (from 0 to 1) so that the feature and feature groups (pumas and boroughs) are expanded by default when someone opens the map.

After making the manual edits, I uploaded the KML to my webserver and pasted the url for it into the Google Maps search box, which overlayed my KML on the map. Then I was able to get a persistent link to the map and code for embedding it into websites via the Google Map Interface. No need to add it to Google My Maps, as I have my own space. One big quirk – it’s difficult to make changes to an existing KML once you’ve uploaded and displayed it. After I uploaded what I thought would be my final version I noticed a typo. So I fixed it locally, uploaded the KML and overwrote the old one. But – the changes I made didn’t appear. I tried reloading and clearing the cache in my browser, but no good – once the KML is uploaded and Google caches it, you won’t see any of your changes until Google re-caches. The conventional wisdom is to change the name of the file every single time – which is pretty dumb as you’ll never be able to have a persistent link to anything. There are ways to circumvent the problem, or you can just wait it out. I waited one day and by the next the file was updated; good enough for me, as I’ll only need to update it once a year.

I’m hosting the map, along with some static PDF maps and a spreadsheet of PUMA names and neighborhood numbers, from the NYC Data LibGuide I created (part of my college’s collection of research guides). If you’re looking for neighborhood names to associate with PUMA numbers for your city, you’ll have to hunt around and see if a local planning agency or non-profit has created them for a project or research study (as the Census Bureau does not create them). For example, the County of Los Angeles Department of Mental Health uses pumas in a large study they did where they associated local place names with each puma.

If you’re interested in dabbling in some KML, there’s Google’s KML tutorial. I’d also recommend The KML Handbook by Josie Wernecke. The catch for any guide to KML is that while all KML elements are supported by Google Earth, there’s only partial support for Google Maps.


Copyright © 2012 Gothos. All Rights Reserved.
No computers were harmed in the 0.389 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.