2010 Census Data Being Released

June 16th, 2011

The US Census Bureau has begin releasing data for Summary File 1, which is the primary summary data set that the Bureau tabulates. They will release data for groups of states on a weekly basis from June through September. Alabama and Hawaii were the first states released today. California, Delaware, Kansas, Pennsylvania and Wyoming are out next week.

This data is based on the 100% count of the population and is being released for geographies that nest within states: states, counties, county subdivisions, places, census tracts, ZCTAs, and congressional districts, and in some cases block groups and blocks. You can download the data table by table by building queries via the new American Factfinder, or power users can download entire datasets via the FTP site.

You’ll see how small the 2010 Census is compared to the past: we’re only going to get basic demographic variables. The extensive number of socio-economic indicators – education, income, language, employment status, etc – are no longer collected as part of the decennial census; you have to turn to the American Community Survey for this data, which is released on an annual basis.

Here’s what’s in the 2010 Census:

  • Total Population
  • Urban and Rural Population
  • Gender and Age
  • Race
  • Hispanic or Latino Origin
  • Households (Including Type and Size)
  • Group Quarters
  • Families
  • Family Relationships
  • Housing Units
  • Occupancy Status (Occupied or Vacant)
  • Tenure (Owner or Renter Occupied)

Many of these variables are cross-tabulated by age, gender, race, Hispanic or Latino Origin, Household Type, and Household Size. Once we get to the fall of 2011 we’ll start to see national level data for divsions, regions, and metropolitan areas.

Define Projection for a Batch of Shapefiles

June 12th, 2011

I was working on a project where I had downloaded 51 shapefiles (state-based census tract files) from the Census Generalized Cartographic Boundary Files. Each file lacked a projection .prj file, so I had to define each one as NAD83. Not wanting to do this one at a time, I used the GDAL / OGR tools and a bash script to process them all in a batch. I wrote a little script in a text file and then pasted it in the command line:

#!/bin/bash
for i in $(ls *.shp); do
ogr2ogr -f “ESRI Shapefile” -a_srs “EPSG:4269″ ./nad83 $i
done

It iterates through a list of all the shapefiles in a directory, uses OGR to define them as NAD83, then writes them to a new subdirectory called NAD83.

After searching through the web for some guidance on this, I later realized that there was a nice, succinct example of this in a book that I had (yeah – remember books? They’re still great!)

#!/bin/bash
# from Sherman (2008) Desktop GIS Mapping the Planet With Open Source Tools pp 243-44

for shp in *.shp
do
echo “Processing $shp”
ogr2ogr -f “ESRI Shapefile” -t_srs EPSG:4326 geo/$shp $shp
done

This does the same thing, difference here is that it prints a message to the command line for each file that’s processed and uses the -t_srs switch (transform projection) rather than the -a_srs (assign an output projection), which in this case seems to do the same thing. Of course you could tweak this a little to transform projections from one system to another as well.

This is fine and good if you’re using Linux and can use bash (go here for more info about bash). If you’re using Windows, you can do this if you’re using a Linux / UNIX terminal emulator like MSYS; otherwise you can use the DOS Command Prompt and write a batch (.bat) file to do this instead – the post on this forum is the first thing I found in my quest to figure all of this out.

Libraries Help Create Video Game Geography

June 9th, 2011

Just for fun – I stumbled on a blog post from a TV Station in California that discusses how the developers for the new L.A. Noire video game made extensive use of libraries and archives to recreate Los Angeles of 1947. They used property, city planning, and USGS maps to recreate the street grid and landscape and aerial and street-level photos to faithfully replicate everything from specific buildings to street lights and garbage cans. They also dove into newspaper archives and Raymond Chandler’s works and personal papers for dialogue and story lines. It’s a good thing that we keep libraries and archives around to organize and collect all this stuff…

How Archivists Helped Video Game Designers Recreate the City’s Dark Side for ‘L.A. Noire’

Don’t know what I’m talking about? Check out this trailer.

Copyright RockStar Games 2011

GIS Practicum and QGIS Tutorial

May 19th, 2011

I recently finished running my day-long practicum this semester, Introduction to GIS Using Open Source Software, which used QGIS to introduce new users to GIS. Each participant received a printed tutorial booklet for the class, which I’m now providing online under a Creative Commons license here:

http://www.baruch.cuny.edu/geoportal/practicum/

I plan on revising the tutorial once a year, and the online version will be one version behind what I use and distribute in class. I’ll be busy this summer tweaking the guide and the class and will offer the in-class version to members of my university again in the 2011-12 academic year.

I held the workshop three times and had 45 participants out of a possible 60. Half were graduate students and the other half were faculty or staff. The top three disciplines that were represented were public health, library and information science, and public administration; I also had a smattering of folks from across the social sciences and a few from the natural sciences. Despite my intention to introduce GIS to novices, about half of the participants had some previous experience using GIS, primarily ArcGIS. All of these folks were pleasantly surprised with how well QGIS performed.

If you have any questions or comments about the tutorial feel free to contact me:

francis DOT donnelly AT baruch DOT cuny DOT edu

There’s also the gothos email address, but to be honest I don’t check it as often as I should – frank

2010 Census Redistricting Data

April 17th, 2011

The Redistricting Summary Data [P.L. 94-171] from the 2010 Census has all been published for the nation, states, counties, and places, and is available via the new American Factfinder. The redistricting data includes basic demographic data: total population, race, Hispanic or Latino origin, and number of housing units occupied and vacant. Data is available down to census blocks and is available for most (but not all – no ZCTAs or PUMAs) geographies.

If you don’t want all the data for a state, don’t want to slog through the Factfinder, and are comfortable working with large text files, you can FTP the summary data from the Redistricting Data homepage. If you want basic summary data for states, counties, and places and don’t want to fuss with the Factfinder or text files, you can download Excel spreadsheets from the Redistricting Data Press Kit. They also have some pdf / jpg maps showing county level population and population change, plus interactive map widgets like the one below for the country and for each state. 2010 Redistricting TIGER Shapefiles have also been released for geographies included in the redistricting dataset.

The full 2010 Census for all geographies will be released throughout this summer and into the fall in Summary File 1 [SF1]. Stay tuned.

Joining CSV Files in QGIS

April 4th, 2011

DBF files are the other thorny issue that comes up when I’m teaching the Intro to GIS workshops with QGIS – specifically how do you create and edit them? Doing that has been pretty inconvenient in the Windows world since they were deprecated in Excel 2007. You can download plugins or basic stand-alone software to do the job. Since I’m a Linux user I have the Open Office suite by default, and I can easily work with DBFs in Calc. That’s an option for Windows and Mac users too, but it’s kind of a drag to download an entire office suite just for working with DBFs (assuming most folks are use MS Office).

Another possibility is to dump DBF all together; even though the join table option in the fTools menu in QGIS only presents you the option of joining shapfiles or DBF, it turns out if you choose the DBF radio button option you can actually point to DBFs OR CSV files when you’re browsing to point to your data table. The join proceeds the same way, and you get a new shapefile with the data from the CSV joined to it. CSVs can be easily created from any spreadsheet, text editor, or database program in any operating system, so you can prep your attribute data in your program of choice before exporting to CSV and importing to GIS.

The problem with this approach is that all of the fields from the CSV file are automatically saved as text or strings when they’re appended to the shapefile. This means that any numeric data you have can’t be treated numerically; you can’t perform calculations or classify data as value ranges for mapping. You can go into the attribute table, enter the edit mode, and use the field calculator to create new fields where you convert the values to integers, but this adds a bunch of duplicate fields and is rather messy. You can’t delete columns from shapefiles within QGIS; you have to edit the attributes outside the software to remove extra columns.

Here’s an easy work around: you can create a CSVT file to go along with your CSV file. Open a text editor like Notepad or gedit and create a one line file where you specify the data type for each of the fields in your CSV file. Save the CSVT with the SAME NAME as your CSV file. Now when you go to join your CSV to your shapefile in QGIS, it will read the CSVT and save all of your fields properly in the new shapefile.

So for a CSV file like this with six fields:

CODE, ST, ID_NAICS51, EMP_51, ID_TOTAL, TOTAL_EMP
01, AL, ENU0100010551, 27013, ENU0100010510, 1570188
02, AK, ENU0200010551, 6988, ENU0200010510, 237708
04, AZ, ENU 0400010551, 41833, ENU0400010510, 2174919 …

You’d create a CSVT file like this:

“String”, “String”, “String”, “Integer”, “String”, “Integer”

So the 4th and 6th fields that have numeric values are saved as integers. There are a few other field types you can use, like Real for decimal numbers and some Date / Time options, and you can specify length and precision – see this post for details.

Using shapefiles, CSVs, and CSVTs are fine for small to medium projects and for the introductory workshops I’m teaching; geodatabases are another option and are certainly better for medium to large projects, but introducing them in my intro workshop is a little too much.

(NOTE – in QGIS 1.7 the join tool has been dropped from the ftools menu. To join a csv or dbf in 1.7+, you have to add your data table as a vector to your map, and then use the join tab within the properties menu of the feature you want to join it to. See this post for details.)

Common Map Projection Definitions

April 3rd, 2011

Just finished teaching my second intro to GIS workshop using open source software (QGIS). Coordinate systems and map projections are always one of the toughest hurdles for novices. It’s hard enough just teaching the basic concepts using the existing CRS libraries in QGIS; having to custom define a number of common projected coordinate systems makes the process more daunting. For example, when we’re producing a thematic map of the US I want to use Lambert Conformal Conic for North America, but I have to give the students a proj4 definition in a text file and explain how you have to comb through the Spatial Reference site to find it.

For reference purposes and to make things a bit simpler, I’m providing some codes and definitions for some common coordinate reference systems (common for the participants in class) in this post. You can look up projection definitions at Spatial Reference and use the map projection resources at Radical Cartography and the USGS to see depictions and explanations of different systems. I created the projection images in this post using NASA’s G.Projector tool; a lightweight cross-platform tool for experimenting with projections.

The following CRS are pretty common and are included in the EPSG library used by QGIS – no need to custom define them, just search by name and code (the EPSG codes are ID codes for each CRS):

Geographic Coordinate Systems:

WGS84 (EPSG 4326): World Geodetic System of 1984, commonly used by organizations that provide GIS data for the entire globe or many countries and used by most web-based mapping engines (Google Maps)

NAD83 (EPSG 4269): North American Datum of 1983, commonly used by most US and Canadian federal government agencies (the US Census Bureau in particular) that provide GIS data

Since WGS84, NAD83, and all geographic coordinate systems are unprojected they will all look like Equirectangular or “Plate Caree” projections, which preserve distances:

Local Projected Coordinate Systems:

NAD 83 / New York Long Island (ft US) (EPSG 2263): The State Plane zone that covers Long Island and New York City is used by all NYC agencies that produce GIS data. Many city and state agencies produce data in their specific state plane zone. An alternate projection, EPSG 32118, represents the same zone but uses meters instead of feet.

NAD 83 / UTM Zone 18N (EPSG 26918): An alternative to State Plane that is better for larger regions; satellite or ortho imagery is often provided based on the UTM zone where the tile is. UTM Zone 18N covers much of the east coast of the US. An alternate projection, EPSG 32618, uses WGS 84 as a datum instead of NAD 83.

The following CRS are common continental and global projected coordinate systems that are NOT included in the EPSG library that is part of QGIS; you have to custom define them using the proj4 definitions.

North America Lambert Conformal Conic: Perhaps the most common map projection for North America, a conformal map preserves angles. LCC can be modified for optimally displaying specific countries (i.e. USA and Canada) or other continents (i.e. South America, Asia, etc.)

+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs

North America Albers Equal Area Conic: an alternative to LCC, all areas in an AEAC map are proportional to the same areas on the Earth. Can also be modified for specific countries or other continents. Visually it look more “compact” east to west versus LCC.

+proj=aea +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs

Robinson: a global map projection used by National Geographic for many decades. The Robinson map is a compromise projection; it doesn’t preserve any aspect of the earth precisely but makes the earth “look right” visually based on our common perceptions.

+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

Mollweide: a global map projection that preserves areas, often used in the sciences for depicting global distributions on small maps.

+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

Calculating Standard Deviation for Summarized Data

March 29th, 2011

This isn’t a geospatial issue per say, but I thought it would be useful to share. I have a spreadsheet where I’m tracking course evaluation responses for the GIS workshops I’m teaching. I have to report the total number of responses, the mean, and standard deviation for each question. The worksheet I designed tracks aggregate responses; the total number of people who responded to each question in each category, on a scale of 5 (strongly agree) to 1 (strongly disagree). For example:

The problem I had was that Excel’s standard deviation formula doesn’t work for summaries – you need to give the formula individual responses or raw scores for arguments. In other words:

So I was fixated on trying to find a formula, through the help and by searching the forums, where somehow I could calculate standard deviation using summaries or aggregates. It finally dawned on me (duh) that I could plug in the standard deviation formula myself and modify it.

To calculate the standard deviation for an entire population you compute the difference of each data point from the mean and square each result. Then you calculate the average of all these values and take the square root.

So for each question I subtract the mean score for that question from the score category for that question, square it, and then multiply the result by the number of people who answered in that category. So if 10 people strongly agreed with the question and strongly agreed is associated with a score of 5, I subtract the average score (4.71) from 5, square the result, and multiply it by 10 (since ten people responded that they strongly agreed).

((score value – mean score)^2)*respondents

I perform the same operation for each category. So if 4 people said they agreed with a question and agreed is a value of 4, subtract 4.71 from 4, square it, and multiply by 4. After I do this this for each category, I sum the values for each one and take the square root of the whole thing.

SQRT ((((score5 – mean score)^2)*respondents)+(((score4 – mean score)^2)*respondents))

SQRT ((((5-4.71)^2)*10)+(((4-4.71)^2)*4))

For my spreadsheet the formula is repeated for each of the 5 possible scores, references are used to pull in the mean and respondent values from other cells, and I round the entire result to 1 decimal place. The number of parens makes it a little confusing; I’ve inserted a color-coded image below so it’s a little clearer.

=ROUND((SQRT(((((5-H9)^2)*B9)+(((4-H9)^2)*C9)+(((3-H9)^2)*D9)+(((2-H9)^2)*E9)+(((1-H9)^2)*F9))/G9)),1)

Given all that can go wrong with one misplaced parens, I tested this by inputting some raw scores by hand and running the STDEVPA formula to verify that I get the same result.

Relating ZIP Codes / ZCTAs to PUMAs

March 19th, 2011

Ever since I created the Google Maps finding aid for census data for NYC PUMAs and the associated PUMA – NYC neighborhood names maps, I’ve received several requests for tables or maps that relate PUMAs to ZIP Codes. These are usually from non-profits in NYC who have lists of donors, members, or constituents with addresses, and they want to relate the addresses (using the ZIP) to recent demographic data from American Community Survey (ACS) for the broader neighborhood where the ZIP is located.

The problem is that ZIP Codes are an all around pain. They actually don’t exist as areas with distinct boundaries; ZIP Codes are all address based, with ZIPs tied to addresses along street segments. The USPS doesn’t publish these tables or create maps; they contract this out for private companies to do, who turn around and sell these products for hefty fees.

Fortunately the Census Bureau has used these address tables to create approximations of ZIP Codes that they call ZCTAs or ZIP Code Tabulation Areas. ZCTAs are aggregates of census blocks that attempt to mimic ZIP Codes that exist as areas; codes associated with specific single-point firms or organization are dropped. Since ZIPS were created by the USPS, ZCTAs do not nest or mesh with any census geography; they cross PUMA, county, and in some cases even state boundaries. They are also less stable than census geography, with frequent changes, and as statistical areas they vary widely in area and population. For this reason ZCTA data is only published every ten years in the decennial census; it’s not included in the ACS (so far).

With these caveats in mind, I used the Missouri Census Data Center’s MABLE/GEOCORR engine to correlate ZCTAs with PUMAs. While the interface looks a little retro and daunting, it’s actually pretty simple. You choose the state, the two geographies you want to relate, the weighting method for allocating one to the other, and an output format that includes CSV or HTML. I also used an option that lets you type in FIPS codes for the counties you want, so I didn’t end up with the entire state.

This method was the way to go, as they give you the option to allocate geographies based on population and not simply land area; each ZCTA was allocated to PUMAs based on where the majority of the ZCTA’s population lived using 2000 census block data. The final output contains one row for each ZCTA to PUMA combination. So you had multiple rows for ZCTAs that weren’t contained within a single PUMA, and for each of those ZCTAs you had fields that showed the percentage of the ZCTA’s population that lived in each PUMA (along with the actual population number) as well as the percentage of the PUMA’s population that lived in that ZCTA.

I took that table and cleaned it up in a spreadsheet, so that I was left with one row for each ZCTA, where the ZCTA was allocated to one PUMA based on where the majority of it’s population lives. I used some ZCTA and PUMA boundaries that I had originally downloaded and subsequently cleaned up from the 2009 TIGER shapefiles page, added them to QGIS, joined the ZCTA allocation table to the ZCTA geography, and mapped the result. I color-coded ZCTAs so that clusters of ZCTAs within a particular PUMA had the same color. Then I overlaid the PUMA boundaries on top to see how well they corresponded.

In the end, they didn’t correspond all that well. There was a fairly good relationship in Manhattan, ok relationship in Queens and Staten Island, and a rather lousy relationship in the Bronx and Brooklyn. I overlaid greenspace and facilities (airports, shipyards, etc) boundaries I had, and that made some difference; you could see in some areas where ZCTAs overlapped two PUMAs that the overlap coincided with parks, cemeteries, or other areas with low or no residential population in one of the PUMAs.

I’ve posted both sets of tables, maps, and some instructions on the NYC neighborhoods resource page. You can use the original MABLE / GEOCORR table to judge where allocations were good and were they were not so good based on population. For now, the engine is still based on 2000 Census geography and data. Even though the Census has started releasing 2010 TIGER files based on 2010 Census geography, ZCTAs and PUMAs are often some of the last geographies to be updated; current releases of the ACS are still based on the 2000 geographies. Stay tuned to the Census Bureau and MCDC websites for news on updates, and keep the MABLE / GEOCORR in mind if you want to create lists to relate census geographies by population or land area.

Some 2010 Census Updates

February 7th, 2011

Some geography updates to pass along regarding new US Census data:

  • The Census has released a few 2010 map widgets that you can embed in web pages. One shows population change, density, and apportionment for the whole country at the state level, while the other shows population, race, and Hispanic change for states at the county level. As of this post only four states are ready (LA, MS, NJ, and VA) but they’ll be adding the rest once they’re available.
  • The 2010 TIGER Line Files are starting to be released; they’ve changed the download interface a little bit based on user feedback. Most summary levels / geographic areas are available; some (like ZIP Codes and PUMAs) will be released later this year.
  • They’re also rolling out the new interface for the American Factfinder; currently you can get 2000 Census data, some population estimates, and the 2010 Census data as it becomes available. Other datasets like the American Community Survey and Economic Census will be added over time. Some maps and gov docs librarians have expressed concerned about the change – apparently when you download the data from the new interface the FIPS codes are not “ready to go” for joining to shapefiles; there’s one long geo id that has to be parsed. The other concern is that the 1990 Census won’t be carried over into the new interface at all. The original American Factfinder is slated to come down towards the end of this year.



Copyright © 2012 Gothos. All Rights Reserved.
No computers were harmed in the 0.327 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.