Posts Tagged ‘census’

NYC Geodatabase in Spatialite

Wednesday, February 6th, 2013

I spent much of the fall semester and winter interim compiling and creating the NYC geodatabase (nyc_gdb), a desktop geodatabase resource for doing basic mapping and analysis at a neighborhood level – PUMAs, ZIP Codes / ZCTAs, and census tracts. There were several motivations for doing this. First and foremost, as someone who is constantly introducing new people to GIS it’s a pain sending people to a half dozen different websites to download shapefiles and process basic features and data before actually doing a project. By creating this resource I hoped to lower the hurdles a bit for newcomers; eventually they still need to learn about the original sources and data processing, but this gives them a chance to experiment and see the possibilities of GIS before getting into nitty gritty details.

Second, for people who are already familiar with GIS and who have various projects to work on (like me) this saves a lot of duplicated effort, as the db provides a foundation to build on and saves the trouble of starting from scratch each time.

Third, it gave me something new to learn and will allow me to build a second part to my open source GIS workshops. I finally sat down and hammered away with Spatialite (went through the Spatialite Cookbook from start to finish) and learned spatial SQL, so I could offer a resource that’s open source and will compliment my QGIS workshop. I was familiar with the Access personal geodatabases in ArcGIS, but for the most part these serve as simple containers. With the ability to run all the spatial SQL operations, Spatialite expands QGIS functionality, which was something I was really looking for.

My original hope was to create a server-based PostGIS database, but at this point I’m not set up to do that on my campus. I figured Spatialite was a good alternative – the basic operations and spatial SQL commands are relatively the same, and I figured I could eventually scale up to PostGIS when the time comes.

I also created an identical, MS Access version of the database for ArcGIS users. Once I got my features in Spatialite I exported them all out as shapefiles and imported them all via ArcCatalog – not too arduous as I don’t have a ton of features. I used the SQLite ODBC driver to import all of my data tables from SQLite into Access – that went flawlessly and was a real time saver; it just took a little bit of time to figure out how to set up (but this blog post helped).

The databases are focused on NYC features and resources, since that’s what my user base is primarily interested in. I purposefully used the Census TIGER files as the base, so that if people wanted to expand the features to the broader region they easily could. I spent a good deal of time creating generalized layers, so that users would have the primary water / coastline and large parks and wildlife areas as reference features for thematic maps, without having every single pond and patch of grass to clutter things up. I took several features (schools, subway stations, etc) from the City and the MTA that were stored in tables and converted them to point features so they’re readily useable.

Given that focus, it’s primarily of interest to NYC folks, but I figured it may be useful for others who wish to experiment with Spatialite. I assumed that most people who would be interested in the database would not be familiar with this format, so I wrote a tutorial that covers the database and it’s features, how to add and map data in QGIS, how to work with the data and do SQL / spatial SQL in the Spatialite GUI, and how to map data in ArcGIS using the Access Geodb. It’s Creative Commons, Attribution, Non-Commercial, Share-alike, so feel free to give it a try.

I spent a good amount of time building a process rather than just a product, so I’ll be able to update the db twice a year, as city features (schools, libraries, hospitals, transit) change and new census data (American Community Survey, ZIP Business Patterns) is released. Many of the Census features, as well as the 2010 Census data, will be static until 2020.

ACS Trend Reports and Census Geography Guide

Sunday, February 12th, 2012

I recently received my first question from someone who wanted to compare 2005-2007 ACS data with 2008-2010. With the release of the latter, we can make historical comparisons with the three year data for the first time since we have estimates that don’t overlap. We should be able to make some interesting comparisons, since the first set covers the real estate boom years (remember those?) and the second covers the Great Recession. One resource that makes such comparisons relatively painless is over at the Missouri Census Data Center. They’ve put together a really clean and simple interface called the ACS Trends Menu, which allows you to select either two one period estimates or two three period estimates and compare them for several different census geographies – states, counties, MCDs, places, metros, Congressional Districts, PUMAs, and a few others – for the entire US (not just Missouri). The end result is a profile that groups data into the Economic, Demographic, Social, and Housing categories that the Census uses for its Demographic Profile tables. The calculations for change and percent change for the estimates and margins of error are done for you.

Downloading the data is not as straightforward – the links to extract it just brought me some error messages, so it’s still a work in progress. Until then, a simple copy and paste into your spreadsheet of choice will work fine.

ACS Trends Menu

If you like the interface, they’ve created separate ones for downloading profiles from any of the ACS periods or from the 2010 Census. The difference here is that you’re looking at one time frame; not across time periods. The interface and the output are the same, but in these menus you can compare four different geographies at once in one profile. Unlike the Trends reports, both the ACS and 2010 Census profiles have easy, clear cut ways to download the profiles as a PDF or a spreadsheet. If you’re happy with data in a profile format and want an interface that’s a little less confusing to navigate than the American Factfinder, these are all great alternatives (and if you’re building web applications these profiles are MUCH easier to work with – you can easily build permanent links or generate them on the fly).

The US Census Bureau also recently put together a great resource called the Guide to State and Local Census Geography. They provide a census geography overview of each state: 2010 population, land area, bordering states, year of entry into the union, population centroids, and a description of how local government is organized in the state – (i.e. do they have municipal civil divisions or only incorporated cities and unincorporated land, etc). You get counts for every type of geography – how many counties, tracts, ZCTAs, and so on, AND best of all you can download all of this data directly in tab delimited files. Need a list of every county subdivision in a state, with codes, land area, and coordinates? No problem – it’s all there.

2010 Census Generalized Cartographic Boundary Files

Thursday, December 22nd, 2011

I’ve had a few interesting projects that have kept me busy at the end of this year. I’ll do a post or two after New Years, once I’m back in the office and can take some screen shots to illustrate.

In the meantime I have one tidbit I can mention – the Census Bureau has released the 2010 version of the Generalized Cartographic Boundary Files. These files are generalized versions of the TIGER files, with smoothed and simplified boundaries and areas of coastal water removed. They haven’t posted them on the same page as the 2000 and 1990 boundaries; they’ve mentioned they’re creating a new interface to host all of them, which is currently a work in process at http://www.census.gov/geo/www/cob/.

However, you can get access to all the 2010 boundaries via the FTP site – you just need to know what you’re looking at. All the files are named with codes to identify the geographic coverage, summary level, and resolution / scale. There’s a README file on the FTP page that tells you how to identify each.

But in brief – The file names look like this: gz_2010_ss_lll_vv_rr.zip, where:

  • ss is the state INCITS / FIPS code which you can look up here – ‘us’ is a national level file.
  • lll is the summary level or unit of geography – the README file has a table with each code. The most common ones: 040 for state, 050 for county, 060 for county subdivisions, 140 for census tracts, 160 for places, 310 for metropolitan and micropolitan statistical areas, 860 for ZCTAs. (No PUMAs- 2010 PUMA boundaries haven’t been drawn yet, and 2000 PUMA boundaries are still being used in the latest ACS).
  • vv is a version number for the file.
  • rr is resolution – most of the files are 500k = 1:500,000, which is the least generalized and best for mapping state-level to regional areas. For national level files you also have the option of 5m = 1:5,000,000 and 20m = 1:20,000,000, which are more generalized and better for national mapping.

The Census Bureau has been doing a lot of tweaking to their website lately. The legacy version of the American Factfinder is set to disappear for good on Jan 20, 2012.

2010 American Community Survey Releases

Friday, September 23rd, 2011

The US Census Bureau released the new annual data for the 2010 American Community Survey; this dataset includes an extensive number of demographic, socio-economic, and housing estimates (with margins of error) for all geographic areas in the US that have a population of at least 65,000 people. This is the first ACS survey that is weighted based on the 2010 Census, and that is tabulated entirely on the new 2010 Census geography; exceptions include PUMAs and urban areas, which typically aren’t redrawn until a couple of years after a decennial census is taken. Data for these areas will be reported based on the 2000 Census geography. This will also be the first ACS that is distributed via the new American Factfinder. Previous ACS datasets should be moved to the new Factfinder by the end of this year.

According to the release schedule data for the three year ACS (2008-2010) for areas with at least 20,000 residents will be published in October and the five year ACS (2006-2010) for geography down to census tracts will be released in December. The three year dataset hits a milestone this year, as for the first time we’ll have datasets with mutually exclusive years that can be feasibly compared for historical change (the 2005-2007 dataset versus 2008-2010). It should prove interesting as the earlier dataset represents the end of the brief boom years while the current one depicts the depth of the great recession. There will be some challenges in making comparisons, as the base for weighting the estimates and the geography used to tabulate them is different for each dataset (2000 Census in the earlier dataset versus 2010 Census in the latest one).

Relating ZIP Codes / ZCTAs to PUMAs

Saturday, March 19th, 2011

Ever since I created the Google Maps finding aid for census data for NYC PUMAs and the associated PUMA – NYC neighborhood names maps, I’ve received several requests for tables or maps that relate PUMAs to ZIP Codes. These are usually from non-profits in NYC who have lists of donors, members, or constituents with addresses, and they want to relate the addresses (using the ZIP) to recent demographic data from American Community Survey (ACS) for the broader neighborhood where the ZIP is located.

The problem is that ZIP Codes are an all around pain. They actually don’t exist as areas with distinct boundaries; ZIP Codes are all address based, with ZIPs tied to addresses along street segments. The USPS doesn’t publish these tables or create maps; they contract this out for private companies to do, who turn around and sell these products for hefty fees.

Fortunately the Census Bureau has used these address tables to create approximations of ZIP Codes that they call ZCTAs or ZIP Code Tabulation Areas. ZCTAs are aggregates of census blocks that attempt to mimic ZIP Codes that exist as areas; codes associated with specific single-point firms or organization are dropped. Since ZIPS were created by the USPS, ZCTAs do not nest or mesh with any census geography; they cross PUMA, county, and in some cases even state boundaries. They are also less stable than census geography, with frequent changes, and as statistical areas they vary widely in area and population. For this reason ZCTA data is only published every ten years in the decennial census; it’s not included in the ACS (so far).

With these caveats in mind, I used the Missouri Census Data Center’s MABLE/GEOCORR engine to correlate ZCTAs with PUMAs. While the interface looks a little retro and daunting, it’s actually pretty simple. You choose the state, the two geographies you want to relate, the weighting method for allocating one to the other, and an output format that includes CSV or HTML. I also used an option that lets you type in FIPS codes for the counties you want, so I didn’t end up with the entire state.

This method was the way to go, as they give you the option to allocate geographies based on population and not simply land area; each ZCTA was allocated to PUMAs based on where the majority of the ZCTA’s population lived using 2000 census block data. The final output contains one row for each ZCTA to PUMA combination. So you had multiple rows for ZCTAs that weren’t contained within a single PUMA, and for each of those ZCTAs you had fields that showed the percentage of the ZCTA’s population that lived in each PUMA (along with the actual population number) as well as the percentage of the PUMA’s population that lived in that ZCTA.

I took that table and cleaned it up in a spreadsheet, so that I was left with one row for each ZCTA, where the ZCTA was allocated to one PUMA based on where the majority of it’s population lives. I used some ZCTA and PUMA boundaries that I had originally downloaded and subsequently cleaned up from the 2009 TIGER shapefiles page, added them to QGIS, joined the ZCTA allocation table to the ZCTA geography, and mapped the result. I color-coded ZCTAs so that clusters of ZCTAs within a particular PUMA had the same color. Then I overlaid the PUMA boundaries on top to see how well they corresponded.

In the end, they didn’t correspond all that well. There was a fairly good relationship in Manhattan, ok relationship in Queens and Staten Island, and a rather lousy relationship in the Bronx and Brooklyn. I overlaid greenspace and facilities (airports, shipyards, etc) boundaries I had, and that made some difference; you could see in some areas where ZCTAs overlapped two PUMAs that the overlap coincided with parks, cemeteries, or other areas with low or no residential population in one of the PUMAs.

I’ve posted both sets of tables, maps, and some instructions on the NYC neighborhoods resource page. You can use the original MABLE / GEOCORR table to judge where allocations were good and were they were not so good based on population. For now, the engine is still based on 2000 Census geography and data. Even though the Census has started releasing 2010 TIGER files based on 2010 Census geography, ZCTAs and PUMAs are often some of the last geographies to be updated; current releases of the ACS are still based on the 2000 geographies. Stay tuned to the Census Bureau and MCDC websites for news on updates, and keep the MABLE / GEOCORR in mind if you want to create lists to relate census geographies by population or land area.

Some 2010 Census Updates

Monday, February 7th, 2011

Some geography updates to pass along regarding new US Census data:

  • The Census has released a few 2010 map widgets that you can embed in web pages. One shows population change, density, and apportionment for the whole country at the state level, while the other shows population, race, and Hispanic change for states at the county level. As of this post only four states are ready (LA, MS, NJ, and VA) but they’ll be adding the rest once they’re available.
  • The 2010 TIGER Line Files are starting to be released; they’ve changed the download interface a little bit based on user feedback. Most summary levels / geographic areas are available; some (like ZIP Codes and PUMAs) will be released later this year.
  • They’re also rolling out the new interface for the American Factfinder; currently you can get 2000 Census data, some population estimates, and the 2010 Census data as it becomes available. Other datasets like the American Community Survey and Economic Census will be added over time. Some maps and gov docs librarians have expressed concerned about the change – apparently when you download the data from the new interface the FIPS codes are not “ready to go” for joining to shapefiles; there’s one long geo id that has to be parsed. The other concern is that the 1990 Census won’t be carried over into the new interface at all. The original American Factfinder is slated to come down towards the end of this year.


Track 2010 Census Participation Rates

Tuesday, March 30th, 2010

The 2010 Census is in full swing – the target date of April 1st is coming up soon. I mailed my form back last week. If you’re curious as to how many others have mailed theirs back, check out the bureau’s interactive Take 10 Map. Built on top of a Google Map interface, it allows you to track participation rates by state, county, place, reservation, and census tract. You can zoom in to change the scale and select different geography, or enter a zip code, city, or state to zoom to an area of choice. Clicking on an area will display it’s participation rate to date, compared to the state and national rates.

Data is updated daily, Monday through Friday. Once you click on a particular area, if you click the Track Participation Rate link it will create a widget that you can embed in a website to provide the updated rate. Unlike a lot of the other interactive web maps floating around these days, the bureau does give you the ability to download the actual data behind the map, if you want to do some analysis of your own.

Mapping Hard to Count Areas for Census 2010

Tuesday, February 23rd, 2010

There was an interesting article in the New York Times today about neighborhoods in New York that typically get under-counted in the Census. These include areas with high immigrant populations as well as places that have had new construction since the last census, as the buildings haven’t been added to the Census Bureau’s master address file.

What the article didn’t mention is that CUNY’s Center for Urban Research has created a great online ap called the Census 2010 Hard to Count mapping site. The site is built on the Census Bureau’s Tract Level Planning Database, which identified twelve population and housing variables, such as language isolation, recent movers, poverty, and crowded housing, that were associated with low mail response in the 2000 Census. This tool was designed to help Census reps, local government officials, and community activists identify traditionally under-counted areas to insure a more complete count this time around.

The database is national in scope, and you can easily map tracts for a particular state, county, city, metro area, or tribal area, and you can search for an area using an individual address. The map is built on a Google Maps interface, and zooming in will change the units mapped from larger units (states, counties, etc) to tracts. You can easily select one of the twelve variables color-coded in the menu to the left of the map, or a Hard to Count index of all the variables.

Formulas for Working With Census ACS Data in Excel / Calc

Friday, June 26th, 2009

After downloading US census data, you often need to reformat it before using it. It’s quite common that you download files where the population is broken down by gender and age, and you need to aggregate the data to get a total or divide a particular characteristic to get a percent total. This is pretty straightforward if you’re working with decennial census data, but data from the American Community Survey (ACS) is a little trickier to deal with since you’re working with estimates that have a margin of error. When creating new data, you also have to calculate what the margin of error is for your derived numbers. I’ll walk through some examples of how you would do this in a spreadsheet (the formulas below will work in either Excel or Calc).

Creating an Aggregate

We’ll use the following data in our example:

screenshot1

We have the total population of people three years and older who are enrolled in school, and a breakdown of this population enrolled in grades 1 through 4 and grades 5 through 8 in a few counties in New York, with margins of error for each data point. Our data is from the 3 year averaged 2005-2007 American Community Survey.

Let’s say we want to create a total for students who are enrolled in grades 1 through 8 for each county. We create a new column and sum the estimates for each county with the formula e3+g3, or sum(e3:g3).

To calculate a margin or error (MOE) for our grade 1 to 8 data, first we have to use the find and replace command to get rid of the “+/-” signs in the MOE column, so our spreadsheet will treat our values as numbers and not text (this is an issue if you downloaded the data as an Excel file – if you download a txt file the +/- is not included). Depending on the dataset you’re working with, you may also need to replace dashes, which represent data that was null or not estimated.

Once the data is cleaned up, we can insert a new column with this formula:

=SQRT((F3^2)+(H3^2))

This calculates our new margin of error by squaring the moes for each of our data points, summing the results together, and taking the square root of that sum. In other words,

=SQRT((MOE1^2)+(MOE2^2))

Once that’s done, you may want to round the new MOE to a whole number.

Creating a Percent Total

Let’s calculate the percentage of the population 3 years and older enrolled in school that are in grades 1 through 8. Based on what we have thus far (I hid the columns E,F,G, and H for grades 1-4 and 5-8 in this screenshot, as we don’t need them):

screenshot-2

We insert a new column where we divide our subgroup by the total, as you would expect – I3/C3. In the next column we insert the following formula to create a MOE for our new percent total:

=(SQRT((J3^2)-((K3^2)*(D3^2))))/C3

This one’s a little weightier than our last formula. We’re taking the square of our percent total (K3) and the square of the MOE of the total population (D3), multiplying them together, then subtracting that number from the square of the MOE of our subgroup (J3). Then we take the square root of the whole thing, then divide it by our total population (C3). If you’re saying – HUH? Maybe this is clearer:

=(SQRT((MOEsubset^2)-((PercentTotal^2)*(MOEtotalpop^2))))/TotalPop

Finally, we have something like this:

screenshot-3

Based on our data, we can say things like “There were approximately 30,556 students enrolled in 1st through 8th grade per year in Dutchess County, NY between 2005 and 2007, plus or minus 1,184 students. An estimated 37% of the population enrolled in school in the county was in the 1st through 8th grade, plus or minus 1%.” The ACS estimates have a 90% confidence interval.

Wrap Up

In this example we worked with aggregating and calculating percentages based on characteristics. We could also use these same formulas to aggregate data by geography, if we wanted to add the characteristics for all the counties together.

For the full documentation on working with ACS data, take a look at the appendix in the Census’ ACS Compass Guide, What General Data Users Need to Know. It provides you with the formulas in their proper statistical notation (for those of you more mathematically inclined than I) and includes formulas for calculating other kinds of numbers, such as ratios and percent change. It does provide you with worked-through examples, but not with spreadsheet formulas. I used their examples when I created formulas the first time around, so I could compare my formula results to their examples to insure that I was getting it right. I’d strongly recommend doing that before you start plugging away with your own data – one misplaced parentheses and you could end up with a different (and incorrect) result.

Mapping ACS Census Data for Urban Areas With PUMAs

Tuesday, December 16th, 2008

The NY Times wrote a story recently based on the new 3 year ACS data that the Census Bureau released a couple weeks ago (see my previous post for details). They created some maps for this story using geography that I would never have thought to use.

Outside of Decennial Census years, it is difficult to map demographic patterns and trends within large cities as you’ll typically get one figure for the entire city and you can’t get a break down for areas within. Data for areas like census tracts and zip codes is not available outside the ten-year census (yet), and large cities exist as single municipal divisions that aren’t subdivided. New York City is an exception, as it is the only city composed of several counties (boroughs) and thus can be subdivided. But the borough data still doesn’t reveal much about patterns within the city.

The NY Times used PUMAS – Public Use Microdata Areas – to subdivide the city into smaller areas and mapped rents and income. PUMAs are aggregations of census tracts and were designed for aggregating and mapping public microdata. Microdata consists of a selection of actual individual responses from the census or survey with the personal identifying information (name, address, etc) stripped away. Researchers can build their own indicators from scratch, aggregate them to PUMAs, and then figure out the degree to which the sample represents the entire population.

Since PUMAs have a large population, the new three-year ACS data is available at the PUMA level. The PUMAs essentially become surrogates for neighborhoods or clusters of neighborhoods, and in fact several NYC agencies have created districts or neighborhoods based on these boundaries for statistical or planning purposes. This wasn’t the original intent for creating or using PUMAs, but it’s certainly a useful application of them.

You can check out the NY Times article and maps here – Census Shows Growing Diversity in New York City (12/9/08). I tested ACS / PUMA mapping out myself by downloading some PUMA shapefiles from the Census Bureau’s Generalized Cartographic Boundaries page, grabbing some of the new annual ACS data from the American Factfinder, and creating a map of Philly. In the map below, you’re looking at 2005-2007 averaged data that shows the percentage of residents who lived in their current home last year. If you know Philly, you can see that the PUMAs do a reasonable job of approximating regions in the city – South Philly, Center City, West Philly, etc.

The problem I ran into here was that data did not exist for all of the PUMAs – in this case, South Philly and half of North Philly had values of zero. According to the footnotes on the ACS site, there were no values for these areas because “no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution”. So even though the PUMA geography is generally available, there still may be cases where data for particular variables for individual geographies is missing.

Just for the heck of it, I tried looking at the annual ACS data which is limited to more populated areas (must have 65k population where 3 year estimates are for areas with at least 20k) and even more data was missing (in this instance, all the areas in the northeast). Even though PUMAs have a minimum population of 100k people, the ACS sampling is county based. So even if the sample size for a county is ideal, they may not have a significant threshold for individual places within a county to compute an estimate. At least, that’s my guess. Regardless, it’s still worth looking at for the city and data you’re interested in.

ACS Data for Philly Pumas


Copyright © 2013 Gothos. All Rights Reserved.
No computers were harmed in the 2.062 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.