I recently went through my pages of suggested links for data and resources to update and clean them up. I’ve included many of the cool resources I’ve discovered since I started writing this blog, which ended up in individual posts but not in these pages. I went over the resources page in particular, to try and classify the reference materials, tools, and software into useful categories rather than just having one large blob of stuff.
Posts Tagged ‘data’
The NY Times wrote a story recently based on the new 3 year ACS data that the Census Bureau released a couple weeks ago (see my previous post for details). They created some maps for this story using geography that I would never have thought to use.
Outside of Decennial Census years, it is difficult to map demographic patterns and trends within large cities as you’ll typically get one figure for the entire city and you can’t get a break down for areas within. Data for areas like census tracts and zip codes is not available outside the ten-year census (yet), and large cities exist as single municipal divisions that aren’t subdivided. New York City is an exception, as it is the only city composed of several counties (boroughs) and thus can be subdivided. But the borough data still doesn’t reveal much about patterns within the city.
The NY Times used PUMAS – Public Use Microdata Areas – to subdivide the city into smaller areas and mapped rents and income. PUMAs are aggregations of census tracts and were designed for aggregating and mapping public microdata. Microdata consists of a selection of actual individual responses from the census or survey with the personal identifying information (name, address, etc) stripped away. Researchers can build their own indicators from scratch, aggregate them to PUMAs, and then figure out the degree to which the sample represents the entire population.
Since PUMAs have a large population, the new three-year ACS data is available at the PUMA level. The PUMAs essentially become surrogates for neighborhoods or clusters of neighborhoods, and in fact several NYC agencies have created districts or neighborhoods based on these boundaries for statistical or planning purposes. This wasn’t the original intent for creating or using PUMAs, but it’s certainly a useful application of them.
You can check out the NY Times article and maps here – Census Shows Growing Diversity in New York City (12/9/08). I tested ACS / PUMA mapping out myself by downloading some PUMA shapefiles from the Census Bureau’s Generalized Cartographic Boundaries page, grabbing some of the new annual ACS data from the American Factfinder, and creating a map of Philly. In the map below, you’re looking at 2005-2007 averaged data that shows the percentage of residents who lived in their current home last year. If you know Philly, you can see that the PUMAs do a reasonable job of approximating regions in the city – South Philly, Center City, West Philly, etc.
The problem I ran into here was that data did not exist for all of the PUMAs – in this case, South Philly and half of North Philly had values of zero. According to the footnotes on the ACS site, there were no values for these areas because “no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution”. So even though the PUMA geography is generally available, there still may be cases where data for particular variables for individual geographies is missing.
Just for the heck of it, I tried looking at the annual ACS data which is limited to more populated areas (must have 65k population where 3 year estimates are for areas with at least 20k) and even more data was missing (in this instance, all the areas in the northeast). Even though PUMAs have a minimum population of 100k people, the ACS sampling is county based. So even if the sample size for a county is ideal, they may not have a significant threshold for individual places within a county to compute an estimate. At least, that’s my guess. Regardless, it’s still worth looking at for the city and data you’re interested in.
The Census Bureau released its new American Community Survey data the other day. Three year averages for a variety of socio-economic variables are now available for all geographic areas that have at least 20,000 people. The ACS has been releasing annual data for most of this decade for areas with at least 65,000 people and will continue to do so. They didn’t provide data for smaller areas because the numbers were not as statistically robust. Now that they have three years of data, they can average the numbers for three years and get sound data for areas with a population of at least 20k.
Data for 2005 to 2007 is available now, and like the annual numbers, you’ll get a range of values and a confidence interval. For example, we can say with 90% confidence that the estimated population of Atlantic City, NJ between 2005 and 2007 was 35,770, plus or minus 1,749 people. The Bureau created this estimate based on a sample of 1,379 people in AC.
Next year, the census will release new annual numbers for areas with a population of at least 65k, and will update the three year averages for areas with 20k by adding the newest year of data and dropping the oldest one to calculate a new average.
All of the data is available through the American Factfinder.
If you are looking for population figures for basic indicators (population, race, gender, age, and housing units) for basic geographic areas (states, counties, places, and metro areas), you’ll probably want to consider using estimates from the Bureau’s Population Estimates program instead. Their annual estimates are based on a demographic calculation that factors in births, deaths, and migration, and is not based on a survey (according to that program, Atlantic City had 39,684 residents in 2007 – 4,090 more people than the ACS midrange estimate). If you’re looking for any other kind of data (ethnicity, immigration status, income, poverty, rent, home value, etc) the ACS is your best bet.
By 2010 the Bureau will begin releasing ACS 5 year avearges for all geographic areas. Of course, we’ll also have our next decennial census in 2010. The big change here is that, since we’ll have the ACS churning out data for all areas for every year from that point forward, the Bureau is doing away with the long form (which was sent to one in six households) that was issued in past censuses, and will only collect data using the basic short form, which gets distributed to everyone. For more info on this change, see the Bureau’s Census 2010 info page.
Here’s a tutorial I’ve been meaning to write: adding a table of longitude and latitude coordinates to ArcMap and turning them into features. For this example, I’ll be using place names from the GEOnet Names Server country files. The US National Geospatial Intelligence Agency has a pretty extensive list of geographic features for each country, with coordinates in many formats, including longitude and latitude in decimal degrees. I’ll use Botswana in southern Africa as an example, as it has a small record set and because I have some admin boundaries handy that I’ve downloaded from SAHIMS.
- Download the file from the GNS and unzip it. It is a tab-delimited text file. If you like, you can open it in Excel or another spreadsheet to see what it looks like. This works fine for this example, but won’t work for larger or more populated countries because the files will exceed the maximum number of records that a spreadsheet can handle (65k). You’ll need to import the file into a database (Access for example) if you want to take a look in those cases. In either event, you’ll be able to add the text file directly to ArcMap, so no worries.
- Open ArcMap and under the Tools menu, select Add XY Data. In the dialog box, you’ll select the file that contains your XY coordinates. Choose the text file you’ve downloaded. ArcMap will then search through the fields and look for appropriate ones to add as X and Y fields. In this case, it should correctly choose LONG for X and LAT for Y. If Arc couldn’t figure it out, you would have to specify which columns have the coordinates. Longitude is ALWAYS the X coordinate, and Latitude is ALWAYS the Y. Finally, you’ll select a projection. Choose the standard geographic coordinate system WGS 1984, which is usually a safe bet when adding long/lat data from most sources.
- Hit OK, and Arc will plot the coordinates (after you click through the warning message). In this example, it looks like there is one wayward point, way to the north. When you see something like this, it often means that one of the coordinates is missing a minus sign: latitudes below the equator are negative, as are longitudes east of the international date line and west of the prime meridian. If you use the identity tool, you’ll see that the minus sign for latitude for this wayward point is missing. The easiest thing to do would be to go back into the text file, edit it, and add it to ArcMap again.
- Even though Arc has plotted the points, they still don’t exist as features (remember the warning message? That’s essentially what it was saying). Select the plotted points in the Table of Contents, right-click, select Data, and select Export. Export the points out as a new shapefile or a feature class in a geodatabase. Then add the new features to the map.
- At this point, it may be helpful to have a frame of reference for all of these points. Get your hands on some administrative layers, like country boundaries. I downloaded the outline of Botswana from SAHIMS. This step usually requires projecting and reprojecting, as you’ll need to get your points layer to match the projection of the other files you’re working with. I always use the ArcToolbox within ArcCatalog to fiddle with projections and then add the finished files to a new, blank map in ArcMap. In my case, the Botswana boundary was undefined – I had to consult the metadata from their website to figure out what the projection is (NAD 1927) and then define it using the ArcToolbox (Data Management Tools, Projections and Transformations, Define Projection). Then, I had to convert the Botswana points layer from WGS 1984 to match the boundary’s NAD 1927 projection (using Data Management Tools, Projections and Transformations, Feature, Project).
- Add the projected boundary and reprojected points to your map. Many of these points are point features (villages, towns, farms, mountain peaks), while others represent the geographic centers of lines (roads, rivers) or areas (administrative areas, parks, reserves). You’ll probably want to extract certain kinds of features. At this point, you’ll want to take a look at the attribute table for the points file and consult the NGS description for the names files. The description will tell you what each of the data columns represents and what all of the codes mean. The FC field will come in quite handy here, as it designates categories for each feature. So if we wanted to extract populated places, under the Selection Menu in ArcMap we could do a Select by Attribute where the field FC is equal to P, which is the code for populated place features. Once they are selected, you can do a Data, Export to create a new shapefile with just those features.
- Alternatives do abound here. If you prefer, you could do a lot of the work of editing and creating feature subsets within a geodatabase. You can also follow these same, general procedures using open source tools (I believe that QGIS has a tool for adding XY data). And while we’re discussing a specific example here, the same basic steps would apply for any XY dataset.
I’ve been looking for census data for various countries, and have visited the usual suspects that aggregate this data – the CIA World Factbook and the United Nations Population Information Network. Other supra-national orgs like the IMF and World Bank also create and compile this info. These are fine sources, particularly if your goal is to look at basic data for several (or all) countries. But if you are studying or writing about one country in particular, it may seem odd to cite the UN, and even odder to cite the CIA. It would be better to go right to the source – the chief statistical agency in that particular country. In all likelihood, this agency would also have more in-depth stats than the aggregators.
But – where is the source? Rather than be left to the mercy of google, where you’ll uncover the obvious suspects and lots of commercial sites and joe-schmoes who republished some data from last decade, visit the US Census Bureau’s list of foreign statistical agencies, which will lead you right to the source.
Assuming you can find some pages with some data (census data isn’t public domain in every country and isn’t necessarily online for free, or at all, in which case you may need to go with some of the aggregate sources), the next obstacle will be overcoming the language barrier. Many countries will publish pages in several languages, including English. Some may publish only limited info in English, or no info in English at all. If you don’t read the lingua franca, you can try a translating tool like Babblefish or the Google Language Tool to translate the page for you. The translation may not be perfect, but it should be good enough where you can figure out what you need (although if the language you are translating doesn’t use the Roman alphabet and Arabic numerals – i.e. 1,2,3 etc, you may have some trouble).
The toughest obstacle to overcome may be the organizational barrier. If you are familiar with the US Census Bureau, you’ll know that it’s a large and complex organization with many subdivisions and datasets (decennial census, acs, population estimates, etc). And despite it’s enormity, it doesn’t collect all socio-economic data (religious affiliation) and may not be the best source for all data (current labor force stats). Well – other countries are just as complicated, so be wary!
Another strategy would be to visit Wikipedia – not to cite as a source, but to find what sources they use. You’ll find many country specific articles that cite the CIA Factbook or the UN, but some of the more detailed and well written ones do cite reports written by the statistical agencies for the country in question, often with a link to the page or report. If you have access to some library databases, like Gale Virtual Reference, they will (usually) cite sound references as well. Happy hunting!
Surely I jest, as messing with FIPS codes is rarely fun at all. However, it’s one of those things that you constantly have to deal with when mapping US demographic data, and it’s one of those things that GIS books and tutorials rarely discuss. And since I was tangling with them yesterday, I thought I’d share some tricks you can use in Excel to manipulate and format FIPS codes.
FIPS codes were created by the federal government to uniquely identify all geographic units in the US and are widely used. In an ideal world, here are some examples of FIPS codes for four counties:
State_Name, County_Name, State_FIPS, County_FIPS, FIPS
California, Alpine, 06, 003, 06003
California, San Mateo, 06, 081, 06081
Delaware, New Castle, 10, 003, 10003
Pennsylvania, Philadelphia, 42, 101, 42101
Each state code is unique, and each county code is unique WITHIN each state. If you are working with data for every county, then you will need to use the concatenated 5 digit FIPS in order to have unique ID numbers for each county. Otherwise, in this example, you wouldn’t be able to distinguish Alpine County, CA from New Castle County, DE because they both have the same county FIPS: 003.
In your database or spreadsheet, the codes need to be saved as text strings, NOT as numbers. In the spreadsheet I opened the other day, they were mistakenly saved as numbers. Excel also has an annoying habit of making corrections which aren’t really correct at all. So instead of the having the above, I had this:
State_Name, County_Name, State_FIPS, County_FIPS, FIPS
California, Alpine, 6, 3, 6003
California, San Mateo, 6, 81, 6081
Delaware, New Castle, 10, 3, 10003
Pennsylvania, Philadelphia, 42, 101, 42101
Since the codes were saved as numbers, all leading zeros were dropped. This is a problem, as I want to relate this data to data stored in other tables and shapefiles, where the FIPS codes are stored correctly. So, I needed to convert the data in this example back to strings (as join fields must have the same data type – you can’t join strings to numbers), and I had to get those zeros back. Somehow. Here’s how:
Open the data up in Excel, and convert the State_FIPS field to strings by selecting the column, selecting Format, and changing the format to text.
Then insert a column to the right of State_FIPS, and type in a formula:
In English, this says: If the length of the value in c2 is one, then put a zero in front of the value in the new cell. Otherwise, just print the value of c2 in the cell. So in our example, the code for California would get converted from 6 to 06, while Delaware and Pennsylvania’s codes will just get reprinted, as they have the correct number of characters (two).
Then, it’s just a matter of copying and pasting this formula all the way down the rows, to create the correct two digit FIPS for each record. Lastly, select the whole column, copy it, go up to the Edit menu and select paste special for values only. This will overwrite all of the formulas in these cells and replace them with the actual output of the formula. You can follow these steps with the other two fields; the only thing that needs to change are the formulas.
Fixing the COUNTY_FIPS field is trickier, as we have three possibilities here: we’ll need to add two zeros if the code is one one character long, one zero if the code is two characters long, and nothing if the code is three characters long. Here’s the formula:
The parentheses get confusing, as arguments within each IF statement and each Len and concatenate function need their own parentheses. In English: if the value in e2 is three characters long, just print that value in the new cell. Otherwise, if the length is two characters, tack a zero to the front of that value and then print it. Otherwise, tack two zeros to the front of it and print. In that last piece, we’re making the implicit assumption that if it isn’t three characters long, and if it isn’t two, it must be one. The two zeros must go in parentheses so Excel reads it as a string. Without the parentheses, Excel assumes you mean a number and will “correct” you by dropping a zero – which is exactly what you DON’T WANT!
So in our example above, Alpine and New Castle get two zeros added to the front, San Mateo gets one, and Philadelphia gets none.
Fixing the last FIPS field is easy – it’s just a repeat of our first formula, as there are only two options: Add a zero to the front if it’s four characters long, otherwise just print the value (which is five characters long).
If you’re using a spreadsheet other than Excel, like Open Office’s Calc, you can use similar formulas with a few syntax changes. Calc uses colons instead of commas to separate values in arguments, so concatenate(0,e2) becomes concatenate (0:e2). You could also perform these operations in Microsoft Access using a Make Table query, with some adjustments. You would reference fields instead of cell values, and instead of the word concatenate, Access uses the syntax &””& in between fields that you want to join.
When I worked at the University of Washington Library, I created a tutorial for concatenating fields in Access, to supplement a tutorial my predecessor created for Excel. Both docs are available on WAGDA’s FAQ page.
Real fun, huh? ; )
It’s been awhile since my last post – I’ve been locked away in my apartment on research leave for the past two weeks. I’m working on a database-backed web directory for finding GIS data, as I’m tired of dealing with bookmarks, html lists, and protracted web searches for keeping track of datasets. The goal is to keep it simple and standards-based. The basic architecture is in place; I’m just struggling to learn and apply PHP. Fingers crossed, I may have a prototype ready by the end of my next round of leave this summer.
I’m also still reading Georeferencing by Linda Hill, which extensively covers gazetteers and and metadata standards.Â It’s definetly worth checking out, from a library near you.
I received an email floating around the Maps-L listserve the other day with an announcement about a website that had African GIS data. Since free, global GIS data can be hard to find, particularly for Africa, I thought I’d re-announce it here.
The Southern African Human-development Information Management Network (SAHIMS) is a UN affiliated organization that provides humanitarian and disaster relief for countries in southern Africa. They have a pretty comprehensive collection of GIS data on their site: http://www.sahims.net. The default GIS page provides layers that cover the entire region, plus some layers for Tanzania (which is technically outside the region). But you can use the menu on the left of the page to navigate to country-specific data for: Angola, Botswana, Lesotho, Madagascar, Malawi, Mozambique, Namibia, Swaziland, Zambia, and Zimbabwe. There doesn’t appear to be any data for South Africa.
The layers are in shapefile format and include administrative boundaries (including sub-national boundaries), transportation, hydrological features, climate, food, and disaster related layers. The availability and details of each layer vary by country.
I downloaded a few of the layers, and many were missing spatial reference information. However they do provide decent metadata on their site, and you can use it to define the coordinate and projection system for each layer.