Archive for the ‘Data Sources’ Category

Natural Earth Vector and Raster Data

Tuesday, December 15th, 2009

I haven’t been posting regularly as I’ve been swamped this semester – but now that it’s coming to an end I should be able to crank out a post or two each month.

I recently saw a message on Maps-L about a new GIS data source, Natural Earth, and just got around to taking a look at it. It’s run by a volunteer organization dedicated to providing free, integrated, public domain map layers for producing high-quality maps at small scales. They have a pretty comprehensive website that includes a blog, feature list, contributor information, and details on how to volunteer.

Natural Earth provides smooth, generalized vector and raster layers at three scales: 1:10m, 1:50m, and 1:110m. See my screen shot of the Delmarva peninsula to see the distinctions (beige area is 110m, red line is 50m, and blue line is 10m).

nat_earth

Having a choice of scales with vector and raster data layers from the same source is a huge plus (many other country-level boundary files available on the web are detailed and suitable for large scale maps, but look messy when you zoom out to a smaller scale). Natural Earth also provide outlines for land and water (including legal water boundaries for all the Pacific islands), hydrographic features generalized to the different scales, ice shelves, urban areas, and several lat/long grid line layers.

For country boundaries they’ve gotten around the tangled issue of country definitions by providing different layers for different definitions, so you can choose the one that’s most appropriate – sovereign states (so, Greenland would be part of the Denmark polygon, Alaska and Puerto Rico part of the US, and French Guiana part of France), countries (Greenland separate from the Denmark polygon, Puerto Rico separate from the US, Alaska part of the US, and French Guiana part of France), and subunits (each place its own polygon). As you move down this hierarchy, places are linked back to their whole (so there are fields in the subunit file that list which country and sovereign state it’s part of).

At this point subdivisions (states / provinces) are only provided for the US and Canada. They do provide some descriptive metadata for each layer on the website, but the metadata doesn’t follow any standardized format for geographic data. The biggest missing link is unique identifiers – none of the countries have ISO or FIPS codes, so there aren’t any fields to join attribute data to for thematic mapping (except country name, which never works smoothly given the amount of variation with names).

Overall this looks like a great resource. Vector data is in shapefile format, raster data is in tiff, and everything is defined as simple WGS 84, so these files should work with almost any GIS package, ready to go.

Update on Some Data Sources

Saturday, October 31st, 2009

Here’s my last chance to squeeze in a post before the month is over. There have been a lot of changes and updates with some key data sites lately. Here’s a summary:

  • The homepage for gdata, which provides global GIS data that was created as part of UC Berkeley’s Biogeomancer project, has moved to the DIVA-GIS website. DIVA-GIS is a free GIS software project designed specifically for biology and ecology applications, with support from UC Berkeley as well as several other research institutions and independent contributors. It looks like the old download interface has been incorporated into the DIVA-GIS page.
  • The US Census Bureau has recently released its latest iteration of the TIGER shapefiles, the 2009 TIGER/Line Shapefiles. Since they seem to be making annual updates, which has involved changing the URLs around, it may be better to link to their main TIGER shapefile page where you can get to the latest and previous versions of the files.
  • The bureau has released its latest American Community Survey (ACS) data: 2008 annual estimates for geographic areas with 65,000 plus people, and three year 2006-2008 estimates for geographic areas with 20,000 plus people. Available through the American Factfinder.
  • Over the summer, UM Information Studies student Clint Newsom and I created a 2005-2007 PUMA-level New York Metropolitan ACS Geodatabase (NYMAG). It’s available for download on the new Baruch Geoportal, which was re-launched as a public website this past September. It’s a personal geodatabase in Microsoft Access format, so it can only be directly used with ArcGIS. I plan on creating the 2006-2008 version sometime between January and March 2010, and hope to release an Access and SQLite version, as the latest development versions of QGIS now offer direct support for SQlite geodatabases in the Spatialite format (which is awesome!).
  • While it’s not a source for GIS data or attribute tables, it’s still worth mentioning that the CIA World Factbook completely revised their website this past summer. The previous web versions of the factbook took their design cues from the old paper copies of the report. The CIA revamped the entire site and apparently will be using a model of continuous rather than annual updates. It’s a great site for getting country profiles – another good option is the UN World Statistics Pocketbook, which is part of the UNdata page.

UNdata Processing, Calc Data Pilot

Sunday, September 27th, 2009

I’d downloaded some data from the UNdata website and cleaned it up so I could use it for my class, and thought I’d share some tips here. In many cases when you download data from UNdata you get multiple records for each country; one record for each year for each data point. In order to bring this data into GIS, I needed to re-arrange it to move the years from rows to columns, so that I’d have one record for each country with multiple columns for years.

You can do this in Excel using a pivot table, but since I was working off of my Linux notebook, I accomplished this using the Data Pilot tool in Open Office 3.0′s Calc spreadsheet. Here’s what I did:

  • screenshot1I opened the csv file I downloaded from UNdata in Calc, and accepted the defaults on the text import screen. Once it was imported, I saved the file in a spreadsheet format – you can use Calc’s odt format, or you can save it in Excel xls format if you need to use Excel later. But you have to get out of the csv format – Calc crashed on me a couple of times when I was running the Pilot and creating multiple sheets in the csv.
  • I went up to the Data menu, selected Data Pilot, and Start, which opens the Data Pilot Menu. I clicked the More button to see the full range of options.
  • screenshot3Then it was a simple matter of dragging the field names into the right places. I dragged the country code and name fields into the rows box, the year into the column box (as I wanted to move years to columns), and the actual data field into the values box. Under the options listed under More, I changed the Results drop down box to save the table in a new sheet, and I unchecked all of the boxes listed below (for adding filters, creating totals rows and columns, etc). Then clicked OK.
  • screenshot4Voila! I had my newly formatted table, with one row for each country and one column for each year. But since I’ll be bringing this data into GIS (and will have to save the data in DBF format as I want my students to bring it into QGIS), I need to make sure that my data doesn’t have any funky formatting that may mess up joining my data to a shapefile. So I added a blank worksheet, copied my new pilot table, and did a Paste Special into the blank worksheet and pasted only text and numbers – with formatting, formulas, and anything else funky left out.
  • screenshot5Once I had my plain, reformatted data in my new sheet, I deleted the top row (which had labels for Sum Value and Year) so I’d be left with only one header row, and I changed the field names to something more database friendly (truncating names and removing spaces). Lastly, I deleted the original data sheet and the formatted data pilot table sheet, so I was left with just the final copy.

That’s it! Sort of. Since I now have the year’s in columns, I could create a few calculated fields to show change over time.

But the last piece will be dealing with the country codes. To get a data table with codes from the UNdata website, you have to choose an Add Columns option from their data browser page before you download, as you don’t get the country codes by default. Then, the codes you get could be anything. Since these data tables are coming from dozens of different organizations, agencies, and bureaus within the UN, the country codes will vary based on what that agency did. In some cases I’ve downloaded data that had the ISO two-digit alpha codes, and in other cases I had three digit numerical ISO codes (stored incorrectly as numbers, so leading zeros were dropped).

Most of the tables I’ve been downloading come from from the World Health Organization (WHO), and came with no standardized country codes. Instead, the codes are sequential numbers assigned to the countries in alphabetical order from 1 to 193. Doh! Then, if a new country gets added they tacked on the next available number regardless of the alphabet – so the country of Montenegro is assigned 194, after Zambia which is 193. Typically, data from countries that are not UN members or observers (like Liechtenstein and Vatican City) and are dependencies (Greenland, Falkland Islands, French Polynesia, etc) are not included in the data sets.

So, I’ll be typing in ISO alpha two codes into one of my data tables and will end up with a table that connects their sequential number system to ISO. Then I can bring this bridge and all of the other tables into a database, relate them to the bridge based on the sequential number, and create new tables out of them that have ISO numbers, so I can join them to my GIS file based on ISO. Or I guess I could add the sequential number field to my countries shapefile and join each table to it based on the sequential number.

Anyway – happy Data Piloting (or Pivoting, if you prefer).

2007 Economic Census

Monday, June 8th, 2009

The US Census Bureau has begun releasing data for the 2007 Economic Census. The bureau conducts the survey of businesses every five years – all medium to large size businesses and multi-part businesses are counted, samples are taken for smaller businesses, and various administrative records are used to calculate businesses with no employees (i.e. freelancers). All businesses are categorized hierarchically by North American Industrial Classification System (NAICS) codes and the data is reported by industry and geography. The number of establishments, employees, payroll, and sales are counted for the nation, states, counties, metro areas, places, and zip codes.

At this point national industry totals for the broadest categories of NAICS are available, as are preliminary numbers for the most specific NAICS categories (six digit) at the national level. Data for smaller geographic areas will be released between October 2009 and August 2010.

The biggest change from the 2002 Economic Census is the delivery method for the data. There will be no more 90 page pdf files or HTML tables that drill down six levels. All of the data will be released via the American Factfinder only. Other changes include the addition of some new geography (CDPs with at least 5000 people), new metro area definitions, and the revised 2007 definitions for NAICS which include small changes to the Finance, Insurance, Real Estate, Professional Services, and Administrative Services categories.

Additional changes for 2007, the data release schedule, NAICS codes, and methodology docs are all available at the 2007 Economic Census homepage within the Census Bureau’s website.

All of the data is aggregated by industry and geography – you cannot get lists of businesses with names and addresses as this information is kept confidential. Furthermore, to maintain confidentiality, if one company controls a large share of the market for a specific sector within a specific geographic area, or if there few businesses within a sector in a specific geographic area, much of the data (with the exception of the number of businesses) remains classified (marked with a D for disclosure). Oftentimes this means that data for industries within small areas (big box retail in a small town) and data for industries with few establishments in an area (mining establishments in New York City) are hidden. The smaller the geography, the more likely it is that the data will not be disclosed. This becomes a technical issue if you want / need to move this data into a database, as these pesky disclosure notes are stored in the same columns as the data and prevent you for designating the fields as numeric.

Given the delay between the time the data is collected and the time it is released, it isn’t particularly helpful for analyzing our current economic climate, but it does provide a snapshot of the way the US economy looked at that moment, and is useful in understanding how the economy is evolving. Be aware that when making comparisons to past data, you have to correct for changes in geography and NAICS definitions. The differences between 2002 and 2007 are not too great, but more adjustments are necessary as you go further back in time. The Bureau provides data back to 1997 through the American Factfinder and some data from 1992 on an older page. If you need to go back further, you’ll be entering the realm of (gasp!) CD-ROMs or the paper reports.

Social Explorer and New ACS Census Data

Thursday, January 22nd, 2009

This is kind of a follow-up to my last post – the Social Explorer, a great interactive mapping site that allows you to map US Census data, has added the 2005-2007 American Community Survey data to their site at the PUMA level. This is the smallest geographic area that is available for recent data, until we get to the 2010 Census and 2010 ACS. At this point you can look at total population, race, and Hispanic ethnicity. It looks like you can make maps, but you can’t export the data unless you subscribe to the full version.

The Social Explorer allows you to map a wide selection of decennial census data all the way back to the 1790 census (they have a partnership with NHGIS, which provides historical data and boundary files for free download with registration). Tract-level data is available back to 1940. While you can map the data, and you can generate slideshows and download static maps as image files, you can only generate reports for the 2000 census. In order to get full access for report generation and other features, you’ll have to subscribe (or find access to a library that does).

Social Explorer also works with ARDA (Association of Religious Data Archives) to create maps of county-level religious affiliation (since the US Census does not collect this data by law). Of all the interactive mapping sites I’ve seen, the Social Explorer is one of the slickest and easiest to use.

Mapping ACS Census Data for Urban Areas With PUMAs

Tuesday, December 16th, 2008

The NY Times wrote a story recently based on the new 3 year ACS data that the Census Bureau released a couple weeks ago (see my previous post for details). They created some maps for this story using geography that I would never have thought to use.

Outside of Decennial Census years, it is difficult to map demographic patterns and trends within large cities as you’ll typically get one figure for the entire city and you can’t get a break down for areas within. Data for areas like census tracts and zip codes is not available outside the ten-year census (yet), and large cities exist as single municipal divisions that aren’t subdivided. New York City is an exception, as it is the only city composed of several counties (boroughs) and thus can be subdivided. But the borough data still doesn’t reveal much about patterns within the city.

The NY Times used PUMAS – Public Use Microdata Areas – to subdivide the city into smaller areas and mapped rents and income. PUMAs are aggregations of census tracts and were designed for aggregating and mapping public microdata. Microdata consists of a selection of actual individual responses from the census or survey with the personal identifying information (name, address, etc) stripped away. Researchers can build their own indicators from scratch, aggregate them to PUMAs, and then figure out the degree to which the sample represents the entire population.

Since PUMAs have a large population, the new three-year ACS data is available at the PUMA level. The PUMAs essentially become surrogates for neighborhoods or clusters of neighborhoods, and in fact several NYC agencies have created districts or neighborhoods based on these boundaries for statistical or planning purposes. This wasn’t the original intent for creating or using PUMAs, but it’s certainly a useful application of them.

You can check out the NY Times article and maps here – Census Shows Growing Diversity in New York City (12/9/08). I tested ACS / PUMA mapping out myself by downloading some PUMA shapefiles from the Census Bureau’s Generalized Cartographic Boundaries page, grabbing some of the new annual ACS data from the American Factfinder, and creating a map of Philly. In the map below, you’re looking at 2005-2007 averaged data that shows the percentage of residents who lived in their current home last year. If you know Philly, you can see that the PUMAs do a reasonable job of approximating regions in the city – South Philly, Center City, West Philly, etc.

The problem I ran into here was that data did not exist for all of the PUMAs – in this case, South Philly and half of North Philly had values of zero. According to the footnotes on the ACS site, there were no values for these areas because “no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution”. So even though the PUMA geography is generally available, there still may be cases where data for particular variables for individual geographies is missing.

Just for the heck of it, I tried looking at the annual ACS data which is limited to more populated areas (must have 65k population where 3 year estimates are for areas with at least 20k) and even more data was missing (in this instance, all the areas in the northeast). Even though PUMAs have a minimum population of 100k people, the ACS sampling is county based. So even if the sample size for a county is ideal, they may not have a significant threshold for individual places within a county to compute an estimate. At least, that’s my guess. Regardless, it’s still worth looking at for the city and data you’re interested in.

ACS Data for Philly Pumas

Census Bureau Releases New ACS Data

Wednesday, December 10th, 2008

The Census Bureau released its new American Community Survey data the other day. Three year averages for a variety of socio-economic variables are now available for all geographic areas that have at least 20,000 people. The ACS has been releasing annual data for most of this decade for areas with at least 65,000 people and will continue to do so. They didn’t provide data for smaller areas because the numbers were not as statistically robust. Now that they have three years of data, they can average the numbers for three years and get sound data for areas with a population of at least 20k.

Data for 2005 to 2007 is available now, and like the annual numbers, you’ll get a range of values and a confidence interval. For example, we can say with 90% confidence that the estimated population of Atlantic City, NJ between 2005 and 2007 was 35,770, plus or minus 1,749 people. The Bureau created this estimate based on a sample of 1,379 people in AC.

Next year, the census will release new annual numbers for areas with a population of at least 65k, and will update the three year averages for areas with 20k by adding the newest year of data and dropping the oldest one to calculate a new average.

All of the data is available through the American Factfinder.

If you are looking for population figures for basic indicators (population, race, gender, age, and housing units) for basic geographic areas (states, counties, places, and metro areas), you’ll probably want to consider using estimates from the Bureau’s Population Estimates program instead. Their annual estimates are based on a demographic calculation that factors in births, deaths, and migration, and is not based on a survey (according to that program, Atlantic City had 39,684 residents in 2007 – 4,090 more people than the ACS midrange estimate). If you’re looking for any other kind of data (ethnicity, immigration status, income, poverty, rent, home value, etc) the ACS is your best bet.

By 2010 the Bureau will begin releasing ACS 5 year avearges for all geographic areas. Of course, we’ll also have our next decennial census in 2010. The big change here is that, since we’ll have the ACS churning out data for all areas for every year from that point forward, the Bureau is doing away with the long form (which was sent to one in six households) that was issued in past censuses, and will only collect data using the basic short form, which gets distributed to everyone. For more info on this change, see the Bureau’s Census 2010 info page.

ReferenceUSA for business data

Sunday, November 30th, 2008

Sorry that November has been another crummy month for posts. Here’s one that I’ve been meaning to write for quite awhile.

 

While there is a lot of free GIS data out there, one of the black holes is business data. Specifically, if you want to plot all of the businesses in one industry or all of the branches or locations of one company, where do you get the data? I’ve found that, if you need a comprehensive resource, this is one of those datasets that you have to pay for.

 

At our library we subscribe to a great business directory called ReferenceUSA, which is produced by company called InfoUSA. Their directories of American and Canadian businesses are extremely comprehensive and cover every business large an small. They also have an international directory that has mid-size to large businesses. You can generate lists of businesses using several criteria and filters.

 

Search using NAICS and ZIP CodeFor places, you can specify the entire country, states, counties, places, or ZIP codes. You can get generate lists based on company names, keywords, or NAICS codes to grab all of the businesses in one industry. Once you have your list, you can click on each individual business to get a detailed profile. For GIS purposes, you’ll want to use the download option. Depending on your subscription, you’ll be able to download only a certain number of records at a time (we can get 25 records per download). Just download as a csv file, save, open in a spreadsheet, then start downloading subsequent batches and start copying and pasting records in a master file.

 

Coffee shops in midtown - download to get all the dataWhen you go to download, you’ll be prompted to choose basic, detailed, or custom. Basic isn’t going to cut it, as it’s missing the key fields – latitude and longitude coordinates. Choose the detailed option to get all of the fields. The custom option has some bugs – you’ll get lat and long without decimal places and some of the data for fields will be missing. Once you have all of the detailed records, you can delete a lot of the unecessary fields. You’ll want to, as many of the field headings are not database friendly – many are long and contain spaces, which will cause problems when you go to import the table into GIS. So be sure to delete any that you don’t need and fix the ones you do need.

 

Once you have your table ready, add it to your favorite GIS program. In ArcGIS you can use the Add XY Table feature to plot the points and turn them into a shapefile. Remember to specify the X coordinate as your longitude field and the Y coordinate as latitude, and define your geographic coordinate system as WGS 84. Once you plot them, right click on the feature in the Table of Contents and export them out as a shapefile so you have a permanent layer (see my previous XY post for more details). You can map the businesses as regular old points, or make some graduated symbols based on some of the attributes, like sales or total employees (ReferenceUSA doesn’t provide the exact data, but identifies a range, i.e. 1 to 10 employees, 11 to 25, etc).

 

Most of the open source alternatives also have a tool or plugin that allow you to plot XY data. Of course, the data does include address fields if you wanted to geocode your points rather than plot XY (but plotting XY is a million times easier and doesn’t require downloading huge street network files).

 

The good news here is that if you’re not affiliated with a university, you can probably get access to this db from a large public library, as many will have a subscription to a business directory as a matter of course. If they don’t have RefUSA they may have an alternative like the D and B Million Dollar Database. It’s another business directory that allows you to download XY data for businesses, but it is not nearly as comprehensive.

GIS Data: UNSDI, gData, CEGRP, AIMS

Monday, August 4th, 2008

I’ve stumbled across a few good sites for GIS data lately. Check these out:

UNSDI-NCO: The United Nations Spatial Data Infrastructure site, maintained by the Netherlands Coordination Office. They have many global datasets as well as country-specific ones, often for developing countries where data is hard to come by. Includes boundaries, roads, infrastructure, and natural features. Click on the Datasets link under the Categories menu to see the list, then click on the feature of you choice. You’ll have to scroll through the metadata to the Distribution Info element to get to a download link. Not all of the datasets are available for public download.

gData: This site is housed at Berkeley as part of the Biogeomancer Project, whose goal is to share data on biodiversity. You can download boundaries, hydrography, infrastructure, topography, and climate data in vector and raster formats for any country in the world. The data is aggregated, and in some cases improved, from many public sources. Administrative boundaries include 1st, 2nd, and often 3rd level divisions. A great, comprehensive source.

CEGRP: China Earthquake Geosptial Research Portal, housed at Harvard. The goal of the site is to gather and distribute geospatial data in response to the earthquake that hit Sichuan China in May 2008. Vector and raster layers for all of China and for this particular region where the earthquake hit.

AIMS: Afghanistan Information Management Services. A non-profit group located in Afghanistan that has created and maintains a geospatial infrastructure to support the government. Vector datasets for the entire country and the city of Kabul are available for download. They also offer a number of static pdf maps.

Adding Long / Lat XY Data to ArcMap

Monday, July 7th, 2008

Here’s a tutorial I’ve been meaning to write: adding a table of longitude and latitude coordinates to ArcMap and turning them into features. For this example, I’ll be using place names from the GEOnet Names Server country files. The US National Geospatial Intelligence Agency has a pretty extensive list of geographic features for each country, with coordinates in many formats, including longitude and latitude in decimal degrees. I’ll use Botswana in southern Africa as an example, as it has a small record set and because I have some admin boundaries handy that I’ve downloaded from SAHIMS.

  • Download the file from the GNS and unzip it. It is a tab-delimited text file. If you like, you can open it in Excel or another spreadsheet to see what it looks like. This works fine for this example, but won’t work for larger or more populated countries because the files will exceed the maximum number of records that a spreadsheet can handle (65k). You’ll need to import the file into a database (Access for example) if you want to take a look in those cases. In either event, you’ll be able to add the text file directly to ArcMap, so no worries.
  • Add XY Data Open ArcMap and under the Tools menu, select Add XY Data. In the dialog box, you’ll select the file that contains your XY coordinates. Choose the text file you’ve downloaded. ArcMap will then search through the fields and look for appropriate ones to add as X and Y fields. In this case, it should correctly choose LONG for X and LAT for Y. If Arc couldn’t figure it out, you would have to specify which columns have the coordinates. Longitude is ALWAYS the X coordinate, and Latitude is ALWAYS the Y. Finally, you’ll select a projection. Choose the standard geographic coordinate system WGS 1984, which is usually a safe bet when adding long/lat data from most sources.
  • Add XY Dialog BoxHit OK, and Arc will plot the coordinates (after you click through the warning message). In this example, it looks like there is one wayward point, way to the north. When you see something like this, it often means that one of the coordinates is missing a minus sign: latitudes below the equator are negative, as are longitudes east of the international date line and west of the prime meridian. If you use the identity tool, you’ll see that the minus sign for latitude for this wayward point is missing. The easiest thing to do would be to go back into the text file, edit it, and add it to ArcMap again.
  • Even though Arc has plotted the points, they still don’t exist as features (remember the warning message? That’s essentially what it was saying). Select the plotted points in the Table of Contents, right-click, select Data, and select Export. Export the points out as a new shapefile or a feature class in a geodatabase. Then add the new features to the map.
  • At this point, it may be helpful to have a frame of reference for all of these points. Get your hands on some administrative layers, like country boundaries. I downloaded the outline of Botswana from SAHIMS. This step usually requires projecting and reprojecting, as you’ll need to get your points layer to match the projection of the other files you’re working with. I always use the ArcToolbox within ArcCatalog to fiddle with projections and then add the finished files to a new, blank map in ArcMap. In my case, the Botswana boundary was undefined – I had to consult the metadata from their website to figure out what the projection is (NAD 1927) and then define it using the ArcToolbox (Data Management Tools, Projections and Transformations, Define Projection). Then, I had to convert the Botswana points layer from WGS 1984 to match the boundary’s NAD 1927 projection (using Data Management Tools, Projections and Transformations, Feature, Project).
  • Plotted points with boundaryAdd the projected boundary and reprojected points to your map. Many of these points are point features (villages, towns, farms, mountain peaks), while others represent the geographic centers of lines (roads, rivers) or areas (administrative areas, parks, reserves). You’ll probably want to extract certain kinds of features. At this point, you’ll want to take a look at the attribute table for the points file and consult the NGS description for the names files. The description will tell you what each of the data columns represents and what all of the codes mean. The FC field will come in quite handy here, as it designates categories for each feature. So if we wanted to extract populated places, under the Selection Menu in ArcMap we could do a Select by Attribute where the field FC is equal to P, which is the code for populated place features. Once they are selected, you can do a Data, Export to create a new shapefile with just those features.
  • Alternatives do abound here. If you prefer, you could do a lot of the work of editing and creating feature subsets within a geodatabase. You can also follow these same, general procedures using open source tools (I believe that QGIS has a tool for adding XY data). And while we’re discussing a specific example here, the same basic steps would apply for any XY dataset.

Copyright © 2012 Gothos. All Rights Reserved.
No computers were harmed in the 0.332 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.