Archive for the ‘Data Processing’ Category.

Working With Lat Long, Numeric Data

Sorry that October has obviously been a pretty weak month for posts. I’ve been driven to distraction lately and haven’t done much GIS related work.

I was working on a project this week that involved manipulating data tables, so I thought I’d share a couple tips here. A number of months ago I wrote a post about manipulating FIPS codes and text-based ID fields. But what if you have to manipulate numeric fields? Adding decimal places, zeros, etc? The answer is - math!

In one field, I had a population figure from the 1970 Census that had been rounded to the hundreds place, so it was listed like this:

Bronx    14718

I wanted to make this a little more explicit by adding the appropriate zeros, so in Excel (or Calc if you prefer) I created a formula to multiply this by 100 =(c2*100) to get the full number with zeros:

Bronx    1471800

I also had fields with latitude and longitude coordinates in decimal degrees, but they lacked decimal points. The longitude field also lacked the minus sign, which means if we plotted the points they would end up in Asia instead of North America (longitude east of the dateline and west of the prime meridian is notated as negative in decimal degrees, as is latitude south of the equator). I knew from the metadata that each coordinate pair was precise to four decimal places, and I knew all of my points were in North America. So I created a formula where I took the latitude and divided by ten thousand =(c3/10000) and took the longitude, divided by ten thousand and multiplied by -1 =((c4/10000)*-1). Here’s the before and after:

Bronx    408492    738800

Bronx    40.8492    -73.8800

Some of this may seem pretty obvious, but if you’re used to working with text-based ID fields all of the time (like I am), it’s easy to forget that all you need is simple math to fix number fields.

The last step I took was to check for null values. A few of my data points had 0,0 listed for lat and long, because coordinate data was missing for those particular places. The problem is that 0 IS a value! If we plotted this data, these points would show up where the equator and prime meridian meet below western Africa. You have to represent “no data” as a blank value or null, and not as a zero. I fixed those, plotted, and was good to go.

FIPS and ISO Country Code Table

I’ve created a bridge table to relate various country codes: FIPS 10, the three versions of ISO 3166 (two alpha, three alpha, and three numeric), NATO, and the internet country codes. Ok, ok, I didn’t create it, the CIA World Factbook did and has it listed in their appendix, but in HTML. I just copied it and made it database friendly: fixed column headings, removed dashes for nulls, removed trailing and leading spaces, converted numeric codes to text while preserving zeros, and saved it in a tab-delimited text file. You can download it from the Resources page.

If you open it in Excel, Excel annoyingly converts the three digit numeric ISO code back to a number and drops the leading zeros. You can fix this using a trick I illustrated in a previous post, or import it into Calc or Access instead. You’ll be able to designate that field as text during the import process. If you stash the table in a geodatabase, you will be able to relate features and tables that use different codes through this bridge table.

I was also searching for the ISO 3166-2 codes for 1st level subdivisions within countries (like states in the US or provinces in Canada), but had trouble finding anything official as I think they may be copyrighted. I eventually found one source that claims that they received permission to post the codes, so you can take a look there. I also stumbled across a good reference source called Statoids, which gives you background information, lists, and codes for subdivisions on a country by country basis. I’ve added both of these to the Links - Resources page.

Adding Long / Lat XY Data to ArcMap

Here’s a tutorial I’ve been meaning to write: adding a table of longitude and latitude coordinates to ArcMap and turning them into features. For this example, I’ll be using place names from the GEOnet Names Server country files. The US National Geospatial Intelligence Agency has a pretty extensive list of geographic features for each country, with coordinates in many formats, including longitude and latitude in decimal degrees. I’ll use Botswana in southern Africa as an example, as it has a small record set and because I have some admin boundaries handy that I’ve downloaded from SAHIMS.

  • Download the file from the GNS and unzip it. It is a tab-delimited text file. If you like, you can open it in Excel or another spreadsheet to see what it looks like. This works fine for this example, but won’t work for larger or more populated countries because the files will exceed the maximum number of records that a spreadsheet can handle (65k). You’ll need to import the file into a database (Access for example) if you want to take a look in those cases. In either event, you’ll be able to add the text file directly to ArcMap, so no worries.
  • Add XY Data Open ArcMap and under the Tools menu, select Add XY Data. In the dialog box, you’ll select the file that contains your XY coordinates. Choose the text file you’ve downloaded. ArcMap will then search through the fields and look for appropriate ones to add as X and Y fields. In this case, it should correctly choose LONG for X and LAT for Y. If Arc couldn’t figure it out, you would have to specify which columns have the coordinates. Longitude is ALWAYS the X coordinate, and Latitude is ALWAYS the Y. Finally, you’ll select a projection. Choose the standard geographic coordinate system WGS 1984, which is usually a safe bet when adding long/lat data from most sources.
  • Add XY Dialog BoxHit OK, and Arc will plot the coordinates (after you click through the warning message). In this example, it looks like there is one wayward point, way to the north. When you see something like this, it often means that one of the coordinates is missing a minus sign: latitudes below the equator are negative, as are longitudes east of the international date line and west of the prime meridian. If you use the identity tool, you’ll see that the minus sign for latitude for this wayward point is missing. The easiest thing to do would be to go back into the text file, edit it, and add it to ArcMap again.
  • Even though Arc has plotted the points, they still don’t exist as features (remember the warning message? That’s essentially what it was saying). Select the plotted points in the Table of Contents, right-click, select Data, and select Export. Export the points out as a new shapefile or a feature class in a geodatabase. Then add the new features to the map.
  • At this point, it may be helpful to have a frame of reference for all of these points. Get your hands on some administrative layers, like country boundaries. I downloaded the outline of Botswana from SAHIMS. This step usually requires projecting and reprojecting, as you’ll need to get your points layer to match the projection of the other files you’re working with. I always use the ArcToolbox within ArcCatalog to fiddle with projections and then add the finished files to a new, blank map in ArcMap. In my case, the Botswana boundary was undefined - I had to consult the metadata from their website to figure out what the projection is (NAD 1927) and then define it using the ArcToolbox (Data Management Tools, Projections and Transformations, Define Projection). Then, I had to convert the Botswana points layer from WGS 1984 to match the boundary’s NAD 1927 projection (using Data Management Tools, Projections and Transformations, Feature, Project).
  • Plotted points with boundaryAdd the projected boundary and reprojected points to your map. Many of these points are point features (villages, towns, farms, mountain peaks), while others represent the geographic centers of lines (roads, rivers) or areas (administrative areas, parks, reserves). You’ll probably want to extract certain kinds of features. At this point, you’ll want to take a look at the attribute table for the points file and consult the NGS description for the names files. The description will tell you what each of the data columns represents and what all of the codes mean. The FC field will come in quite handy here, as it designates categories for each feature. So if we wanted to extract populated places, under the Selection Menu in ArcMap we could do a Select by Attribute where the field FC is equal to P, which is the code for populated place features. Once they are selected, you can do a Data, Export to create a new shapefile with just those features.
  • Alternatives do abound here. If you prefer, you could do a lot of the work of editing and creating feature subsets within a geodatabase. You can also follow these same, general procedures using open source tools (I believe that QGIS has a tool for adding XY data). And while we’re discussing a specific example here, the same basic steps would apply for any XY dataset.

Fun With FIPS Codes

Surely I jest, as messing with FIPS codes is rarely fun at all. However, it’s one of those things that you constantly have to deal with when mapping US demographic data, and it’s one of those things that GIS books and tutorials rarely discuss. And since I was tangling with them yesterday, I thought I’d share some tricks you can use in Excel to manipulate and format FIPS codes.

FIPS codes were created by the federal government to uniquely identify all geographic units in the US and are widely used. In an ideal world, here are some examples of FIPS codes for four counties:

State_Name, County_Name, State_FIPS, County_FIPS, FIPS
California, Alpine, 06, 003, 06003
California, San Mateo, 06, 081, 06081
Delaware, New Castle, 10, 003, 10003
Pennsylvania, Philadelphia, 42, 101, 42101

Each state code is unique, and each county code is unique WITHIN each state. If you are working with data for every county, then you will need to use the concatenated 5 digit FIPS in order to have unique ID numbers for each county. Otherwise, in this example, you wouldn’t be able to distinguish Alpine County, CA from New Castle County, DE because they both have the same county FIPS: 003.

In your database or spreadsheet, the codes need to be saved as text strings, NOT as numbers. In the spreadsheet I opened the other day, they were mistakenly saved as numbers. Excel also has an annoying habit of making corrections which aren’t really correct at all. So instead of the having the above, I had this:

State_Name, County_Name, State_FIPS, County_FIPS, FIPS
California, Alpine, 6, 3, 6003
California, San Mateo, 6, 81, 6081
Delaware, New Castle, 10, 3, 10003
Pennsylvania, Philadelphia, 42, 101, 42101

Since the codes were saved as numbers, all leading zeros were dropped. This is a problem, as I want to relate this data to data stored in other tables and shapefiles, where the FIPS codes are stored correctly. So, I needed to convert the data in this example back to strings (as join fields must have the same data type - you can’t join strings to numbers), and I had to get those zeros back. Somehow. Here’s how:

Open the data up in Excel, and convert the State_FIPS field to strings by selecting the column, selecting Format, and changing the format to text.

Then insert a column to the right of State_FIPS, and type in a formula:

=IF(Len(c2)=1,concatenate(0,c2),c2)

In English, this says: If the length of the value in c2 is one, then put a zero in front of the value in the new cell. Otherwise, just print the value of c2 in the cell. So in our example, the code for California would get converted from 6 to 06, while Delaware and Pennsylvania’s codes will just get reprinted, as they have the correct number of characters (two).

Then, it’s just a matter of copying and pasting this formula all the way down the rows, to create the correct two digit FIPS for each record. Lastly, select the whole column, copy it, go up to the Edit menu and select paste special for values only. This will overwrite all of the formulas in these cells and replace them with the actual output of the formula. You can follow these steps with the other two fields; the only thing that needs to change are the formulas.

Fixing the COUNTY_FIPS field is trickier, as we have three possibilities here: we’ll need to add two zeros if the code is one one character long, one zero if the code is two characters long, and nothing if the code is three characters long. Here’s the formula:

=IF(Len(e2)=3,e2,IF(Len(e2)=2,concatenate(0,e2),concatenate(”00″,e2)))

The parentheses get confusing, as arguments within each IF statement and each Len and concatenate function need their own parentheses. In English: if the value in e2 is three characters long, just print that value in the new cell. Otherwise, if the length is two characters, tack a zero to the front of that value and then print it. Otherwise, tack two zeros to the front of it and print. In that last piece, we’re making the implicit assumption that if it isn’t three characters long, and if it isn’t two, it must be one. The two zeros must go in parentheses so Excel reads it as a string. Without the parentheses, Excel assumes you mean a number and will “correct” you by dropping a zero - which is exactly what you DON’T WANT!

So in our example above, Alpine and New Castle get two zeros added to the front, San Mateo gets one, and Philadelphia gets none.

Fixing the last FIPS field is easy - it’s just a repeat of our first formula, as there are only two options: Add a zero to the front if it’s four characters long, otherwise just print the value (which is five characters long).

If you’re using a spreadsheet other than Excel, like Open Office’s Calc, you can use similar formulas with a few syntax changes. Calc uses colons instead of commas to separate values in arguments, so concatenate(0,e2) becomes concatenate (0:e2). You could also perform these operations in Microsoft Access using a Make Table query, with some adjustments. You would reference fields instead of cell values, and instead of the word concatenate, Access uses the syntax &”"& in between fields that you want to join.

When I worked at the University of Washington Library, I created a tutorial for concatenating fields in Access, to supplement a tutorial my predecessor created for Excel. Both docs are available on WAGDA’s FAQ page.

Real fun, huh? ; )

Census Cartographic Boundary Files

I’ve worked with these files a number of times and just used them again recently, and thought I would share the process you need to go through to prepare them for use in ArcGIS, as they are not “ready to go”. If you are not using ArcGIS, you can still follow these general steps using the specific tools that your software provides.

I would opt for the Cartographic Boundary Files (CBF) over the TIGER shapefiles (that the census just released) when making a national-level thematic map, as the generalization of the CBF makes the boundaries look cleaner at this scale. Also, the generalized files show land boundaries along coasts, while the TIGER files show the legal boundaries that extend into the water. The latter are not great for thematic maps, particularly as the Great Lakes states look distorted (as their boundaries extend into the lakes).

I’ll use the state and equivalent areas as an example, as those are the files I’ve just worked with. After downloading and unzipping the national-level shapefiles, you’ll need to take the following steps in the ArcCatalog:

  • Define the projection, as the files are undefined. According to metadata on the website, the files are in simple NAD83. In the ArcToolbox, the tool is under Data Management Tools, Projections and Transformations, Define Projection. Once you launch the tool, you will need to select the North American Datum 1983 as the coordinate system, which is stored under Geographic Coordinate Systems for North America.
  •  

  • After you define the projection, the next step is to reproject the layer to another projection that is more suitable for displaying the US. If you are making a map for basic presentation, a projected coordinate system like Albers Equal Area Conic would be a good choice (most atlases and maps of the continental US use this projection). Alaska, Hawaii, and Puerto Rico will be distorted, but we will be able to give them a separate data frame in ArcMap with their own projection later on. The tool is in the ArcToolbox under Data Management Tools, Projections and Transformations, Features, Project. Note that this is a DIFFERENT tool than the one we used in the last step. Define Projection is used to tell ArcGIS what projection a file is in if it is undefined, while Feature, Project is used to reproject a vector file from one projection to another. A file MUST have a defined projection BEFORE you can reproject it.
  •  

  • The CBF’s are stored as single part features, which means that each distinct polygon will have its own record in the attribute table. For example, each of the Hawaiian Islands will have its own record in the table. This is a problem if you plan to join state-level data to your shapefile, as the data from the join will be repeated for each record. So if you have a table with population data for each of the states and you join it to the shapefile, each individual Hawaiian island will be assigned the total population of Hawaii. If you run statistics on your data, you’ll get inflated counts. To avoid this, we need to convert the CBF to a multi-part feature, where each state will have only one record in the attribute table. To do this, we use the Dissolve tool under Data Management Tools, Generalization, Dissolve. The Dissolve fields will be the basis for dissolving the individual parts of the states into one state feature. In this case, we would choose the STATE field (FIPS code) and NAME field as the dissolve field, which will give us one feature for each state (if we chose DIVISION or REGION as the field, we would aggregate the polygons to create those larger geographic areas).
  •  

  • The next step is to decide whether you want to keep your shapefile as an independent file, or bring it into a geodatabase. The geodatabase is handy if you have lots of other tables and shapefiles that you are using in your project. Right-click in the catalog tree to create a new personal or file geodatabase. Then select your shapefile and right click to export it to your new geodatabase.
  •  

  • Whether you stick with a shapefile or go with a geodb, the next step is to open ArcMap and add your file to it. Now, you’ll have to make a decision about Puerto Rico. If you have a dataset where you want to map data for it, then you need not do anything. Since I am making presidential election maps and Puerto Rico doesn’t vote in the electoral college, I needed to delete it. To do so, go into an Edit mode under the Editor toolbar, select PR in the attribute table or map, delete it, then save. You’ll be left with a file for the 50 states and DC.
  •  

  • At this point, if you are going to join table data to your features, do so. Your features have a FIPS code, so you can use that to do the join (NEVER use names for joining - stick with codes). I often will add a new column to my features and plug in the two letter postal abbreviations, since they are commonly used for identifying states.
  •  

  • National Map With Multiple Data LayersOnce you’ve joined your data and are ready to make a finished map, the last step will be adding two new data frames for Alaska and Hawaii. Since AK and HI are distant from the continental US, it is better to create separate frames for all three rather than trying to display them in one. Copy your current data layer (not the features - the layer which is indicated by the yellow rectangles layered on top of each other) in the table of contents, and paste it below. Activate that layer, and name the layer Alaska. Then right click on the properties for the data layer and go to the coordinates tab. Modify the coordinate system of the data layer by choosing Alaska Albers Equal Area Conic. This will reproject the data on the fly and will display Alaska in a more appropriate projection (as the continental projection distorts it). Then, in the Layout View, you can resize the Alaska data frame and zoom in to focus just on AK. Repeat these steps for Hawaii (and Puerto Rico if you’re mapping it), and you’ll have a good looking US map!
  •  

Excel COUNTIF Function to Clean ACS Data

I’ve been preparing a GIS workshop for the New York Census Research Data Center’s 2nd Annual Workshop series, and have dug up some useful tips as I’ve assembled my materials. Here’s one of them:

I have a data table from the Census Bureau’s 2006 Annual Community Survey (ACS) in Excel which contains some data for Metropolitan and Micropolitan Areas. Now, I have a shapefile of Metropolitan Areas that I would like to join this data table to, but I would like to get rid of the records for the Micropolitan Areas in the data table. Unfortunately, the data table does not have a field that indicates whether an area is a Metro or Micro. Instead, this information is embedded in the name field, like “Akron, OH Metro Area” which means there is no way to sort the table to weed out the Micro Areas.

COUNTIF function to the rescue! I inserted a new column and typed in the formula:

=COUNTIF(D3, “*Metro*”)

If the formula sees the word Metro anywhere in the GEO_NAME, it counts it as a one in the new column, otherwise it counts it as zero (by default, the zeros will be the Micro areas). Copied and pasted the formula all the way down, then copied and pasted the formula column over top of itself using Paste Special (to replace the formulas with the actual values), and voila! Sorted by this column, and deleted all the records with a zero in the field (the Micro areas).

Excel_COUNTIF_ACS

I’ve done something like this before in Microsoft Access using LIKE, but Excel doesn’t include this function. I knew about COUNTIF but didn’t connect the dots. I discovered I could apply it after stumbling across this useful post at Daily Dose of Excel.

Lastly, before you can bring this table into GIS, you have to delete that second header row (you can only have one column heading - the rest of the rows are assumed to be data). While the codes in the first row are cryptic, they are concise. The headings in the second row are too long and contain spaces, which will cause problems when you import the table into GIS.

NOTE - If you’re using Open Office’s Calc instead of Excel, and you have enabled regular expressions under the Tools - Options - OpenOffice.org Calc - Calculate menu, the same function would look like this:

=COUNTIF(D3;”.*Metro.*”)