UNdata Processing, Calc Data Pilot

I’d downloaded some data from the UNdata website and cleaned it up so I could use it for my class, and thought I’d share some tips here. In many cases when you download data from UNdata you get multiple records for each country; one record for each year for each data point. In order to bring this data into GIS, I needed to re-arrange it to move the years from rows to columns, so that I’d have one record for each country with multiple columns for years.

You can do this in Excel using a pivot table, but since I was working off of my Linux notebook, I accomplished this using the Data Pilot tool in Open Office 3.0’s Calc spreadsheet. Here’s what I did:

  • screenshot1I opened the csv file I downloaded from UNdata in Calc, and accepted the defaults on the text import screen. Once it was imported, I saved the file in a spreadsheet format – you can use Calc’s odt format, or you can save it in Excel xls format if you need to use Excel later. But you have to get out of the csv format – Calc crashed on me a couple of times when I was running the Pilot and creating multiple sheets in the csv.
  • I went up to the Data menu, selected Data Pilot, and Start, which opens the Data Pilot Menu. I clicked the More button to see the full range of options.
  • screenshot3Then it was a simple matter of dragging the field names into the right places. I dragged the country code and name fields into the rows box, the year into the column box (as I wanted to move years to columns), and the actual data field into the values box. Under the options listed under More, I changed the Results drop down box to save the table in a new sheet, and I unchecked all of the boxes listed below (for adding filters, creating totals rows and columns, etc). Then clicked OK.
  • screenshot4Voila! I had my newly formatted table, with one row for each country and one column for each year. But since I’ll be bringing this data into GIS (and will have to save the data in DBF format as I want my students to bring it into QGIS), I need to make sure that my data doesn’t have any funky formatting that may mess up joining my data to a shapefile. So I added a blank worksheet, copied my new pilot table, and did a Paste Special into the blank worksheet and pasted only text and numbers – with formatting, formulas, and anything else funky left out.
  • screenshot5Once I had my plain, reformatted data in my new sheet, I deleted the top row (which had labels for Sum Value and Year) so I’d be left with only one header row, and I changed the field names to something more database friendly (truncating names and removing spaces). Lastly, I deleted the original data sheet and the formatted data pilot table sheet, so I was left with just the final copy.

That’s it! Sort of. Since I now have the year’s in columns, I could create a few calculated fields to show change over time.

But the last piece will be dealing with the country codes. To get a data table with codes from the UNdata website, you have to choose an Add Columns option from their data browser page before you download, as you don’t get the country codes by default. Then, the codes you get could be anything. Since these data tables are coming from dozens of different organizations, agencies, and bureaus within the UN, the country codes will vary based on what that agency did. In some cases I’ve downloaded data that had the ISO two-digit alpha codes, and in other cases I had three digit numerical ISO codes (stored incorrectly as numbers, so leading zeros were dropped).

Most of the tables I’ve been downloading come from from the World Health Organization (WHO), and came with no standardized country codes. Instead, the codes are sequential numbers assigned to the countries in alphabetical order from 1 to 193. Doh! Then, if a new country gets added they tacked on the next available number regardless of the alphabet – so the country of Montenegro is assigned 194, after Zambia which is 193. Typically, data from countries that are not UN members or observers (like Liechtenstein and Vatican City) and are dependencies (Greenland, Falkland Islands, French Polynesia, etc) are not included in the data sets.

So, I’ll be typing in ISO alpha two codes into one of my data tables and will end up with a table that connects their sequential number system to ISO. Then I can bring this bridge and all of the other tables into a database, relate them to the bridge based on the sequential number, and create new tables out of them that have ISO numbers, so I can join them to my GIS file based on ISO. Or I guess I could add the sequential number field to my countries shapefile and join each table to it based on the sequential number.

Anyway – happy Data Piloting (or Pivoting, if you prefer).

Reading List for Geographic Information Course

The fall semester is here, and I’m about to start teaching the class I mentioned in my last post (an information studies course on geographic information). I thought I’d share my reading list and try out the Open Book plugin. I chose my readings based on: my particular audience (undergraduate students from many disciplines with little or no background in geography), relevance (materials appropriate in a hybrid information studies / geography course), cost (wanting to assign the students a single textbook that’s reasonably priced and covers all the bases, and will supplement with other readings), and copyright (staying within the bounds of fair use by not assigning too much from a single work). Here goes:

Making Maps
John Krygier, Denis Wood; The Guilford Press 2005

I decided to go with Krygier and Woods Making Maps as my assigned text book. Since cartography is a visual and technical art, I thought it made sense to use a book that relies on visuals for explanations rather than text. It’s approachable, particularly for my students who won’t be coming from a geography background, affordable, wonderfully quirky, and covers all of the essentials of the geographic framework and map interpretation and design independent of specific GIS software.

Place
Tim Cresswell; Blackwell Publishing Limited 2004

I’m using the first chapter of Cresswell’s book as a succinct introduction to how individuals define places, but would recommend the rest of the text for classes that cover geographic concepts and methods.

Georeferencing
Linda L. Hill; The MIT Press 2006

I’m assigning the second and third chapters of Hill’s book. The second chapter, which discusses how people process, store, and use geographic information is the best summary of this topic that I’ve ever seen, and the third chapter is a good overview of the different types of geographic objects. As a librarian-geo nerd, I love the chapters that deal with coordinate metadata and gazetteers, but won’t be using them in this class.

Image Of The City
Kevin Lynch; M.I.T. Press 1960

This is an urban planning / design classic, and I’ll have my students read the summary of Lynch’s city elements (based on his research, Lynch proposed that people mentally break the urban environment down into five types of elements in order to organize and navigate the city: paths, barriers, districts, nodes, and landmarks).

Realms, Regions And Concepts

This is the only traditional textbook that I’ll be borrowing from (I actually used it when I was a Freshmen, way back when). While I’m using the previous three books to discuss egocentric places, or how we as individuals conceive of place, I’m using the first chapter of this book to give the students an overview of geocentric places – the formal, defined hierarchy of places that exist in the world – and to introduce them to the concept of regions.

How To Lie With Maps
Mark S. Monmonier; University Of Chicago Press 1991

This has become a modern classic and I almost assigned it as a second textbook. I am assigning the chapter on maps for propaganda as a background to our discussion on map interpretation and communication, and will later use the chapter on census maps to talk about the effects of data classification and choice of enumeration units.

Desktop GIS
Gary Sherman; Pragmatic Bookshelf 2008

This is the only software book that I’ll be using chapters from, so the students have some formal guide for using QGIS (in addition to the QGIS documentation). I’m using the chapters on vector and raster data.

Key Concepts And Techniques In GIS
Jochen Albrecht; Sage Publications Ltd 2007

This concise, excellent book deals strictly with the concepts and principles behind GIS. I’m using the chapters on spatial search and geoprocessing, but would recommend the entire book for any GIS course, novice to advanced.

In addition to chapters from these books, I’ll also be using:

  • “Revolutions in Mapping” by John Noble Wilford, National Geographic Feb 1998 – a great overview of the history of cartography
  • USGS GIS poster – if there is such a thing, this is a “web classic” and an accessible intro to GIS
  • One article from a scholarly journal and one article from a mass market magazine to illustrate how geographic research is covered and used
  • And for shameless self-promotion, I summary I wrote about US Census data – In Three Parts

Finally, an honorable mention:

A Primer Of GIS
Francis Harvey; The Guilford Press 2008

If I was teaching an introductory GIS course in a geography or earth sciences department, this is certainly the book I would use, and for those of you in that boat I’d recommend checking it out. It does an excellent job of covering GIS principles without being software specific, contains exercises at the end of each chapter, and is well written and affordable. Since the scope of my course is broader than GIS and my audience more general and diverse, I opted to leave it out (but may still assign a chapter).

Geographic Information: Literacy and Systems

I’ve been spending a good portion of my summer working on the course that I’m going to teach this fall. The library at my college offers credit courses in Information Studies which students can take as a minor – they can choose two 3000 level courses and then a 4000 level capstone course. My course is a 3000 level special topics course which I’ve called Geographic Information: Literacy and Systems.

My situation is rather peculiar. I can’t teach this course as a pure GIS course, since it’s an information studies class and not geography or earth sciences. Beyond that, my college does not have a geography department, and earth sciences are not an individual department but are combined with other natural and physical sciences. With the exception of a regional geography class offered by the anthropology department, my college doesn’t offer geography instruction. So even if I could teach a pure GIS class, it’s unlikely that any of the students would have any foundational geographic knowledge.

I also can’t teach the course as a “library” class where I’m training people to be map or GIS librarians, because that isn’t the point of the info studies minor. The minor is meant to introduce students to the foundational principles of information – what is information, how do we search for it, organize it, what is its context in society, etc. I also could not teach the course as a basic software class, as that isn’t really appropriate for a college course. In short, I couldn’t find a model that I could follow, as what I’m doing falls outside these traditional realms.

So I decided to build the course around the concept of geographic information where I’ll cover some foundational geography,cartography, and GIS from an information science perspective that encompasses:  organization, search and retrieval, data processing, and assessment and analysis of GI. I’ve divided the class into four units that cover geographic information and fundamental geography, maps as information objects, and two units of GIS. In the first GIS unit we’ll cover the theoretical aspects and the basics of using the software with datasets that I’ll provide. In the second unit we’ll deal with the nitty gritty of actually searching for and processing freely available GIS data. In the last couple of weeks I’ll spend some time on web mapping and on geographic analysis and research.

Many of the concepts that I’ll be teaching are things that I never formally learned in a college course, such as a discussion of the kinds of administrative and statistical divisions that exist in the world, why they exist, and how data is collected for them. The second GIS unit on data processing is something that I feel is never adequately covered in GIS classes, but is essential for doing just about anything in GIS. I think this is also poignant in information studies, as it involves a discussion of the difference between data and information and how you can turn one into the other.

I’ve decided to use all open source software. Since these are undergraduate students who probably won’t be entering a geography related field, and we are a commuter campus where students have to make special trips to get to computer labs, I didn’t see any logic in using ArcGIS. With the open source software they can use it anywhere and there will be a better chance that they’ll use it after the course is over (and after they graduate). I’ve opted to go with QGIS as it covers all the bases I need. I liked gvSIG but had too many problems with the map layout – I might be able to cut my way through them, but can sophomore business and english majors? QGIS is also more thoroughly documented (in english), which is important since this is an introductory class.

I’m using Krygier and Woods Making Maps as my textbook, along with a few chapters here and there from other texts. I have looked to the pages Krygier’s created for his courses for guidance, and like the stream of consciousness style he used for writing his notes. I’ll post an annotated reading list later.

Since I’m breaking molds, I’ve also decided not to use Blackboard to organize the whole course and am using a blog and various other bits and pieces of software for creating assignments, organizing the roster, etc. If you’re interested you can follow along on my course blog – (only students can register). Classes start on August 31st…

Print Composer in QGIS – ACS Puma Maps

ny_youth_pumasI wrapped up a project recently where I created some thematic maps of 2005-2007 ACS PUMA level census data for New York State. I decided to do all the mapping in open source QGIS, and was quite happy with the result, which leads me to retract a statement from a post I made last year, where I suggested that QGIS may not be the best for map layout. The end product looked just as good as maps I’ve created in ArcGIS. There were a few tricks and quirks in using the QGIS Print Composer and I wanted to share those here. I’m using QGIS Kore 1.02, and since I was at work I was using Windows XP with SP3 (I run Ubuntu at home but haven’t experimented with all of these steps yet using Linux). Please note that the data in this map isn’t very strong – the subgroup I was mapping was so small that there were large margins of errors for many of the PUMAs, and in many cases the data was suppressed. But the map itself is a good example of what an ACS PUMA map can look like, and is a good example of what QGIS can do.

  • Inset Map – The map was of New York State, but I needed to add an inset map of New York City so the details there were not obscured. This was just a simple matter of using the Add New Map button for the first map, and doing it a second time for the inset. In the item tab for the map, I changed the preview from rectangle to cache and I had maps of NY state in each map. Changing the focus and zoom of the inset map was easy, once I realized that I could use the scroll on my mouse to zoom in and out and the Move Item Content button (hand over the globe) to re-position the extent (you can also manually type in the scale in the map item tab). Unlike other GIS software I’ve experimented with, the extent of the map layout window is not dynamically tied to the data view – which is a good thing! It means I can have these two maps with different extents based on data in one data window. Then it was just a matter of using the buttons to raise or lower one element over another.
  • Legend – Adding the legend was a snap, and editing each aspect of the legend, the data class labels, and the categories was a piece of cake. You can give your data global labels in the symbology tab for the layer, or you can simply alter them in the legend. One quirk for the legend and the inset map – if you give assign a frame outline that’s less than 1.0, and you save and exit your map, QGIS doesn’t remember this setting if when you open your map again – it sets the outline to zero.
  • Text Boxes / Labels – Adding them was straightforward, but you have to make sure that the label box is large enough to grab and move. One annoyance here is, if you accidentally select the wrong item and move your map frame instead of the label, there is no undo button or hotkey. If you have to insert a lot of labels or free text, it can be tiresome because you can’t simply copy and paste the label – you have to create a new one each time, which means you have to adjust your font size and type, change the opacity, turn the outline to zero, etc each time. Also, if the label looks “off” compared to any automatic labeling you’ve done in the data window, don’t sweat it. After you print or export the map it will look fine.
  • North Arrow – QGIS does have a plugin for north arrows, but the arrow appears in the data view and not in the print layout. To get a north arrow, I inserted a text label, went into the font menu, and chose a font called ESRI symbols, which contains tons of north arrows. I just had to make the font really large, and experiment with hitting keys to get the arrow I wanted.
  • Scale Bar – This was the biggest weakness of the print composer. The scale bar automatically takes the unit of measurement from your map, and there doesn’t seem to be an option to convert your measurement units. Which means you’re showing units in feet, meters, or decimal degrees instead of miles or kilometers, which doesn’t make a lot of sense. Since I was making a thematic map, I left the scale bar off. If anyone has some suggestions for getting around this or if I’m totally missing something, please chime in.
  • Exporting to Image – I exported my map to an image file, which was pretty simple. One quirk here – regardless of what you set as your paper size, QGIS will ignore this and export your map out as the optimal size based on the print quality (dpi) that you’ve set (this isn’t unique to QGIS – ArcGIS behaves the same way when you export a map). If you create an image that you need to insert into a report or web page, you’ll have to mess around with the dpi to get the correct size. The map I’ve linked to in this post uses the default 300 dpi in a PNG format.
  • Printing to PDF – QGIS doesn’t have a built in export function for PDF, so you have to use a PDF print driver via your print screen (if you don’t have the Adobe PDF printer or a reasonable facsimile pre-installed, there are a number  of free ones available on sourceforge – PDFcreator is a good one). I tried Adobe and PDFcreator and ran into trouble both times. For some reason when I printed to PDF it was unable to print the polygon layer I had in either the inset map or the primary map (I had a polygon layer of pumas and a point layer of puma centroids showing MOEs). It appeared that it started to draw the polygon layer but then stopped near the top of the map. I fiddled with the internal settings of both pdf drivers endlessly to no avail, and after endless tinkering found the answer. Right before I go to print to pdf, if I selected the inset map, chose the move item content button (hand with globe), used the arrow key to move the extent up one, and then back one to get it to it’s original position, then printed the map, it worked! I have no idea why, but it did the trick. After printing the map once, to print it again you have to re-do this trick. I also noticed that after hitting print, if the map blinked and I could see all the elements, I knew it would work. But, if the map blinked and I momentarily didn’t see the polygon layer, I knew it wouldn’t export correctly.

Despite a few quirks (what software doesn’t have them), I was really happy with the end result and find myself using QGIS more and more for making basic to intermediate maps at work. Not only was the print composer good, but I was also able to complete all of the pre-processing steps using QGIS or another open source tool. I’ll wrap up by giving you the details of the entire process, and links to previous posts where I discuss those particular issues.

I used 2005-2007 American Community Survey (ACS) date from the US Census Bureau, and mapped the data at the PUMA level. I had to aggregate and calculate percentages for the data I downloaded, which required using a number of spreadsheet formulas to calculate new margins of error; (MOEs). I downloaded a PUMA shapefile layer from the US Census Generalized Cartographic Boundary files page, since generalized features were appropriate at the scale I was using. The shapefile had an undefined coordinate system, so I used the Ftools add-on in QGIS I converted the shapefile from single-part to multi-part features. Then I used Ftools to join my shapefile to the ACS data table I had downloaded and cleaned-up (I had to save the data table as a DBF in order to do the join). Once they were joined, I classified the data using natural breaks (I sorted and eyeballed the data and manually created breaks based on where I thought there were gaps). I used the Color Brewer tool to choose a good color scheme, and entered the RGB values in the color / symbology screen. Once I had those colors, I saved them as custom colors so I could use them again and again. Then I used Ftools to create a polygon centroid layer out of my puma/data layer. I used this new point layer to map my margin of error values. Finally, I went into the print composer and set everything up. I exported my maps out as PNGs, since this is a good image format for preserving the quality of the maps, and as PDFs.

Formulas for Working With Census ACS Data in Excel / Calc

After downloading US census data, you often need to reformat it before using it. It’s quite common that you download files where the population is broken down by gender and age, and you need to aggregate the data to get a total or divide a particular characteristic to get a percent total. This is pretty straightforward if you’re working with decennial census data, but data from the American Community Survey (ACS) is a little trickier to deal with since you’re working with estimates that have a margin of error. When creating new data, you also have to calculate what the margin of error is for your derived numbers. I’ll walk through some examples of how you would do this in a spreadsheet (the formulas below will work in either Excel or Calc).

Creating an Aggregate

We’ll use the following data in our example:

screenshot1

We have the total population of people three years and older who are enrolled in school, and a breakdown of this population enrolled in grades 1 through 4 and grades 5 through 8 in a few counties in New York, with margins of error for each data point. Our data is from the 3 year averaged 2005-2007 American Community Survey.

Let’s say we want to create a total for students who are enrolled in grades 1 through 8 for each county. We create a new column and sum the estimates for each county with the formula e3+g3, or sum(e3:g3).

To calculate a margin or error (MOE) for our grade 1 to 8 data, first we have to use the find and replace command to get rid of the “+/-” signs in the MOE column, so our spreadsheet will treat our values as numbers and not text (this is an issue if you downloaded the data as an Excel file – if you download a txt file the +/- is not included). Depending on the dataset you’re working with, you may also need to replace dashes, which represent data that was null or not estimated.

Once the data is cleaned up, we can insert a new column with this formula:

=SQRT((F3^2)+(H3^2))

This calculates our new margin of error by squaring the moes for each of our data points, summing the results together, and taking the square root of that sum. In other words,

=SQRT((MOE1^2)+(MOE2^2))

Once that’s done, you may want to round the new MOE to a whole number.

Creating a Percent Total

Let’s calculate the percentage of the population 3 years and older enrolled in school that are in grades 1 through 8. Based on what we have thus far (I hid the columns E,F,G, and H for grades 1-4 and 5-8 in this screenshot, as we don’t need them):

screenshot-2

We insert a new column where we divide our subgroup by the total, as you would expect – I3/C3. In the next column we insert the following formula to create a MOE for our new percent total:

=(SQRT((J3^2)-((K3^2)*(D3^2))))/C3

This one’s a little weightier than our last formula. We’re taking the square of our percent total (K3) and the square of the MOE of the total population (D3), multiplying them together, then subtracting that number from the square of the MOE of our subgroup (J3). Then we take the square root of the whole thing, then divide it by our total population (C3). If you’re saying – HUH? Maybe this is clearer:

=(SQRT((MOEsubset^2)-((PercentTotal^2)*(MOEtotalpop^2))))/TotalPop

Finally, we have something like this:

screenshot-3

Based on our data, we can say things like “There were approximately 30,556 students enrolled in 1st through 8th grade per year in Dutchess County, NY between 2005 and 2007, plus or minus 1,184 students. An estimated 37% of the population enrolled in school in the county was in the 1st through 8th grade, plus or minus 1%.” The ACS estimates have a 90% confidence interval.

Wrap Up

In this example we worked with aggregating and calculating percentages based on characteristics. We could also use these same formulas to aggregate data by geography, if we wanted to add the characteristics for all the counties together.

For the full documentation on working with ACS data, take a look at the appendix in the Census’ ACS Compass Guide, What General Data Users Need to Know. It provides you with the formulas in their proper statistical notation (for those of you more mathematically inclined than I) and includes formulas for calculating other kinds of numbers, such as ratios and percent change. It does provide you with worked-through examples, but not with spreadsheet formulas. I used their examples when I created formulas the first time around, so I could compare my formula results to their examples to insure that I was getting it right. I’d strongly recommend doing that before you start plugging away with your own data – one misplaced parentheses and you could end up with a different (and incorrect) result.

2007 Economic Census

The US Census Bureau has begun releasing data for the 2007 Economic Census. The bureau conducts the survey of businesses every five years – all medium to large size businesses and multi-part businesses are counted, samples are taken for smaller businesses, and various administrative records are used to calculate businesses with no employees (i.e. freelancers). All businesses are categorized hierarchically by North American Industrial Classification System (NAICS) codes and the data is reported by industry and geography. The number of establishments, employees, payroll, and sales are counted for the nation, states, counties, metro areas, places, and zip codes.

At this point national industry totals for the broadest categories of NAICS are available, as are preliminary numbers for the most specific NAICS categories (six digit) at the national level. Data for smaller geographic areas will be released between October 2009 and August 2010.

The biggest change from the 2002 Economic Census is the delivery method for the data. There will be no more 90 page pdf files or HTML tables that drill down six levels. All of the data will be released via the American Factfinder only. Other changes include the addition of some new geography (CDPs with at least 5000 people), new metro area definitions, and the revised 2007 definitions for NAICS which include small changes to the Finance, Insurance, Real Estate, Professional Services, and Administrative Services categories.

Additional changes for 2007, the data release schedule, NAICS codes, and methodology docs are all available at the 2007 Economic Census homepage within the Census Bureau’s website.

All of the data is aggregated by industry and geography – you cannot get lists of businesses with names and addresses as this information is kept confidential. Furthermore, to maintain confidentiality, if one company controls a large share of the market for a specific sector within a specific geographic area, or if there few businesses within a sector in a specific geographic area, much of the data (with the exception of the number of businesses) remains classified (marked with a D for disclosure). Oftentimes this means that data for industries within small areas (big box retail in a small town) and data for industries with few establishments in an area (mining establishments in New York City) are hidden. The smaller the geography, the more likely it is that the data will not be disclosed. This becomes a technical issue if you want / need to move this data into a database, as these pesky disclosure notes are stored in the same columns as the data and prevent you for designating the fields as numeric.

Given the delay between the time the data is collected and the time it is released, it isn’t particularly helpful for analyzing our current economic climate, but it does provide a snapshot of the way the US economy looked at that moment, and is useful in understanding how the economy is evolving. Be aware that when making comparisons to past data, you have to correct for changes in geography and NAICS definitions. The differences between 2002 and 2007 are not too great, but more adjustments are necessary as you go further back in time. The Bureau provides data back to 1997 through the American Factfinder and some data from 1992 on an older page. If you need to go back further, you’ll be entering the realm of (gasp!) CD-ROMs or the paper reports.

Creating a New Shapefile in ArcGIS: Part II

In my previous post I gave an overview of how to create a shapefile from scratch, where we created a point layer to identify places and neighborhoods in NYC. In this post, I’ll pick up where we left off.

Whenever you create new features in a shapefile, ArcGIS automatically adds a couple of fields, including an auto-number ID field that uniquely identifies each feature. This was sufficient for our example as the 291 place names we were working with do not have a standard ID number that represents them. If we were creating features that did have a recognized ID number or code, we certainly would want to add an additional field to hold that number. This would allow us to share and relate our data to other datasets that use that conventional ID. For example, if we had a layer with the 50 states, we would want to have a FIPS number or the two digit postal code for each state in the attribute table, so we could relate our states feature to the zillions of other state-based data tables out there that also use these codes.

It’s also helpful to add other identifiers to relate our place names to some larger geographic area. Why? Let’s say we want to filter our neighborhoods by borough – perhaps we just want to label neighborhoods in Manhattan or calculate distances only between places that appear in the Bronx. It would be useful to have a borough code or some other code associated with each of our place names for running queries.

scrnshot6As it turns out, the City of New York does use a standardized system of three digit codes to identify all boroughs and community districts in the city. In our example, the code for Manhattan Community District 12, which contains Inwood and Washington Heights, is 112. The first digit identifies the borugh and the second two digits identify the district. It would be a good idea to assign each of our neighborhoods this district code, so we could filter our features by either borough or district.

When we create each feature, we could manually type in the code in it’s own field just like we added the neighborhood names, but that would be rather tedious – and unnecessary. A better choice would be to do a spatial join. Whereas a “regular” join allows us to join attribute tables based on a common ID field, a spatial join allows us to assign attributes to one layer based on their geographical relationship to another layer.

scrnshot7In the Table of Contents, right click on the neighborhoods layer and choose Joins and Relates – Joins. We’ll get the familiar Join dialog box. However, if you hit the first drop down box that says Join Attributes From a Table and choose Join Data Based on Spatial Location, we’ll get the options for doing a spatial join. Choose the community districts as the layer to join to the neighborhoods, and since we’re joining points to polygons we’ll choose parameters that are relevant for relating these two features. In this case, give each point (neighborhood / place) the attributes of the polygons (districts) that it falls inside. ArcGIS will create a new point layer with the joined fields when you hit OK. Open the attribute table of the new point layer, and you’ll see the additional fields, including the community district numbers. You’ll also get some rather useless fields from the district layer, like the length and area of each district, which you can safely delete.

So instead of tediously entering these numbers by hand for each neighborhood, we simply run the spatial join process once (after we’ve finished adding the points for all 291 neighborhoods) and the IDs are automatically added.

Creating a New Shapefile in ArcGIS: Part I

I’m working with a grad student who needs to create a new shapefile from scratch, and thought I’d turn the instructions for doing this in ArcGIS into a tutorial / post for creating new point layers. The idea in this example is to create a point layer that shows the relative center of 291 neighborhoods in New York City. Since many of these neighborhoods are place names without finite boundaries, we’ll have to use various sources (NYC Planning map and Rand McNally street maps) to pinpoint the relative center of each neighborhood.

These points will be used for labeling each neighborhood. In this case, creating a new, georeferenced layer is preferable to creating 291 text labels on a map that are not tied to geography in any way.

  • The first step is to download some layers from the NYC Department of Planning to use for reference, such as a layer for boroughs and community districts. Community districts are used by the city to approximate neighborhoods. Many of the neighborhoods that we are trying to plot are, in many cases, smaller areas or places within these boundaries.
  • scrnshot1Next, open ArcCatalog and create a folder to store the data. Then, right click on the folder in the table of contents and select New – Shapefile. In the Create New Shapefile window, we give the shapefile a name, select Point as the feature type, and hit Edit to change the
    coordinate system. In the Spatial Reference Properties menu, we’ll import a coordinate system from one of the files we downloaded from NYC Planning, which uses New York State Plane for Long Island. Click OK and OK again, and we’ll have a new shapefile.
  • scrnshot2Right now, our new shapefile isn’t very exciting because it’s empty – you can preview it in the catalog to see for yourself. If you preview the table, you’ll see that Arc created three fields – FID, Shape, and ID, which it will automatically fill in when we start creating features. Before we do that, we’ll have to add an additional column to store the name of the neighborhood. To do that, open ArcMap and add the neighborhood layer to the map. Then, right click on the layer in the Table of Contents and open the attribute table. Hit the Options button and choose Add Field. In the Add Field menu, name the new field, choose Text as the type, and change the length to 80 (in case we have some neighborhoods with long names). Hit OK, and you’ll have a new field.
  • scrnshot3Let’s add our reference layers next. Hit the Add Data button (or File – Add Data), and add the borough boundaries and community districts (if you don’t see anything after you add them, right click on one of these layers and choose Zoom to Layer). Go into the symbology tab for each layer and change their display to make the areas appear more distinctive. Make sure your neighborhood layer is on top of your other layers.
  • Now it’s time to start plotting neighborhoods. Go to the Selection menu – Set selectable Layers, and turn off all the layers except the neighborhood layer. Then, use the dropdown on the Editor Toolbar and Select Start Editing (if you don’t see the Editor Toolbar, make sure it’s activated by going to View – Toolbars and select it). scrnshot4On the Editor Toolbar, make sure the Create New Feature task is activated and that the target layer is the neighborhood layer, and not any of the reference layers. Zoom in to the top of Manhattan. With the Pencil tool selected in the toolbar, and using your sources (NYC planning map, Rand McNally street map, whatever), click on the map to approximate where the center of the Inwood neighborhood would be. A blue dot should appear on the map. Then right-click on the neighborhoods layer in the Table of Contents and open the attributes table. You’ll see a brand new record for your new dot. Click in the empty field for Name, type in the name of the neighborhood, and press enter.
  • That’s the process! Next, locate the area for Washington Heights and click on the map to create the point for that neighborhood. The new dot will appear hi-lighted, while the previous dot for Inwood will now appear as a regular point symbol. Now it’s just a matter of plugging away. Make sure to occasionally save your edits by clicking Editor and choosing Save Edits. If you make a mistake, you can delete a feature by selecting the Select Feature tool in the regular tool bar (white arrow with a blue and white feature box next to it), select the particular point, and hit the delete key. If you’re having trouble pinpointing the right location for the neighborhood, try downloading additional reference layers to guide you. The NYC DOITT also has a page with GIS layers for the city with features like parks and streets that may be helpful. When you’re finished editing, choose Stop Editing under the Editor Toolbar.

    scrnshot5

  • The ultimate goal of this exercise was to get neighborhood labels to appear without the actual point. To accomplish this, change the point symbol for the neighborhood to nothing by going into the Symbology tab for the layer and reducing the fill to no color, the outline to nothing, and the size to zero. Then open the Labels tab under the Properties menu, turn labels on using the name field as the label field, select Placement Properties and choose the setting to place the labels on top of the point, hit ok, and voila! Perfectly centered neighborhood names that are part of a georeferenced layer.

This covers the basics. In the next post, I’ll go a little further and discuss adding additional fields to the new file, without having to type them in manually.

Updated Links for Data and Resources

I recently went through my pages of suggested links for data and resources to update and clean them up. I’ve included many of the cool resources I’ve discovered since I started writing this blog, which ended up in individual posts but not in these pages. I went over the resources page in particular, to try and classify the reference materials, tools, and software into useful categories rather than just having one large blob of stuff.

Transform Projections with GDAL / OGR

The GDAL / OGR tools are an open source, cross platform, command-line toolkit that can be used for viewing GIS metadata, performing attribute queries, and converting file formats, among other things. It can also be used for transforming coordinate systems and projections for GIS files. I’ll demonstrate in this brief tutorial how to accomplish this using the OGR tools, which are for vector based GIS. The raster based GDAL tools work in a similar fashion.

Viewing basic coordinate system / projection info:

ogrinfo -al -so world_wgs.shp

Where ogrinfo is the name of the tool, -al is a switch to get detailed info about the layer, -so is a switch to display summary info, and world_wgs.ship is the name of our file. Run that command and we’ll get something that looks like this, with info about the features, coordinate system, and attribute fields of our shapefile:

INFO: Open of `world_wgs.shp’
using driver `ESRI Shapefile’ successful.

Layer name: world_wgs
Geometry: Polygon
Feature Count: 243
Extent: (-179.808664, -89.677397) – (179.808664, 83.435942)
Layer SRS WKT:
GEOGCS["GCS_WGS_1984",
DATUM["WGS_1984",
SPHEROID["WGS_1984",6378137,298.257223563]],
PRIMEM["Greenwich",0],
UNIT["Degree",0.017453292519943295]]
CNTRY_NAME: String (254.0)
FIPS_CNT_1: String (254.0)
ISO_2DIGIT: String (254.0)
ISO_3DIGIT: String (254.0)
STATUS: String (254.0)
COLORMAP: Real (18.6)
CONTINENT: String (254.0)
UN_CONTINE: String (254.0)
REGION: String (254.0)
UN_REGION: String (254.0)

Convert coordinate systems supported by EPSG

GDAL / OGR and most of the open source GIS software supports projections and coordinate systems that are part of the EPSG library. If you want to do a conversion between two coordinate systems and they are both supported by EPSG, you just have to reference the EPSG code that’s used to identity the system that you want to project to. You can look up codes using spatialreference.org.

Let’s say we want to convert our shapefile that’s in WGS 84 (common lat and long) to NAD 83 (used frequently in North America):

ogr2ogr -t_srs EPSG:4269 world_new.shp world_wgs.shp

Where ogr2ogr is the name of the tool, -t_srs is the command for transforming from one coordinate system to the other, EPSG:4269 is the code that identifies the coordinate system we want the new file to have – NAD83, world_new.shp is the name of the output file that will have the new projection that we want, and world_wgs.shp is our input file. If you run the command and get no error message, you’re in good shape. Just run the ogrinfo command on the new file to verify that it’s been re-projected.

Convert coordinate system not supported by EPSG

The EPSG library is extensive, but doesn’t contain everything, particularly some global and continental map projections. GDAL / OGR can still do the job, but you’ll have to provide the tool with the proper frame of reference since the EPSG library doesn’t have the info. Let’s say we want to project our WGS file to the Robinson Projection, which is not part of EPSG.

First, go back to spatialreference.org and search for Robinson. Its ID code is ESRI 54030 – not part of the EPSG library. Click on the link for the projection to open its window. You’ll be able to look at the projection data in a number of standard file formats. Select OGC_WKT from the list, and it will open the text in a new window, showing you the parameters of that projection. In your browser, go up to file, save as, and save the file as robinson_ogcwkt.txt in the same directory as the shapefile you want to reproject.

Now that you have the projection info stored in the text file, run the following command to make the conversion:

ogr2ogr -t_srs robisnon_ogcwkt.txt world_rob.shp world_wgs.shp

It’s the same command as our previous one, except that you’re referencing the text file with your data instead of an EPSG code.

Define an undefined coordinate system

If you run the ogrinfo command and your coordinate system is undefined, you should define it before doing anything else, and you must define an undefined projection before converting to another projection. Look at the metadata that came with you file or go back to the source to figure out what it is. For example the US Census Bureau Generalized Cartographic Boundary Files for 2000 are in NAD83 according to their metadata, but the files lack a projection definition.

To define one, use the following command:

ogr2ogr -a_srs EPSG:4269 states_nad83.shp states_unknown.shp

The only difference here is the -a_srs command is used to assign a coordinate system to a file – the rest of the parameters are the same. If you’re defining a non-EPSG projection, use the same method from the previous example – download a definition file from spatialreference.org and use the file name in place of the EPSG code.

More help and where to download:

UC Santa Barbara NCEAS and the UC Davis Soil Lab both have short tutorials and sample commands of GDAL / OGR.

If you want to thumb through the world’s map projections, the folks at radicalcartography have a nice projection reference page with visuals and brief descriptions.

Visit the GDAL / OGR page for downloading, or if you’re a Windows or Mac user, you can download QGIS and GDAL / OGR together from the QGIS download page. Linux users can get GDAL / OGR via your package handler – depending on your distro, you may have it already.