UNdata Processing, Calc Data Pilot
I’d downloaded some data from the UNdata website and cleaned it up so I could use it for my class, and thought I’d share some tips here. In many cases when you download data from UNdata you get multiple records for each country; one record for each year for each data point. In order to bring this data into GIS, I needed to re-arrange it to move the years from rows to columns, so that I’d have one record for each country with multiple columns for years.
You can do this in Excel using a pivot table, but since I was working off of my Linux notebook, I accomplished this using the Data Pilot tool in Open Office 3.0′s Calc spreadsheet. Here’s what I did:
- I opened the csv file I downloaded from UNdata in Calc, and accepted the defaults on the text import screen. Once it was imported, I saved the file in a spreadsheet format – you can use Calc’s odt format, or you can save it in Excel xls format if you need to use Excel later. But you have to get out of the csv format – Calc crashed on me a couple of times when I was running the Pilot and creating multiple sheets in the csv.
- I went up to the Data menu, selected Data Pilot, and Start, which opens the Data Pilot Menu. I clicked the More button to see the full range of options.
- Then it was a simple matter of dragging the field names into the right places. I dragged the country code and name fields into the rows box, the year into the column box (as I wanted to move years to columns), and the actual data field into the values box. Under the options listed under More, I changed the Results drop down box to save the table in a new sheet, and I unchecked all of the boxes listed below (for adding filters, creating totals rows and columns, etc). Then clicked OK.
- Voila! I had my newly formatted table, with one row for each country and one column for each year. But since I’ll be bringing this data into GIS (and will have to save the data in DBF format as I want my students to bring it into QGIS), I need to make sure that my data doesn’t have any funky formatting that may mess up joining my data to a shapefile. So I added a blank worksheet, copied my new pilot table, and did a Paste Special into the blank worksheet and pasted only text and numbers – with formatting, formulas, and anything else funky left out.
- Once I had my plain, reformatted data in my new sheet, I deleted the top row (which had labels for Sum Value and Year) so I’d be left with only one header row, and I changed the field names to something more database friendly (truncating names and removing spaces). Lastly, I deleted the original data sheet and the formatted data pilot table sheet, so I was left with just the final copy.
That’s it! Sort of. Since I now have the year’s in columns, I could create a few calculated fields to show change over time.
But the last piece will be dealing with the country codes. To get a data table with codes from the UNdata website, you have to choose an Add Columns option from their data browser page before you download, as you don’t get the country codes by default. Then, the codes you get could be anything. Since these data tables are coming from dozens of different organizations, agencies, and bureaus within the UN, the country codes will vary based on what that agency did. In some cases I’ve downloaded data that had the ISO two-digit alpha codes, and in other cases I had three digit numerical ISO codes (stored incorrectly as numbers, so leading zeros were dropped).
Most of the tables I’ve been downloading come from from the World Health Organization (WHO), and came with no standardized country codes. Instead, the codes are sequential numbers assigned to the countries in alphabetical order from 1 to 193. Doh! Then, if a new country gets added they tacked on the next available number regardless of the alphabet – so the country of Montenegro is assigned 194, after Zambia which is 193. Typically, data from countries that are not UN members or observers (like Liechtenstein and Vatican City) and are dependencies (Greenland, Falkland Islands, French Polynesia, etc) are not included in the data sets.
So, I’ll be typing in ISO alpha two codes into one of my data tables and will end up with a table that connects their sequential number system to ISO. Then I can bring this bridge and all of the other tables into a database, relate them to the bridge based on the sequential number, and create new tables out of them that have ISO numbers, so I can join them to my GIS file based on ISO. Or I guess I could add the sequential number field to my countries shapefile and join each table to it based on the sequential number.
Anyway – happy Data Piloting (or Pivoting, if you prefer).