Posts Tagged ‘data’

Mapping Domestic Migration with IRS Data

Friday, November 18th, 2011

Forbes magazine just published a neat interactive map on American migration using data NOT from the Census, but from – the IRS. Whether you fill it out virtually or the old fashioned way, everyone fills in their address at the top of the 1040, and the IRS stores this in a database. If you file from a different address from one year to the next you must have moved, and the IRS publishes a summary file of where people went (with all personal information and practically all filing data stripped away) .

The Forbes map taps into five years of this data and lets you see all domestic in-migration and out-migration from a particular county. The map is a flow or line map with lines going from the county you choose to each target – net in-migration to your county is colored in blue and net out-migration is red. You can also hover over the sending and receiving counties to see how many people moved. Click on the map to choose your county or search by name; you also have the option of searching for cities or towns, as the largest place within each county is helpfully identified and tied to the data.

It’s relatively straightforward and fun to explore. Some of the trends are pretty striking – the difference between declining cities (Wayne County – Detroit MI) and growing ones (Travis County – Austin TX) is pretty vivid, as is the change in migration during the height of the housing boom period in 2005 compared to the depth of the bust in 2009 (see Maricopa County – Phoenix AZ). More subtle is the difference in the scope of migration between urban and rural counties, with the former having more numerous and broader connections and the latter having smaller, more localized exchanges. Case in point is my home state of Delaware – urban New Castle County (Wilmington) compared to rural Sussex County (Seaford). There are many other stories to see here – the exodus from New Orleans after Katrina and the subsequent return of residents, the escape from Los Angeles to the surrounding mountain states, and the pervasiveness of Florida as a destination for everybody (click on the thumbnails below for full images of each map).

Detroit 2009

Wayne Co MI (Detroit) 2009

Austin 2009

Travis Co TX (Austin) 2009

Phoenix 2005

Mariciopa Co AZ (Phoenix) 2005

Phoenix 2009

Mariciopa Co AZ (Phoenix) 2009

Wilmington 2009

New Castle Co DE (Wilmington) 2009

Seaford 2009

Sussex Co DE (Seaford) 2009

While the map is great, the even better news is that the data is free and can be downloaded by anyone from the IRS Statistics page. They provide a lot of summary data – information for individuals is never reported. The individual tax data page with data gleaned from the 1040 has the most data that is geographic in nature. If you wanted to see how much and what kind of tax is collected by state, county, and ZIP code you could get it there. The US Population Migration data used to create the Forbes map is also there and the years from 2005 to 2009 are free (migration data from 1991 to 2004 is available for purchase).

You can download separate files for county inflow and county outflow on a state by state basis in Excel (.xls) format, or you can download the entire enormous dataset in .dat or .csv format. The data that’s reported is the number of filings and exemptions that represent a change in address by county from one year to the next, and includes the aggregated adjusted gross income of the total filers. There are some limitations – in order to protect confidentiality, if the flow from one county to another has less than 10 moves that data is lumped into an “other” category. International migration is also lumped into one interntaional category (on the Forbes map, both the other category where two counties have a flow less than 10 and the foreign migration category are not depicted).

The IRS migration data is often used when creating population estimates; when combined with vital stats on births and deaths it can serve as the migration piece of the demographic equation.

2010 Census Redistricting Data

Sunday, April 17th, 2011

The Redistricting Summary Data [P.L. 94-171] from the 2010 Census has all been published for the nation, states, counties, and places, and is available via the new American Factfinder. The redistricting data includes basic demographic data: total population, race, Hispanic or Latino origin, and number of housing units occupied and vacant. Data is available down to census blocks and is available for most (but not all – no ZCTAs or PUMAs) geographies.

If you don’t want all the data for a state, don’t want to slog through the Factfinder, and are comfortable working with large text files, you can FTP the summary data from the Redistricting Data homepage. If you want basic summary data for states, counties, and places and don’t want to fuss with the Factfinder or text files, you can download Excel spreadsheets from the Redistricting Data Press Kit. They also have some pdf / jpg maps showing county level population and population change, plus interactive map widgets like the one below for the country and for each state. 2010 Redistricting TIGER Shapefiles have also been released for geographies included in the redistricting dataset.

The full 2010 Census for all geographies will be released throughout this summer and into the fall in Summary File 1 [SF1]. Stay tuned.

Freely Available World Bank Country Data

Saturday, September 11th, 2010

This actually happend a little while ago, but for various reasons I haven’t been able to keep up with posting…

Our library had been subscribing to the WDI (World Development Indicators) database from the World Bank, but we were recently informed that the product was being discontinued and all of the data from the WDI and a number of other World Bank datasets would now be freely available from their data portal at http://data.worldbank.org/.

You can download an indicator for all countries by browsing through a list of all 300, or drill down by broad topics. Select an indicator and you can view a table with the most recent data, or a graduated circle map. If you download a table you can choose between an Excel or XML format. If you download the Excel format you get all years for all countries for that particular indicator from 1960 to present; but for many indicators you end up with a lot of null values up until this decade. If you go the XML route, the nulls are omitted and only years with data are provided. Unfortunately, in neither case do you get any unique identifiers like an ISO code.

Fortunately, power users can opt to download an entire data set, such as all of the WDI Indicators, in one file via their data catalog. In this case you have the option for Excel (xlsx only) or CSV, and the records I looked at DID contain ISO codes for each country (3 letter alpha). It looks like they’re also letting people tap into an API, so you can build web applications that harness the data directly from their repository.

In addition to browsing through indicators, you also have the ability to pull up a profile for a particular country to view several indicators for one particular place. They have a snazzy dashboard with stats, charts, and a reference map.

2007 Economic Census

Monday, June 8th, 2009

The US Census Bureau has begun releasing data for the 2007 Economic Census. The bureau conducts the survey of businesses every five years – all medium to large size businesses and multi-part businesses are counted, samples are taken for smaller businesses, and various administrative records are used to calculate businesses with no employees (i.e. freelancers). All businesses are categorized hierarchically by North American Industrial Classification System (NAICS) codes and the data is reported by industry and geography. The number of establishments, employees, payroll, and sales are counted for the nation, states, counties, metro areas, places, and zip codes.

At this point national industry totals for the broadest categories of NAICS are available, as are preliminary numbers for the most specific NAICS categories (six digit) at the national level. Data for smaller geographic areas will be released between October 2009 and August 2010.

The biggest change from the 2002 Economic Census is the delivery method for the data. There will be no more 90 page pdf files or HTML tables that drill down six levels. All of the data will be released via the American Factfinder only. Other changes include the addition of some new geography (CDPs with at least 5000 people), new metro area definitions, and the revised 2007 definitions for NAICS which include small changes to the Finance, Insurance, Real Estate, Professional Services, and Administrative Services categories.

Additional changes for 2007, the data release schedule, NAICS codes, and methodology docs are all available at the 2007 Economic Census homepage within the Census Bureau’s website.

All of the data is aggregated by industry and geography – you cannot get lists of businesses with names and addresses as this information is kept confidential. Furthermore, to maintain confidentiality, if one company controls a large share of the market for a specific sector within a specific geographic area, or if there few businesses within a sector in a specific geographic area, much of the data (with the exception of the number of businesses) remains classified (marked with a D for disclosure). Oftentimes this means that data for industries within small areas (big box retail in a small town) and data for industries with few establishments in an area (mining establishments in New York City) are hidden. The smaller the geography, the more likely it is that the data will not be disclosed. This becomes a technical issue if you want / need to move this data into a database, as these pesky disclosure notes are stored in the same columns as the data and prevent you for designating the fields as numeric.

Given the delay between the time the data is collected and the time it is released, it isn’t particularly helpful for analyzing our current economic climate, but it does provide a snapshot of the way the US economy looked at that moment, and is useful in understanding how the economy is evolving. Be aware that when making comparisons to past data, you have to correct for changes in geography and NAICS definitions. The differences between 2002 and 2007 are not too great, but more adjustments are necessary as you go further back in time. The Bureau provides data back to 1997 through the American Factfinder and some data from 1992 on an older page. If you need to go back further, you’ll be entering the realm of (gasp!) CD-ROMs or the paper reports.

Updated Links for Data and Resources

Saturday, April 25th, 2009

I recently went through my pages of suggested links for data and resources to update and clean them up. I’ve included many of the cool resources I’ve discovered since I started writing this blog, which ended up in individual posts but not in these pages. I went over the resources page in particular, to try and classify the reference materials, tools, and software into useful categories rather than just having one large blob of stuff.

Mapping ACS Census Data for Urban Areas With PUMAs

Tuesday, December 16th, 2008

The NY Times wrote a story recently based on the new 3 year ACS data that the Census Bureau released a couple weeks ago (see my previous post for details). They created some maps for this story using geography that I would never have thought to use.

Outside of Decennial Census years, it is difficult to map demographic patterns and trends within large cities as you’ll typically get one figure for the entire city and you can’t get a break down for areas within. Data for areas like census tracts and zip codes is not available outside the ten-year census (yet), and large cities exist as single municipal divisions that aren’t subdivided. New York City is an exception, as it is the only city composed of several counties (boroughs) and thus can be subdivided. But the borough data still doesn’t reveal much about patterns within the city.

The NY Times used PUMAS – Public Use Microdata Areas – to subdivide the city into smaller areas and mapped rents and income. PUMAs are aggregations of census tracts and were designed for aggregating and mapping public microdata. Microdata consists of a selection of actual individual responses from the census or survey with the personal identifying information (name, address, etc) stripped away. Researchers can build their own indicators from scratch, aggregate them to PUMAs, and then figure out the degree to which the sample represents the entire population.

Since PUMAs have a large population, the new three-year ACS data is available at the PUMA level. The PUMAs essentially become surrogates for neighborhoods or clusters of neighborhoods, and in fact several NYC agencies have created districts or neighborhoods based on these boundaries for statistical or planning purposes. This wasn’t the original intent for creating or using PUMAs, but it’s certainly a useful application of them.

You can check out the NY Times article and maps here – Census Shows Growing Diversity in New York City (12/9/08). I tested ACS / PUMA mapping out myself by downloading some PUMA shapefiles from the Census Bureau’s Generalized Cartographic Boundaries page, grabbing some of the new annual ACS data from the American Factfinder, and creating a map of Philly. In the map below, you’re looking at 2005-2007 averaged data that shows the percentage of residents who lived in their current home last year. If you know Philly, you can see that the PUMAs do a reasonable job of approximating regions in the city – South Philly, Center City, West Philly, etc.

The problem I ran into here was that data did not exist for all of the PUMAs – in this case, South Philly and half of North Philly had values of zero. According to the footnotes on the ACS site, there were no values for these areas because “no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution”. So even though the PUMA geography is generally available, there still may be cases where data for particular variables for individual geographies is missing.

Just for the heck of it, I tried looking at the annual ACS data which is limited to more populated areas (must have 65k population where 3 year estimates are for areas with at least 20k) and even more data was missing (in this instance, all the areas in the northeast). Even though PUMAs have a minimum population of 100k people, the ACS sampling is county based. So even if the sample size for a county is ideal, they may not have a significant threshold for individual places within a county to compute an estimate. At least, that’s my guess. Regardless, it’s still worth looking at for the city and data you’re interested in.

ACS Data for Philly Pumas

Census Bureau Releases New ACS Data

Wednesday, December 10th, 2008

The Census Bureau released its new American Community Survey data the other day. Three year averages for a variety of socio-economic variables are now available for all geographic areas that have at least 20,000 people. The ACS has been releasing annual data for most of this decade for areas with at least 65,000 people and will continue to do so. They didn’t provide data for smaller areas because the numbers were not as statistically robust. Now that they have three years of data, they can average the numbers for three years and get sound data for areas with a population of at least 20k.

Data for 2005 to 2007 is available now, and like the annual numbers, you’ll get a range of values and a confidence interval. For example, we can say with 90% confidence that the estimated population of Atlantic City, NJ between 2005 and 2007 was 35,770, plus or minus 1,749 people. The Bureau created this estimate based on a sample of 1,379 people in AC.

Next year, the census will release new annual numbers for areas with a population of at least 65k, and will update the three year averages for areas with 20k by adding the newest year of data and dropping the oldest one to calculate a new average.

All of the data is available through the American Factfinder.

If you are looking for population figures for basic indicators (population, race, gender, age, and housing units) for basic geographic areas (states, counties, places, and metro areas), you’ll probably want to consider using estimates from the Bureau’s Population Estimates program instead. Their annual estimates are based on a demographic calculation that factors in births, deaths, and migration, and is not based on a survey (according to that program, Atlantic City had 39,684 residents in 2007 – 4,090 more people than the ACS midrange estimate). If you’re looking for any other kind of data (ethnicity, immigration status, income, poverty, rent, home value, etc) the ACS is your best bet.

By 2010 the Bureau will begin releasing ACS 5 year avearges for all geographic areas. Of course, we’ll also have our next decennial census in 2010. The big change here is that, since we’ll have the ACS churning out data for all areas for every year from that point forward, the Bureau is doing away with the long form (which was sent to one in six households) that was issued in past censuses, and will only collect data using the basic short form, which gets distributed to everyone. For more info on this change, see the Bureau’s Census 2010 info page.

Adding Long / Lat XY Data to ArcMap

Monday, July 7th, 2008

Here’s a tutorial I’ve been meaning to write: adding a table of longitude and latitude coordinates to ArcMap and turning them into features. For this example, I’ll be using place names from the GEOnet Names Server country files. The US National Geospatial Intelligence Agency has a pretty extensive list of geographic features for each country, with coordinates in many formats, including longitude and latitude in decimal degrees. I’ll use Botswana in southern Africa as an example, as it has a small record set and because I have some admin boundaries handy that I’ve downloaded from SAHIMS.

  • Download the file from the GNS and unzip it. It is a tab-delimited text file. If you like, you can open it in Excel or another spreadsheet to see what it looks like. This works fine for this example, but won’t work for larger or more populated countries because the files will exceed the maximum number of records that a spreadsheet can handle (65k). You’ll need to import the file into a database (Access for example) if you want to take a look in those cases. In either event, you’ll be able to add the text file directly to ArcMap, so no worries.
  • Add XY Data Open ArcMap and under the Tools menu, select Add XY Data. In the dialog box, you’ll select the file that contains your XY coordinates. Choose the text file you’ve downloaded. ArcMap will then search through the fields and look for appropriate ones to add as X and Y fields. In this case, it should correctly choose LONG for X and LAT for Y. If Arc couldn’t figure it out, you would have to specify which columns have the coordinates. Longitude is ALWAYS the X coordinate, and Latitude is ALWAYS the Y. Finally, you’ll select a projection. Choose the standard geographic coordinate system WGS 1984, which is usually a safe bet when adding long/lat data from most sources.
  • Add XY Dialog BoxHit OK, and Arc will plot the coordinates (after you click through the warning message). In this example, it looks like there is one wayward point, way to the north. When you see something like this, it often means that one of the coordinates is missing a minus sign: latitudes below the equator are negative, as are longitudes east of the international date line and west of the prime meridian. If you use the identity tool, you’ll see that the minus sign for latitude for this wayward point is missing. The easiest thing to do would be to go back into the text file, edit it, and add it to ArcMap again.
  • Even though Arc has plotted the points, they still don’t exist as features (remember the warning message? That’s essentially what it was saying). Select the plotted points in the Table of Contents, right-click, select Data, and select Export. Export the points out as a new shapefile or a feature class in a geodatabase. Then add the new features to the map.
  • At this point, it may be helpful to have a frame of reference for all of these points. Get your hands on some administrative layers, like country boundaries. I downloaded the outline of Botswana from SAHIMS. This step usually requires projecting and reprojecting, as you’ll need to get your points layer to match the projection of the other files you’re working with. I always use the ArcToolbox within ArcCatalog to fiddle with projections and then add the finished files to a new, blank map in ArcMap. In my case, the Botswana boundary was undefined – I had to consult the metadata from their website to figure out what the projection is (NAD 1927) and then define it using the ArcToolbox (Data Management Tools, Projections and Transformations, Define Projection). Then, I had to convert the Botswana points layer from WGS 1984 to match the boundary’s NAD 1927 projection (using Data Management Tools, Projections and Transformations, Feature, Project).
  • Plotted points with boundaryAdd the projected boundary and reprojected points to your map. Many of these points are point features (villages, towns, farms, mountain peaks), while others represent the geographic centers of lines (roads, rivers) or areas (administrative areas, parks, reserves). You’ll probably want to extract certain kinds of features. At this point, you’ll want to take a look at the attribute table for the points file and consult the NGS description for the names files. The description will tell you what each of the data columns represents and what all of the codes mean. The FC field will come in quite handy here, as it designates categories for each feature. So if we wanted to extract populated places, under the Selection Menu in ArcMap we could do a Select by Attribute where the field FC is equal to P, which is the code for populated place features. Once they are selected, you can do a Data, Export to create a new shapefile with just those features.
  • Alternatives do abound here. If you prefer, you could do a lot of the work of editing and creating feature subsets within a geodatabase. You can also follow these same, general procedures using open source tools (I believe that QGIS has a tool for adding XY data). And while we’re discussing a specific example here, the same basic steps would apply for any XY dataset.

Searching for Foreign Census Data

Saturday, July 5th, 2008

I’ve been looking for census data for various countries, and have visited the usual suspects that aggregate this data – the CIA World Factbook and the United Nations Population Information Network. Other supra-national orgs like the IMF and World Bank also create and compile this info. These are fine sources, particularly if your goal is to look at basic data for several (or all) countries. But if you are studying or writing about one country in particular, it may seem odd to cite the UN, and even odder to cite the CIA. It would be better to go right to the source – the chief statistical agency in that particular country. In all likelihood, this agency would also have more in-depth stats than the aggregators.

But – where is the source? Rather than be left to the mercy of google, where you’ll uncover the obvious suspects and lots of commercial sites and joe-schmoes who republished some data from last decade, visit the US Census Bureau’s list of foreign statistical agencies, which will lead you right to the source.

Assuming you can find some pages with some data (census data isn’t public domain in every country and isn’t necessarily online for free, or at all, in which case you may need to go with some of the aggregate sources), the next obstacle will be overcoming the language barrier. Many countries will publish pages in several languages, including English. Some may publish only limited info in English, or no info in English at all. If you don’t read the lingua franca, you can try a translating tool like Babblefish or the Google Language Tool to translate the page for you. The translation may not be perfect, but it should be good enough where you can figure out what you need (although if the language you are translating doesn’t use the Roman alphabet and Arabic numerals – i.e. 1,2,3 etc, you may have some trouble).

The toughest obstacle to overcome may be the organizational barrier. If you are familiar with the US Census Bureau, you’ll know that it’s a large and complex organization with many subdivisions and datasets (decennial census, acs, population estimates, etc). And despite it’s enormity, it doesn’t collect all socio-economic data (religious affiliation) and may not be the best source for all data (current labor force stats). Well – other countries are just as complicated, so be wary!

Another strategy would be to visit Wikipedia – not to cite as a source, but to find what sources they use. You’ll find many country specific articles that cite the CIA Factbook or the UN, but some of the more detailed and well written ones do cite reports written by the statistical agencies for the country in question, often with a link to the page or report. If you have access to some library databases, like Gale Virtual Reference, they will (usually) cite sound references as well. Happy hunting!

Fun With FIPS Codes

Saturday, May 31st, 2008

Surely I jest, as messing with FIPS codes is rarely fun at all. However, it’s one of those things that you constantly have to deal with when mapping US demographic data, and it’s one of those things that GIS books and tutorials rarely discuss. And since I was tangling with them yesterday, I thought I’d share some tricks you can use in Excel to manipulate and format FIPS codes.

FIPS codes were created by the federal government to uniquely identify all geographic units in the US and are widely used. In an ideal world, here are some examples of FIPS codes for four counties:

State_Name, County_Name, State_FIPS, County_FIPS, FIPS
California, Alpine, 06, 003, 06003
California, San Mateo, 06, 081, 06081
Delaware, New Castle, 10, 003, 10003
Pennsylvania, Philadelphia, 42, 101, 42101

Each state code is unique, and each county code is unique WITHIN each state. If you are working with data for every county, then you will need to use the concatenated 5 digit FIPS in order to have unique ID numbers for each county. Otherwise, in this example, you wouldn’t be able to distinguish Alpine County, CA from New Castle County, DE because they both have the same county FIPS: 003.

In your database or spreadsheet, the codes need to be saved as text strings, NOT as numbers. In the spreadsheet I opened the other day, they were mistakenly saved as numbers. Excel also has an annoying habit of making corrections which aren’t really correct at all. So instead of the having the above, I had this:

State_Name, County_Name, State_FIPS, County_FIPS, FIPS
California, Alpine, 6, 3, 6003
California, San Mateo, 6, 81, 6081
Delaware, New Castle, 10, 3, 10003
Pennsylvania, Philadelphia, 42, 101, 42101

Since the codes were saved as numbers, all leading zeros were dropped. This is a problem, as I want to relate this data to data stored in other tables and shapefiles, where the FIPS codes are stored correctly. So, I needed to convert the data in this example back to strings (as join fields must have the same data type – you can’t join strings to numbers), and I had to get those zeros back. Somehow. Here’s how:

Open the data up in Excel, and convert the State_FIPS field to strings by selecting the column, selecting Format, and changing the format to text.

Then insert a column to the right of State_FIPS, and type in a formula:

=IF(Len(c2)=1,concatenate(0,c2),c2)

In English, this says: If the length of the value in c2 is one, then put a zero in front of the value in the new cell. Otherwise, just print the value of c2 in the cell. So in our example, the code for California would get converted from 6 to 06, while Delaware and Pennsylvania’s codes will just get reprinted, as they have the correct number of characters (two).

Then, it’s just a matter of copying and pasting this formula all the way down the rows, to create the correct two digit FIPS for each record. Lastly, select the whole column, copy it, go up to the Edit menu and select paste special for values only. This will overwrite all of the formulas in these cells and replace them with the actual output of the formula. You can follow these steps with the other two fields; the only thing that needs to change are the formulas.

Fixing the COUNTY_FIPS field is trickier, as we have three possibilities here: we’ll need to add two zeros if the code is one one character long, one zero if the code is two characters long, and nothing if the code is three characters long. Here’s the formula:

=IF(Len(e2)=3,e2,IF(Len(e2)=2,concatenate(0,e2),concatenate(“00″,e2)))

The parentheses get confusing, as arguments within each IF statement and each Len and concatenate function need their own parentheses. In English: if the value in e2 is three characters long, just print that value in the new cell. Otherwise, if the length is two characters, tack a zero to the front of that value and then print it. Otherwise, tack two zeros to the front of it and print. In that last piece, we’re making the implicit assumption that if it isn’t three characters long, and if it isn’t two, it must be one. The two zeros must go in parentheses so Excel reads it as a string. Without the parentheses, Excel assumes you mean a number and will “correct” you by dropping a zero – which is exactly what you DON’T WANT!

So in our example above, Alpine and New Castle get two zeros added to the front, San Mateo gets one, and Philadelphia gets none.

Fixing the last FIPS field is easy – it’s just a repeat of our first formula, as there are only two options: Add a zero to the front if it’s four characters long, otherwise just print the value (which is five characters long).

If you’re using a spreadsheet other than Excel, like Open Office’s Calc, you can use similar formulas with a few syntax changes. Calc uses colons instead of commas to separate values in arguments, so concatenate(0,e2) becomes concatenate (0:e2). You could also perform these operations in Microsoft Access using a Make Table query, with some adjustments. You would reference fields instead of cell values, and instead of the word concatenate, Access uses the syntax &”"& in between fields that you want to join.

When I worked at the University of Washington Library, I created a tutorial for concatenating fields in Access, to supplement a tutorial my predecessor created for Excel. Both docs are available on WAGDA’s FAQ page.

Real fun, huh? ; )


Copyright © 2013 Gothos. All Rights Reserved.
No computers were harmed in the 0.576 seconds it took to produce this page.

Designed/Developed by Lloyd Armbrust & hot, fresh, coffee.