Archive for June 2009

Formulas for Working With Census ACS Data in Excel / Calc

After downloading US census data, you often need to reformat it before using it. It’s quite common that you download files where the population is broken down by gender and age, and you need to aggregate the data to get a total or divide a particular characteristic to get a percent total. This is pretty straightforward if you’re working with decennial census data, but data from the American Community Survey (ACS) is a little trickier to deal with since you’re working with estimates that have a margin of error. When creating new data, you also have to calculate what the margin of error is for your derived numbers. I’ll walk through some examples of how you would do this in a spreadsheet (the formulas below will work in either Excel or Calc).

Creating an Aggregate

We’ll use the following data in our example:

screenshot1

We have the total population of people three years and older who are enrolled in school, and a breakdown of this population enrolled in grades 1 through 4 and grades 5 through 8 in a few counties in New York, with margins of error for each data point. Our data is from the 3 year averaged 2005-2007 American Community Survey.

Let’s say we want to create a total for students who are enrolled in grades 1 through 8 for each county. We create a new column and sum the estimates for each county with the formula e3+g3, or sum(e3:g3).

To calculate a margin or error (MOE) for our grade 1 to 8 data, first we have to use the find and replace command to get rid of the “+/-” signs in the MOE column, so our spreadsheet will treat our values as numbers and not text (this is an issue if you downloaded the data as an Excel file – if you download a txt file the +/- is not included). Depending on the dataset you’re working with, you may also need to replace dashes, which represent data that was null or not estimated.

Once the data is cleaned up, we can insert a new column with this formula:

=SQRT((F3^2)+(H3^2))

This calculates our new margin of error by squaring the moes for each of our data points, summing the results together, and taking the square root of that sum. In other words,

=SQRT((MOE1^2)+(MOE2^2))

Once that’s done, you may want to round the new MOE to a whole number.

Creating a Percent Total

Let’s calculate the percentage of the population 3 years and older enrolled in school that are in grades 1 through 8. Based on what we have thus far (I hid the columns E,F,G, and H for grades 1-4 and 5-8 in this screenshot, as we don’t need them):

screenshot-2

We insert a new column where we divide our subgroup by the total, as you would expect – I3/C3. In the next column we insert the following formula to create a MOE for our new percent total:

=(SQRT((J3^2)-((K3^2)*(D3^2))))/C3

This one’s a little weightier than our last formula. We’re taking the square of our percent total (K3) and the square of the MOE of the total population (D3), multiplying them together, then subtracting that number from the square of the MOE of our subgroup (J3). Then we take the square root of the whole thing, then divide it by our total population (C3). If you’re saying – HUH? Maybe this is clearer:

=(SQRT((MOEsubset^2)-((PercentTotal^2)*(MOEtotalpop^2))))/TotalPop

Finally, we have something like this:

screenshot-3

Based on our data, we can say things like “There were approximately 30,556 students enrolled in 1st through 8th grade per year in Dutchess County, NY between 2005 and 2007, plus or minus 1,184 students. An estimated 37% of the population enrolled in school in the county was in the 1st through 8th grade, plus or minus 1%.” The ACS estimates have a 90% confidence interval.

Wrap Up

In this example we worked with aggregating and calculating percentages based on characteristics. We could also use these same formulas to aggregate data by geography, if we wanted to add the characteristics for all the counties together.

For the full documentation on working with ACS data, take a look at the appendix in the Census’ ACS Compass Guide, What General Data Users Need to Know. It provides you with the formulas in their proper statistical notation (for those of you more mathematically inclined than I) and includes formulas for calculating other kinds of numbers, such as ratios and percent change. It does provide you with worked-through examples, but not with spreadsheet formulas. I used their examples when I created formulas the first time around, so I could compare my formula results to their examples to insure that I was getting it right. I’d strongly recommend doing that before you start plugging away with your own data – one misplaced parentheses and you could end up with a different (and incorrect) result.

2007 Economic Census

The US Census Bureau has begun releasing data for the 2007 Economic Census. The bureau conducts the survey of businesses every five years – all medium to large size businesses and multi-part businesses are counted, samples are taken for smaller businesses, and various administrative records are used to calculate businesses with no employees (i.e. freelancers). All businesses are categorized hierarchically by North American Industrial Classification System (NAICS) codes and the data is reported by industry and geography. The number of establishments, employees, payroll, and sales are counted for the nation, states, counties, metro areas, places, and zip codes.

At this point national industry totals for the broadest categories of NAICS are available, as are preliminary numbers for the most specific NAICS categories (six digit) at the national level. Data for smaller geographic areas will be released between October 2009 and August 2010.

The biggest change from the 2002 Economic Census is the delivery method for the data. There will be no more 90 page pdf files or HTML tables that drill down six levels. All of the data will be released via the American Factfinder only. Other changes include the addition of some new geography (CDPs with at least 5000 people), new metro area definitions, and the revised 2007 definitions for NAICS which include small changes to the Finance, Insurance, Real Estate, Professional Services, and Administrative Services categories.

Additional changes for 2007, the data release schedule, NAICS codes, and methodology docs are all available at the 2007 Economic Census homepage within the Census Bureau’s website.

All of the data is aggregated by industry and geography – you cannot get lists of businesses with names and addresses as this information is kept confidential. Furthermore, to maintain confidentiality, if one company controls a large share of the market for a specific sector within a specific geographic area, or if there few businesses within a sector in a specific geographic area, much of the data (with the exception of the number of businesses) remains classified (marked with a D for disclosure). Oftentimes this means that data for industries within small areas (big box retail in a small town) and data for industries with few establishments in an area (mining establishments in New York City) are hidden. The smaller the geography, the more likely it is that the data will not be disclosed. This becomes a technical issue if you want / need to move this data into a database, as these pesky disclosure notes are stored in the same columns as the data and prevent you for designating the fields as numeric.

Given the delay between the time the data is collected and the time it is released, it isn’t particularly helpful for analyzing our current economic climate, but it does provide a snapshot of the way the US economy looked at that moment, and is useful in understanding how the economy is evolving. Be aware that when making comparisons to past data, you have to correct for changes in geography and NAICS definitions. The differences between 2002 and 2007 are not too great, but more adjustments are necessary as you go further back in time. The Bureau provides data back to 1997 through the American Factfinder and some data from 1992 on an older page. If you need to go back further, you’ll be entering the realm of (gasp!) CD-ROMs or the paper reports.