## Posts Tagged ‘geodatabase’

Thursday, August 13th, 2015

A few weeks ago I launched a new version of our college’s GIS data repository, the Baruch Geoportal. At the back end I have a simplified process for getting data onto our server, and on the front end we did away with manually updating HTML and CSS webpages in favor of using a Confluence wiki. My college has a subscription to Confluence, and I’ve been using an internal wiki for documenting and administering all aspects of our projects. A public, external wiki for providing our data seemed like a nice way to go – we can focus more on the content and it’s easier for my team and I to collaborate.

Since it’s a new site with a new address, many of the links to projects I’ve referred to throughout the years on this blog are no longer valid. Redirects are in place, but they won’t last forever. Some notable links to update:

The new site has a dedicated blog that you can follow (via RSS) for the latest updates to the portal. The portal also has a number of relatively new and publicly accessible datasets that we’ve posted over the last year (but that I haven’t had time to post about). These include the NYC Mass Transit Spatial Layers series and population centroids for US census geographies. We’ve been creating ISO spatial metadata for all of our new layers, but we still need to create XML stylesheets to make them more human-readable. That will be one of many projects to do for this academic year.

### Writing Functions and Building a Jinja Template

Thursday, August 6th, 2015

In previous posts I demonstrated how to pull data from a sqlite / spatialite data to generate reports using Python and Jinja, where Jinja2 is used as a template engine for creating LaTeX documents and the NYC Geodatabase is used as my test case. Up until now the scripts pulled the data “as is”. In this post I’ll demonstrate how I created derived variables, and how I created the Jinja2 template for the report. Please note – instead of duplicating all of the code I’m just going to illustrate the new pieces – you should check out the earlier posts to see how all the pieces fit together.

### Aggregating Variables

Aggregating census data is a pretty common operation, and when working with American Community Survey estimates it’s also necessary to calculate a new margin of error for each derived value. I wrote two functions to accomplish this. For each function you pass in the keys for values you want to aggregate, a name which will be the name of the new variable, and a dictionary that contains all the keys and values that were taken from a database table for a specific geography.

#Functions for summing individual values and calculating margins of error
#for individual values

tosum=[]
for val in keys:
agg=sum(tosum)

sqrd=[]
for val in keys:
if item=='':
pass
else:
sqrd.append(item**2)
moe=round(math.sqrt(sum(sqrd)))


Later in the script, as we’re looping through all the geographies and gathering the necessary data into dictionaries that represent each data table, we call the function. In this example we’re combining household income brackets so that we don’t have so many categories:

for geog in geodict.keys():

name=geodict.get(geog)
filename='zzpuma_' + geog + '.tex'
folder='puma_rept'
outpath=os.path.join(folder,filename)

acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)

calc_sums(['INC03_E','INC04_E'],'INC10K_E',acs1dict)
calc_moe(['INC03_M','INC04_M'],'INC10K_M',acs1dict)
calc_sums(['INC05_E','INC06_E'],'INC25K_E',acs1dict)
calc_moe(['INC05_M','INC06_M'],'INC25K_M',acs1dict)
calc_sums(['INC09_E','INC10_E'],'INC100K_E',acs1dict)
calc_moe(['INC09_M','INC10_M'],'INC100K_M',acs1dict)


Rather than creating a new dictionary, these new values are simply appended to the existing dictionaries that contain the data taken from each of the ACS data tables in the database. They can be referenced in the template using their new column name.

### Calculating Areas

I also want to include the geographic size of the PUMA as one of the report items. Columns for the area are included in the spatial table for the PUMAs – the features originally came from the TIGER files, and all TIGER files have an ALAND and an AWATER column that has land and water area in square meters. So we don’t have to calculate the area from the geometry – we can just use this function to convert the land and water attributes to square miles, and then calculate a total area:

def calc_area(adict,land,water,total):
totalarea=landarea+waterarea


In the body of our script, we invoke our pulltab function (explained in an earlier post) to grab all the data from the PUMA spatial boundary table:

area=pulltab('c_bndy_pumas2010','geoid10',geog)


And then we can call our area function. We pass in the area dictionary, and what we want the new output column names to be – area for land, water, and total:

calc_area(area, 'LAND_SQM','WAT_SQM','TOT_SQM')


Like our previous aggregate script, this function appends our new values to the existing table-dictionary – in this case, one called area.

### Aggregating Geographies

Our last function is a little more complicated. In all of our previous examples, we pulled PUMA-level data from the American Community Survey tables. What if we wanted 2010 Census data for the PUMAs? Decennial census data is not tabulated at the PUMA level, but it is tabulated at the census tract level. Since PUMAs are created by aggregating tracts, we can aggregate the census tract data in the NYC Geodatabase into PUMAs. Here’s our function:

#Function aggregates all values in a table with a group by field from a
#joined table, then creates a dictionary consisting of column names and values
#for a specific geography

def sumtab(tabname,jointab,id1,id2,gid,geog):
query='SELECT * FROM %s LIMIT 1' %(tabname)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
tosum=[]
for var in col_names[3:]:
tosum.append("SUM("+var+") AS '0_"+var+"'")
summer=', '.join(str(command) for command in tosum)
query='SELECT %s, %s FROM %s, %s WHERE %s = %s and %s = %s GROUP BY %s' %(gid,summer,tabname,jointab,id1,id2,gid,geog,gid)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
for row in rows:
thedict=dict(zip(col_names,row))
return thedict


What’s going on here? The first thing we need to do is associate the census tracts with the PUMAs they’re located in. The NYC Geodatabase does NOT have a relationship table for this, so I had to create one. We have to pass in the table name, the relationship table, the unique IDs for each, and then the ID and the geography that we’re interested in (remember our script is looping through PUMA geographies one by one). The first thing we do is a little trick – we get the names of every column in the existing data table, and we append them to a list where we create a new column name based on the existing one (in this case, append a 0 in front of the column name – in retrospect I realize this is a bad idea as column names should not begin with numbers, so this is something I will change). Then we can take the list of column names and create a giant string out of them.

With that giant string (called summer) we can now pass all of the parameters that we need into the SQL query. This selects all of our columns (using the summer string), the table names and join info, for the specific geographic area that we want and then groups the data by that geography (i.e. all tracts that have the same PUMA number). Then we zip the column names and values together in a dictionary that the function returns.

Later on in our script, we call the function:

    census10=sumtab('b_tracts_2010census','b_tracts_to_pumas','GEOID2','tractid','pumaid',geog)


Which creates a new dictionary called census10 that has all the 2010 census data for our PUMA. Like the rest of our dictionaries, census10 is passed out to the Jinja2 template and its values can be invoked using the dictionary keys (the column headings):

outfile=open(outpath,'w')
outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict, area=area,
c2010=census10))
outfile.close()


### Designing the Template

The Jinja template is going to look pretty busy compared to our earlier examples, and in both cases they’re not complete (this is still a work in progress).

I wanted to design the entire report first, to get a sense for how to balance everything I want on the page, without including any Jinja code to reference specific variables in the database. So I initially worked just in LaTeX and focused on designing the document with placeholders. Ultimately I decided to use the LaTeX minipage environment as it seemed the best approach in giving me control in balancing items on the page. The LaTeX wikibook entries on floats, figures, and captions and on boxes was invaluable for figuring this out. I used rule to draw boxes to serve as placeholders for charts and figures. Since the report is being designed as a document (ANSI A 8 1/2 by 11 inches) I had no hang-up with specifying precise dimensions (i.e. this isn’t going into a webpage that could be stretched or mushed on any number of screens). I loaded the xcolor package so I could modify the row colors of the tables, as well as a number of other packages that make it easy to balance table and figure captions on the page (caption, subscaption, and multicol).

Once I was satisfied with the look and feel, I made a copy of this template and started modifying the copy with the Jinja references. The references look awfully busy, but this is the same thing I’ve illustrated in earlier posts. We’re just getting the values from the dictionaries we created by invoking their keys, regardless of whether we’re taking new derived values that we created or simply pulling existing values that were in the original data tables. Here’s a snippet of the LaTeX with Jinja that includes both derived (2010 Census, area) and existing (ACS) variables:

%Orientation - detail map and basic background info
\begin{minipage}{\textwidth}
\begin{minipage}[h]{3in}
\centering
\rule{3in}{3in}
\captionof{figure}{Race by 2010 Census Tract}
\end{minipage}
\hfill
\begin{minipage}[h]{4in}
\centering
\captionof{table}{Geography}
\begin{tabular}{cccc}\hline
& Land & Water & Total\\ \hline
Area (sq miles) &  \num{\VAR{area.get('LAND_SQM')}} & \num {\VAR{area.get('WAT_SQM')}} &  \num {\VAR{area.get('TOT_SQM')}} \\ \hline
\vspace{10pt}
\end{tabular}
\captionof{table}{Basic Demographics}
\rowcolors{3}{SpringGreen}{white}
\begin{tabular}{cccc}\hline
& \textbf{2010 Census} & \textbf{2009-2013} & \textbf{ACS Margin}\\
& & \textbf{ACS} & \textbf{of Error}\\ \hline
Population & \num {\VAR{c2010.get('0_HD01_S001')}} & \num{\VAR{acs2.get('SXAG01_E')}} & +/- \num{\VAR{acs2.get('SXAG01_M')}}\\
Males & \num {\VAR{c2010.get('0_HD01_S026')}} & \num{\VAR{acs2.get('SXAG02_E')}} & +/- \num{\VAR{acs2.get('SXAG02_M')}}\\
Females & \num {\VAR{c2010.get('0_HD01_S051')}}& \num{\VAR{acs2.get('SXAG03_E')}} & +/- \num{\VAR{acs2.get('SXAG03_M')}}\\
Median Age (yrs) & \num {99999} & \num{\VAR{acs2.get('SXAG17_E')}} & +/- \num{\VAR{acs2.get('SXAG17_M')}}\\
Households & \num {\VAR{c2010.get('0_HD01_S150')}} & \num{\VAR{acs1.get('HSHD01_E')}} & +/- \num{\VAR{acs1.get('HSHD01_M')}}\\
Housing Units & \num {\VAR{c2010.get('0_HD01_S169')}} & \num{\VAR{acs2.get('HOC01_E')}} & +/- \num{\VAR{acs2.get('HOC01_M')}}\\ \hline
\end{tabular}
\end{minipage}
\end{minipage}


And here’s a snippet of the resulting PDF:

### What Next?

You may have noticed references to figures and charts in some of the code above. I’ll discuss my trials and tribulations with trying to use matplotlib to create charts in some future post. Ultimately I decided not to take that approach, and was experimenting with using various LaTeX packages to produce charts instead.

### Looping Through a Database to Create Reports

Tuesday, July 28th, 2015

I’ve got a lot of ground to cover, picking up where I left off several months ago. In earlier posts I presented the concept for creating reports from sqlite databases using Python, Jinja, and LaTeX, and looked at different methods for passing data from the database to the template. I’m using the NYC Geodatabase as our test case. In this entry I’ll cover how I implemented my preferred approach – creating Python dictionaries to pass to the Jinja template.

One of the primary decisions I had to make was how to loop through the database. Since the reports we’re making are profiles (lots of different data for one geographic area), we’re going to want to loop through the database by geography. So, for each geography select all the data from a specific table, pass the data out to the template where the pertinent variables are pulled, build the report and move on to the next geography. In contrast, if we were building comparison tables (one specific variable for many geographic areas) we would want to loop through the data by variable.

In the beginning of the script we import the necessary modules, set up the Jinja environment, and specify our template (not going to repeat that code here – see the previous post). Then we have our function that creates a dictionary for a specific data table for a specific geography:

def pulltab(tabname,idcol,geog):
query='SELECT * FROM %s WHERE %s = %s' %(tabname,idcol,geog)
curs.execute(query)
col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
for row in rows:
thedict=dict(zip(col_names,row))
return thedict


We connect to the database and create a dictionary of all the geographies (limited to 3 PUMAs since this is just a test):

#Connect to database and create dictionary of all geographies

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT geoid10, namelsad10 FROM a_pumas2010 ORDER BY geoid10 LIMIT 3')
rows = curs.fetchall()
geodict=dict(rows)


And then we generate reports by looping through all the geographies in that dictionary, and we pass in the ID of each geography to pull all data from a data table for that geography out of the table and into a dictionary.

#Generate reports by looping through geographies and passing out
#dictionaries of values

for geog in geodict.keys():
acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)
name=geodict.get(geog)
filename='zzpuma_' + geog + '.tex'
folder='test5'
outpath=os.path.join(folder,filename)


Lastly, we pass the dictionaries out to the template, and run LaTeX to generate the report from the template:

  outfile=open(outpath,'w')
outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict))
outfile.close()

os.system("pdflatex -output-directory=" + folder + " " + outpath)

conn.close()


The Jinja template (as a LaTeX file) is below – the example here is similar to what I covered in my previous post. We passed two dictionaries into the template, one for each data table. The key is the name of the variable (the column name in the table) and the value is the American Community Survey estimate and the margin of error. We pass in the key and get the value in return. The PDF output follows.

\documentclass{article}
\usepackage[margin=0.5in]{geometry}
\usepackage{graphicx}
\usepackage[labelformat=empty]{caption}
\usepackage[group-separator={,}]{siunitx}

\title{\VAR{acs1.get('GEOLABEL') | replace("&","\&")} \VAR{acs1.get('GEOID2')}}
\date{}

\begin{document}

\maketitle
\pagestyle{empty}
\thispagestyle{empty}

\begin{table}[h]
\centering
\caption{Commuting to Work - Workers 16 years and over}
\begin{tabular}{|c|c|c|c|c|}

\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\
\hline

Car, truck, or van alone & \num{\VAR{acs1.get('COM02_E')}} & +/- \num{\VAR{acs1.get('COM02_M')}}
& \num{\VAR{acs1.get('COM02_PC')}} & +/- \num{\VAR{acs1.get('COM02_PM')}}\\

Car, truck, or van carpooled & \num{\VAR{acs1.get('COM03_E')}} & +/- \num{\VAR{acs1.get('COM03_M')}}
& \num{\VAR{acs1.get('COM03_PC')}} & +/- \num{\VAR{acs1.get('COM03_PM')}}\\

Public transit & \num{\VAR{acs1.get('COM04_E')}} & +/- \num{\VAR{acs1.get('COM04_M')}}
& \num{\VAR{acs1.get('COM04_PC')}} & +/- \num{\VAR{acs1.get('COM04_PM')}}\\

Walked & \num{\VAR{acs1.get('COM05_E')}} & +/- \num{\VAR{acs1.get('COM05_M')}}
& \num{\VAR{acs1.get('COM05_PC')}} & +/- \num{\VAR{acs1.get('COM05_PM')}}\\

Other means & \num{\VAR{acs1.get('COM06_E')}} & +/- \num{\VAR{acs1.get('COM06_M')}}
& \num{\VAR{acs1.get('COM06_PC')}} & +/- \num{\VAR{acs1.get('COM06_PM')}}\\

Worked at home & \num{\VAR{acs1.get('COM07_E')}} & +/- \num{\VAR{acs1.get('COM07_M')}}
& \num{\VAR{acs1.get('COM07_PC')}} & +/- \num{\VAR{acs1.get('COM07_PM')}}\\
\hline

\end{tabular}
\end{table}

\begin{table}[h]
\centering
\caption{Housing Tenure}
\begin{tabular}{|c|c|c|c|c|}

\hline
& Estimate & Margin of Error & Percent Total & Margin of Error\\
\hline

Occupied housing units & \num{\VAR{acs2.get('HTEN01_E')}} & +/- \num{\VAR{acs2.get('HTEN01_M')}} &  &\\

Owner-occupied & \num{\VAR{acs2.get('HTEN02_E')}} & +/- \num{\VAR{acs2.get('HTEN02_M')}}
& \num{\VAR{acs2.get('HTEN02_PC')}} & +/- \num{\VAR{acs2.get('HTEN02_PM')}}\\

Renter-occupied & \num{\VAR{acs2.get('HTEN03_E')}} & +/- \num{\VAR{acs2.get('HTEN03_M')}}
& \num{\VAR{acs2.get('HTEN03_PC')}} & +/- \num{\VAR{acs2.get('HTEN03_PM')}}\\
\hline

\end{tabular}
\end{table}
\end{document}


In this example we took the simple approach of grabbing all the variables that were in a particular table, and then we just selected what we wanted within the template. This is fine since we’re only dealing with 55 PUMAs and a table that has 200 columns or so. If we were dealing with gigantic tables or tons of geographies, we could modify the Python script to pull just the variables we wanted to speed up the process; my inclination would be to create a list of variables in a text file, read that list into the script and modify the SQL function to just select those variables.

What if we want to modify some of the variables before we pass them into the template? I’ll cover that in the next post.

Tuesday, March 4th, 2014

This example takes over where the Loading Data Into PostGIS – Text Files left off; we’ll use the same schema usa_general. We’ll load two shapefiles from the Census Bureau’s 2010 Cartographic Boundary files from https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html.

Loading shapefiles can be done via the pgAdmin III GUI (for this example, used version 1.18 for MS Windows) from a client machine using the shp2pgsql-gui. This is an extra plugin that must be downloaded and installed. A command line version of the tool (shp2pgsql) is automatically bundled with PostGIS, but can’t be used by a client unless they either log into the server via a terminal or install PostGIS locally.

The shp2pgsql-gui can be downloaded for MS Windows from the extras folder on the PostGIS website: http://download.osgeo.org/postgis/windows/. Version 2.01 for 32 bit environment was downloaded and unzipped. Per the instructions in the README file, the postgisgui folder was dragged into the PG BIN path as designated in pgAdmin under File > Options > Binary Paths. Once pgAdmin was restarted the gui was available under the Plugins menu.

### Set Options

Under Plugins > PostGIS Shapefile and DBF Loader, Check the Options Menu. There are often problems when dealing with UTF-8 encoding in DBF files and when working in an MS Windows environment. Change encoding from UTF-8 to WINDOWS-1252 (another possibility would be LATIN-1). Keep boxes for creating spatial index and using COPY instead of INSERT checked.

Browse and select both shapefiles. Once they appear in the import list the following elements must be changed: switch schema from public to desired schema (usa_general), change table to give each one a meaningful name, and specify the SRID number for each (since these files are from the US Census they’re in NAD 83, which is EPSG 4269). Once you hit Load, the log window will display the results of each import.

Refresh the database and both shapefiles will appear as tables.

### Check Geometries

Because of the size of some geometries, they may appear blank if you preview the tables in pgAdmin – this is a well known and documented issue. The simplest way to verify is to check and make sure that none of the geometry columns are null:

SELECT COUNT(*)
FROM usa_general.states
WHERE geom IS NULL;

“0”

### Primary Keys

By default, PostGIS will automatically create a field called gid using sequential integers, and will designate this field as the primary key. If the data already has a natural key that you want to use, you would have to drop the key constraint on gid and add the constraint to the desired field. Since features from the US Census have unique ANSI / FIPS codes, we’ll designate that as the key.

ALTER TABLE usa_general.states
DROP CONSTRAINT states_pkey

Query returned successfully with no result in 125 ms.

ALTER TABLE usa_general.states
ADD CONSTRAINT states_pkey PRIMARY KEY (“state”);

Query returned successfully with no result in 125 ms.

### Constraints

An alternative to changing the primary key would be to add UNIQUE and NOT NULL constraints to additional columns; gid remains as the primary key but the other columns will automatically be indexed when constraints are added.

ALTER TABLE usa_general.metros ADD CONSTRAINT cbsa_unq UNIQUE (cbsa);
ALTER TABLE usa_general.metros ALTER COLUMN cbsa SET NOT NULL;

### Delete Records

For sake of example, we’ll delete all the records for micropolitan areas in the metros table, so we’re left with just metropolitan areas.

DELETE FROM usa_general.metros

Query returned successfully: 581 rows affected, 219 ms execution time.

## Spatial SQL

As a test we can do a query to select all of the populated places that are within the NYC metropolitan area but are not in Pennsylvania, that have an elevation greater than or equal to 500 feet.

SELECT feature_name, state_alpha, county_name, elev_in_ft
FROM usa_general.pop_places, usa_general.metros
WHERE metros.cbsa=’35620′
AND ST_WITHIN(pop_places.geom, metros.geom)
AND elev_in_ft >= 500
AND state_alpha !=’PA’
ORDER BY elev_in_ft DESC

“Kampe”;”NJ”;”Sussex”;1293
“Highland Lakes”;”NJ”;”Sussex”;1283
“Edison”;”NJ”;”Sussex”;1234″…

## Connect to QGIS

We can look at the geographic features by establishing a connection between the database and QGIS. In QGIS we can either hit the add PostGIS Layer button or use the browser to connect to a our database; we need to add the connection name, host, port, and database name. The username and password can be left blank and Q will prompt the user for it (if you don’t want it saved on the system). Once the connection is established you can add the layers to the view and change the symbolization.

Tuesday, February 25th, 2014

I’ve fallen out of the blogosphere for quite awhile. Time to catch up – lately we’ve been experimenting with deploying our own instance of Open Geoportal (see previous post) and square one is getting a functioning data repository up and running. I’ve been tinkering with PostgreSQL / PostGIS and am documenting tests in a wiki we’ve created. The wiki is private, so I thought I’d re-post some of the tests here.

I’m working with a developer who’s installed and configured the database on a server, and I’m accessing it remotely using pgAdmin. I’ve started loading data and running queries and so far I’m relieved that most of what I’m seeing seems familiar, having worked with SQLite / Spatialite for a few years now. I’m using the excellent book PostGIS in Action as my guide. Here’s my take on loading a text file with coordinates into the db and building geometry for it. Loading shapefiles, spatial SQL experiments, and connecting to Geoserver will follow.

## Verify Version

SELECT postgis_full_version();

“POSTGIS=”2.1.1 r12113″ GEOS=”3.4.2-CAPI-1.8.2 r3921″ PROJ=”Rel. 4.8.0, 6 March 2012″ GDAL=”GDAL 1.9.2, released 2012/10/08″ LIBXML=”2.7.6″ LIBJSON=”UNKNOWN” TOPOLOGY RASTER”

## Create Schema for Holding Related Objects

CREATE SCHEMA usa_general;

Query returned successfully with no result in 297 ms.

## Import Delimited Text File

Test file is the US Populated Places gazetteer file from the USGS Geographic Names Information Service (GNIS). It is a pipe-delimited text file in encoded in UTF-8 with longitude and latitude coordinates in both DMS and DEC format. http://geonames.usgs.gov/domestic/download_data.htm

### Create Table for Holding Data

CREATE TABLE usa_general.pop_places
(
feature_id int NOT NULL PRIMARY KEY,
feature_name varchar,
feature_class varchar,
state_alpha char(2),
state_numeric char(2),
county_name varchar,
county_numeric char(3),
primary_lat_dms varchar,
prim_long_dms varchar,
prim_lat_dec float4,
prim_long_dec float4,
source_lat_dms varchar,
source_long_dms varchar,
source_lat_dec float4,
source_long_dec float4,
elev_in_m int,
elev_in_ft int,
map_name varchar,
date_created date,
date_edited date
);

Query returned successfully with no result in 390 ms.

### Copy Data From File

Must be run from the PSQL Console plugin in order to load data from a local client machine. If data was stored on the server, you could use the PGAdmin GUI and use COPY in an SQL statement instead of \copy. When running the PSQL Console on MS Windows the default character encoding is WIN1252. In this example, the data contains characters unsupported by this encoding; the file is encoded in UTF-8 and the console must be set to match. COPY is not a standard SQL command but is native to PostgreSQL.

— The optional HEADER specifies that the data file has a header column that should be skipped when importing

\encoding UTF8
\copy usa_general.pop_places
FROM ‘C:\Workspace\postgis_testing\POP_PLACES_20140204.txt’

### Basic Query to Test that Load was Success

SELECT *
FROM usa_general.pop_places
WHERE state_alpha=’NY’ and county_name=’New York’
ORDER BY feature_name

### Create Geometry from Coordinates and Make Spatial Index

4269 is the EPSG code for NAD 83, the basic geographic coordinate system used by US government agencies.

“usa_general.pop_places.geom SRID:4269 TYPE:POINT DIMS:2 ”

UPDATE usa_general.pop_places
SET geom = ST_GeomFromText(‘POINT(‘|| prim_long_dec || ‘ ‘|| prim_lat_dec || ‘)’,4269);

Query returned successfully: 200359 rows affected, 18268 ms execution time.

CREATE INDEX idx_pop_places_geom
ON usa_general.pop_places USING gist(geom);

Query returned successfully with no result in 8439 ms.

## Basic SQL Query to Test Geometry

ST_AsEWKT transforms the output of geometry from binary code into the OGC Well Known Text format (WKT).

SELECT feature_name, ST_AsEWKT(geom) AS geometry
FROM usa_general.pop_places
WHERE state_alpha=’NY’ AND county_name=’New York’
ORDER BY feature_name

“Amalgamated Dwellings”;”SRID=4269;POINT(-73.9828 40.715)”
“Amsterdam Houses”;”SRID=4269;POINT(-73.9881 40.7731)”
“Battery Park City”;”SRID=4269;POINT(-74.0163 40.7115)”…

## Basic Spatial SQL Query to Test Geometry

Selects the northernmost point in the table.

SELECT feature_name, state_alpha, ST_AsEWKT(geom) AS geometry
FROM usa_general.pop_places
WHERE ST_Xmax(geom) IN(
SELECT Max(ST_Xmax(geom))
FROM usa_general.pop_places)

“Amchitka”;”AK”;”SRID=4269;POINT(178.878 51.5672)”

## Drop Columns

Drop some columns that have no data.

ALTER TABLE usa_general.pop_places
DROP COLUMN source_lat_dms,
DROP COLUMN source_long_dms,
DROP COLUMN source_lat_dec,
DROP COLUMN source_long_dec;

Query returned successfully with no result in 180 ms.

### Creating Reports with SQLite, Python, and prettytable

Friday, February 8th, 2013

In addition to providing the NYC Geodatabase as a resource, I also wanted to use it to generate reports and build applications. None of the open source SQLite GUIs that I’m familiar with have built in report generating capabilities, so I thought I could use Python to connect to the database and generate them. I have some grand ambitions here, but decided to start out small.

Python has a built-in module, sqlite3, that you can use to work with SQLite databases. This is pretty well documented – do a search and you’ll find a ton of brief tutorials. Take a look at this great post for a comprehensive intro.

For generating reports I gave prettytable a shot: it lets you create nice looking ASCII text tables that you can copy and paste from the prompt or export out to a file. The tutorial for the module was pretty clear and covers the basics quite nicely. In the examples he directly embeds the data in the script and generates the table from it, which makes the tutorial readily understandable. For my purposes I wanted to pull data out of a SQLite database and into a formatted table, so that’s what I’ll demonstrate here.

Initially I had some trouble getting the module to load, primarily (I think) because I’m using Python 3.x and the setup file for the module was written for Python 2.x; the utility you use for importing 3rd party modules has changed between versions. I’m certainly no Python expert, so instead of figuring it out I just downloaded the module, dumped it into the site-packages folder (as suggested in the prettytable installation instructions under “The Harder Way” – but it wasn’t hard at all) and unzipped it. In my script I couldn’t get the simple “import prettytable” to work without throwing an error, but when I added the name of the specific function “import PrettyTable from prettytable” it worked. Your mileage may vary.

So here was my first go at it. I created a test database and loaded a table of population estimates from the US Census Bureau into it (you can download it if you want to experiment):

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

x = PrettyTable(col_names)
x.align[col_names[1]] = "l"
x.align[col_names[2]] = "r"
for row in rows:

print (x)
tabstring = x.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()


The first piece is the standard SQLite piece – connect, activate a cursor, and execute a SQL statement. Here I’m grabbing three columns from the table for records that represent Northeastern states (Region 1). I read in the names of the columns from the first row into the col_names list, and I grab everything else and dump them into rows, a list that contains a tuple for each record:

>>> col_names
['State', 'Name', 'Est2010']
>>> rows
[('09', 'Connecticut', 3574097), ('23', 'Maine', 1328361), ('25', 'Massachusetts', 6547629),
('33', 'New Hampshire', 1316469), ('34', 'New Jersey', 8791898), ('36', 'New York', 19378104),
('42', 'Pennsylvania', 12702379), ('44', 'Rhode Island', 1052567), ('50', 'Vermont', 625741)]
>>>


The second piece will make sense after you have a quick look at the prettytable tutorial. Here I grab the list of columns names and specify how cells for the columns should be aligned (default is center) and padded (default is one space). Then I add each row from the nested list of tuples to the table, row by row. There are two outputs: print directly to the screen, and dump the whole table into a string. That string can then be dumped into a text file, along with a title. Here’s the screen output:

+-------+---------------+----------+
| State | Name          |  Est2010 |
+-------+---------------+----------+
|   09  | Connecticut   |  3574097 |
|   23  | Maine         |  1328361 |
|   25  | Massachusetts |  6547629 |
|   33  | New Hampshire |  1316469 |
|   34  | New Jersey    |  8791898 |
|   36  | New York      | 19378104 |
|   42  | Pennsylvania  | 12702379 |
|   44  | Rhode Island  |  1052567 |
|   50  | Vermont       |   625741 |
+-------+---------------+----------+


The one hangup I had was the formatting for the numbers: I really want some commas in there since the values are so large. I couldn’t figure out how to do this using the approach above – I’m writing all the rows in one swoop, and couldn’t step in and and format the last value for each row.

Unless – instead of constructing the table by rows, I construct it by columns. Here’s my second go at it:

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

y=PrettyTable()
y.align[col_names[1]]="l"
y.align[col_names[2]]="r"

print(y)
tabstring = y.get_string()

output=open("export.txt","w")
output.write("Population Data"+"\n")
output.write(tabstring)
output.close()

conn.close()


To add by column, you don’t provide any arguments to the PrettyTable function. You just add the columns one by one: here I call the appropriate values using the index, first for the column name and then for all of the values from the rows that are in the same position. For the last value (the population estimate) I use format to display the value like a decimal number (this works in Python 3.1+ – for earlier versions there’s a similar command – see this post for details). I tried this in my first example but I couldn’t get the format to stick, or got an error. Since I’m specifically calling these row values and then writing them I was able to get it to work in this second example. In this version the alignment specifications have to come last. Here’s the result:

+-------+---------------+------------+
| State | Name          |    Est2010 |
+-------+---------------+------------+
|   09  | Connecticut   |  3,574,097 |
|   23  | Maine         |  1,328,361 |
|   25  | Massachusetts |  6,547,629 |
|   33  | New Hampshire |  1,316,469 |
|   34  | New Jersey    |  8,791,898 |
|   36  | New York      | 19,378,104 |
|   42  | Pennsylvania  | 12,702,379 |
|   44  | Rhode Island  |  1,052,567 |
|   50  | Vermont       |    625,741 |
+-------+---------------+------------+


prettytable gives you a few other options, like the ability to sort records by a certain column or to return only the first “n” records from a table. In this example, since we’re pulling the data from a database we could (and did) specify sorting and other constraints in the SQL statement instead. prettytable also gives you the option of exporting the table as HTML, which can certainly come in handy.

### Altering Tables in SQLite / Spatialite

Thursday, February 7th, 2013

In building the Spatialite geodatabase (see previous post), one of the fundamental things I learned was how to manage table creation and alteration in SQLite, which was quite different from my previous experience working with MS Access and ArcGIS.

In SQLite, the ALTER TABLE statement is limited to changing the name of a table or adding new columns. If you want to make any other changes, the process is: create a new, blank table that’s structured the way you want, and then write an INSERT statement to copy the data you want from the existing table into the new table. So if you have a table that has a bunch of columns you want to drop or change, create a new table that has your desired structure, then insert what you want into that new table. The same thing goes for primary keys or other constraints. If your existing table doesn’t have a specified key you can’t alter it by specifying one: create a new table that does, then copy your data over.

Let’s say we have this table (de_data) with basic population data for Delaware’s three counties:

USPS	GEOID	NAME			POP10	HU10	ALAND_SQMI	AWATER_SQMI
DE	10001	Kent County		162310	65338	586.179		212.152
DE	10003	New Castle County	538479	217511	426.286		67.717
DE	10005	Sussex County		197145	123036	936.079		260.312


And let’s say that we want to change some of the column names, drop the field for housing units, and specify our data types and GEOID as our key. First, create the table:

CREATE TABLE de_pop (STATE TEXT, GEOID TEXT NOT NULL PRIMARY KEY,
COUNTY TEXT, POP10 INTEGER, ALAND REAL, AWATER REAL)

GEOID is the FIPS/ANSI code that uniquely identifies each state. Since these codes may have leading zeros (the codes for all states from AL through CT do), we designate it as text.

Second, insert the data we want from the existing table:

INSERT INTO de_pop (STATE, GEOID, COUNTY, POP10, ALAND, AWATER)
SELECT USPS, GEOID, NAME, POP10, ALAND_SQMI, AWATER_SQMI
FROM de_data

Order here matters – it’s going to insert columns from the original table into the new one in sequence: USPS into STATE, GEOID into GEOID, etc. Sometimes it’s possible to use an * as a shortcut to insert and copy everything, instead of listing every field, if both tables contain the same number of columns and they’re in the right order. But this is always a bit risky.

Lastly, if we don’t need that original table we could delete it:

DROP TABLE de_data

SELECT * FROM de_pop

STATE	GEOID	COUNTY			POP10	ALAND	AWATER
DE	10001	Kent County		162310	586.179	212.152
DE	10003	New Castle County	538479	426.286	67.717
DE	10005	Sussex County		197145	936.079	260.312


The process is similar if we want to take a query or view and turn it into a permanent table. SQLite does support CREATE TABLE AS, followed by a select query, so you can create a table out of a query. However – this is usually NOT the best course of action. If you do this, you won’t be able to specify a primary key for the new table (you really never want a table that lacks a key). Furthermore, if you create a new, calculated field you won’t be able to specify a type for it. This is particularly a problem if you’re working in Spatialite and want to create a spatial view or table that you want to join to some features and map in QGIS. If no type is specified, QGIS won’t know how to handle that new field, and it will be unviewable.

So, we take the same approach as before. Let’s say we want to create a table with population density as a calculated field. First, create the table:

CREATE TABLE de_popdens (STATE TEXT, GEOID TEXT NOT NULL PRIMARY KEY,
COUNTY TEXT, POP10 INTEGER, ALAND REAL, AWATER REAL, POPDENS REAL)

Then we write our insert statement, and in the insert we create the calculated field:

INSERT INTO de_popdens (STATE, GEOID, COUNTY, POP10, ALAND, AWATER, POPDENS)
SELECT STATE, GEOID, COUNTY, POP10, ALAND, AWATER, (ROUND(POP10/ALAND),1)
FROM de_pop

SELECT * FROM de_density:

STATE	GEOID	COUNTY			POP10	ALAND	AWATER	POPDENS
DE	10001	Kent County		162310	586.179	212.152	276.9
DE	10003	New Castle County	538479	426.286	67.717	1263.2
DE	10005	Sussex County		197145	936.079	260.312	210.6


This gives us a new table with density properly typed. Once again, this is only necessary if you want the calculated field to be permanent and function with other objects in the database, and particularly if you want to join this table to a geodatabase feature or shapefile to map the attributes. If you simply want the new field to answer a specific question or to export the data to output, you can just do a SELECT query and save it as a view.

I also had to do this procedure for every table and shapefile that I imported to the geodatabase. In the Spatialite GUI you don’t have the option to specify a primary key or data types for columns when you do an import; for the latter it takes its best guess. So to get it well-structured, I imported (or created a virtual link), created a new table, inserted the data over, then deleted the original table or severed the link. If it was a shapefile I went through the extra step of activating the geometry (check out the Spatialite Cookbook or the tutorial I wrote for the NYC geodatabase for details).

Since I was dealing with some enormous tables with hundreds of columns, I used some trickery to avoid typing all the statements by hand. If the data was small enough and came from a spreadsheet, I used a series of concatenate formulas to build the CREATE TABLE and INSERT statements by copying the field names and stringing them together with type names and necessary syntax, so I could just copy and paste a statement into the SQL dialog box. For larger datasets I used Python to do the processing, and had Python grab field names and write statements that I could copy and paste.

The import issue was particular to the Spatialite GUI, and not SQLite in general. If you’re dealing with just data tables and use the SQLite Manager (Firefox plugin), it asks you to specify column names, keys and constraints, and types for the columns you’re importing. It does the latter by having you select from a dropdown box for each column – this works fine if you have 10 or 20 columns, but it’s rather tedious if you have hundreds. The Manager also gives you the ability to alter additional elements of the table, like column names, by essentially performing the same operations (create table, insert records, drop table) behind the scenes while holding things temporarily in memory. It prefaces this with a warning message that this operation is not part of the standard SQLite commands, and there’s a chance something could go awry.

### NYC Geodatabase in Spatialite

Wednesday, February 6th, 2013

I spent much of the fall semester and winter interim compiling and creating the NYC geodatabase (nyc_gdb), a desktop geodatabase resource for doing basic mapping and analysis at a neighborhood level – PUMAs, ZIP Codes / ZCTAs, and census tracts. There were several motivations for doing this. First and foremost, as someone who is constantly introducing new people to GIS it’s a pain sending people to a half dozen different websites to download shapefiles and process basic features and data before actually doing a project. By creating this resource I hoped to lower the hurdles a bit for newcomers; eventually they still need to learn about the original sources and data processing, but this gives them a chance to experiment and see the possibilities of GIS before getting into nitty gritty details.

Second, for people who are already familiar with GIS and who have various projects to work on (like me) this saves a lot of duplicated effort, as the db provides a foundation to build on and saves the trouble of starting from scratch each time.

Third, it gave me something new to learn and will allow me to build a second part to my open source GIS workshops. I finally sat down and hammered away with Spatialite (went through the Spatialite Cookbook from start to finish) and learned spatial SQL, so I could offer a resource that’s open source and will compliment my QGIS workshop. I was familiar with the Access personal geodatabases in ArcGIS, but for the most part these serve as simple containers. With the ability to run all the spatial SQL operations, Spatialite expands QGIS functionality, which was something I was really looking for.

My original hope was to create a server-based PostGIS database, but at this point I’m not set up to do that on my campus. I figured Spatialite was a good alternative – the basic operations and spatial SQL commands are relatively the same, and I figured I could eventually scale up to PostGIS when the time comes.

I also created an identical, MS Access version of the database for ArcGIS users. Once I got my features in Spatialite I exported them all out as shapefiles and imported them all via ArcCatalog – not too arduous as I don’t have a ton of features. I used the SQLite ODBC driver to import all of my data tables from SQLite into Access – that went flawlessly and was a real time saver; it just took a little bit of time to figure out how to set up (but this blog post helped).

The databases are focused on NYC features and resources, since that’s what my user base is primarily interested in. I purposefully used the Census TIGER files as the base, so that if people wanted to expand the features to the broader region they easily could. I spent a good deal of time creating generalized layers, so that users would have the primary water / coastline and large parks and wildlife areas as reference features for thematic maps, without having every single pond and patch of grass to clutter things up. I took several features (schools, subway stations, etc) from the City and the MTA that were stored in tables and converted them to point features so they’re readily useable.

Given that focus, it’s primarily of interest to NYC folks, but I figured it may be useful for others who wish to experiment with Spatialite. I assumed that most people who would be interested in the database would not be familiar with this format, so I wrote a tutorial that covers the database and it’s features, how to add and map data in QGIS, how to work with the data and do SQL / spatial SQL in the Spatialite GUI, and how to map data in ArcGIS using the Access Geodb. It’s Creative Commons, Attribution, Non-Commercial, Share-alike, so feel free to give it a try.

I spent a good amount of time building a process rather than just a product, so I’ll be able to update the db twice a year, as city features (schools, libraries, hospitals, transit) change and new census data (American Community Survey, ZIP Business Patterns) is released. Many of the Census features, as well as the 2010 Census data, will be static until 2020.

### Calculated Fields in SpatiaLite / SQLite

Wednesday, February 3rd, 2010

After downloading data, it’s pretty common that you’ll want to create calculated fields, such as percent totals or change, to use for analysis and mapping. The next step in my QGIS / SpatiaLite experiment was to create a calculated field (aka derived field). I’ll run through three ways of accomplishing this, using my subway commuter data to calculate the percentage of workers in each NYC PUMA who commute to work. Just to keep everything straight:

• sub_commuters is a census data table for all PUMAs in NY State
• [SUBWAY] field that has the labor force that commutes by subway
• [WORKERS_16] field with the total labor force
• [SUB_PER] a calculated field with the % of labor force that commutes by subway
• [GEO_ID2] the primary key field, FIPS code that is the unqiue identifier
• nyc_pumas is a feature class with all PUMAs in NYC
• [PUMA5ID00] is the primary key field, FIPS code that is the unqiue identifier
• pumas_nyc_subcom is the data table that results from joining sub_commuters and nyc_pumas; it can be converted to a feature class for mapping

The first method would be to add the calculated field to the data after downloading it from the census in a spreadsheet, as part of the cleaning / preparation stage. You could then save it as a delimited text file for import to SpatiaLite. No magic there, so I’ll skip to the second method.

### SpatiaLite

The second method would be to create the calculated field in the SpatiaLite database. I’ll go through the steps I used to figure this out. The basic SQL select query:

SELECT *, (SUBWAY / WORKERS_16) AS SUB_PER FROM sub_commuters

This gives us the proper result, but there are two problems. First, the data in my SUBWAY and WORKERS_16 field are stored as integers, and when you divide the result is rounded to the nearest whole number. Not very helpful here, as my percentage results get rounded to 0 or 1. There are many ways to work around this: set the numeric fields as double, real, or float in the spreadsheet before import (didn’t work for me), specify the field types when importing (didn’t get that option with the SpatiaLite GUI, but maybe you can with the command line), add * 100 to the expression to multiply the percentage to a whole number (ok unless you need decimals in your result) or use the CAST operator. CAST converts the current data type of a field to a specified data type in the result of the expression. So:

SELECT *, (CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL)) AS SUB_PER FROM sub_commuters

This gave me the percentages with several decimal places (since we’re casting the fields as real instead of integer), which is what I needed. The second problem is that this query just produces a temporary view; in order to map this data, we need to create a new table to make the calculated field permanent and join it to a feature class. Here’s how we do that:

CREATE TABLE pumas_nyc_subcom AS
SELECT *, (CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL)) AS SUB_PER
FROM sub_commuters, nyc_pumas
WHERE nyc_pumas.PUMA5ID00=sub_commuters.geo_id2

The CREATE TABLE AS statement let’s us create a new table from the existing two tables – the data table of subway commuters and the feature class table for NYC PUMAs. We select all the fields in both while throwing in the new calculated field, and we join the data table to the feature class all in one step, and via the join we end up with just data from NYC (the data for the rest of the state gets dropped). After that, it’s just a matter of taking our new table and enabling the geometry to make it a feature class (as explained in the previous post).

This seems like it should work – but I discovered another problem. The resulting calculated field that has the percentage of subway commuters per PUMA, SUB_PER, has no data type associated with it. Looking at the schema for the table in SpatiaLite shows that the data type is blank. If I bring this into QGIS, I’m not able to map this field as a numeric value, because QGIS doesn’t know what it is. I have to define the data type for this field. SpatiaLite (SQLite really) doesn’t allow you to re-define an existing field – we have to create and define a new blank field, and the set the value of our calculated field equal to it. Here are the SQL statements to make it all happen:

ALTER TABLE sub_commuters ADD SUB_PER REAL

UPDATE sub_commuters SET SUB_PER=(CAST (SUBWAY AS REAL)/ CAST(WORKERS_16 AS REAL))

CREATE TABLE pumas_nyc_subcom AS
SELECT * FROM sub_commuters, nyc_pumas
WHERE nyc_pumas.PUMA5ID00=sub_commuters.geo_id2

So, we add a new blank field to our data table and define it as real. Then we update our data table by seting that blank field equal to our expression, thus filling the field with the result of our expression. Once we have the defined calculated field, we can create a new table from the data plus the features based on the ID they share in common. Once the table is created, then we can activate the geometry (right click on geometry field in the feature class and activate – see previous post for details) so we can map it in QGIS. Phew!

### QGIS

The third method is to create the calculated field within QGIS, using the new field calculator. It’s pretty easy to do – you select the layer in the table of contents and go into an edit mode. Open the attribute table for the features and click the last button in the row of buttons underneath the table – this is the field calculator button. Once we’re in the field calculator window, we can choose to update an existing field or create a new field. We give the output field a name and a data type, enter our expression SUBWAY / WORKERS_16, hit OK, and we have our new field. Save the edits and we should be good to go. HOWEVER – I wasn’t able to add a calculated fields to features in a SpatiaLite geodatabase without getting errors. I posted to the QGIS forum – initially it was thought that the SpatiaLite driver was read only, but it turns out that’s not the case and so and the developers are investigating a possible bug. The investigation continues – stay tuned. I have tried the field calculator with shapefiles and it works perfectly (incidentally, you can export SpatiaLite features out of the database as shapefiles).

I’m providing the database I created here for download, if anyone wants to experiment.

### Open Source GIS Wrap-up

Tuesday, September 30th, 2008

I’ve been on an open source GIS tear this month, so in this post I’ll wrap up some odds and ends:

• There is a project called Sextante, which is essentially an open source ArcToolbox for gvSIG. It adds a lot of geoprocessing and analysis functions and is pretty easy to install. There are 200 + tools in the box, but for some reason not all of them are active. I’m not sure why this is the case, but haven’t poked around much to find out.
• There are also a number of extra plugins for QGIS that are available through the QGIS wiki under PluginRepository; they include plugins that add more symbolization and that make table joins possible. Haven’t had a chance to try this yet either, but it sounds like these extras could make QGIS a lot more viable as a thematic mapping option.
• I found out about the QGIS plugins from this article, which offers a good overview of QGIS. The article also discusses one of the other shortcomings of open source GIS – the lack of a support for a simple, desktop geodatabase similar to the Microsoft Access personal geodatabases. PostGIS is certainly powerful and there has been a lot written about it, but a server based geodatabase is not always the best solution, particularly for small, stand-alone projects. There is a cool project called Spatiallite, where someone has created geographically enabled SQLite databases (which are small, stand alone dbs). You can export shapefiles to them, or simply view and edit the attributes in a shapefile via a virtual connection. Based on what I’ve looked at thus far, you can access SQlite databases directly in GRASS and when using GRASS datasets via QGIS, but I haven’t been able to connect to a SQlite db with the other software I’ve looked at – it’s just not supported yet.
• In researching open source GIS, I’ve looked at a book specifically on GRASS, Open Source GIS: A Grass Approach, as well as two books on web mapping (GIS for Web Developers: Adding ‘Where’ to Your Web Applications and Web Mapping Illustrated: Using Open Source GIS Toolkits)which cover GDAL and OGR, QGIS, GIS servers, PostGIS and PostgreSQL, and a few other tools. There is a book that’s recently been published that focusses specifically on Open Source Desktop GIS – Desktop GIS: Mapping the Planet with Open Source Tools. I pre-ordered a copy on Amazon that was supposed to ship in Mid September, but is now being delayed until late October. Based on the table of contents it looks pretty thorough and covers many of the choices I listed in my previous post, and I’m looking forward to its arrival.