DuckDB + GIS

I’ve been wanting to try DuckDB for a while. The lure of a serverless database is strong. But it always seemed intimidating. I recently read two articles that made me think that there’s no time like the present to give it a try: DuckDB: The Indispensable Geospatial Tool You Didn’t Know You Were Missing by Chris Holmes and # PostGEESE? Introducing The DuckDB Spatial Extension by Max Gabrielsson. I am going to share what I did to get points and polygons into a DuckDB database, then use that data to make maps.

Some things to know ahead of time

Here are the additional libraries I am using for this project.

library(sf) # simple features for working with spatial data in r
library(duckdb) #Let's you create and connect to a duckdb file
library(glue) # An updated version of paste, part of the tidyverse
library(DBI) # Let's us interact with the database using SQL

DuckDB uses a SQL dialect to interact with the DuckDB files. DuckDB stores spatial data in the Well Known Text (WKT) format so we will have to convert our geometries to that format when creating tables and again when we bring it back into R for analysis.

I’ve split up this project into two scripts; create_duckdb.R will create the database and add the spatial information, read_duckdb.R will query the database and create sf objects in R. Copies of data and scripts are available on GitHub.

Importing your data

I will be using R to demo this project, but DuckDB is compatible with a variety of languages. I downloaded two shapefiles from NYS GIS Resources: Campsite Amenities (points), County Boundaries (polygons). I also made a parquet version of the campsite amenities so you can see how to load those file types as well.

# Load required libraries
library(sf)
library(duckdb)

# Create a DuckDB connection, a new db will be created if it doesn't exist already
con <- dbConnect(duckdb(), dbdir = "spatial_db.duckdb")

# Make sure the spatial extension is installed and loaded
dbExecute(con,
          "INSTALL spatial;
          LOAD spatial;")

Once your database is ready we are going to add some polygon data from a shapefile.

# Read the shapefile
polygon_data <- st_read("data/NYS_Civil_Boundaries/Counties_Shoreline.shp")

# Convert geometry to WKT
polygon_df <- data.frame(polygon_data)
polygon_df$geometry <- st_as_text(polygon_data$geometry)

# Create a table in DuckDB and load the data
dbWriteTable(con, "counties_polygon", polygon_df)

Now we can add the point data from a parquet file.

# Add point data from a parquet file
points_file <- here::here("data", "campsite_amenities.parquet")

# Create table in DuckDB and load data
points_qry <- glue::glue("CREATE TABLE points AS
            SELECT  OBJECTID,
                    FACILITY,
                    ASSET,
                    PUBLIC_USE,
                    ST_Point(longitude, latitude) AS geom
            FROM '{points_file}';")
dbExecute(con, points_qry)

You can verify your records were uploaded by querying the new tables.

# Verify the data was loaded correctly
result <- dbGetQuery(con, "SELECT * FROM counties_polygon LIMIT 5")

Lastly, don’t forget to close your connection when you are done.

# Don't forget to close the connection when you're done
dbDisconnect(con)

Querying your spatial data in R

Now that you’ve stored your data in your DuckDB file, it is time to query it so you can make some maps. You are going to start out by connecting to the database and make sure the spatial extension is loaded.

library(duckdb)
library(sf)
library(tidyverse)

con <- dbConnect(duckdb(), dbdir = "spatial_db.duckdb")

dbExecute(con,
          "INSTALL spatial;
           LOAD spatial;")

dbListTables(con)

You’ll need to know the tables present to write your queries.

> dbListTables(con)
[1] "counties_polygon" "points"

Now lets query some data. I want to find the top five New York counties by area. After we’ve successfully queried the db, the result needs to be converted back to a sf object.

# Query the top 5 largest counties in New York State by area
counties_qry <- "
SELECT *
FROM counties_polygon
ORDER BY ST_Area(ST_GeomFromText(geometry)) DESC
LIMIT 5
"
top_5_df <- dbGetQuery(con, counties_qry)

Here is the result of the query. Notice that the geometry field is a text field.

> glimpse(top_5_df)
Rows: 5
Columns: 16
$ NAME       <chr> "St Lawrence", "Essex", "Hamilton", "Franklin", "Delaware"
$ ABBREV     <chr> "STLA", "ESSE", "HAMI", "FRAN", "DELA"
$ GNIS_ID    <chr> "977309", "974114", "974119", "974115", "974111"
$ FIPS_CODE  <chr> "36089", "36031", "36041", "36033", "36025"
$ SWIS       <chr> "400000", "150000", "200000", "160000", "120000"
$ NYSP_ZONE  <chr> "East", "East", "East", "East", "East"
$ POP1990    <dbl> 111974, 37152, 5279, 46540, 47225
$ POP2000    <dbl> 111931, 38851, 5379, 51134, 48055
$ POP2010    <dbl> 111944, 39370, 4836, 51599, 47980
$ POP2020    <dbl> 108505, 37381, 5107, 47555, 44308
$ DOS_LL     <chr> NA, NA, NA, NA, NA
$ DOSLL_DATE <date> 1899-12-30, 1899-12-30, 1899-12-30, 1899-12-30, 1899-12-30
$ NYC        <chr> "N", "N", "N", "N", "N"
$ CALC_SQ_MI <dbl> 2760.789, 1831.572, 1806.212, 1694.793, 1466.338
$ DATEMOD    <date> 2018-11-19, 2018-07-24, 2018-11-01, 2018-10-04, 2020-05-19
$ geometry   <chr> "MULTIPOLYGON (((433110.4 4915970, 433108.2 4915968, 433071.4 4915993, 433094.9 4916023, …

We need to change this back to sf geometry.

# Convert back to sf object for mapping
top_5_sf <- st_as_sf(top_5_df, wkt = "geometry")

We can see how the geometry changes after we convert to sf.

> glimpse(top_5_sf)
Rows: 5
Columns: 16
$ NAME       <chr> "St Lawrence", "Essex", "Hamilton", "Fra…
$ ABBREV     <chr> "STLA", "ESSE", "HAMI", "FRAN", "DELA"
$ GNIS_ID    <chr> "977309", "974114", "974119", "974115", …
$ FIPS_CODE  <chr> "36089", "36031", "36041", "36033", "360$ SWIS       <chr> "400000", "150000", "200000", "160000", …
$ NYSP_ZONE  <chr> "East", "East", "East", "East", "East"
$ POP1990    <dbl> 111974, 37152, 5279, 46540, 47225
$ POP2000    <dbl> 111931, 38851, 5379, 51134, 48055
$ POP2010    <dbl> 111944, 39370, 4836, 51599, 47980
$ POP2020    <dbl> 108505, 37381, 5107, 47555, 44308
$ DOS_LL     <chr> NA, NA, NA, NA, NA
$ DOSLL_DATE <date> 1899-12-30, 1899-12-30, 1899-12-30, 1899$ NYC        <chr> "N", "N", "N", "N", "N"
$ CALC_SQ_MI <dbl> 2760.789, 1831.572, 1806.212, 1694.793, …
$ DATEMOD    <date> 2018-11-19, 2018-07-24, 2018-11-01, 2018$ geometry   <MULTIPOLYGON> MULTIPOLYGON (((433110.4 49..., MULTIPOL…

Now let’s look at some campsites. We’ll go through a similar process to read the point data.

# Query all of the campsites

campsite_query <- "
SELECT OBJECTID, FACILITY, ASSET, PUBLIC_USE, ST_AsText(geom) AS geometry
FROM points
"

campsite_df <- dbGetQuery(con, campsite_query)
# Convert back to sf object for mapping
campsite_sf <- st_as_sf(campsite_df, wkt = "geometry")

Notice that we have to specify how the geometry fields are handled in to query. Since DuckDB stores the information in WKT, we have to turn back into simple features format.

Lastly, let’s quickly look at the data in a plot.

# Plot to check your work. This data is in NAD83.
ggplot() +
  geom_sf(data = top_5_sf, fill = "lightgreen") +
  geom_sf(data = campsite_sf, color = "darkblue") +
  theme_minimal() +
  labs(title = "Top 5 largest counties in New York State") +
  theme(plot.title = element_text(hjust = 0.5))

Here’s a simple map to illustrate the output.

Next up: projections, transformations, and spatial joins. Stay tuned!

#GIS #rstats #duckdb