Spatial Queries with DuckDB + R
Sunday, November 24, 2024
Now that I understand the basics of getting data into and out of a DuckDB file, I want to add spatial queries to the repertoire. You can find my blog post on creating and reading DuckDB data here: Duck DB + GIS. Spatial queries are a pretty common task in GIS. For example, you might want to find all of the restaurants within 5 miles of your hotel during a vacation. Spatial queries understand how locations are related to each other such as distance between two points or whether or not a point is inside of a specific area. With that in mind, lets explore how to perform spatial queries using R and DuckDB. We’ll work with the same real-world data from the first post: New York State counties and DEC campgrounds.
Some things to know ahead of time
Here are the libraries I am using in this project.
library(duckdb) # Connect and query a duckdb file
library(sf) # simple features for working with spatial data in r
library(ggplot2) # for making the map
This code can be found in the script spatial_queries.R. Copies of data and scripts are available on GitHub.
Let’s connect to the database and make sure the spatial extension is installed and loaded.
# Connect to the database
con <- dbConnect(duckdb(), dbdir = "spatial_db.duckdb")
# Install and load the spatial extension
dbExecute(con,
"INSTALL spatial;
LOAD spatial;")
Exploring the Data
Before diving into spatial queries, it’s important to understand our data structure. Let’s examine our tables:
# List all tables in the database
dbListTables(con)
# Get table schemas
dbGetQuery(con, "DESCRIBE counties_polygon")
dbGetQuery(con, "DESCRIBE points")
This database has two tables: counties_polygon
and points
. It is important to note that the counties_polygon
table geometry is stored as text, in the Well Known Text (WKT) format, while the points
table is stored as a geometry. That will change how we write our query.
> dbGetQuery(con, "DESCRIBE counties_polygon")
column_name column_type null key default extra
1 NAME VARCHAR YES <NA> <NA> <NA>
2 ABBREV VARCHAR YES <NA> <NA> <NA>
3 GNIS_ID VARCHAR YES <NA> <NA> <NA>
4 FIPS_CODE VARCHAR YES <NA> <NA> <NA>
5 SWIS VARCHAR YES <NA> <NA> <NA>
6 NYSP_ZONE VARCHAR YES <NA> <NA> <NA>
7 POP1990 DOUBLE YES <NA> <NA> <NA>
8 POP2000 DOUBLE YES <NA> <NA> <NA>
9 POP2010 DOUBLE YES <NA> <NA> <NA>
10 POP2020 DOUBLE YES <NA> <NA> <NA>
11 DOS_LL VARCHAR YES <NA> <NA> <NA>
12 DOSLL_DATE DATE YES <NA> <NA> <NA>
13 NYC VARCHAR YES <NA> <NA> <NA>
14 CALC_SQ_MI DOUBLE YES <NA> <NA> <NA>
15 DATEMOD DATE YES <NA> <NA> <NA>
16 geometry VARCHAR YES <NA> <NA> <NA>
> dbGetQuery(con, "DESCRIBE points")
column_name column_type null key default extra
1 OBJECTID INTEGER YES <NA> <NA> <NA>
2 FACILITY VARCHAR YES <NA> <NA> <NA>
3 ASSET VARCHAR YES <NA> <NA> <NA>
4 PUBLIC_USE VARCHAR YES <NA> <NA> <NA>
5 geom GEOMETRY YES <NA> <NA> <NA>
Looking at the row data for the points
table we can see that there are multiple points at each campground. We only need one point per campground for this analysis.
Let’s look at the geometry column from the counties data in a little more detail.
check_polygons <- dbGetQuery(con,
"SELECT *
FROM counties_polygon p
limit 10")
> check_polygons |> dplyr::select(NAME, geometry) |> dplyr::glimpse()
Rows: 10
Columns: 2
$ NAME <chr> "Albany", "Allegany", "Bronx", "Broome", "Cattaraugus", "Cayuga", "Chautauq…
$ geometry <chr> "MULTIPOLYGON (((608219.1 4737612, 608196.7 4737614, 608166.1 4737617, 6081…
Now let’s look at the points data. There are multiple points at each campground. Assuming that none of campgrounds cross county boundaries, we only need to use the first point. Running the spatial query on all 3,000+ points in the full dataset will needlessly slow down the process.
> check_points |> dplyr::select(-geom)
OBJECTID FACILITY ASSET PUBLIC_USE
1 4275259 KENNETH L. WILSON CAMPGROUND PARKING Yes
2 4275261 KENNETH L. WILSON CAMPGROUND RECREATION STRUCTURE Yes
3 4275262 KENNETH L. WILSON CAMPGROUND SPIGOT Yes
4 4275263 KENNETH L. WILSON CAMPGROUND SPIGOT Yes
5 4275268 KENNETH L. WILSON CAMPGROUND SPIGOT Yes
6 4275269 KENNETH L. WILSON CAMPGROUND AED Yes
7 4275270 KENNETH L. WILSON CAMPGROUND PARKING Yes
8 4275277 CAROGA LAKE CAMPGROUND PICNIC AREA Yes
9 4275278 CAROGA LAKE CAMPGROUND SHOWER BUILDING Yes
10 4275279 CAROGA LAKE CAMPGROUND AED Yes
Writing Spatial Queries
Now for the interesting part! Let’s write a spatial query to find all counties that contain campground points. This query uses an INNER JOIN
to match the county polygon to the campground point using ST_Contains
. The subquery ranked
selects the first point from each group of points based on the facility.
SELECT DISTINCT p.*
FROM counties_polygon p
INNER JOIN (
SELECT
FACILITY,
geom
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY FACILITY ORDER BY FACILITY) as rn
FROM points
) ranked
WHERE rn = 1
) pt
ON ST_Contains(
ST_GeomFromText(p.geometry),
pt.geom
)
county_df <- dbGetQuery(con, county_query)
Let’s break down what this query does:
- We start with the
counties_polygon
table - We join it to the campground
points
table with a subquery that selects unique facilities (campgrounds) ST_Contains
will returnTRUE
when the first geometry contains the second, in this case whencounties_polygon
contains thepoints
geometry.ST_GeomFromText
converts the WKT geometry to a spatial object. This was necessary because the polygon geometry was stored as text.
Converting Query Results to SF Objects
To visualize our results, we need to convert the DuckDB query results into spatial objects that R can understand. Since the counties_polygon
table isn’t in the standard sf geometry format, we need to define what format it is in - in this case WKT in the geometry
. We also need to define the coordinate reference system (CRS) of the data and change it to the reference that we want. Both datasets we are using are in NAD83 UTM Zone 18N (26918) but we want to change the CRS to be WGS84 for web mapping.
# Convert county results to sf object
county_sf <- st_as_sf(county_df, wkt = "geometry", crs = 26918)
# Transform to WGS84 for mapping
county_sf_WGS84 <- county_sf |>
st_transform(4326)
Querying Points Data
Now that we have the selected polygons ready to map, lets select the points we want to appear. Just as before, I only want to plot one point per campground. This query shows an alternative way to get that result. We are going to treat the resulting dataframe the same way that we did for the county polygons: define the CRS as NAD83 UTM Zone 18N and transform it to WGS84.
campsite_query <- "
SELECT FACILITY, ST_AsText(FIRST (geom)) AS geometry
FROM points
GROUP BY FACILITY
"
campsite_df <- dbGetQuery(con, campsite_query)
campsite_sf <- st_as_sf(campsite_df, wkt = "geometry", crs = 26918)
campsite_sf_WGS84 <- campsite_sf |>
st_transform(4326)
Creating the Visualization
There is one last item I want to add to the map - county boundaries for the whole state.
# Statewide county boundaries
all_counties_query <- "
SELECT DISTINCT p.*
FROM counties_polygon p"
all_counties_df <- dbGetQuery(con, all_counties_query)
all_counties_sf <- st_as_sf(all_counties_df, wkt = "geometry", crs = 26918)
all_counties_sf_WGS84 <- all_counties_sf |>
st_transform(4326)
I’m going to use ggplot
to put it all together.
ggplot() +
# Add base map tiles
ggspatial::annotation_map_tile(type = "cartolight", zoom = 6) +
# Add all counties as background
geom_sf(data = all_counties_sf_WGS84,
fill = "white",
alpha = 0.1,
color = "darkgrey") +
# Add counties with campgrounds
geom_sf(data = county_sf_WGS84,
fill = "lightgreen",
alpha = 0.2,
color = "darkgrey") +
# Add campground points
geom_sf(data = campsite_sf_WGS84,
fill = "darkblue",
color = "white",
size = 3,
shape = 21) +
# Add title and caption
labs(title = "Counties in NY State that contain DEC campgrounds",
caption = "Data source: DEC") +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
plot.caption = element_text(hjust = 0, size = 8, color = "grey50")
)
# Save the plot
ggsave("ny_counties_campgrounds.png", width = 8, height = 6, dpi = 300)
The resulting map highlights the results of our spatial query.