Mel Likes Maps | Spatial Queries with DuckDB + R

Spatial Queries with DuckDB + R

Sunday, November 24, 2024

Now that I understand the basics of getting data into and out of a DuckDB file, I want to add spatial queries to the repertoire. You can find my blog post on creating and reading DuckDB data here: Duck DB + GIS. Spatial queries are a pretty common task in GIS. For example, you might want to find all of the restaurants within 5 miles of your hotel during a vacation. Spatial queries understand how locations are related to each other such as distance between two points or whether or not a point is inside of a specific area. With that in mind, lets explore how to perform spatial queries using R and DuckDB. We’ll work with the same real-world data from the first post: New York State counties and DEC campgrounds.

Some things to know ahead of time

Here are the libraries I am using in this project.

library(duckdb) # Connect and query a duckdb file
library(sf) # simple features for working with spatial data in r
library(ggplot2) # for making the map

This code can be found in the script spatial_queries.R. Copies of data and scripts are available on GitHub.

Let’s connect to the database and make sure the spatial extension is installed and loaded.

# Connect to the database
con <- dbConnect(duckdb(), dbdir = "spatial_db.duckdb")

# Install and load the spatial extension
dbExecute(con,
          "INSTALL spatial;
           LOAD spatial;")

Exploring the Data

Before diving into spatial queries, it’s important to understand our data structure. Let’s examine our tables:

# List all tables in the database
dbListTables(con)

# Get table schemas
dbGetQuery(con, "DESCRIBE counties_polygon")
dbGetQuery(con, "DESCRIBE points")

This database has two tables: counties_polygon and points. It is important to note that the counties_polygon table geometry is stored as text, in the Well Known Text (WKT) format, while the points table is stored as a geometry. That will change how we write our query.

> dbGetQuery(con, "DESCRIBE counties_polygon")
   column_name column_type null  key default extra
1         NAME     VARCHAR  YES <NA>    <NA>  <NA>
2       ABBREV     VARCHAR  YES <NA>    <NA>  <NA>
3      GNIS_ID     VARCHAR  YES <NA>    <NA>  <NA>
4    FIPS_CODE     VARCHAR  YES <NA>    <NA>  <NA>
5         SWIS     VARCHAR  YES <NA>    <NA>  <NA>
6    NYSP_ZONE     VARCHAR  YES <NA>    <NA>  <NA>
7      POP1990      DOUBLE  YES <NA>    <NA>  <NA>
8      POP2000      DOUBLE  YES <NA>    <NA>  <NA>
9      POP2010      DOUBLE  YES <NA>    <NA>  <NA>
10     POP2020      DOUBLE  YES <NA>    <NA>  <NA>
11      DOS_LL     VARCHAR  YES <NA>    <NA>  <NA>
12  DOSLL_DATE        DATE  YES <NA>    <NA>  <NA>
13         NYC     VARCHAR  YES <NA>    <NA>  <NA>
14  CALC_SQ_MI      DOUBLE  YES <NA>    <NA>  <NA>
15     DATEMOD        DATE  YES <NA>    <NA>  <NA>
16    geometry     VARCHAR  YES <NA>    <NA>  <NA>

> dbGetQuery(con, "DESCRIBE points")
  column_name column_type null  key default extra
1    OBJECTID     INTEGER  YES <NA>    <NA>  <NA>
2    FACILITY     VARCHAR  YES <NA>    <NA>  <NA>
3       ASSET     VARCHAR  YES <NA>    <NA>  <NA>
4  PUBLIC_USE     VARCHAR  YES <NA>    <NA>  <NA>
5        geom    GEOMETRY  YES <NA>    <NA>  <NA>

Looking at the row data for the points table we can see that there are multiple points at each campground. We only need one point per campground for this analysis.

Let’s look at the geometry column from the counties data in a little more detail.

check_polygons <- dbGetQuery(con,
                             "SELECT *
                FROM counties_polygon p
                limit 10")
                
> check_polygons |> dplyr::select(NAME, geometry) |> dplyr::glimpse()
Rows: 10
Columns: 2
$ NAME     <chr> "Albany", "Allegany", "Bronx", "Broome", "Cattaraugus", "Cayuga", "Chautauq…
$ geometry <chr> "MULTIPOLYGON (((608219.1 4737612, 608196.7 4737614, 608166.1 4737617, 6081…

Now let’s look at the points data. There are multiple points at each campground. Assuming that none of campgrounds cross county boundaries, we only need to use the first point. Running the spatial query on all 3,000+ points in the full dataset will needlessly slow down the process.

> check_points |> dplyr::select(-geom)
   OBJECTID                     FACILITY                ASSET PUBLIC_USE
1   4275259 KENNETH L. WILSON CAMPGROUND              PARKING        Yes
2   4275261 KENNETH L. WILSON CAMPGROUND RECREATION STRUCTURE        Yes
3   4275262 KENNETH L. WILSON CAMPGROUND               SPIGOT        Yes
4   4275263 KENNETH L. WILSON CAMPGROUND               SPIGOT        Yes
5   4275268 KENNETH L. WILSON CAMPGROUND               SPIGOT        Yes
6   4275269 KENNETH L. WILSON CAMPGROUND                  AED        Yes
7   4275270 KENNETH L. WILSON CAMPGROUND              PARKING        Yes
8   4275277       CAROGA LAKE CAMPGROUND          PICNIC AREA        Yes
9   4275278       CAROGA LAKE CAMPGROUND      SHOWER BUILDING        Yes
10  4275279       CAROGA LAKE CAMPGROUND                  AED        Yes

Writing Spatial Queries

Now for the interesting part! Let’s write a spatial query to find all counties that contain campground points. This query uses an INNER JOIN to match the county polygon to the campground point using ST_Contains. The subquery ranked selects the first point from each group of points based on the facility.

SELECT DISTINCT p.*
FROM counties_polygon p
INNER JOIN (
    SELECT
        FACILITY,
        geom
    FROM (
        SELECT *,
        ROW_NUMBER() OVER (PARTITION BY FACILITY ORDER BY FACILITY) as rn
        FROM points
    ) ranked
    WHERE rn = 1
) pt
  ON ST_Contains(
    ST_GeomFromText(p.geometry),
    pt.geom
  )
county_df <- dbGetQuery(con, county_query)

Let’s break down what this query does:

We start with the counties_polygon table
We join it to the campground points table with a subquery that selects unique facilities (campgrounds)
ST_Contains will return TRUE when the first geometry contains the second, in this case when counties_polygon contains the points geometry.
ST_GeomFromText converts the WKT geometry to a spatial object. This was necessary because the polygon geometry was stored as text.

Converting Query Results to SF Objects

To visualize our results, we need to convert the DuckDB query results into spatial objects that R can understand. Since the counties_polygon table isn’t in the standard sf geometry format, we need to define what format it is in - in this case WKT in the geometry. We also need to define the coordinate reference system (CRS) of the data and change it to the reference that we want. Both datasets we are using are in NAD83 UTM Zone 18N (26918) but we want to change the CRS to be WGS84 for web mapping.

# Convert county results to sf object
county_sf <- st_as_sf(county_df, wkt = "geometry", crs = 26918)

# Transform to WGS84 for mapping
county_sf_WGS84 <- county_sf |>
  st_transform(4326)

Querying Points Data

Now that we have the selected polygons ready to map, lets select the points we want to appear. Just as before, I only want to plot one point per campground. This query shows an alternative way to get that result. We are going to treat the resulting dataframe the same way that we did for the county polygons: define the CRS as NAD83 UTM Zone 18N and transform it to WGS84.

campsite_query <- "
SELECT FACILITY, ST_AsText(FIRST (geom)) AS geometry
FROM points
GROUP BY FACILITY
"
campsite_df <- dbGetQuery(con, campsite_query)
campsite_sf <- st_as_sf(campsite_df, wkt = "geometry", crs = 26918)
campsite_sf_WGS84 <- campsite_sf |>
  st_transform(4326)

Creating the Visualization

There is one last item I want to add to the map - county boundaries for the whole state.

# Statewide county boundaries
all_counties_query <- "
SELECT DISTINCT p.*
FROM counties_polygon p"

all_counties_df <- dbGetQuery(con, all_counties_query)
all_counties_sf <- st_as_sf(all_counties_df, wkt = "geometry", crs = 26918)
all_counties_sf_WGS84 <- all_counties_sf |>
  st_transform(4326)

I’m going to use ggplot to put it all together.

ggplot() +
  # Add base map tiles
  ggspatial::annotation_map_tile(type = "cartolight", zoom = 6) +
  # Add all counties as background
  geom_sf(data = all_counties_sf_WGS84,
          fill = "white",
          alpha = 0.1,
          color = "darkgrey") +
  # Add counties with campgrounds
  geom_sf(data = county_sf_WGS84,
          fill = "lightgreen",
          alpha = 0.2,
          color = "darkgrey") +
  # Add campground points
  geom_sf(data = campsite_sf_WGS84,
          fill = "darkblue",
          color = "white",
          size = 3,
          shape = 21) +
  # Add title and caption
  labs(title = "Counties in NY State that contain DEC campgrounds",
       caption = "Data source: DEC") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    plot.caption = element_text(hjust = 0, size = 8, color = "grey50")
  )

# Save the plot
ggsave("ny_counties_campgrounds.png", width = 8, height = 6, dpi = 300)

The resulting map highlights the results of our spatial query.

#GIS #rstats #duckdb