How to make spatial joins (point in polygon)?
Various ways to make a spatial join between GIS point data and polygon data
library(R.utils)
library(rgdal)
library(tidyverse)
library(leaflet)
library(sp)
library(sf)
library(rgbif)
library(DBI)
What we want to do
In this short tutorial, we explore various options to deal with the situation where we have (1) a spatially referenced GIS file with polygons and (2) a spatially referenced set of points that overlaps with the GIS polygons.
Typically, both data sources contain information (apart from the spatial locations) that needs to be related to each other in some way. In this case study, we want to know for each point in which polygon it is located.
Get some data to work with
For the point data, we will work with data on the invasive species - Chinese mitten crab (Eriocheir sinensis) in Flanders, Belgium, from the year 2008 (GBIF.org (20th June 2019) GBIF Occurrence Download https://doi.org/10.15468/dl.decefb).
We will use convenience functions from the rgbif
package to download
the data as a zip file and to import the data as a tibble
in the R
environment.
invasive_species <- occ_download_get("0032582-190415153152247",
path = tempdir(),
overwrite = TRUE) %>%
occ_download_import() %>%
filter(!is.na(decimalLongitude), !is.na(decimalLatitude))
## Download file size: 0.01 MB
## On disk at /var/folders/c2/7_7qg3r93993m4kk3kjqgj2n76h_tn/T//RtmpYC9cBo/0032582-190415153152247.zip
We will use the European reference grid system from the European Environmental Agency as an example of a GIS vector layer (each grid cell is a polygon). The Belgian part of the grid system can be downloaded as a sqlite/spatialite database from the EEA website using the following code:
# explicitly set mode = "wb", otherwise zip file will be corrupt
download.file("https://www.eea.europa.eu/data-and-maps/data/eea-reference-grids-2/gis-files/belgium-spatialite/at_download/file", destfile = file.path(tempdir(), "Belgium_spatialite.zip"), mode = "wb")
# this will extract a file Belgium.sqlite to the temporary folder
unzip(zipfile = file.path(tempdir(), "Belgium_spatialite.zip"), exdir = tempdir())
Point in polygon with the sf
package
The spatial query can be done with the aid of the sf
package. The package has built-in
functions to read spatial data (which uses GDAL as backbone).
We will project the data to Belgian Lambert 72 (https://epsg.io/31370), because the join assumes planar coordinates.
be10grid <- read_sf(file.path(tempdir(), "Belgium.sqlite"),
layer = "be_10km") %>%
# convert to Belgian Lambert 72
st_transform(crs = 31370)
We now get a sf
object which is also a data.frame
and a tbl
:
class(be10grid)
## [1] "sf" "tbl_df" "tbl" "data.frame"
Let’s have a look at this object:
be10grid
## Simple feature collection with 580 features and 3 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -22402.56 ymin: -1449.985 xmax: 311353.3 ymax: 305932.2
## Projected CRS: Belge 1972 / Belgian Lambert 72
## # A tibble: 580 × 4
## cellcode eoforigin noforigin GEOMETRY
## * <chr> <dbl> <dbl> <POLYGON [m]>
## 1 10kmE376N318 3760000 3180000 ((-20851.02 240718.4, -21626.25 250679.5, -…
## 2 10kmE376N319 3760000 3190000 ((-21626.25 250679.5, -22402.56 260640.9, -…
## 3 10kmE377N314 3770000 3140000 ((-7781.475 201650, -8552.253 211611, 1426.…
## 4 10kmE377N315 3770000 3150000 ((-8552.253 211611, -9324.075 221572.1, 655…
## 5 10kmE377N316 3770000 3160000 ((-9324.075 221572.1, -10096.95 231533.3, -…
## 6 10kmE377N317 3770000 3170000 ((-10096.95 231533.3, -10870.87 241494.6, -…
## 7 10kmE377N318 3770000 3180000 ((-10870.87 241494.6, -11645.85 251456.1, -…
## 8 10kmE377N319 3770000 3190000 ((-11645.85 251456.1, -12421.9 261417.9, -2…
## 9 10kmE377N320 3770000 3200000 ((-12421.9 261417.9, -13199.02 271379.8, -3…
## 10 10kmE377N321 3770000 3210000 ((-13199.02 271379.8, -13977.21 281342.1, -…
## # … with 570 more rows
We can see that the spatial information resides in a GEOMETRY
list
column.
Similarly, the package has built-in functions to convert a data.frame
containing coordinates to a spatial sf
object:
invasive_spatial <- st_as_sf(invasive_species,
coords = c("decimalLongitude",
"decimalLatitude"),
crs = 4326) %>%
# convert to Lambert72
st_transform(crs = 31370)
Resulting in:
invasive_spatial
## Simple feature collection with 513 features and 43 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 35125.7 ymin: 188235.3 xmax: 153743.2 ymax: 220135
## Projected CRS: Belge 1972 / Belgian Lambert 72
## # A tibble: 513 × 44
## gbifID datasetKey occurrenceID kingdom phylum class order family genus
## * <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1146738051 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 2 1146738044 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 3 1146738029 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 4 1146738026 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 5 1146737979 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 6 1146737949 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 7 1146737902 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 8 1146737751 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 9 1146737736 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## 10 1146737717 258c9ce5-1b… INBO:VIS:007… Animal… Arthr… Mala… Deca… Pseud… Erio…
## # … with 503 more rows, and 35 more variables: species <chr>,
## # infraspecificEpithet <lgl>, taxonRank <chr>, scientificName <chr>,
## # countryCode <chr>, locality <lgl>, publishingOrgKey <chr>,
## # coordinateUncertaintyInMeters <dbl>, coordinatePrecision <lgl>,
## # elevation <lgl>, elevationAccuracy <lgl>, depth <lgl>, depthAccuracy <lgl>,
## # eventDate <dttm>, day <int>, month <int>, year <int>, taxonKey <int>,
## # speciesKey <int>, basisOfRecord <chr>, institutionCode <chr>, …
Now we are ready to make the spatial overlay. This is done with the aid
of sf::st_join
. The default join type is st_intersects
. This will
result in the same spatial overlay as sp::over
(see next section). We
join the information from the grid to the points through a left join.
See the DE-9IM topological model
for explanations about all possible spatial joins.
Note that with st_intersects
points on a polygon boundary and points
corresponding to a polygon vertex are considered to be inside the
polygon. In the case where points are joined with polygons,
st_intersects
and st_covered_by
will give the same result. The join
type st_within
can be used if we want to join only when points are
completely within (excluding polygon boundary).
invasive_be10grid_sf <- invasive_spatial %>%
st_join(be10grid,
join = st_intersects,
left = TRUE) %>%
select(-eoforigin, -noforigin) # exclude the EofOrigin NofOrigin columns
Looking at selected columns of the resulting object:
invasive_be10grid_sf %>%
select(species, eventDate, cellcode, geometry)
## Simple feature collection with 513 features and 3 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 35125.7 ymin: 188235.3 xmax: 153743.2 ymax: 220135
## Projected CRS: Belge 1972 / Belgian Lambert 72
## # A tibble: 513 × 4
## species eventDate cellcode geometry
## <chr> <dttm> <chr> <POINT [m]>
## 1 Eriocheir sinensis 2008-09-17 00:00:00 10kmE392N312 (143788.9 201487.1)
## 2 Eriocheir sinensis 2008-06-03 00:00:00 10kmE392N312 (143788.9 201487.1)
## 3 Eriocheir sinensis 2008-03-20 00:00:00 10kmE389N311 (114822.9 188235.3)
## 4 Eriocheir sinensis 2008-07-03 00:00:00 10kmE393N311 (153743.2 191634.8)
## 5 Eriocheir sinensis 2008-09-17 00:00:00 10kmE392N312 (143788.9 201487.1)
## 6 Eriocheir sinensis 2008-04-10 00:00:00 10kmE392N312 (147138.3 199035.5)
## 7 Eriocheir sinensis 2008-03-13 00:00:00 10kmE381N314 (35125.7 205808.5)
## 8 Eriocheir sinensis 2008-03-19 00:00:00 10kmE391N312 (136197.7 197300.8)
## 9 Eriocheir sinensis 2008-03-19 00:00:00 10kmE391N312 (136197.7 197300.8)
## 10 Eriocheir sinensis 2008-10-28 00:00:00 10kmE389N311 (121593.8 190203.2)
## # … with 503 more rows
Point in polygon with the sp
package
General note: migration to the more actively developed sf
package is
currently advised by the sp
maintainer. The sp
package is still
maintained in order to support the newest versions of the GDAL and PROJ
backends.
Instead of sf
objects (= data.frames
or tibbles
with a geometry
list-column), the sp
package works with Spatial
spatial data classes
(which has many derived spatial data classes for points, polygons, …).
First, we need to convert the data.frame
with point locations to a
SpatialPointsDataFrame
. We also need to ensure that the coordinate
reference system (CRS) for both the point locations and the grid is the
same. The data from GBIF are in WGS84 format.
crs_wgs84 <- CRS(SRS_string = "EPSG:4326")
coord <- invasive_species %>%
select(decimalLongitude, decimalLatitude)
invasive_spatial <- SpatialPointsDataFrame(coord,
data = invasive_species,
proj4string = crs_wgs84)
The sp
package has no native methods to read the Belgium 10 km x 10 km
grid, but we can use rgdal::readOGR
to connect with the
sqlite/spatialite database and extract the Belgium 10 km x 10 km grid as
a SpatialPolygonsDataFrame
. Apart from the 10 km x 10 km grid, the
database also contains 1 km x 1 km and 100 km x 100 km grids as raster
or vector files.
be10grid <- readOGR(dsn = file.path(tempdir(), "Belgium.sqlite"),
layer = "be_10km")
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS
## = dumpSRS, : Discarded datum European_Terrestrial_Reference_System_1989 in
## Proj4 definition: +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000
## +ellps=GRS80 +units=m +no_defs
## OGR data source with driver: SQLite
## Source: "/private/var/folders/c2/7_7qg3r93993m4kk3kjqgj2n76h_tn/T/RtmpYC9cBo/Belgium.sqlite", layer: "be_10km"
## with 580 features
## It has 3 fields
Note the warning: it is because some PROJ.4 information, i.e. the string
to represent the coordinate reference system, is not supported anymore
in the current geospatial GDAL and PROJ libraries (the background
workhorses for spatial R packages). The spatialite database from the EEA
website (with the 10 km x 10 km grid) still uses the older PROJ.4 string
. Because the rgdal
package is still backwards compatible, we should
not (yet) worry about this: rgdal
does the translation for the newer
GDAL 3 and PROJ >= 6. Do know that, instead of PROJ.4 strings, the
WKT2 string is now used in R to better represent coordinate reference
systems (so it would best be incorporated in the EEA data source). Just
compare these:
# PROJ.4 string = old; used by PROJ 4
proj4string(be10grid) # or: be10grid@proj4string
## Warning in proj4string(be10grid): CRS object has comment, which is lost in
## output
## [1] "+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80 +units=m +no_defs"
# WKT2 string = new and much better; used by PROJ >= 6
wkt(be10grid) %>% cat # or: be10grid@proj4string %>% comment %>% cat
## PROJCRS["ETRS89-extended / LAEA Europe",
## BASEGEOGCRS["ETRS89",
## DATUM["European Terrestrial Reference System 1989",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4258]],
## CONVERSION["Europe Equal Area 2001",
## METHOD["Lambert Azimuthal Equal Area",
## ID["EPSG",9820]],
## PARAMETER["Latitude of natural origin",52,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",10,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["False easting",4321000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",3210000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["northing (Y)",north,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["easting (X)",east,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Statistical analysis."],
## AREA["Europe - European Union (EU) countries and candidates. Europe - onshore and offshore: Albania; Andorra; Austria; Belgium; Bosnia and Herzegovina; Bulgaria; Croatia; Cyprus; Czechia; Denmark; Estonia; Faroe Islands; Finland; France; Germany; Gibraltar; Greece; Hungary; Iceland; Ireland; Italy; Kosovo; Latvia; Liechtenstein; Lithuania; Luxembourg; Malta; Monaco; Montenegro; Netherlands; North Macedonia; Norway including Svalbard and Jan Mayen; Poland; Portugal including Madeira and Azores; Romania; San Marino; Serbia; Slovakia; Slovenia; Spain including Canary Islands; Sweden; Switzerland; Turkey; United Kingdom (UK) including Channel Islands and Isle of Man; Vatican City State."],
## BBOX[24.6,-35.58,84.17,44.83]],
## ID["EPSG",3035]]
We transform the 10 km x 10 km grid to the same CRS system:
be10grid <- spTransform(be10grid, crs_wgs84)
Now we are ready to spatially join (overlay) the
SpatialPointsDataFrame
wth the 10 km x 10 km grid. This can be done
using sp::over
. The first two arguments of the function give,
respectively, the geometry (locations) of the queries, and the layer
from which the geometries or attributes are queried. See ?sp::over
. In
this case, when x = “SpatialPoints” and y = “SpatialPolygonsDataFrame”,
it returns a data.frame
of the second argument with row entries
corresponding to the first argument.
invasive_be10grid <- over(x = invasive_spatial, y = be10grid)
invasive_species_be10grid <- bind_cols(invasive_species,
invasive_be10grid)
To see what the result looks like, we can select the most relevant variables and print it (first ten rows).
invasive_species_be10grid %>%
select(species, starts_with("decimal"),
eventDate, cellcode) %>%
head(10)
## # A tibble: 10 × 5
## species decimalLatitude decimalLongitude eventDate cellcode
## <chr> <dbl> <dbl> <dttm> <chr>
## 1 Eriocheir si… 51.1 4.28 2008-09-17 00:00:00 10kmE392N…
## 2 Eriocheir si… 51.1 4.28 2008-06-03 00:00:00 10kmE392N…
## 3 Eriocheir si… 51.0 3.87 2008-03-20 00:00:00 10kmE389N…
## 4 Eriocheir si… 51.0 4.42 2008-07-03 00:00:00 10kmE393N…
## 5 Eriocheir si… 51.1 4.28 2008-09-17 00:00:00 10kmE392N…
## 6 Eriocheir si… 51.1 4.33 2008-04-10 00:00:00 10kmE392N…
## 7 Eriocheir si… 51.2 2.73 2008-03-13 00:00:00 10kmE381N…
## 8 Eriocheir si… 51.1 4.17 2008-03-19 00:00:00 10kmE391N…
## 9 Eriocheir si… 51.1 4.17 2008-03-19 00:00:00 10kmE391N…
## 10 Eriocheir si… 51.0 3.96 2008-10-28 00:00:00 10kmE389N…
What have we done?
To wrap this up, we make a map that shows what we have done. We will use
the results obtained with the sf
package.
First, we need to transform invasive_be10grid_sf
back to WGS84 (the
background maps in leaflet are in WGS84) (be10grid
is already in WGS84
format).
invasive_be10grid_sf <- invasive_be10grid_sf %>%
st_transform(crs = 4326)
Zooming in on the point markers and hovering over a marker will show the
reference grid identifier for the grid cell as it is joined to spatial
points object invasive_be10grid
. Clicking in a grid cell will bring up
a popup showing the reference grid identifier for the grid cell as it is
named in be10grid
.
leaflet(be10grid) %>%
addTiles() %>%
addPolygons(popup = ~cellcode) %>%
addMarkers(data = invasive_be10grid_sf,
label = ~cellcode)
Note: run the code to see the interactive map.