Read data from INBO databases (SQL Server) with R • Stijn Van Hoey

Connection to INBO database

Here we provide the approach using the package DBI, which is also used by RStudio.[1]. This package enables the link between R and the (remote) database. After installation of the needed packages (install.packages(c("DBI", "glue", "tidyverse"))), the packages can be loaded:

library(DBI)
library(glue)
library(tidyverse)

To create a database connection, different approaches are available as explained in the next section:

Use the dedicated connect_inbo_dbase function in the inbodb package
Setup the required parameters yourself
Reuse existing registered connections on your computer

Inbodb function

To support the connection to INBO databases, a dedicated function is available in the inbodb package, called connect_inbo_dbase. The function provides support to connect to INBO databases and uses the Connections pane in the Rstudio interface:

odbc-connecties

To use the connect_inbo_dbase function, make sure to install the inbodb package, following the installation instructions. After a successfull installation, load the library and create a database connection:

library(inbodb)
my_connection <- connect_inbo_dbase("D0021_00_userFlora")

Once the connection is successfully established, the database can be queried.

Remark for Linux users: When working in Linux, this setup (using Trusted connection) requires an active kerberos session. More information about the setup and functionality is to be found in the tutorial on kerberos installation.

Setting up the connection yourself

In case you want to setup the connection yourself (e.g. errors using inbodb), you can do so by providing the necessary parameters:

my_connection <- DBI::dbConnect(odbc::odbc(),
      driver = "ODBC Driver 17 for SQL Server",
      server = "<server hostname or IP>",
      port = 1433,
      database = "D0021_00_userFlora", # or your database of interest
      trusted_connection = "Yes")

The most important parameters are server and database, the others should normally be kept as such (except for maybe driver). For server, keep in mind that there is more than one server: database names starting with M, S or W use another server than those starting with D. The database name is the name of the database (if you can’t remember the name, connect with a database you do know, e.g. D0021_00_userFlora and you’ll see an overview of the existing databases on that server after connecting.) The driver version depends on your computer and can be looked up using odbc::odbcListDrivers().

Use existing MSAccess connection name

When you query data from a SQL database that is already accessible using MSAccess, such a database is also accessible from R. For Windows users, the most important element is to know the so-called DSN (i.e. a registered Data Source Name). Actually, it is just the name of the database as it is known by your computer (and MS Access). The easiest way to check the DSN is to check the registered ODBC connections in the administrator tools menu.

For Dutch-speaking Windows 7 users:

> Kies in het Configuratiescherm van Windows de optie Systeembeheer > Gegevensbronnen (ODBC). De optie Systeembeheer verschijnt in de categorie Systeem en onderhoud.

You should see a list similar to the list underneath, with the names of the available DSN names enlisted:

odbc-connecties

An alternative way to check the DSN name of a database already working on with Access, is to check the DSN inside MS Access (in dutch, check menu item Koppelingsbeheer):

access-dsn

For example, the DSN name UserFlora or Cydonia-prd can be used to query these databases and extract data from it with similar queries to the one used in MSAccess. First of all, the connection with the database need to be established, by using the odbcConnect function, providing the DSN name as argument:

For Windows users:

my_connection <- odbcConnect("UserFlora")

Once the connection is successfully established, the database can be queried.

Get a complete table from the database

The function dbReadTable can be used to load an entire table from a database. For example, to extract the tblTaxon table from the flora database:

rel_taxa <- dbReadTable(my_connection, "relTaxonTaxonGroep")
head(rel_taxa) %>% knitr::kable()

ID	TaxonGroepID	TaxonID
1	4	1
2	4	2
3	4	3
4	4	4
5	4	5
6	4	6

The connection my_connection, made earlier, is used as the first argument. The table name is the second argument.

Remark: If you have no idea about the size of the table you’re trying to load from the database, this could be rather tricky and cumbersome. Hence, it is probably better to only extract a portion of the table using a query.

Execute a query to the database

The function dbGetQuery provides more flexibilty as it can be used to try any SQL-query on the database. A complete introduction to the SQL language is out of scope here. We will focus on the application and the reusage of a query.

meting <- dbGetQuery(my_connection, paste("SELECT TOP 10 * FROM dbo.tblMeting", 
                                          "WHERE COR_X IS NOT NULL"))
head(meting) %>% knitr::kable()

ID	WaarnemingID	TaxonID	MetingStatusCode	Cor_X	Cor_Y	CommentaarTaxon	CommentaarHabitat	CREATION_DATE	CREATION_USER	UPDATE_DATE	UPDATE_USER
2	21748	3909	GDGA	109948	185379	NA	NA	NA	NA	NA	NA
14	45523	3909	GDGA	127708	179454	NA	NA	NA	NA	NA	NA
15	124394	3909	GDGA	109424	192152	NA	NA	NA	NA	NA	NA
23	38561	3909	GDGA	128290	179297	NA	NA	NA	NA	NA	NA
24	126500	3909	GDGA	98714	178373	NA	NA	NA	NA	NA	NA
173	73725	3909	GDGA	102612	189891	NA	NA	NA	NA	NA	NA

Create and use query templates

When you regularly use similar queries, with some minimal alterations, you do not want to copy/paste each time the entire query. It is prone to errors and you’re script will become verbose. It is advisable to create query templates, that can be used within the dbGetQuery function.

Consider the execution of the following query. We are interested in those records with valid X and Y coordinates for the measurement, based on a given dutch name:

subset_meting <- dbGetQuery(my_connection, 
    "SELECT meet.COR_X
         , meet.Cor_Y
         , meet.MetingStatusCode
         , tax.NaamNederlands
         , tax.NaamWetenschappelijk
         , waar.IFBLHokID
    FROM  tblMeting AS meet
        LEFT JOIN tblTaxon AS tax ON tax.ID = meet.TaxonID
        LEFT JOIN tblWaarneming AS waar ON waar.ID = meet.WaarnemingID
    WHERE meet.Cor_X IS NOT NULL
        AND meet.Cor_X != 0
        AND tax.NaamNederlands LIKE 'Wilde hyacint'")
head(subset_meting) %>% knitr::kable()

COR_X	Cor_Y	MetingStatusCode	NaamNederlands	NaamWetenschappelijk	IFBLHokID
88720	208327	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	11195
24106	199925	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	9601
103111	190915	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	6990
118123	183942	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	5672
106107	182343	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	5217
105765	180785	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	4830

If we need this query regularly, but each time using a different tax.NaamNederlands (only the name changes), it is worthwhile to invest some time in the creation of a small custom function that uses this query as a template. Let’s create a function flora_records_on_dutch_name that takes a valid database connection and a given Dutch name and returns the relevant subset of the data for this query:

flora_records_on_dutch_name <- function(dbase_connection, dutch_name) {
     dbGetQuery(dbase_connection, glue_sql(
            "SELECT meet.Cor_X
                 , meet.COr_Y
                 , meet.MetingStatusCode
                 , tax.NaamNederlands
                 , tax.NaamWetenschappelijk
                 , waar.IFBLHokID
            FROM  dbo.tblMeting meet
                LEFT JOIN dbo.tblTaxon AS tax ON tax.ID = meet.TaxonID
                LEFT JOIN dbo.tblWaarneming AS waar ON waar.ID = meet.WaarnemingID
            WHERE meet.Cor_X IS NOT NULL
                AND meet.Cor_X != 0
                AND tax.NaamNederlands LIKE {dutch_name}",
            dutch_name = dutch_name,
            .con = dbase_connection))
}

Hence, instead of copy-pasting the whole query each time (which could be error-prone), we can reuse the function for different names:

hyacint <- flora_records_on_dutch_name(my_connection, "Wilde hyacint")
head(hyacint) %>% knitr::kable()

Cor_X	COr_Y	MetingStatusCode	NaamNederlands	NaamWetenschappelijk	IFBLHokID
88720	208327	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	11195
24106	199925	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	9601
103111	190915	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	6990
118123	183942	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	5672
106107	182343	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	5217
105765	180785	GDGA	Wilde hyacint	Hyacinthoides non-scripta (L.) Chouard ex Rothm.	4830

bosanemoon <- flora_records_on_dutch_name(my_connection, "Bosanemoon")
head(bosanemoon) %>% knitr::kable()

Cor_X	COr_Y	MetingStatusCode	NaamNederlands	NaamWetenschappelijk	IFBLHokID
119247	204936	GDGA	Bosanemoon	Anemone nemorosa L.	10234
73658	199081	GDGA	Bosanemoon	Anemone nemorosa L.	8823
72752	199010	GDGA	Bosanemoon	Anemone nemorosa L.	8824
72921	198828	GDGA	Bosanemoon	Anemone nemorosa L.	8824
72874	198735	GDGA	Bosanemoon	Anemone nemorosa L.	8824
72887	198660	GDGA	Bosanemoon	Anemone nemorosa L.	8824

Remark: Do not forget to close your connection when finished.

dbDisconnect(my_connection)

The `glue_sql` function

In order to accomplish the re-usage of a query for different input names (dutch_name), the glue_sql function is used from the glue package. The glue_sql function (and the more general glue function) provides the ability to combine text and variable values in a single charactor string (i.e. the query to execute). For each variable name required in the query (any part of your query you want to have interchangeable), a representation in the query is given by the variable name you use in R, put in between curly brackets. For example, if you have the dutch_name variable in R, you can use it inside the query as {dutch_name}:

dutch_name <- 'Jan'
an_integer <- 3
a_float <- 2.8
glue('This prints a combination of a name: {dutch_name}, an integer: {an_integer} and a float value: {a_float}')

## This prints a combination of a name: Jan, an integer: 3 and a float value: 2.8

Whereas the glue function is a general function for strings, the glue_sql function is specifically created to setup queries to databases. More information is provided here and here.

[1] Formerly, connections were made using the package RODBC