Reading large data files in R
Handling large text files in R
Intro
R is known to have difficulties handling large data files. Here we will explore some tips that make working with such files in R less painfull.
tl;dr
- If you can comfortably work with the entire file in memory, but
reading the file is rather slow, consider using the
data.table
package and read the file with itsfread
function. - If your file does not comfortably fit in memory:
- Use
sqldf
if you have to stick tocsv
files. - Use a SQLite database and query it using either SQL
queries or
dplyr
. - Convert your
csv
file to asqlite
database in order to query
- Use
Downloading the example files
While you can directly test this tutorial on your own large data files,
we will use bird tracking data from the LifeWatch bird tracking
network for the examples. We
have made two versions of some tracking data available for download: a
.csv
file (text data) and a .db
file (sqlite data). Both contain
processed log files; for more information on the processing, see the
BirdTrackingEtl package.
csv.name <- "2016-04-20-processed-logs-big-file-example.csv"
db.name <- "2016-04-20-processed-logs-big-file-example.db"
The evaluation of the next code chunk is ignored by default as the
downloading and unzipping of the files results in more than 3 GB of
data. If you do want to download the files yourself and test the other
chunks, run the code and download the csv
and sqlite
examples. Make
sure you have the R.utils
package available (for unzipping the
downloaded files). If not, use the command install.packages("R.utils")
in your R console to download the package.
library("R.utils")
# download the CSV file example
csv.url <- paste("https://s3-eu-west-1.amazonaws.com/lw-birdtracking-data/",
csv.name, ".gz", sep = "")
if (!file.exists(csv.name)) {
download.file(csv.url, destfile = paste0(csv.name, ".gz"))
gunzip(paste0(csv.name, ".gz"))
}
# download the sqlite database example
db.url <- paste("https://s3-eu-west-1.amazonaws.com/lw-birdtracking-data/",
db.name, ".gz", sep = "")
if (!file.exists(db.name)) {
download.file(db.url, destfile = paste0(db.name, ".gz"))
gunzip(paste0(db.name, ".gz"))
}
Loading a large dataset: use fread()
or functions from readr
instead of read.xxx()
.
library("data.table")
library("readr")
If you really need to read an entire csv in memory, by default, R users
use the read.table
method or variations thereof (such as read.csv
).
However, fread
from the data.table
package is a lot faster.
Furthermore, the readr
package also provides more optimized reading
functions (read_csv
, read_delim
,…). Let’s measure the time to read
in the data using these three different methods.
read.table.timing <- system.time(read.table(csv.name, header = TRUE, sep = ","))
readr.timing <- system.time(read_delim(csv.name, ",", col_names = TRUE))
data.table.timing <- system.time(allData <- fread(csv.name, showProgress = FALSE))
data <- data.frame(method = c('read.table', 'readr', 'fread'),
timing = c(read.table.timing[3], readr.timing[3], data.table.timing[3]))
data
## method timing
## 1 read.table 183.732
## 2 readr 3.625
## 3 fread 12.564
fread
and read_delim
are indeed much faster then the default
read.table
. However, the result of fread
is a data.table
and the
result of read_delim
is a tibble
. Both are not a data.frame
. The
data.table
package describes the data.table
object as a more
performant replacement for the data.frame
. This means that selecting,
filtering and aggregating data is much faster on a data.table
compared
to the standard data.frame
but it requires you to use a slightly
different syntax. A tibble
is very similar to a data.frame
, but
provides more convenience when printing or subsetting the data table.
You can find the data.table
package on
CRAN. A
good place to learn this package are the package vignettes. The
introduction to
data.table
should be enough to get started. The readr
package is also on
CRAN. It
belongs to a suite of R packages aiming to improve data manipulation in
R, called tidyverse. More examples and
explanation about readr
is provided on the readr
website.
Data files that don’t fit in memory
If you are not able to read in the data file, because it does not fit in memory (or because R becomes too slow when you load the entire dataset), you will need to limit the amount of data that will actually be stored in memory. There are a couple of options which we will investigate:
- limit the number of lines you are trying to read for some exploratory analysis. Once you are happy with the analysis you want to run on the entire dataset, move to another machine.
- limit the number of columns you are reading to reduce the memory required to store the data.
- limit both the number of rows and the number of columns using
sqldf
. - stream the data.
1. Limit the number of lines you read (fread
)
Limiting the number of lines you read is easy. Just use the nrows
and/or skip
option (available to both read.table
and fread
).
skip
can be used to skip a number of rows, but you can also pass a
string to this parameter causing fread
to only start reading lines
from the first line matching that string. Let’s say we only want to
start reading lines after we find a line matching the pattern
2015-06-12 15:14:39
. We can do that like this:
sprintf("Number of lines in full data set: %s", nrow(allData))
## [1] "Number of lines in full data set: 3761058"
subSet <- fread(csv.name, skip = "2015-06-12 15:14:39", showProgress = FALSE)
sprintf("Number of lines in data set with skipped lines: %s", nrow(subSet))
## [1] "Number of lines in data set with skipped lines: 9998"
Skipping rows this way is obviously not giving you the entire dataset,
so this strategy is only useful for doing exploratory analysis on a
subset of your data. Note that also read_delim
provides a n_max
argument to limit the number of lines to read. If you want to explore
the whole dataset, limiting the number of columns you read can be a more
useful strategy.
2. Limit the number of columns you read (fread
)
If you only need 4 columns of the 21 columns present in the file, you
can tell fread
to only select those 4. This can have a major impact on
the memory footprint of your data. The option you need for this is:
select
. With this, you can specify a number of columns to keep. The
opposite - specifying the columns you want to drop - can be accomplished
with the drop
option.
fourColumns = fread(csv.name, select = c("device_info_serial", "date_time",
"latitude", "longitude"),
showProgress = FALSE)
sprintf("Size of total data in memory: %s MB", utils::object.size(allData)/1000000)
## [1] "Size of total data in memory: 1173.480728 MB"
sprintf("Size of only four columns in memory: %s MB", utils::object.size(fourColumns)/1000000)
## [1] "Size of only four columns in memory: 105.311936 MB"
The difference might not be as large as you would expect. R objects claim more memory than needed to store the data alone, as they keep pointers, and other object attributes. But still, the difference could save you.
3. Limiting both the number of rows and the number of columns using sqldf
The sqldf package allows you to run SQL-like queries on a file, resulting in only a selection of the file being read. It allows you to limit both the number of lines and the number of rows at the same time. In the background, this actually creates a sqlite database on the fly to execute the query. Consider using the package when starting from a csv file, but the actual strategy boils down to making a sqlite database file of your data. See this section below to learn how to interact with those and create a SQlite database from a CSV-file.
4. Streaming data
Short: streaming a file in R is a bad idea. If you are interested why, read the rest of this section.
Streaming a file means reading it line by line and only keeping the lines you need or do stuff with the lines while you read through the file. It turns out that R is really not very efficient in streaming files. The main reason is the memory allocation process that has difficulties with a constantly growing object (which can be a dataframe containing only the selected lines).
In the next code block, we will read parts of our data file once using
the fread
function, and once line by line. You’ll see the performance
issue with the streaming solution.
library(ggplot2)
allowedDevices = c(753, 801, 852)
minDate = strptime('1/3/2014',format = '%d/%m/%Y')
maxDate = strptime('1/10/2014',format = '%d/%m/%Y')
streamFile <- function(limit) {
con <- file(csv.name, open = "r")
selectedRecords <- list()
i <- 0
file.streaming.timing <- system.time(
while (i < limit) {
oneLine <- readLines(con, n = 1, warn = FALSE)
vec = (strsplit(oneLine, ","))
selectedRecords <- c(selectedRecords, vec)
i <- i + 1
}
)
close(con)
return(file.streaming.timing[[3]])
}
freadFile <- function(limit) {
file.fread.timing = system.time(
d <- fread(csv.name, showProgress = FALSE, nrows = limit)
)
return(file.fread.timing[[3]])
}
maxLines <- c(5000, 10000, 15000, 20000, 25000, 30000)
streamingTimes <- sapply(maxLines, streamFile)
freadTimes <- sapply(maxLines, freadFile)
data <- data.frame(n = maxLines, streaming = streamingTimes,
fread = freadTimes)
pdata <- melt(data, id = c("n"))
colnames(pdata) <- c("n", "algorithm", "execTime")
qplot(n, execTime, data = pdata, color = algorithm,
xlab = "number of lines read", ylab = "execution time (s)")
The database file strategy
Working with SQLite databases
SQLite databases are single file databases meaning you can simply download them, store them in a folder or share them with colleagues. Similar to a csv. They are however more powerful than csv’s because of two important features:
- Support for SQL: this allows you to execute intelligent filters on
your data, similar to the
sqldf
package or database environments you are familiar with. That way, you can reduce the amount of data that’s stored in memory by filtering out rows or columns. - Indexes: SQLite databases contain indexes. An index is something like an ordered version of a column. When enabled on a column, you can search through the column much faster. We will demonstrate this below.
We have downloaded a second file
2016-04-20-processed-logs-big-file-example.db
that contains the same
data as the 2016-04-20-processed-logs-big-file-example.csv
file, but
as a sqlite database. Furthermore, the database file contains indexes
which will dramatically drop the time needed to perform search queries.
If you do not have a SQLite database containing your data, you can first
convert your csv into a SQlite as described further in this
tutorial.
Let’s first connect to the database and list the available tables.
library(RSQLite)
db <- dbConnect(SQLite(), dbname = db.name)
# show the tables in this database
dbListTables(db)
## [1] "SpatialIndex" "geom_cols_ref_sys"
## [3] "geometry_columns" "geometry_columns_auth"
## [5] "geometry_columns_field_infos" "geometry_columns_statistics"
## [7] "geometry_columns_time" "processed_logs"
## [9] "spatial_ref_sys" "spatialite_history"
## [11] "sql_statements_log" "sqlite_sequence"
## [13] "vector_layers" "vector_layers_auth"
## [15] "vector_layers_field_infos" "vector_layers_statistics"
## [17] "views_geometry_columns" "views_geometry_columns_auth"
## [19] "views_geometry_columns_field_infos" "views_geometry_columns_statistics"
## [21] "virts_geometry_columns" "virts_geometry_columns_auth"
## [23] "virts_geometry_columns_field_infos" "virts_geometry_columns_statistics"
Let’s try to select rows where the device id matches a given value
(e.g. 860), and the date time is between two given timestamps. For our
analysis, we only need date_time
, latitude
, longitude
and
altitude
so we will only select those.
sqlTiming <- system.time(data <- dbGetQuery(conn = db,
"SELECT date_time, latitude, longitude, altitude
FROM processed_logs
WHERE device_info_serial = 860
AND date_time < '2014-07-01'
AND date_time > '2014-03-01'"
))
print(sqlTiming[3])
## elapsed
## 40.405
This provides a convenient and fast way to request subsets of data from
our large data file. We could do the same analysis for each of the
serial numbers, each time only loading that subset of the data. As an
example, consider the calculation of the average altitude over the
specified period for each of the bird serial identifiers in the list
serial_id_list
. By using a for loop, the calculation is done for each
of the birds separately and the amount of data loaded into memory at the
same time is lower:
serial_id_list <- c(853, 860, 783)
print("Average altitude between 2014-03-01 and 2014-07-01:")
## [1] "Average altitude between 2014-03-01 and 2014-07-01:"
for (serialid in serial_id_list) {
data <- dbGetQuery(conn = db,
sprintf("SELECT date_time, latitude, longitude, altitude
FROM processed_logs
WHERE device_info_serial = %d
AND date_time < '2014-07-01'
AND date_time > '2014-03-01'", serialid))
print(sprintf("serialID %d: %f", serialid, mean(data$altitude)))
}
## [1] "serialID 853: NA"
## [1] "serialID 860: 23.550518"
## [1] "serialID 783: 14.900030"
Remark that we use the sprintf
function to dynamically replace the
serial id in the sqlite query we will execute. For each loop, the %d
is replaced by the value of the serial id of the respective loop. Read
the
manual
of the sprintf
function for more information and options.
Interacting with SQLite databases using dplyr
If you’re not comfortable with writing queries in SQL, R has a great
alternative: dplyr
. dplyr
can connect to a SQLite database and you
can perform the same operations on it that you would do on a dataframe.
However, dplyr will translate your commands to SQL, allowing you to take
advantage of the indexes in the SQLite database.
library(dplyr)
my_db <- src_sqlite(db.name, create = FALSE)
bird_tracking <- tbl(my_db, "processed_logs")
results <- bird_tracking %>%
filter(device_info_serial == 860) %>%
select(date_time, latitude, longitude, altitude) %>%
filter(date_time < "2014-07-01") %>%
filter(date_time > "2014-03-01")
head(results)
## # Source: lazy query [?? x 4]
## # Database: sqlite 3.36.0
## # [/Users/peter_desmet/Coding/Repositories/inbo/tutorials/content/tutorials/r_large_data_files_handling/2016-04-20-processed-logs-big-file-example.db]
## date_time latitude longitude altitude
## <chr> <dbl> <dbl> <dbl>
## 1 2014-03-10 12:43:37 44.0 -7.59 626
## 2 2014-03-10 12:58:32 44.1 -7.67 405
## 3 2014-03-10 13:13:52 44.1 -7.69 326
## 4 2014-03-10 13:28:57 44.1 -7.71 250
## 5 2014-03-10 13:43:54 44.1 -7.72 174
## 6 2014-03-10 13:59:06 44.1 -7.73 23
Dplyr provides the ability to perform queries as above without the need
to know SQL. If you want to learn more about how to use dplyr
with a
SQLite database, head over to this
vignette.
Create a SQLite database from a CSV file
In the case you have a CSV file available and you would like to query
the data using SQL queries or with dplyr
as shown in the previous
sections, you can decide to convert the data to a SQlite database. The
conversion will require some time, but once available, it provides the
opportunity to query the data using SQL queries or with dplyr
as shown
in the previous sections. Moreover, you can easily add additional tables
with related information to combine the data with.
If you already loaded the CSV file into memory, the creation of a SQLITE database is very straighforward and can be achieved in two steps:
db <- dbConnect(SQLite(), dbname = "example.sqlite")
dbWriteTable(db, "birdtracks", allData)
dbDisconnect(db)
The first command creates a new database when the file example.sqlite
does not exist already. The command dbWriteTable
writes the table in
the database. Hence, we can rerun the query from the previous section,
but now on the newly created SQlite database, with the single created
table birdtracks
:
my_db <- src_sqlite("example.sqlite", create = FALSE)
bird_tracking <- tbl(my_db, "birdtracks")
results <- bird_tracking %>%
filter(device_info_serial == 860) %>%
select(date_time, latitude, longitude, altitude) %>%
filter(date_time < "2014-07-01") %>%
filter(date_time > "2014-03-01")
head(results)
## # Source: lazy query [?? x 4]
## # Database: sqlite 3.36.0
## # [/Users/peter_desmet/Coding/Repositories/inbo/tutorials/content/tutorials/r_large_data_files_handling/example.sqlite]
## # … with 4 variables: date_time <dbl>, latitude <dbl>, longitude <dbl>,
## # altitude <int>
However, when working with really large CSV files, you do not want to load the entire file into memory first (this is the whole point of this tutorial). An alternative strategy is to load the data from the CSV file in chunks (small sections) and write them step by step to the SQlite database.
This can be implemented by reading the CSV file in small sections (let’s
say 50000 lines each time) and move all sections to a given table in a
sqlite database. As this is a recurrent task, we will provide the
transformation in a custom written function, called csv_to_sqlite
. The
function is available within the inborutils
package. Check the function
documentation
online
or by typing ?csv_to_sqlite
after installing and loading the
inborutils
package. As SQlite does not natively support date
and
datetime
representations, the function converts those columns to an
appropriate string representation before copying the dates to sqlite. To
check for the date handling, the lubridate
package is used.
As an example, let’s convert the processed bird logs csv file to a
sqlite database, called example.sqlite
as a table birdtracks
. Using
the default values for the preprocessing number of lines and the chunk
size, the conversion is as follows:
library(inborutils)
sqlite_file <- "example2.sqlite"
table_name <- "birdtracks"
inborutils::csv_to_sqlite(csv_file = csv.name,
sqlite_file, table_name, pre_process_size = 1000,
chunk_size = 50000, show_progress_bar = FALSE)
Hence, this approach will work for large files as well and is an ideal first step when doing this kind of analysis. Once performed, the SQlite database is available to query, similar to the previous examples:
my_db <- src_sqlite("example2.sqlite", create = FALSE)
## Warning: `src_sqlite()` was deprecated in dplyr 1.0.0.
## Please use `tbl()` directly with a database connection
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
bird_tracking <- tbl(my_db, "birdtracks")
results <- bird_tracking %>%
filter(device_info_serial == 860) %>%
select(date_time, latitude, longitude, altitude) %>%
filter(date_time < "2014-07-01") %>%
filter(date_time > "2014-03-01")
head(results)
## # Source: lazy query [?? x 4]
## # Database: sqlite 3.36.0
## # [/Users/peter_desmet/Coding/Repositories/inbo/tutorials/content/tutorials/r_large_data_files_handling/example2.sqlite]
## date_time latitude longitude altitude
## <chr> <dbl> <dbl> <dbl>
## 1 2014-03-10 12:43:37 44.0 -7.59 626
## 2 2014-03-10 12:58:32 44.1 -7.67 405
## 3 2014-03-10 13:13:52 44.1 -7.69 326
## 4 2014-03-10 13:28:57 44.1 -7.71 250
## 5 2014-03-10 13:43:54 44.1 -7.72 174
## 6 2014-03-10 13:59:06 44.1 -7.73 23
Remark that the dates are properly handled, by making sure the date representation inside SQlite is the converted string version.