dplyr (https://dplyr.tidyverse.org/) is the most popular package for data manipulation in the R ecosystem.
This notebook seeks to compare the speed of improved frameworks
(packages) that perform similar tasks:
- Apache arrow: https://arrow.apache.org/docs/r/
- multidplyr (dplyr with parallel cores): https://multidplyr.tidyverse.org/
- data.table: https://github.com/Rdatatable/data.table
- dtplyr (combines dpylr with datatable): https://dtplyr.tidyverse.org/
- tidytable (also dplyr & datatable): https://github.com/markfairbanks/tidytable
- collapse: https://sebkrantz.github.io/collapse/
Below, we perform two tasks, namely two pivot tables along two
axes.
Naturally, these are very precise examples and are not sufficient to
conclude definitely.
The dataset can be downloaded at https://dachxiu.chicagobooth.edu/download/datashare.zip
It comprises financial information about 31K+ firms over 768 months. We
will thus test grouping along these 2 dimensions.
NOTE: we use fread to import the .csv:
https://datatable.readthedocs.io/en/latest/api/dt/fread.html
I thank Mark Fairbanks and Sebastian Krantz for helpful comments.
library(tidyverse)
library(data.table)
library(arrow)
library(dtplyr)
library(tidytable)
library(collapse)
library(multidplyr)
library(microbenchmark)
data <- fread("datashare_2021.csv") %>% data.frame()
One issue with multidplyr is the pre-processing
stage, which is slightly lengthy.
In all transparency, the authors of the packages acknowledge that it is
the most impressive with very large datasets (10M+ rows).
data_multi_n <- function(data, m){
cluster <- new_cluster(m) # for multidplyr
data %>% group_by(DATE) %>% partition(cluster)
}
microbenchmark(
arrow = write_parquet(data, sink = "data.parquet"), # arrow
multidplyr = data_multi_n(data, 4), # multidplyr
tidytable = data.table(data), # data.table & tidytable
dtplyr = lazy_dt(data), # dtplyr
times = 10,
unit = "seconds") %>%
summary()
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
arrow | 12.1074129 | 12.125918 | 12.293681 | 12.211368 | 12.470099 | 12.778913 | 10 |
multidplyr | 14.4815546 | 14.494053 | 15.398404 | 15.466416 | 15.665112 | 16.505704 | 10 |
tidytable | 2.0439909 | 2.555480 | 2.982242 | 3.049378 | 3.178886 | 3.894809 | 10 |
dtplyr | 0.7722297 | 1.307555 | 1.536715 | 1.494366 | 1.910549 | 2.071766 | 10 |
Below, we re-execute the operations to keep the objects in the environment.
write_parquet(data, sink = "data.parquet")
cluster <- new_cluster(4) # for multidplyr
data_multi <- data %>% group_by(DATE) %>% partition(cluster)
data_dt <- data.table(data) # for data.table & tidytable
data_table <- lazy_dt(data)
data_parquet <- read_parquet("data.parquet") # for arrow
bench <- microbenchmark(
dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))), # dplyr
arrow = data_parquet %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))), # arrow
multidplyr = data_multi %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(), # multidplyr
collapse = data %>% fgroup_by(DATE) %>% fsummarise(m = fmean(is.finite(mom1m))), # collapse
tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = DATE), # tidytable
dtplyr = data_table %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
data.table = data_dt[, mean(is.finite(mom1m)), by = "DATE"], # data.table
times = 60,
unit = "seconds")
bench %>%
summary() %>%
ggplot(aes(y = reorder(expr, mean))) + theme_bw() +
geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
theme(axis.title.y = element_blank(),
text = element_text(size = 15))