dplyr (https://dplyr.tidyverse.org/) is the most popular package for data manipulation in the R ecosystem.
This notebook seeks to compare the speed of improved frameworks
(packages) that perform similar tasks:
- Apache arrow: https://arrow.apache.org/docs/r/
- multidplyr (dplyr with parallel cores): https://multidplyr.tidyverse.org/
- data.table: https://github.com/Rdatatable/data.table
- dtplyr (combines dpylr with datatable): https://dtplyr.tidyverse.org/
- tidytable (also dplyr & datatable): https://github.com/markfairbanks/tidytable
- collapse: https://sebkrantz.github.io/collapse/
Below, we perform two tasks, namely two pivot tables along two
axes.
Naturally, these are very precise examples and are not sufficient to
conclude definitely.
The dataset can be downloaded at https://dachxiu.chicagobooth.edu/download/datashare.zip
It comprises financial information about 31K+ firms over 768 months. We
will thus test grouping along these 2 dimensions.
NOTE: we use fread to import the .csv:
https://datatable.readthedocs.io/en/latest/api/dt/fread.html
I thank Mark Fairbanks and Sebastian Krantz for helpful comments.
library(tidyverse)
library(data.table)
library(arrow)
library(dtplyr)
library(tidytable)
library(collapse)
library(multidplyr)
library(microbenchmark)
data <- fread("datashare_2021.csv") %>% data.frame()
One issue with multidplyr is the pre-processing
stage, which is slightly lengthy.
In all transparency, the authors of the packages acknowledge that it is
the most impressive with very large datasets (10M+ rows).
data_multi_n <- function(data, m){
cluster <- new_cluster(m) # for multidplyr
data %>% group_by(DATE) %>% partition(cluster)
}
microbenchmark(
arrow = write_parquet(data, sink = "data.parquet"), # arrow
multidplyr = data_multi_n(data, 4), # multidplyr
tidytable = data.table(data), # data.table & tidytable
dtplyr = lazy_dt(data), # dtplyr
times = 10,
unit = "seconds") %>%
summary()
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
arrow | 12.1074129 | 12.125918 | 12.293681 | 12.211368 | 12.470099 | 12.778913 | 10 |
multidplyr | 14.4815546 | 14.494053 | 15.398404 | 15.466416 | 15.665112 | 16.505704 | 10 |
tidytable | 2.0439909 | 2.555480 | 2.982242 | 3.049378 | 3.178886 | 3.894809 | 10 |
dtplyr | 0.7722297 | 1.307555 | 1.536715 | 1.494366 | 1.910549 | 2.071766 | 10 |
Below, we re-execute the operations to keep the objects in the environment.
write_parquet(data, sink = "data.parquet")
cluster <- new_cluster(4) # for multidplyr
data_multi <- data %>% group_by(DATE) %>% partition(cluster)
data_dt <- data.table(data) # for data.table & tidytable
data_table <- lazy_dt(data)
data_parquet <- read_parquet("data.parquet") # for arrow
bench <- microbenchmark(
dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))), # dplyr
arrow = data_parquet %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))), # arrow
multidplyr = data_multi %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(), # multidplyr
collapse = data %>% fgroup_by(DATE) %>% fsummarise(m = fmean(is.finite(mom1m))), # collapse
tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = DATE), # tidytable
dtplyr = data_table %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
data.table = data_dt[, mean(is.finite(mom1m)), by = "DATE"], # data.table
times = 60,
unit = "seconds")
bench %>%
summary() %>%
ggplot(aes(y = reorder(expr, mean))) + theme_bw() +
geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
theme(axis.title.y = element_blank(),
text = element_text(size = 15))
The average point is shown with the IQR (inter-quartile range).
The original implementation of dplyr (in C++) is
competitive in this exercise.
In the second exercise, there are much more groups along which to compute the number of well-defined points.
data_multi <- data %>% group_by(permno) %>% partition(cluster)
bench <- microbenchmark(
dplyr = data %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))), # dplyr
arrow = data_parquet %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))), # arrow
multidplyr = data_multi %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(), # multidplyr
collapse = data %>% fgroup_by(permno) %>% fsummarise(m = fmean(is.finite(mom1m))), # collapse
tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = permno), # tidytable
dtplyr = data_table %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
data.table = data_dt[, mean(is.finite(mom1m)), by = "permno"], # data.table
times = 50,
unit = "seconds")
bench %>%
summary() %>%
ggplot(aes(y = reorder(expr, mean))) + theme_bw() +
geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
theme(axis.title.y = element_blank(),
text = element_text(size = 15))
The initial package dplyr is now less appealing, compared to collapse, tidytable and dtplyr.
Finally, we test an even lengthier exercise in which the summary is asked for all columns in the dataset.
prop_fin <- function(x){mean(is.finite(x))}
bench <- microbenchmark(
dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(across(everything(), prop_fin)), # dplyr
dtplyr = data_table %>% dplyr::group_by(DATE) %>%
dplyr::summarise(across(everything(), prop_fin)) %>% as.data.table(), # dtplyr
collapse = data %>% ftransformv(-DATE, is.finite) %>% fgroup_by(DATE) %>% fmean(), # collapse
times = 50,
unit = "seconds")
summary(bench)
expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|
dplyr | 2.644738 | 3.050272 | 3.323984 | 3.274942 | 3.521120 | 4.349464 | 50 |
dtplyr | 1.643432 | 1.724856 | 2.034350 | 1.805122 | 2.227978 | 3.405955 | 50 |
collapse | 1.662085 | 1.675199 | 1.932099 | 1.688387 | 1.994454 | 3.041505 | 50 |
dtplyr and collapse remain ahead of dplyr.
It seems some improved solutions such as tidytable
mostly competitive when the number of groups is large.
dtplyr and collapse seem impressive in
both configurations.
When chosen with care (including syntax!), these packages are likely to
reduce computation times.
These results are contingent on the size of the dataset (~4M), which is intermediate between small data and big data.