1 Introduction

dplyr (https://dplyr.tidyverse.org/) is the most popular package for data manipulation in the R ecosystem.

This notebook seeks to compare the speed of improved frameworks (packages) that perform similar tasks:
- Apache arrow: https://arrow.apache.org/docs/r/
- multidplyr (dplyr with parallel cores): https://multidplyr.tidyverse.org/
- data.table: https://github.com/Rdatatable/data.table
- dtplyr (combines dpylr with datatable): https://dtplyr.tidyverse.org/
- tidytable (also dplyr & datatable): https://github.com/markfairbanks/tidytable
- collapse: https://sebkrantz.github.io/collapse/

Below, we perform two tasks, namely two pivot tables along two axes.
Naturally, these are very precise examples and are not sufficient to conclude definitely.

The dataset can be downloaded at https://dachxiu.chicagobooth.edu/download/datashare.zip
It comprises financial information about 31K+ firms over 768 months. We will thus test grouping along these 2 dimensions.
NOTE: we use fread to import the .csv: https://datatable.readthedocs.io/en/latest/api/dt/fread.html

I thank Mark Fairbanks and Sebastian Krantz for helpful comments.

library(tidyverse)
library(data.table)
library(arrow)
library(dtplyr)
library(tidytable)
library(collapse)
library(multidplyr)
library(microbenchmark)
data <- fread("datashare_2021.csv") %>% data.frame()

2 Data preparation

One issue with multidplyr is the pre-processing stage, which is slightly lengthy.
In all transparency, the authors of the packages acknowledge that it is the most impressive with very large datasets (10M+ rows).

data_multi_n <- function(data, m){
    cluster <- new_cluster(m)                                     # for multidplyr
    data %>% group_by(DATE) %>% partition(cluster)
}

microbenchmark(
    arrow = write_parquet(data, sink = "data.parquet"),           # arrow
    multidplyr = data_multi_n(data, 4),                           # multidplyr
    tidytable = data.table(data),                                 # data.table & tidytable
    dtplyr = lazy_dt(data),                                       # dtplyr
    times = 10,
    unit = "seconds") %>% 
    summary()

expr	min	lq	mean	median	uq	max	neval
arrow	12.1074129	12.125918	12.293681	12.211368	12.470099	12.778913	10
multidplyr	14.4815546	14.494053	15.398404	15.466416	15.665112	16.505704	10
tidytable	2.0439909	2.555480	2.982242	3.049378	3.178886	3.894809	10
dtplyr	0.7722297	1.307555	1.536715	1.494366	1.910549	2.071766	10

Below, we re-execute the operations to keep the objects in the environment.

write_parquet(data, sink = "data.parquet")
cluster <- new_cluster(4)                                            # for multidplyr
data_multi <- data %>% group_by(DATE) %>% partition(cluster)
data_dt <- data.table(data)                                          # for data.table & tidytable
data_table <- lazy_dt(data)

data_parquet <- read_parquet("data.parquet")                         # for arrow

3 Pivoting along a short axis

bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                            # dplyr
    arrow = data_parquet %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                    # arrow
    multidplyr = data_multi %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(),   # multidplyr
    collapse = data %>% fgroup_by(DATE) %>% fsummarise(m = fmean(is.finite(mom1m))),                                    # collapse
    tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = DATE),                                          # tidytable
    dtplyr = data_table %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
    data.table = data_dt[, mean(is.finite(mom1m)), by = "DATE"],                                                        # data.table
    times = 60,
    unit = "seconds")

bench %>%
    summary() %>%
    ggplot(aes(y = reorder(expr, mean))) + theme_bw() + 
    geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
    geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
    theme(axis.title.y = element_blank(),
          text = element_text(size = 15))

The average point is shown with the IQR (inter-quartile range).
The original implementation of dplyr (in C++) is competitive in this exercise.

4 Pivoting along the other dimension

In the second exercise, there are much more groups along which to compute the number of well-defined points.

data_multi <- data %>% group_by(permno) %>% partition(cluster)
bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                            # dplyr
    arrow = data_parquet %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                    # arrow
    multidplyr = data_multi %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(),   # multidplyr
    collapse = data %>% fgroup_by(permno) %>% fsummarise(m = fmean(is.finite(mom1m))),                                    # collapse
    tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = permno),                                          # tidytable
    dtplyr = data_table %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
    data.table = data_dt[, mean(is.finite(mom1m)), by = "permno"],                                                        # data.table
    times = 50,
    unit = "seconds")

bench %>%
    summary() %>%
    ggplot(aes(y = reorder(expr, mean))) + theme_bw() + 
    geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
    geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
    theme(axis.title.y = element_blank(),
          text = element_text(size = 15))

The initial package dplyr is now less appealing, compared to collapse, tidytable and dtplyr.

Finally, we test an even lengthier exercise in which the summary is asked for all columns in the dataset.

prop_fin <- function(x){mean(is.finite(x))}
bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(across(everything(), prop_fin)), # dplyr
    dtplyr = data_table %>% dplyr::group_by(DATE) %>% 
        dplyr::summarise(across(everything(), prop_fin)) %>% as.data.table(),                    # dtplyr
    collapse = data %>% ftransformv(-DATE, is.finite) %>% fgroup_by(DATE) %>% fmean(),           # collapse
    times = 50,
    unit = "seconds") 
summary(bench)

expr	min	lq	mean	median	uq	max	neval
dplyr	2.644738	3.050272	3.323984	3.274942	3.521120	4.349464	50
dtplyr	1.643432	1.724856	2.034350	1.805122	2.227978	3.405955	50
collapse	1.662085	1.675199	1.932099	1.688387	1.994454	3.041505	50

dtplyr and collapse remain ahead of dplyr.

5 Preliminary conclusion

It seems some improved solutions such as tidytable mostly competitive when the number of groups is large.
dtplyr and collapse seem impressive in both configurations.
When chosen with care (including syntax!), these packages are likely to reduce computation times.

These results are contingent on the size of the dataset (~4M), which is intermediate between small data and big data.

Bechmarking (fast) dplyr alternatives

1 Introduction

2 Data preparation

3 Pivoting along a short axis

4 Pivoting along the other dimension

5 Preliminary conclusion