1 Introduction

dplyr (https://dplyr.tidyverse.org/) is the most popular package for data manipulation in the R ecosystem.

This notebook seeks to compare the speed of improved frameworks (packages) that perform similar tasks:
- Apache arrow: https://arrow.apache.org/docs/r/
- multidplyr (dplyr with parallel cores): https://multidplyr.tidyverse.org/
- data.table: https://github.com/Rdatatable/data.table
- dtplyr (combines dpylr with datatable): https://dtplyr.tidyverse.org/
- tidytable (also dplyr & datatable): https://github.com/markfairbanks/tidytable
- collapse: https://sebkrantz.github.io/collapse/

Below, we perform two tasks, namely two pivot tables along two axes.
Naturally, these are very precise examples and are not sufficient to conclude definitely.

The dataset can be downloaded at https://dachxiu.chicagobooth.edu/download/datashare.zip
It comprises financial information about 31K+ firms over 768 months. We will thus test grouping along these 2 dimensions.
NOTE: we use fread to import the .csv: https://datatable.readthedocs.io/en/latest/api/dt/fread.html

I thank Mark Fairbanks and Sebastian Krantz for helpful comments.

library(tidyverse)
library(data.table)
library(arrow)
library(dtplyr)
library(tidytable)
library(collapse)
library(multidplyr)
library(microbenchmark)
data <- fread("datashare_2021.csv") %>% data.frame() 

2 Data preparation

One issue with multidplyr is the pre-processing stage, which is slightly lengthy.
In all transparency, the authors of the packages acknowledge that it is the most impressive with very large datasets (10M+ rows).

data_multi_n <- function(data, m){
    cluster <- new_cluster(m)                                     # for multidplyr
    data %>% group_by(DATE) %>% partition(cluster)
}

microbenchmark(
    arrow = write_parquet(data, sink = "data.parquet"),           # arrow
    multidplyr = data_multi_n(data, 4),                           # multidplyr
    tidytable = data.table(data),                                 # data.table & tidytable
    dtplyr = lazy_dt(data),                                       # dtplyr
    times = 10,
    unit = "seconds") %>% 
    summary()
expr min lq mean median uq max neval
arrow 12.1074129 12.125918 12.293681 12.211368 12.470099 12.778913 10
multidplyr 14.4815546 14.494053 15.398404 15.466416 15.665112 16.505704 10
tidytable 2.0439909 2.555480 2.982242 3.049378 3.178886 3.894809 10
dtplyr 0.7722297 1.307555 1.536715 1.494366 1.910549 2.071766 10

Below, we re-execute the operations to keep the objects in the environment.

write_parquet(data, sink = "data.parquet")
cluster <- new_cluster(4)                                            # for multidplyr
data_multi <- data %>% group_by(DATE) %>% partition(cluster)
data_dt <- data.table(data)                                          # for data.table & tidytable
data_table <- lazy_dt(data)

data_parquet <- read_parquet("data.parquet")                         # for arrow 

3 Pivoting along a short axis

bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                            # dplyr
    arrow = data_parquet %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                    # arrow
    multidplyr = data_multi %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(),   # multidplyr
    collapse = data %>% fgroup_by(DATE) %>% fsummarise(m = fmean(is.finite(mom1m))),                                    # collapse
    tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = DATE),                                          # tidytable
    dtplyr = data_table %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
    data.table = data_dt[, mean(is.finite(mom1m)), by = "DATE"],                                                        # data.table
    times = 60,
    unit = "seconds") 
bench %>%
    summary() %>%
    ggplot(aes(y = reorder(expr, mean))) + theme_bw() + 
    geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
    geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
    theme(axis.title.y = element_blank(),
          text = element_text(size = 15)) 

The average point is shown with the IQR (inter-quartile range).
The original implementation of dplyr (in C++) is competitive in this exercise.

4 Pivoting along the other dimension

In the second exercise, there are much more groups along which to compute the number of well-defined points.

data_multi <- data %>% group_by(permno) %>% partition(cluster)
bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                            # dplyr
    arrow = data_parquet %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                    # arrow
    multidplyr = data_multi %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(),   # multidplyr
    collapse = data %>% fgroup_by(permno) %>% fsummarise(m = fmean(is.finite(mom1m))),                                    # collapse
    tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = permno),                                          # tidytable
    dtplyr = data_table %>% dplyr::group_by(permno) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
    data.table = data_dt[, mean(is.finite(mom1m)), by = "permno"],                                                        # data.table
    times = 50,
    unit = "seconds") 
bench %>%
    summary() %>%
    ggplot(aes(y = reorder(expr, mean))) + theme_bw() + 
    geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
    geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
    theme(axis.title.y = element_blank(),
          text = element_text(size = 15)) 

The initial package dplyr is now less appealing, compared to collapse, tidytable and dtplyr.

Finally, we test an even lengthier exercise in which the summary is asked for all columns in the dataset.

prop_fin <- function(x){mean(is.finite(x))}
bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(across(everything(), prop_fin)), # dplyr
    dtplyr = data_table %>% dplyr::group_by(DATE) %>% 
        dplyr::summarise(across(everything(), prop_fin)) %>% as.data.table(),                    # dtplyr
    collapse = data %>% ftransformv(-DATE, is.finite) %>% fgroup_by(DATE) %>% fmean(),           # collapse
    times = 50,
    unit = "seconds") 
summary(bench)
expr min lq mean median uq max neval
dplyr 2.644738 3.050272 3.323984 3.274942 3.521120 4.349464 50
dtplyr 1.643432 1.724856 2.034350 1.805122 2.227978 3.405955 50
collapse 1.662085 1.675199 1.932099 1.688387 1.994454 3.041505 50

dtplyr and collapse remain ahead of dplyr.

5 Preliminary conclusion

It seems some improved solutions such as tidytable mostly competitive when the number of groups is large.
dtplyr and collapse seem impressive in both configurations.
When chosen with care (including syntax!), these packages are likely to reduce computation times.

These results are contingent on the size of the dataset (~4M), which is intermediate between small data and big data.