1 Introduction

dplyr (https://dplyr.tidyverse.org/) is the most popular package for data manipulation in the R ecosystem.

This notebook seeks to compare the speed of improved frameworks (packages) that perform similar tasks:
- Apache arrow: https://arrow.apache.org/docs/r/
- multidplyr (dplyr with parallel cores): https://multidplyr.tidyverse.org/
- data.table: https://github.com/Rdatatable/data.table
- dtplyr (combines dpylr with datatable): https://dtplyr.tidyverse.org/
- tidytable (also dplyr & datatable): https://github.com/markfairbanks/tidytable
- collapse: https://sebkrantz.github.io/collapse/

Below, we perform two tasks, namely two pivot tables along two axes.
Naturally, these are very precise examples and are not sufficient to conclude definitely.

The dataset can be downloaded at https://dachxiu.chicagobooth.edu/download/datashare.zip
It comprises financial information about 31K+ firms over 768 months. We will thus test grouping along these 2 dimensions.
NOTE: we use fread to import the .csv: https://datatable.readthedocs.io/en/latest/api/dt/fread.html

I thank Mark Fairbanks and Sebastian Krantz for helpful comments.

library(tidyverse)
library(data.table)
library(arrow)
library(dtplyr)
library(tidytable)
library(collapse)
library(multidplyr)
library(microbenchmark)
data <- fread("datashare_2021.csv") %>% data.frame() 

2 Data preparation

One issue with multidplyr is the pre-processing stage, which is slightly lengthy.
In all transparency, the authors of the packages acknowledge that it is the most impressive with very large datasets (10M+ rows).

data_multi_n <- function(data, m){
    cluster <- new_cluster(m)                                     # for multidplyr
    data %>% group_by(DATE) %>% partition(cluster)
}

microbenchmark(
    arrow = write_parquet(data, sink = "data.parquet"),           # arrow
    multidplyr = data_multi_n(data, 4),                           # multidplyr
    tidytable = data.table(data),                                 # data.table & tidytable
    dtplyr = lazy_dt(data),                                       # dtplyr
    times = 10,
    unit = "seconds") %>% 
    summary()
expr min lq mean median uq max neval
arrow 12.1074129 12.125918 12.293681 12.211368 12.470099 12.778913 10
multidplyr 14.4815546 14.494053 15.398404 15.466416 15.665112 16.505704 10
tidytable 2.0439909 2.555480 2.982242 3.049378 3.178886 3.894809 10
dtplyr 0.7722297 1.307555 1.536715 1.494366 1.910549 2.071766 10

Below, we re-execute the operations to keep the objects in the environment.

write_parquet(data, sink = "data.parquet")
cluster <- new_cluster(4)                                            # for multidplyr
data_multi <- data %>% group_by(DATE) %>% partition(cluster)
data_dt <- data.table(data)                                          # for data.table & tidytable
data_table <- lazy_dt(data)

data_parquet <- read_parquet("data.parquet")                         # for arrow 

3 Pivoting along a short axis

bench <- microbenchmark(
    dplyr = data %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                            # dplyr
    arrow = data_parquet %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))),                    # arrow
    multidplyr = data_multi %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% collect(),   # multidplyr
    collapse = data %>% fgroup_by(DATE) %>% fsummarise(m = fmean(is.finite(mom1m))),                                    # collapse
    tidytable = data_dt %>% summarise(m = mean(is.finite(mom1m)), .by = DATE),                                          # tidytable
    dtplyr = data_table %>% dplyr::group_by(DATE) %>% dplyr::summarise(m = mean(is.finite(mom1m))) %>% as.data.table(), # dtplyr
    data.table = data_dt[, mean(is.finite(mom1m)), by = "DATE"],                                                        # data.table
    times = 60,
    unit = "seconds") 
bench %>%
    summary() %>%
    ggplot(aes(y = reorder(expr, mean))) + theme_bw() + 
    geom_point(aes(x = mean), size = 2) + xlab("Seconds of execution") +
    geom_errorbar(aes(xmin = lq, xmax = uq), width = 0.3) +
    theme(axis.title.y = element_blank(),
          text = element_text(size = 15))