Early illustrations

We are interested in prices and carats of diamonds. To get a glimpse of their distribution, it is customary to plot histograms. The bins parameter corresponds to the number of small rectangles on the x-axis.

library(tidyverse)
diamonds %>% filter(carat < 2.55) %>% ggplot() + geom_histogram(aes(x = carat), bins = 100)

diamonds %>% filter(carat < 1.1) %>% ggplot() + geom_histogram(aes(x = carat), bins = 200) # Higher precision for small diamonds

diamonds %>% filter(carat < 2.55) %>% ggplot() + geom_histogram(aes(x = price), bins = 30)

diamonds %>% ggplot(aes(x = cut)) + geom_bar(aes(y = (..count..)/sum(..count..))) + ylab("Proportion")

Sometimes, it is easier to quantify the distribution of a variable with a few numbers: the mean, the standard deviation, etc. They are usually called the ‘descriptive statistics’.

diamonds %>% summary()                            # Canonical R function for descriptive statistics
     carat               cut        color        clarity          depth           table           price      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
                                    J: 2808   (Other): 2531                                                  
       x                y                z         
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :10.740   Max.   :58.900   Max.   :31.800  
                                                   
diamonds %>% select(carat, price) %>% apply(2,sd) # Computing the standard deviation over carats & prices
       carat        price 
   0.4740112 3989.4397381 
diamonds %>% select(carat, price) %>% filter(carat < 2.1) %>% summary() 
     carat           price      
 Min.   :0.200   Min.   :  326  
 1st Qu.:0.400   1st Qu.:  942  
 Median :0.700   Median : 2357  
 Mean   :0.775   Mean   : 3765  
 3rd Qu.:1.030   3rd Qu.: 5166  
 Max.   :2.090   Max.   :18818  

The last computation illustrates the sensitivity of the mean. In the first batch of stats, the average price was 3933 and it is 3765 over the filtered data (omitting large diamonds makes the average price go down). The mean is sensitive to extreme points. It’s not the case for the median (2401 => 2357). The median is much more stable and less sensitive to outliers.

The reason why it is convenient to work with a single figure is that it can easily be computed on many subsamples. It is harder to visually analyse many distributions. Below, we analyse many subcases, i.e., when working with subgroups pertaining to each combination of cut, clarity and color.

means <- diamonds %>% group_by(cut, clarity, color) %>%  # We build a pivot table over cut, clarity and color
  summarize(avg_carat = mean(carat), avg_price = mean(price))
means %>% ggplot(aes(x = avg_carat, y = avg_price)) + geom_point(aes(color = clarity, size = cut))