We are interested in prices and carats of diamonds. To get a glimpse of their distribution, it is customary to plot histograms. The bins parameter corresponds to the number of small rectangles on the x-axis.

```
library(tidyverse)
diamonds %>% filter(carat < 2.55) %>% ggplot() + geom_histogram(aes(x = carat), bins = 100)
```

`diamonds %>% filter(carat < 1.1) %>% ggplot() + geom_histogram(aes(x = carat), bins = 200) # Higher precision for small diamonds`

`diamonds %>% filter(carat < 2.55) %>% ggplot() + geom_histogram(aes(x = price), bins = 30)`

`diamonds %>% ggplot(aes(x = cut)) + geom_bar(aes(y = (..count..)/sum(..count..))) + ylab("Proportion")`

Sometimes, it is easier to quantify the distribution of a variable with a few numbers: the mean, the standard deviation, etc. They are usually called the ‘descriptive statistics’.

`diamonds %>% summary() # Canonical R function for descriptive statistics`

```
carat cut color clarity depth table price
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00 Min. :43.00 Min. : 326
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80 Median :57.00 Median : 2401
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75 Mean :57.46 Mean : 3933
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00 Max. :95.00 Max. :18823
J: 2808 (Other): 2531
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
```

`diamonds %>% select(carat, price) %>% apply(2,sd) # Computing the standard deviation over carats & prices`

```
carat price
0.4740112 3989.4397381
```

`diamonds %>% select(carat, price) %>% filter(carat < 2.1) %>% summary() `

```
carat price
Min. :0.200 Min. : 326
1st Qu.:0.400 1st Qu.: 942
Median :0.700 Median : 2357
Mean :0.775 Mean : 3765
3rd Qu.:1.030 3rd Qu.: 5166
Max. :2.090 Max. :18818
```

The last computation illustrates the sensitivity of the mean. In the first batch of stats, the average price was 3933 and it is 3765 over the filtered data (omitting large diamonds makes the average price go down). The mean is sensitive to extreme points. It’s not the case for the median (2401 => 2357). The median is much more stable and less sensitive to outliers.

The reason why it is convenient to work with a single figure is that it can easily be computed on many subsamples. It is harder to visually analyse many distributions. Below, we analyse many subcases, i.e., when working with subgroups pertaining to each combination of cut, clarity and color.

```
means <- diamonds %>% group_by(cut, clarity, color) %>% # We build a pivot table over cut, clarity and color
summarize(avg_carat = mean(carat), avg_price = mean(price))
means %>% ggplot(aes(x = avg_carat, y = avg_price)) + geom_point(aes(color = clarity, size = cut))
```