This is the companion notebook to the course on Maximum Likelihood.

if (!require("bbmle")) install.packages("bbmle") # Only installs if missing
library(bbmle)
library(tidyverse)

Moment Matching

In this section, we estimate parameters using moment matching: the parameters must be chosen such that the first moments observed in the data match the theoretical ones.

Let’s try to fit an exponential distribution to the prices of diamonds. The average of an exponential law is the inverse of its parameter (and vice-versa).

m <- mean(diamonds$price)  # Average price
ggplot(diamonds, aes(x = price)) + geom_histogram(aes(y = ..density..)) + stat_function(fun = dexp, args = list(1/m), color = "red")

A (possibly) better fit could be obtained by a family with 2 parameters: let’s try the gamma distribution.

m <- mean(diamonds$price) # Average price
v <- var(diamonds$price)  # Variance of prices
ggplot(diamonds, aes(x = price)) + geom_histogram(aes(y = ..density..)) + 
    stat_function(fun = dexp, args = list(1/m), color = "red") +
    stat_function(fun = dgamma, args = list(shape = m^2/v, rate = m/v), color = "cyan") 

The difference between the two curves is hard to distinguish.

Maximum Likelihood Estimation

Below, we perform the MLE of the price distribution. We scale the prices by a kilo factor and use a target gamma distribution. We use two different functions: mle (old) and mle2 (new). The latter is more user-friendly. Below, we show that they yield the same output.

par <- mle2(price / 1000 ~ dgamma(shape, rate), start = list(shape = 2, rate = 2), data = diamonds) # Using the new package; WARINING: prices were scaled.
par@coef # Just to see the values of the parameters 
    shape      rate 
1.1580360 0.2944673 
nLL <- function(s, r) -sum(stats::dgamma(diamonds$price / 1000, s, r, log = TRUE)) # Using the old package
fit <- mle(nLL, start = list(s = 2, r = 2))
fit@coef # Again, just having a look: they match!
        s         r 
1.1580360 0.2944673 
diamonds %>% ggplot() + geom_histogram(aes(x = price / 1000, y = ..density..), bins = 90) + stat_function(fun = dgamma, args = list(shape = fit@coef[1], rate = fit@coef[2]), color = "cyan") + stat_function(fun = dexp, args = list(1/m*1000), color = "red") # Plot

The difference with the moment-matching density is the small bump close to zero.

A more difficult task is to fit a parametric distribution to the distribution of diamond weights. Indeed, the empirical distribution is far from smooth because spikes appear around ‘round’ values of carat (1, 1.5, 2, 2.5, etc.).

par <- mle2(carat ~ dgamma(shape, rate), start = list(shape = 1, rate = 2), data = diamonds) # Using the new package
par@coef # The estimated parameters
   shape     rate 
3.111010 3.898804 
nLL <- function(s, r) -sum(stats::dgamma(diamonds$carat, shape = s, rate = r, log = TRUE)) # Using the old package
fit <- mle(nLL, start = list(s = 2, r = 2))
fit@coef # The estimated parameters
       s        r 
3.111011 3.898804 
diamonds %>% ggplot() + geom_histogram(aes(x = carat, y = ..density..), bins = 50) + stat_function(fun = dgamma, args = list(shape = fit@coef[1], rate = fit@coef[2]), color = "red")

It is hard in this case to find the fit very impressive: the gamma distribution is smooth and not fit for this task.

Finally, and for the sakes of completeness, we end with a counter-example. We define the density of a scaled Student variable and then perform MLE over the carat distribution.

student <- function(x, m, v, df, log = FALSE){
    if(log == FALSE) {return(dt(v*x+m, df = df)*v)}     # Raw value
    if(log == TRUE) {return(log(dt(v*x+m, df = df)*v))} # Log value
}
par <- mle2(carat ~ student(m, v, df), 
            start = list(m = 2, v = 1, df = 5), 
            data = diamonds) 
par@coef
        m         v        df 
-1.814839  2.407730  8.607311 
diamonds %>% ggplot() + 
    geom_histogram(aes(x = carat, y = ..density..), bins = 50) + 
    stat_function(fun = student, 
                  args = list(m = par@coef[1], v = par@coef[2], df = par@coef[3]),
                  color = "yellow")

The fit is not so different, compared to the gamma curve. But let us not forget that the underlying distribution is defined on the whole real line: negative values are possible in this case.

Exercises

Fitting!

  1. Set your working directory, load the tidyverse package & import the ANES dataset.
  2. Replace the (or create a new) age variable with its normalized value: (Age - 15) / 80. Hint: use the mutate() function.
  3. Check that the new variable lies between 0 and 1 (plot a histogram for instance, or look at the summary). BEWARE: if extreme values are reached (0 or 1), the optimizer might not like it.
  4. Fit this scaled age variable with a beta distribution.
  5. Plot the corresponding distribution over the histogram.
  6. Fit the variable to a Gaussian distribution. Plot and compare to the beta fit.

Working on subsets

Perform the same analysis on the male/female subsets (Hint: use the filter() function). Compare the two!

Getting nuts: fitting gamma with beta.

  1. Create new data with the commands below:

n_sample <- 10^5
y <- rgamma(n_sample,2,2)
y <- y/max(y+1)
x <- 1:n_sample
data <- data.frame(x, y)

  1. Again, fit a beta distribution on y. Plot the data and fitted distribution.
  2. Same question with a gamma law. Look at the graphical difference between the two.
LS0tCnRpdGxlOiAiTWF4aW11bSBMaWtlbGlob29kIEVzdGltYXRpb24iCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazoKICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCiAgI2dpdGh1Yl9kb2N1bWVudDogCiAgICMgdG9jOiB0cnVlCi0tLQoKClRoaXMgaXMgdGhlIGNvbXBhbmlvbiBub3RlYm9vayB0byB0aGUgY291cnNlIG9uICoqTWF4aW11bSBMaWtlbGlob29kKiouCgpgYGB7ciBwYWNrYWdlLCBtZXNzYWdlID0gRkFMU0UsIHdhcm5pbmcgPSBGQUxTRX0KaWYgKCFyZXF1aXJlKCJiYm1sZSIpKSBpbnN0YWxsLnBhY2thZ2VzKCJiYm1sZSIpICMgT25seSBpbnN0YWxscyBpZiBtaXNzaW5nCmxpYnJhcnkoYmJtbGUpCmxpYnJhcnkodGlkeXZlcnNlKQpgYGAKIyMgTW9tZW50IE1hdGNoaW5nCkluIHRoaXMgc2VjdGlvbiwgd2UgZXN0aW1hdGUgcGFyYW1ldGVycyB1c2luZyAqKm1vbWVudCBtYXRjaGluZyoqOiB0aGUgcGFyYW1ldGVycyBtdXN0IGJlIGNob3NlbiBzdWNoIHRoYXQgdGhlIGZpcnN0IG1vbWVudHMgb2JzZXJ2ZWQgaW4gdGhlIGRhdGEgbWF0Y2ggdGhlIHRoZW9yZXRpY2FsIG9uZXMuICAKCkxldCdzIHRyeSB0byBmaXQgYW4gZXhwb25lbnRpYWwgZGlzdHJpYnV0aW9uIHRvIHRoZSBwcmljZXMgb2YgZGlhbW9uZHMuIFRoZSBhdmVyYWdlIG9mIGFuIGV4cG9uZW50aWFsIGxhdyBpcyB0aGUgaW52ZXJzZSBvZiBpdHMgcGFyYW1ldGVyIChhbmQgdmljZS12ZXJzYSkuCmBgYHtyIE1NXzEsIG1lc3NhZ2UgPSBGQUxTRX0KbSA8LSBtZWFuKGRpYW1vbmRzJHByaWNlKSAgIyBBdmVyYWdlIHByaWNlCmdncGxvdChkaWFtb25kcywgYWVzKHggPSBwcmljZSkpICsgCiAgICBnZW9tX2hpc3RvZ3JhbShhZXMoeSA9IC4uZGVuc2l0eS4uKSkgKyAKICAgIHN0YXRfZnVuY3Rpb24oZnVuID0gZGV4cCwgYXJncyA9IGxpc3QoMS9tKSwgY29sb3IgPSAicmVkIikKYGBgCgpBIChwb3NzaWJseSkgYmV0dGVyIGZpdCBjb3VsZCBiZSBvYnRhaW5lZCBieSBhIGZhbWlseSB3aXRoIDIgcGFyYW1ldGVyczogbGV0J3MgdHJ5IHRoZSAqKmdhbW1hIGRpc3RyaWJ1dGlvbioqLgpgYGB7ciBNTV8yLCBtZXNzYWdlID0gRkFMU0V9Cm0gPC0gbWVhbihkaWFtb25kcyRwcmljZSkgIyBBdmVyYWdlIHByaWNlCnYgPC0gdmFyKGRpYW1vbmRzJHByaWNlKSAgIyBWYXJpYW5jZSBvZiBwcmljZXMKZ2dwbG90KGRpYW1vbmRzLCBhZXMoeCA9IHByaWNlKSkgKyBnZW9tX2hpc3RvZ3JhbShhZXMoeSA9IC4uZGVuc2l0eS4uKSkgKyAKICAgIHN0YXRfZnVuY3Rpb24oZnVuID0gZGV4cCwgYXJncyA9IGxpc3QoMS9tKSwgY29sb3IgPSAicmVkIikgKwogICAgc3RhdF9mdW5jdGlvbihmdW4gPSBkZ2FtbWEsIGFyZ3MgPSBsaXN0KHNoYXBlID0gbV4yL3YsIHJhdGUgPSBtL3YpLCBjb2xvciA9ICJjeWFuIikgCmBgYAoKVGhlIGRpZmZlcmVuY2UgYmV0d2VlbiB0aGUgdHdvIGN1cnZlcyBpcyBoYXJkIHRvIGRpc3Rpbmd1aXNoLgoKCiMjIE1heGltdW0gTGlrZWxpaG9vZCBFc3RpbWF0aW9uCkJlbG93LCB3ZSBwZXJmb3JtIHRoZSAqKk1MRSoqIG9mIHRoZSBwcmljZSBkaXN0cmlidXRpb24uIFdlIHNjYWxlIHRoZSBwcmljZXMgYnkgYSBraWxvIGZhY3RvciBhbmQgdXNlIGEgdGFyZ2V0IGdhbW1hIGRpc3RyaWJ1dGlvbi4gV2UgdXNlIHR3byBkaWZmZXJlbnQgZnVuY3Rpb25zOiBtbGUgKG9sZCkgYW5kIG1sZTIgKG5ldykuIFRoZSBsYXR0ZXIgaXMgbW9yZSB1c2VyLWZyaWVuZGx5LiBCZWxvdywgd2Ugc2hvdyB0aGF0IHRoZXkgeWllbGQgdGhlIHNhbWUgb3V0cHV0LgpgYGB7ciBmaXJzdCBlc3RpbWF0aW9uLCB3YXJuaW5nID0gRkFMU0V9CnBhciA8LSBtbGUyKHByaWNlIC8gMTAwMCB+IGRnYW1tYShzaGFwZSwgcmF0ZSksIAogICAgICAgICAgICBzdGFydCA9IGxpc3Qoc2hhcGUgPSAyLCByYXRlID0gMiksIAogICAgICAgICAgICBkYXRhID0gZGlhbW9uZHMpIAojIFVzaW5nIHRoZSBuZXcgcGFja2FnZTsgV0FSSU5JTkc6IHByaWNlcyB3ZXJlIHNjYWxlZC4KcGFyQGNvZWYgIyBKdXN0IHRvIHNlZSB0aGUgdmFsdWVzIG9mIHRoZSBwYXJhbWV0ZXJzIApuTEwgPC0gZnVuY3Rpb24ocywgcikgLXN1bShzdGF0czo6ZGdhbW1hKGRpYW1vbmRzJHByaWNlIC8gMTAwMCwgcywgciwgbG9nID0gVFJVRSkpIAojIFVzaW5nIHRoZSBvbGQgcGFja2FnZQpmaXQgPC0gbWxlKG5MTCwgc3RhcnQgPSBsaXN0KHMgPSAyLCByID0gMikpCmZpdEBjb2VmICMgQWdhaW4sIGp1c3QgaGF2aW5nIGEgbG9vazogdGhleSBtYXRjaCEKZGlhbW9uZHMgJT4lIGdncGxvdCgpICsgCiAgICBnZW9tX2hpc3RvZ3JhbShhZXMoeCA9IHByaWNlIC8gMTAwMCwgeSA9IC4uZGVuc2l0eS4uKSwgYmlucyA9IDkwKSArIAogICAgc3RhdF9mdW5jdGlvbihmdW4gPSBkZ2FtbWEsIGFyZ3MgPSBsaXN0KHNoYXBlID0gZml0QGNvZWZbMV0sIHJhdGUgPSBmaXRAY29lZlsyXSksIGNvbG9yID0gImN5YW4iKSArIAogICAgc3RhdF9mdW5jdGlvbihmdW4gPSBkZXhwLCBhcmdzID0gbGlzdCgxL20qMTAwMCksIGNvbG9yID0gInJlZCIpICMgUGxvdApgYGAKClRoZSBkaWZmZXJlbmNlIHdpdGggdGhlIG1vbWVudC1tYXRjaGluZyBkZW5zaXR5IGlzIHRoZSBzbWFsbCBidW1wIGNsb3NlIHRvIHplcm8uICAKCkEgbW9yZSBkaWZmaWN1bHQgdGFzayBpcyB0byBmaXQgYSBwYXJhbWV0cmljIGRpc3RyaWJ1dGlvbiB0byB0aGUgZGlzdHJpYnV0aW9uIG9mIGRpYW1vbmQgd2VpZ2h0cy4gSW5kZWVkLCB0aGUgZW1waXJpY2FsIGRpc3RyaWJ1dGlvbiBpcyBmYXIgZnJvbSBzbW9vdGggYmVjYXVzZSBzcGlrZXMgYXBwZWFyIGFyb3VuZCAncm91bmQnIHZhbHVlcyBvZiBjYXJhdCAoMSwgMS41LCAyLCAyLjUsIGV0Yy4pLgoKYGBge3Igc2Vjb25kIGVzdGltYXRpb24sIHdhcm5pbmcgPSBGQUxTRX0KcGFyIDwtIG1sZTIoY2FyYXQgfiBkZ2FtbWEoc2hhcGUsIHJhdGUpLCAKICAgICAgICAgICAgc3RhcnQgPSBsaXN0KHNoYXBlID0gMSwgcmF0ZSA9IDIpLCAKICAgICAgICAgICAgZGF0YSA9IGRpYW1vbmRzKSAKIyBVc2luZyB0aGUgbmV3IHBhY2thZ2UKcGFyQGNvZWYgIyBUaGUgZXN0aW1hdGVkIHBhcmFtZXRlcnMKbkxMIDwtIGZ1bmN0aW9uKHMsIHIpIC1zdW0oc3RhdHM6OmRnYW1tYShkaWFtb25kcyRjYXJhdCwgc2hhcGUgPSBzLCByYXRlID0gciwgbG9nID0gVFJVRSkpIAojIFVzaW5nIHRoZSBvbGQgcGFja2FnZQpmaXQgPC0gbWxlKG5MTCwgc3RhcnQgPSBsaXN0KHMgPSAyLCByID0gMikpCmZpdEBjb2VmICMgVGhlIGVzdGltYXRlZCBwYXJhbWV0ZXJzCmRpYW1vbmRzICU+JSBnZ3Bsb3QoKSArIAogICAgZ2VvbV9oaXN0b2dyYW0oYWVzKHggPSBjYXJhdCwgeSA9IC4uZGVuc2l0eS4uKSwgYmlucyA9IDUwKSArIAogICAgc3RhdF9mdW5jdGlvbihmdW4gPSBkZ2FtbWEsIGFyZ3MgPSBsaXN0KHNoYXBlID0gZml0QGNvZWZbMV0sIHJhdGUgPSBmaXRAY29lZlsyXSksIGNvbG9yID0gInJlZCIpCmBgYAoKSXQgaXMgaGFyZCBpbiB0aGlzIGNhc2UgdG8gZmluZCB0aGUgZml0IHZlcnkgaW1wcmVzc2l2ZTogdGhlIGdhbW1hIGRpc3RyaWJ1dGlvbiBpcyBzbW9vdGggYW5kIG5vdCBmaXQgZm9yIHRoaXMgdGFzay4KCkZpbmFsbHksIGFuZCBmb3IgdGhlIHNha2VzIG9mIGNvbXBsZXRlbmVzcywgd2UgZW5kIHdpdGggYSBjb3VudGVyLWV4YW1wbGUuIFdlIGRlZmluZSB0aGUgZGVuc2l0eSBvZiBhIHNjYWxlZCBTdHVkZW50IHZhcmlhYmxlIGFuZCB0aGVuIHBlcmZvcm0gTUxFIG92ZXIgdGhlIGNhcmF0IGRpc3RyaWJ1dGlvbi4KCmBgYHtyIHN0dWRlbnRfY291bnRlciwgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0V9CnN0dWRlbnQgPC0gZnVuY3Rpb24oeCwgbSwgdiwgZGYsIGxvZyA9IEZBTFNFKXsKICAgIGlmKGxvZyA9PSBGQUxTRSkge3JldHVybihkdCh2KngrbSwgZGYgPSBkZikqdil9ICAgICAjIFJhdyB2YWx1ZQogICAgaWYobG9nID09IFRSVUUpIHtyZXR1cm4obG9nKGR0KHYqeCttLCBkZiA9IGRmKSp2KSl9ICMgTG9nIHZhbHVlCn0KCnBhciA8LSBtbGUyKGNhcmF0IH4gc3R1ZGVudChtLCB2LCBkZiksICAgICAgICAgICAgICAgICAgIyBUaGUgZXN0aW1hdGlvbgogICAgICAgICAgICBzdGFydCA9IGxpc3QobSA9IDIsIHYgPSAxLCBkZiA9IDUpLCAKICAgICAgICAgICAgZGF0YSA9IGRpYW1vbmRzKSAKCnBhckBjb2VmICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIyBUaGUgcGFyYW1ldGVyIHZhbHVlcwoKZGlhbW9uZHMgJT4lIGdncGxvdCgpICsgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAjIFRoZSBwbG90IQogICAgZ2VvbV9oaXN0b2dyYW0oYWVzKHggPSBjYXJhdCwgeSA9IC4uZGVuc2l0eS4uKSwgYmlucyA9IDUwKSArIAogICAgc3RhdF9mdW5jdGlvbihmdW4gPSBzdHVkZW50LCAKICAgICAgICAgICAgICAgICAgYXJncyA9IGxpc3QobSA9IHBhckBjb2VmWzFdLCB2ID0gcGFyQGNvZWZbMl0sIGRmID0gcGFyQGNvZWZbM10pLAogICAgICAgICAgICAgICAgICBjb2xvciA9ICJ5ZWxsb3ciKQpgYGAKClRoZSBmaXQgaXMgbm90IHNvIGRpZmZlcmVudCwgY29tcGFyZWQgdG8gdGhlIGdhbW1hIGN1cnZlLiBCdXQgbGV0IHVzIG5vdCBmb3JnZXQgdGhhdCB0aGUgdW5kZXJseWluZyBkaXN0cmlidXRpb24gaXMgZGVmaW5lZCBvbiB0aGUgd2hvbGUgcmVhbCBsaW5lOiBuZWdhdGl2ZSB2YWx1ZXMgYXJlIHBvc3NpYmxlIGluIHRoaXMgY2FzZS4KCgoKIyMgRXhlcmNpc2VzCgojIyMgRml0dGluZyEKMSkgU2V0IHlvdXIgd29ya2luZyBkaXJlY3RvcnksIGxvYWQgdGhlIHRpZHl2ZXJzZSBwYWNrYWdlICYgaW1wb3J0IHRoZSBBTkVTIGRhdGFzZXQuICAKMikgUmVwbGFjZSB0aGUgKG9yIGNyZWF0ZSBhIG5ldykgYWdlIHZhcmlhYmxlIHdpdGggaXRzIG5vcm1hbGl6ZWQgdmFsdWU6IChBZ2UgLSAxNSkgLyA4MC4gSGludDogdXNlIHRoZSAqKm11dGF0ZSoqKCkgZnVuY3Rpb24uICAKMykgQ2hlY2sgdGhhdCB0aGUgbmV3IHZhcmlhYmxlIGxpZXMgYmV0d2VlbiAwIGFuZCAxIChwbG90IGEgaGlzdG9ncmFtIGZvciBpbnN0YW5jZSwgb3IgbG9vayBhdCB0aGUgc3VtbWFyeSkuICoqQkVXQVJFKio6IGlmIGV4dHJlbWUgdmFsdWVzIGFyZSByZWFjaGVkICgwIG9yIDEpLCB0aGUgb3B0aW1pemVyIG1pZ2h0IG5vdCBsaWtlIGl0LiAgCjQpIEZpdCB0aGlzIHNjYWxlZCBhZ2UgdmFyaWFibGUgd2l0aCBhIGJldGEgZGlzdHJpYnV0aW9uLiAgCjUpIFBsb3QgdGhlIGNvcnJlc3BvbmRpbmcgZGlzdHJpYnV0aW9uIG92ZXIgdGhlIGhpc3RvZ3JhbS4KNikgRml0IHRoZSB2YXJpYWJsZSB0byBhIEdhdXNzaWFuIGRpc3RyaWJ1dGlvbi4gUGxvdCBhbmQgY29tcGFyZSB0byB0aGUgYmV0YSBmaXQuCgpgYGB7ciB5b3VyIHR1cm4hLCBtZXNzYWdlID0gRkFMU0UsIHdhcm5pbmcgPSBGQUxTRX0KCiAgICAKYGBgCgoKIyMjIFdvcmtpbmcgb24gc3Vic2V0cwpQZXJmb3JtIHRoZSBzYW1lIGFuYWx5c2lzIG9uIHRoZSBtYWxlL2ZlbWFsZSBzdWJzZXRzIChIaW50OiB1c2UgdGhlICoqZmlsdGVyKiooKSBmdW5jdGlvbikuIENvbXBhcmUgdGhlIHR3byEKCgojIyMgR2V0dGluZyBudXRzOiBmaXR0aW5nIGdhbW1hIHdpdGggYmV0YS4KMSkgQ3JlYXRlIG5ldyBkYXRhIHdpdGggdGhlIGNvbW1hbmRzIGJlbG93OiAgIAogIApuX3NhbXBsZSA8LSAxMF41ICAKeSA8LSByZ2FtbWEobl9zYW1wbGUsMiwyKSAgIAp5IDwtIHkvbWF4KHkrMSkgICAgCnggPC0gMTpuX3NhbXBsZSAgCmRhdGEgPC0gZGF0YS5mcmFtZSh4LCB5KSAgICAKIAoyKSBBZ2FpbiwgZml0IGEgYmV0YSBkaXN0cmlidXRpb24gb24geS4gUGxvdCB0aGUgZGF0YSBhbmQgZml0dGVkIGRpc3RyaWJ1dGlvbi4gIAozKSBTYW1lIHF1ZXN0aW9uIHdpdGggYSBnYW1tYSBsYXcuIExvb2sgYXQgdGhlIGdyYXBoaWNhbCBkaWZmZXJlbmNlIGJldHdlZW4gdGhlIHR3by4KCg==