First steps

First, the packages…

if(!require(factoextra)){install.packages("factoextra")}
library(tidyverse)
library(factoextra)

… and the data.

load("movies.RData")

Feature engineering

Let’s then spend some time on variable engineering. To this purpose, let’s plot the distribution of variables (post-scaling).
NOTE: this is done in hindsight after encountering problems later on…

var_dist <- c("duration", "budget", "earnings", "imdb_score", "likes")
movies_km <- movies %>%
    mutate(duration = (duration-min(duration)) / (max(duration)-min(duration)),
           budget = (budget-min(budget)) / (max(budget)-min(budget)),
           earnings = (earnings-min(earnings)) / (max(earnings)-min(earnings)),
           imdb_score = (imdb_score-min(imdb_score)) / (max(imdb_score)-min(imdb_score)),
           likes = (likes-min(likes))/(max(likes)-min(likes))) 
movies_km %>% 
    select(var_dist) %>%
    pivot_longer(all_of(var_dist), names_to = "key", values_to = "value") %>%
    ggplot(aes(x = value)) + geom_histogram() + facet_grid(key ~ ., scales = "free")

There are outliers for earnings & budget (and arguably for duration.). These guys can cause problems (and ruin cluster visualizations). Let’s use a log transform to get the outliers back in the game.

movies_km <- movies %>%
    mutate(duration = (duration-min(duration)) / max(duration),
           budget = (log(budget)-min(log(budget))) / max(log(budget)),
           earnings = (log(earnings+1)-min(log(earnings+1))) / max(log(earnings+1)), # Zero earnings!
           imdb_score = (imdb_score-min(imdb_score)) / max(imdb_score),
           likes = (log(likes+1)-min(log(likes+1)))/(max(log(likes+1)))) 
movies_km %>% 
    select(var_dist) %>%
    pivot_longer(all_of(var_dist), names_to = "key", values_to = "value") %>%
    ggplot(aes(x = value)) + geom_histogram() + facet_grid(key ~ ., scales = "free")

Much better! The log transform could also be applied to duration.

Let’s move forward with the algorithm. It is random, hence we fix the random seed.

set.seed(42)               # Random seed.
km <- movies_km %>%
    select(var_dist) %>%   # Selects the variables for the computation
    kmeans(5)              # Five groups
km
K-means clustering with 5 clusters of sizes 407, 802, 1115, 561, 765

Cluster means:
   duration    budget  earnings imdb_score     likes
1 0.1951604 0.4603092 0.3851561  0.5234736 0.5600210
2 0.2179476 0.5195894 0.6809796  0.5846003 0.5531132
3 0.2123685 0.6330305 0.7945420  0.4958966 0.6684160
4 0.1819748 0.5844481 0.7213624  0.3462136 0.5960229
5 0.2825629 0.6403687 0.8316550  0.6263827 0.7239000

Clustering vector:
   [1] 5 5 5 5 3 5 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 5 3 5 5 5 3 5 5 5 5 5 5 5 3 3 5 5 5 3 3 5 5 5 5 5 5 3 5 3 5
  [53] 3 3 5 5 5 3 3 5 3 5 5 5 5 5 3 5 5 3 3 5 3 5 3 3 5 5 5 5 5 5 3 5 3 3 5 5 1 3 5 3 5 5 5 5 5 5 5 3 3 5 3 5
 [105] 5 3 5 5 5 5 5 3 5 5 5 5 3 3 5 3 5 5 5 5 5 5 5 3 3 3 5 3 3 5 4 5 3 4 3 3 3 5 3 3 3 5 3 3 5 5 5 3 5 5 5 5
 [157] 5 3 4 5 3 3 3 3 3 5 3 5 3 3 5 3 5 3 5 3 5 5 5 5 3 3 5 3 3 3 5 5 3 3 3 3 5 5 3 5 5 5 3 5 5 5 3 5 3 5 4 3
 [209] 5 3 3 3 5 3 5 4 5 3 3 5 3 3 4 3 5 5 5 5 5 5 5 3 3 3 2 3 4 5 3 3 5 5 5 3 5 4 3 3 3 3 3 5 4 4 3 5 5 3 5 4
 [261] 5 5 3 3 5 5 4 5 5 5 5 5 3 5 5 5 5 3 3 5 3 5 3 3 5 5 3 3 3 3 3 5 5 3 5 5 3 4 3 5 3 3 3 4 5 4 5 3 2 5 5 1
 [313] 3 3 5 3 5 4 3 3 3 3 5 5 5 3 3 3 3 5 3 3 3 5 3 3 5 5 5 3 5 3 3 5 3 5 5 5 5 5 5 3 3 5 3 3 5 4 5 3 3 3 3 5
 [365] 3 1 3 5 5 3 3 5 2 5 1 5 3 5 5 5 5 5 3 5 3 5 4 3 5 5 5 5 3 5 5 3 3 5 3 3 5 4 3 5 4 4 3 2 3 4 3 3 5 3 5 5
 [417] 5 5 3 5 3 3 3 5 3 4 3 5 5 5 3 3 5 3 5 3 5 3 5 5 3 3 3 3 3 5 5 3 4 3 3 3 5 3 3 3 3 3 3 5 3 3 3 3 5 3 4 4
 [469] 3 4 3 3 3 3 2 4 2 4 4 5 5 3 3 5 5 4 3 5 3 4 3 3 5 5 3 3 5 5 5 5 3 3 3 4 4 5 3 5 3 3 5 3 3 3 5 5 3 5 3 3
 [521] 3 3 3 3 5 3 3 3 4 3 5 3 3 3 3 3 3 5 3 5 4 5 5 3 5 4 2 3 3 3 3 5 3 3 3 3 5 5 5 3 5 3 3 4 5 5 3 3 5 3 5 3
 [573] 5 3 3 3 5 5 3 3 3 4 3 3 3 3 4 4 3 3 3 3 4 3 4 3 3 5 5 4 5 1 3 3 5 5 3 3 5 4 5 3 3 5 3 5 5 5 3 5 3 5 5 3
 [625] 5 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 5 5 3 4 4 2 4 3 3 3 2 3 3 5 5 5 1 4 3 3 4 3 5 3 3 5 3 5 5 3 5 5 5
 [677] 3 5 3 3 3 3 5 3 3 3 3 3 3 3 5 5 5 5 5 3 3 5 3 2 3 4 2 3 5 4 3 3 3 3 3 3 3 3 3 4 4 3 5 3 3 3 3 3 5 4 3 5
 [729] 4 5 3 4 3 3 3 3 3 3 3 3 3 3 5 5 4 4 3 3 3 3 3 3 4 1 3 4 4 3 3 4 3 4 3 3 5 4 5 3 3 4 3 3 4 3 3 3 5 3 3 3
 [781] 4 3 3 3 5 3 5 3 5 3 5 3 3 5 3 3 5 3 5 5 3 3 3 5 3 5 3 3 3 3 3 3 3 3 2 5 3 3 3 3 3 3 3 3 5 3 3 5 3 3 5 4
 [833] 4 3 4 3 2 5 5 3 5 3 3 3 4 3 3 3 5 3 5 5 5 5 3 3 5 5 3 3 3 4 5 3 2 5 2 4 3 3 3 2 5 5 5 5 5 5 3 3 3 5 3 5
 [885] 5 5 3 5 3 3 5 3 4 3 3 5 3 3 5 5 5 3 5 3 3 4 3 3 3 3 5 5 5 3 5 3 5 3 3 3 5 3 4 3 5 3 3 3 3 4 4 3 3 2 5 4
 [937] 4 3 3 5 3 3 3 3 3 3 3 5 5 5 5 4 3 5 3 2 3 3 3 3 4 3 3 4 3 3 3 4 4 4 3 4 3 3 5 5 5 5 5 3 3 3 5 3 3 3 5 3
 [989] 3 3 3 3 3 4 3 3 3 5 5 4
 [ reached getOption("max.print") -- omitted 2650 entries ]

Within cluster sum of squares by cluster:
[1] 23.16010 27.08870 19.79035 12.94434 19.66216
 (between_SS / total_SS =  55.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
[8] "iter"         "ifault"      

There are lots of informations. The means indicate the centers of the clusters.
Let’s have a look at the compositions of the groups.

Visualization

var_dist <- c("duration", "budget", "earnings", "imdb_score", "likes")
plot_vars <- c(3,5)
fviz_cluster(km, 
             movies_km %>% select(var_dist),    # Scaled data
             stand = FALSE,
             choose.vars = c(var_dist[plot_vars[1]],var_dist[plot_vars[2]]),
             #axes = plot_vars,                 # Chooses the variables for the plot 
             geom = "point",                    # Geom type (better than "text")
             xlab = var_dist[plot_vars[1]],
             ylab = var_dist[plot_vars[2]],
             ellipse.type = "norm")

Let’s have a look on another set of variables, but on the original data.

plot_vars <- c(1,3)
fviz_cluster(km, 
             movies %>% select(var_dist),   # Raw data
             stand = FALSE,
             choose.vars = c(var_dist[plot_vars[1]],var_dist[plot_vars[2]]),
             #axes = plot_vars,                 # Chooses the variables for the plot 
             geom = "point",                    # Geom type (better than "text")
             xlab = var_dist[plot_vars[1]],
             ylab = var_dist[plot_vars[2]],
             ellipse.type = "norm") 

One overarching cluster (red) and four more specialized groups. The scaling does have a big impact on the visualization.

LS0tCnRpdGxlOiAiUzg6IEstbWVhbnMgY2x1c3RlcmluZyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyBGaXJzdCBzdGVwcwoKRmlyc3QsIHRoZSBwYWNrYWdlcy4uLgoKYGBge3IsIG1lc3NhZ2UgPSBGQUxTRSwgd2FybmluZyA9IEZBTFNFfQppZighcmVxdWlyZShmYWN0b2V4dHJhKSl7aW5zdGFsbC5wYWNrYWdlcygiZmFjdG9leHRyYSIpfQpgYGAKCmBgYHtyLCB3YXJuaW5nID0gRkFMU0UsIG1lc3NhZ2UgPSBGQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkoZmFjdG9leHRyYSkKYGBgCgouLi4gYW5kIHRoZSBkYXRhLgoKYGBge3J9CmxvYWQoIm1vdmllcy5SRGF0YSIpCmBgYAoKIyBGZWF0dXJlIGVuZ2luZWVyaW5nCgpMZXQncyB0aGVuIHNwZW5kIHNvbWUgdGltZSBvbiB2YXJpYWJsZSBlbmdpbmVlcmluZy4gVG8gdGhpcyBwdXJwb3NlLCBsZXQncyBwbG90IHRoZSBkaXN0cmlidXRpb24gb2YgdmFyaWFibGVzIChwb3N0LXNjYWxpbmcpLiAgICAKKipOT1RFKio6IHRoaXMgaXMgZG9uZSBpbiBoaW5kc2lnaHQgYWZ0ZXIgZW5jb3VudGVyaW5nIHByb2JsZW1zIGxhdGVyIG9uLi4uIAoKYGBge3IsIG1lc3NhZ2UgPSBGQUxTRSwgd2FybmluZyA9IEZBTFNFfQp2YXJfZGlzdCA8LSBjKCJkdXJhdGlvbiIsICJidWRnZXQiLCAiZWFybmluZ3MiLCAiaW1kYl9zY29yZSIsICJsaWtlcyIpCm1vdmllc19rbSA8LSBtb3ZpZXMgJT4lCiAgICBtdXRhdGUoZHVyYXRpb24gPSAoZHVyYXRpb24tbWluKGR1cmF0aW9uKSkgLyAobWF4KGR1cmF0aW9uKS1taW4oZHVyYXRpb24pKSwKICAgICAgICAgICBidWRnZXQgPSAoYnVkZ2V0LW1pbihidWRnZXQpKSAvIChtYXgoYnVkZ2V0KS1taW4oYnVkZ2V0KSksCiAgICAgICAgICAgZWFybmluZ3MgPSAoZWFybmluZ3MtbWluKGVhcm5pbmdzKSkgLyAobWF4KGVhcm5pbmdzKS1taW4oZWFybmluZ3MpKSwKICAgICAgICAgICBpbWRiX3Njb3JlID0gKGltZGJfc2NvcmUtbWluKGltZGJfc2NvcmUpKSAvIChtYXgoaW1kYl9zY29yZSktbWluKGltZGJfc2NvcmUpKSwKICAgICAgICAgICBsaWtlcyA9IChsaWtlcy1taW4obGlrZXMpKS8obWF4KGxpa2VzKS1taW4obGlrZXMpKSkgCm1vdmllc19rbSAlPiUgCiAgICBzZWxlY3QodmFyX2Rpc3QpICU+JQogICAgcGl2b3RfbG9uZ2VyKGFsbF9vZih2YXJfZGlzdCksIG5hbWVzX3RvID0gImtleSIsIHZhbHVlc190byA9ICJ2YWx1ZSIpICU+JQogICAgZ2dwbG90KGFlcyh4ID0gdmFsdWUpKSArIGdlb21faGlzdG9ncmFtKCkgKyBmYWNldF9ncmlkKGtleSB+IC4sIHNjYWxlcyA9ICJmcmVlIikKYGBgCgpUaGVyZSBhcmUgb3V0bGllcnMgZm9yICoqZWFybmluZ3MqKiAmICoqYnVkZ2V0KiogKGFuZCBhcmd1YWJseSBmb3IgKipkdXJhdGlvbioqLikuIFRoZXNlIGd1eXMgY2FuIGNhdXNlIHByb2JsZW1zIChhbmQgcnVpbiBjbHVzdGVyIHZpc3VhbGl6YXRpb25zKS4KTGV0J3MgdXNlIGEgbG9nIHRyYW5zZm9ybSB0byBnZXQgdGhlIG91dGxpZXJzIGJhY2sgaW4gdGhlIGdhbWUuCgpgYGB7ciwgbWVzc2FnZSA9IEZBTFNFLCB3YXJuaW5nID0gRkFMU0V9Cm1vdmllc19rbSA8LSBtb3ZpZXMgJT4lCiAgICBtdXRhdGUoZHVyYXRpb24gPSAoZHVyYXRpb24tbWluKGR1cmF0aW9uKSkgLyBtYXgoZHVyYXRpb24pLAogICAgICAgICAgIGJ1ZGdldCA9IChsb2coYnVkZ2V0KS1taW4obG9nKGJ1ZGdldCkpKSAvIG1heChsb2coYnVkZ2V0KSksCiAgICAgICAgICAgZWFybmluZ3MgPSAobG9nKGVhcm5pbmdzKzEpLW1pbihsb2coZWFybmluZ3MrMSkpKSAvIG1heChsb2coZWFybmluZ3MrMSkpLCAjIFplcm8gZWFybmluZ3MhCiAgICAgICAgICAgaW1kYl9zY29yZSA9IChpbWRiX3Njb3JlLW1pbihpbWRiX3Njb3JlKSkgLyBtYXgoaW1kYl9zY29yZSksCiAgICAgICAgICAgbGlrZXMgPSAobG9nKGxpa2VzKzEpLW1pbihsb2cobGlrZXMrMSkpKS8obWF4KGxvZyhsaWtlcysxKSkpKSAKbW92aWVzX2ttICU+JSAKICAgIHNlbGVjdCh2YXJfZGlzdCkgJT4lCiAgICBwaXZvdF9sb25nZXIoYWxsX29mKHZhcl9kaXN0KSwgbmFtZXNfdG8gPSAia2V5IiwgdmFsdWVzX3RvID0gInZhbHVlIikgJT4lCiAgICBnZ3Bsb3QoYWVzKHggPSB2YWx1ZSkpICsgZ2VvbV9oaXN0b2dyYW0oKSArIGZhY2V0X2dyaWQoa2V5IH4gLiwgc2NhbGVzID0gImZyZWUiKQpgYGAKCk11Y2ggYmV0dGVyISBUaGUgbG9nIHRyYW5zZm9ybSBjb3VsZCBhbHNvIGJlIGFwcGxpZWQgdG8gKipkdXJhdGlvbioqLgoKTGV0J3MgbW92ZSBmb3J3YXJkIHdpdGggdGhlIGFsZ29yaXRobS4gSXQgaXMgcmFuZG9tLCBoZW5jZSB3ZSBmaXggdGhlIHJhbmRvbSBzZWVkLgoKYGBge3J9CnNldC5zZWVkKDQyKSAgICAgICAgICAgICAgICMgUmFuZG9tIHNlZWQuCmttIDwtIG1vdmllc19rbSAlPiUKICAgIHNlbGVjdCh2YXJfZGlzdCkgJT4lICAgIyBTZWxlY3RzIHRoZSB2YXJpYWJsZXMgZm9yIHRoZSBjb21wdXRhdGlvbgogICAga21lYW5zKDUpICAgICAgICAgICAgICAjIEZpdmUgZ3JvdXBzCmttCmBgYAoKVGhlcmUgYXJlIGxvdHMgb2YgaW5mb3JtYXRpb25zLiBUaGUgbWVhbnMgaW5kaWNhdGUgdGhlIGNlbnRlcnMgb2YgdGhlIGNsdXN0ZXJzLiAgIApMZXQncyBoYXZlIGEgbG9vayBhdCB0aGUgY29tcG9zaXRpb25zIG9mIHRoZSBncm91cHMuIAoKYGBge3J9Cm1vdmllc1t3aGljaChrbSRjbHVzdGVyPT0xKSxdICMgRmlyc3QgY2x1c3Rlcgptb3ZpZXNbd2hpY2goa20kY2x1c3Rlcj09NSksXSAjIEZpZnRoIGNsdXN0ZXIKYGBgCgoKVmlzdWFsaXphdGlvbgoKYGBge3J9CnZhcl9kaXN0IDwtIGMoImR1cmF0aW9uIiwgImJ1ZGdldCIsICJlYXJuaW5ncyIsICJpbWRiX3Njb3JlIiwgImxpa2VzIikKcGxvdF92YXJzIDwtIGMoMyw1KQpmdml6X2NsdXN0ZXIoa20sIAogICAgICAgICAgICAgbW92aWVzX2ttICU+JSBzZWxlY3QodmFyX2Rpc3QpLCAgICAjIFNjYWxlZCBkYXRhCiAgICAgICAgICAgICBzdGFuZCA9IEZBTFNFLAogICAgICAgICAgICAgY2hvb3NlLnZhcnMgPSBjKHZhcl9kaXN0W3Bsb3RfdmFyc1sxXV0sdmFyX2Rpc3RbcGxvdF92YXJzWzJdXSksCiAgICAgICAgICAgICAjYXhlcyA9IHBsb3RfdmFycywgICAgICAgICAgICAgICAgICMgQ2hvb3NlcyB0aGUgdmFyaWFibGVzIGZvciB0aGUgcGxvdCAKICAgICAgICAgICAgIGdlb20gPSAicG9pbnQiLCAgICAgICAgICAgICAgICAgICAgIyBHZW9tIHR5cGUgKGJldHRlciB0aGFuICJ0ZXh0IikKICAgICAgICAgICAgIHhsYWIgPSB2YXJfZGlzdFtwbG90X3ZhcnNbMV1dLAogICAgICAgICAgICAgeWxhYiA9IHZhcl9kaXN0W3Bsb3RfdmFyc1syXV0sCiAgICAgICAgICAgICBlbGxpcHNlLnR5cGUgPSAibm9ybSIpCmBgYAoKTGV0J3MgaGF2ZSBhIGxvb2sgb24gYW5vdGhlciBzZXQgb2YgdmFyaWFibGVzLCBidXQgb24gdGhlIG9yaWdpbmFsIGRhdGEuCgpgYGB7cn0KcGxvdF92YXJzIDwtIGMoMSwzKQpmdml6X2NsdXN0ZXIoa20sIAogICAgICAgICAgICAgbW92aWVzICU+JSBzZWxlY3QodmFyX2Rpc3QpLCAgICMgUmF3IGRhdGEKICAgICAgICAgICAgIHN0YW5kID0gRkFMU0UsCiAgICAgICAgICAgICBjaG9vc2UudmFycyA9IGModmFyX2Rpc3RbcGxvdF92YXJzWzFdXSx2YXJfZGlzdFtwbG90X3ZhcnNbMl1dKSwKICAgICAgICAgICAgICNheGVzID0gcGxvdF92YXJzLCAgICAgICAgICAgICAgICAgIyBDaG9vc2VzIHRoZSB2YXJpYWJsZXMgZm9yIHRoZSBwbG90IAogICAgICAgICAgICAgZ2VvbSA9ICJwb2ludCIsICAgICAgICAgICAgICAgICAgICAjIEdlb20gdHlwZSAoYmV0dGVyIHRoYW4gInRleHQiKQogICAgICAgICAgICAgeGxhYiA9IHZhcl9kaXN0W3Bsb3RfdmFyc1sxXV0sCiAgICAgICAgICAgICB5bGFiID0gdmFyX2Rpc3RbcGxvdF92YXJzWzJdXSwKICAgICAgICAgICAgIGVsbGlwc2UudHlwZSA9ICJub3JtIikgCmBgYAoKT25lIG92ZXJhcmNoaW5nIGNsdXN0ZXIgKHJlZCkgYW5kIGZvdXIgbW9yZSBzcGVjaWFsaXplZCBncm91cHMuIFRoZSBzY2FsaW5nIGRvZXMgaGF2ZSBhIGJpZyBpbXBhY3Qgb24gdGhlIHZpc3VhbGl6YXRpb24uCgpgYGB7cn0KCmBgYAoK