First, the packages…
if(!require(factoextra)){install.packages("factoextra")}
library(tidyverse)
library(factoextra)
… and the data.
load("movies.RData")
Let’s then spend some time on variable engineering. To this purpose, let’s plot the distribution of variables (post-scaling).
NOTE: this is done in hindsight after encountering problems later on…
var_dist <- c("duration", "budget", "earnings", "imdb_score", "likes")
movies_km <- movies %>%
mutate(duration = (duration-min(duration)) / (max(duration)-min(duration)),
budget = (budget-min(budget)) / (max(budget)-min(budget)),
earnings = (earnings-min(earnings)) / (max(earnings)-min(earnings)),
imdb_score = (imdb_score-min(imdb_score)) / (max(imdb_score)-min(imdb_score)),
likes = (likes-min(likes))/(max(likes)-min(likes)))
movies_km %>%
select(var_dist) %>%
pivot_longer(all_of(var_dist), names_to = "key", values_to = "value") %>%
ggplot(aes(x = value)) + geom_histogram() + facet_grid(key ~ ., scales = "free")
There are outliers for earnings & budget (and arguably for duration.). These guys can cause problems (and ruin cluster visualizations). Let’s use a log transform to get the outliers back in the game.
movies_km <- movies %>%
mutate(duration = (duration-min(duration)) / max(duration),
budget = (log(budget)-min(log(budget))) / max(log(budget)),
earnings = (log(earnings+1)-min(log(earnings+1))) / max(log(earnings+1)), # Zero earnings!
imdb_score = (imdb_score-min(imdb_score)) / max(imdb_score),
likes = (log(likes+1)-min(log(likes+1)))/(max(log(likes+1))))
movies_km %>%
select(var_dist) %>%
pivot_longer(all_of(var_dist), names_to = "key", values_to = "value") %>%
ggplot(aes(x = value)) + geom_histogram() + facet_grid(key ~ ., scales = "free")
Much better! The log transform could also be applied to duration.
Let’s move forward with the algorithm. It is random, hence we fix the random seed.
set.seed(42) # Random seed.
km <- movies_km %>%
select(var_dist) %>% # Selects the variables for the computation
kmeans(5) # Five groups
km
K-means clustering with 5 clusters of sizes 407, 802, 1115, 561, 765
Cluster means:
duration budget earnings imdb_score likes
1 0.1951604 0.4603092 0.3851561 0.5234736 0.5600210
2 0.2179476 0.5195894 0.6809796 0.5846003 0.5531132
3 0.2123685 0.6330305 0.7945420 0.4958966 0.6684160
4 0.1819748 0.5844481 0.7213624 0.3462136 0.5960229
5 0.2825629 0.6403687 0.8316550 0.6263827 0.7239000
Clustering vector:
[1] 5 5 5 5 3 5 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 5 5 3 5 5 5 3 5 5 5 5 5 5 5 3 3 5 5 5 3 3 5 5 5 5 5 5 3 5 3 5
[53] 3 3 5 5 5 3 3 5 3 5 5 5 5 5 3 5 5 3 3 5 3 5 3 3 5 5 5 5 5 5 3 5 3 3 5 5 1 3 5 3 5 5 5 5 5 5 5 3 3 5 3 5
[105] 5 3 5 5 5 5 5 3 5 5 5 5 3 3 5 3 5 5 5 5 5 5 5 3 3 3 5 3 3 5 4 5 3 4 3 3 3 5 3 3 3 5 3 3 5 5 5 3 5 5 5 5
[157] 5 3 4 5 3 3 3 3 3 5 3 5 3 3 5 3 5 3 5 3 5 5 5 5 3 3 5 3 3 3 5 5 3 3 3 3 5 5 3 5 5 5 3 5 5 5 3 5 3 5 4 3
[209] 5 3 3 3 5 3 5 4 5 3 3 5 3 3 4 3 5 5 5 5 5 5 5 3 3 3 2 3 4 5 3 3 5 5 5 3 5 4 3 3 3 3 3 5 4 4 3 5 5 3 5 4
[261] 5 5 3 3 5 5 4 5 5 5 5 5 3 5 5 5 5 3 3 5 3 5 3 3 5 5 3 3 3 3 3 5 5 3 5 5 3 4 3 5 3 3 3 4 5 4 5 3 2 5 5 1
[313] 3 3 5 3 5 4 3 3 3 3 5 5 5 3 3 3 3 5 3 3 3 5 3 3 5 5 5 3 5 3 3 5 3 5 5 5 5 5 5 3 3 5 3 3 5 4 5 3 3 3 3 5
[365] 3 1 3 5 5 3 3 5 2 5 1 5 3 5 5 5 5 5 3 5 3 5 4 3 5 5 5 5 3 5 5 3 3 5 3 3 5 4 3 5 4 4 3 2 3 4 3 3 5 3 5 5
[417] 5 5 3 5 3 3 3 5 3 4 3 5 5 5 3 3 5 3 5 3 5 3 5 5 3 3 3 3 3 5 5 3 4 3 3 3 5 3 3 3 3 3 3 5 3 3 3 3 5 3 4 4
[469] 3 4 3 3 3 3 2 4 2 4 4 5 5 3 3 5 5 4 3 5 3 4 3 3 5 5 3 3 5 5 5 5 3 3 3 4 4 5 3 5 3 3 5 3 3 3 5 5 3 5 3 3
[521] 3 3 3 3 5 3 3 3 4 3 5 3 3 3 3 3 3 5 3 5 4 5 5 3 5 4 2 3 3 3 3 5 3 3 3 3 5 5 5 3 5 3 3 4 5 5 3 3 5 3 5 3
[573] 5 3 3 3 5 5 3 3 3 4 3 3 3 3 4 4 3 3 3 3 4 3 4 3 3 5 5 4 5 1 3 3 5 5 3 3 5 4 5 3 3 5 3 5 5 5 3 5 3 5 5 3
[625] 5 3 3 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 5 5 3 4 4 2 4 3 3 3 2 3 3 5 5 5 1 4 3 3 4 3 5 3 3 5 3 5 5 3 5 5 5
[677] 3 5 3 3 3 3 5 3 3 3 3 3 3 3 5 5 5 5 5 3 3 5 3 2 3 4 2 3 5 4 3 3 3 3 3 3 3 3 3 4 4 3 5 3 3 3 3 3 5 4 3 5
[729] 4 5 3 4 3 3 3 3 3 3 3 3 3 3 5 5 4 4 3 3 3 3 3 3 4 1 3 4 4 3 3 4 3 4 3 3 5 4 5 3 3 4 3 3 4 3 3 3 5 3 3 3
[781] 4 3 3 3 5 3 5 3 5 3 5 3 3 5 3 3 5 3 5 5 3 3 3 5 3 5 3 3 3 3 3 3 3 3 2 5 3 3 3 3 3 3 3 3 5 3 3 5 3 3 5 4
[833] 4 3 4 3 2 5 5 3 5 3 3 3 4 3 3 3 5 3 5 5 5 5 3 3 5 5 3 3 3 4 5 3 2 5 2 4 3 3 3 2 5 5 5 5 5 5 3 3 3 5 3 5
[885] 5 5 3 5 3 3 5 3 4 3 3 5 3 3 5 5 5 3 5 3 3 4 3 3 3 3 5 5 5 3 5 3 5 3 3 3 5 3 4 3 5 3 3 3 3 4 4 3 3 2 5 4
[937] 4 3 3 5 3 3 3 3 3 3 3 5 5 5 5 4 3 5 3 2 3 3 3 3 4 3 3 4 3 3 3 4 4 4 3 4 3 3 5 5 5 5 5 3 3 3 5 3 3 3 5 3
[989] 3 3 3 3 3 4 3 3 3 5 5 4
[ reached getOption("max.print") -- omitted 2650 entries ]
Within cluster sum of squares by cluster:
[1] 23.16010 27.08870 19.79035 12.94434 19.66216
(between_SS / total_SS = 55.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size"
[8] "iter" "ifault"
There are lots of informations. The means indicate the centers of the clusters.
Let’s have a look at the compositions of the groups.
Visualization
var_dist <- c("duration", "budget", "earnings", "imdb_score", "likes")
plot_vars <- c(3,5)
fviz_cluster(km,
movies_km %>% select(var_dist), # Scaled data
stand = FALSE,
choose.vars = c(var_dist[plot_vars[1]],var_dist[plot_vars[2]]),
#axes = plot_vars, # Chooses the variables for the plot
geom = "point", # Geom type (better than "text")
xlab = var_dist[plot_vars[1]],
ylab = var_dist[plot_vars[2]],
ellipse.type = "norm")
Let’s have a look on another set of variables, but on the original data.
plot_vars <- c(1,3)
fviz_cluster(km,
movies %>% select(var_dist), # Raw data
stand = FALSE,
choose.vars = c(var_dist[plot_vars[1]],var_dist[plot_vars[2]]),
#axes = plot_vars, # Chooses the variables for the plot
geom = "point", # Geom type (better than "text")
xlab = var_dist[plot_vars[1]],
ylab = var_dist[plot_vars[2]],
ellipse.type = "norm")
One overarching cluster (red) and four more specialized groups. The scaling does have a big impact on the visualization.