Before we start…

  • Who has tried deploying his/her app?
  • What do you know about clustering?
  • Who likes math?

Introduction

Key notions (simple version)

In this deck, clustering will refer to techniques that, given a particular structured dataset, seek to create homogeneous groups of observations.

Key question: how do you define “ homogeneous”? Answer: via distances.

Example: assume songs are characterized only by their danceability:

  • Poker Face by Lady Gaga has a danceability of 0.851.
  • Bridge Over Troubled Water by Simon and Garfunkel has a danceability of 0.149.
  • Lose Yourself by Eminem has a danceability of 0.7 (roughly, depending on version)

\(\rightarrow\) Lose Yourself is closer to Poker Face because the distance between the two scores is smaller.

Key notions (math version)

A tabular dataset is a matrix \(X_{i,j}\) such that:

  • i (= row number!) is the instance/observation (the song, the movie, the diamond, the country, etc.)
  • j (= column number!) is the variable (danceability, duration, imdb_score, cut, color, clarity, carat, lifeExp, population, etc.)

A distance is always computed when \(j\) is fixed and \(i\) changes:

\[D_{\text{dance}}(\text{Poker Face, Lose Yourself})= \left|X_{\text{Poker Face}, \text{Dance}}-X_{\text{Lose Yourself}, \text{Dance}}\right|\] In a more formal way, we can write that \[D_j(i_1,i_2)=\left|X_{i_1,j}-X_{i_2,j}\right|,\] if \(j\)=Dance and \(i_1\)=Poker Face and \(i_2\)=Lose Yourself.

Introducing a little complexity…

Ok, but what if we want to measure proximity according to other criteria (and not just danceability)? \[\begin{align} D_{\text{dance & duration}}(\text{Poker Face, Lose Yourself})=& \left|X_{\text{Poker Face}, \text{Dance}}-X_{\text{Lose Yourself}, \text{Dance}}\right| \\ &+\left|X_{\text{Poker Face}, \text{Duration}}-X_{\text{Lose Yourself}, \text{Duration}}\right| \end{align}\]


Problem: danceability is on a scale from 0 to 1 while duration is expressed in minutes!

\[\begin{align} D_{\text{dance & duration}}(\text{Poker Face, Lose Yourself})=&w_{dance} \left|X_{\text{Poker Face}, \text{Dance}}-X_{\text{Lose Yourself}, \text{Dance}}\right| \\ &+w_{duration}\left|X_{\text{Poker Face}, \text{Duration}}-X_{\text{Lose Yourself}, \text{Duration}}\right| \end{align}\]


This is very convenient: it allows to give more or less weights to particular attributes (they may not be all equally useful)!

=> see also variable engineering & scaling (below)

So, formally

\[D(i_1,i_2)=\sum_{j=1}^J\underbrace{w_j }_{\text{weight of variable }j} \quad \times \quad \underbrace{\left| X_{i_1,j}-X_{i_2,j} \right|}_{\text{distance over variable j}}\] Note: sometimes, it is useful to use squared distances:

\[D(i_1,i_2)=\sum_{j=1}^J\underbrace{w_j }_{\text{weight of variable }j} \quad \times \quad \underbrace{\left( X_{i_1,j}-X_{i_2,j} \right)^2}_{\text{distance over variable j}}\]

Distances with categorical data

What if variables are not numbers but (unordered) categories? (ordered categories can be digitized)

There are many way to proceed (see the paper Similarity Measures for Categorical Data: A Comparative Evaluation)

The simplest choice is binary:

\[d_j(i_1,i_2)=\left\{\begin{array}{clc} 0& \text{ if } X_{i_1,j}=X_{i_2,j} \\ 1& \text{ otherwise} \end{array} \right.\]

\(\rightarrow\) Either the categories are the same and hence the distance is zero or they are different and the distance is one.

Imagine there is a song genre variable: Poker Face could be Pop while Lose Yourself would be Rap, hence the distance would be one.

Variable engineering: scaling

A very important quantity is the range of variables. Sometimes, they are very different across the dataset: this must be dealt with!

\(\rightarrow\) Normalization / standardization: \[x_i \leftarrow \frac{x_i-\mu}{\sigma},\] where \(\mu\) is the average of the \(x_i\) and \(\sigma\) the standard deviation. Then, the \(x_i\) have zero mean and unit variance.

\(\rightarrow\) Min-max rescaling: \[x_i \leftarrow \frac{x_i-\min(\textbf{x})}{\max(\textbf{x})-\min(\textbf{x})}\] Then, the \(x_i\) are all in the \([0,1]\) interval.

Algorithms

Overview

  • Nearest neighbors are local and instance focused (computationally costly). For one given instance, they seek those that are the most similar. Assume you are interested in one particular client: to predict his/her behavior, you will try to find those that are similar to him/her.

  • k-means is global: the aim is to form groups that are as similar as possible (within the groups). (unsupervised learning)

  • decision trees try to understand which variables have an impact on one particular outcome. (supervised learning)

Nearest Neighbors (NN)

Here, \(i_1\) is fixed: it is the target instance (movie, song, diamond, customer,etc.). The NN algorithm seeks to find the instances that are the closest to \(i_1\).

It’s rather simple. For a given distance \(D(\cdot, \cdot)\), you just need to compute ALL distances \(D(i_1,i)\)! This is why it’s computationally intense.

Then, keep only the (\(k\)) instances that are the closest to the target, i.e., those with the smallest distances.

Prediction

  • Let’s say you know some attributes of a new client (=variables!): age, gender, purchase history (frequency, amount, type), etc.

  • Let’s also say that you have a large database of clients with some of them being similar to a new profile (k nearest neighbors).

  • In this database, you have a key figure: the probability that the customer will buy a product, or will respond to a particular add campaign for instance.

  • Given this data, you can try to predict this probability for the new client!

  • Indeed, the assumptions is that neighbors give information and that the new customer will behave like them.

Mathematically

If the variable to predict is \(X_{\cdot,m}\) (m does not belong to the set of variables used to compute the distances), then the simple average is: \[\text{prediction}=\frac{1}{K}\sum_{k=1}^KX_{k,m} \quad \text{ : over the neighbors}.\] In order to refine the analysis, it’s possible to scale values to give more weight to the instances that are the closest to the target: \[\text{prediction}=\frac{1}{K}\sum_{k=1}^K\underbrace{\frac{e^{-\alpha D(i_1,k)}}{\sum_{k=1}^Ke^{-\alpha D(i_1,k)}}}_{\text{weighting factor}}X_{k,m}.\] The factor must sum to one!

k-means: introduction & notation

  • k-NN are local: the seek to determine the vicinity of one particular instance.

  • k-means seek to partition a dataset into global homogeneous clusters: \(S_i\). For each group, we can compute the average value of the variables/attributes:

\[\textbf{m}_i=\frac{1}{\text{card}(S_i)}\sum_{\textbf{x}\in S_i}\textbf{x}\]

Inside one cluster, the dispersion over all variables is measured by \[\sum_{\textbf{x}\in S_i}||\textbf{x}-\textbf{m}_i||^2\]

where \(||\cdot ||\) is distance. This quantifies the homogeneity of the cluster.

k-means: algorithm

The objective of \(k\)-means is to minimize the total dispersion over all clusters: \[\underset{\textbf{S}\in \mathcal{S}}{\min} \ \sum_{i=1}^k\sum_{\textbf{x}\in S_i}||\textbf{x}-\textbf{m}_i||^2,\] where \(\mathcal{S}\) is the set of all possible partitions. This minimization is very hard to achieve because it requires the knowledge of dispersions of all possible combinations of clusters.

That is impossible in practice whenever the sample is not very small (15 instances or less).

Thus, we resort to approximate algorithms.

\(\rightarrow\) the choice of \(k\) is usually not easy/obvious.

Hierarchical clustering (1/2)

Top-down versus Bottom-up.
The stars below have 3 characteristics: shape, size and color.

Here, homogeneity is defined by color (that’s what drives the distances).

Hierarchical clustering (2/2): Decision trees

Here, we split again according to color.

Let’s take a closer look at how the splits are performed. They are variable based!

Building decision trees (1/3)

  • The aim is to explain the popularity of songs.
  • Suppose we seek to split (in two groups) all songs based on loudness (loud vs not so loud songs).
  • We try to form clusters of homogeneous popularity.
  • Mathematically, we compute the dispersion in popularity of two clusters: those with low loudness and those with high loudness.
  • The problem is that we don’t know where to split loudness (very low, low, medium, high, etc.)!
  • The best level is the one that minimize the total dispersion of popularity over the two clusters.

Building decision trees (2/3)

There is a clear effect! If we split the sample at a loudness of -10, we get the smallest dispersion! The average pop when loudness is smaller than -10 is 46 and that of those with loundess larger than -10 is 49. That’s not a big change.

Building decision trees (3/3)

The thing is: there are many more variables! So this exercise can be carried out on all of them. And the one that leads to the most homogeneous clusters wins!

This procedure can then be re-iterated on the two sub-groups and continued until sufficient complexity is reached. The precision increases.

A new dataset: clothing store (Larose 2006)

1001 customers (1000+1) and 24 attributes.

  1. KEY = customer ID
  2. REC = recency (months since last purchase)
  3. FRE = frequency (number of purchase visits)
  4. MON = monetary (total net sales)
  5. SPTXM = amount spent in the past month, the past 3 months, and the past 6 months (3 columns)
  6. CLASS = number of different product classes purchased
  7. PXXX = 15 variables providing the percentages spent by the customer on specific classes of clothing, including sweaters, knit tops, knit dresses, blouses, jackets, career pants, casual pants, shirts, dresses, suits, outerwear, jewelry, fashion, legwear, and the collectibles line
  8. RESPONSERATE = response rate to promotion

Dataset: a quick look

load("clothing.RData")
head(clothing)
KEY REC FRE MON SPT1M SPT3M SPT6M CLASS PSWEATS PKNIT_TOPS PKNIT_DRES PBLOUSES PJACKETS PCAR_PNTS PCAS_PNTS PSHIRTS PDRESSES PSUITS POUTERWEAR PJEWELRY PFASHION PLEGWEAR PCOLLSPND RESPONSERATE
1 208 2 368.46 0.00 0.00 0.00 9 0.18 0.00 0.00 0.30 0.0 0.25 0.00 0.19 0.00 0 0 0.000000 0.02 0.03 0.29 0.00
2 6 4 258.00 138.00 55.99 258.00 6 0.26 0.16 0.00 0.00 0.0 0.18 0.14 0.00 0.18 0 0 0.000000 0.00 0.02 0.37 50.00
3 327 2 77.00 0.00 0.00 0.00 1 1.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0 0 0.000000 0.00 0.00 0.00 0.00
4 66 8 846.06 104.94 0.00 373.87 15 0.38 0.00 0.05 0.06 0.2 0.17 0.00 0.05 0.00 0 0 0.005307 0.03 0.01 0.00 66.67
5 49 1 87.44 87.44 0.00 87.44 4 0.20 0.20 0.00 0.00 0.0 0.00 0.41 0.00 0.00 0 0 0.170000 0.00 0.00 0.00 0.00
6 26 2 120.00 62.00 62.00 120.00 3 0.00 0.56 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0 0 0.020000 0.40 0.00 0.00 0.00

Wrap up: recalling the process

  • decide which variables you want to use to build the clusters / compute the distances;
  • perform some engineering if necessary (check distributions!);
  • accordingly, fix the distance metrics (numerical, categorical, weighted, etc.);
  • go for it:
    • compute distances for \(k\)-NN;
    • compute dispersions for \(k\)-means;
    • build the tree if you seek to explain one particular variable.

Your turn!