- Who has tried deploying his/her app?
- What do you know about clustering?
- Who likes math?
In this deck, clustering will refer to techniques that, given a particular structured dataset, seek to create homogeneous groups of observations.
Key question: how do you define “ homogeneous”? Answer: via distances.
Example: assume songs are characterized only by their danceability:
\(\rightarrow\) Lose Yourself is closer to Poker Face because the distance between the two scores is smaller.
A tabular dataset is a matrix \(X_{i,j}\) such that:
A distance is always computed when \(j\) is fixed and \(i\) changes:
\[D_{\text{dance}}(\text{Poker Face, Lose Yourself})= \left|X_{\text{Poker Face}, \text{Dance}}-X_{\text{Lose Yourself}, \text{Dance}}\right|\] In a more formal way, we can write that \[D_j(i_1,i_2)=\left|X_{i_1,j}-X_{i_2,j}\right|,\] if \(j\)=Dance and \(i_1\)=Poker Face and \(i_2\)=Lose Yourself.
Ok, but what if we want to measure proximity according to other criteria (and not just danceability)? \[\begin{align} D_{\text{dance & duration}}(\text{Poker Face, Lose Yourself})=& \left|X_{\text{Poker Face}, \text{Dance}}-X_{\text{Lose Yourself}, \text{Dance}}\right| \\ &+\left|X_{\text{Poker Face}, \text{Duration}}-X_{\text{Lose Yourself}, \text{Duration}}\right| \end{align}\]
Problem: danceability is on a scale from 0 to 1 while duration is expressed in minutes!
\[\begin{align} D_{\text{dance & duration}}(\text{Poker Face, Lose Yourself})=&w_{dance} \left|X_{\text{Poker Face}, \text{Dance}}-X_{\text{Lose Yourself}, \text{Dance}}\right| \\ &+w_{duration}\left|X_{\text{Poker Face}, \text{Duration}}-X_{\text{Lose Yourself}, \text{Duration}}\right| \end{align}\]
This is very convenient: it allows to give more or less weights to particular attributes (they may not be all equally useful)!
=> see also variable engineering & scaling (below)
\[D(i_1,i_2)=\sum_{j=1}^J\underbrace{w_j }_{\text{weight of variable }j} \quad \times \quad \underbrace{\left| X_{i_1,j}-X_{i_2,j} \right|}_{\text{distance over variable j}}\] Note: sometimes, it is useful to use squared distances:
\[D(i_1,i_2)=\sum_{j=1}^J\underbrace{w_j }_{\text{weight of variable }j} \quad \times \quad \underbrace{\left( X_{i_1,j}-X_{i_2,j} \right)^2}_{\text{distance over variable j}}\]
What if variables are not numbers but (unordered) categories? (ordered categories can be digitized)
There are many way to proceed (see the paper Similarity Measures for Categorical Data: A Comparative Evaluation)
The simplest choice is binary:
\[d_j(i_1,i_2)=\left\{\begin{array}{clc} 0& \text{ if } X_{i_1,j}=X_{i_2,j} \\ 1& \text{ otherwise} \end{array} \right.\]
\(\rightarrow\) Either the categories are the same and hence the distance is zero or they are different and the distance is one.
Imagine there is a song genre variable: Poker Face could be Pop while Lose Yourself would be Rap, hence the distance would be one.
A very important quantity is the range of variables. Sometimes, they are very different across the dataset: this must be dealt with!
\(\rightarrow\) Normalization / standardization: \[x_i \leftarrow \frac{x_i-\mu}{\sigma},\] where \(\mu\) is the average of the \(x_i\) and \(\sigma\) the standard deviation. Then, the \(x_i\) have zero mean and unit variance.
\(\rightarrow\) Min-max rescaling: \[x_i \leftarrow \frac{x_i-\min(\textbf{x})}{\max(\textbf{x})-\min(\textbf{x})}\] Then, the \(x_i\) are all in the \([0,1]\) interval.
Nearest neighbors are local and instance focused (computationally costly). For one given instance, they seek those that are the most similar. Assume you are interested in one particular client: to predict his/her behavior, you will try to find those that are similar to him/her.
k-means is global: the aim is to form groups that are as similar as possible (within the groups). (unsupervised learning)
decision trees try to understand which variables have an impact on one particular outcome. (supervised learning)
Here, \(i_1\) is fixed: it is the target instance (movie, song, diamond, customer,etc.). The NN algorithm seeks to find the instances that are the closest to \(i_1\).
It’s rather simple. For a given distance \(D(\cdot, \cdot)\), you just need to compute ALL distances \(D(i_1,i)\)! This is why it’s computationally intense.
Then, keep only the (\(k\)) instances that are the closest to the target, i.e., those with the smallest distances.
Let’s say you know some attributes of a new client (=variables!): age, gender, purchase history (frequency, amount, type), etc.
Let’s also say that you have a large database of clients with some of them being similar to a new profile (k nearest neighbors).
In this database, you have a key figure: the probability that the customer will buy a product, or will respond to a particular add campaign for instance.
Given this data, you can try to predict this probability for the new client!
Indeed, the assumptions is that neighbors give information and that the new customer will behave like them.
If the variable to predict is \(X_{\cdot,m}\) (m does not belong to the set of variables used to compute the distances), then the simple average is: \[\text{prediction}=\frac{1}{K}\sum_{k=1}^KX_{k,m} \quad \text{ : over the neighbors}.\] In order to refine the analysis, it’s possible to scale values to give more weight to the instances that are the closest to the target: \[\text{prediction}=\frac{1}{K}\sum_{k=1}^K\underbrace{\frac{e^{-\alpha D(i_1,k)}}{\sum_{k=1}^Ke^{-\alpha D(i_1,k)}}}_{\text{weighting factor}}X_{k,m}.\] The factor must sum to one!
k-NN are local: the seek to determine the vicinity of one particular instance.
k-means seek to partition a dataset into global homogeneous clusters: \(S_i\). For each group, we can compute the average value of the variables/attributes:
\[\textbf{m}_i=\frac{1}{\text{card}(S_i)}\sum_{\textbf{x}\in S_i}\textbf{x}\]
Inside one cluster, the dispersion over all variables is measured by \[\sum_{\textbf{x}\in S_i}||\textbf{x}-\textbf{m}_i||^2\]
where \(||\cdot ||\) is distance. This quantifies the homogeneity of the cluster.
The objective of \(k\)-means is to minimize the total dispersion over all clusters: \[\underset{\textbf{S}\in \mathcal{S}}{\min} \ \sum_{i=1}^k\sum_{\textbf{x}\in S_i}||\textbf{x}-\textbf{m}_i||^2,\] where \(\mathcal{S}\) is the set of all possible partitions. This minimization is very hard to achieve because it requires the knowledge of dispersions of all possible combinations of clusters.
That is impossible in practice whenever the sample is not very small (15 instances or less).
Thus, we resort to approximate algorithms.
\(\rightarrow\) the choice of \(k\) is usually not easy/obvious.
Top-down versus Bottom-up.
The stars below have 3 characteristics: shape, size and color.
Here, homogeneity is defined by color (that’s what drives the distances).
Here, we split again according to color.
Let’s take a closer look at how the splits are performed. They are variable based!
There is a clear effect! If we split the sample at a loudness of -10, we get the smallest dispersion! The average pop when loudness is smaller than -10 is 46 and that of those with loundess larger than -10 is 49. That’s not a big change.
The thing is: there are many more variables! So this exercise can be carried out on all of them. And the one that leads to the most homogeneous clusters wins!
This procedure can then be re-iterated on the two sub-groups and continued until sufficient complexity is reached. The precision increases.
1001 customers (1000+1) and 24 attributes.
load("clothing.RData") head(clothing)
KEY | REC | FRE | MON | SPT1M | SPT3M | SPT6M | CLASS | PSWEATS | PKNIT_TOPS | PKNIT_DRES | PBLOUSES | PJACKETS | PCAR_PNTS | PCAS_PNTS | PSHIRTS | PDRESSES | PSUITS | POUTERWEAR | PJEWELRY | PFASHION | PLEGWEAR | PCOLLSPND | RESPONSERATE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 208 | 2 | 368.46 | 0.00 | 0.00 | 0.00 | 9 | 0.18 | 0.00 | 0.00 | 0.30 | 0.0 | 0.25 | 0.00 | 0.19 | 0.00 | 0 | 0 | 0.000000 | 0.02 | 0.03 | 0.29 | 0.00 |
2 | 6 | 4 | 258.00 | 138.00 | 55.99 | 258.00 | 6 | 0.26 | 0.16 | 0.00 | 0.00 | 0.0 | 0.18 | 0.14 | 0.00 | 0.18 | 0 | 0 | 0.000000 | 0.00 | 0.02 | 0.37 | 50.00 |
3 | 327 | 2 | 77.00 | 0.00 | 0.00 | 0.00 | 1 | 1.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0.000000 | 0.00 | 0.00 | 0.00 | 0.00 |
4 | 66 | 8 | 846.06 | 104.94 | 0.00 | 373.87 | 15 | 0.38 | 0.00 | 0.05 | 0.06 | 0.2 | 0.17 | 0.00 | 0.05 | 0.00 | 0 | 0 | 0.005307 | 0.03 | 0.01 | 0.00 | 66.67 |
5 | 49 | 1 | 87.44 | 87.44 | 0.00 | 87.44 | 4 | 0.20 | 0.20 | 0.00 | 0.00 | 0.0 | 0.00 | 0.41 | 0.00 | 0.00 | 0 | 0 | 0.170000 | 0.00 | 0.00 | 0.00 | 0.00 |
6 | 26 | 2 | 120.00 | 62.00 | 62.00 | 120.00 | 3 | 0.00 | 0.56 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | 0 | 0.020000 | 0.40 | 0.00 | 0.00 | 0.00 |