Mining of Massive Datasets

Clustering

Similarity Measuring

Sets as vectors: cosine distance = $\frac{\overrightarrow{x_1}\cdot\overrightarrow{x_2}}{||x_1||||x_2||}\in[-1,1]$
Sets as sets: Jaccard distance
Sets as points: Euclidean distance = $\frac{1}{\sqrt{||\overrightarrow{x_1}-\overrightarrow{x_2}||^2}}$

Methods

Hierarchical

Agglomerative(bottom up)
- Combine the two nearest clusters into one
Divisive(top down)
- Start with one and recursively split

Point assignment

Maintain a set of Clusters then assign points to nearest cluster repeatedly eg:k-means

Clustroid

$\min\limits_c \sum\limits_{x\in C}d(x,c)^2$

K-means

Pick k clusters, Initialise by picking one point per cluster
Place each point in the cluster with nearest current centroid
Update centroid locations
Reassign all points to their closest centroid
Repeat 34 until convergence(points dont move + centroids stabilise)

BFR(Bradley-Fayyad-Reina)

Assumes clusters are normally distributed around a centroid in a Euclidean space

Take k random points: pick a random one, then k-1, as dispersed ap
Take a small random sample and cluster optimally

Three Classes

Discard Set(DS)
- Points close enough to a centroid to be summarised
Compression Set(CS)
- Points close together but not close to any existing centroid
Retained Set(RS)
- Isolated points waiting to be assigned to a compression set

For Each Cluster,

DS Summary

N: number of points
SUM: vector, ith = sum of the coordinates of the points in the ith dimension
SUMSQ: vector, ith = sum of squares of coordinates in ith dimension

d: number of dimensions

cluster size: $2d+1$

Average in each dimension(the centroid) : $\frac{SUM_i}{N}$

Variance in dimension i: $\frac{SUMSQ_i}{N}-(\frac{SUM_i}{N})^2$

$$Var(X)=EX^2-E(X)^2$$ ⬆️

Processing

Add points close enough to a centroid centroid to that cluster and the DS
Cluster the remaining points and the old RS
1. Clusters→ CS; Outliers → RS
DS: Adjust Stats for new points
Merge CS
Merge all CS and all RS points into their nearest cluster

Close

Mahalanobis distance less than a threshold
High probability of the point belonging to currently nearest centroid

Mahalanobis Distance(M-distance)

point $(x_1,\cdots, x_d)$, centroid $(c_1, \cdots, c_d)$

$\sigma_i$: Standard deviation of points in the cluster in ith dimension

Normalise in each dimension $y_i=\frac{x_i-c_i}{\sigma_i}$
$d(x,c)=\sqrt{\sum\limits_{i=1}^d(\frac{x_i-c_i}{\sigma_i})^2}$

Combine 2 CS clusters if the combined variance is below threshold

CURE(Clustering Using REpresentatives)

Assume a Euclidean distance
Allows clusters to assume any shape
Uses a collection of representative points to represent clusters

2 Pass

Pass 1

Pick a random sample of points that fits in main memory
Initial clusters: hierachically cluster the points
Pick representative points

Pass 2

Rescan the dataset, visit each point p
Place it in the closest cluster

DimRed

SVD

Matrix Rank: Number of linearly independent columns of a matrix

Definitions

$A_{[m \times n] =U_{m\times r}\sum_{r \times r}(V_{[n \times r]})^T}$

A: Input data matrix; U: Left singular vectors; $\sum$: Singular values;

r: Rank of the matrix A; V: Right Singular vectors

Propertities

$U, \sum, V$: Unique

$U, V$ column orthonormal

$\sum$: Diagonal, Singular values are positive and sorted in decreasing order

$A \approx U \sigma V^T = \sum\limits_{i} \sigma_i u_i v_i^T$

Goal: Minimize sum of reconstruction errors

Best Low Rank Approx Theorem

$A = U \sum V^T(\sigma_1 \geq \sigma_2 \geq \cdots, rank(A)=r)$, $B=USV^T$

S = diagonal r*r matrix, $s_i = \sigma_i(i:1 \rightarrow k)$ else $s_i=0$

S: best rank-k approx to A, B: sol to $\min\limits_{B,rank(B)=k} ||A-B||_F$

Proof:

$\min\limits_{B,rank(B)=k} ||A-B||F =min ||\sum - S||*F = min \sum (\sigma_i - s_i)^2 = min \sum\limits*{i=1}^{k}(\sigma_i - s_i)^2 + \sum\limits{i=k+1}^r \sigma_i^2 = \sum\limits_{i=k+1}^r \sigma_i^2$

CUR

Input: Matrix A∈Rm×n, desired rank k.
Select c columns from A to form matrix C
Select r rows from A to form matrix R
Compute $U = W^{-1}$
1. pseudo-inverse: $W^{-1}$, if W nonsingular pseudoinverse the true inverse.
A≈CUR

Comparison

$A = U \sum V^T$, $A = CUR$

A: Huge sparse

U, V: Big dense

$\sum$: small sparse

C, R: Big sparse

U: small dense

RecSys

X = set of Customers

S = set of Items

R = set of ratings ( a totally ordered set)

Utility function: u: $X \times S \rightarrow R$

Approaches

Content-based
Collaborative Filtering
Collaborative Reasoning
Large Foundation Models

Content-based

Recommend items similar to previous items rated highly by the customer

Item Profiles: vector of features

TF-IDF

$f_{ij} = \text{frequency of term(feature) i in doc (item) j}$

$TF_{ij}=\frac{f_{ij}}{max_{kf_{kj}}}$

$n_i$: number that mention i, N: total

$IDF_i=\log \frac{N}{n_i}$

$\text{TF-IDF score } w_{i,j} = TF_{ij} \times IDF_i$

User Profiles and prediction

User profile: weighted average of all rated item profiles

Prediction heuristic

Given user profile x and item profile i, cosine similarity

$u(x,i) = cos(x,i) = \frac{x\cdot i}{||x|| \cdot ||i||}$

Pros

No need on other users data
Users with unique tastes, new and unpopular items
Explanation

Cons

Hard to find proper features
New users no profile
Overspecialisation

Collaborative Filtering

Find similar users

Jaccard sim, Ignore rating value

Cosine sim, Treat missing ratings as negative

$r_x:\text{vector of user x’s ratings}$

Pearson correlation coefficient

$S_{xy}=\text{items rated by both users x and y}$

$sim(x,y)=\frac{\sum_{S \in S_{xy}}(r_{xs}-\overline{r_x})(r_{ys}-\overline{r_y})}{\sqrt{\sum_{S \in S_{xy}}(r_{xs}-\overline{r_x})^2}\sqrt{\sum_{S \in S_{xy}(r_{ys}-\overline{r_y})^2}}}$

Rating Predictions

N: the set of k users most similar to x who have rated item i

eg Prediction for item i of user x

$r_{xi}=\frac{1}{k} \sum_{y \in N} r_{yi}$

$r_{xi} = \frac{\sum_{y \in N}S_{xy} \cdot r_{yi}}{\sum_{y \in N}S_{xy}}$

Pro

Works for any item kinds

Cons

User/Item cold start
Sparsity
Popularity bias

Metrics

Rating Prediction: MAE, RMSE

Top-n Rec:

$\text{Precision}=\frac{tp}{tp+fp}$

$\text{Recall}=\frac{tp}{tp+fn}$

$F_1=2 \cdot \frac{precision \cdot recall}{ precision + recall}$, Optimal value = 1

Normalised Discounted Cumulative Gain

DCG

p: position up to which relevance is accumulated

$rel_i:$ relevance of reccommendation at position i

$DCG_p = \sum\limits_{i=1}^p \frac{2^{rel_i}-1}{\log_2(i+1)}$

Idealised DCG

Assumption that items are ordered by decreasing relevance

$IDCG_p=\sum\limits_{i=1}^{|REL|} \frac{2^{rel_i}-1}{\log_2(i+1)}$

Normalised DCG

normalised to the interval [0,1]

$nDCG_p=\frac{DCG_p}{IDCG_p}$

Mean Average Precision(MAP)

Average of precision values after each successful prediction

Latent Factor Models

CF

Item-Item CF
- Predict ratings using similar items $s_{ij}$
- Weighted average of user x’s ratings for similar items.
Bias Model
- Adjust for global mean $\mu$, user bias $b_u$, item bias $b_i$
- $\hat{r}_{ui} = \mu + b_u + b_i$
Weighted CF
- Combines biases and learned weights $w_{ij}$
- $$\hat{r}{ui} = \mu + b_u + b_i + \sum w{ij}(r_{uj} - \mu - b_u - b_j)$$

MF

Latent Factors
- Decompose R as $P \cdot Q^T$
- Minimize error: $\min_{P, Q} \sum (r_{ui} - p_u \cdot q_i)^2 + \lambda (|P|^2 + |Q|^2)$
GD to update P,Q

Temporal Dynamics

Add time-dependence to biases $b_u(t), b_i(t)$ and factors $p_u(t)$

Regularisation $\lambda$ prevents overfitting