Mining of Massive Datasets

Clustering

Similarity Measuring

  • Sets as vectors: cosine distance = $\frac{\overrightarrow{x_1}\cdot\overrightarrow{x_2}}{||x_1||||x_2||}\in[-1,1]$
  • Sets as sets: Jaccard distance
  • Sets as points: Euclidean distance = $\frac{1}{\sqrt{||\overrightarrow{x_1}-\overrightarrow{x_2}||^2}}$

Methods

Hierarchical

  • Agglomerative(bottom up)
    • Combine the two nearest clusters into one
  • Divisive(top down)
    • Start with one and recursively split

Point assignment

  • Maintain a set of Clusters then assign points to nearest cluster repeatedly eg:k-means

Clustroid

$\min\limits_c \sum\limits_{x\in C}d(x,c)^2$

K-means

  1. Pick k clusters, Initialise by picking one point per cluster
  2. Place each point in the cluster with nearest current centroid
  3. Update centroid locations
  4. Reassign all points to their closest centroid
  5. Repeat 34 until convergence(points dont move + centroids stabilise)

BFR(Bradley-Fayyad-Reina)

Assumes clusters are normally distributed around a centroid in a Euclidean space

  • Take k random points: pick a random one, then k-1, as dispersed ap
  • Take a small random sample and cluster optimally

Three Classes

  • Discard Set(DS)
    • Points close enough to a centroid to be summarised
  • Compression Set(CS)
    • Points close together but not close to any existing centroid
  • Retained Set(RS)
    • Isolated points waiting to be assigned to a compression set

For Each Cluster,

DS Summary

  • N: number of points
  • SUM: vector, ith = sum of the coordinates of the points in the ith dimension
  • SUMSQ: vector, ith = sum of squares of coordinates in ith dimension

d: number of dimensions

cluster size: $2d+1$

Average in each dimension(the centroid) : $\frac{SUM_i}{N}$

Variance in dimension i: $\frac{SUMSQ_i}{N}-(\frac{SUM_i}{N})^2$

$$Var(X)=EX^2-E(X)^2$$ ⬆️

Processing

  1. Add points close enough to a centroid centroid to that cluster and the DS
  2. Cluster the remaining points and the old RS
    1. Clusters→ CS; Outliers → RS
  3. DS: Adjust Stats for new points
  4. Merge CS
  5. Merge all CS and all RS points into their nearest cluster

Close

  1. Mahalanobis distance less than a threshold
  2. High probability of the point belonging to currently nearest centroid

Mahalanobis Distance(M-distance)

point $(x_1,\cdots, x_d)$, centroid $(c_1, \cdots, c_d)$

$\sigma_i$: Standard deviation of points in the cluster in ith dimension

  • Normalise in each dimension $y_i=\frac{x_i-c_i}{\sigma_i}$
  • $d(x,c)=\sqrt{\sum\limits_{i=1}^d(\frac{x_i-c_i}{\sigma_i})^2}$

Combine 2 CS clusters if the combined variance is below threshold

CURE(Clustering Using REpresentatives)

  • Assume a Euclidean distance
  • Allows clusters to assume any shape
  • Uses a collection of representative points to represent clusters

2 Pass

Pass 1

  1. Pick a random sample of points that fits in main memory
  2. Initial clusters: hierachically cluster the points
  3. Pick representative points

Pass 2

  1. Rescan the dataset, visit each point p
  2. Place it in the closest cluster

DimRed

SVD

Matrix Rank: Number of linearly independent columns of a matrix

Definitions

$A_{[m \times n] =U_{m\times r}\sum_{r \times r}(V_{[n \times r]})^T}$

A: Input data matrix; U: Left singular vectors; $\sum$: Singular values;

r: Rank of the matrix A; V: Right Singular vectors

Propertities

$U, \sum, V$: Unique

$U, V$ column orthonormal

$\sum$: Diagonal, Singular values are positive and sorted in decreasing order

$A \approx U \sigma V^T = \sum\limits_{i} \sigma_i u_i v_i^T$

Goal: Minimize sum of reconstruction errors

Best Low Rank Approx Theorem

$A = U \sum V^T(\sigma_1 \geq \sigma_2 \geq \cdots, rank(A)=r)$, $B=USV^T$

S = diagonal r*r matrix, $s_i = \sigma_i(i:1 \rightarrow k)$ else $s_i=0$

S: best rank-k approx to A, B: sol to $\min\limits_{B,rank(B)=k} ||A-B||_F$

Proof:

$\min\limits_{B,rank(B)=k} ||A-B||F =min ||\sum - S||*F = min \sum (\sigma_i - s_i)^2 = min \sum\limits*{i=1}^{k}(\sigma_i - s_i)^2 + \sum\limits{i=k+1}^r \sigma_i^2 = \sum\limits_{i=k+1}^r \sigma_i^2$

CUR

  1. Input: Matrix A∈Rm×n, desired rank k.
  2. Select c columns from A to form matrix C
  3. Select r rows from A to form matrix R
  4. Compute $U = W^{-1}$
    1. pseudo-inverse: $W^{-1}$, if W nonsingular pseudoinverse the true inverse.
  5. ACUR

Comparison

$A = U \sum V^T$, $A = CUR$

A: Huge sparse

U, V: Big dense

$\sum$: small sparse

C, R: Big sparse

U: small dense

RecSys

X = set of Customers

S = set of Items

R = set of ratings ( a totally ordered set)

Utility function: u: $X \times S \rightarrow R$

Approaches

  1. Content-based
  2. Collaborative Filtering
  3. Collaborative Reasoning
  4. Large Foundation Models

Content-based

Recommend items similar to previous items rated highly by the customer

Item Profiles: vector of features

TF-IDF

$f_{ij} = \text{frequency of term(feature) i in doc (item) j}$

$TF_{ij}=\frac{f_{ij}}{max_{kf_{kj}}}$

$n_i$: number that mention i, N: total

$IDF_i=\log \frac{N}{n_i}$

$\text{TF-IDF score } w_{i,j} = TF_{ij} \times IDF_i$

User Profiles and prediction

User profile: weighted average of all rated item profiles

Prediction heuristic

Given user profile x and item profile i, cosine similarity

$u(x,i) = cos(x,i) = \frac{x\cdot i}{||x|| \cdot ||i||}$

Pros

  1. No need on other users data
  2. Users with unique tastes, new and unpopular items
  3. Explanation

Cons

  1. Hard to find proper features
  2. New users no profile
  3. Overspecialisation

Collaborative Filtering

Find similar users

Jaccard sim, Ignore rating value

Cosine sim, Treat missing ratings as negative

$r_x:\text{vector of user x’s ratings}$

Pearson correlation coefficient

$S_{xy}=\text{items rated by both users x and y}$

$sim(x,y)=\frac{\sum_{S \in S_{xy}}(r_{xs}-\overline{r_x})(r_{ys}-\overline{r_y})}{\sqrt{\sum_{S \in S_{xy}}(r_{xs}-\overline{r_x})^2}\sqrt{\sum_{S \in S_{xy}(r_{ys}-\overline{r_y})^2}}}$

Rating Predictions

N: the set of k users most similar to x who have rated item i

eg Prediction for item i of user x

$r_{xi}=\frac{1}{k} \sum_{y \in N} r_{yi}$

$r_{xi} = \frac{\sum_{y \in N}S_{xy} \cdot r_{yi}}{\sum_{y \in N}S_{xy}}$

Pro

Works for any item kinds

Cons

  1. User/Item cold start
  2. Sparsity
  3. Popularity bias

Metrics

Rating Prediction: MAE, RMSE

Top-n Rec:

$\text{Precision}=\frac{tp}{tp+fp}$

$\text{Recall}=\frac{tp}{tp+fn}$

$F_1=2 \cdot \frac{precision \cdot recall}{ precision + recall}$, Optimal value = 1

Normalised Discounted Cumulative Gain

DCG

p: position up to which relevance is accumulated

$rel_i:$ relevance of reccommendation at position i

$DCG_p = \sum\limits_{i=1}^p \frac{2^{rel_i}-1}{\log_2(i+1)}$

Idealised DCG

Assumption that items are ordered by decreasing relevance

$IDCG_p=\sum\limits_{i=1}^{|REL|} \frac{2^{rel_i}-1}{\log_2(i+1)}$

Normalised DCG

normalised to the interval [0,1]

$nDCG_p=\frac{DCG_p}{IDCG_p}$

Mean Average Precision(MAP)

Average of precision values after each successful prediction

Latent Factor Models

CF

  1. Item-Item CF
    • Predict ratings using similar items $s_{ij}$
    • Weighted average of user x’s ratings for similar items.
  2. Bias Model
    • Adjust for global mean $\mu$, user bias $b_u$, item bias $b_i$
    • $\hat{r}_{ui} = \mu + b_u + b_i$
  3. Weighted CF
    • Combines biases and learned weights $w_{ij}$
    • $$\hat{r}{ui} = \mu + b_u + b_i + \sum w{ij}(r_{uj} - \mu - b_u - b_j)$$

MF

  1. Latent Factors
    • Decompose R as $P \cdot Q^T$
    • Minimize error: $\min_{P, Q} \sum (r_{ui} - p_u \cdot q_i)^2 + \lambda (|P|^2 + |Q|^2)$
  2. GD to update P,Q

Temporal Dynamics

  • Add time-dependence to biases $b_u(t), b_i(t)$ and factors $p_u(t)$

Regularisation $\lambda$ prevents overfitting