University of Illinois at Chicago
College of Business Administration
Department of Information & Decision Sciences 

IDS 472

  Statistics for Information Systems and Data Mining ("SISDM")

Instructor

  Prof. Stanley L. Sclove


Notes on Cluster Analysis
These notes   Copyright   ©   2001    Stanley Louis Sclove


HyperTable of Contents

1.   Introduction

2.   What is Cluster Analysis?
3.   How Does Cluster Analysis Work?
4.   A Famous Example of Cluster Analysis 
5.   Objectives of Cluster Analysis

6.   Preparing the Data

6.1. Detecting Outliers

6.2. Similarity Measures

6.3. Standardizing the Data

7. Assumptions in Cluster Analysis

8. Deriving Clusters and Assessing Overall Fit

8.1. Clustering Algorithms

8.2. How Many Clusters Should Be Formed?

9. Interpretation of the Clusters:   Interpreting or labeling the clusters

10. Validating  and Profiling of the Clusters

10.1. Validating the Cluster Solution

10.2. Profiling the Cluster Solution

11.Summary

12. An Illustrative Example

13. Summary . Questions . References



1. Introduction

   As you know from NOVA, "We Know Where You Live," the role of cluster analysis in data mining is central.

 

What is Clustering ?    It's . . .  Groping with Grouping !!

 

The clustering problem is:    Given a set of objects, group them.

Of course, the problem has to be given more structure and constraints before anything can be done.

Data may be numerical, categorical, or a combination of both.

The data for cluster analysis may be nearly anything.  The methods and interpretation are somewhat more complicated with a combination of data types.   Sometimes the "data" are not  records-by-features but rather a matrix of pairwise similarities or differences. 
 

Naturally-occurring groups may be present in the data

Cluster analysis partitions the set of observations into mutually exclusive groupings in order to best represent distinct sets of observations within the sample.

Cluster analysis is not able, however, to confirm the validity of these groupings. The researcher must do this by

1.    ensuring that theoretical justification exists for the cluster analysis, and

2.    following up by profiling and discriminating between groups.


2. What is Cluster Analysis?

Cluster analysis identifies and classifies objects individuals or variables on the basis of the similarity of the characteristics they possess.  It seeks to minimize within-group variance and maximize between-group variance.  The result of cluster analysis is a number of heterogeneous groups with homogeneous contents:  There are substantial differences between the groups, but the individuals within a single group are similar. 

3. How Does Cluster Analysis Work?

Data may be thought of as points in a space where the axes correspond to the variables.  Cluster analysis divides the space into regions characteristic of groups that it finds in the data.

4. A Famous Example of Cluster Analysis

The discovery of white dwarfs and red giants is a famous example of cluster analysis.  Stars were plotted by astronomers Hertzsprung and Russell according to the two features, log luminosity and log temperature.     Three clusters emerged, white dwarfs, red giants and the main sequence in between them.

 

TABLE.  Profiles of the three clusters

HR Diagram

TEMPERATURE, T

LUMINOSITY

White Dwarfs

medium

low

Main Sequence

wide range

low when T is low, high when T is high

 Red Giants

medium low

high

 


5.  Objectives of Cluster Analysis

6.  Preparing the Data

The data may need to be preprocessed by outlier detection and standardization. 

6.1. Detecting Outliers

Cluster Analysis can be used for outlier detection.  Outliers may emerge as singletons or as small clusters far removed from the others.  To do  outlier detection at the same time as clustering the main body of the data, use enough clusters to represent both the main body of the data and the outliers. 

6.2.  Distance Measures

Types of Distance Measures

Euclidean distance is appropriate for variables that are uncorrelated and have equal variances.   Statistical distance adjusts for correlations and different variances.

Impact of Unstandardized Data Values

If  you don't use standardized values or statistical distance, a variable measured in small units (say inches instead of feet) can dominate.

6.3. Standardizing the Data

This is most usually done in terms of "z values."


7.  Assumptions in Cluster Analysis

Inter-object similarity is measured by distance between pairs of objects.

Often used, rightly or wrongly, is Euclidean distance, the length of the hypotenuse of a right triangle formed between the points.

Usually standarization of the data is needed; the statistical  distance (Mahalanobis distance) is preferred.  Standardization of the data is needed if the range or scale of one variable is much larger or different from the range of others.  This distance also compensates for intercorrelation among the variables.  Often one sums across the within-groups sum-of-products matrices to obtain a pooled covariance matrix for use in statistical distance.

A logical way to proceed is through model-based clustering, via the mixture model. 


8.  Deriving Clusters and Assessing Them

8.1. Clustering Algorithms

What clustering procedure?     hierarchical vs. nonhierarchical

8.1.1. Hierarchical Clustering  Procedures

Hierarchical clustering follows one of two approaches: Agglomerative methods start with each observation as a cluster and with each step combine observations to form clusters until there is only one large cluster.  Divisive methods begin with one large cluster and proceed to split into smaller clusters items that are most dissimilar.

There are five ways of defining inter-cluster distance:

1) single linkage (based on the shortest distance between objects);

2) complete linkage (based on the largest distance between objects);

3) average linkage (based on the average distance between objects);

4) Ward's method (based on the sum of squares between the two clusters, summed over all variables), and

5) centroid method (based on the distance between cluster centroids).

Here the distance between clusters is that between their centroids(mean vectors).


8.1.2. Nonhierarchical Clustering Procedures

Nonhierarchical clustering is partitioning of the sample.

Nonhierarchical clustering has three approaches:

 

1) the sequential threshold (based on one cluster seed at a time and membership in that cluster fulfilled before another seed is selected, i.e., looping through all n points before updating the seeds, as in the K-MEANS procedure),

2) parallel threshold (based on simultaneous cluster seed selection and membership threshold distance adjusted to include more or fewer objects in the clusters, i.e., updating the seeds as you go along, as in the ISODATA procedure), and

3) optimizing (same as the others except it allows for reassignment of objects to another cluster based on some optimizing criterion).

Selecting Seed Points    Let   k   denote the number of clusters to be formed.   Fix  k.   (Later you can try k-1,  k+1, etc.).  You choose   k   "seed" points to get started.   The results can depend upon the seed points, so often clustering is done several times, starting with different seed points. The  k  initial seeds can  arbitrarily be

·       the first k cases

·       a randomly chosen k cases

·       k specified cases  

·       or  chosen from a k-cluster hierarchical solution.

 

 

Programs include

Hierarchical

BMDP 2M

SAS CLUSTER

 

Non-hierarchical (Partitioning)

BMDP KM

MINITAB:  k-MEANS

          In Minitab, do   Stat > Multivariate > K-means Cluster .

SAS FASTCLUS

There is a DRIFT option.

 

 

SPSS QC (QUICK CLUSTER)   

 

MCLUST (see References at end)   

 
8.1.3. Should Hierarchical or Nonhierarchical Methods Be Used?

While there is no definite rule as to which type of clustering to use, it is suggested that both be used.  Start with hierarchical to generate and profile the clusters and then use nonhierarchical to fine tune the cluster membership with its switching ability.  In this case, the centroids from hierarchical clustering are taken as the seeds for nonhierarchical clustering.

Which Type of Clustering? -- Single vs. Complete vs. Average Linkage

A shortcoming  of single-linkage clustering:   Cluster A goes from upper left to lower right and might be fit by a bivariate normal distribution with negative correlation; Cluster B, from lower left to upper right, by one with positive correlation.   If a single link exists between Clusters A and B, they will be linked together at the next stage, even though overall they are dissimilar.   This suggests the need for model-based clustering.


Mixture Model Clustering

A conceptual model for clustering is that the sample comes from a mixture of several populations. This leads to a mathematical probability model for the data called the finite mixture-model. Suppose it is a mixture of two distributions. The model is that any given observation comes from Distribution 1 (population P1 ) with probability p1 and Distribution 2 (population P2 ) with probability p2.  Then the probability density function is

 

f(x)  =  p1 f1(x) +   p2 f2(x),

 

where  f1 and  f2  are the probability density functions of the two distributions.  If the within-cluster type of distribution is specified (such as multivariate normal), then the method of maximum likelihood can be used to estimate the parameters.  This has to be done with a step-by-step (iterative) algorithm, which can be shown to correspond to some of the algorithms discussed here.

By the definition of conditional probability, 

P(B|A)  =  P(A and B)/P(A), 

so the posterior probability of membership in Population 1 is


P(
P1|x)  =   p1 f1(x) / f(x) ,    where     f(x)  =  p1 f1(x) +   p2 f2(x) .


Analogous formulas hold for more than two component distributions.

In mixture-model clustering, all parameters are estimated.  This enables estimation of the posterior probabilities of group membership; a table of each case with all its membership probabilities can be printed out.  The rows are the case IDs; the  columns, P(P1|x), P(P2|x), . . .,  P(Pk|x) . 
 


8.2. How Many Clusters Should Be Formed?

There is no generally accepted procedure for determining the numer of clusters. This decision should be guided by theory and practicality of the results, along with use of the inter-cluster distances at successive steps. When using a criterion such as between-groups sum of squares or likelihood, this can be plotted against the number  k  of clusters in a scree diagram.   Also, the likelihood can be used in  model selection criteria such as AIC (Akaike's Information Criterion) or BIC (Bayesian Information Criterion)  to estimate k.

 

AIC  =   - 2 log likelihood  +  2*no.parms

 

BIC  =   -2 log likelihood  +   log(n)*no.parms

 

(Most statisticians who use model-selection criteria are leaning toward BIC now instead of AIC.)  In a mixture-model, the number of parameters, no.parms, is a mean vector for each group, plus the number of elements of the covariance matrix or matrices, plus the number of mixing probabilities.

9.  Interpretation of the Clusters:   Interpreting or labeling the clusters

This is  a creative process.  Examination of the cluster profiles will provide the researcher with insight as to what the clusters mean.  A cluster's profile can suggest a name for it. 


10.  Validation and Profiling of the Clusters

10.1. Validating the Cluster Solution

Statistical Tests

Suppose you've obtained three clusters with sizes n1, n2 and n3.  To test this clustering against the  hypothesis  that there is no clustering (i.e., that the population  is homogeneous):

 

Compute the mean vector and covariance matrix of the whole sample.  Draw pseudorandom samples of n1, n2 and n3 from the corresponding multinormal distribution and compute a measure of spread of the clusters.  The generates a sampling distribution for the measure of spread.  If the value for the actual sample is among the highest, you've got statistical significance.     

 

Validity in Test Cases

The sample can be split into training and test cases.  The centroids from the clustering of the training cases can be used to cluster the test cases  to see if comparable results are obtained. 

Validity for Variables not Used in the Clustering

The profile of the clusters across related variables not used in the clustering can be used in assessing validity. 

 


10.2. Profiling the Cluster Solution

A "profile" of a cluster is merely the set of mean values for that cluster.

 

These can be for the internal variables (used to form the clusters) as well as external variables.

The external variables may be demographic, psychographic, or consumption-pattern.

 

Once the clusters are formed, other techniques such as profiling or discriminant analysis can be used to see what internal variables account for the clustering. 

11. Summary of the Process

Some researchers do a preliminary hierarchical clustering to suggest how to start a non-hierarchical procedure. 

12. An Illustrative Example

See the exercises below for a small example.  We'll present examples in class, and maybe post them to the Web later.

13. Summary .  Questions .  References

13.1. Summary

There are two major ways of clustering, hierarchical and nonhierarchical.

 

Hierarchical clustering can be agglomerative or divisive.

 

A dendrogram is a way to represent the results of hierarchical clustering.

 

 The mixture model is a conceptual and statistical model for clustering.

 

13.2.  Questions

1.  Consider the following dataset.

# Dataset for N = 5 cases
# Afifi and Clark, Fig. 16.2, p. 431
#
# Case  X1     X2
# ---------------
   1     1      1
   2     1      2
   3     3      6
   4     5      7
   5     8      5

 

Cluster this dataset hierarchically:

a) Compute the matrix of squared Euclidean distances.

b) Which two cases are closest together?

c) Cluster these two together. Now continue clustering by complete linkage:-- Form a new matrix of squared distances, where the squared distance between clusters is the maximum of the squared distances, for all pairs formed with one member in one cluster and the other member in the other. 

d) In the next step, which two clusters are put together?

e) Continue the agglomeration until all cases are in a single cluster.

2. (continuation)  Cluster the above dataset non-hierarchically, by K-Means. 

Do it twice, with different starting points. 

 
3.  The coordinates of Case 1 are 4.7 and 2.4. The coordinates of Case 2 are 6.6 and 2.5. Find the coordinates of Case 3, given that

the Euclidean distance between Case 3 and Case 1 is 1.342 and

the Euclidean distance between Case 3 and Case 2 is 0.989 .

 

HINT:  There are two correct answers:  Draw a circle with radius 1.342 around Case 1 and a circle with radius 0.989 around Case 2;  the circles will intersect in two points. 

 

 


13.3. References and Bibliography

Afifi, A., & Clark, V. (1996).  Computer-Aided Multivariate Analysis.  3rd ed.     Chapman & Hall,  New York. [Ch. 16 is on cluster analysis.]

 

Aldenderfer, Mark S., & Blashfield, Roger K. (1984).  Cluster Analysis.  Sage Publications, Newbury Park, Cal.

 

Berry, Michael J.A., and Linoff, Gordon (1997).  Data Mining Techniques:  for Marketing, Sales, and Customer Support.  John Wiley & Sons, Inc., New York.  [Ch. 10 is on Automatic Cluster Detection.]

 

Fraley, C., and Raftery, A.E. (1999).  MCLUST:  Software for Model-Based Cluster Analysis.  Jnl. of Classification  16 (2) 297-306.  A Webpage with related links can be found at http://www.stat.washington.edu/fraley/mclust.

 

Hair, Anderson, Tatham and Black.   Multivariate Data Analysis.  5th ed.  Prentice Hall, 1998.  [Chapter 9 is on Cluster Analysis.]

 

Hartigan, John (1975).  Clustering Algorithms.   John Wiley & Sons, Inc., New York.

 

Johnson, R., & Wichern, D.  (1998).   Applied Multivariate Statistical Analysis.  4th  ed.  Prentice Hall.   [Part of Chapter 12 is on  cluster analysis.]

 

 

 


Created: 2 October 1998       Updated:   16 Nov 2000