IDS 472 |
Statistics for Information Systems and Data Mining ("SISDM") |
Instructor |
Prof. Stanley L. Sclove |
Notes on Cluster Analysis
These
notes Copyright
© 2001 Stanley Louis Sclove
HyperTable of
Contents
1. Introduction
2. What is Cluster Analysis?
3. How Does Cluster Analysis Work?
4. A Famous Example of Cluster Analysis
5. Objectives of Cluster Analysis
6. Preparing the Data
6.1. Detecting Outliers
6.2. Similarity Measures
7. Assumptions in Cluster Analysis
8. Deriving Clusters and Assessing Overall Fit
8.2. How Many Clusters Should Be
Formed?
9. Interpretation of the Clusters: Interpreting or
labeling the clusters
10. Validating and Profiling of the Clusters
10.1. Validating the Cluster
Solution
10.2. Profiling the Cluster Solution
11.Summary
13. Summary . Questions . References
As you know from NOVA, "We
Know Where You Live," the role of cluster analysis in data mining is
central.
What
is Clustering ? It's .
. . Groping with Grouping !!
The
clustering problem is: Given a set of objects, group them.
Of course, the
problem has to be given more structure and constraints before anything can be
done.
Data may be
numerical, categorical, or a combination of both.
The data for
cluster analysis may be nearly anything.
The methods and interpretation are somewhat more complicated with a
combination of data types.
Sometimes the "data" are not records-by-features but rather a matrix of pairwise
similarities or differences.
Naturally-occurring
groups may be present in the data
Cluster analysis
partitions the set of observations into mutually exclusive groupings in order
to best represent distinct sets of observations within the sample.
Cluster analysis is
not able, however, to confirm the validity of these groupings. The researcher
must do this by
1.
ensuring
that theoretical justification exists for the cluster analysis, and
2.
following
up by profiling and discriminating between groups.
Cluster
analysis
identifies and classifies objects individuals or variables on the basis of the
similarity of the characteristics they possess. It seeks to minimize within-group variance and maximize
between-group variance. The result
of cluster analysis is a number of heterogeneous groups with homogeneous
contents: There are substantial
differences between the groups, but the individuals within a single group are
similar.
Data
may be thought of as points in a space where the axes correspond to the
variables. Cluster analysis
divides the space into regions characteristic of groups that it finds in the
data.
The
discovery of white dwarfs and red giants is a famous example of cluster
analysis. Stars were plotted by
astronomers Hertzsprung and Russell according to the two features, log
luminosity and log temperature. Three clusters emerged,
white dwarfs, red giants and the main sequence in between them.
TABLE. Profiles of the three clusters
|
HR Diagram |
TEMPERATURE,
T |
LUMINOSITY |
|
White
Dwarfs |
medium |
low |
|
Main
Sequence |
wide
range |
low
when T is low, high when T is high |
|
Red Giants |
medium
low |
high |
The data may need to be preprocessed by outlier detection and
standardization.
Cluster
Analysis can be used for outlier detection. Outliers may emerge as singletons or as small clusters far
removed from the others. To do outlier detection at the same time as
clustering the main body of the data, use enough clusters to represent both the
main body of the data and the outliers.
Euclidean
distance is
appropriate for variables that are uncorrelated and have equal variances. Statistical distance adjusts for
correlations and different variances.
If you don't use standardized values or
statistical distance, a variable measured in small units (say inches instead of
feet) can dominate.
This
is most usually done in terms of "z values."
Inter-object
similarity is measured by distance between pairs of objects.
Often
used, rightly or wrongly, is Euclidean distance, the length of the hypotenuse
of a right triangle formed between the points.
Usually
standarization of the data is needed; the statistical distance (Mahalanobis distance) is preferred. Standardization of the data is needed
if the range or scale of one variable is much larger or different from the
range of others. This distance
also compensates for intercorrelation among the variables. Often one sums across the within-groups
sum-of-products matrices to obtain a pooled covariance matrix for use in
statistical distance.
A logical way to
proceed is through model-based clustering, via the mixture model.
What
clustering procedure?
hierarchical vs. nonhierarchical
Hierarchical
clustering follows one of two approaches: Agglomerative methods start
with each observation as a cluster and with each step combine observations to
form clusters until there is only one large cluster. Divisive methods
begin with one large cluster and proceed to split into smaller clusters items
that are most dissimilar.
There
are five ways of defining inter-cluster distance:
1) single linkage
(based on the shortest distance between objects);
2) complete linkage
(based on the largest distance between objects);
3) average linkage
(based on the average distance between objects);
4) Ward's method
(based on the sum of squares between the two clusters, summed over all
variables), and
5)
centroid method (based on the distance between cluster centroids).
Here the distance between clusters is that between
their centroids(mean vectors).
Nonhierarchical clustering is partitioning
of the sample.
Nonhierarchical
clustering has three approaches:
1)
the sequential threshold (based on one cluster seed at a time and membership in
that cluster fulfilled before another seed is selected, i.e., looping through
all n points before updating the seeds, as in the K-MEANS procedure),
2) parallel
threshold (based on simultaneous cluster seed selection and membership
threshold distance adjusted to include more or fewer objects in the clusters, i.e.,
updating the seeds as you go along, as in the ISODATA procedure), and
3) optimizing (same
as the others except it allows for reassignment of objects to another cluster
based on some optimizing criterion).
Selecting Seed
Points Let k denote the number of clusters to be
formed. Fix k.
(Later you can try k-1,
k+1, etc.). You choose
k "seed" points to get started. The results
can depend upon the seed points, so often clustering is done several times,
starting with different seed points. The
k initial seeds can arbitrarily be
·
the
first k cases
·
a
randomly chosen k cases
·
k
specified cases
·
or chosen from a k-cluster hierarchical
solution.
Programs include
Hierarchical
BMDP 2M
SAS CLUSTER
Non-hierarchical
(Partitioning)
BMDP KM
MINITAB: k-MEANS
In
Minitab, do Stat >
Multivariate > K-means Cluster .
SAS FASTCLUS
There is a DRIFT option.
SPSS QC (QUICK CLUSTER)
MCLUST
(see References at end)
While
there is no definite rule as to which type of clustering to use, it is
suggested that both be used. Start
with hierarchical to generate and profile the clusters and then use
nonhierarchical to fine tune the cluster membership with its switching
ability. In this case, the
centroids from hierarchical clustering are taken as the seeds for nonhierarchical
clustering.
A
shortcoming of single-linkage clustering: Cluster A goes from
upper left to lower right and might be fit by a bivariate normal distribution
with negative correlation; Cluster B, from lower left to upper right, by one
with positive correlation. If a single link exists between Clusters
A and B, they will be linked together at the next stage, even though overall
they are dissimilar. This suggests the need for model-based
clustering.
A
conceptual model for clustering is that the sample comes from a mixture of
several populations. This leads to a mathematical probability model for the
data called the finite mixture-model. Suppose it is a mixture of two
distributions. The model is that any given observation comes from Distribution
1 (population P1 ) with probability p1 and Distribution 2
(population P2
) with probability p2. Then the probability density function
is
f(x) = p1
f1(x) + p2 f2(x),
where f1 and f2 are the probability density functions of the two
distributions. If the within-cluster
type of distribution is specified (such as multivariate normal), then the
method of maximum likelihood can be used to estimate the parameters. This has to be done with a step-by-step
(iterative) algorithm, which can be shown to correspond to some of the
algorithms discussed here.
By the definition
of conditional probability,
P(B|A)
= P(A and B)/P(A),
so the posterior
probability of membership in Population 1 is
P(P1|x)
= p1 f1(x) / f(x)
, where f(x) = p1
f1(x) + p2 f2(x) .
Analogous formulas hold for more than two component distributions.
In mixture-model
clustering, all parameters are estimated.
This enables estimation of the posterior probabilities of group
membership; a table of each case with all its membership probabilities can be
printed out. The rows are the case
IDs; the columns, P(P1|x), P(P2|x), . . ., P(Pk|x) .
There
is no generally accepted procedure for determining the numer of clusters. This
decision should be guided by theory and practicality of the results, along with
use of the inter-cluster distances at successive steps. When using a criterion
such as between-groups sum of squares or likelihood, this can be plotted
against the number k of clusters in a scree diagram. Also, the likelihood
can be used in model
selection criteria such as AIC (Akaike's Information Criterion) or BIC
(Bayesian Information Criterion)
to estimate k.
AIC = - 2 log likelihood + 2*no.parms
BIC = -2 log likelihood +
log(n)*no.parms
(Most
statisticians who use model-selection criteria are leaning toward BIC now
instead of AIC.) In a
mixture-model, the number of parameters, no.parms, is a mean vector for each
group, plus the number of elements of the covariance matrix or matrices, plus
the number of mixing probabilities.
This
is a creative process. Examination of the cluster profiles
will provide the researcher with insight as to what the clusters mean. A cluster's profile can suggest a name
for it.
Suppose
you've obtained three clusters with sizes n1, n2 and n3. To test this clustering against
the hypothesis that there is no clustering (i.e., that
the population is homogeneous):
Compute
the mean vector and covariance matrix of the whole sample. Draw pseudorandom samples of n1, n2 and
n3 from the corresponding multinormal distribution and compute a measure of
spread of the clusters. The
generates a sampling distribution for the measure of spread. If the value for the actual sample is
among the highest, you've got statistical significance.
The
sample can be split into training and test cases. The centroids from the clustering of the training cases can
be used to cluster the test cases
to see if comparable results are obtained.
The
profile of the clusters across related variables not used in the clustering can
be used in assessing validity.
A
"profile" of a cluster is merely the set of mean values for that
cluster.
These
can be for the internal variables (used to form the clusters) as well as
external variables.
The
external variables may be demographic, psychographic, or consumption-pattern.
Once
the clusters are formed, other techniques such as profiling or discriminant
analysis can be used to see what internal variables account for the
clustering.
Some
researchers do a preliminary hierarchical clustering to suggest how to start a
non-hierarchical procedure.
See
the exercises below for a small example.
We'll present examples in class, and maybe post them to the Web later.
There
are two major ways of clustering, hierarchical and nonhierarchical.
Hierarchical
clustering can be agglomerative or divisive.
A dendrogram
is a way to represent the results of hierarchical clustering.
The mixture model is a conceptual
and statistical model for clustering.
1. Consider the following dataset.
# Dataset for N = 5 cases
# Afifi and Clark, Fig. 16.2, p. 431
#
# Case X1 X2
# ---------------
1
1 1
2
1 2
3
3 6
4
5 7
5
8 5
Cluster this dataset
hierarchically:
a) Compute the matrix of
squared Euclidean distances.
b) Which two cases are
closest together?
c) Cluster these two
together. Now continue clustering by complete linkage:-- Form a new
matrix of squared distances, where the squared distance between clusters is the
maximum of the squared distances, for all pairs formed with one member
in one cluster and the other member in the other.
d) In the next step, which
two clusters are put together?
e) Continue the
agglomeration until all cases are in a single cluster.
2. (continuation) Cluster the above dataset
non-hierarchically, by K-Means.
Do
it twice, with different starting points.
3. The coordinates of Case 1 are 4.7 and
2.4. The coordinates of Case 2 are 6.6 and 2.5. Find the coordinates of Case 3,
given that
the Euclidean distance
between Case 3 and Case 1 is 1.342 and
the Euclidean distance
between Case 3 and Case 2 is 0.989 .
HINT: There are two correct answers: Draw a circle with radius 1.342 around
Case 1 and a circle with radius 0.989 around Case 2; the circles will intersect in two points.
Afifi, A., & Clark, V.
(1996). Computer-Aided
Multivariate Analysis. 3rd ed. Chapman & Hall, New York. [Ch. 16
is on cluster analysis.]
Aldenderfer,
Mark S., & Blashfield, Roger K. (1984). Cluster Analysis.
Sage Publications, Newbury Park, Cal.
Berry,
Michael J.A., and Linoff, Gordon (1997).
Data Mining Techniques:
for Marketing, Sales, and Customer Support. John Wiley & Sons, Inc., New York. [Ch. 10 is on Automatic Cluster
Detection.]
Fraley,
C., and Raftery, A.E. (1999).
MCLUST: Software for
Model-Based Cluster Analysis. Jnl.
of Classification 16 (2)
297-306. A Webpage with related
links can be found at http://www.stat.washington.edu/fraley/mclust.
Hair,
Anderson, Tatham and Black. Multivariate
Data Analysis. 5th ed. Prentice Hall, 1998. [Chapter 9 is
on Cluster Analysis.]
Hartigan,
John (1975). Clustering
Algorithms. John Wiley
& Sons, Inc., New York.
Johnson,
R., & Wichern, D. (1998). Applied Multivariate
Statistical Analysis. 4th ed. Prentice Hall.
[Part of Chapter 12 is on cluster analysis.]