University of Illinois at Chicago
School of Public Health
Division of Epidemiology & Biostatistics
BSTT 580 Applied Multivariate
Analysis
Instructor Stan Sclove
Textbook Johnson & Wichern, 4th
ed. (JW)
Notes on the Family of Multivariate Normal Distributions
(Supplementary notes to be read with JW Chapter 4)
These notes Copyright © 1999 Stanley Louis Sclove
4. The Multivariate Normal Distribution
Reading assigned: Secs. 4.1,2,6,7,8
4.1. Introduction
Many of the techniques in the book are based on the assumption that the
data have a distribution that is close to a multivariate normal distribution
(MVND). The properties of this family of distribution are the
subject of Section 4.2. Methods of assessing normality are discussed in
Section 4.6. The problem of outlier detection is discussed in Section 4.7.
Often data which are not nearly normally distributed can be transformed
(e.g., by taking logs) to data which are approximately normal, so
that normal-based methods can be used. Such transformation
is the topic of Section 4.8.
Normal distributions are useful in practice for two reasons:
-
many variables are approximately normally distributed
-
many statistics are approximately normally distributed.
4.2. The Multivariate Normal Density and its Properties
The Family of Univariate Normal Distributions
The univariate normal probability density function (p.d.f.) looks like
a cross-section of a bell. Its formula is Const.
exp[-(1/2)z2], where z = (x - µ)/sigma .
In connection with the univariate normal distribution we have
the 68-95-99.7% rule:
If X has a normal distribution
-
P(µ - sigma < X < µ + sigma) is about .6826 or about
.68 (about two-thirds)
-
P(µ - 2 sigmas < X < µ + 2 sigmas) is about .9544 or
about .95
-
P(µ - 3 sigmas < X < µ + 3 sigmas) is about .9970 or
about .997
The Family of Multivariate Normal Distributions
Parameters
In a family of distributions, the family member is specified by the values
of the parameters. For multivariate normal distributions, the parameters
are the mean vector and covariance matrix. Example.
For a bivariate normal distribution, there are five parameters: the mean
of X, the mean of Y, the standard deviation of X, the standard deviation
of Y and the covariance of X and Y.
Remark. Since each covariance is the product of the corresponding
correlation, times the product of the standard deviations, specifying the
covariance matrix is equivalent to specifying the correlations and standard
deviations.
Role of D-square
The multivariate normal probability density function (p.d.f.) involves
Mahalanobis D-square, D2(x,µ;C),
which is (x-µ)'C-1(x-µ),
where x is the vector of values of the variables, µ
is the mean vector, and C is the covariance matrix.
Denote this by Q for short. Then the p.d.f. is
of the form
Const. e-Q/2 ,
where
Const. = 1/[(2*pi)p/2det(C)1/2]
and p is the number of variables.
The quantity Q is the square of the statistical (Mahalanobis) distance
between x and µ. The larger this distance, the smaller
the probability density; the density decreases exponentially with the square
of the distance.
Example 4.1 (p. 159): Bivariate normal density
Parameters
A bivariate normal distribution (joint normal distribution
of just p = 2 variables) is specified by giving the values of five parameters,
namely, the two means, the two standard deviations, and the correlation
(or the covariance).
Example
Suppose for a population of adult males, height (H) and weight
(W) are jointly normally distributed with E(H) = 68", SD(H) = 2.5", E(W)
= 165 lbs., SD(W) = 25 lbs., and Corr(H,W) = +.4. Then with a correlation
of +.4; then Cov(H,W) = Corr(H,W)*SD(H)*SD(W) = (+.4)(2.5)(25) = +25.0.
Shape
Since the density decreases exponentially with the square of the distance
between (x,y) and the mean vector, f(x,y) is a bivariate normal p.d.f,
then the surface z = f(x,y) looks like a bell.Exercise. Write down
Mahalanobis D-squared and the p.d.f. for the bivariate case. (You can find
the answer in the book.)
Concentration ellipsoid
A uniform distribution over the interior of the ellipsoid (x
- µ)'C-1(x - µ) = p+2
has mean vector µ and covariance matrix C. In this
sense, this ellipsoid provides a way of visualizing in a sense the multinormal
distribution N(µ, C) .
Result 4.2 Distribution of Linear Combinations of a MVN
Rand Vector If X has an MVND, then every linear
combination of the variables in X has a univariate normal distribution.
Remark. The MVND can be developed with this as its defining
property.
Example 4.7 (p. 171): Conditional Distribution
of Y given X=x
Analogous to P(B|A) = P(A and B)/P(A) we have f(y|x) = f(x,y)/f(x).
The conditional p.d.f. of Y given that X=x can be found by this division
of the joint p.d.f. of X and Y by the marginal p.d.f. of X. Doing this
and simplying by algebra shows that the conditional distribution of Y given
that X=x is normal with a mean A + Bx and a variance equal to Var(Y)[1-rho2],
here rho is the correlation between X and Y.
Connection between MVNDs and the Ordinary Linear Regression
Model
The conditional distribution of Y given X involves the mean
of Y when X = x; this function is called the "regression function." In
the example the mean Wt for men of height h is
A + Bh, where
B = Cov(H,W)/Var(H) = +25.0/6.25 = +4.0
and
A = mean Wt - B*(mean Ht) = 165 - (4)(68) = 165-272
= -107,
i.e., the mean Wt for men of height h is 4h - 107;
e.g.,
if h = 70", this is 4(70) - 107 = 280 - 107 = 173 lbs. The variance of
the conditional distribution of W given H is Var(W)[1-rho2],
which does not depend upon the value of the explanatory variable H.
Thus, using a more general notation, if X and Y are jointly normally
distributed, then the conditional distribution of Y given X satisfies the
assumptions underlying ordinary linear regression:
-
linearity of regression: the mean value of Y given x
is a linear function of x
-
homoscedasticity: the variance of Y given x is a constant,
not dependent upon the value of x.
Observational Model
The model for the observations is written
Yj = A + Bxj + ej,
j = 1,2,...,n,
where A + Bxj is the conditional mean of Y for the j-th case,
i.e., conditional mean of Y, given that X = xj, B (beta)
is Cov(X,Y)/Var(X), A (alpha) is µy - Bµx,
and the variance of each ej is a constant, usually denoted by
sigma-squared, which is equal to Var(Y)[1-rho2] .
Result 4.7: Distribution of D-square
If X is distributed according to N(µ,C), then D2(X(µ;,C)
is distributed according the the chi-square distribution with p d.f.
4.3. Sampling from a Multivariate Normal Distribution and Maximum Likelihood
Estimation
Review of estimation of the variance of a univariate normal distribution
The MLE is SSD/n, where SSD is the sum of squared deviations from the mean.
Let s2 = SSD/(n-1) . No matter what
the parent distribution, s2 is unbiased for s2
. This is a reason for using s2 as the basis
for a more or less universal measure of spread. Note however, that
s is biased for s . To see
this, note that for any variable X, E[X2] >= {[E[X]}2.
Taking X = s, this gives E[s2]
>= {[E[s]}2. So
E[s] <= {E[s2]}1/2
= {s2}1/2 =
s . That is, s is biased downward
for s .
For normal distributions it can in fact be shown that E[s]
is approximately s(1-1/4n)
. So s(1-1/4n)-1 is unbiased for s
. This is equal to s[(4n-1)/(4n)]-1
= s[(4n)/(4n-1)] and is approximately s(1+1/4n) . See
Kendall & Stuart (1973), Exercise 17.6, p. 33 (and also Exercise 10.20
in their Volume 1).
4.4. The Sampling Distribution of the Sample Mean Vector and Covariance
Matrix
4.5. Large-sample Behavior of the Sample Mean Vector and Covariance Matrix
4.6. Assessing the Assumption of Normality
Q-Q plots
Normal quantile plots
The sample quantiles are plotted against those of the standard normal distribution.
If the plot is nearly a straight line, normality is accepted. An S-shaped
plot indicates skewness.
Correlational methods
Related to this is the method which correlates the sample quantiles with
those of the standard normal distribution. If the correlation is high,
normality is accepted. A variation, the Shapiro-Wilk test, uses the expected
values of the order statistics from the standard normal distribution rather
than its quantiles.
Chi-square quantile plots
This uses the approximate chi-square distribution of D-square between the
observation vectors and the sample mean vector.
4.7. Detecting Outliers and Cleaning Data
We first discuss some methods not mentioned in the book. One method of
detecting outliers is to compute the statistical distance between each
observation and the sample mean vector, in the metric of the sample covariance
matrix. If the data are approximately multinormal, the squared statistical
distance is approximately chi-square with p d.f. A difficulty with this
method is that the presence of outliers fouls the estimation of the mean
vector and covariance matrix. There are several methods to try to deal
with this. One is to recompute the sample mean and covariance matrix after
each suspected outlier is omitted from the sample, stopping when no further
outliers are suspected.
Another is to compute the sample mean and covariance matrix with each
case left out. This would be completely effective if there were only one
outlier. It still makes some sense for dealing with situations where more
than one outlier is suspected.
4.8. Transformations to Near Normality
Variance-Stabilizing Transformations
Suppose the variance is a function V(µ) of the mean. We seek a transformation
T(x) such that Var[T(X)] is constant. But, Var[T(X)] is approximately Var(X)[T'(µ)]2
= V(µ)[T'(µ)]2. This gives the differential
equation T'(x) = c/[V(x)]1/2 . The square root transform for
Poisson count data, the arc sin square-root transform for proportions and
Fisher's z transform for correlation coefficients are obtained as solutions
in those cases.
Power Transformations
The Box-Cox transformations say T(x) are the solutions to the simple differential
equation dT/dx = xp-1, where the power p can be any real number.
(The power p is usually denoted by lambda.) This gives T(x) = xp/p
+ c. The choice c = -1/p gives T(x) = (xp - 1)/p
for p different from 0 and by continuity T(x) = ln(x) for p = 0. These
transformations are denoted by x(p).
The most popular transformations are
-
p = -1 (the reciprocal)
-
p = 0 (the log)
-
p = 1/2 (the square root)
-
p = 1 (no transformation).
It is convenient to use instead the modification
y(p) = (xp - 1)/[p(g.m.)p-1],
where g.m. is the geometric mean of the data. For p = 0, this is
p ln(x) . The variance of y(p) is plotted against
p, say for p = -2(.1)+2. The smallest variance corresponds
to the maximum likelihood choice of p.
APPENDIX
Other Multivariate Distributions
Modeling
Another way to specify a bivariate normal distribution is in terms of the
conditional distribution of Y given X and the marginal distribution of
X. Analogous to
Pr(A and B) = Pr(A)Pr(B|A),
for pdf's we have
f(x,y) = f(x)f(y|x).
It is often easy to describe these two and then multiply them to get
the joint p.d.f.
There are many continuous multivariate distributions besides the normal.
It is interesting to construct one from elementary considerations. Let
the conditional distribution of Y given X=x be exponential with parameter
x:
f(y]x) = x exp(-xy), y>0, x>0.
Let X have an exponential distribution with parameter k:
f(x) = k exp(-kx), x > 0.
Then
f(x,y) = f(x)f(y|x)
= k exp(-kx) x exp(-xy)
= kx exp[-(kx+xy)]
= kx exp[-x(y+k)],
x>0, y>0.
Usually in such situations we want to know the marginal p.d.f. of y,
to know what to expect when sampling Y alone. Then we find f(y) by integrating
f(x,y) with respect to x. In this example we get f(y) = k/(y+k)2.
The cumulative distribution function is F(y) = 1 - k/(y+k), or y/(y+k)
. The median is the value of y for which F(y) = 1/2; this is
y = k.
References
Kendall and Stuart (1973). The Advanced Theory of
Statistics. Vol.2: Inference and Relationship. 3rd ed.
Hafner, New York.
Created: 15 January 1997
Updated: 2 Nov 2000