University of Illinois at Chicago
School of Public Health
Division of Epidemiology & Biostatistics
BSTT 580      Applied Multivariate Analysis

Instructor      Stan Sclove
Textbook      Johnson & Wichern, 4th ed. (JW)
Notes on the Family of Multivariate Normal Distributions

(Supplementary notes to be read with JW Chapter 4)
These notes Copyright © 1999 Stanley Louis Sclove 
4.   The Multivariate Normal Distribution Reading assigned: Secs. 4.1,2,6,7,8

4.1. Introduction

Many of the techniques in the book are based on the assumption that the data have a distribution that is close to a multivariate normal distribution (MVND).   The properties of this family of distribution are the subject of Section 4.2. Methods of assessing normality are discussed in Section 4.6. The problem of outlier detection is discussed in Section 4.7.   Often data which are not nearly normally distributed can be transformed (e.g., by taking logs) to data which are approximately normal, so that normal-based methods can be used.   Such transformation is the topic of Section 4.8.

Normal distributions are useful in practice for two reasons:

4.2. The Multivariate Normal Density and its Properties

The Family of Univariate Normal Distributions

The univariate normal probability density function (p.d.f.) looks like a cross-section of a bell. Its formula is     Const.   exp[-(1/2)z2],   where z = (x - µ)/sigma .

 In connection with the univariate normal distribution we have the 68-95-99.7% rule:
If X has a normal distribution

The Family of Multivariate Normal Distributions

Parameters

In a family of distributions, the family member is specified by the values of the parameters. For multivariate normal distributions, the parameters are the mean vector and covariance matrix.   Example. For a bivariate normal distribution, there are five parameters: the mean of X, the mean of Y, the standard deviation of X, the standard deviation of Y and the covariance of X and Y.

Remark. Since each covariance is the product of the corresponding correlation, times the product of the standard deviations, specifying the covariance matrix is equivalent to specifying the correlations and standard deviations.

Role of D-square

The multivariate normal probability density function (p.d.f.) involves Mahalanobis D-square, D2(x,µ;C), which is (x-µ)'C-1(x-µ),   where x is the vector of values of the variables, µ is the mean vector, and C is the covariance matrix.   Denote this by   Q   for short. Then the p.d.f. is of the form
Const. e-Q/2 ,
where
Const. =   1/[(2*pi)p/2det(C)1/2]
and   p   is the number of variables.

The quantity Q is the square of the statistical (Mahalanobis) distance between x and µ. The larger this distance, the smaller the probability density; the density decreases exponentially with the square of the distance.

Example 4.1 (p. 159): Bivariate normal density

Parameters

A bivariate normal distribution (joint normal distribution of just p = 2 variables) is specified by giving the values of five parameters, namely, the two means, the two standard deviations, and the correlation (or the covariance).

Example

Suppose for a population of adult males, height (H) and weight (W) are jointly normally distributed with E(H) = 68", SD(H) = 2.5", E(W) = 165 lbs., SD(W) = 25 lbs., and Corr(H,W) = +.4. Then with a correlation of +.4; then Cov(H,W) = Corr(H,W)*SD(H)*SD(W) = (+.4)(2.5)(25) = +25.0.

Shape

Since the density decreases exponentially with the square of the distance between (x,y) and the mean vector, f(x,y) is a bivariate normal p.d.f, then the surface z = f(x,y) looks like a bell.Exercise. Write down Mahalanobis D-squared and the p.d.f. for the bivariate case. (You can find the answer in the book.)

Concentration ellipsoid

A uniform distribution over the interior of the ellipsoid   (x - µ)'C-1(x - µ) = p+2
has mean vector µ and covariance matrix C. In this sense, this ellipsoid provides a way of visualizing in a sense the multinormal distribution N(µ, C) .

Result 4.2   Distribution of Linear Combinations of a MVN Rand Vector   If X has an MVND, then every linear combination of the variables in X has a univariate normal distribution.
Remark. The MVND can be developed with this as its defining property. 


Example 4.7 (p. 171):   Conditional Distribution of Y given X=x

Analogous to P(B|A) = P(A and B)/P(A) we have f(y|x) = f(x,y)/f(x). The conditional p.d.f. of Y given that X=x can be found by this division of the joint p.d.f. of X and Y by the marginal p.d.f. of X. Doing this and simplying by algebra shows that the conditional distribution of Y given that X=x is normal with a mean A + Bx and a variance equal to Var(Y)[1-rho2],   here rho is the correlation between X and Y.

Connection between MVNDs and the Ordinary Linear Regression Model

The conditional distribution of Y given X involves the mean of Y when X = x; this function is called the "regression function." In the example the mean Wt for men of height   h   is A + Bh, where
B = Cov(H,W)/Var(H) = +25.0/6.25 = +4.0
and
A = mean Wt - B*(mean Ht) = 165 - (4)(68) = 165-272 = -107,

i.e., the mean Wt for men of height h is 4h - 107; e.g., if h = 70", this is 4(70) - 107 = 280 - 107 = 173 lbs. The variance of the conditional distribution of W given H is Var(W)[1-rho2], which does not depend upon the value of the explanatory variable H.

Thus, using a more general notation, if X and Y are jointly normally distributed, then the conditional distribution of Y given X satisfies the assumptions underlying ordinary linear regression:

Observational Model

The model for the observations is written
Yj   =   A + Bxj + ej, j = 1,2,...,n,
where A + Bxj is the conditional mean of Y for the j-th case, i.e., conditional mean of Y, given that X = xj, B (beta) is Cov(X,Y)/Var(X), A (alpha) is µy - Bµx, and the variance of each ej is a constant, usually denoted by sigma-squared, which is equal to Var(Y)[1-rho2] . 
Result 4.7:   Distribution of D-square   If X is distributed according to N(µ,C), then D2(X(µ;,C) is distributed according the the chi-square distribution with p d.f. 

4.3. Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation

Review of  estimation of the variance of a univariate normal distribution

The MLE is SSD/n, where SSD is the sum of squared deviations from the mean.  Let  s=  SSD/(n-1) .  No matter what the parent distribution, s  is unbiased for s .  This is a reason for using  s2  as the basis for a more or less universal measure of spread.  Note however, that  s  is biased for s .   To see this, note that for any variable X,  E[X2]  >= {[E[X]}2.   Taking X = s, this gives        E[s2]  >= {[E[s]}2.     So     E[s]  <=   {E[s2]}1/2   =  {s2}1/2  =  s .  That is,   s   is biased downward for s .

For normal distributions it can in fact be  shown that  E[s] is approximately s(1-1/4n) .  So   s(1-1/4n)-1 is unbiased for s .   This is equal to s[(4n-1)/(4n)]-1     =  s[(4n)/(4n-1)]  and is approximately s(1+1/4n) .  See Kendall & Stuart (1973), Exercise 17.6, p. 33 (and also Exercise 10.20 in their Volume 1).



 

4.4. The Sampling Distribution of the Sample Mean Vector and Covariance Matrix

4.5. Large-sample Behavior of the Sample Mean Vector and Covariance Matrix


4.6. Assessing the Assumption of Normality

Q-Q plots

Normal quantile plots

The sample quantiles are plotted against those of the standard normal distribution. If the plot is nearly a straight line, normality is accepted. An S-shaped plot indicates skewness.
Correlational methods
Related to this is the method which correlates the sample quantiles with those of the standard normal distribution. If the correlation is high, normality is accepted. A variation, the Shapiro-Wilk test, uses the expected values of the order statistics from the standard normal distribution rather than its quantiles.

Chi-square quantile plots

This uses the approximate chi-square distribution of D-square between the observation vectors and the sample mean vector. 

4.7. Detecting Outliers and Cleaning Data

We first discuss some methods not mentioned in the book. One method of detecting outliers is to compute the statistical distance between each observation and the sample mean vector, in the metric of the sample covariance matrix. If the data are approximately multinormal, the squared statistical distance is approximately chi-square with p d.f. A difficulty with this method is that the presence of outliers fouls the estimation of the mean vector and covariance matrix. There are several methods to try to deal with this. One is to recompute the sample mean and covariance matrix after each suspected outlier is omitted from the sample, stopping when no further outliers are suspected.

Another is to compute the sample mean and covariance matrix with each case left out. This would be completely effective if there were only one outlier. It still makes some sense for dealing with situations where more than one outlier is suspected. 

4.8. Transformations to Near Normality

Variance-Stabilizing Transformations

Suppose the variance is a function V(µ) of the mean. We seek a transformation T(x) such that Var[T(X)] is constant. But, Var[T(X)] is approximately Var(X)[T'(µ)]2 = V(µ)[T'(µ)]2.   This gives the differential equation T'(x) = c/[V(x)]1/2 . The square root transform for Poisson count data, the arc sin square-root transform for proportions and Fisher's z transform for correlation coefficients are obtained as solutions in those cases.

Power Transformations

The Box-Cox transformations say T(x) are the solutions to the simple differential equation dT/dx = xp-1, where the power p can be any real number. (The power p is usually denoted by lambda.) This gives T(x) = xp/p + c.   The choice c = -1/p gives T(x) = (xp - 1)/p for p different from 0 and by continuity T(x) = ln(x) for p = 0. These transformations are denoted by x(p).

The most popular transformations are

It is convenient to use instead the modification
y(p) = (xp - 1)/[p(g.m.)p-1],

where g.m. is the geometric mean of the data. For p = 0, this is   p ln(x) .   The variance of y(p) is plotted against p, say for p = -2(.1)+2.   The smallest variance corresponds to the maximum likelihood choice of p. 


APPENDIX

Other Multivariate Distributions

Modeling

Another way to specify a bivariate normal distribution is in terms of the conditional distribution of Y given X and the marginal distribution of X. Analogous to
Pr(A and B) = Pr(A)Pr(B|A),
for pdf's we have
f(x,y) = f(x)f(y|x).

It is often easy to describe these two and then multiply them to get the joint p.d.f.

There are many continuous multivariate distributions besides the normal. It is interesting to construct one from elementary considerations. Let the conditional distribution of Y given X=x be exponential with parameter x:

f(y]x) = x exp(-xy), y>0, x>0.

Let X have an exponential distribution with parameter k:

f(x) = k exp(-kx), x > 0.
Then
f(x,y)   =   f(x)f(y|x)
    =   k exp(-kx) x exp(-xy)
    =   kx exp[-(kx+xy)]
    =   kx exp[-x(y+k)],
x>0, y>0.

Usually in such situations we want to know the marginal p.d.f. of y, to know what to expect when sampling Y alone. Then we find f(y) by integrating f(x,y) with respect to x.   In this example we get f(y) = k/(y+k)2. The cumulative distribution function is F(y) = 1 - k/(y+k), or  y/(y+k) .   The median is the value of y for which F(y) = 1/2; this is y = k.


References

Kendall and Stuart (1973).    The Advanced Theory of Statistics.  Vol.2:  Inference and Relationship.  3rd ed.    Hafner, New York.

Created: 15 January 1997      Updated: 2 Nov 2000