University of Illinois at Chicago
College of Business Administration
Department of Information & Decision Sciences

IDS 470     Multivariate Statistical Analysis
Instructor               Sclove
Text   Hair et al., 5th ed.


Notes on Chapter 5     Discriminant Analysis; Logistic Regression
Part B Section-by-Section Commentary

HyperTable of Contents

5.0.   Learning Objectives . Chapter Preview . Key Terms
5.1.   What are Discriminant Analysis and Logistic Regression?
Classification
5.2.   Analogy with Regression and MANOVA
5.3.   Hypothetical Example of Discriminant Analysis (pp. 246ff)
5.3.1. A Two-Group Discriminant Analysis: Purchasers versus Nonpurchasers
5.3.2. A Geometric Representation of the Two-Group Discriminant Function
5.3.3. A Three-Group Example of Discriminant Analysis: Switching Intentions
5.4.   The Decision Process for Discriminant Analysis
5.5.  Stage 1: Objectives of Discriminant Analysis
5.6.   Stage 2: Research Design for Discriminant Analysis
5.7.   Stage 3: Assumptions of Discriminant Analysis
5.8. Stage 4: Estimation of the Discriminant Model and Assessing Overall Fit
5.8.1. Computational Method
5.8.2. Statistical Significance
5.8.3. Assessing Overall Fit
DISCRIMINANT ANALYSIS: THEORY
UNEQUAL COSTS OF MISCLASSIFICATION
5.9. Stage 5: Interpretation of the Results
5.9.1. Discriminant Weights
5.9.2. Discriminant Loadings
5.9.3. Partial F Values-
5.9.4. Interpretation of Two or More Functions
5.9.5. Which Interpretative Method to Use?
5.10. Stage 6: Validation of the Results
5.10.1. Split-Sample or Cross-Validation Procedures
5.10.2. Profiling Group Differences
5.11. Logistic Regression: Regression with a Binary Dependent Variable
5.12. A Two-Group Illustrative Example
5.13. A Three-Group Illustrative Example
5.14. An Illustrative Example of Logistic Regression
5.15. Summary . Questions . References
5.15.2. Additional Questions

5.0. LEARNING OBJECTIVES   .   CHAPTER PREVIEW   .   KEY TERMS

The data for classification can be considered as consisting of We classify an individual having values x into the group indicated by Y=1 if P(Y=1|X=x) is sufficiently large. The vector X can contain both metric and nonmetric variables. When it contains only metric variables, and when the within-group distributions are multivariate normal, there is a special method for classification, "Discriminant Analysis".

5.1. What are Discriminant Analysis and Logistic Regression?

Thus, Discriminant Analysis is best viewed as a special case of classification, so we begin this Commentary with a discussion of Classification in general. A linear classification function function takes the form

Zji = aj + Wj1 X1i + Wj2 X2i + . . . + Wjn Xni .

This expression corrects that at the bottom of p. 244, in that it shows that there is a different intercept and coefficients for each j = 1, 2, . . ., K, where K is the number of groups. Here

Xvi = value of the v-th variable for the i-th individual
aj = intercept in the j-th classification function
Wjv = coefficient of the v-th variable in the j-th classification function,   and
Zji = score of Individual i on the j-th classification function.

When the group covariance matrices differ, quadratic classification functions are used. For n = 3 variables, for example, these would take the form

Zji = aj + Wj1 X1i + Wj2 X2i + Wj3 X3i + Wj11 X1i2 + Wj22 X2i2 + Wj33 X3i2 + Wj12 X1i X2i + Wj13 X1i X3i + Wj23 X2i X3i .

5.2. Analogy with Regression and MANOVA

In regression, a numerical dependent variable Y is regressed on several explanatory variables. In DA, the dependent variable Y is categorical. If ANOVA or MANOVA is written in terms of regression, the explanatory variables are categorical.

5.3. Hypothetical Example of Discriminant Analysis for Two Groups

5.3.1. A Two-Group Discriminant Analysis: Purchasers versus Nonpurchasers

Here is output from the discriminant analysis. Note that "Buy?" is the binary dependent variable, indicating the group.   Note also that Style wasn't included in the MDA since its t was N.S.

MTB >    info
COLUMN    NAME      COUNT
C1        Case         10
C2        Durablty     10
C3        Perform      10
C4        Style        10
C5        Buy?         10   

MTB > DISCriminant analysis for labels in C5, data in C2-C3

TABLE.  Classification Functions
      Group:       0        1
   --------------------------
   Constant   -6.170  -25.619
   Durablty    1.823    5.309
   Perform     1.479    2.466

Q. What are the values of the two classification functions for an individual giving the food mixer a Durability rating of 4 and a Performance rating of 6?

A. For j = 0,1, let C(j|d,p) denote the value of the classification function for Group j, given Durability = d and Performance = p.

For Group 0:  C(0|4,6) = -6.170 + 1.823(4) + 1.479(6) = +9.996
For Group 1:   C(1|4,6) = -25.619 + 5.309(4) + 2.466(6) = +6.413

Q. Classify this individual.

A. Since C(0|4,6) > C(1|4,6), we classify this individual as 0: that is, we predict that this individual will not buy.

Q. What is this person's posterior probability of membership in the 'Buy' group?

A. For j = 0,1, let p(j|d,p) denote the posterior probability of membership in group j, given that Durability = d and Performance = p. Math shows that the classification functions are equal to the log posterior probabilities, except for a constant which doesn't depend on the group. That is,

C(j|d,p) = ln[p(j|d,p)] + k.
This gives
p(j|d,p) = k'exp[C(j|d,p)].
p(0|4,6) = k'exp(+9.996)
p(1|4,6) = k'exp(+6.413)
The sum of the two is 1.
Hence k' = 1/[exp(+9.996) + exp(+6.413)].
p(1|4,6) = exp(+6.413)/[exp(+6.413) + exp(+9.996)] = 1/[1 + exp(9.996-6.413)] = 1/[1 + exp(+3.187)] = .0397 .
Also, p(0|4,6) = .9603.

5.3.2.   A Geometric Representation of the Discriminant Function

5.4. The Decision Process for Discriminant Analysis

5.5. Stage 1: Objectives of Discriminant Analysis

5.6. Stage 2: Research Design for Discriminant Analysis

5.7. Stage 3: Assumptions of Discriminant Analysis

Strictly speaking, DA requires joint normal distributions within classes. If the covariance matrices are equal, linear classification functions result. (See "Discriminant Analysis: Theory" below.) If the covariance matrices are not equal, quadratic classification functions result. A "quadratic" subcommand or option is included in the discriminant analysis command of the various statistical packages.

5.8. Stage 4: Estimation of the Discriminant Model and Assessing Overall Fit

5.8.1. Computational Method

5.8.2. Statistical Significance

5.8.3. Assessing Overall Fit

Specifying Probabilities of Classification

DISCRIMINANT ANALYSIS: Theory

The posterior probability classification rule is to classify an individual having pattern x into Group 1 if p(1|x) is larger than p(0|x). (This is equivalent to the condition p(1|x) > 1/2.)

Let P(Y=j | X = x) = p(j|x), for j = 0, 1. Then all of the following are equivalent.

p(1|x) > p(0|x)
p(1)f(x|1)/f(x) > p(0)f(x|0)/f(x)
p(1)f(x|1) > p(0)f(x|0)

At this point note that this amounts to "discounting" the probability density by the prior probabilities.

When the class-conditional (within-group) probability density functions f(x|0) and f(x|1) are multinormal (see notes on related mathematics) with equal covariance matrices, this comparison reduces further to a comparison of the so-called "classification functions". These are, except for an additive constant, the log posterior probabilities.   That is,

C(j|x) = ln p(j|x) + k,   j = 0,1,

where the constant   k   is the same for both groups (j = 0 and 1). The functions C(j|x) are linear functions of x:
C(j|x) = w0j + wjTx

The vector wjT is µjTC-1. The parameters are estimated from data in the usual way. If the true value of w0 is the same as that of w1, then the variables are of no use in discriminating between groups. The test for this is the test of equality of mean vectors in the two groups. Remember how in connection with Chapter 1 it was stated that one of the uses of multivariate analysis was to search for differences in every direction, that is, for every linear combination. The present situation is a case in point. A test for equality of the group centroids can be developed by doing the two-sample t-test for every linear combination y = a'x. It turns out that the most significant linear combination is that given by the vector w1-w0.

Unequal Costs of Misclassification

We remarked above that the posterior-probability classification rule is to "Say 1" for x such that p(1|x) is greater than 1/2. When the costs of misclassification are unequal, the rule used is a minimum-expected-cost rule. It takes the form, Say "1" iff. p(1|x) > c, where   c   depends upon the costs of misclassification and so is not necessarily 1/2.

5.9. Stage 5: Interpretation of the Results

5.10. Stage 6: Validation of the Results

5.11. Logistic Regression: Regression with a Binary Dependent Variable

Let Px = P(Y=1|X=x) and Qx = P(Y=0|X=x).   When the distribution of X for Y = 1 is multinormal with mean µ1 and the distribution of X for Y = 0 is multinormal with mean µ0, and the two covariance matrices are equal, then   ln (Px/Qx)   is linear in x .   This suggests modeling the binary dependent variable Y by taking   ln Px/Qx   to be linear in x, even when the conditional distributions of X given Y = 0 and 1 are not multinormal.   The model  

ln Px/Qx = b0 + b1Tx

is called the logistic regression model.   The function of P   (0 < P < 1) defined by ln [P/(1-P)] is called the logit of P.

Note that if the covariance matrices in the multinormal model were unequal, there would be quadratic terms. That is, the form would be

ln (Px/Qx)   =   a + b'x + x'Mx.

More generally the model can be

ln (Px/Qx)   =   a + b'g(x).
That is, other functions of the elements of x can be allowed.

Function:Name:Range:
Pxprob. that Y=1, given that X=x0 to 1
Px/QxOdds0 to infinity
ln(Px/Qx)log Odds, or "logit"negative infinity to positive infinity

A little algebra shows that if logit(P) = z, then P = ez/(1+ez), or   1/(1+e-z) .   This function is called the logistic function.  

Estimation

If there is more than one observation at each value x, weighted least squares estimation can be used. Example. Testing strength of wires. P = Px = probability of breaking at weight x. Odds = P/Q, where Q = 1-P.


Weight applied (lbs.):      10    20     30      40     50    
number of wires breaking:    4     8     18      76     90
number of wires tested
at this weight:            100   100    100     100    100
p, estimate of P:          .04   .08    .18     .76    .90 
q, estimate of Q:          .96   .92    .82     .24    .10
Odds, p/q:                 .0417 .0870  .2195  3.167  9.000             
Logit = ln(Odds):        -3.18   -2.44  -1.52  +1.15  +2.197 


The logit is regressed on x, except that weighted regression is used. The weight is 1/Var(y), where here y = logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately 1/(NPQ), so the weight is NPQ, which is estimated by Npq. The median breaking strength can be estimated as the value of x for which px = 1/2, that is, logit(px) = 0.

Maximum likelihood estimation is used when there are not repeated observations at each pattern of x. The likelihood L is maximized. The higher the maximized value of L, the better the fit of the model. This is assessed on a log scale by computing -2 log L, called -2LL . (This criterion corresponds to residual mean square in normal multiple regression models.) When there are several explanatory variables, different models can be assessed using -2LL as a figure-of-merit.

If you have a sample y1, y2, . . ., yN of 0,1 variables with success probability P, then the likelihood can be written

Py1Q1-y1 x Py2Q1-y2 x . . . x PyNQ1-yN.

The maximum likelihood estimator is the value of P which maximizes this; it turns out to be simply

(y1 + y2 + . . . + yN)/N .

In the logistic regression model, Px = 1/[1+exp(-BTx)], where B is the vector of logistic regression parameters, to be estimated. This is done by maximizing the likelihood by numerical methods.

Testing is based on -2LL, where LL is the natural log of the maximized likelihood. It is based on the fact that (-2LL)full - (-2LL)reduced is for large N distributed approximately as chi-square with nfull - nreduced d.f.

5.12. A Two-Group Illustrative Example

The example in this section could as well be done by MANOVA, since the focus is on finding which variables account for the Buy/Not Buy split. A Credit Scoring example is more interesting, in that interest focuses on the variate which gives the credit score.

5.13. A Three-Group Illustrative Example

5.14. An Illustrative Example of Logistic Regression

The Food Mixer Example again

Refer again to the data of Table 5.1.   The three explanatory variables are Performance, Durability and Style.   If Gender -- a categorical variable -- were included among the explanatory variables, the whole analysis would change.   You couldn't use MDA, you would have to use Logistic Regression, because Gender is a nonmetric variable.

The Credit Data

We consider also the Credit Data, on 113 applicants for charge accounts at a department store in a large city in the northeastern U.S.   The variables include gender and marital status, which are categorical and hence require Logistic Regression rather than MDA.

HATCO Data

The book considers the HATCO data, with X11 (spec buying vs. total value analysis) as the binary dependent variable. The explanatory variables are X1-X7 in the HATCO dataset, the seven perceptions-of-HATCO variables. These variables are metric, and MDA could be used (as was illustrated earlier in the chapter. Again, this example is not compelling from the point-of-view of classification; one could as easily study the effects of X1 thru X7 with MANOVA.

Exercises on Logistic Regression

Longevity of Restaurants

Suppose a logistic regression analysis of some data for restaurants gave the following result.
L(x1,x2) = .3 - .2 x1 + .1 x2,
where
L(x1,x2) = ln{P(Y=1|X1=x1,X2=x2)/P(Y=0 |X1=x1,X2=x2)}
Y = 1 denotes bankruptcy within three years of startup,
Y = 0 denotes staying in business more than three years,
X1 = 1 if franchised, 0 if not,
X2 = 1 if a fast-food restaurant, 0 if not.
1.   What is the value of L(0,1) ?
(A) .1    
(B) .2    
(C) .3    
(D) .4    
(E) .5
2.   The number P(Y=1|X1=0,X2=1) is the probability that
(A) a franchised fast-food restaurant is bankrupt within three years.
(B) a franchised non-fast food restaurant is not bankrupt within three years.
(C) a non-franchised fast-food restaurant is bankrupt within three years.
(D) a non-franchised non-fast food restaurant is bankrupt within three years.
(E) a fast-food restaurant is franchised.
3.   (continuation) What is the value of this probability ?
(A) .401
(B) .450
(C) .550
(D) .599
(E) .67

Logit Function

4. If P = 1/2, what is the value of the logit function, ln(P/Q), where Q = 1-P ?
(A) -1    
(B) -1/2    
(C) 0  
(D) 1/2  
(E) 1

Exponential and Logarithmic Function

5.   For x > 0, eln(x) = ?

(A) ex  
(B) ln(x)    
(C) x
(D) 0  
(E) 1

6.   ln(ex) = ?

(A) ex  
(B) ln(x)    
(C) x  
(D) 0  
(E) 1

7.     ln(2e) =
(A) 1/2
(B) 1
(C) 2
(D) 3
(E) 1 + ln(2)
8.   Which of the following is closest to the value of the number 1/e ?
(A) 1/3  
(B) 0.368    
(C) 1/2  
(D) 3  
(E) 3.14159

Logistic Function

9.   If z = 0, what is the value of the logistic function, 1/(1 + e-z) ?
(A) -1  
(B) -1/2    
(C) 0  
(D) 1/2  
(E) 1

5.15. Summary .   Questions .   References

5.15.2. Additional Questions


These notes Copyright © 2005 Stanley Louis Sclove
Created   1998 Oct 19     Updated   2005: Sept 28