University of Illinois at Chicago
College of Business Administration
Department of Information & Decision Sciences
|
IDS 470 | Multivariate Statistical Analysis
| Instructor | Sclove
|
Text | Hair et al., 5th ed.
|
|
Notes on Chapter 5 | Discriminant Analysis; Logistic Regression
| Part B | Section-by-Section Commentary
| | | | | |
HyperTable of Contents
- 5.0. Learning Objectives . Chapter Preview . Key Terms
- 5.1. What are Discriminant Analysis and Logistic Regression?
- Classification
- 5.2. Analogy with Regression and MANOVA
- 5.3. Hypothetical Example of Discriminant Analysis (pp. 246ff)
- 5.3.1. A Two-Group Discriminant Analysis: Purchasers versus Nonpurchasers
- 5.3.2. A Geometric Representation of the Two-Group Discriminant Function
- 5.3.3. A Three-Group Example of Discriminant Analysis: Switching Intentions
- 5.4. The Decision Process for Discriminant Analysis
- 5.5. Stage 1: Objectives of Discriminant Analysis
- 5.6. Stage 2: Research Design for Discriminant Analysis
- 5.7. Stage 3: Assumptions of Discriminant Analysis
- 5.8. Stage 4: Estimation of the Discriminant Model and Assessing Overall Fit
- 5.8.1. Computational Method
- 5.8.2. Statistical Significance
- 5.8.3. Assessing Overall Fit
- DISCRIMINANT ANALYSIS: THEORY
- UNEQUAL COSTS OF MISCLASSIFICATION
- 5.9. Stage 5: Interpretation of the Results
- 5.9.1. Discriminant Weights
- 5.9.2. Discriminant Loadings
- 5.9.3. Partial F Values-
- 5.9.4. Interpretation of Two or More Functions
- 5.9.5. Which Interpretative Method to Use?
- 5.10. Stage 6: Validation of the Results
- 5.10.1. Split-Sample or Cross-Validation Procedures
- 5.10.2. Profiling Group Differences
- 5.11. Logistic Regression: Regression with a Binary Dependent Variable
- 5.12. A Two-Group Illustrative Example
- 5.13. A Three-Group Illustrative Example
- 5.14. An Illustrative Example of Logistic Regression
- 5.15. Summary . Questions . References
- 5.15.2. Additional Questions
5.0. LEARNING OBJECTIVES . CHAPTER PREVIEW . KEY TERMS
The data for classification can be considered as consisting of
- a group label variable Y which is (0,1) in the case of two groups and
-
explanatory variables X1, X2, . . ., Xn .
We classify an individual having values x into
the group indicated by Y=1 if P(Y=1|X=x) is
sufficiently large.
The vector X can contain both metric and nonmetric variables.
When it contains only metric variables,
and when the within-group distributions are multivariate normal,
there is a special method for classification, "Discriminant Analysis".
5.1. What are Discriminant Analysis and Logistic Regression?
Thus, Discriminant Analysis is best viewed as a special case of classification,
so we begin this Commentary with a discussion of Classification in general.
A linear classification function function takes the form
Zji =
aj + Wj1 X1i +
Wj2 X2i + . . .
+ Wjn Xni .
This expression corrects that at the bottom of p. 244, in that it shows
that there is a different intercept and coefficients for each j = 1, 2, . . ., K,
where K is the number of groups. Here
Xvi = value of the v-th variable for the i-th individual
aj = intercept in the j-th classification function
Wjv = coefficient of the v-th variable in the j-th
classification function, and
Zji = score of Individual i on the j-th classification function.
When the group covariance matrices differ, quadratic classification
functions are used. For n = 3 variables, for example, these would
take the form
Zji =
aj + Wj1 X1i +
Wj2 X2i + Wj3 X3i
+ Wj11 X1i2
+ Wj22 X2i2
+ Wj33 X3i2
+ Wj12 X1i X2i
+ Wj13 X1i X3i
+ Wj23 X2i X3i .
5.2. Analogy with Regression and MANOVA
In regression, a numerical dependent variable Y is regressed on
several explanatory variables.
In DA, the dependent variable Y is categorical.
If ANOVA or MANOVA is written in terms of regression,
the explanatory variables are categorical.
5.3. Hypothetical Example of Discriminant Analysis for Two Groups
5.3.1. A Two-Group Discriminant Analysis: Purchasers versus Nonpurchasers
Here is output from the discriminant analysis.
Note that "Buy?" is the binary dependent
variable, indicating the group.
Note also that Style wasn't included in the MDA since
its t was N.S.
MTB > info
COLUMN NAME COUNT
C1 Case 10
C2 Durablty 10
C3 Perform 10
C4 Style 10
C5 Buy? 10
MTB > DISCriminant analysis for labels in C5, data in C2-C3
TABLE. Classification Functions
Group: 0 1
--------------------------
Constant -6.170 -25.619
Durablty 1.823 5.309
Perform 1.479 2.466
Q. What are the values of the two classification functions
for an individual giving the food mixer a Durability rating of 4 and
a Performance rating of 6?
A. For j = 0,1, let C(j|d,p) denote the value of the classification
function for Group j, given Durability = d and Performance = p.
For Group 0: C(0|4,6) = -6.170 + 1.823(4) + 1.479(6) = +9.996
For Group 1: C(1|4,6) = -25.619 + 5.309(4) + 2.466(6) = +6.413
Q. Classify this individual.
A. Since C(0|4,6) > C(1|4,6), we classify this individual as 0:
that is, we predict that this individual will not buy.
Q. What is this person's posterior probability of membership in the 'Buy' group?
A. For j = 0,1, let p(j|d,p) denote the posterior probability of membership in group j, given that Durability = d and Performance = p. Math shows that the classification functions are equal to the
log posterior probabilities, except for a constant which doesn't
depend on the group. That is,
C(j|d,p) = ln[p(j|d,p)] + k.
This gives
p(j|d,p) = k'exp[C(j|d,p)].
p(0|4,6) = k'exp(+9.996)
p(1|4,6) = k'exp(+6.413)
The sum of the two is 1.
Hence k' = 1/[exp(+9.996) + exp(+6.413)].
p(1|4,6) = exp(+6.413)/[exp(+6.413) + exp(+9.996)]
= 1/[1 + exp(9.996-6.413)] = 1/[1 + exp(+3.187)] = .0397 .
Also, p(0|4,6) = .9603.
5.3.2. A Geometric Representation of the Discriminant Function
5.4. The Decision Process for Discriminant Analysis
5.5. Stage 1: Objectives of Discriminant Analysis
5.6. Stage 2: Research Design for Discriminant Analysis
5.7. Stage 3: Assumptions of Discriminant Analysis
Strictly speaking, DA requires joint normal distributions within classes.
If the covariance matrices are equal, linear classification functions result. (See "Discriminant Analysis: Theory" below.) If the covariance matrices are not equal, quadratic classification functions result. A "quadratic" subcommand or option is included in the discriminant analysis command of the various statistical packages.
5.8. Stage 4: Estimation of the Discriminant Model and Assessing Overall Fit
5.8.1. Computational Method
5.8.2. Statistical Significance
5.8.3. Assessing Overall Fit
Specifying Probabilities of Classification
DISCRIMINANT ANALYSIS: Theory
The posterior probability classification rule
is to classify an individual having pattern x
into Group 1 if p(1|x) is larger than p(0|x).
(This is equivalent to the condition p(1|x) > 1/2.)
Let P(Y=j | X = x) = p(j|x), for j = 0, 1.
Then all of the following are equivalent.
p(1|x) > p(0|x)
p(1)f(x|1)/f(x) > p(0)f(x|0)/f(x)
p(1)f(x|1) > p(0)f(x|0)
At this point note that this amounts to "discounting" the probability density by the
prior probabilities.
When the class-conditional (within-group) probability density functions f(x|0) and
f(x|1) are multinormal
(see notes on
related mathematics) with equal covariance matrices,
this comparison reduces further to a comparison of the so-called "classification functions".
These are, except for an additive constant,
the log posterior probabilities.
That is,
C(j|x) = ln p(j|x) + k, j = 0,1,
where the constant k
is the same for both groups (j = 0 and 1).
The functions C(j|x) are linear functions of x:
C(j|x) = w0j + wjTx
The vector wjT is µjTC-1. The parameters are estimated from data in the usual way. If the true value of w0 is the same as that of
w1, then the variables are of no use in discriminating between groups.
The test for this is the test of equality of mean vectors in the two groups.
Remember how in connection with Chapter 1 it was stated that one of the uses of multivariate analysis was to search for differences in every direction, that is, for every linear combination.
The present situation is a case in point.
A test for equality of the group centroids can be developed by doing the two-sample t-test
for every linear combination y = a'x.
It turns out that the most significant linear combination is
that given by the vector
w1-w0.
Unequal Costs of Misclassification
We remarked above that the posterior-probability classification
rule is to "Say 1" for x such that p(1|x) is greater than 1/2.
When the costs of misclassification are unequal, the rule used is a
minimum-expected-cost rule.
It takes the form, Say "1" iff. p(1|x) > c, where c depends upon the costs of misclassification and so is not necessarily 1/2.
5.9. Stage 5: Interpretation of the Results
5.10. Stage 6: Validation of the Results
5.11. Logistic Regression:
Regression with a Binary Dependent Variable
Let Px = P(Y=1|X=x) and Qx = P(Y=0|X=x). When the distribution of X for Y = 1 is multinormal with mean µ1 and the distribution of X for Y = 0 is multinormal with mean µ0, and the two covariance matrices are equal, then
ln (Px/Qx) is linear in x .
This suggests modeling
the binary dependent variable Y by taking ln Px/Qx to be linear in x,
even when the conditional distributions of X given Y = 0 and 1 are not multinormal.
The model
ln Px/Qx =
b0 + b1Tx
is called the logistic regression model.
The function of P (0 < P < 1) defined by ln [P/(1-P)]
is called the logit of P.
Note that if the covariance matrices in the multinormal model were unequal,
there would be quadratic terms.
That is, the form would be
ln (Px/Qx) = a + b'x + x'Mx.
More generally the model can be
ln (Px/Qx) =
a + b'g(x).
That is, other functions of the elements of x can be allowed.
| Function: | Name: | Range: |
| Px | prob. that Y=1, given that X=x | 0 to 1 |
| Px/Qx | Odds | 0 to infinity |
| ln(Px/Qx) | log Odds, or "logit" | negative infinity to positive infinity |
A little algebra shows that if logit(P) = z,
then P = ez/(1+ez),
or 1/(1+e-z) .
This function is called the logistic function.
Estimation
If there is more than one observation at each value x, weighted least squares estimation can be used.
Example. Testing strength of wires. P = Px = probability of breaking at weight x. Odds = P/Q, where Q = 1-P.
Weight applied (lbs.): 10 20 30 40 50
number of wires breaking: 4 8 18 76 90
number of wires tested
at this weight: 100 100 100 100 100
p, estimate of P: .04 .08 .18 .76 .90
q, estimate of Q: .96 .92 .82 .24 .10
Odds, p/q: .0417 .0870 .2195 3.167 9.000
Logit = ln(Odds): -3.18 -2.44 -1.52 +1.15 +2.197
The logit is regressed on x,
except that weighted regression is used.
The weight is 1/Var(y), where here y = logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately 1/(NPQ),
so the weight is NPQ, which is estimated by Npq. The median breaking
strength can be estimated as the value of x for which px
= 1/2, that is, logit(px) = 0.
Maximum likelihood estimation is used when there are not repeated observations at each pattern of x. The likelihood L is
maximized. The higher the maximized value of L, the better the fit of the
model. This is assessed on a log scale by computing -2 log L, called -2LL . (This criterion corresponds to residual mean square in normal
multiple regression models.) When there are several explanatory variables, different models can be assessed using -2LL as a figure-of-merit.
If you have a sample y1, y2, . . ., yN of 0,1 variables with success probability P, then the likelihood can be written
Py1Q1-y1 x Py2Q1-y2 x . . . x PyNQ1-yN.
The maximum likelihood estimator is the value of P which maximizes this;
it turns out to be simply
(y1 + y2 + . . . + yN)/N .
In the logistic regression model, Px = 1/[1+exp(-BTx)], where B is the
vector of logistic regression parameters, to be estimated. This is done
by maximizing the likelihood by numerical methods.
Testing is based on -2LL, where LL is the natural log of the maximized likelihood. It is based on the fact that (-2LL)full - (-2LL)reduced is for large N distributed approximately as chi-square with nfull - nreduced d.f.
5.12. A Two-Group Illustrative Example
The example in this section could as well be done by MANOVA,
since the focus is on
finding which variables account for the Buy/Not Buy split. A Credit Scoring example is more
interesting, in that interest focuses on the variate which gives the credit score.
5.13. A Three-Group Illustrative Example
5.14. An Illustrative Example of Logistic Regression
The Food Mixer Example again
Refer again to the data of Table 5.1. The three
explanatory variables are Performance, Durability and Style. If Gender -- a categorical variable -- were included
among the explanatory variables, the whole analysis would change. You couldn't use MDA, you would
have to use Logistic Regression, because Gender is a nonmetric variable.
The Credit Data
We consider also the Credit Data, on 113 applicants for charge accounts at a department store
in a large city in the northeastern U.S. The variables include gender and marital status, which are
categorical and hence require Logistic Regression rather than MDA.
HATCO Data
The book considers the HATCO data, with X11 (spec buying vs. total value analysis) as the binary dependent variable. The explanatory variables are X1-X7 in the HATCO dataset, the seven perceptions-of-HATCO variables. These variables are metric, and MDA could be used (as was illustrated earlier in the chapter. Again, this example is not compelling from the point-of-view
of classification; one could as easily study the effects of X1 thru X7 with MANOVA.
Exercises on Logistic Regression
Longevity of Restaurants
Suppose a logistic regression analysis of some data for restaurants gave the following result.
L(x1,x2) = .3 - .2 x1 + .1
x2,
where
L(x1,x2) =
ln{P(Y=1|X1=x1,X2=x2)/P(Y=0
|X1=x1,X2=x2)}
Y = 1 denotes bankruptcy within three years of startup,
Y = 0 denotes staying in business more than three years,
X1 = 1 if franchised, 0 if not,
X2 = 1 if a fast-food restaurant, 0 if not.
1. What is the value of L(0,1) ?
(A) .1 (B) .2 (C) .3 (D) .4
(E) .5
2. The number P(Y=1|X1=0,X2=1) is the
probability that
(A) a franchised fast-food restaurant is bankrupt within three years.
(B) a franchised non-fast food restaurant is not bankrupt within three years.
(C) a non-franchised fast-food restaurant is bankrupt within three years.
(D) a non-franchised non-fast food restaurant is bankrupt within three years.
(E) a fast-food restaurant is franchised.
3. (continuation) What is the value of this probability ?
(A) .401 (B) .450 (C) .550 (D) .599 (E) .67
Logit Function
4. If P = 1/2, what is the value of the logit function, ln(P/Q),
where Q = 1-P ?
(A) -1 (B) -1/2 (C) 0 (D) 1/2 (E) 1
Exponential and Logarithmic Function
5. For x > 0, eln(x) = ?
(A) ex (B) ln(x) (C) x (D) 0 (E) 1
6. ln(ex) = ?
(A) ex (B) ln(x) (C) x (D) 0 (E) 1
7. ln(2e) =
(A) 1/2 (B) 1 (C) 2 (D) 3 (E) 1 + ln(2)
8. Which of the following is closest to the value of the number 1/e ?
(A) 1/3 (B) 0.368 (C) 1/2 (D) 3 (E) 3.14159
Logistic Function
9. If z = 0, what is the value of the logistic function, 1/(1 + e-z) ?
(A) -1 (B) -1/2 (C) 0 (D) 1/2 (E) 1
5.15. Summary . Questions . References
5.15.2. Additional Questions
These notes Copyright © 2005 Stanley Louis Sclove
Created 1998 Oct 19 Updated
2005: Sept 28