University of Illinois at Chicago
College of Business Administration
Department of Information & Decision Sciences
IDS 470  Multivariate Statistical Analysis 
Instructor  Sclove

Text  Hair et al., 5th ed.



Notes on Chapter 5  Discriminant Analysis; Logistic Regression

Part B  SectionbySection Commentary

HyperTable of Contents
 5.0. Learning Objectives . Chapter Preview . Key Terms
 5.1. What are Discriminant Analysis and Logistic Regression?
 Classification
 5.2. Analogy with Regression and MANOVA
 5.3. Hypothetical Example of Discriminant Analysis (pp. 246ff)
 5.3.1. A TwoGroup Discriminant Analysis: Purchasers versus Nonpurchasers
 5.3.2. A Geometric Representation of the TwoGroup Discriminant Function
 5.3.3. A ThreeGroup Example of Discriminant Analysis: Switching Intentions
 5.4. The Decision Process for Discriminant Analysis
 5.5. Stage 1: Objectives of Discriminant Analysis
 5.6. Stage 2: Research Design for Discriminant Analysis
 5.7. Stage 3: Assumptions of Discriminant Analysis
 5.8. Stage 4: Estimation of the Discriminant Model and Assessing Overall Fit
 5.8.1. Computational Method
 5.8.2. Statistical Significance
 5.8.3. Assessing Overall Fit
 DISCRIMINANT ANALYSIS: THEORY
 UNEQUAL COSTS OF MISCLASSIFICATION
 5.9. Stage 5: Interpretation of the Results
 5.9.1. Discriminant Weights
 5.9.2. Discriminant Loadings
 5.9.3. Partial F Values
 5.9.4. Interpretation of Two or More Functions
 5.9.5. Which Interpretative Method to Use?
 5.10. Stage 6: Validation of the Results
 5.10.1. SplitSample or CrossValidation Procedures
 5.10.2. Profiling Group Differences
 5.11. Logistic Regression: Regression with a Binary Dependent Variable
 5.12. A TwoGroup Illustrative Example
 5.13. A ThreeGroup Illustrative Example
 5.14. An Illustrative Example of Logistic Regression
 5.15. Summary . Questions . References
 5.15.2. Additional Questions
5.0. LEARNING OBJECTIVES . CHAPTER PREVIEW . KEY TERMS
The data for classification can be considered as consisting of
 a group label variable Y which is (0,1) in the case of two groups and

explanatory variables X_{1}, X_{2}, . . ., X_{n} .
We classify an individual having values x into
the group indicated by Y=1 if P(Y=1X=x) is
sufficiently large.
The vector X can contain both metric and nonmetric variables.
When it contains only metric variables,
and when the withingroup distributions are multivariate normal,
there is a special method for classification, "Discriminant Analysis".
5.1. What are Discriminant Analysis and Logistic Regression?
Thus, Discriminant Analysis is best viewed as a special case of classification,
so we begin this Commentary with a discussion of Classification in general.
A linear classification function function takes the form
Z_{ji} =
a_{j} + W_{j1} X_{1i} +
W_{j2} X_{2i} + . . .
+ W_{jn} X_{ni} .
This expression corrects that at the bottom of p. 244, in that it shows
that there is a different intercept and coefficients for each j = 1, 2, . . ., K,
where K is the number of groups. Here
X_{vi} = value of the vth variable for the ith individual
a_{j} = intercept in the jth classification function
W_{jv} = coefficient of the vth variable in the jth
classification function, and
Z_{ji} = score of Individual i on the jth classification function.
When the group covariance matrices differ, quadratic classification
functions are used. For n = 3 variables, for example, these would
take the form
Z_{ji} =
a_{j} + W_{j1} X_{1i} +
W_{j2} X_{2i} + W_{j3} X_{3i}
+ W_{j11} X_{1i}^{2}
+ W_{j22} X_{2i}^{2}
+ W_{j33} X_{3i}^{2}
+ W_{j12} X_{1i} X_{2i}
+ W_{j13} X_{1i} X_{3i}
+ W_{j23} X_{2i} X_{3i} .
5.2. Analogy with Regression and MANOVA
In regression, a numerical dependent variable Y is regressed on
several explanatory variables.
In DA, the dependent variable Y is categorical.
If ANOVA or MANOVA is written in terms of regression,
the explanatory variables are categorical.
5.3. Hypothetical Example of Discriminant Analysis for Two Groups
5.3.1. A TwoGroup Discriminant Analysis: Purchasers versus Nonpurchasers
Here is output from the discriminant analysis.
Note that "Buy?" is the binary dependent
variable, indicating the group.
Note also that Style wasn't included in the MDA since
its t was N.S.
MTB > info
COLUMN NAME COUNT
C1 Case 10
C2 Durablty 10
C3 Perform 10
C4 Style 10
C5 Buy? 10
MTB > DISCriminant analysis for labels in C5, data in C2C3
TABLE. Classification Functions
Group: 0 1

Constant 6.170 25.619
Durablty 1.823 5.309
Perform 1.479 2.466
Q. What are the values of the two classification functions
for an individual giving the food mixer a Durability rating of 4 and
a Performance rating of 6?
A. For j = 0,1, let C(jd,p) denote the value of the classification
function for Group j, given Durability = d and Performance = p.
For Group 0: C(04,6) = 6.170 + 1.823(4) + 1.479(6) = +9.996
For Group 1: C(14,6) = 25.619 + 5.309(4) + 2.466(6) = +6.413
Q. Classify this individual.
A. Since C(04,6) > C(14,6), we classify this individual as 0:
that is, we predict that this individual will not buy.
Q. What is this person's posterior probability of membership in the 'Buy' group?
A. For j = 0,1, let p(jd,p) denote the posterior probability of membership in group j, given that Durability = d and Performance = p. Math shows that the classification functions are equal to the
log posterior probabilities, except for a constant which doesn't
depend on the group. That is,
C(jd,p) = ln[p(jd,p)] + k.
This gives
p(jd,p) = k'exp[C(jd,p)].
p(04,6) = k'exp(+9.996)
p(14,6) = k'exp(+6.413)
The sum of the two is 1.
Hence k' = 1/[exp(+9.996) + exp(+6.413)].
p(14,6) = exp(+6.413)/[exp(+6.413) + exp(+9.996)]
= 1/[1 + exp(9.9966.413)] = 1/[1 + exp(+3.187)] = .0397 .
Also, p(04,6) = .9603.
5.3.2. A Geometric Representation of the Discriminant Function
5.4. The Decision Process for Discriminant Analysis
5.5. Stage 1: Objectives of Discriminant Analysis
5.6. Stage 2: Research Design for Discriminant Analysis
5.7. Stage 3: Assumptions of Discriminant Analysis
Strictly speaking, DA requires joint normal distributions within classes.
If the covariance matrices are equal, linear classification functions result. (See "Discriminant Analysis: Theory" below.) If the covariance matrices are not equal, quadratic classification functions result. A "quadratic" subcommand or option is included in the discriminant analysis command of the various statistical packages.
5.8. Stage 4: Estimation of the Discriminant Model and Assessing Overall Fit
5.8.1. Computational Method
5.8.2. Statistical Significance
5.8.3. Assessing Overall Fit
Specifying Probabilities of Classification
DISCRIMINANT ANALYSIS: Theory
The posterior probability classification rule
is to classify an individual having pattern x
into Group 1 if p(1x) is larger than p(0x).
(This is equivalent to the condition p(1x) > 1/2.)
Let P(Y=j  X = x) = p(jx), for j = 0, 1.
Then all of the following are equivalent.
p(1x) > p(0x)
p(1)f(x1)/f(x) > p(0)f(x0)/f(x)
p(1)f(x1) > p(0)f(x0)
At this point note that this amounts to "discounting" the probability density by the
prior probabilities.
When the classconditional (withingroup) probability density functions f(x0) and
f(x1) are multinormal
(see notes on
related mathematics) with equal covariance matrices,
this comparison reduces further to a comparison of the socalled "classification functions".
These are, except for an additive constant,
the log posterior probabilities.
That is,
C(jx) = ln p(jx) + k, j = 0,1,
where the constant k
is the same for both groups (j = 0 and 1).
The functions C(jx) are linear functions of x:
C(jx) = w_{0j} + w_{j}^{T}x
The vector w_{j}^{T} is µ_{j}^{T}C^{1. } The parameters are estimated from data in the usual way. If the true value of w_{0} is the same as that of
w_{1}, then the variables are of no use in discriminating between groups.
The test for this is the test of equality of mean vectors in the two groups.
Remember how in connection with Chapter 1 it was stated that one of the uses of multivariate analysis was to search for differences in every direction, that is, for every linear combination.
The present situation is a case in point.
A test for equality of the group centroids can be developed by doing the twosample ttest
for every linear combination y = a'x.
It turns out that the most significant linear combination is
that given by the vector
w_{1}w_{0}.
We remarked above that the posteriorprobability classification
rule is to "Say 1" for x such that p(1x) is greater than 1/2.
When the costs of misclassification are unequal, the rule used is a
minimumexpectedcost rule.
It takes the form, Say "1" iff. p(1x) > c, where c depends upon the costs of misclassification and so is not necessarily 1/2.
5.9. Stage 5: Interpretation of the Results
5.10. Stage 6: Validation of the Results
5.11. Logistic Regression:
Regression with a Binary Dependent Variable
Let P_{x} = P(Y=1X=x) and Q_{x} = P(Y=0X=x). When the distribution of X for Y = 1 is multinormal with mean µ_{1} and the distribution of X for Y = 0 is multinormal with mean µ_{0}, and the two covariance matrices are equal, then
ln (P_{x}/Q_{x}) is linear in x .
This suggests modeling
the binary dependent variable Y by taking ln P_{x}/Q_{x} to be linear in x,
even when the conditional distributions of X given Y = 0 and 1 are not multinormal.
The model
ln P_{x}/Q_{x} =
b_{0} + b_{1}^{T}x
is called the logistic regression model.
The function of P (0 < P < 1) defined by ln [P/(1P)]
is called the logit of P.
Note that if the covariance matrices in the multinormal model were unequal,
there would be quadratic terms.
That is, the form would be
ln (P_{x}/Q_{x}) = a + b'x + x'Mx.
More generally the model can be
ln (P_{x}/Q_{x}) =
a + b'g(x).
That is, other functions of the elements of x can be allowed.
Function:  Name:  Range: 
P_{x}  prob. that Y=1, given that X=x  0 to 1 
P_{x}/Q_{x}  Odds  0 to infinity 
ln(P_{x}/Q_{x})  log Odds, or "logit"  negative infinity to positive infinity 
A little algebra shows that if logit(P) = z,
then P = e^{z}/(1+e^{z}),
or 1/(1+e^{z}) .
This function is called the logistic function.
Estimation
If there is more than one observation at each value x, weighted least squares estimation can be used.
Example. Testing strength of wires. P = P_{x} = probability of breaking at weight x. Odds = P/Q, where Q = 1P.
Weight applied (lbs.): 10 20 30 40 50
number of wires breaking: 4 8 18 76 90
number of wires tested
at this weight: 100 100 100 100 100
p, estimate of P: .04 .08 .18 .76 .90
q, estimate of Q: .96 .92 .82 .24 .10
Odds, p/q: .0417 .0870 .2195 3.167 9.000
Logit = ln(Odds): 3.18 2.44 1.52 +1.15 +2.197
The logit is regressed on x,
except that weighted regression is used.
The weight is 1/Var(y), where here y = logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately 1/(NPQ),
so the weight is NPQ, which is estimated by Npq. The median breaking
strength can be estimated as the value of x for which p_{x}
= 1/2, that is, logit(p_{x}) = 0.
Maximum likelihood estimation is used when there are not repeated observations at each pattern of x. The likelihood L is
maximized. The higher the maximized value of L, the better the fit of the
model. This is assessed on a log scale by computing 2 log L, called 2LL . (This criterion corresponds to residual mean square in normal
multiple regression models.) When there are several explanatory variables, different models can be assessed using 2LL as a figureofmerit.
If you have a sample y_{1}, y_{2}, . . ., y_{N} of 0,1 variables with success probability P, then the likelihood can be written
P^{y1}Q^{1y1} x P^{y2}Q^{1y2} x . . . x P^{yN}Q^{1yN}.
The maximum likelihood estimator is the value of P which maximizes this;
it turns out to be simply
(y_{1} + y_{2} + . . . + y_{N})/N .
In the logistic regression model, P_{x} = 1/[1+exp(B^{T}x)], where B is the
vector of logistic regression parameters, to be estimated. This is done
by maximizing the likelihood by numerical methods.
Testing is based on 2LL, where LL is the natural log of the maximized likelihood. It is based on the fact that (2LL)_{full}  (2LL)_{reduced} is for large N distributed approximately as chisquare with n_{full}  n_{reduced} d.f.
5.12. A TwoGroup Illustrative Example
The example in this section could as well be done by MANOVA,
since the focus is on
finding which variables account for the Buy/Not Buy split. A Credit Scoring example is more
interesting, in that interest focuses on the variate which gives the credit score.
5.13. A ThreeGroup Illustrative Example
5.14. An Illustrative Example of Logistic Regression
The Food Mixer Example again
Refer again to the data of Table 5.1. The three
explanatory variables are Performance, Durability and Style. If Gender  a categorical variable  were included
among the explanatory variables, the whole analysis would change. You couldn't use MDA, you would
have to use Logistic Regression, because Gender is a nonmetric variable.
The Credit Data
We consider also the Credit Data, on 113 applicants for charge accounts at a department store
in a large city in the northeastern U.S. The variables include gender and marital status, which are
categorical and hence require Logistic Regression rather than MDA.
HATCO Data
The book considers the HATCO data, with X11 (spec buying vs. total value analysis) as the binary dependent variable. The explanatory variables are X1X7 in the HATCO dataset, the seven perceptionsofHATCO variables. These variables are metric, and MDA could be used (as was illustrated earlier in the chapter. Again, this example is not compelling from the pointofview
of classification; one could as easily study the effects of X1 thru X7 with MANOVA.
Exercises on Logistic Regression
Longevity of Restaurants
Suppose a logistic regression analysis of some data for restaurants gave the following result.
L(x_{1},x_{2}) = .3  .2 x_{1} + .1
x_{2},
where
L(x_{1},x_{2}) =
ln{P(Y=1X_{1}=x_{1},X_{2}=x_{2})/P(Y=0
X_{1}=x_{1},X_{2}=x_{2})}
Y = 1 denotes bankruptcy within three years of startup,
Y = 0 denotes staying in business more than three years,
X_{1} = 1 if franchised, 0 if not,
X_{2} = 1 if a fastfood restaurant, 0 if not.
1. What is the value of L(0,1) ?
(A) .1 (B) .2 (C) .3 (D) .4
(E) .5
2. The number P(Y=1X_{1}=0,X_{2}=1) is the
probability that
(A) a franchised fastfood restaurant is bankrupt within three years.
(B) a franchised nonfast food restaurant is not bankrupt within three years.
(C) a nonfranchised fastfood restaurant is bankrupt within three years.
(D) a nonfranchised nonfast food restaurant is bankrupt within three years.
(E) a fastfood restaurant is franchised.
3. (continuation) What is the value of this probability ?
(A) .401 (B) .450 (C) .550 (D) .599 (E) .67
Logit Function
4. If P = 1/2, what is the value of the logit function, ln(P/Q),
where Q = 1P ?
(A) 1 (B) 1/2 (C) 0 (D) 1/2 (E) 1
Exponential and Logarithmic Function
5. For x > 0, e^{ln(x)} = ?
(A) e^{x} (B) ln(x) (C) x (D) 0 (E) 1
6. ln(e^{x}) = ?
(A) e^{x} (B) ln(x) (C) x (D) 0 (E) 1
7. ln(2e) =
(A) 1/2 (B) 1 (C) 2 (D) 3 (E) 1 + ln(2)
8. Which of the following is closest to the value of the number 1/e ?
(A) 1/3 (B) 0.368 (C) 1/2 (D) 3 (E) 3.14159
Logistic Function
9. If z = 0, what is the value of the logistic function, 1/(1 + e^{z}) ?
(A) 1 (B) 1/2 (C) 0 (D) 1/2 (E) 1
5.15. Summary . Questions . References
5.15.2. Additional Questions
These notes Copyright © 2005 Stanley Louis Sclove
Created 1998 Oct 19 Updated
2005: Sept 28