University of Illinois at
Chicago
College of Business Administration
Department of Information &
Decision Sciences
IDS
472 Statistics
for IS and Data Mining
Instructor
Prof.
Stanley L. Sclove
Logistic Regression
Including
Qualitative Variables in Classification Problems: Logistic Regression
A binary variable Y
= 0 or 1 can be used to indicate membership in one of two groups.
Explanatory variables may
explain/predict this membership.
When one or more of the explanatory variables is non-numerical, discriminant analysis is
inappropriate. Logistic regression
can be used with any combination of explanatory variables, numerical,
categorical, or some of each.
Example. There were 113
applicants for charge accounts at a department store in a large city in the
northeastern U.S. The variables include gender and marital status,
which are categorical and hence require Logistic Regression rather than
Discriminant Analysis.
Let
the two groups be indicated by the binary variable Y,
which is equal to 0 or
1.The probability of being in one group or the other depends on values of some
variables in the vectorX.
Let Px = P(Y=1|X=x) and Qx
= 1 - Px=P(Y=0|X=x). When the
distribution of X for Y = 1 is
multinormal (multivariate
normal) with mean µ1 and the distribution of X for Y =
0 is
multinormal with mean µ0,
and the two covariance matrices are equal, then
ln (Px/Qx)
is
linear in x . This suggests modeling the binary dependent variable
Y by taking ln Px/Qx
to be linear in x, even when the conditional distributions of X
given Y = 0 and 1 are not multinormal. The model
ln (Px/Qx) =b0
+ b1'x
is
called the logistic regression model.
|
Function: |
Name: |
Range: |
|
Px |
prob. that Y=1, given that X=x |
0 to 1 |
|
Px/Qx |
Odds |
0 to infinity |
|
ln(Px/Qx) |
log Odds, or "logit" |
negative infinity to positive infinity |
In
this model, what is the mathematical expression for P itself ? A little algebra shows that if logit(P)
= z, then P = ez/(1+ez),or 1/(1+e-z)
. This function is called the logistic function.
If
there is more than one observation at each value x,
weighted least squares estimation can be used. The next example illustrates a use of logistic regression other than classification.
Example. It involves the breaking strength of wires. The logit is
regressed on x, the weight applied.
Weighted regression is used.
The weight is 1/Var(y),
where here y =
logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately
1/(NPQ), so the weight is NPQ,
which is estimated by Npq. The fitted regression equation is
^logit
= -11.51 + 0.156 Weight.
The median breaking strength
can be estimated as the value of
x for which px=
1/2, that is, for which logit(px) = 0.
TABLE. Testing strength of
wires. Px =
probability of breaking at weight x. Odds = P/Q, where Q = 1-P; logit(P) = ln(Odds) =
ln(P/Q).
____________________________________________________________
x, Wt(lbs.)
10 20 30
40 50
number breaking 4
8 18
76 90
number of tested 100 100
100 100 100
p, estimate of P .04 .08
.18 .76 .90
q, estimate of Q .96 .92
.82 .24 .10
Odds, p/q
.0417 .0870 .2195
3.167 9.000
Logit:
-3.18 -2.44 -1.52 +1.15 +2.197
___________________________________________________
Maximum likelihood
estimation
is used when there are not repeated observations at each value or pattern of x. The likelihood L
(developed below) is maximized. The higher the maximized value of L, the better the fit of
the model. This is assessed on a
log scale by computing -2 log
L, called -2LL .
(This criterion corresponds to residual mean square, i.e., sum of
squared errors, in normal multiple regression models.) When there
are several explanatory variables, different models can be assessed using -2LL as a figure-of-merit. A
penalty term can be added which increases with the number of parameters
used. In AIC (Akaike's information
criterion) this penalty is 2 times the number of parameters. In Schwarz's criterion (denoted by SC, SIC
or BIC, for Bayesian Information
Criterion), the penalty is the
natural log of n, that is, ln n, times the number of parameters.
Here is an explanation of
maximum likelihood estimation in logistic regression models. We begin
with the simpler case of a sample of Bernoulli variables. If you
have a sample
y1, y2,
. . ., yn
of 0,1 variables with
success probability P, then the log likelihood can be written
y1lnP + (1-y1)lnQ
+ y2lnP + (1-y2)lnQ + . . . + ynlnP + (1-yn)ln
Q .
The maximum likelihood
estimator is the value of P which maximizes this; it turns
out to be simply the
sample proportion of 1's,
(y1 + y2
+ . . . + yn)/n .
In the logistic regression
model,
Px =
1/[1+exp(-B0 - BTx)],
where B0 is the constant and B is the vector of logistic regression
parameters, to be estimated. This is done by maximizing the
likelihood by numerical methods.
Testing a reduced model
against a full modelis based on -2LL, whereLLis the natural log of the
maximized likelihood. It is based on the fact that (-2LL)full - (-2LL)reduced is for large n distributed approximately as chi-square with the number of
d.f. equal to kfull - kreduced , where
these k's are the numbers of explanatory
variables in the two models.
Logistic regression
implicitly uses prior probability estimates obtained from the sample. Thus the
estimate of B0 will contain a term ln(p/q),
where p and q are the proportions of cases from Group 0 and Group 1 in the
sample. If the appropriate prior probabilities are instead p' and
q',then b0 should be adjusted:
new bo
= bo - ln(p/q) + ln(p'/q') .
___________________________________________________________________________________________
Beyond the Basics: An Alternative to the Logistic
Regression Model
An alternative is Pregibon's
family of curves
[Pg+(1- P)-g]/|g| for g not equal to 0,
ln P - ln(1- P) for g = 0.
Thus this family of
transformations of P includes the logistic regression model
as a special case. The other
members of the family are not symmetric.
Suppose a logistic
regression analysis of some data for restaurants gave the following
result.
L(x1,x2)
= .3 - .2 x1 + .1x2,
where
L(x1,x2)
= logit[P(Y=1|X1=x1,X2=x2)] =
ln{P(Y=1|X1=x1,X2=x2)/P(Y=0|X1=x1,X2=x2)}
Y = 1 denotes bankruptcy
within three years of startup,
Y = 0 denotes staying in
business more than three years,
X1 = 1 if
franchised, 0 if not,
X2 = 1 if a
fast-food restaurant, 0 if not.
1. What is the value of L(0,1)
?
(A) .1 (B)
.2 (C)
.3 (D)
.4 (E)
.5
2. The number P(Y=1|X1=0,X2=1)
is the probability that
(A) a franchised fast-food
restaurant is bankrupt within three years.
(B) a franchised non-fast
food restaurant is not bankrupt within three years.
(C) a non-franchised
fast-food restaurant is bankrupt within three years.
(D) a non-franchised
non-fast food restaurant is bankrupt within three years.
(E) a fast-food restaurant
is franchised.
3. (continuation) What is the
value of this probability ?
(A) .401
(B) .450 (C)
.550 (D)
.599 (E)
.67
4. If P = 1/2, what is the
value of the logit function, ln(P/Q),where Q = 1-P ?
(A) -1 (B)
-1/2 (C)
0 (D)
1/2 (E)
1
5. For x > 0,
eln(x) = ?
(A) ex
(B) ln(x) (C)
x (D)
0 (E) 1
6. ln(ex) = ?
(A) ex
(B) ln(x) (C)
x (D)
0 (E) 1
7. ln(2e) =
(A)
1/2 (B)
1 (C)
2 (D)
3 (E) 1 + ln(2)
8. Which of the following is
closest to the value of the number 1/e ?
(A)
1/3 (B)
0.368 (C)
1/2 (D)
3 (E) 3.14159
9. If z = 0, what is the value
of the logistic function, 1/(1 + e-z) ?
(A)
-1 (B)
-1/2 (C)
0 (D)
1/2 (E) 1
_____________________________________________________________________
Pregibon, Daryl (1985). "Link Tests." Encyclopedia of Statistical
Sciences, 5, John Wiley & Sons, Inc, New York .
Bibliography
Allison, Paul. Logistic
Regression Using the SAS System. There is an excellent
discussion of the basics, along with treatment of many advanced topics such as
ordered logit, discrete choice and repeated measures.
Go to http://www.sas.com/pubs
and find the book in the "books by users" section.
Friedman, Jerome H.
(1997). Data mining and statistics: What's the connection? 29th
Symposium on the Interface of Computer Science & Statistics, Houston,
TX.