**
University of Illinois at
Chicago**

**
College of Business Administration**

**
Department of Information &
Decision Sciences **

**IDS
472 Statistics
for IS and Data Mining**

**Instructor
Prof.
Stanley L. Sclove**

**
Logistic Regression**

** **

**Including
Qualitative Variables in Classification Problems: Logistic Regression**

A binary variable Y
= 0 or 1 can be used to indicate membership in one of two groups.
Explanatory variables may
explain/predict this membership.
When one or more of the explanatory variables is non-numerical, discriminant analysis is
inappropriate. Logistic regression
can be used with any combination of explanatory variables, numerical,
categorical, or some of each.

*Example.* There were 113
applicants for charge accounts at a department store in a large city in the
northeastern U.S. The variables include gender and marital status,
which are categorical and hence require Logistic Regression rather than
Discriminant Analysis.

Let
the two groups be indicated by the binary variable Y,
which is equal to 0 or
1.The probability of being in one group or the other depends on values of some
variables in the vector**X**.
Let P** _{x}** = P(Y=1|

ln (P** _{x}**/Q

is
linear in **x** . This suggests modeling the binary dependent variable
Y by taking ln P** _{x}**/Q

ln (P** _{x}**/Q

is
called the **logistic regression** model.

Function: |
Name: |
Range: |

P |
prob. that Y=1, given that |
0 to 1 |

P _{x} |
Odds |
0 to infinity |

ln(P )_{x} |
log Odds, or "logit" |
negative infinity to positive infinity |

In
this model, what is the mathematical expression for P itself ? A little algebra shows that if logit(P)
= z, then P = e^{z}/(1+e^{z}),or 1/(1+e^{-z})
. This function is called the *logistic function*.

If
there is more than one observation at each value **x**,
weighted least squares estimation can be used. The next example illustrates a use of logistic regression other than classification.

*Example.* It involves the breaking strength of wires. The logit is
regressed on x, the weight applied.
Weighted regression is used.
The weight is 1/Var(y),
where here y =
logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately
1/(NPQ), so the weight is NPQ,
which is estimated by Npq. The fitted regression equation is

*^logit
= -11.51 + 0.156 Weight*.

The median breaking strength
can be estimated as the value of
x for which p_{x}=
1/2, that is, for which logit(p_{x}) = 0.

*TABLE. *Testing strength of
wires. P_{x} =
probability of breaking at weight x. Odds = P/Q, where Q = 1-P; logit(P) = ln(Odds) =
ln(P/Q).

**____________________________________________________________**

**x, Wt(lbs.)
10 20 30
40 50**

**number breaking 4
8 18
76 90**

**number of tested 100 100
100 100 100**

**p, estimate of P .04 .08
.18 .76 .90**

**q, estimate of Q .96 .92
.82 .24 .10**

**Odds, p/q
.0417 .0870 .2195
3.167 9.000**

**Logit:
-3.18 -2.44 -1.52 +1.15 +2.197**

`___________________________________________________`

` `

**Maximum likelihood
estimation**
is used when there are not repeated observations at each value or pattern of **x**. The likelihood L
(developed below) is maximized. The higher the maximized value of L, the better the fit of
the model. This is assessed on a
log scale by computing -2 log
L, called *-2LL* .
(This criterion corresponds to residual mean square, *i.e.,* sum of
squared errors, in normal multiple regression models.) When there
are several explanatory variables, different models can be assessed using *-2LL* as a figure-of-merit. A
penalty term can be added which increases with the number of parameters
used. In AIC (Akaike's information
criterion) this penalty is 2 times the number of parameters. In Schwarz's criterion (denoted by SC, SIC
or BIC, for Bayesian Information
Criterion), the penalty is the
natural log of* n, *that is,* ln n, * times the number of parameters.

Here is an explanation of
maximum likelihood estimation in logistic regression models. We begin
with the simpler case of a sample of Bernoulli variables. If you
have a sample

y_{1}, y_{2},
. . ., y_{n}

of 0,1 variables with
success probability P, then the log likelihood can be written

y_{1}lnP + (1-y_{1})lnQ
+ y_{2}lnP + (1-y_{2})lnQ + . . . + y_{n}lnP + (1-y_{n})ln
Q .

The maximum likelihood
estimator is the value of P which maximizes this; it turns
out to be simply the
sample proportion of 1's,

(y_{1} + y_{2}
+ . . . + y_{n})/n .

In the logistic regression
model,

*P_{x} =
1/[1+exp(-B_{0} - B^{T}x*

where *B _{0}* is the constant and

Testing a reduced model
against a full modelis based on -2LL, whereLLis the natural log of the
maximized likelihood. It is based on the fact that (-2LL)_{full} - (-2LL)_{reduced} is for large *n* distributed approximately as chi-square with the number of
d.f. equal to k_{full} - k_{reduced} , where
these *k*'s are the numbers of explanatory
variables in the two models.

Logistic regression
implicitly uses prior probability estimates obtained from the sample. Thus the
estimate of B_{0} will contain a term *ln(p/q),*
where *p* and *q*_{ } are the proportions of cases from Group 0 and Group 1 in the
sample. If the appropriate prior probabilities are instead p' and
q',then b_{0 }should be adjusted:

*new b _{o}
= b_{o} - ln(p/q) + ln(p'/q') .*

___________________________________________________________________________________________

**Beyond the Basics: An Alternative to the Logistic
Regression Model **

An alternative is Pregibon's
family of curves

*[P*^{g}*+(1- P) ^{-}*

*ln P - ln(1- P) *for* **g** = 0. *

Thus this family of
transformations of *P* includes the logistic regression model
as a special case. The other
members of the family are not symmetric.

Suppose a logistic
regression analysis of some data for restaurants gave the following
result.

L(x_{1},x_{2})
= .3 - .2 x_{1} + .1x_{2},

where

L(x_{1},x_{2})
= logit[P(Y=1|X_{1}=x_{1},X_{2}=x_{2})] =
ln{P(Y=1|X_{1}=x_{1},X_{2}=x_{2})/P(Y=0|X_{1}=x_{1},X_{2}=x_{2})}

Y = 1 denotes bankruptcy
within three years of startup,

Y = 0 denotes staying in
business more than three years,

X_{1} = 1 if
franchised, 0 if not,

X_{2} = 1 if a
fast-food restaurant, 0 if not.

**1. **What is the value of L(0,1)
?

(A) .1 (B)
.2 (C)
.3 (D)
.4 (E)
.5

**2. **The number P(Y=1|X_{1}=0,X_{2}=1)
is the probability that

(A) a franchised fast-food
restaurant is bankrupt within three years.

(B) a franchised non-fast
food restaurant is not bankrupt within three years.

(C) a non-franchised
fast-food restaurant is bankrupt within three years.

(D) a non-franchised
non-fast food restaurant is bankrupt within three years.

(E) a fast-food restaurant
is franchised.

**3. **(continuation) What is the
value of this probability ?

(A) .401
(B) .450 (C)
.550 (D)
.599 (E)
.67

**4**. If P = 1/2, what is the
value of the logit function, ln(P/Q),where Q = 1-P ?

(A) -1 (B)
-1/2 (C)
0 (D)
1/2 (E)
1

**5**. For x > 0,
e^{ln(x)} = ?

(A) e^{x
}(B) ln(x) (C)
x (D)
0 (E) 1

**6. **ln(e^{x}) = ?

(A) e^{x
}(B) ln(x) (C)
x (D)
0 (E) 1

**7. ** ln(2e) =

(A)
1/2 (B)
1 (C)
2 (D)
3 (E) 1 + ln(2)

**8. **Which of the following is
closest to the value of the number 1/e ?

(A)
1/3 (B)
0.368 (C)
1/2 (D)
3 (E) 3.14159

**9. **If z = 0, what is the value
of the logistic function, 1/(1 + e^{-z}) ?

(A)
-1 (B)
-1/2 (C)
0 (D)
1/2 (E) 1

_____________________________________________________________________

Pregibon, Daryl (1985). "Link Tests." *Encyclopedia of Statistical
Sciences, 5*, John Wiley & Sons, Inc, New York .

**Bibliography**

Allison, Paul. *Logistic
Regression Using the SAS System.* There is an excellent
discussion of the basics, along with treatment of many advanced topics such as
ordered logit, discrete choice and repeated measures.

Go to http://www.sas.com/pubs
and find the book in the "books by users" section.

Friedman, Jerome H.
(1997). Data mining and statistics: What's the connection? *29th
Symposium on the Interface of Computer Science & Statistics*, Houston,
TX.

Created: 19 October 1998 Updated: 14 April 2001