University of Illinois at Chicago
College of Business Administration
Department of Information & Decision Sciences 

IDS 472               Statistics for IS and Data Mining
Instructor             Prof. Stanley L. Sclove

Logistic Regression


Including Qualitative Variables in Classification Problems:  Logistic Regression

A binary variable Y = 0 or 1 can be used to indicate membership in one of two groups.  Explanatory variables may explain/predict this membership.  When one or more of the explanatory variables is non-numerical,  discriminant analysis is inappropriate.  Logistic regression can be used with any combination of explanatory variables, numerical, categorical, or some of each. 

Example.   There were 113 applicants for charge accounts at a department store in a large city in the northeastern U.S.   The variables include gender and marital status, which are categorical and hence require Logistic Regression rather than Discriminant Analysis.

Logistic Regression:  Regression with a Binary Dependent Variable

Let the two groups be indicated by the binary variable Y, which is equal to 0 or 1.The probability of being in one group or the other depends on values of some variables in the vectorX.  Let Px = P(Y=1|X=x) and Qx = 1 - Px=P(Y=0|X=x).   When the distribution of X for Y = 1 is  multinormal  (multivariate normal) with mean 1 and the distribution of X for Y = 0 is  multinormal with mean 0, and the two covariance matrices are equal, then  

ln (Px/Qx)

is linear in x .   This suggests modeling the binary dependent variable Y by taking   ln Px/Qx   to be linear in x, even when the conditional distributions of X given Y = 0 and 1 are not multinormal.   The model 

ln (Px/Qx) =b0 + b1'x

is called the logistic regression model.






prob. that Y=1, given that X=x

0 to 1



0 to infinity


log Odds, or "logit"

negative infinity to positive infinity


In this model, what is the mathematical expression for  P  itself ?  A little algebra shows that if logit(P) = z, then P = ez/(1+ez),or   1/(1+e-z) .   This function is called the logistic function.


If there is more than one observation at each value  x,  weighted least squares estimation can be used.  The next example illustrates a use of logistic regression other than classification.


Example.  It involves the breaking strength of wires.  The  logit  is regressed on x, the weight applied.  Weighted regression is used.  The weight is 1/Var(y),  where here  y = logit(p)  and  Var(y) = Var[(logit(p)],  which can be shown to be approximately 1/(NPQ),  so the weight is NPQ, which is estimated by  Npq.   The fitted regression equation is


^logit =   -11.51 + 0.156 Weight.

The median breaking strength can be estimated as the value of  x  for which px= 1/2, that is, for which logit(px) = 0.


TABLE.    Testing strength of wires.   Px = probability of breaking at weight x.   Odds = P/Q, where Q = 1-P; logit(P) = ln(Odds) = ln(P/Q).

x, Wt(lbs.)         10    20     30      40     50

number breaking      4     8     18      76     90

number of tested   100   100    100     100    100

p, estimate of P   .04   .08    .18     .76    .90

q, estimate of Q   .96   .92    .82     .24    .10

Odds, p/q          .0417 .0870  .2195  3.167  9.000

Logit:           -3.18 -2.44  -1.52   +1.15  +2.197


Maximum likelihood estimation  is used when there are not repeated observations at each value or pattern of x.   The likelihood  L  (developed below) is maximized.   The higher the maximized value of L, the better the fit of the model.  This is assessed on a log scale by computing  -2 log L,  called     -2LL .  (This criterion corresponds to residual mean square, i.e., sum of squared errors, in normal multiple regression models.)   When there are several explanatory variables, different models can be assessed using   -2LL   as a figure-of-merit.  A penalty term can be added which increases with the number of parameters used.  In AIC (Akaike's information criterion) this penalty is 2 times the number of parameters.   In Schwarz's criterion (denoted by SC, SIC or  BIC, for Bayesian Information Criterion),  the penalty is the natural log of n, that is, ln n,  times the number of parameters.

Here is an explanation of maximum likelihood estimation in logistic regression models.  We begin with the simpler case of a sample of Bernoulli variables.   If you have a sample

y1, y2, . . ., yn

of 0,1 variables with success probability P, then the log likelihood can be written

y1lnP + (1-y1)lnQ + y2lnP + (1-y2)lnQ + . . . + ynlnP + (1-yn)ln Q .

The maximum likelihood estimator is the value of  P  which maximizes this;  it turns out to be simply the sample proportion of 1's, 

(y1 + y2 + . . . + yn)/n .

In the logistic regression model,

Px = 1/[1+exp(-B0 - BTx)],

where   B0   is the constant and  B  is the vector of logistic regression parameters, to be estimated.   This is done by maximizing the likelihood by numerical methods.

Testing a reduced model against a full modelis based on -2LL, whereLLis the natural log of the maximized likelihood. It is based on the fact that   (-2LL)full - (-2LL)reduced   is for large  n  distributed approximately as chi-square with the number of d.f. equal to   kfull - kreduced , where these  k's   are the numbers of explanatory variables in the two models.

Incorporating prior probabilities

Logistic regression implicitly uses prior probability estimates obtained from the sample. Thus the estimate of B0 will contain a term   ln(p/q),   where   p   and   q   are the proportions of cases from Group 0 and Group 1 in the sample.   If the appropriate prior probabilities are instead p' and q',then  b0  should be adjusted:

new bo =   bo - ln(p/q) + ln(p'/q') .


Beyond the Basics:  An Alternative to the Logistic Regression Model 


An alternative is Pregibon's family of curves

[Pg+(1- P)-g]/|g|   for  g  not equal to 0,  

ln P - ln(1- P)   for  g = 0.

Thus this family of transformations of  P  includes the logistic regression model as a special case.   The other members of the family are not symmetric.


Exercises on Logistic Regression

Longevity of Restaurants

Suppose a logistic regression analysis of some data for restaurants gave the following result. 

L(x1,x2) = .3 - .2 x1 + .1x2,


L(x1,x2) = logit[P(Y=1|X1=x1,X2=x2)] = ln{P(Y=1|X1=x1,X2=x2)/P(Y=0|X1=x1,X2=x2)}

Y = 1 denotes bankruptcy within three years of startup,

Y = 0 denotes staying in business more than three years,

X1 = 1 if franchised, 0 if not,

X2 = 1 if a fast-food restaurant, 0 if not.


1. What is the value of L(0,1) ?

(A) .1                 (B) .2               (C) .3               (D) .4               (E) .5


2. The number P(Y=1|X1=0,X2=1) is the probability that


(A) a franchised fast-food restaurant is bankrupt within three years.

(B) a franchised non-fast food restaurant is not bankrupt within three years.

(C) a non-franchised fast-food restaurant is bankrupt within three years.

(D) a non-franchised non-fast food restaurant is bankrupt within three years.

(E) a fast-food restaurant is franchised.


3. (continuation) What is the value of this probability ?

(A) .401    (B) .450       (C) .550          (D) .599          (E) .67

Logit Function

4. If P = 1/2, what is the value of the logit function, ln(P/Q),where Q = 1-P ?

(A) -1                 (B) -1/2                       (C) 0                (D) 1/2            (E) 1

Exponential and Logarithmic Function

5.   For x > 0, eln(x) = ?

(A) ex       (B) ln(x)        (C) x            (D) 0            (E) 1

6. ln(ex) = ?

(A) ex       (B) ln(x)        (C) x        (D) 0        (E) 1

7.   ln(2e) =

(A) 1/2        (B) 1        (C) 2        (D) 3        (E) 1 + ln(2)

8. Which of the following is closest to the value of the number 1/e ?

(A) 1/3        (B) 0.368        (C) 1/2        (D) 3        (E) 3.14159

Logistic Function

9. If z = 0, what is the value of the logistic function, 1/(1 + e-z) ?

(A) -1        (B) -1/2        (C) 0        (D) 1/2        (E) 1 



Pregibon, Daryl (1985).  "Link Tests."  Encyclopedia of Statistical Sciences, 5, John Wiley & Sons, Inc, New York .



Allison, Paul.  Logistic Regression Using the SAS System.   There is an excellent discussion of the basics, along with treatment of many advanced topics such as ordered logit, discrete choice and repeated measures.
Go to  and find the book in the "books by users" section. 

Friedman, Jerome H. (1997).   Data mining and statistics: What's the connection? 29th Symposium on the Interface of Computer Science & Statistics, Houston, TX. 

Copyright 2000 Stanley Louis Sclove
Created: 19 October 1998       Updated:  14 April 2001