University of Illinois at Chicago
College of Business Administration
Department of Information & Decision Sciences

IDS 470:     Multivariate (Statistical) Analysis I
Instructor:   Prof. Stanley L. Sclove
Textbook:   Johnson & Wichern, 4th ed.
Section 11.8:   Addendum on Logistic Regression; Classification Trees; Neural Networks

HyperTable of Contents

Including Qualitative Variables:   Logistic Regression
Classification Trees
Neural Networks

Including Qualitative Variables: Logistic Regression

When one or more of the explanatory variables is non-numerical, discriminant analysis will be inappropriate. Logistic regression can be used with any combination of explanatory variables, numerical, categorical, or some of each.   The two groups to be discriminated are indicated by the binary dependent variable.

Logistic Regression: Regression with a Binary Dependent Variable

Let Px = P(Y=1|X=x) and Qx = P(Y=0|X=x).   When the distribution of X for Y = 1 is multinormal with mean µ1 and the distribution of X for Y = 0 is multinormal with mean µ0, and the two covariance matrices are equal, then   ln (Px/Qx)   is linear in x .   This suggests modeling the binary dependent variable Y by taking   ln Px/Qx   to be linear in x, even when the conditional distributions of X given Y = 0 and 1 are not multinormal.   The model   ln P/Q = b0 + b1Tx is called the logistic regression model.  
Function:Name:Range:
Pxprob. that Y=1, given that X=x0 to 1
Px/QxOdds0 to infinity
ln(Px/Qx)log Odds, or "logit"negative infinity to positive infinity

A little algebra shows that if logit(P) = z, then P = ez/(1+ez), or   1/(1+e-z) .   This function is called the logistic function.

Estimation

If there is more than one observation at each value x, weighted least squares estimation can be used. Example. Testing strength of wires. P = Px = probability of breaking at weight x. Odds = P/Q, where Q = 1-P.
Weight applied (lbs.):      10    20     30      40     50    
number of wires breaking:    4     8     18      76     90
number of wires tested
at this weight:            100   100    100     100    100
p, estimate of P:          .04   .08    .18     .76    .90 
q, estimate of Q:          .96   .92    .82     .24    .10
Odds, p/q:                 .0417 .0870  .2195  3.167  9.000             
Logit = ln(Odds):        -3.18   -2.44  -1.52  +1.15  +2.197 
The logit is regressed on x, except that weighted regression is used. The weight is 1/Var(y), where here y = logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately 1/(NPQ), so the weight is NPQ, which is estimated by Npq. The fitted regression equation is

^logit =   -11.51 + 0.156 Weight.

The median breaking strength can be estimated as the value of x for which px = 1/2, that is, logit(px) = 0.

Maximum likelihood estimation is used when there are not repeated observations at each pattern of x. The likelihood L is maximized. The higher the maximized value of L, the better the fit of the model. This is assessed on a log scale by computing -2 log L, called -2LL . (This criterion corresponds to residual mean square, i.e., sum of squared errors, in normal multiple regression models.) When there are several explanatory variables, different models can be assessed using -2LL as a figure-of-merit. A penalty term can be added which increases with the number of parameters used. In AIC (Akaike's information criterion) this penalty is 2 times the number of parameters. In Schwarz's criterion the penalty is the natural log of n times the number of parameters.

Here is an explanation of maximum likelihood estimation in logistic regression models. We begin with the simpler case of a sample of Bernoulli variables.   If you have a sample

y1, y2, . . ., yN

of 0,1 variables with success probability P, then the likelihood can be written

Py1Q1-y1 x Py2Q1-y2 x . . . x PyNQ1-yN.

The maximum likelihood estimator is the value of P which maximizes this; it turns out to be simply

(y1 + y2 + . . . + yN)/N .

In the logistic regression model, Px = 1/[1+exp(-B0 - BTx)], where B0 is the constant and B is the vector of logistic regression parameters, to be estimated.   This is done by maximizing the likelihood by numerical methods.

Testing is based on -2LL, where LL is the natural log of the maximized likelihood. It is based on the fact that (-2LL)full - (-2LL)reduced is for large N distributed approximately as chi-square with nfull - nreduced d.f.

Incorporating prior probabilities

Logistic regression implicitly uses prior probability estimates obtained from the sample. Thus the estimate of B0 will contain a term ln(p1/p0), where p0 and p1 are the proportions of cases from Group 0 and Group 1 in the sample.   If the appropriate prior probabilities are instead q0 and q1, then b0 should be adjusted:  

new bo =   bo - ln(p1/p0) + ln(q1/q0).

Example. There were 113 applicants for charge accounts at a department store in a large city in the northeastern U.S.   The variables include gender and marital status, which are categorical and hence require Logistic Regression rather than MDA.

Exercises on Logistic Regression

Longevity of Restaurants

Suppose a logistic regression analysis of some data for restaurants gave the following result.
L(x1,x2) = .3 - .2 x1 + .1 x2,
where
L(x1,x2) = ln{P(Y=1|X1=x1,X2=x2)/P(Y=0 |X1=x1,X2=x2)}
Y = 1 denotes bankruptcy within three years of startup,
Y = 0 denotes staying in business more than three years,
X1 = 1 if franchised, 0 if not,
X2 = 1 if a fast-food restaurant, 0 if not.
1.   What is the value of L(0,1) ?
(A) .1    
(B) .2    
(C) .3    
(D) .4    
(E) .5
2.   The number P(Y=1|X1=0,X2=1) is the probability that
(A) a franchised fast-food restaurant is bankrupt within three years.
(B) a franchised non-fast food restaurant is not bankrupt within three years.
(C) a non-franchised fast-food restaurant is bankrupt within three years.
(D) a non-franchised non-fast food restaurant is bankrupt within three years.
(E) a fast-food restaurant is franchised.
3.   (continuation) What is the value of this probability ?
(A) .401
(B) .450
(C) .550
(D) .599
(E) .67

Logit Function

4. If P = 1/2, what is the value of the logit function, ln(P/Q), where Q = 1-P ?
(A) -1    
(B) -1/2    
(C) 0  
(D) 1/2  
(E) 1

Exponential and Logarithmic Function

5.   For x > 0, eln(x) = ?

(A) ex  
(B) ln(x)    
(C) x
(D) 0  
(E) 1

6.   ln(ex) = ?

(A) ex  
(B) ln(x)    
(C) x  
(D) 0  
(E) 1

7.     ln(2e) =
(A) 1/2
(B) 1
(C) 2
(D) 3
(E) 1 + ln(2)
8.   Which of the following is closest to the value of the number 1/e ?
(A) 1/3  
(B) 0.368    
(C) 1/2  
(D) 3  
(E) 3.14159

Logistic Function

9.   If z = 0, what is the value of the logistic function, 1/(1 + e-z) ?
(A) -1  
(B) -1/2    
(C) 0  
(D) 1/2  
(E) 1

Classification Trees

Addendum:   CART -- Classification and Regression Trees

CART and AID

This concerns a statistical technique, AID (Automatic Interaction Detection) which was much discussed for a while about a quarter-century ago and has been rediscovered, further developed and named CART (Classification & Regression Trees).

Classification. Statistical "classification" seeks a rule to predict accurately the class of each new observation. A prediction rule is constructed using information from a "training set", a sample in which the true class of each observation is known. In discriminant analysis, a goal is to derive a function of the explanatory variables which is a classification index.

AID

AID (Automatic Interaction Detection), recursive partitioning and CART (Classification and Regression Trees) are three names for similar techniques for classification. They are alternatives to discriminant analysis.

In these three techniques, the sample is subdivided successively according to the observed variables. In marketing, such subdivision is one way of "segmenting the market".

In its simplest form, recursive partitioning separates units of the initial group into two subgroups contingent upon the value of one of the variables. All possible splits of this type are considered and the one which best separates the data into groups homogeneous in class is chosen. A chi-square or F statisic is used to measure the separation. This process then continues recursively. Below this is illustrated with a marketing example from part of the Alpha radial tires case in Green (1978).

Efron and Tibshirani (1991) write about CART (and other computer-intensive statistical techniques).

Software for AID includes SPSS's CHAID (AID using chi-square for categorical variables).

Example. Alpha Radial Tires (Green 1978). This is a numerical example of the application of AID. The explanatory variables (see the file ALPHMSTR CODEBOOK for definitions of variables) used were:--
A2: Was Alpha the brand last purchased?;
A3: pre-exposure interest rating of Alpha radials;
C1: post-exposure believability rating of the Alpha commercial;
C2: post-exposure interest in Alpha radials.
The dependent variable is D: Is Alpha the brand of choice?

The tree is shown below. The terminal nodes are underlined. These define six clusters into which the   n   respondents are grouped.

If we examine these terminal nodes, we see that:

1. The group consisting of only 15 respondents shows the highest p, .93. This group expresses high believability in the Alpha radial tire commercial and its members are all past purchasers of the Alpha brand.

2. The group of 38 respondents show the lowest p, .03. This group:
a. Does not express high believability in the Alpha commercial.
b. Does not express high pre-exposure interest in Alpha.
c. Is not made up of past purchasers of Alpha.
d. Expresses low post-exposure interest in the Alpha brand.

                         -----------
                         | n = 252 |
                         | p = .31 |
                         -----------
                              |
                     Split on Variable C1
                              |
             --------------------------------------
             |                                    |
      High Believability                        Other
             |                                    |
             V                                    V
        -----------                          -----------
        | n =  53 |                          | n = 199 |
        | p = .68 |                          | p = .21 |
        -----------                          -----------
             |                                    |
     Split on Variable A2              Split on Variable A3
       |               |                 |                |
Alpha Purchaser      Other             Other      High Pre-exposure
       |               |                 |             Interest
       V               V                 V                V
  -----------     -----------       -----------      -----------
  | n =  15 |     | n =  38 |       | n = 178 |      | n =  21 |
  | p = .93 |     | p = .58 |       | p = .17 |      | p = .57 |
  -----------     -----------       -----------      -----------
  -----------     -----------            |           -----------
                                Split on Variable A2
                                  |               |
                           Alpha Purchaser      Other
                                  V               V
                             -----------     -----------
                             | n =  21 |     | n = 157 |
                             | p = .38 |     | p = .14 |
                             -----------     -----------
                             -----------          |
                                         Split on Variable C2
                                                |
                                            ----------------
                                            |              |
                                            |         Low Post-Exp.
                                          Other         Interest
                                            V              V
                                       -----------     -----------
                                       | n = 119 |     | n =  38 |
                                       | p = .18 |     | p = .03 |
                                       -----------     -----------
                                       -----------     -----------

EXERCISE ON CART

Use the classification tree to estimate     P(B=1|C1=1)     and     P(B=0|C1=0).

REFERENCES on CART and AID

Efron, Bradley, and Tibshirani, Robert (1991). "Statistical Data Analysis in the Computer Age,"   Science, Vol. 253, No. 5018 (26 July 1991), 390-395.

Green, Paul E. (1978). Analyzing Multivariate Data. Dryden Press, Hinsdale, IL; see esp. pp. 191-201 on AID.


Neural Networks

Basic Concepts of Neural Networks

An interesting alternative to logistic regression (indeed, to many problems) is that of "neural networks," which is in fact very similar to logistic regression.



             (I1)                              (I1)    (H1)
                \                                         \
                 \ w1                                      \    w1
                  \                                         \     
                   \                                         \
              w2    \                                    w2   \
       (I2)---------(O)                    (I2)    (H2)-------(O)
                     /                                        /
                    /                                        /
                   /w3                                      /w3
                  /                                        /  
                 /                                        /
              (I3)                             (I3)      (H3)
                      (a)                                     (b)
    
  Figure:
  (a) Neural net with 3 input nodes, I1, I2, I3, and one output, O.
  (b) Net with one hidden layer of nodes H1, H2, H3;  arcs between I's and H's and their weights
 w(i,j) are not  shown here.

Consider applying a neural net to a classification problem where the data are {x1i,x2i, ...,xni, yi, i = 1,2,...,N} and each y is either 0 or 1, denoting the classification. There are   n   input nodes. In the i-th case, the inputs to the nodes are x1i,...,xni, i = 1,2,..., N.

12.3.2. Estimating a Neural Network Model

The computation starts with initial values for the w's, which will be updated as each case i is processed. For case i (i=1,2,..., N) the predicted response corresponding to given weights wv, v = 1,2, ..., n   is

w1x1i + ... + wnxni = Li ,

say. The predicted value of yi is the logistic function of this linear combination,

    1/(1 + e-Li) = pi,
say.

A loss function such as the square of the error yi-pi is to be minimized. In the simplest scheme, the weights are updated by moving in the opposite of the direction of the rate of change of the loss with respect to wv. That is,

new wv = old wv - (lrate)(lossder),
where
lossder = rate of change of the loss with respect to wv
and
lrate = "learning rate," a constant, say 1/2.

Such a scheme can be used for machine learning (automatic learning) in any numerical context.

In logistic regression the training algorithm (scheme for updating the weights) is

new wv = old wv - (lrate)(y-p)p(1-p)xv.
This is based on a squared-error loss function C(y-p)2.

Modern statistical methods can be used on many of the same problems as neural networks. Standard and non-standard statistical techniques need to be compared with neural networks to understand the relative advantages and trade-offs among these different tools. The neural-nets approach may prove most useful when the data arrive sequentially.

A Neural Net is a non-interpretable, black box. But maybe this could be remedied. For example, a Path Diagram could be solved with a Neural Net. This would combine the interpretability of the Path Diagram with the technology of the Neural Net.

Software for Neural Nets includes NEURAL CONNECTION from SPSS.

EXERCISES

1. Derive the training algorithm above for classification by logistic regression when the loss is squared error.

2. (continuation) Do this when the loss is normalized squared-error, C(y-p)2/pq, where q = 1-p .

3. Derive a neural nets training algorithm for the problem of multiple linear regression.

REFERENCES

Denning, Peter J. (1992). "Neural Networks," American Scientist, Vol. 80, No. 5, pp. 426-429. (A short, readable introduction to the subject.)

Neural Networks. Journal. Available in the UIC Math Library.


12.5. Summary . Questions . References

Additional References

Friedman, Jerome H. (1997).   Data mining and statistics: What's the connection? 29th Symposium on the Interface of Computer Science & Statistics, Houston, TX.
Copyright © 1999 Stanley Louis Sclove
Created: 19 October 1998       Updated: 27 June 1999