University of  Illinois at Chicago
College of  Business Administration
Department of  In formation & Decision Sciences


IDS 594

   Special Topics in IDS:  Statistical Finite Mixture Model -- Classification & Cluster Analysis

Instructor

   Prof.   Stanley L.   Sclove

Text

   McLachlan & Peel,  Finite Mixture Models (recommended only)


Classification  and  Discrimination:
Addendum on Logistic Regression,  Classification Trees,  and  Neural Networks


Outline

Including Qualitative Variables:   Logistic Regression

Classification Trees

Neural Networks


Including Qualitative Variables: Logistic Regression

Classification is a dependence model; that is,  the variables are in two sets, a vector  Y  to be explained or predicted  and  an explanatory vector  X.           The dependent variable  Y   is binary;  {Y = 1} corresponds to P1;  {Y = 0},  to P2.     The two groups to be discriminated are indicated by the binary dependent variable. 

 

When one or more of  the explanatory variables is non-numerical,  discriminant analysis is  inappropriate.     Logistic regression can be used with any combination of  explanatory variables,  numerical,  categorical,  or some of  each.    

Example.   A number of  patients are to be diagnosed into one of  several diseases.   The variables include gender  and  prior history of  a related disease (yes or no),  which are categorical.  

Logistic Regression:   Regression with a Binary Dependent Variable

Let Px  =  P(Y = 1|X = x)  and  Qx  =  P(Y = 0|X = x)  =  1 - P(Y  =  0 | |X = x) .     When the distribution of   X    for Y  =  1 is multinormal with mean µ1  and  the distribution of X for Y  =  0 is multinormal with mean µ2  and  the two covariance matrices are equal,   then    ln (Px/Qx)   is linear in x .     This suggests modeling the binary dependent variable  Y  by taking   ln Px/Qx   to be linear in  x, even when the conditional distributions of   X   given Y  =  0  and  1 are not multinormal.     The model  

ln P/Q  =  b0b'x

is called the logistic regression model.  

Consider the case where there are  m  distinct patterns of  x,  denoted by xi,  i  =  1,  2,  .   .   .   ,  m, with ni observations at xi.    Note that then  for the sample proportion pi  we have 

ln pi/qi   =   ln Pi/Qi+ (ln pi/qi- ln Pi/Qi) = b0b'ei ,

where

ei = ln pi/qi- ln Pi/Qi

can be shown to be approximately normal with mean 0  and  variance 1/( ni PiQi).    
 
 
 

Function:

Name:

Range:

Px

prob.   that Y = 1,  given that X = x

0 to 1

Px/Qx

Odds

0 to infinity

ln(Px/Qx)

log Odds,  or "logit"

negative infinity to positive infinity

A little algebra shows that if logit(P)  =  t,  then 

P  =  et/(1+et), or   1/(1+e-t) . 

This function is called the logistic function.  Also, 

Q  = 1/(1+et), or   e-t /(1+e-t) . 

Estimation

If there is more than one observation at each value x, then Px can be estimated by the sample proportion  px and  logit px  can be regressed on  x  using weighted least squares. 

Example.    (tensile strength of  wires).     P  =  Px  =  probability of  breaking at weight x.   

Odds  =  P/Q,  where Q  =  1-P.  

__________________________________________________________

Weight applied (lbs.):   10    20     30      40     50

no. of wires breaking:    4     8     18      76     90

no. of  wires tested:    100   100    100     100    100

p, estimate of  P:       .04   .08    .18     .76    .90

q, estimate of  Q:       .96   .92    .82     .24    .10

Odds, p/q:               .0417 .0870  .2195  3.167  9.000

Logit  =  ln(Odds):     -3.18  -2.44 -1.52  +1.15  +2.197

_____________________________________________

 

The logit is regressed on x,  except that weighted regression is used.  The weight is 1/Var(y),  where here  y  =  logit(p)  and Var(y)  =  Var[(logit(p)], which can be shown to be approximately1/(nxPxQx), so the weight is  nxPxQx, which is estimated by nxpxqx.  The fitted regression equation is

predicted value of the logit  =    -11.51  +  0.156x

 and  the estimated value of  Px  according to the model is 

1 / [1 + exp(-11.51 + 0.156 x)]. 


The median breaking strength can be estimated as the value of    for which  px =  1/2,  that is, logit(px)  = ln[(1/2)/(1/2)] = ln (1) = 0.   

You can change this tensile strength example to a dose/response example.  

Maximum likelihood estimation in logistic regression models

Maximum likelihood estimation can be used even when there are repeated observations at each pattern of   x  but must  be used  when there are not repeated observations.     The likelihood L is maximixed.    What is  L ?
 

To see what  L  is,   we begin with the simpler case of  a sample of  Bernoulli variables.     If you have a sample 

y1,  y2,  .   .   .  ,  yn

of   n independent 0, 1 variables with success probability P,  then the likelihood can be written 

yjQ1-yj

where the product is  for  j  =   1  to  n.  That is, the log likelihood is

S (yj log P + (1-yj) log Q), 

where the sum is  forj  =  1 to n.  

The maximum likelihood estimator is the value of  P which maximixes this; it of  course turns out to be simply y-bar,  or 

(y1 + y2 + .   .   .   + yn)/n .  

In the logistic regression model,  P  =  1/[1+exp(-bb'x)], whereb0 is the constant  and  b1 is the vector of  logistic regression parameters,  to be estimated.     This is done by maximixing the likelihood by numerical methods,  e.  g.  the Newton-Raphson method.   

Testing is based on -2LL,  where LL is the natural log of  the maximixed likelihood.   It is based on the fact that 

-2 ln L = (-2LL)full - (-2LL)reduced

is  for large n distributed approximately as chi-square with 

pfull - preduced d.  f. 

where the p's are the numbers of  explanatory variables in the full  and  reduced models.  

The higher the maximixed value ofL, the better the fit of  the model.  This is assessed on a log scale by computing -2 ln L,  called  -2LL .  (This criterion corresponds to residual sum of  squares,  i.  e.  sum of  squared errors,  in normal multiple regression models.  )When there are several explanatory variables,  different models can be assessed using -2LL as a figure-of-merit.  

Model-selection criteria.  Some of  the model selection criteria consist of  a penalty term added to -2LL.  The penalty increases with the number of  parameters used:

MSC(k)  = - 2LLk +a(n)mk

wherekindexes the alternative models,  sayk  =  1,  2,  .   .   .   , K;mk =  the number of  independent parameters in modelk; and a(n)is the cost per parameter (cost in units of  -2LL).  The cost per parametera(n)  =  2 forAIC (Akaike's in formation criterion)  and  a(n)  =  ln(n)  for SC (Schwarx'scriterion).  Models giving low values of  the MSC are the good ones.  

Choice of  a(n).  Although Akaike did much innovative,  ground-breaking work in this area,  Schwarx's criterion is derived from a fairly compelling Bayesian argument,   and  many analysts,  including me,  prefer a(n)  =  ln(n).  Note that  for n at least 8,  ln(n) exceeds 2,  so  for all practical purposes the cost per parameter in SIC exceeds that in AIC,   and  so AIC will tend to choose larger models than does SC.  
 

Incorporating prior probabilities

Logistic regression implicitly uses prior probability estimates obtained from the sample.  Thus the estimatebwill contain a termln(p1/p0),  wherep0 and p1are the proportions of  cases from Group 0  and  Group 1 in the sample.     If the appropriate prior probabilities are insteadq0 and q1,  thenb0should be adjusted:

new bo  =    bo - ln(p1/p0) + ln(q1/q0). 

Power Analysis

(This section is from an e-mail to the UICSTATS-L listserve from Richard T.   Campbell, 

Professor of  Sociology  and  Director,  Research Methods Core,  Health Research  and  Policy Centers,  UIC.  )

A software package that does power analysis  for logistic regression has just become available.    The product is called PASS 2000,  which is put out by the same people who do NCSS.     The web site is http://www.  ncss.  com.     The software contains the same set of  elementary power analysis routines that one can find in several other packages,  although the reporting  formats  and  graphics seem to be superior to many other products.    What is unique about PASS is that it contains power routines  for logistic regression.   One can compute power  for dichotomous  and  continuous independent variables,  specifying a correlation structure on the covariates  and  a distribution on the categorical X.     There may be no other product that does this unless one computes "exemplary likelihood tests" in O'Brien's SAS-based Unifypow,  which can be a bit difficult.    The routines are based on a 1998 paper in Statistics  and  Medicine by Hsieh et al.   Graphs corresponding to the printed output are easily available.   

Here are a few lines of  output from PASS.   The column labeled R-squared refers to the correlation of  the X in question with the remaining X's.    The column labeled Pcnt X  =  1 refers to the distribution of  the dichotomous covariate.   

One can fix P0,  P1 or the odds ratio in setting up the power analysis.   

PcntOddsR

PowerNX = 1P0P1Ratio SquaredAlphaBeta

0.  52420040.0000.0500.  1503.  3530.  3000.0500.  476

0.  50520050.0000.0500.  1503.  3530.  3000.0500.  495

0.  59430020.0000.0500.  1503.  3530.  3000.0500.  406

0.  66030030.0000.0500.  1503.  3530.  3000.0500.  340

0.  68530040.0000.0500.  1503.  3530.  3000.0500.  315

0.  67830050.0000.0500.  1503.  3530.  3000.0500.  322

0.  69440020.0000.0500.  1503.  3530.  3000.0500.  306

0.  76840030.0000.0500.  1503.  3530.  3000.0500.  232

0.  79940040.0000.0500.  1503.  3530.  3000.0500.  201

0.  80040050.0000.0500.  1503.  3530.  3000.0500.  200

-----------------------------------------------------------------------------------------------------------

Alternatives to the Logistic Regression Model

The functionlogit(P), which isthe log-odds, ln(P/Q), is called a link function,  because it links the expected valuePofpto a linear model.  Note that the logit link has a symmetry in P   and  Q:

logit(Q) =  ln(Q/P) = - ln(P/Q) = - logit(P). 

The corresponding inverse function, P = 1/[1+exp(-t)], wheret =  b0b'x, may or may not follow the data very well.  Alternative models are provided by Pregibon's functions.   

Pregibon's curves are 

(Pg + Q-g)   for g different from 0

 and 

ln(P) - ln(Q)  =  logit(P)   for  g =  0. 


Exercise.  Recall that the family of  Box-Cox transformations is defined as 

(xl - 1)/ l

rather than just  xl  so that by continuity at l = 0 the family includesln(x).  Similarly,  show that 

limg -> 0(Pg + Q-g- 2)/ g = ln(P) - ln(Q) = logit(P) .  

Exercise.    Find the delta-method approximation to the variance of   g(p),  where  p  is the sample proportion  and  g(P) is Pregibon's link function.  
 


Exercises on Logistic Regression

Longevity of  Restaurants.   Suppose a logistic regression analysis of  some data  for restaurants gave the following result. 

L(x1, x2)  =  .  3 - .  2 x1 + .  1x2

where

L(x1, x2)  =  ln{P(Y = 1|X1 = x1, X2 = x2)/P(Y = 0|X1 = x1, X2 = x2)}

Y  =  1 denotes bankruptcy within three years of  startup,  

Y  =  0 denotes staying in business more than three years,  

X1  =  1 if franchised,  0 if not,  

X2  =  1 if a fast-food restaurant,  0 if not.   

1.   What is the value of  L(0, 1) ? 

(A) .  1(B) .               (C) .                          (D) .                          (E) . 

2.   The number P(Y = 1|X1 = 0, X2 = 1) is the probability that 

(A) a franchised fast-food restaurant is bankrupt within three years.   

(B) a franchised non-fast food restaurant is not bankrupt within three years.   

(C) a non-franchised fast-food restaurant is bankrupt within three years.   

(D) a non-franchised non-fast food restaurant is bankrupt within three years.   

(E) a fast-food restaurant is franchised.   

3.   (continuation) What is the value of  this probability ? 

(A) .  401(B) .  450(C) .  550(D) .  599(E) .  67 

Logit Function

4.   If P  =  1/2,  what is the value of  the logit function,  ln(P/Q), where Q  =  1-P ?

(A) -1              (B) -1/2                        (C) 0    (D) 1/2                         (E) 1 

Exponential  and  Logarithmic Function

5.      for x > 0,  eln(x)  =  ?

(A) ex                    (B) ln(x)                       (C) x                (D) 0    (E) 1 

6.   ln(ex)  = 

(A) ex                    (B) ln(x)           (C) x    (D) 0                (E) 1 

7.     ln(2e)  =  

(A) 1/2                         (B) 1    (C) 2    (D) 3                (E) 1 + ln(2) 

8.   Which of  the following is closest to the value of  the number 1/e ? 

(A) 1/3                         (B) 0.  368       (C) 1/2             (D) 3    (E) 3.  14159

Logistic Function

9.   If x  =  0,  what is the value of  the logistic function,  1/(1 + e-x) ?

(A) -1              (B) -1/2                        (C) 0    (D) 1/2                         (E) 1 

___________________________________________________________

References

Allison,  Paul.  SAS book on logistic regression.  

Pregibon,  Daryl (1981).  Logistic regression diagnostics.  Ann.   Statist.   9:4, 705-724.  

Pregibon,  Daryl (1984).  JASA,  79,  61-83.  

___________________________________________________________

Classification Trees

Addendum:   CART -- Classification  and  Regression Trees

CART  and  AID

This concerns a statistical technique,  AID (Automatic Interaction Detection) which was much discussed  for a while about a quarter-century ago  and  has been rediscovered,  further developed  and  named CART (Classification & Regression Trees). 

Classification.   Statistical "classification" seeks a rule to predict accurately the class of  each new observation.  A prediction rule is constructed using in formation from a "training set", a sample in which the true class of  each observation is known.  In discriminant analysis,  a goal is to derive a function of  the explanatory variables which is a classification index.   

AID

AID (Automatic Interaction Detection),  recursive partitioning  and  CART (Classification  and  RegressionTrees) are three names  for similar techniques  for classification.  They are alternatives to discriminant analysis. 

In these three techniques,  the sample is subdivided successively according to the observed variables.  In marketing,  such subdivision is one way of  "segmenting the market".   

 

In its simplest  form,  recursive partitioning separates units of  the initial group into two subgroups contingent upon the value of  one of  the variables.   All possible splits of  this type are considered  and  the one which best separates the data into groups homogeneous in class is chosen.  A chi-square or F statisic is used to measure the separation.  This process then continues recursively.  Below this is illustrated with a marketing example from part of  the Alpha radial tires case in Green (1978).   

Efron  and  Tibshirani (1991) write about CART ( and  other computer-intensive statistical techniques).   

Software  for AID includes SPSS's CHAID (AID using chi-square  for categorical variables).   

Example.   (Alpha Radial Tires ).   This is a numerical example of  the application of  AID (see Green 1978).  .   The explanatory variables (see the file ALPHMSTR CODEBOOK  for definitions of  variables) used were:--
A2: Was Alpha the br and  last purchased?;
A3: pre-exposure interest rating of  Alpha radials;
C1: post-exposure believability rating of  the Alpha commercial;
C2: post-exposure interest in Alpha radials.  
The dependent variable is D: Is Alpha the br and  of  choice? 

The tree is shown below.    The six terminal nodes are underlined.   These define six clusters into which the   n   respondents are grouped.  If we examine these terminal nodes,  we see that: 

1.   The group consisting of  only 15 respondents shows the highest p,  .93.  This group expresses high believability in the Alpha radial tire commercial  and  its members are all past purchasers of  the Alpha br and .   

2.   The group of  38 respondents show the lowest p,  .03.   This group:
     a.   Does not express high believability in the Alpha commercial.  
     b.   Does not express high pre-exposure interest in Alpha.  
     c.   Is not made up of  past purchasers of  Alpha.  
     d.   Expresses low post-exposure interest in the Alpha brand .   

-----------

            | n  =  252 |  (78 said they would ; 

            | p  =  .31 |buy; 78/252 = .  31 )
                          -----------
                              |
                     Split on Variable C1
                              |
             --------------------------------------
             |                                    |
      High Believability                        Other
             |                                    |
             V                                    V
        -----------                            -----------
        | n  =   53 |                          | n  =  199 |
        | p  =  .68 |                          | p  =  .21 |
        -----------                            -----------
             |                                    |
     Split on Variable A2              Split on Variable A3
       |               |                 |                |
Alpha Purchaser      Other             Other      High Pre-exposure
       |               |                |             Interest
       V               V                 V                V
  -----------     -----------       -----------      -----------
  | n  =   15 |   | n  =   38 |     | n  =  178 |    | n  =   21 |
  | p  =  .93 |   | p  =  .58 |     | p  =  .17 |    | p  =  .57 |
  -----------     -----------       -----------      -----------
  -----------     -----------            |           -----------
                                Split on Variable A2
                                  |               |
                           Alpha Purchaser      Other
                                  V               V
                             -----------     -----------
                             | n  =   21 |   | n  =  157 |
                             | p  =  .38 |   | p  =  .14 |
                             -----------     -----------
                             -----------          |
                                         Split on Variable C2
                                                |
                                            ----------------
                                            |              |
                                            |         Low Post-Exp.   
                                          Other         Interest
                                            V              V
                                       -----------     -----------
                                       | n  =  119 |     | n  =   38 |
                                       | p  =  .18 |     | p  =  .03 |
                                       -----------     -----------
                                       -----------     -----------

EXERCISES ON CART

1.  Use the classification tree to estimate     P(B = 1|C1 = 1)      and      P(B = 0|C1 = 0).  

2.  Show how to adjust the probabilities  P(Buy | x)  if the overall proportion of  Buyers is  .10 instead of  .31.  

REFERENCES on CART  and  AID

Efron,  Bradley,   and  Tibshirani,  Robert (1991).   "Statistical Data Analysis in the Computer Age, "   Science,  Vol.   253,  No.   5018(26 July 1991),  390-395.  

Green,  Paul E.   (1978).  Analyzing Multivariate Data.  Dryden Press,  Hinsdale,  IL;see esp.   pp.   191-201 on AID.  


Neural Networks

Basic Concepts of  Neural Networks

An interesting alternative to logistic regression (indeed,  to many problems) is that of  "neural networks, " which is in fact very similar to logistic regression. 

 

             (I1)                              (I1)    (H1)                \                                         \                 \ w1                                      \    w1                  \                                         \                        \                                         \              w2    \                                    w2   \       (I2)---------(O)                    (I2)    (H2)-------(O)                     /                                        /                    /                                        /                   /w3                                      /w3                  /                                        /                   /                                        /              (I3)                             (I3)      (H3)                      (a)                                     (b)
Figure:  (a) Neural net with 3 input nodes,  I1,  I2,  I3,   and  one output,  O.    (b) Net with one hidden layer of  nodes H1,  H2,  H3;  arcs between I's  and  H's  and  their weights w(i, j) are not  shown here.  
 
 

Consider applying a neural net to a classification problem where the data are {x1j,  x2j,  .  .  .  ,  xpj,  yj,  j  =  1, 2, .  .  .  , n}  and  each y is either 0 or 1,  denoting the classification.   There are   p   input nodes.  In the j-th case,  the inputs to the nodes are x1j,  .  .  .  ,  xnj, j  =  1, 2, .  .  .  ,  n.   

Estimating a Neural Network Model

The computation starts with initial values  for the w's,  which will be updated as each case i is processed.    for case j (j = 1, 2, .  .  .  ,  n) the predicted response corresponding to given weights  wv,  v  =  1, 2,  .  .  .  ,  p   is

w1x1i + .  .  .   + wnxni  =  Lj

say.  The predicted value of  yi is the logistic function of  this linear combination,  

    1/(1 + e-Lj)  =  pj, 
say.   

A loss function such as the square of  the error yi-pi is tobe minimixed.   In the simplest scheme,  the weights are updated bymoving in the opposite of  the direction of  the rate of  change of  theloss with respect to wv.  That is,  

new wv  =  old wv - (lrate)(lossder), 

where  lossder  =  rate of  change of  the loss with respect to wv

 and  

lrate  =  "learning rate, " a constant,  say 1/2. 

Such a scheme can be used  for machine learning (automatic learning) in any numerical context.   

In logistic regression the training algorithm (scheme  for updating the weights) is 

new wv  =  old wv - (lrate)(y-p)p(1-p)xv. 

This is based on a squared-error loss function C(y-p)2.   

Modern statistical methods can be used on many of  the same problems as neural networks.   St and ard and  non-st and ard statistical techniques need to be compared with neural networks to underst and  the relative advantages  and  trade-offs among these different tools.   The neural-nets approach may prove most useful when the data arrive sequentially.   

A Neural Net is a non-interpretable,  black box.   But maybe this could be remedied.   for example,  a Path Diagram could be solved with a Neural Net.   This would combine the interpretability of  the Path Diagram with the technology of  the Neural Net.   

Software  for Neural Nets includes NEURAL CONNECTION from SPSS.   

EXERCISES

1.   Derive the training algorithm above  for classification by logistic regression when the loss is squared error. 

2.   (continuation) Do this when the loss is normalixed squared-error,  

C(y-p)2/pq,  where q  =  1-p .   

3.   Derive a neural nets training algorithm  for the problem of  multiple linear regression.   

REFERENCES

Denning,  Peter J.   (1992).   "Neural Networks, " American Scientist,  Vol.   80,  No.   5,  pp.   426-429.   (A short,  readable introduction to the subject.  )

 

Neural Networks.  Journal.  Available in the UIC Math Library.   


Additional References

Friedman,  Jerome H.   (1997).     Data mining  and  statistics:  What's the connection?  29th Symposium on the Interface of  Computer Science & Statistics,  Houston,  TX. 


Copyright © 2002 Stanley Louis Sclove
Created  1998:  Oct 19        latest revision   2002:  Sept 21