University
of Illinois at Chicago
College of Business
Administration
Department of In formation
& Decision Sciences
|
IDS 594 |
Special Topics in IDS: Statistical Finite Mixture Model -- Classification & Cluster Analysis |
|
Instructor |
Prof. Stanley L. Sclove |
|
Text |
McLachlan & Peel, Finite Mixture Models (recommended only) |
Classification and Discrimination:
Addendum on Logistic Regression,
Classification Trees,
and Neural Networks
Including Qualitative Variables: Logistic Regression
Classification is a dependence model; that is, the variables are in two sets, a vector Y to be explained or predicted and an explanatory vector X. The dependent variable Y is binary; {Y = 1} corresponds to P1; {Y = 0}, to P2. The two groups to be discriminated are indicated by the binary dependent variable.
When one or more of the explanatory variables is non-numerical, discriminant analysis is inappropriate. Logistic regression can be used with any combination of explanatory variables, numerical, categorical, or some of each.
Example. A number of patients are to be diagnosed into one of several diseases. The variables include gender and prior history of a related disease (yes or no), which are categorical.
Let Px = P(Y = 1|X = x) and Qx = P(Y = 0|X = x) = 1 - P(Y = 0 | |X = x) . When the distribution of X for Y = 1 is multinormal with mean µ1 and the distribution of X for Y = 0 is multinormal with mean µ2, and the two covariance matrices are equal, then ln (Px/Qx) is linear in x . This suggests modeling the binary dependent variable Y by taking ln Px/Qx to be linear in x, even when the conditional distributions of X given Y = 0 and 1 are not multinormal. The model
ln P/Q = b0
+ b'x
is called the logistic regression model.
Consider the case where there are m distinct patterns of x, denoted by xi, i = 1, 2, . . . , m, with ni observations at xi. Note that then for the sample proportion pi we have
ln pi/qi = ln Pi/Qi+ (ln pi/qi- ln Pi/Qi) = b0 + b'x + ei ,
where
ei = ln pi/qi-
ln Pi/Qi
can be
shown to be approximately normal with mean 0 and variance
1/( ni PiQi).
|
Function: |
Name: |
Range: |
|
Px |
prob. that Y = 1, given that X = x |
0 to 1 |
|
Px/Qx |
Odds |
0 to infinity |
|
ln(Px/Qx) |
log Odds, or "logit" |
negative infinity to positive infinity |
A little algebra shows that if logit(P) = t, then
P = et/(1+et), or 1/(1+e-t) .
This function is called the logistic function. Also,
Q = 1/(1+et), or e-t /(1+e-t) .
If there is more than one observation at each value x, then Px can be estimated by the sample proportion px and logit px can be regressed on x using weighted least squares.
Example. (tensile strength of wires). P = Px = probability of breaking at weight x.
Odds = P/Q, where Q = 1-P.
__________________________________________________________
Weight applied (lbs.): 10 20 30 40 50
no. of wires breaking: 4 8 18 76 90
no. of wires tested: 100 100 100 100 100
p, estimate of P: .04 .08 .18 .76 .90
q, estimate of Q: .96 .92 .82 .24 .10
Odds, p/q: .0417 .0870 .2195 3.167 9.000
Logit = ln(Odds): -3.18 -2.44 -1.52 +1.15 +2.197
_____________________________________________
The logit is regressed on x, except that weighted regression is used. The weight is 1/Var(y), where here y = logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately1/(nxPxQx), so the weight is nxPxQx, which is estimated by nxpxqx. The fitted regression equation is
predicted value of the logit = -11.51 + 0.156x,
and the estimated value of Px according to the model is
1 / [1 + exp(-11.51 + 0.156 x)].
The median breaking strength can be estimated as the value of x for which px = 1/2, that is, logit(px) = ln[(1/2)/(1/2)] = ln (1) = 0.
You can change this tensile strength example to a dose/response example.
Maximum likelihood estimation in logistic regression models
Maximum likelihood estimation can be used even when
there are repeated observations at each pattern of x
but must be used when there are not repeated
observations. The
likelihood L is maximixed.
What is L ?
To see what L is, we begin with the simpler case of a sample of Bernoulli variables. If you have a sample
y1, y2, . . . , yn
of n independent 0, 1 variables with success probability P, then the likelihood can be written
P yjQ1-yj
where the product is for j = 1 to n. That is, the log likelihood is
S (yj log P + (1-yj) log Q),
where the sum is forj = 1 to n.
The maximum likelihood estimator is the value of P which maximixes this; it of course turns out to be simply y-bar, or
(y1 + y2 + . . . + yn)/n .
In the logistic regression model, P = 1/[1+exp(-b0 - b'x)], whereb0 is the constant and b1 is the vector of logistic regression parameters, to be estimated. This is done by maximixing the likelihood by numerical methods, e. g. , the Newton-Raphson method.
Testing is based on -2LL, where LL is the natural log of the maximixed likelihood. It is based on the fact that
-2 ln L = (-2LL)full - (-2LL)reduced
is for large n distributed approximately as chi-square with
pfull - preduced d. f.
where the p's are the numbers of explanatory variables in the full and reduced models.
The higher the maximixed value ofL, the better the fit of the model. This is assessed on a log scale by computing -2 ln L, called -2LL . (This criterion corresponds to residual sum of squares, i. e. , sum of squared errors, in normal multiple regression models. )When there are several explanatory variables, different models can be assessed using -2LL as a figure-of-merit.
Model-selection criteria. Some of the model selection criteria consist of a penalty term added to -2LL. The penalty increases with the number of parameters used:
MSC(k) = - 2LLk +a(n)mk ,
wherekindexes the alternative models, sayk = 1, 2, . . . , K;mk = the number of independent parameters in modelk; and a(n)is the cost per parameter (cost in units of -2LL). The cost per parametera(n) = 2 forAIC (Akaike's in formation criterion) and a(n) = ln(n) for SC (Schwarx'scriterion). Models giving low values of the MSC are the good ones.
Choice of a(n). Although Akaike did much
innovative, ground-breaking work in this area, Schwarx's criterion
is derived from a fairly compelling Bayesian argument, and many analysts, including me, prefer a(n) =
ln(n). Note that for n at least 8, ln(n) exceeds
2, so for all practical
purposes the cost per parameter in SIC exceeds that in AIC, and so AIC will tend to choose larger models than does SC.
Logistic regression implicitly uses prior probability estimates obtained from the sample. Thus the estimateb0 will contain a termln(p1/p0), wherep0 and p1are the proportions of cases from Group 0 and Group 1 in the sample. If the appropriate prior probabilities are insteadq0 and q1, thenb0should be adjusted:
new bo = bo - ln(p1/p0) + ln(q1/q0).
Power Analysis
(This section is from an e-mail to the UICSTATS-L listserve from Richard T. Campbell,
Professor of Sociology and Director, Research Methods Core, Health Research and Policy Centers, UIC. )
A software package that does power analysis for logistic regression has just become available. The product is called PASS 2000, which is put out by the same people who do NCSS. The web site is http://www. ncss. com. The software contains the same set of elementary power analysis routines that one can find in several other packages, although the reporting formats and graphics seem to be superior to many other products. What is unique about PASS is that it contains power routines for logistic regression. One can compute power for dichotomous and continuous independent variables, specifying a correlation structure on the covariates and a distribution on the categorical X. There may be no other product that does this unless one computes "exemplary likelihood tests" in O'Brien's SAS-based Unifypow, which can be a bit difficult. The routines are based on a 1998 paper in Statistics and Medicine by Hsieh et al. Graphs corresponding to the printed output are easily available.
Here are a few lines of output from PASS. The column labeled R-squared refers to the correlation of the X in question with the remaining X's. The column labeled Pcnt X = 1 refers to the distribution of the dichotomous covariate.
One can fix P0, P1 or the odds ratio in setting up the power analysis.
PcntOddsR
PowerNX = 1P0P1Ratio SquaredAlphaBeta
0. 52420040.0000.0500. 1503. 3530. 3000.0500. 476
0. 50520050.0000.0500. 1503. 3530. 3000.0500. 495
0. 59430020.0000.0500. 1503. 3530. 3000.0500. 406
0. 66030030.0000.0500. 1503. 3530. 3000.0500. 340
0. 68530040.0000.0500. 1503. 3530. 3000.0500. 315
0. 67830050.0000.0500. 1503. 3530. 3000.0500. 322
0. 69440020.0000.0500. 1503. 3530. 3000.0500. 306
0. 76840030.0000.0500. 1503. 3530. 3000.0500. 232
0. 79940040.0000.0500. 1503. 3530. 3000.0500. 201
0. 80040050.0000.0500. 1503. 3530. 3000.0500. 200
-----------------------------------------------------------------------------------------------------------
Alternatives to the Logistic Regression Model
The functionlogit(P), which isthe log-odds, ln(P/Q), is called a link function, because it links the expected valuePofpto a linear model. Note that the logit link has a symmetry in P and Q:
logit(Q) = ln(Q/P) = - ln(P/Q) = - logit(P).
The corresponding inverse function, P = 1/[1+exp(-t)], wheret = b0+ b'x, may or may not follow the data very well. Alternative models are provided by Pregibon's functions.
Pregibon's curves are
(Pg + Q-g) for g different from 0
and
ln(P) - ln(Q) = logit(P) for g = 0.
Exercise. Recall that the family of Box-Cox transformations is defined as
(xl - 1)/ l
rather than just xl so that by continuity at l = 0 the family includesln(x). Similarly, show that
limg ->
0(Pg + Q-g- 2)/
g = ln(P) - ln(Q) = logit(P) .
Exercise. Find
the delta-method approximation to the variance of g(p), where p
is the sample proportion and g(P) is Pregibon's link
function.
Exercises on Logistic Regression
L(x1, x2) = . 3 - . 2 x1 + . 1x2,
where
L(x1, x2) = ln{P(Y = 1|X1 = x1, X2 = x2)/P(Y = 0|X1 = x1, X2 = x2)}
Y = 1 denotes bankruptcy within three years of startup,
Y = 0 denotes staying in business more than three years,
X1 = 1 if franchised, 0 if not,
X2 = 1 if a fast-food restaurant, 0 if not.
1. What is the value of L(0, 1) ?
(A) . 1(B) . 2 (C) . 3 (D) . 4 (E) . 5
2. The number P(Y = 1|X1 = 0, X2 = 1) is the probability that
(A) a franchised fast-food restaurant is bankrupt within three years.
(B) a franchised non-fast food restaurant is not bankrupt within three years.
(C) a non-franchised fast-food restaurant is bankrupt within three years.
(D) a non-franchised non-fast food restaurant is bankrupt within three years.
(E) a fast-food restaurant is franchised.
3. (continuation) What is the value of this probability ?
(A) . 401(B) . 450(C) . 550(D) . 599(E) . 67
4. If P = 1/2, what is the value of the logit function, ln(P/Q), where Q = 1-P ?
(A) -1 (B) -1/2 (C) 0 (D) 1/2 (E) 1
5. for x > 0, eln(x) = ?
(A) ex (B) ln(x) (C) x (D) 0 (E) 1
6. ln(ex) = ?
(A) ex (B) ln(x) (C) x (D) 0 (E) 1
7. ln(2e) =
(A) 1/2 (B) 1 (C) 2 (D) 3 (E) 1 + ln(2)
8. Which of the following is closest to the value of the number 1/e ?
(A) 1/3 (B) 0. 368 (C) 1/2 (D) 3 (E) 3. 14159
9. If x = 0, what is the value of the logistic function, 1/(1 + e-x) ?
(A) -1 (B) -1/2 (C) 0 (D) 1/2 (E) 1
___________________________________________________________
References
Allison, Paul. SAS book on logistic regression.
Pregibon, Daryl (1981). Logistic regression diagnostics. Ann. Statist. 9:4, 705-724.
Pregibon, Daryl (1984). JASA, 79, 61-83.
___________________________________________________________
This concerns a statistical technique, AID (Automatic Interaction Detection) which was much discussed for a while about a quarter-century ago and has been rediscovered, further developed and named CART (Classification & Regression Trees).
Classification. Statistical "classification" seeks a rule to predict accurately the class of each new observation. A prediction rule is constructed using in formation from a "training set", a sample in which the true class of each observation is known. In discriminant analysis, a goal is to derive a function of the explanatory variables which is a classification index.
AID (Automatic Interaction Detection), recursive partitioning and CART (Classification and RegressionTrees) are three names for similar techniques for classification. They are alternatives to discriminant analysis.
In these three techniques, the sample is subdivided successively according to the observed variables. In marketing, such subdivision is one way of "segmenting the market".
In its simplest form, recursive partitioning separates units of the initial group into two subgroups contingent upon the value of one of the variables. All possible splits of this type are considered and the one which best separates the data into groups homogeneous in class is chosen. A chi-square or F statisic is used to measure the separation. This process then continues recursively. Below this is illustrated with a marketing example from part of the Alpha radial tires case in Green (1978).
Efron and Tibshirani (1991) write about CART ( and other computer-intensive statistical techniques).
Software for AID includes SPSS's CHAID (AID using chi-square for categorical variables).
Example. (Alpha Radial Tires ). This is a numerical example
of the application of AID (see Green 1978). . The explanatory variables (see the file ALPHMSTR
CODEBOOK for definitions of variables) used were:--
A2: Was Alpha the br and last
purchased?;
A3: pre-exposure interest rating of
Alpha radials;
C1: post-exposure believability rating of
the Alpha commercial;
C2: post-exposure interest in Alpha radials.
The dependent variable is D: Is Alpha the br and of
choice?
The tree is shown below. The six terminal nodes are underlined. These define six clusters into which the n respondents are grouped. If we examine these terminal nodes, we see that:
1. The group consisting of only 15 respondents shows the highest p, .93. This group expresses high believability in the Alpha radial tire commercial and its members are all past purchasers of the Alpha br and .
2.
The group of 38 respondents
show the lowest p, .03.
This group:
a. Does not express high
believability in the Alpha commercial.
b. Does not express high
pre-exposure interest in Alpha.
c. Is not made up of past purchasers of Alpha.
d. Expresses low post-exposure
interest in the Alpha brand .
-----------
| n = 252 | (78 said they would ;
| p = .31 |buy; 78/252 = . 31 )
-----------
|
Split on Variable C1
|
--------------------------------------
|
|
High
Believability
Other
|
|
V
V
----------- -----------
| n = 53
|
| n = 199 |
| p = .68
|
| p = .21 |
----------- -----------
|
|
Split on Variable
A2
Split on Variable A3
|
|
|
|
Alpha
Purchaser Other
Other High Pre-exposure
|
|
|
Interest
V
V
V
V
-----------
-----------
----------- -----------
| n = 15 | | n = 38 | |
n = 178 | | n = 21 |
| p =
.93 | | p
= .58
| | p
= .17
| | p
= .57 |
-----------
-----------
----------- -----------
-----------
-----------
| -----------
Split on Variable A2
|
|
Alpha Purchaser Other
V
V
----------- -----------
| n = 21
| | n = 157 |
| p = .38 | | p = .14 |
----------- -----------
----------- |
Split on Variable C2
|
----------------
|
|
| Low Post-Exp.
Other Interest
V
V
----------- -----------
| n = 119 | | n = 38 |
| p = .18 | | p = .03 |
----------- -----------
----------- -----------
EXERCISES ON CART
1. Use the classification tree to estimate P(B = 1|C1 = 1) and P(B = 0|C1 = 0).
2. Show how to adjust the probabilities P(Buy | x) if the overall proportion of Buyers is .10 instead of .31.
REFERENCES on CART and AID
Efron, Bradley, and Tibshirani, Robert (1991). "Statistical Data Analysis in the Computer Age, " Science, Vol. 253, No. 5018(26 July 1991), 390-395.
Green, Paul E. (1978). Analyzing Multivariate Data. Dryden Press, Hinsdale, IL;see esp. pp. 191-201 on AID.
An interesting alternative to logistic regression
(indeed, to many problems) is that of "neural networks, " which is in fact very
similar to logistic regression.
(I1) (I1) (H1) \ \ \ w1 \ w1 \ \ \ \ w2 \ w2 \ (I2)---------(O) (I2) (H2)-------(O) / / / / /w3 /w3 / / / / (I3) (I3) (H3) (a) (b)
Figure: (a) Neural net with 3 input nodes, I1, I2, I3, and one output, O. (b) Net with one hidden layer of nodes H1, H2, H3; arcs between I's and H's and their weights w(i, j) are not shown here.
Consider applying a neural net to a classification problem where the data are {x1j, x2j, . . . , xpj, yj, j = 1, 2, . . . , n} and each y is either 0 or 1, denoting the classification. There are p input nodes. In the j-th case, the inputs to the nodes are x1j, . . . , xnj, j = 1, 2, . . . , n.
The computation starts with initial values for the w's, which will be updated as each case i is processed. for case j (j = 1, 2, . . . , n) the predicted response corresponding to given weights wv, v = 1, 2, . . . , p is
w1x1i + . . . + wnxni = Lj ,
say. The predicted value of yi is the logistic function of this linear combination,
1/(1 + e-Lj)
= pj,
say.
A loss function such as the square of the error yi-pi is tobe minimixed. In the simplest scheme, the weights are updated bymoving in the opposite of the direction of the rate of change of theloss with respect to wv. That is,
new wv = old wv - (lrate)(lossder),
where lossder = rate of change of the loss with respect to wv
and
lrate = "learning rate, " a constant, say 1/2.
Such a scheme can be used for machine learning (automatic learning) in any numerical context.
In logistic regression the training algorithm (scheme for updating the weights) is
new wv = old wv - (lrate)(y-p)p(1-p)xv.
This is based on a squared-error loss function C(y-p)2.
Modern statistical methods can be used on many of the same problems as neural networks. St and ard and non-st and ard statistical techniques need to be compared with neural networks to underst and the relative advantages and trade-offs among these different tools. The neural-nets approach may prove most useful when the data arrive sequentially.
A Neural Net is a non-interpretable, black box. But maybe this could be remedied. for example, a Path Diagram could be solved with a Neural Net. This would combine the interpretability of the Path Diagram with the technology of the Neural Net.
Software for Neural Nets includes NEURAL CONNECTION from SPSS.
1. Derive the training algorithm above for classification by logistic regression when the loss is squared error.
2. (continuation) Do this when the loss is normalixed squared-error,
C(y-p)2/pq, where q = 1-p .
3. Derive a neural nets training algorithm for the problem of multiple linear regression.
Denning, Peter J. (1992). "Neural Networks, " American Scientist, Vol. 80, No. 5, pp. 426-429. (A short, readable introduction to the subject. )
Neural Networks. Journal. Available in the UIC Math Library.
Friedman, Jerome H. (1997). Data mining and statistics: What's the connection? 29th Symposium on the Interface of Computer Science & Statistics, Houston, TX.