| Function: | Name: | Range: |
| Px | prob. that Y=1, given that X=x | 0 to 1 |
| Px/Qx | Odds | 0 to infinity |
| ln(Px/Qx) | log Odds, or "logit" | negative infinity to positive infinity |
A little algebra shows that if logit(P) = z, then P = ez/(1+ez), or 1/(1+e-z) . This function is called the logistic function.
Weight applied (lbs.): 10 20 30 40 50 number of wires breaking: 4 8 18 76 90 number of wires tested at this weight: 100 100 100 100 100 p, estimate of P: .04 .08 .18 .76 .90 q, estimate of Q: .96 .92 .82 .24 .10 Odds, p/q: .0417 .0870 .2195 3.167 9.000 Logit = ln(Odds): -3.18 -2.44 -1.52 +1.15 +2.197The logit is regressed on x, except that weighted regression is used. The weight is 1/Var(y), where here y = logit(p) and Var(y) = Var[(logit(p)], which can be shown to be approximately 1/(NPQ), so the weight is NPQ, which is estimated by Npq. The fitted regression equation is
^logit = -11.51 + 0.156 Weight.
The median breaking strength can be estimated as the value of x for which px = 1/2, that is, logit(px) = 0.Maximum likelihood estimation is used when there are not repeated observations at each pattern of x. The likelihood L is maximized. The higher the maximized value of L, the better the fit of the model. This is assessed on a log scale by computing -2 log L, called -2LL . (This criterion corresponds to residual mean square, i.e., sum of squared errors, in normal multiple regression models.) When there are several explanatory variables, different models can be assessed using -2LL as a figure-of-merit. A penalty term can be added which increases with the number of parameters used. In AIC (Akaike's information criterion) this penalty is 2 times the number of parameters. In Schwarz's criterion the penalty is the natural log of n times the number of parameters.
Here is an explanation of maximum likelihood estimation in logistic regression models. We begin with the simpler case of a sample of Bernoulli variables. If you have a sample
y1, y2, . . ., yN
of 0,1 variables with success probability P, then the likelihood can be writtenPy1Q1-y1 x Py2Q1-y2 x . . . x PyNQ1-yN.
The maximum likelihood estimator is the value of P which maximizes this; it turns out to be simply
(y1 + y2 + . . . + yN)/N .
In the logistic regression model, Px = 1/[1+exp(-B0 - BTx)], where B0 is the constant and B is the vector of logistic regression parameters, to be estimated. This is done by maximizing the likelihood by numerical methods.
Testing is based on -2LL, where LL is the natural log of the maximized likelihood. It is based on the fact that (-2LL)full - (-2LL)reduced is for large N distributed approximately as chi-square with nfull - nreduced d.f.
new bo = bo - ln(p1/p0) + ln(q1/q0).
Example. There were 113 applicants for charge accounts at a department store in a large city in the northeastern U.S. The variables include gender and marital status, which are categorical and hence require Logistic Regression rather than MDA.Classification. Statistical "classification" seeks a rule to predict accurately the class of each new observation. A prediction rule is constructed using information from a "training set", a sample in which the true class of each observation is known. In discriminant analysis, a goal is to derive a function of the explanatory variables which is a classification index.
In these three techniques, the sample is subdivided successively according to the observed variables. In marketing, such subdivision is one way of "segmenting the market".
In its simplest form, recursive partitioning separates units of the initial group into two subgroups contingent upon the value of one of the variables. All possible splits of this type are considered and the one which best separates the data into groups homogeneous in class is chosen. A chi-square or F statisic is used to measure the separation. This process then continues recursively. Below this is illustrated with a marketing example from part of the Alpha radial tires case in Green (1978).
Efron and Tibshirani (1991) write about CART (and other computer-intensive statistical techniques).
Software for AID includes SPSS's CHAID (AID using chi-square for categorical variables).
Example. Alpha Radial Tires (Green 1978). This is a numerical
example of the application of AID.
The explanatory variables (see the file ALPHMSTR CODEBOOK for
definitions of variables) used
were:--
A2: Was Alpha the brand last purchased?;
A3: pre-exposure interest rating of Alpha radials;
C1: post-exposure believability rating of the Alpha commercial;
C2: post-exposure interest in Alpha radials.
The dependent variable is D: Is Alpha the brand of choice?
The tree is shown below. The terminal nodes are underlined. These define six clusters into which the n respondents are grouped.
If we examine these terminal nodes, we see that:
1. The group consisting of only 15 respondents shows the highest p, .93. This group expresses high believability in the Alpha radial tire commercial and its members are all past purchasers of the Alpha brand.
2. The group of 38 respondents show the lowest p, .03. This group:
a. Does not express high believability in the Alpha commercial.
b. Does not express high pre-exposure interest in Alpha.
c. Is not made up of past purchasers of Alpha.
d. Expresses low post-exposure interest in the Alpha brand.
-----------
| n = 252 |
| p = .31 |
-----------
|
Split on Variable C1
|
--------------------------------------
| |
High Believability Other
| |
V V
----------- -----------
| n = 53 | | n = 199 |
| p = .68 | | p = .21 |
----------- -----------
| |
Split on Variable A2 Split on Variable A3
| | | |
Alpha Purchaser Other Other High Pre-exposure
| | | Interest
V V V V
----------- ----------- ----------- -----------
| n = 15 | | n = 38 | | n = 178 | | n = 21 |
| p = .93 | | p = .58 | | p = .17 | | p = .57 |
----------- ----------- ----------- -----------
----------- ----------- | -----------
Split on Variable A2
| |
Alpha Purchaser Other
V V
----------- -----------
| n = 21 | | n = 157 |
| p = .38 | | p = .14 |
----------- -----------
----------- |
Split on Variable C2
|
----------------
| |
| Low Post-Exp.
Other Interest
V V
----------- -----------
| n = 119 | | n = 38 |
| p = .18 | | p = .03 |
----------- -----------
----------- -----------
Green, Paul E. (1978). Analyzing Multivariate Data. Dryden Press, Hinsdale, IL; see esp. pp. 191-201 on AID.
An interesting alternative to logistic regression (indeed, to many problems) is that of "neural networks," which is in fact very similar to logistic regression.
(I1) (I1) (H1)
\ \
\ w1 \ w1
\ \
\ \
w2 \ w2 \
(I2)---------(O) (I2) (H2)-------(O)
/ /
/ /
/w3 /w3
/ /
/ /
(I3) (I3) (H3)
(a) (b)
Figure:
(a) Neural net with 3 input nodes, I1, I2, I3, and one output, O.
(b) Net with one hidden layer of nodes H1, H2, H3; arcs between I's and H's and their weights
w(i,j) are not shown here.
Consider applying a neural net to a classification problem where the data are {x1i,x2i, ...,xni, yi, i = 1,2,...,N} and each y is either 0 or 1, denoting the classification. There are n input nodes. In the i-th case, the inputs to the nodes are x1i,...,xni, i = 1,2,..., N.
1/(1 + e-Li) = pi,
say.
A loss function such as the square of the error yi-pi is to be minimized. In the simplest scheme, the weights are updated by moving in the opposite of the direction of the rate of change of the loss with respect to wv. That is,
Such a scheme can be used for machine learning (automatic learning) in any numerical context.
In logistic regression the training algorithm (scheme for updating the weights) is
Modern statistical methods can be used on many of the same problems as neural networks. Standard and non-standard statistical techniques need to be compared with neural networks to understand the relative advantages and trade-offs among these different tools. The neural-nets approach may prove most useful when the data arrive sequentially.
A Neural Net is a non-interpretable, black box. But maybe this could be remedied. For example, a Path Diagram could be solved with a Neural Net. This would combine the interpretability of the Path Diagram with the technology of the Neural Net.
Software for Neural Nets includes NEURAL CONNECTION from SPSS.
1. Derive the training algorithm above for classification by logistic regression when the loss is squared error.
2. (continuation) Do this when the loss is normalized squared-error, C(y-p)2/pq, where q = 1-p .
3. Derive a neural nets training algorithm for the problem of multiple linear regression.
Denning, Peter J. (1992). "Neural Networks," American Scientist, Vol. 80, No. 5, pp. 426-429. (A short, readable introduction to the subject.)
Neural Networks. Journal. Available in the UIC Math Library.