| IDS 594 Special Topics in IDS: | Statistical Finite Mixture Models -- Classification & Cluster Analysis |
| Fall Semester, 2002 | Sclove |
TABLE. Joint Distribution of X and Y in P0 and P1
Probability function Probability function Likelihood ratio,
under P0, fo(x,y) under P1, f1(x,y) f1(x,y) / fo(x,y)
------------------- ---------------- ----------------------
y y y
| 9 10 11 12 9 10 11 12 9 10 11 12
_______ | _________________ _______________ ______________________
|
64 | .04 .08 .12 .16 .01 .02 .03 .04 0.25 0.25 0.25 0.25
x 66 | .03 .06 .09 .12 .02 .02 .04 .02 0.67 0.33 0.44 0.17
68 | .02 .04 .06 .08 .15 .18 .04 .03 7.50 4.50 0.67 0.38
70 | .01 .02 .03 .04 .12 .25 .02 .01 12.00 12.50 0.67 0.25
(a)Consider the test that has rejection region{(x,y):x> 68 and y < 10}.What is the level of this test ?
(b) What is the power of this test ?
(c) What is the rejection region of the optimal .02 level test of H0vs. H1?
(d)What is the power of this test ?
(e)What is the rejection region of the optimal .05 level test of H0vs. H1?
(f) What is the power of this test ?
(g) What is the rejection region of the Bayes test when the prior probabilities are .8, .2 ?
(h) What is the level of this test ?
(i) What is the power of this test ?
(j)What is the rejection region of the Bayes test when the prior probabilities are equal?
(k)What is the level of this test ?
(l)What is the power of this test ?
======================================================================
Consider classification with two multinormal distributions.Assuming prior probabilitiesp1 = .8,p2 = .2,
the constants in the two classification functionsare0.5and0.6.
What will be the constants if the prior probabilities are changed top1 = .9,p2 = .1 ?
D2.Form of Classification Functions
The form of the classification functions in the bivariate normal case ( p = 2 ) with equal covariance matricesis
C1(x,y)=b10 + b11x + b12 yandC2(x,y)=b20 + b21x + b22 y.
What is the form of the classification functions in the bivariate normal case withunequal covariance matrices?
D3.Means
the Same, Covariance Matrices Different
In Population 1, X
is multinormal with mean vector 0 and covariance matrix 4I;
in Population 2, it is multinormal with mean vector 0 and covariance
matrix 9I. The two populations have equal prior probabilities.
Find the discriminant function.
D4.Height within Sexes.If the mean height for males is 69", the mean height for females is 64" and the standard deviations are 3" and 2.5", respectively, and the prior probabilities are .5 and .5, find the classification regions.Hint:Find the roots of the associated quadratic.
D5.Height and Weight within Sexes.If the mean height for males is 69", the mean height for females is 64" and the standard deviations are 3" and 2.5", the mean weight for males is 160 lbs. with a standard deviation of 20 lbs., the mean weight for females is 125 lbs. with a standard deviation of 15 lbs., and the correlation between height and weight is +0.6 in both groups, find the classification regions. (Probably it's enough to think about the boundary (or boundaries, as the case may be) between the classification regions.)
D6. Show that, in multinormal classification with different covariance matrices, the MAP procedure is to classify x into population k*, where
D7. (continuation) The prior probability term can be ignored if the prior probabilities are equal . Then one is looking at D2(x,m;S) + ln|S| for the different populations. Is this a form of minimum-distance classification; that is, is [ D2(x,m;S) + ln|S| ]1/2 a distance function?
______________________________________________________________________________________________
HISTOGRAMS
H1. The number of days ill in a year
is given for a sample of n = 50 coal miners in the dataset http://www.uic.edu/classes/ids472/data/daysill.txt
. Make a series of histograms with various bin widths to illustrate the
shape of the distribution, as follows.
One often reasonable rule for the number of bins is that if n is about 2k, use k+1 bins. Here n = 50, which is between 32 and 64, the fifth and sixth powers of 2. The rule would suggest six or seven bins. Using Excel or Minitab, make histograms with 5, 7 and 9 bins. Compare them. Which seems to show the overall pattern best ? Making a succession of histograms with different bin widths can be help in formulating a mixture model for the data.
H2.Work out the SIC model selection criterion for histograms.Hint:multinomial distribution.
H2. Stock Rates of Return: Crude Oil
3.1.
Given a price series {P(t)}, the rate of return
is [P(t+1) - P(t)]/P(t),
which is approximately ln P(t+1)/P(t),
or
ln P(t+1) - ln P(t) . This is often
multiplied by 100 to correspond to percent. This logarithmic approximation
to the percentage RORs of the crude-oil nearest month futures prices is
in the worksheet
in the column 100*DIFF.
Make histograms of 100*DIFF
using bin widths of say, 1.5, 2.0 and 2.5, or whatever best exhibits the
information in the data.
3.2. Do a K-Means clustering of these data, using a number of clusters and starting points suggested by the histograms. (Use Minitab or other software.)
Copyright ©Stanley
Louis Sclove2002
created 2002: May 29 latest update 2002: Nov 23