University of Illinois
College of Business Administration
Department of Information & Decision Sciences
IDS 594   Special Topics in IDS: Statistical Finite Mixture Models -- Classification & Cluster Analysis
Fall Semester, 2002 Sclove

Suggested Exercises
Classification; Histograms; Clustering 
The exercises are to be worked and saved but not collected.   Those exercises  to be collected are repeated in the separate files Homework A and Homework B. 

LINEAR ALGEBRA (added 23-Oct)

A1.  What is the inverse of a 2 x 2 matrix ?
A2.  Express the covariance matrix in terms of the correlation matrix and the standard deviations.
A3.  Express the inverse covariance matrix in terms of the inverse correlation matrix and the standard deviations.

STATISTICAL DISTANCE (added 23-Nov)

D1.  For the bivariate case, express the (squared) statistical distance between the observation vector and the mean vector in terms of z scores.

HYPOTHESIS TESTING

T1.
TABLE.   Joint Distribution of   X  and  Y  in  P0  and P1 
    Probability function        Probability function      Likelihood ratio,
     under P0,  fo(x,y)           under P1, f1(x,y)        f1(x,y) / fo(x,y
    -------------------           ----------------       ----------------------
                  y                       y                       y
         |   9  10  11   12         9  10  11  12            9   10    11    12   
 _______ | _________________      _______________        ______________________ 
         |                                         
     64  | .04 .08 .12  .16       .01 .02 .03 .04        0.25  0.25  0.25  0.25
 x   66  | .03 .06 .09  .12       .02 .02 .04 .02        0.67  0.33  0.44  0.17 
     68  | .02 .04 .06  .08       .15 .18 .04 .03        7.50  4.50  0.67  0.38 
     70  | .01 .02 .03  .04       .12 .25 .02 .01       12.00 12.50  0.67  0.25  
 
There will be a single observationX'=(X, Y ) .Consider testing
H0:Xis from P0vs.H1:X is from P1 .

(a)Consider the test that has rejection region{(x,y):x> 68 and y < 10}.What is the level of this test ?

(b) What is the power of this test ?

(c) What is the rejection region of the optimal .02 level test of H0vs. H1?

(d)What is the power of this test ?

(e)What is the rejection region of the optimal .05 level test of H0vs. H1?

(f) What is the power of this test ?

(g) What is the rejection region of the Bayes test when the prior probabilities are .8, .2 ?

(h) What is the level of this test ?

(i) What is the power of this test ?

(j)What is the rejection region of the Bayes test when the prior probabilities are equal?

(k)What is the level of this test ?

(l)What is the power of this test ?

======================================================================

CLASSIFICATION

C1. Negative Exponential Distributions

The class-conditional densities are exponential with means 10, 20 and 30.
If there are equal prior probabilities and x = 20,
find the probability that x came from the population with mean equal to 20. 

 

DISCRIMINATION

D1.Adjusting Prior Probabilities

Consider classification with two multinormal distributions.Assuming prior probabilitiesp1 = .8,p2 = .2,

the constants in the two classification functionsare0.5and0.6.

What will be the constants if the prior probabilities are changed top1 = .9,p2 = .1 ?



 

D2.Form of Classification Functions

The form of the classification functions in the bivariate normal case ( p = 2 ) with equal covariance matricesis

C1(x,y)=b10 + b11x + b12 yandC2(x,y)=b20 + b21x + b22 y.

What is the form of the classification functions in the bivariate normal case withunequal covariance matrices?

D3.Means the Same, Covariance Matrices Different


In Population 1, X is multinormal with mean vector 0 and covariance matrix 4I; in Population 2, it is multinormal with mean vector 0 and covariance matrix 9I.  The two populations have equal prior probabilities.  Find the discriminant function.

D4.Height within Sexes.If the mean height for males is 69",   the mean height for females is 64" and the standard deviations are 3" and 2.5", respectively, and the prior probabilities are .5 and .5, find the classification regions.Hint:Find the roots of the associated quadratic.

D5.Height and Weight within Sexes.If the mean height for males is 69", the mean height for females is 64" and the standard deviations are 3" and 2.5", the mean weight for males is 160 lbs. with a standard deviation of 20 lbs., the mean weight for females is 125 lbs. with a standard deviation of 15 lbs., and the correlation between height and weight is +0.6 in both groups, find the classification regions. (Probably it's enough to think about the boundary (or boundaries, as the case may be) between the classification regions.)

D6.   Show that, in multinormal classification with different covariance matrices, the MAP procedure is to classify  x  into population k*,  where

k* = arg min k=1,2,...,K { D2(x,mk;Sk) + ln|Sk| - 2 ln pk } .

D7. (continuation)  The prior probability term can be ignored if the prior probabilities are equal .   Then one is looking at  D2(x,m;S) + ln|S|  for the different populations.  Is this a form of minimum-distance classification; that is, is  [ D2(x,m;S) + ln|S| ]1/2  a distance function?

______________________________________________________________________________________________

HISTOGRAMS


H1.     The number of days ill in a year is given for a sample of n = 50 coal miners in the dataset http://www.uic.edu/classes/ids472/data/daysill.txt . Make a series of histograms with various bin widths to illustrate the shape of the distribution, as follows.

One often reasonable rule for the number of bins is that if  n is about   2k,  use  k+1  bins.  Here n = 50, which is between 32 and 64, the fifth and sixth powers of 2.    The rule would suggest six or seven bins.  Using Excel or Minitab, make histograms with 5, 7 and 9 bins. Compare them.  Which seems to show the overall pattern best ?  Making a succession of histograms with different bin widths can be help in formulating a mixture model for the data.

H2.Work out the SIC model selection criterion for histograms.Hint:multinomial distribution.

H2.  Stock Rates of Return:  Crude Oil

3.1.  Given a price series  {P(t)},  the rate of return is  [P(t+1) - P(t)]/P(t), which is approximately   ln P(t+1)/P(t),  or
ln P(t+1) - ln P(t) .  This is often multiplied by 100 to correspond to percent.  This logarithmic approximation to the percentage RORs of the crude-oil nearest month futures prices is in the worksheet  in the column 100*DIFF.   Make histograms of 100*DIFF using bin widths of say, 1.5, 2.0 and 2.5, or whatever best exhibits the information in the data.

3.2.  Do a K-Means clustering of these data, using a number of clusters and starting points suggested by the histograms.  (Use Minitab or other software.)

CLUSTERING

C1.Cluster the dataset on the Hertzsprung-Russell diagram of stars.


Copyright ©Stanley Louis Sclove2002

created  2002:  May 29       latest update  2002:  Nov 23