University of Illinois at Chicago
College of Business Administration

MBA 503 (Statistics Module)
Fall, 1997/ Prof. Stan Sclove / Data Analysis for Managerial Decision Making
Textbook: Levine, Berenson & Stephan (LBS)

NOTES TO ACCOMPANY LBS CH. 9 (TWO AND c-SAMPLE TESTS: CATEGORICAL DATA)
These notes Copyright (C) 1997 Stanley Louis Sclove


9.1 INTRODUCTION

This chapter deals with frequency (count) data. Methods for dealing with several proportions are presented. Chi-square tests for goodness-of-fit of distributions to data are discussed. These notes touch on these topics and also mention regression methods for use with categorical data are mentioned.

CHI-SQUARE TESTS FOR GOODNESS-OF-FIT

Testing the Fit of a Specified Distribution

Example. Test whether the distribution (.1, .2, .4, .2, .1) fits data on a 5-point Likert-scale item. Suppose n=100 and the data are like this.
 
------------------------------------
value       1     2     3    4     5
frequency   9    18    38   24    11
------------------------------------


The expected frequencies (E's) corresponding to the hypothesized distribution are 10, 20, 40, 20 and 10. The differences between the observed frequencies (O's) and these are -1, -2, -2, +4 and +1. The square of the ordinary distance between the O's and E's is (O1-E1)2 + (O2-E2)2 + ... + (O5-E5)2.

However, the appropriate squared statistical distance is the chi-square test statistic

(O1-E1)2/E1 + (O2-E2)2/E2 + ... + (O5-E5)2/E5.

When the hypothesized distribution is true, this has a chi-square distribution with 5-1 = 4 d.f. At the .05 level, one would reject the goodness-of-fit of the hypothesized distribution if the value exceeds the critical value of 9.488 from Table E.4.

The value of the statistic for these data is (-1)2/10 + (-2)2/20 + (-2)2/40 + (+4)2/20 + (+1)2/10 = 0.1 + 0.2 + 0.1 + 0.8 + 0.1 = 1.3. This is a small value. In particular, it is less than 9.488, so we accept the fit provided by the hypothesized distribution.

Testing the Fit of a Family of Distributions

Examples. (i) Test whether a Poisson distribution can fit data on a discrete numerical variable. If the Poisson parameter is specified, the number of d.f. is k-1, where k is the number of categories. If the Poisson parameter is estimated, the number of d.f. is k-2. (ii) Test whether a normal distribution can fit data on a continuous numerical variable. The observed data are recorded into k categories. If the parameters are specified, the number of d.f. is k-1. If the mean is estimated, the number of d.f. is k-2. If both the mean and variance are estimated, the number of d.f. is k-3.

9.2   Z TEST FOR DIFFERENCES IN TWO PROPORTIONS (INDEPENDENT SAMPLES)

AAO Pgm #23 (Inference for Proportions) can be viewed as an adjunct to this material.

The text discusses the Z test for equality of two proportions, based on samples from the two corresponding populations.

9.3 CHI-SQUARE TEST FOR DIFFERENCES IN TWO PROPORTIONS (INDEPENDENT SAMPLES)

The text presents a way of doing this using the chi-square test statistic, which compares the observed frequencies (counts) with those that would be expected if the two population proportions were equal. The chi-square test statistic is a (squared) distance between the set of observed counts fo and the set of expected counts, fe. If this distance is large, the null hypothesis (no difference between population proportions) is rejected.

The symbols O and E are often used for fo and fe, respectively.

9.4 CHI-SQUARE TEST FOR DIFFERENCES IN c PROPORTIONS (INDEPENDENT SAMPLES)

The standard deviation of the observed count O for any given cell is the square root of E for that cell. Hence, the Z for the cell is (O-E)/E1/2. The chi-square test statistic is the sum of squares of such Z's. It is interesting to look at the values of   Z   for the different cells to see where the differences between observed and expected values are large. In Table 9.9 the righthand column contains the squares of the Z's. Those for Brands 2 and 4 are particularly large. A look back at the data in Table 9.7 indicates that Brand 2 looks particularly good, while Brand 4 looks particularly bad.

To study how proportions vary as a function of one or more explanatory variables, the logistic regression method mentioned in the Notes to Accompany Chapter 12 is used.

9.5 CHI-SQUARE TEST OF INDEPENDENCE

This section deals with Two-Way Frequency Tables (Contingency Tables).

AAO Pgm #24 (Inference for Two-Way Tables) can be viewed in connection with this section.

The test of "independence" is better considered as a test of Homogeneity of Row Distributions. That is, one tests whether the distributions across column categories are the same for all rows. This leads to E's which take the form (Cj/n)Ri, which equals RiCj/n, where the R's and C's are the row and column sums, respectively. The number of d.f. for the chi-square statistic is (r-1)(c-1), where r is the number of rows and c is the number of columns.

Correspondence Analysis

Correspondence Analysis provides a way of viewing the relationships between the row and column categories. For example, one could see which age/gender groups are close to which political candidates or which types of cars. The plot is called a Biplot. The example shown in the link is Smoking, by Job Category (from Greenacre 1984).

Log-Linear Models

When the chi-square for independence is large, one will want to see where the departures from independence are. The log-linear methods discussed in the Notes to Accompany Chapter 12 are used for this.
latest revision 29-Sept-1997