____________________________________________________________________________________________________________
University
of Illinois at Chicago
College
of Business Administration
Department
of Information & Decision Sciences
____________________________________________________________________________________________________________
IDS 470:
Multivariate Statistical Analysis I
Instructor:
Prof. Stanley L. Sclove
Textbook:
Johnson and Wichern, 4/e (JW)
____________________________________________________________________________________________________________
Notes
on JW Ch. 10: Canonical Correlation Analysis
NOTE. This chapter is not required in
IDS 470.
CONTENTS
10.1. Introduction 587
10.2. Canonical
Variates and Canonical Correlations 587
10.3. Interpreting
the Population Canonical Variables 595
10.4. The
Sample Canonical Variates and Sample Canonical Correlations 601
10.5. Additional
Sample Descriptive Measures 610
10.6. Large
Sample Inferences 557
References 627
______________________________________________________________________________________________________________________________
10.1. Introduction 515
Your
dataset consists of two sets of variables. How can you
best represent them in terms of a few dimensions ?
Learning Objectives . Chapter Preview . Key Terms
Canonical Correlation Analysis (CCA) deals
with situations in which the variables fall into two subsets,
say q Y's and r Z's.
Often the Y's are to be predicted from the Z's . Thus, CCA is a dependence
rather than an interdependence technique.
What is Canonical Correlation?
Examples.
(i) In a dataset
IQRA DAT , the p = 2 variables RA1 (Y1) and RA2 (Y2), reading achievement
before and after
second grade, are to be explained in terms of the
q=2 variables Language IQ (Z1) and Non-Language IQ (Z2) (Anderson and Sclove
1986).
(ii)
In the dataset HEART DAT (Dixon and Massey, Table 2-2a) the p = 2 variables
Systolic BP (Y1) and Diastolic
BP (Y2) are to be predicted using the q = 3 variables
Age (Z1), Weight (Z2) and Height(Z3).
TABLE. Dataset: Two sets of variables
VARIABLE
-------------------------------------------------
X1 X2 . . .
Xp
--------------------------------------------------
Y1 Y2 ... Yk . .. Yq Z1
Z2 . . . Zm . . . Zr (q+r = p)
--------------------------------------------------
1 y11 y12 ... y1k ... x1q z11 z12 ...
z1m ... z1r
2 y21 y22 ... y2k ... x2q z21 z22 ...
z2m ... z2r
. . .
CASE j yj1
yj2 ... yjk ... xjq zj1 zj2
... zjm ... zjr
. . .
n yn1 yn2 ... ynk ... xnq zn1 zn2 ...
znm ... znr
--------------------------------------------------
A question is the extent to which the r Z's
can explain or predict the q Y's, that is, whether the Z-construct
can explain the
Y-construct.
When there is only one Y (q=1), this is a multiple regression
situation. The multiple correlation coefficient R is the
ordinary correlation between Y and the BLP (best
linear predictor) based on the Z's. The statistic R is
a measure of the strength of the relationship between Y and the Z's.
When q > 1, we find a linear combination CVY1
that is most predictable by any linear combination of the Z's, leading
to
CVZ1. If this maximum correlation is large enough,
we conclude that the Y's are significantly related to the Z's; the size
of this
correlation is a measure of the strength of this
relationship. Also, a 2D plot is obtained by plotting CVY1 vs. CVZ1.
This is a plot of a Y-index vs. a Z-index.
Canonical correlation analysis proceeds further by finding successive
pairs (CVZ2,CVY2), (CVZ3,CVY3),
etc., of linear
combinations. The successive pairs are uncorrelated.
The number of such pairs is min{q,r}. For the Heart Data there
are three
Z's and two Y's, so this would be min{3,2} = 2.
H E R E
TABLE. Examples of some studies with
two sets of variables
Field
Medicine
Study
Los Angeles Heart Study; see Dixon & Massey (1969)
X's
Personal physical characteristics:
Age; Weight; Height
Y's
Blood pressure variables:
Systolic BP; Diastolic BP
Field
Business
Study
HATCO data of Hair et al. (see esp. Table 8.1, p. 329)
X's
Measures of customer characteristics:
Family size; family income
Y's
Measures of credit usage:
Number of credit cards held by the family
Ave. monthly dollar expenditures on all credit cards
Field
Public Health
Study
Chronic Depression Study, reported in Afifi & Clark(1987)
X's
personal social and financial variables
Gender; Age; Education; Income
Y's
health variables:
CESD (an index of chronic depression);
Perceived physical health
Field/Study Agriculture/Waugh (1942)
X's
characteristics of wheat
texture measure; density; protein content; %
kernels damaged; % foreign matter
Y's
characteristics of the resulting flour:
crude protein content; wheat per barrel of flour;
ash in flour
Field/Study Sociology/Galle, Gove &
McPherson (1972)
Cases
75 community areas of Chicago
X's
Population density variables:
persons per acre
persons per room
rooms per housing unit
housing units per structure
structures per acre
Y's
Social pathology variables:
juvenile delinquency rate
public assistance rate
rate of admissions to mental hospitals
general fertility rate
standardized mortality ratio
Field/Study Education/Anderson & Sclove
(1986)
Cases
23 second-grade school pupils
X's
IQ variables:
Language IQ; Non-languague IQ
Y's
Reading achievement variables:
reading achievement score before second grade
reading achievement score after second grade
Field: Financial
Securities Analysis
Study: Chemical
companies data, reported in Afifi & Clark(1987)
Cases: 30
chemical company stocks
X's:
Measures of company financial performance:
ROR5; D/E; SALESGR5; EPS5; NPM1
Y's:
Return or potential return to stockholders:
P/E; PAYOUTR1
(annual dividend divided by the latest 12-mo. earnings per share)
Field/Study: Public Administration & Health
Study/Hopkins (1969)
X's:
housing quality variables
Y's:
illness variables
Field/Study: Educational Psychology/Tatsuoka
(1988)
X's:
personality scales
Y's
achievement tests
Field/Study Psychological Measurement/Meredith
(1964)
X's
one set of intelligence tests
Y's
another set of intelligence tests
Some of these examples involve three sets of variables
and so can be considered in terms of Path Analysis/Structural Equations
Modeling (SEM) in Section 9.7.
Reduction of Dimensionality
One can reduce consideration only to those pairs
that are significantly correlated. The number of such pairs can be appreciably
less than min{m,n}. Also, variables which have low
weights in the significant pairs can be eliminated from further study.
The
linear combinations can be rotated to produce as
many low coefficients as possible, to aid in interpretation.
Hypothetical Example of Canonical Correlation
I made up 40 additional cases for the credit card
data (Hair et al., Table 4.1), making a total N of 48.
Some of the output from BMDP6M follows.
BMDP6M - CANONICAL CORRELATION ANALYSIS
VERSION: 1990 (IBM/CMS)
DATE: MARCH 3, 1998 AT 15:46:31
PROGRAM INSTRUCTIONS
/PROBLEM
TITLE IS 'Credit Card Data (like Hair, Table 3.3)'.
/INPUT
VARIABLES ARE 4.
FORMAT IS FREE.
/VARIABLE
NAMES ARE CrdtCrds, CCexp, FamSize, FamInc.
/CANONICAL FIRST
= FamSize, FamInc.
SECOND = CrdtCrds, CCexp.
/PRINT
MATR=CORR, LOAD, COEF.
LINEsize=69.
/PLOT
XVAR = CNVRS1,CNVRS2.
YVAR = CNVRF1,CNVRF2.
/END
PROBLEM TITLE IS Credit Card Data (like
Hair, Table 3.3)
NUMBER OF VARIABLES TO READ . . . . .
. . . . . 4
VARIABLES TO BE USED
1 CrdtCrds
2 CCexp 3 FamSize
4 FamInc
FIRST SET OF VARIABLES
----------------------
3 FamSize
4 FamInc
SECOND SET OF VARIABLES
-----------------------
1 CrdtCrds
2 CCexp
NUMBER OF VARIABLES IN FIRST SET.
. . . . . . . 2 This is n.
NUMBER OF VARIABLES IN SECOND SET . .
. . . . . 2 This is m.
TOTAL NUMBER OF VARIABLES USED. . . .
. . . . . 4
MAXIMUM NUMBER OF CANONICAL VARIABLES
. . . . . 2 This is min{m,n}
NUMBER OF CASES READ. . . . . . . . .
. . . . . 48 This is N.
UNIVARIATE SUMMARY STATISTICS
-----------------------------
SMALLEST LARGEST
STANDARD SMALLEST LARGEST STANDARD STANDARD
VARIABLE
MEAN DEVIATION VALUE VALUE
SCORE SCORE
3 FamSize 4.2500
1.4947 2.0000 8.0000 -1.51
1.17
4 FamInc 33.5417
15.4589 14.0000 75.0000 -1.26
2.68
1 CrdtCrds 8.8542
1.5709 4.0000 12.0000 -1.82
3.28
2 CCexp 14.6875
8.2642 5.0000 34.0000 -1.55
3.08
CORRELATIONS
------------
FamSize FamInc CrdtCrds CCexp
3 4
1 2
FamSize 3
1.000
FamInc 4
0.290 1.000
CrdtCrds 1
0.813 0.340 1.000
CCexp 2
0.370 0.930 0.490
1.000
CANONICAL NUMBER OF BARTLETT'S TEST
FOR
EIGENVALUE
CORRELATION EIGENVALUES REMAINING EIGENVALUES
CHI- TAIL
SQUARE D.F. PROB.
142.12 4 0.0000
lambda1= 0.88365 R1= 0.94003
1 48.40 1 0.0000
lambda2= 0.64747 R2= 0.80466
BARTLETT'S TEST ABOVE INDICATES THE NUMBER
OF CANONICAL VARIABLES
NECESSARY TO EXPRESS THE DEPENDENCY BETWEEN
THE TWO SETS OF
VARIABLES. THE NECESSARY NUMBER
OF CANONICAL VARIABLES IS THE
SMALLEST NUMBER OF EIGENVALUES SUCH THAT
THE TEST OF THE REMAINING
EIGENVALUES IS NOT SIGNIFICANT.
FOR EXAMPLE, IF A TEST AT THE .01
LEVEL WERE DESIRED, THEN
2 VARIABLES WOULD BE CONSIDERED
NECESSARY. HOWEVER, THE NUMBER OF
CANONICAL VARIABLES OF PRACTICAL
VALUE IS LIKELY TO BE SMALLER.
COEFFICIENTS FOR CANONICAL VARIABLES FOR
FIRST SET OF VARIABLES
---------------------------------------------------------------
CNVRF1 CNVRF2
1
2
FamSize 3
-0.026753 0.698485
FamInc 4
0.065389 -0.017082
STANDARDIZED COEFFICIENTS FOR CANONICAL
VARIABLES FOR FIRST SET OF
VARIABLES (THESE ARE THE COEFFICIENTS
FOR THE STANDARDIZED
VARIABLES)
CNVRF1 CNVRF2
1 2
FamSize 3
-0.040 1.044
FamInc 4
1.011 -0.264
COEFFICIENTS FOR CANONICAL VARIABLES FOR
SECOND SET OF VARIABLES
----------------------------------------------------------------
CNVRS1 CNVRS2
1
2
CrdtCrds 1
-0.127497 0.719246
CCexp 2
0.172865 -0.060589
STANDARDIZED COEFFICIENTS FOR CANONICAL
VARIABLES FOR SECOND SET OF
VARIABLES (THESE ARE THE COEFFICIENTS
FOR THE STANDARDIZED
VARIABLES)
CNVRS1 CNVRS2
1 2
CrdtCrds 1 -0.200
1.130
CCexp 2
1.083 -0.380
CANONICAL VARIABLE LOADINGS
---------------------------
(CORRELATIONS OF CANONICAL VARIABLES WITH
ORIGINAL VARIABLES)
FOR FIRST SET OF VARIABLES
CNVRF1 CNVRF2
1 2
FamSize 3
0.253 0.968
FamInc 4
0.999 0.038
-----------------------------
CANONICAL VARIABLE LOADINGS
---------------------------
(CORRELATIONS OF CANONICAL VARIABLES WITH
ORIGINAL VARIABLES)
FOR SECOND SET OF VARIABLES
CNVRS1 CNVRS2
1 2
CrdtCrds 1
0.331 0.944
CCexp 2
0.985 0.175
------------------------------
2 PLOTS ARE TO BE MADE
NO. NAME
NO. NAME NUMBER
7 CNVRS1
5 CNVRF1 9
8 CNVRS2
6 CNVRF2 10
......+.....+.....+.....+.....+.....+.....+.....+....
2.7 +
1 +
-
-
-
1
-
-
-
-
1 -
-
-
-
-
-
1 -
1.8 +
+
-
-
-
-
-
-
C
-
-
N
-
-
V
-
1
-
R
-
21
-
F .90 +
1
+
1
-
-
-
-
-
1 1 2 1
-
-
1
-
-
1
-
-
-
5
- 1 1
1
-
0.0 +
1 1 2 11
+
-
1
-
- 1
1
-
- 1 111
-
-
-
- 1
-
-
1
-
- 1
-
-.90 +
1
+
- 2 1
-
- 1 111 2
-
-1 11
-
......+.....+.....+.....+.....+.....+.....+.....+....
-.50 .50
1.5 2.5
-1.0 0.0
1.0 2.0
CNVRS1 7
...+....+....+....+....+....+....+....+....+....+....
-
-
1.5 +
11
+
-
1 1 -
-
-
-
12
-
-
-
1.0 +
1
+
-
1 1 1
-
-
1 11
-
-
1
-
-
-
.50 +
2
1 +
C
-
12 1
-
N
-
-
V
-
1
-
R
-
1 11
-
F 0.0 +
1
+
2
-
2 1 1
-
-
2
-
-
11
-
-
-
-.50 +
+
-
-
6
-
-
-
1
-
-
-
-1.0 +
+
-
-
- 1
-
- 1
11
-
-
-
-1.5 +
111 12
+
-
-
- 1
-
- 1
-
-
-
...+....+....+....+....+....+....+....+....+....+....
-1.5 -.50
.50 1.5
2.5
-2.0 -1.0
0.0 1.0
2.0
CNVRS2 8
10.2. The
blah b lah 515
10.3. Methods
of Estimation
10.4. Factor
Rotation
10.5. Factor
Scores 550
10.6. Perspectives
and a Strategy for Factor Analysis
9.7. Structural
Equations Models
______________________________________________________________________________________________________________________________
Copyright
© 1999 Stanley Louis Sclove
Created:
17 October 1998 Updated:
7 November 1999
______________________________________________________________________________________________________________________________
References
Afifi, A. and Clark, V. (1990). Computer-Aided Multivariate
Analysis. 2nd ed. Van Nostrand-Reinhold, New York. (Now
available in a 3rd edition.) .
Anderson, T.W., & Sclove, Stanley L. (1986).
Statistical Analysis of Data, 2nd ed. Scientific Press, Palo Alto, CA.
Dixon, W.J., & Massey, F.J. (1969). Introduction
to Statistical Analysis, 3rd ed. New York: McGraw-Hill.
Galle, Omer R., Gove, Walter R., & McPherson,
J. Miller (1972). Population density and pathology: What are
the relations
for man? Science,176, 7-April-1972, 23-30.
Hopkins, C.E. (1969). Statistical analysis by canonical
correlation: A computer application. Health Services Research
(Winter): 4, 304-312.
Hotelling, Harold (1935). The most predictable criterion.
J. Educ. Psych. 26, 139-142.
Hotelling, Harold (1936). Relations between two
sets of variates. Biometrika 28 , 321-377.
Meredith, W. (1964). Canonical correlation with fallible
data. Psychometrika 29, 55-65.
Tatsuoka, M. M. (1988). Multivariate analysis: Techniques
for educational and psychological research. 2nd ed. New
York: Wiley.
Waugh, F.V. (1942). Regressions between sets of
variables. Econometrica 10, 290-310.