____________________________________________________________________________________________________________

University of Illinois at Chicago

College of Business Administration

Department of Information & Decision Sciences

____________________________________________________________________________________________________________ 
 

IDS 470:      Multivariate Statistical Analysis I

Instructor:         Prof. Stanley L. Sclove

Textbook:         Johnson and Wichern, 4/e   (JW

____________________________________________________________________________________________________________ 
 

Notes on JW Ch. 10:  Canonical Correlation  Analysis  

NOTE.   This chapter is not required in IDS 470.

CONTENTS

10.1. Introduction  587

10.2.  Canonical Variates and Canonical Correlations  587

10.3. Interpreting the Population Canonical Variables  595

10.4.  The Sample Canonical Variates and Sample Canonical Correlations  601

10.5. Additional Sample Descriptive Measures  610

10.6. Large Sample Inferences  557

  References  627

______________________________________________________________________________________________________________________________

 

10.1. Introduction  515 

 

Your dataset consists of  two sets  of variables.  How can you best represent them in terms of a few dimensions  ? 
 
Learning Objectives . Chapter Preview . Key Terms

Canonical Correlation Analysis (CCA) deals with situations in which the variables fall into two subsets,

say  q   Y's and  r  Z's.   Often the Y's are to be predicted from the Z's .  Thus, CCA is a dependence rather than an interdependence technique.
What is Canonical Correlation?

Examples.

     (i)   In a dataset IQRA DAT , the p = 2 variables RA1 (Y1) and RA2 (Y2), reading achievement before and after
second grade, are to be explained in terms of the q=2 variables Language IQ (Z1) and Non-Language IQ (Z2) (Anderson and Sclove 1986).
       (ii)   In the dataset HEART DAT (Dixon and Massey, Table 2-2a) the p = 2 variables Systolic BP (Y1) and Diastolic
BP (Y2) are to be predicted using the q = 3 variables Age (Z1), Weight (Z2) and Height(Z3).


 
TABLE.  Dataset:  Two sets of variables
 
                                 VARIABLE
                  -------------------------------------------------
                  X1   X2   . . .                                                Xp
                 --------------------------------------------------
                  Y1   Y2 ...  Yk  . .. Yq    Z1   Z2  . . . Zm . . . Zr      (q+r = p)
                 --------------------------------------------------
             1    y11  y12 ... y1k ... x1q   z11 z12 ... z1m ... z1r
             2    y21  y22 ... y2k ... x2q   z21 z22 ... z2m ... z2r
                                        . . .
CASE   j     yj1   yj2 ...  yjk ... xjq     zj1  zj2   ... zjm ... zjr
                                        . . .
             n    yn1  yn2 ... ynk ... xnq   zn1 zn2 ...   znm ... znr
                 --------------------------------------------------
 

A question is the extent to which the r Z's  can explain or predict the q Y's, that is,  whether the Z-construct can explain the

Y-construct.

 When there is only one Y (q=1), this is a multiple regression situation.  The multiple correlation coefficient R is the

ordinary correlation between Y and the BLP (best linear predictor) based on the Z's.  The statistic  R  is a measure of the strength of the relationship between Y and the Z's.
 
When  q > 1, we find a linear combination CVY1 that is most predictable by any linear combination of the Z's, leading to
CVZ1. If this maximum correlation is large enough, we conclude that the Y's are significantly related to the Z's; the size of this
correlation is a measure of the strength of this relationship. Also, a 2D plot  is obtained by plotting CVY1 vs. CVZ1. This is a plot of a Y-index vs. a Z-index.
 
Canonical correlation analysis proceeds further by finding successive pairs (CVZ2,CVY2), (CVZ3,CVY3), etc., of linear
combinations. The successive pairs are uncorrelated. The number of such pairs is min{q,r}.   For the Heart Data there are three
Z's and two Y's, so this would be min{3,2} = 2.
 

H E R E
TABLE.   Examples of some studies with two sets of variables 

Field        Medicine

Study        Los Angeles Heart Study; see Dixon & Massey (1969)
X's          Personal physical characteristics:
                Age; Weight; Height
Y's          Blood pressure variables:
                Systolic BP; Diastolic BP
Field        Business
Study        HATCO data of Hair et al. (see esp. Table 8.1, p. 329)
X's          Measures of customer characteristics:
                Family size; family income
Y's          Measures of credit usage:
                Number of credit cards held by the family
                Ave. monthly dollar expenditures on all credit cards
Field        Public Health
Study        Chronic Depression Study, reported in Afifi & Clark(1987)
X's          personal social and financial variables
                Gender; Age; Education; Income
Y's          health variables:
                CESD (an index of chronic depression);
                Perceived physical health
Field/Study  Agriculture/Waugh (1942)
X's          characteristics of wheat
                texture measure; density; protein content; %
                kernels damaged; % foreign matter
Y's          characteristics of the resulting flour:
                crude protein content; wheat per barrel of flour;
                ash in flour
Field/Study  Sociology/Galle, Gove & McPherson (1972)
Cases        75 community areas of Chicago
X's          Population density variables:
                  persons per acre
                  persons per room
                  rooms per housing unit
                  housing units per structure
                  structures per acre
Y's          Social pathology variables:
                  juvenile delinquency rate
                  public assistance rate
                  rate of admissions to mental hospitals
                  general fertility rate
                  standardized mortality ratio
Field/Study  Education/Anderson & Sclove (1986)
Cases        23 second-grade school pupils
X's          IQ variables:
                  Language IQ; Non-languague IQ
Y's          Reading achievement variables:
                  reading achievement score before second grade
                  reading achievement score after second grade
Field:       Financial Securities Analysis
Study:       Chemical companies data, reported in Afifi & Clark(1987)
Cases:       30 chemical company stocks
X's:         Measures of company financial performance:
                ROR5; D/E; SALESGR5; EPS5; NPM1
Y's:         Return or potential return to stockholders:
                P/E; PAYOUTR1
               (annual dividend divided by the latest 12-mo. earnings per share)
Field/Study: Public Administration & Health Study/Hopkins (1969)
X's:         housing quality variables
Y's:         illness variables
Field/Study: Educational Psychology/Tatsuoka (1988)
X's:         personality scales
Y's          achievement tests
Field/Study  Psychological Measurement/Meredith (1964)
X's          one set of intelligence tests
Y's          another set of intelligence tests
 
 

Some of these examples involve three sets of variables and so can be considered in terms of Path Analysis/Structural Equations

Modeling (SEM) in Section 9.7.
Reduction of Dimensionality
One can reduce consideration only to those pairs that are significantly correlated. The number of such pairs can be appreciably
less than min{m,n}. Also, variables which have low weights in the significant pairs can be eliminated from further study. The
linear combinations can be rotated to produce as many low coefficients as possible, to aid in interpretation.
Hypothetical Example of Canonical Correlation
I made up 40 additional cases for the credit card data (Hair et al., Table 4.1), making a total N of 48.
Some of the output from  BMDP6M follows.
 

 BMDP6M - CANONICAL CORRELATION ANALYSIS

 VERSION: 1990   (IBM/CMS)       DATE:   MARCH  3, 1998  AT 15:46:31
 PROGRAM INSTRUCTIONS
 /PROBLEM       TITLE IS 'Credit Card Data (like Hair, Table 3.3)'.
 /INPUT          VARIABLES ARE 4.
                FORMAT IS FREE.
 /VARIABLE      NAMES ARE CrdtCrds, CCexp, FamSize, FamInc.
 /CANONICAL     FIRST = FamSize, FamInc.
                SECOND = CrdtCrds, CCexp.
 /PRINT         MATR=CORR, LOAD, COEF.
                 LINEsize=69.
 /PLOT          XVAR = CNVRS1,CNVRS2.
                YVAR = CNVRF1,CNVRF2.
 /END
 PROBLEM TITLE IS Credit Card Data (like Hair, Table 3.3)
 NUMBER OF VARIABLES TO READ . . . . . . . . . .       4
 VARIABLES TO BE USED
      1 CrdtCrds    2 CCexp       3 FamSize     4 FamInc
 FIRST  SET OF VARIABLES
 ----------------------
    3 FamSize     4 FamInc
 SECOND SET OF VARIABLES
 -----------------------
    1 CrdtCrds    2 CCexp
 NUMBER OF VARIABLES IN FIRST  SET. . . . . . . .      2   This is n.
 NUMBER OF VARIABLES IN SECOND SET . . . . . . .       2  This is m.
 TOTAL NUMBER OF VARIABLES USED. . . . . . . . .       4
 MAXIMUM NUMBER OF CANONICAL VARIABLES . . . . .  2  This is min{m,n}
 NUMBER OF CASES READ. . . . . . . . . . . . . .      48  This is N.
 UNIVARIATE SUMMARY STATISTICS
 -----------------------------
                                                  SMALLEST   LARGEST
                       STANDARD SMALLEST  LARGEST STANDARD  STANDARD
    VARIABLE    MEAN  DEVIATION    VALUE    VALUE    SCORE     SCORE
   3 FamSize   4.2500   1.4947    2.0000    8.0000   -1.51      1.17
   4 FamInc   33.5417  15.4589   14.0000   75.0000   -1.26      2.68
   1 CrdtCrds  8.8542   1.5709    4.0000   12.0000   -1.82      3.28
   2 CCexp    14.6875   8.2642    5.0000   34.0000   -1.55      3.08
 CORRELATIONS
 ------------
              FamSize  FamInc   CrdtCrds CCexp
                    3        4        1        2
 FamSize    3    1.000
 FamInc     4    0.290    1.000
 CrdtCrds   1    0.813    0.340    1.000
 CCexp      2    0.370    0.930    0.490    1.000
 

                    CANONICAL     NUMBER OF   BARTLETT'S TEST FOR

      EIGENVALUE  CORRELATION   EIGENVALUES  REMAINING EIGENVALUES
                                               CHI-           TAIL
                                             SQUARE   D.F.   PROB.
                                             142.12      4  0.0000
lambda1= 0.88365  R1= 0.94003             1   48.40      1  0.0000
lambda2= 0.64747  R2= 0.80466
 
 BARTLETT'S TEST ABOVE INDICATES THE NUMBER OF CANONICAL VARIABLES
 NECESSARY TO EXPRESS THE DEPENDENCY BETWEEN THE TWO SETS OF
 VARIABLES.  THE NECESSARY NUMBER OF CANONICAL VARIABLES IS THE
 SMALLEST NUMBER OF EIGENVALUES SUCH THAT THE TEST OF THE REMAINING
 EIGENVALUES IS NOT SIGNIFICANT.  FOR EXAMPLE, IF A TEST AT THE .01
 LEVEL WERE DESIRED, THEN    2 VARIABLES WOULD BE CONSIDERED
 NECESSARY.  HOWEVER, THE NUMBER OF CANONICAL VARIABLES OF PRACTICAL
 VALUE IS LIKELY TO BE SMALLER.
 COEFFICIENTS FOR CANONICAL VARIABLES FOR FIRST  SET OF VARIABLES
 ---------------------------------------------------------------
                   CNVRF1        CNVRF2
                         1             2
 FamSize    3     -0.026753      0.698485
 FamInc     4      0.065389     -0.017082
 

 STANDARDIZED COEFFICIENTS FOR CANONICAL VARIABLES FOR FIRST  SET OF

 VARIABLES (THESE ARE THE COEFFICIENTS FOR THE STANDARDIZED
 VARIABLES)
              CNVRF1   CNVRF2
                    1        2
 FamSize    3   -0.040    1.044
 FamInc     4    1.011   -0.264
 COEFFICIENTS FOR CANONICAL VARIABLES FOR SECOND SET OF VARIABLES
 ----------------------------------------------------------------
                   CNVRS1        CNVRS2
                         1             2
 CrdtCrds   1     -0.127497      0.719246
 CCexp      2      0.172865     -0.060589
 
 

 STANDARDIZED COEFFICIENTS FOR CANONICAL VARIABLES FOR SECOND SET OF

 VARIABLES (THESE ARE THE COEFFICIENTS FOR THE STANDARDIZED
 VARIABLES)
              CNVRS1   CNVRS2
                    1        2
 CrdtCrds   1   -0.200    1.130
 CCexp      2    1.083   -0.380
 CANONICAL VARIABLE LOADINGS
 ---------------------------
 (CORRELATIONS OF CANONICAL VARIABLES WITH ORIGINAL VARIABLES)
 FOR FIRST  SET OF VARIABLES
              CNVRF1   CNVRF2
                    1        2
 FamSize    3    0.253    0.968
 FamInc     4    0.999    0.038
 -----------------------------
 CANONICAL VARIABLE LOADINGS
 ---------------------------
 (CORRELATIONS OF CANONICAL VARIABLES WITH ORIGINAL VARIABLES)
 FOR SECOND SET OF VARIABLES
              CNVRS1   CNVRS2
                    1        2
 CrdtCrds   1    0.331    0.944
 CCexp      2    0.985    0.175
 ------------------------------
 

     2 PLOTS ARE TO BE MADE

   NO. NAME         NO. NAME     NUMBER

     7 CNVRS1         5 CNVRF1        9
     8 CNVRS2         6 CNVRF2       10
          ......+.....+.....+.....+.....+.....+.....+.....+....
     2.7  +                                                 1 +
          -                                                   -
          -                                     1             -
          -                                                   -
          -                                              1    -
          -                                                   -
          -                                                   -
          -                                          1        -
     1.8  +                                                   +
          -                                                   -
          -                                                   -
          -                                                   -
 C        -                                                   -
 N        -                                                   -
 V        -                      1                            -
 R        -                                21                 -
 F   .90  +                                1                  +
 1        -                                                   -
          -                                                   -
          -                 1  1       2 1                    -
          -                  1                                -
          -               1                                   -
          -                                                   -
 5        -            1 1 1                                  -
     0.0  +            1 1    2 11                            +
          -                   1                               -
          -           1     1                                 -
          -       1  111                                      -
          -                                                   -
          -           1                                       -
          -              1                                    -
          -         1                                         -
    -.90  +         1                                         +
          -   2  1                                            -
          - 1  111  2                                         -
          -1   11                                             -
          ......+.....+.....+.....+.....+.....+.....+.....+....
                    -.50         .50         1.5         2.5
              -1.0         0.0         1.0         2.0
                              CNVRS1     7
          ...+....+....+....+....+....+....+....+....+....+....
          -                                                   -
     1.5  +                                 11                +
          -                                     1       1     -
          -                                                   -
          -                              12                   -
          -                                                   -
     1.0  +                   1                               +
          -                   1     1  1                      -
          -                         1       11                -
          -                         1                         -
          -                                                   -
     .50  +                               2               1   +
 C        -                   12        1                     -
 N        -                                                   -
 V        -                    1                              -
 R        -                     1   11                        -
 F   0.0  +                1                                  +
 2        -                  2   1 1                          -
          -                      2                            -
          -                11                                 -
          -                                                   -
    -.50  +                                                   +
          -                                                   -
 6        -                                                   -
          -              1                                    -
          -                                                   -
    -1.0  +                                                   +
          -                                                   -
          -      1                                            -
          -       1           11                              -
          -                                                   -
    -1.5  +    111        12                                  +
          -                                                   -
          -  1                                                -
          -           1                                       -
          -                                                   -
          ...+....+....+....+....+....+....+....+....+....+....
                -1.5      -.50       .50       1.5       2.5
           -2.0      -1.0       0.0       1.0       2.0
                              CNVRS2     8

 
 



 
 

10.2.  The blah b lah  515

 

10.3. Methods of Estimation  

 

10.4.  Factor Rotation 
 
 

 

10.5. Factor Scores  550

 

10.6. Perspectives and a Strategy for Factor Analysis  

 

9.7. Structural Equations Models   

 

 

 
______________________________________________________________________________________________________________________________ 
Copyright © 1999 Stanley Louis Sclove  
Created:    17 October 1998       Updated:    7 November 1999
______________________________________________________________________________________________________________________________

 
 
 
 
 
 


References

Afifi, A. and Clark, V. (1990). Computer-Aided Multivariate Analysis. 2nd ed. Van Nostrand-Reinhold, New York. (Now

available in a 3rd edition.) .
Anderson, T.W., & Sclove, Stanley L. (1986). Statistical Analysis of Data, 2nd ed. Scientific Press, Palo Alto, CA.
Dixon, W.J., & Massey, F.J. (1969). Introduction to Statistical Analysis, 3rd ed. New York: McGraw-Hill.

Galle, Omer R., Gove, Walter R., & McPherson, J. Miller (1972). Population density and pathology:   What are the relations

for man? Science,176, 7-April-1972, 23-30.
Hopkins, C.E. (1969). Statistical analysis by canonical correlation: A computer application. Health Services Research
(Winter): 4, 304-312.
Hotelling, Harold (1935). The most predictable criterion. J. Educ. Psych. 26, 139-142.
Hotelling, Harold (1936). Relations between two sets of variates. Biometrika 28 , 321-377.

Meredith, W. (1964). Canonical correlation with fallible data. Psychometrika 29, 55-65.

Tatsuoka, M. M. (1988). Multivariate analysis: Techniques for educational and psychological research. 2nd ed. New

York: Wiley.
Waugh, F.V. (1942). Regressions between sets of variables. Econometrica 10, 290-310.