University of Illinois at Chicago
School of Public Health
Division of Epidemiology & Biostatistics
BSTT 580 Applied Multivariate Statistical Analysis
Professor Stan Sclove
Textbook Johnson & Wichern, 4th ed.

Notes on JW Ch. 10:    Canonical Correlation  Analysis

CONTENTS

10.1.  Introduction
10.2.  Canonical Variates and Canonical Correlations
10.3.   Interpreting the Population Canonical Variables
10.4.   The Sample Canonical Variates and Sample Canonical Correlations   601
10.5.   Additional Sample Descriptive Measures     610
10.6.   Large Sample Inferences
           References     627

10.1.     Introduction

Your dataset consists of   two sets   of variables.   How can you best represent them in terms of a few dimensions? 

Canonical Correlation Analysis (CCA) deals with situations in which the variables fall into two subsets, say   q    Y's and  r  Z's.   Often the Y's are to be predicted from the Z's .   Thus, CCA is a dependence rather than an interdependence technique.

What is Canonical Correlation?
Examples.
     (i)   In a dataset IQRA DAT , the p = 2 variables RA1 (Y1) and RA2 (Y2), reading achievement before and after
second grade, are to be explained in terms of the q=2 variables Language IQ (Z1) and Non-Language IQ (Z2) (Anderson and Sclove 1986).
       (ii)   In the dataset HEART DAT (Dixon and Massey, Table 2-2a) the p = 2 variables Systolic BP (Y1) and Diastolic
BP (Y2) are to be predicted using the q = 3 variables Age (Z1), Weight (Z2) and Height(Z3).


 
TABLE.  Dataset:  Two sets of variables.  The X's are partitioned into two sets, Y's and Z's.
                                 VARIABLE
                  -------------------------------------------------
                  X1   X2   . . .                                Xp
                 --------------------------------------------------
                  Y1   Y2 ...  Yk  ... Yq    Z1   Z2  ... Zm ... Zr     (q+r = p)
                 --------------------------------------------------
             1    y11  y12 ... y1k ... y1q   z11 z12 ... z1m ... z1r
             2    y21  y22 ... y2k ... y2q   z21 z22 ... z2m ... z2r
                                        . . .
    CASE     j    yj1  yj2 ... yjk ... yjq   zj1 zj2 ... zjm ... zjr
                                        . . .
             n    yn1  yn2 ... ynk ... ynq   zn1 zn2 ...  znm ... znr
                  --------------------------------------------------

A question is the extent to which the  r  Z's  can explain or predict the  q  Y's.   Often, the Y's are indicators of one hypothetical construct; the Z's, of another; the question is whether  the Z-construct can explain the  Y-construct.
 
 When there is only one Y (q=1), this is a multiple regression situation.  Then the multiple correlation coefficient R is the
ordinary correlation between Y and the BLP (best linear predictor) based on the Z's.   The statistic  R  is a measure of the strength of the relationship between Y and the Z's.
When   q > 1, we find a linear combination CVY1 that is most predictable by any linear combination of the Z's, leading to
CVZ1. If this maximum correlation is large enough, we conclude that the Y's are significantly related to the Z's; the size of this
correlation is a measure of the strength of this relationship. Also, a 2D plot  is obtained by plotting CVY1 vs. CVZ1. This is a plot of a Y-index vs. a Z-index.


Canonical correlation analysis proceeds further by finding successive pairs (CVZ2,CVY2), (CVZ3,CVY3), etc., of linear

combinations. The successive pairs are uncorrelated. The number of such pairs is min{q,r}.   For the Heart Data there are three
Z's and two Y's, so this would be min{3,2} = 2.


 
 
TABLE.   Examples of some studies with two sets of variables
Field        Medicine
Study        Los Angeles Heart Study; see Dixon & Massey (1969)
Z's          Personal physical characteristics:
                Age; Weight; Height
Y's          Blood pressure variables:
                Systolic BP; Diastolic BP
Field        Business
Study        HATCO data of Hair et al. (see esp. Table 8.1, p. 329)
Z's          Measures of customer characteristics:
                Family size; family income
Y's          Measures of credit usage:
                Number of credit cards held by the family
                Ave. monthly dollar expenditures on all credit cards
Field        Public Health
Study        Chronic Depression Study, reported in Afifi & Clark(1987)
Z's          personal social and financial variables
                Gender; Age; Education; Income
Y's          health variables:
                CESD (an index of chronic depression);
                Perceived physical health
Field/Study  Agriculture/Waugh (1942)
Z's          characteristics of wheat
                 texture measure; density; protein content; %
                kernels damaged; % foreign matter
Y's          characteristics of the resulting flour:
                crude protein content; wheat per barrel of flour;
                ash in flour
Field/Study  Sociology/Galle, Gove & McPherson (1972)
Cases        75 community areas of Chicago
Z's          Population density variables:
                  persons per acre
                  persons per room
                  rooms per housing unit
                  housing units per structure
                  structures per acre
Y's          Social pathology variables:
                  juvenile delinquency rate
                  public assistance rate
                  rate of admissions to mental hospitals
                  general fertility rate
                  standardized mortality ratio
Field/Study  Education/Anderson & Sclove (1986)
Cases        23 second-grade school pupils
Z's          IQ variables:
                  Language IQ; Non-languague IQ
Y's          Reading achievement variables:
                  reading achievement score before second grade
                  reading achievement score after second grade
Field:       Financial Securities Analysis
Study:       Chemical companies data, reported in Afifi & Clark(1987)
Cases:       30 chemical company stocks
Z's:         Measures of company financial performance:
                ROR5; D/E; SALESGR5; EPS5; NPM1
Y's:         Return or potential return to stockholders:
                P/E; PAYOUTR1
               (annual dividend divided by the latest 12-mo. earnings per share)
Field/Study: Public Administration & Health Study/Hopkins (1969)
Z's:         housing quality variables
Y's:         illness variables
Field/Study: Educational Psychology/Tatsuoka (1988)
Z's:         personality scales
Y's          achievement tests
Field/Study  Psychological Measurement/Meredith (1964)
Z's          one set of intelligence tests
Y's          another set of intelligence tests

Some of these examples involve three sets of variables and so can be considered in terms of Path Analysis/Structural Equations
Modeling (SEM) in Section 9.7.

Reduction of Dimensionality

One can reduce consideration only to those pairs that are significantly correlated. The number of such pairs can be appreciably
less than min{m,n}. Also, variables which have low weights in the significant pairs can be eliminated from further study. The
linear combinations can be rotated to produce as many low coefficients as possible, to aid in interpretation.

Hypothetical Example of Canonical Correlation

I made up 40 additional cases for the credit card data (Hair et al., Table 4.1), making a total N of 48.
Some of the output from  BMDP6M follows.
 BMDP6M - CANONICAL CORRELATION ANALYSIS
 VERSION: 1990   (IBM/CMS)       DATE:   MARCH  3, 1998  AT 15:46:31
 PROGRAM INSTRUCTIONS
 /PROBLEM       TITLE IS 'Credit Card Data (like Hair, Table 3.3)'.
 /INPUT          VARIABLES ARE 4.
                FORMAT IS FREE.
 /VARIABLE      NAMES ARE CrdtCrds, CCexp, FamSize, FamInc.
 /CANONICAL     FIRST = FamSize, FamInc.
                SECOND = CrdtCrds, CCexp.
 /PRINT         MATR=CORR, LOAD, COEF.
                 LINEsize=69.
 /PLOT          XVAR = CNVRS1,CNVRS2.
                YVAR = CNVRF1,CNVRF2.
 /END
 PROBLEM TITLE IS Credit Card Data (like Hair, Table 3.3)
 NUMBER OF VARIABLES TO READ . . . . . . . . . .       4
 VARIABLES TO BE USED
      1 CrdtCrds    2 CCexp       3 FamSize     4 FamInc
 FIRST  SET OF VARIABLES
 ----------------------
    3 FamSize     4 FamInc
 SECOND SET OF VARIABLES
 -----------------------
    1 CrdtCrds    2 CCexp
 NUMBER OF VARIABLES IN FIRST  SET. . . . . . . .      2   This is n.
 NUMBER OF VARIABLES IN SECOND SET . . . . . . .       2  This is m.
 TOTAL NUMBER OF VARIABLES USED. . . . . . . . .       4
 MAXIMUM NUMBER OF CANONICAL VARIABLES . . . . .  2  This is min{m,n}
 NUMBER OF CASES READ. . . . . . . . . . . . . .      48  This is N.
 UNIVARIATE SUMMARY STATISTICS
 -----------------------------
                                                  SMALLEST   LARGEST
                       STANDARD SMALLEST  LARGEST STANDARD  STANDARD
    VARIABLE    MEAN  DEVIATION    VALUE    VALUE    SCORE     SCORE
   3 FamSize   4.2500   1.4947    2.0000    8.0000   -1.51      1.17
   4 FamInc   33.5417  15.4589   14.0000   75.0000   -1.26      2.68
   1 CrdtCrds  8.8542   1.5709    4.0000   12.0000   -1.82      3.28
   2 CCexp    14.6875   8.2642    5.0000   34.0000   -1.55      3.08
 CORRELATIONS
 ------------
              FamSize  FamInc   CrdtCrds CCexp
                    3        4        1        2
 FamSize    3    1.000
 FamInc     4    0.290    1.000
 CrdtCrds   1    0.813    0.340    1.000
 CCexp      2    0.370    0.930    0.490    1.000
                    CANONICAL     NUMBER OF   BARTLETT'S TEST FOR
      EIGENVALUE  CORRELATION   EIGENVALUES  REMAINING EIGENVALUES
                                               CHI-           TAIL
                                             SQUARE   D.F.   PROB.
                                             142.12      4  0.0000
lambda1= 0.88365  R1= 0.94003             1   48.40      1  0.0000
lambda2= 0.64747  R2= 0.80466
 BARTLETT'S TEST ABOVE INDICATES THE NUMBER OF CANONICAL VARIABLES
 NECESSARY TO EXPRESS THE DEPENDENCY BETWEEN THE TWO SETS OF
 VARIABLES.  THE NECESSARY NUMBER OF CANONICAL VARIABLES IS THE
 SMALLEST NUMBER OF EIGENVALUES SUCH THAT THE TEST OF THE REMAINING
 EIGENVALUES IS NOT SIGNIFICANT.  FOR EXAMPLE, IF A TEST AT THE .01
 LEVEL WERE DESIRED, THEN    2 VARIABLES WOULD BE CONSIDERED
 NECESSARY.  HOWEVER, THE NUMBER OF CANONICAL VARIABLES OF PRACTICAL
 VALUE IS LIKELY TO BE SMALLER.
 COEFFICIENTS FOR CANONICAL VARIABLES FOR FIRST  SET OF VARIABLES
 ---------------------------------------------------------------
                   CNVRF1        CNVRF2
                         1             2
 FamSize    3     -0.026753      0.698485
 FamInc     4      0.065389     -0.017082
 STANDARDIZED COEFFICIENTS FOR CANONICAL VARIABLES FOR FIRST  SET OF
 VARIABLES (THESE ARE THE COEFFICIENTS FOR THE STANDARDIZED
 VARIABLES)
              CNVRF1   CNVRF2
                    1        2
 FamSize    3   -0.040    1.044
 FamInc     4    1.011   -0.264
 COEFFICIENTS FOR CANONICAL VARIABLES FOR SECOND SET OF VARIABLES
 ----------------------------------------------------------------
                   CNVRS1        CNVRS2
                         1             2
 CrdtCrds   1     -0.127497      0.719246
 CCexp      2      0.172865     -0.060589
 STANDARDIZED COEFFICIENTS FOR CANONICAL VARIABLES FOR SECOND SET OF
 VARIABLES (THESE ARE THE COEFFICIENTS FOR THE STANDARDIZED
 VARIABLES)
              CNVRS1   CNVRS2
                    1        2
 CrdtCrds   1   -0.200    1.130
 CCexp      2    1.083   -0.380
 CANONICAL VARIABLE LOADINGS
 ---------------------------
 (CORRELATIONS OF CANONICAL VARIABLES WITH ORIGINAL VARIABLES)
 FOR FIRST  SET OF VARIABLES
              CNVRF1   CNVRF2
                    1        2
 FamSize    3    0.253    0.968
 FamInc     4    0.999    0.038
 -----------------------------
 CANONICAL VARIABLE LOADINGS
 ---------------------------
 (CORRELATIONS OF CANONICAL VARIABLES WITH ORIGINAL VARIABLES)
 FOR SECOND SET OF VARIABLES
              CNVRS1   CNVRS2
                    1        2
 CrdtCrds   1    0.331    0.944
 CCexp      2    0.985    0.175
 ------------------------------
     2 PLOTS ARE TO BE MADE
   NO. NAME         NO. NAME     NUMBER
     7 CNVRS1         5 CNVRF1        9
     8 CNVRS2         6 CNVRF2       10
          ......+.....+.....+.....+.....+.....+.....+.....+....
     2.7  +                                                 1 +
          -                                                   -
          -                                     1             -
          -                                                   -
          -                                              1    -
          -                                                   -
          -                                                   -
          -                                          1        -
     1.8  +                                                   +
          -                                                   -
          -                                                   -
          -                                                   -
 C        -                                                   -
 N        -                                                   -
 V        -                      1                            -
 R        -                                21                 -
 F   .90  +                                1                  +
 1        -                                                   -
          -                                                   -
          -                 1  1       2 1                    -
          -                  1                                -
          -               1                                   -
          -                                                   -
 5        -            1 1 1                                  -
     0.0  +            1 1    2 11                            +
          -                   1                               -
          -           1     1                                 -
          -       1  111                                      -
          -                                                   -
          -           1                                       -
          -              1                                    -
          -         1                                         -
    -.90  +         1                                         +
          -   2  1                                            -
          - 1  111  2                                         -
          -1   11                                             -
          ......+.....+.....+.....+.....+.....+.....+.....+....
                    -.50         .50         1.5         2.5
              -1.0         0.0         1.0         2.0
                              CNVRS1     7
          ...+....+....+....+....+....+....+....+....+....+....
          -                                                   -
     1.5  +                                 11                +
          -                                     1       1     -
          -                                                   -
          -                              12                   -
          -                                                   -
     1.0  +                   1                               +
          -                   1     1  1                      -
          -                         1       11                -
          -                         1                         -
          -                                                   -
     .50  +                               2               1   +
 C        -                   12        1                     -
 N        -                                                   -
 V        -                    1                              -
 R        -                     1   11                        -
 F   0.0  +                1                                  +
 2        -                  2   1 1                          -
          -                      2                            -
          -                11                                 -
          -                                                   -
    -.50  +                                                   +
          -                                                   -
 6        -                                                   -
          -              1                                    -
          -                                                   -
    -1.0  +                                                   +
          -                                                   -
          -      1                                            -
          -       1           11                              -
          -                                                   -
    -1.5  +    111        12                                  +
          -                                                   -
          -  1                                                -
          -           1                                       -
          -                                                   -
          ...+....+....+....+....+....+....+....+....+....+....
                -1.5      -.50       .50       1.5       2.5
           -2.0      -1.0       0.0       1.0       2.0
                              CNVRS2     8




 
 

10.2. Canonical Variates and Canonical Correlations 587

 

10.3.  Interpreting the Population Canonical Variables 

 

10.4.   The Sample Canonical variates and Sample Canonical Correlations -
 
 

10.5.  Additional Sample Descriptive Measures  610

 

10.6.  Large Sample Inferences 

 

        References


Additional References

Afifi, A. and Clark, V. (1990).  Computer-Aided Multivariate Analysis.  2nd ed.   Van Nostrand-Reinhold,
     New York.    (Now available in a 3rd edition from Chapman & Hall.)

Anderson, T.W., & Sclove, Stanley L. (1986). Statistical Analysis of Data, 2nd ed.  Scientific Press, Palo Alto, CA.

Dixon, W.J., & Massey, F.J. (1969).  Introduction to Statistical Analysis, 3rd ed.   New York: McGraw-Hill.

Galle, Omer R., Gove, Walter R., & McPherson, J. Miller (1972).   Population density and pathology:   What are the relations      for man?  Science 176, 7-April-1972, 23-30.

Hopkins, C.E. (1969).  Statistical analysis by canonical correlation:   A computer application.  Health Services Research

     (Winter): 4, 304-312.

Hotelling, Harold (1935).   The most predictable criterion. J. Educ. Psych. 26, 139-142.

Hotelling, Harold (1936). Relations between two sets of variates. Biometrika 28, 321-377.

Meredith, W. (1964). Canonical correlation with fallible data.   Psychometrika 29, 55-65.

Tatsuoka, M. M. (1988). Multivariate analysis: Techniques for educational and psychological research. 2nd ed.
     New York:   Wiley.

Waugh, F.V. (1942). Regressions between sets of variables.  Econometrica 10, 290-310.


Copyright © 2000 Stanley Louis Sclove 
Created    17 October 1998       Updated    22 October 2000