C C ............................................................... C C C C C CLUSTER ANALYSIS OF CASES BY MIXTURE MODEL C C C C PROGRAM MIXPDTA CLUSPAC: C C C C VARYING COVARIANCE MATRICES C C AUTOMATIC SETTING OF INITIAL PARAMETER ESTIMATES C C C C Programs are organized as C C programs within packages within libraries. C C PROGRAM: MIXPDTA PACKAGE: CLUSPAC LIBRARY: ISOPAC C C C C DOCUMENTATION (on my public disk for clustering programs): C C DIRETABL CLUSPAC, CLUSPAC DOC, MIXPDTA DOC C C C C CMS FILE: MIXPDTA CLUSPAC C C VERSION 2.91 20-Apr-91 C C C C PROGRAMMED BY: C C C C DR. STANLEY L. SCLOVE 312/996-2681 C C INFORMATION AND DECISION SCIENCES DEPT. M/C 294 C C COLLEGE OF BUSINESS ADMINISTRATION C C UNIVERSITY OF ILLINOIS AT CHICAGO C C BOX 4348, CHICAGO, IL 60680 C C C C C C COPYRIGHT (C) 1991 STANLEY L. SCLOVE. ALL RIGHTS RESERVED. C C C C C C ISOPAC consists of the CLUSPAC clustering programs, C C the TSPAC time-series segmentation programs and the C C IMPAC image-segmentation programs. C C C C CLUSPAC is a set of programs implementing clustering C C algorithms derived under the assumption of Gaussian C C class-conditional distributions. The ISDT* programs in C C CLUSPAC are based on the so-called "classification" C C likelihood. The MIX* programs are based on the mixture- C C model likelihood. C C C C Programs MIXP** in the CLUSPAC package are C C mixture-model programs for clustering multivariate data. C C (For univariate data the "MIX1**" programs may be used.) C C MIXPCM and MIXPCMA assume a common covariance matrix C C across distributions. MIXPDT and MIXPDTA allow different C C covariance matrices. C C MIXPCMA and MIXPDTA try a range of numbers of clusters, C C with automatic setting of initial values of the parameters. C C C C C C RESTRICTIONS (CAN BE MODIFIED): C C C C N, SAMPLE SIZE, AT MOST 1000; C C IP, NUMBER OF VARIABLES, AT MOST 20; C C K, NUMBER OF CLUSTERS, AT MOST 29; C C ITER, MAXIMUM NUMBER OF ITERATIONS, 20. C C C C SUBROUTINE(S) CALLED: C C MATEQ, WHICH CALLS MATDT C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C DIMENSION XIDET(29),XLGDET(29) C C C IV IS A WORK ARRAY FOR SUBROUTINE MATEQ. C C C DOUBLE PRECISION XLGDET C DOUBLE PRECISION XLIKLY C C C C ............................................................... C C CONTROL CARDS: C C DATASET TITLE C N, IN FORMAT (2X,I4) C IP, IN FORMAT (3X,I2) C INPFMT, IN FORMAT (18A4), E.G., (4F4.1,1X,A6) C DATFMT, IN FORMAT (18A4), E.G., (4F4.1) C "DATFMT" WILL ALSO BE USED FOR OUTPUT, SO ALLOW C AT LEAST ONE BLANK AT THE BEGINNING FOR CARRIAGE CONTROL. C DATA, ONE CASE AT A TIME, IN FORMAT SPECIFIED BY DATFMT C ............................................................... C C INPFMT allows for a last variable which is an alphanumeric C case label. C ............................................................... C C C WRITE PROGRAM INFORMATION. C C THE FOLLOWING ARE IMSL DATE AND TIME FUNCTIONS: C C READ SAMPLE SIZE, N. C READ NUMBER OF VARIABLES, IP. C C READ INPUT FORMAT (INPUT INCLUDES DATA AND ID.) C READ DATA FORMAT. C C READ DATA. C C C C SET CONSTANTS: C C C SET INITIAL VALUES OF MIXING PROBABILITIES:-- C C XJ(IC,K), K=2 TO 9, IC = 1 TO K C OPTIMAL CLASS PROBABILITIES - NORMAL DISTRIBUTION C JOHARI & SCLOVE (1975). "PARTITIONING A DISTRIBUTION." C COMMUNICATIONS IN STATISTICS. C C C EVALUATE MODEL-SELECTION CRITERIA FOR K=1 (NO CLUSTERING): C C COMPUTE SUM VECTOR AND SUM-OF-PRODUCTS MATRIX: C WRITE SUMMARY STATISTICS FOR WHOLE SAMPLE: C C C CALL SUBROUTINE TO COMPUTE DETERMINANT OF SAMPLE COVARIANCE C MATRIX: C C C COPY VARHAT INTO COVMX TO BE USED AS MATRIX INPUT TO MATEQ C BECAUSE MATRIX INPUT IS DESTROYED BY MATEQ AND VARHAT WILL C BE NEEDED LATER: C C GENERAL FORM OF CALL IS: C CALL MATEQ(A,M,N,JFLG,DET,IDET,IV,NRS1,P,LL) C C THE INPUT MATRIX A IS DESTROYED IN THE COMPUTATION PROCESS. C ON RETURN, P CONTAINS THE INVERSE OF A. C SEE SUBROUTINE LISTING FOR FULLER EXPLANATION. C DET(VARHAT) = DET*10.0**IDET C C C C XIDET(1) = IDET C XLGDET(1) = DLOG(DET) + XIDET(1)*DLOG(10.0D+00) C WRITE (6,62000) DET,IDET C C MAX LIKELIHOOD = C (2*PI)**(-N*P/2)*(DET OF COV MX)**(-N/2)*EXP(-N*P/2) C C COMPUTE LOG OF MAX LIKELIHOOD: C C NO. OF PARAMETERS FOR UNCLUSTERED SAMPLE IS: C 1 MEAN VECTOR + 1 COV MX. C C C C C C PERFORM CLUSTERING FOR VARIOUS NUMBERS OF CLUSTERS, K: C C C C SET CONSTANTS FOR THIS VALUE OF K: C C COUNT NUMBER OF INDEPENDENT PARAMETERS:-- C K MEAN VECTORS OF DIMENSION P AND K P-BY-P COVARIANCE MATRICES, C WHERE P IS THE NUMBER OF VARIABLES, AND K-1 INDEPENDENT C MIXING PROBABILITIES. C C C C SET INITIAL VALUES OF PARAMETER ESTIMATES FOR THIS VALUE OF K: C C SET INITIAL VALUES OF PRIOR PROBABILITIES EQUAL TO THE C OPTIMAL VALUES FOR A NORMAL DISTRIBUTION:-- C C C TO SET INITIAL VALUES OF PRIOR PROBABILITIES EQUAL TO C BINOMIAL(X=IC-1;N=K-1,P=1/2). C FOR IC = 1,2,...,K, C REMOVE THE "C" FROM CC1 OF THE NEXT LINE: C PR(IC) = GAMMA(XK)/(GAMMA(XC)*GAMMA(XK-XC+1)*(2**K)) C C FOR EQUAL INITIAL VALUES OF PRIOR PROBABILITIES, C REMOVE THE "C" FROM CC1 OF THE NEXT LINE: C PR(IC) = 1.0/K . C C INITIALIZE POSTERIOR PROBABILITIES: C C C SET INITIAL MEANS EQUAL TO K OF THE OBSERVATIONS:-- C C EXAMPLE: If N=150 and K=3, the initial means will be C cases 26, 76 AND 126. C C This initialization scheme may be good when the data are C ordered with respect to an important variable. C C ALTERNATIVELY, HERE IS ANOTHER SCHEME FOR COMPUTING INITIAL MEANS: C (IN 2 DIMENSIONS THIS SCHEME RESULTS IN MEANS ORIENTED FROM C SOUTHWEST TO NORTHEAST): C DO 601 JV = 1,IP C DO 601 IG = 1,K C W = IG/(K+1.0) C XBAR(IG,JV) = (1-W)*XMIN(JV) + W*XMAX(JV) C 601 CONTINUE C C C C SET INITIAL COVARIANCE MATRICES:-- C (HERE SOME SLIGHT POSITIVE CORRELATION IS PROGRAMMED IN. C THIS COULD BE CHANGED.) C C SCALE UP THIS CORRELATION MATRIX TO A COVARIANCE MATRIX C USING THE WHOLE-SAMPLE VARIANCES: C COMPUTE INITIAL INVERSE COVARIANCE MATRIX: C C C C COPY VARHAT INTO COVMX TO BE USED AS MATRIX INPUT TO MATEQ C BECAUSE MATRIX INPUT IS DESTROYED BY MATEQ AND VARHAT WILL C BE NEEDED LATER: C C C GENERAL FORM OF CALL IS: C CALL MATEQ(A,M,N,JFLG,DET,IDET,IV,NRS1,P,LL) C SEE SUBROUTINE LISTING FOR FULLER EXPLANATION. C DET(VARHAT) = DET*10.0**IDET C C C C SET INITIAL PRECISION MATRICES FOR THIS K: C WRITE(6,11500) DET,IDET C XIDET(IC) = IDET C XLGDET(IC) = DLOG(DET) + XIDET(IC)*DLOG(10.0D+00) C END PARAMETER INITIALIZATION FOR THIS K; C BEGIN ITERATIONS: C C C C C COMMENCE COMPUTATION OF F(I,IC), I=1,...,N, IC=1,...,K: C C COMPUTE MAHALANOBIS D-SQUARE BETWEEN THE I-TH OBSERVATION AND C THE L-TH MEAN, L=1,2,...,K, I=1,2,...,N:-- C C IF D-SQ IS INORDINATELY LARGE, SET VALUE OF PDF TO ZERO C (IT IS EXTREMELY SMALL ANYWAY, AND THIS AVOIDS UNDERFLOW): C C F(I,L) = DEXP(-ZSQ/2.0)/((2*PI)**(IP/2)*DSQRT(TRUDET(L))) C C COMPUTE LOG LIKELIHOOD AND VALUES OF MODEL-SELECTION CRITERIA: C C XLIKLY = 1.0 C C XLIKLY = XLIKLY*SUMPXF C C SUMLNF = DLOG(XLIKLY) C C C C COMPUTE MODEL SELECTION CRITERIA C C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP, C PP(IC,I), THE CONDITIONAL PROBABILITY OF POP'N IC, GIVEN X(I), C AS PR(IC)*F(I,IC)/DENOM(I): C C IF DENOM(I)=0, PP(IC,I) IS SET EQUAL TO OLD PP(IC,I) SO THAT C THE I-TH CASE WILL STILL HAVE WEIGHT IN SUCCEEDING C COMPUTATIONS: C C C C C C C C UPDATE PARAMETER ESTIMATES: C C C UPDATE CLUSTER PRIOR PROBABILITIES PR(IC):-- C IF MIXING PROBABILITY IS ZERO FOR ANY CLUSTER, STOP. C C UPDATE MEANS:-- C IF (N*PR(IG) .LT. 0.5) GO TO 2050 C IF EXP.NO. OF OBSERVATIONS FOR A CLUSTER IS LESS THAN 0.5, C END COMPUTATION FOR THIS NUMBER OF CLUSTERS C AND GO ON TO NEXT VALUE OF K: C C C C C C C C POOL: C C C COMPUTE VARHAT, MLE OF COMMON COVARIANCE MATRIX: C C UPDATE PRECISION MATRICES: C C C COPY VARHAT INTO COVMX TO BE USED AS MATRIX INPUT TO MATEQ C BECAUSE MATRIX INPUT IS DESTROYED BY MATEQ AND VARHAT WILL C BE NEEDED LATER: C COMPUTE INVERSE OF POOLED COVARIANCE MATRIX:-- C C C UPDATE PRECISION MATRICES: C C C C C WRITE(6,11500) DET,IDET C XIDET(IC) = IDET C XLGDET(IC) = DLOG(DET) + XIDET(IC)*DLOG(10.0D+00) C IF DET=0, REPLACE COV MX WITH COMMON COV MX: C C C (END PARAMETER-ESTIMATE UPDATE SEQUENCE) C C C STORE OLD LABELS: C C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: C C C C C C C C IF NEW CLUSTERING IS SAME AS OLD, NO FURTHER ITERATION IS C NECESSARY AND RESULTS FOR THIS VALUE OF K WILL BE PRINTED. C OTHERWISE, FURTHER ITERATIONS WILL BE PERFORMED. C C C COUNT NUMBERS IN CLUSTERS: C C UNLESS ALREADY 20 ITERATIONS, GO TO 700 AND BEGIN ANOTHER: C C C C C C C LIST CLUSTER MEMBERSHIP OF EACH CASE: C C C C C C C COMPUTE VALUES OF MODEL-SELECTION CRITERIA: C C C C C C GO ON TO NEXT NUMBER OF CLUSTERS (NEXT VALUE OF K), C IF NOT FINISHED. C C C C C SCHPPH(K) = PR(H(K))* C CONST*(2*PI)**(IPARM(K)/2.0)*EXP(-SCHVEC(K)/2.0)*PRIOR PDF(K); C HERE PR(H(K)) IS TAKEN TO BE UNIFORM. C AICPPH(K) = PR(H(K))* C CONST*(2*PI)**(IPARM(K)/2.0)*EXP(-SCHVEC(K)/2.0)*PRIOR PDF(K); C HERE PR(H(K)) IS TAKEN TO BE Const*EXP(-NOPARM(K)). C APPROXIMATE PRIOR PDF: C ADD CONSTANT SCHVEC(1)/2 OR AICVEC(1)/2 TO PREVENT UNDERFLOW: C APPROXIMATE PRIOR PDF: C ADD CONSTANT SCHVEC(1)/2 OR AICVEC(1)/2 TO PREVENT UNDERFLOW: C AT THIS POINT, WE HAVE THE LOGS OF THE PPHS; WE NOW CONVERT C THEM TO THE PPHS: C C C WRITE VALUES OF MSC'S FOR VARIOUS K: C C C C C C WRITE(6,46000) K,XM2LL(K),IPARM(K),AICVEC(K),AICPPH(K), C XSCHVEC(K),SCHPPH(K) C 9 XXXX.X 143 XXXX.X .XX XXXX.X .XX C C C C C SUBROUTINE MATEQ IS DMATEQ FROM THE UICC SUBROUTINE LIBRARY. C PAGE 1 C C SUBROUTINE DMATEQ C ***************** C THIS ROUTINE WILL SOLVE A REAL*8 SYSTEM OF LINEAR EQUATIONS,COMPUTE C THE DETERMINANT, WITHOUT UNDERFLOW OR OVERFLOW, OF A REAL*8 MATRIX, C AND/OR INVERT A REAL*8 MATRIX. C CALLING SEQUENCE: C CALL DMATEQ(A,N,IA,JFLG,DET,IDET,IV,NRS,P,IP) WHERE; C A (INPUT) - IS THE REAL*8 MATRIX ON WHICH THE ROUTINE IS C TO WORK. IN THE PROCESS OF COMPUTATION THE C CONTENTS OF THIS MATRIX ARE DESTROYED. C N (INPUT) - IS AN INTEGER*4 VARIABLE WHICH SPECIFIES THE C ORDER OF THE A MATRIX. C IA (INPUT) - IS AN INTEGER*4 VARIABLE WHICH SPECIFIES THE C ACTUAL ROW DIMENSION OF A AS DIMENSIONED IN C THE CALLING PROGRAM. IA MUST BE GREATER THAN C OR EQUAL TO N. C JFLG (OUTPUT) - IS AN INTEGER*4 RETURN CODE VARIABLE. UPON C RETURN FROM DMATEQ IF; C JFLG=0, ALL WENT WELL. C JFLG=1, THE A MATRIX WAS SINGULAR OR NEAR C SINGULAR AND THE COMPUTATIONS COULD NOT BE C COMPLETED. THE CONTENTS OF THE VARIABLES C A, DET, IDET AND P ARE MEANINGLESS. C DET (OUTPUT) - IS A REAL*8 VARIABLE WHICH CONTAINS THE C DETERMINANT OF A. (SEE IDET) C IDET (INPUT) - IS AN INTEGER*4 VARIABLE. ON INPUT IF; C IDET=0, NO DETERMINANT IS CALCULATED. C IDET NOT 0, THE DETERMINANT OF A IS COMPUTED. C ON OUTPUT IDET CONTAINS THE POWER OF 10 C THAT DET SHOULD BE MULTIPLIED BY TO GIVE THE C CORRECT VALUE OF THE DETERMINANT. I.E. C DET(A)=DET*10.0D0**IDET. C IF DET(A) CAN BE COMPUTED WITHOUT UNDER OR C OVERFLOW, THEN IDET=0 OTHERWISE IDET IS SET C TO THE PROPER VALUE SO THAT NO UNDER OR OVER- C FLOW WILL OCCUR IN COMPUTING DET. C IV (INPUT) - IS AN INTEGER*4 WORK ARRAY WHICH SHOULD BE C DIMENSIONED AT LEAST IV(N). C PAGE 2 C C NRS (INPUT) - IS AN INTEGER*4 VARIABLE WITH THE FOLLOWING C INTERPRETATION: C NRS>0, SOLVE A SYSTEM OF LINEAR EQUATIONS C WITH NRS RIGHT HAND SIDES. C NRS=0, INVERT THE A MATRIX. C NRS<0, ONLY COMPUTE THE DETERMINANT OF A. C IN THIS CASE IDET MUST BE DIFFERENT FROM 0. C P (INPUT) - IS A REAL*8 ARRAY WITH THE FOLLOWING INTER- C PRETATION: C IF NRS>0, THEN P CONTAINS THE NRS RIGHT HAND C SIDES STORED BY COLUMNS. IN THIS CASE P MUST C BE DIMENSIONED AT LEAST P(N,NRS). ON RETURN C THE COLUMNS OF P ARE REPLACED BY THE RESPEC- C TIVE SOLUTIONS. C IF NRS=0, THEN P MUST BE DIMENSIONED AT LEAST C P(N,N). ON RETURN P WILL CONTAIN THE INVERSE C OF A. C IF NRS<0,THEN P NEED ONLY BE A DUMMY VARIABLE C IN THIS CASE P IS NEVER ACCESSED BY DMATEQ. C IP (INPUT) - IS AN INTEGER*4 VARIABLE WHICH CONTAINS THE C ACTUAL ROW DIMENSION OF P AS DIMENSIONED IN C THE CALLING PROGRAM. IP MUST BE GREATER THAN C OR EQUAL TO N. C NOTE: IMMEDIATELY ON RETURN FROM DMATEQ THE CONDITION CODE FLAG, C JFLG, SHOULD BE INTERROGATED. IF JFLG=1, THEN THE ROUTINE C COULD NOT COMPUTE A SOLUTION. C METHOD - THE ALGORITHM USED IS GAUSSIAN ELIMINATION WITH PARTIAL C -1 C PIVOTING. IN ESSENCE THE ROUTINE GENERATES A MATRIX L SUCH C -1 C THAT L *A = U, WHERE U IS AN UPPER TRIANGULAR MATRIX. THEN IT C SOLVES THE SYSTEM A*X = P BY MEANS OF THE EQUIVALENT SYSTEM C -1 -1 C U*X = L *A*X = L *P BY BACK SUBSTITUTION. C -1 C THE L MATRIX CAN BE WRITTEN AS A PRODUCT OF THE FORM C -1 C L = L *P *....*L *P WHERE EACH P IS A PERMUTATION C N-1 N-1 1 1 K C MATRIX OBTAINED BY INTERCHANGING AT MOST TWO ROWS OF THE C IDENTITY MATRIX. ( THIS REPRESENTS THE INTERCHANGING OF TWO C ROWS). THE L MATRICES ARE ELIMINATION MATRICES WHICH ARE C K C CHOSEN TO INTRODUCE ZEROS IN THE LAST N-K ENTRIES OF THE K-TH C COLUMN OF THE MATRIX. C PAGE 3 C C -1 -1 C THE CALCULATIONS OF L *A AND L *P ARE DONE BY PERFORMING C THE PERMUTATIONS ON A AND P RESPECTIVELY. THE ACTUAL L AND P C K K C ARE NOT COMPUTED. C SUBROUTINES CALLED: DMATDT C REFERENCE: C G. W. STEWART, INTRODUCTION TO MATRIX COMPUTATIONS, C ACADEMIC PRESS, 1973. C C JFLG IS A TROUBLE FLAG.UPON EXIT IF JFLG=0 THEN THE MATRIX WAS PROCESS C WITHOUT TROUBLE.IF JFLG=1 EITHER THE MATRIX IS SINGULAR OR TROUBLE C OCCURED.ISIGN=-ISIGN EVERY TIME A ROW IS INTERCHANGED.THIS IS USED TO C INSURE THAT THE DETERMINANT HAS THE PROPER SIGN. C C C INSTEAD OF ACTUALLY INTERCHANGING ROWS A POINTER ARRAY IS USED TO KEEP C TRACK OF THE ROW POSITIONS. C C BEGIN ELIMINATION LOOP. C C C SEARCHING FOR LARGEST ELEMENT IN ABSOLUTE VALUE IN COLUMN K. C C C IF IFLG=0 NO ROW INTERCHANGE TOOK PLACE.IF IFLG=1 A ROW INTERCHANGE C TOOK PLACE AND THE POINTER ARRAY IV MUST BE UPDATED. C C C BEGIN ELIMINATION OF ROW BELOW IV(K).DEN IS THE PIVOT ELEMENT. C C C BEFORE ACTUALLY ELIMINATING WE CHECK TO SEE IF A(IV(IM),K) HAS ALREADY C BEEN ANIHILATED. C C C CALCULATE ELIMINATION FACTOR. C C C WE NOW CALCULATE VALUE OF OTHER ELEMENTS IN ROW IV(IM). C C C CALCULATE VALUE OF DETERMINANT. C C C WE START SOLVING RIGHT HAND SIDES.THE SOLUTION REPLACES THE RIGHT HAND C VECTOR. C C C BEGIN BACK SUBSTITUTION. C C SUBROUTINE MATDT IS DMATDT FROM THE UICC SUBROUTINE LIBRARY.