CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CLUSPAC: Computer Programs for Mixture-Model Clustering C C C C COPYRIGHT (C) 1991, 1992 STANLEY L. SCLOVE C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CMS FILE: MIXPDT CLUSPAC C C VERSION 2.4 11-MAY-92 C C VERSION 2.3 22-MAR-92 C C VERSION 2.2 15-NOV-89 C C C C PROGRAMMED BY: C C C C PROF. STANLEY L. SCLOVE, PH.D. 312/996-2681 C C INFORMATION & DECISION SCIENCES DEPT. M/C 294 C C COLLEGE OF BUSINESS ADMINISTRATION C C UNIVERSITY OF ILLINOIS AT CHICAGO C C BOX 4348, CHICAGO, IL 60680 C C C C CLUSPAC is a set of programs implementing clustering C C algorithms derived under the assumption of Gaussian C C class-conditional distributions. The ISDT* programs in C C CLUSPAC are based on the so-called "classification" C C likelihood. The MIX* programs are based on the mixture- C C model likelihood. C C C C Program MIXPDT in CLUSPAC is one of the C C mixture-model programs for clustering multivariate data. C C (For univariate data the "MIX1" programs may be used.) C C MIXPDT allows different covariance matrices; C C MIXPCM assumes a common covariance matrix across C C distributions. C C C C C C Input: C C ----- C C Number of clusters (K), initial values of means, prior C C probabilities, and covariance matrices. If desired, C C program ISDTPCM.CLUSPAC can be used to obtain these initial C C values. Use program MIXPDTA.CLUSPAC to try a range of C C numbers of clusters (values of K), with automatic setting C C of initial values. C C C C Program restrictions (can be modified): C C -------------------------------------- C C N, sample size, at most 1000; C C IP, number of variables, at most 20; C C K, number of clusters, at most 29; C C ITER, maximum number of iterations, 20. C C C C C C Subroutines called: C C MATEQ, which calls MATDT C C C C IV is a work array for subroutine MATEQ. C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C C C CONTROL CARDS: C C C C DATASET TITLE C C N, IN FORMAT (2X,I4) C C IP, IN FORMAT (3X,I2) C C DATFMT, IN FORMAT (18A4), E.G., (4F4.1) C C "DATFMT" WILL ALSO BE USED FOR OUTPUT, SO ALLOW AT LEAST C C ONE BLANK AT THE BEGINNING FOR CARRIAGE CONTROL. C C DATA, ONE CASE AT A TIME, IN FORMAT SPECIFIED BY DATFMT C C K, NUMBER OF CLUSTERS, IN FORMAT (2X,I2) C C MEANFT, in format (18A4) C C K INITIAL MEANS, IN FORMAT SPECIFIED BY MEANFT C C C C K INITIAL VALUES OF MIXING PROBABILITIES, C C IN FORMAT (5X,F3.2). C C C C COVFMT, in format (18A4). C C Initial value of SIGMA(1,JV,JW), JV,JW=1,IP, cov mx of Pop 1 C C Initial value of SIGMA(2,JV,JW), JV,JW=1,IP, cov mx of Pop 2 C C . C C . C C . C C Initial value of SIGMA(K,JV,JW), JV,JW=1,IP, cov mx of Pop K C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C C C C FLOW OF PROGRAM: C C C READ DATA AND INITIAL PARAMETER ESTIMATES. C INVERT COVARIANCE MATRICES. C PRINT INITIAL PARAMETER ESTIMATES. C C ITERATION: C COMPUTE ESTIMATED VALUES OF PROBABILITY DENSITY FUNCTIONS. C COMPUTE LIKELIHOOD AND VALUES OF MODEL-SELECTION CRITERIA. C COMPUTE CLUSTER-MEMBERSHIP PROBABILITIES. C UPDATE PARAMETER ESTIMATES (INCLUDING C INVERSE COVARIANCE MATRICES). C CLUSTER BY MAXIMUM PROBABILITY OF CLUSTER MEMBERSHIP. C IF CLUSTERING HASN'T CHANGED, STOP AND PRINT RESULTS. C OTHERWISE DO ANOTHER ITERATION (UNLESS 20 HAVE C ALREADY BEEN DONE). C C C C C C WRITE PROGRAM INFORMATION. C C READ SAMPLE SIZE, N. C READ NUMBER OF VARIABLES, IP. C C READ DATA FORMAT. C C READ DATA. C COMPUTE MINIMA AND MAXIMA: C WRITE MINIMA AND MAXIMA: C C READ K, NUMBER OF CLUSTERS. C C READ INITIAL MEANS C READ INPUT FORMAT FOR MEANS: C C READ INITIAL VALUES OF MIXING PROBABILITIES: C C C C READ INITIAL VALUES OF COVARIANCE MATRICES: C C C C WRITE CURRENT ESTIMATES OF PARAMETERS: C WRITE CURRENT ESTIMATES OF MEAN VECTORS:-- C WRITE CURRENT ESTIMATES OF COVARIANCE MATRICES:- C C C CALL SUBROUTINE TO COMPUTE INVERSE COVARIANCE MATRIX: C SET PARAMETERS OF SUBROUTINE CALL: C C C ON RETURN, P CONTAINS THE INVERSE MATRIX. C C C GENERAL FORM OF CALL IS: C CALL MATEQ(A,M,N,JFLG,DET,IDET,IV,NRS1,P,LL) C SEE SUBROUTINE LISTING FOR FULLER EXPLANATION. C DET(VARHAT) = DET*10.0**IDET C C ON RETURN, P CONTAINS THE INVERSE MATRIX. C C C C C C C C COMMENCE COMPUTATION OF F(I,IC), I=1,...,N, IC=1,...,K: C C COMPUTE MAHALANOBIS D-SQUARE BETWEEN THE I-TH OBSERVATION AND C THE L-TH MEAN, L=1,2,...,K, I=1,2,...,N:-- C C C IF D-SQ IS INORDINATELY LARGE, SET VALUE OF PDF TO ZERO C (IT IS EXTREMELY SMALL ANYWAY, AND THIS AVOIDS UNDERFLOW): C C C COMPUTE LOG LIKELIHOOD AND VALUES OF MODEL-SELECTION CRITERIA: C C C C C C C C COMPUTE MODEL SELECTION CRITERIA C C PARAMETERS: C K MEAN VECTORS OF DIMENSION P AND K P-BY-P COVARIANCE MATRICES, C WHERE P IS THE NUMBER OF VARIABLES, AND K-1 INDEPENDENT MIXING C PROBABILITIES. C C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP, C PP(IC,I), THE CONDITIONAL PROBABILITY OF POP'N IC, GIVEN X(I), C AS PR(IC)*F(I,IC)/DENOM(I): C C C C UPDATE PARAMETER ESTIMATES: C C C UPDATE CLUSTER MIXING PROBABILITIES PR(IC):-- C C UPDATE MEANS:-- C C C C C C POOL: C C C COMPUTE VARHAT, MLE OF COMMON COVARIANCE MATRIX: C (END PARAMETER-ESTIMATE UPDATE SEQUENCE) C C STORE OLD LABELS: C C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: C C C C C C C C C C C C COUNT NUMBERS IN CLUSTERS: C C C PRINT FINAL RESULTS C C C C C C C C C C C C C SUBROUTINE MATEQ IS DMATEQ FROM THE UICC SUBROUTINE LIBRARY. C PAGE 1 C C SUBROUTINE DMATEQ C ***************** C THIS ROUTINE WILL SOLVE A REAL*8 SYSTEM OF LINEAR EQUATIONS,COMPUTE C THE DETERMINANT, WITHOUT UNDERFLOW OR OVERFLOW, OF A REAL*8 MATRIX, C AND/OR INVERT A REAL*8 MATRIX. C CALLING SEQUENCE: C CALL DMATEQ(A,N,IA,JFLG,DET,IDET,IV,NRS,P,IP) WHERE; C A (INPUT) - IS THE REAL*8 MATRIX ON WHICH THE ROUTINE IS C TO WORK. IN THE PROCESS OF COMPUTATION THE C CONTENTS OF THIS MATRIX ARE DESTROYED. C N (INPUT) - IS AN INTEGER*4 VARIABLE WHICH SPECIFIES THE C ORDER OF THE A MATRIX. C IA (INPUT) - IS AN INTEGER*4 VARIABLE WHICH SPECIFIES THE C ACTUAL ROW DIMENSION OF A AS DIMENSIONED IN C THE CALLING PROGRAM. IA MUST BE GREATER THAN C OR EQUAL TO N. C JFLG (OUTPUT) - IS AN INTEGER*4 RETURN CODE VARIABLE. UPON C RETURN FROM DMATEQ IF; C JFLG=0, ALL WENT WELL. C JFLG=1, THE A MATRIX WAS SINGULAR OR NEAR C SINGULAR AND THE COMPUTATIONS COULD NOT BE C COMPLETED. THE CONTENTS OF THE VARIABLES C A, DET, IDET AND P ARE MEANINGLESS. C DET (OUTPUT) - IS A REAL*8 VARIABLE WHICH CONTAINS THE C DETERMINANT OF A. (SEE IDET) C IDET (INPUT) - IS AN INTEGER*4 VARIABLE. ON INPUT IF; C IDET=0, NO DETERMINANT IS CALCULATED. C IDET NOT 0, THE DETERMINANT OF A IS COMPUTED. C ON OUTPUT IDET CONTAINS THE POWER OF 10 C THAT DET SHOULD BE MULTIPLIED BY TO GIVE THE C CORRECT VALUE OF THE DETERMINANT. I.E. C DET(A)=DET*10.0D0**IDET. C IF DET(A) CAN BE COMPUTED WITHOUT UNDER OR C OVERFLOW, THEN IDET=0 OTHERWISE IDET IS SET C TO THE PROPER VALUE SO THAT NO UNDER OR OVER- C FLOW WILL OCCUR IN COMPUTING DET. C IV (INPUT) - IS AN INTEGER*4 WORK ARRAY WHICH SHOULD BE C DIMENSIONED AT LEAST IV(N). C PAGE 2 C C NRS (INPUT) - IS AN INTEGER*4 VARIABLE WITH THE FOLLOWING C INTERPRETATION: C NRS>0, SOLVE A SYSTEM OF LINEAR EQUATIONS C WITH NRS RIGHT HAND SIDES. C NRS=0, INVERT THE A MATRIX. C NRS<0, ONLY COMPUTE THE DETERMINANT OF A. C IN THIS CASE IDET MUST BE DIFFERENT FROM 0. C P (INPUT) - IS A REAL*8 ARRAY WITH THE FOLLOWING INTER- C PRETATION: C IF NRS>0, THEN P CONTAINS THE NRS RIGHT HAND C SIDES STORED BY COLUMNS. IN THIS CASE P MUST C BE DIMENSIONED AT LEAST P(N,NRS). ON RETURN C THE COLUMNS OF P ARE REPLACED BY THE RESPEC- C TIVE SOLUTIONS. C IF NRS=0, THEN P MUST BE DIMENSIONED AT LEAST C P(N,N). ON RETURN P WILL CONTAIN THE INVERSE C OF A. C IF NRS<0,THEN P NEED ONLY BE A DUMMY VARIABLE C IN THIS CASE P IS NEVER ACCESSED BY DMATEQ. C IP (INPUT) - IS AN INTEGER*4 VARIABLE WHICH CONTAINS THE C ACTUAL ROW DIMENSION OF P AS DIMENSIONED IN C THE CALLING PROGRAM. IP MUST BE GREATER THAN C OR EQUAL TO N. C NOTE: IMMEDIATELY ON RETURN FROM DMATEQ THE CONDITION CODE FLAG, C JFLG, SHOULD BE INTERROGATED. IF JFLG=1, THEN THE ROUTINE C COULD NOT COMPUTE A SOLUTION. C METHOD - THE ALGORITHM USED IS GAUSSIAN ELIMINATION WITH PARTIAL C -1 C PIVOTING. IN ESSENCE THE ROUTINE GENERATES A MATRIX L SUCH C -1 C THAT L *A = U, WHERE U IS AN UPPER TRIANGULAR MATRIX. THEN IT C SOLVES THE SYSTEM A*X = P BY MEANS OF THE EQUIVALENT SYSTEM C -1 -1 C U*X = L *A*X = L *P BY BACK SUBSTITUTION. C -1 C THE L MATRIX CAN BE WRITTEN AS A PRODUCT OF THE FORM C -1 C L = L *P *....*L *P WHERE EACH P IS A PERMUTATION C N-1 N-1 1 1 K C MATRIX OBTAINED BY INTERCHANGING AT MOST TWO ROWS OF THE C IDENTITY MATRIX. ( THIS REPRESENTS THE INTERCHANGING OF TWO C ROWS). THE L MATRICES ARE ELIMINATION MATRICES WHICH ARE C K C CHOSEN TO INTRODUCE ZEROS IN THE LAST N-K ENTRIES OF THE K-TH C COLUMN OF THE MATRIX. C PAGE 3 C C -1 -1 C THE CALCULATIONS OF L *A AND L *P ARE DONE BY PERFORMING C THE PERMUTATIONS ON A AND P RESPECTIVELY. THE ACTUAL L AND P C K K C ARE NOT COMPUTED. C SUBROUTINES CALLED: DMATDT C REFERENCE: C G. W. STEWART, INTRODUCTION TO MATRIX COMPUTATIONS, C ACADEMIC PRESS, 1973. C C JFLG IS A TROUBLE FLAG.UPON EXIT IF JFLG=0 THEN THE MATRIX WAS PROCESS C WITHOUT TROUBLE.IF JFLG=1 EITHER THE MATRIX IS SINGULAR OR TROUBLE C OCCURED.ISIGN=-ISIGN EVERY TIME A ROW IS INTERCHANGED.THIS IS USED TO C INSURE THAT THE DETERMINANT HAS THE PROPER SIGN. C C C INSTEAD OF ACTUALLY INTERCHANGING ROWS A POINTER ARRAY IS USED TO KEEP C TRACK OF THE ROW POSITIONS. C C BEGIN ELIMINATION LOOP. C C C SEARCHING FOR LARGEST ELEMENT IN ABSOLUTE VALUE IN COLUMN K. C C C IF IFLG=0 NO ROW INTERCHANGE TOOK PLACE.IF IFLG=1 A ROW INTERCHANGE C TOOK PLACE AND THE POINTER ARRAY IV MUST BE UPDATED. C C C BEGIN ELIMINATION OF ROW BELOW IV(K).DEN IS THE PIVOT ELEMENT. C C C BEFORE ACTUALLY ELIMINATING WE CHECK TO SEE IF A(IV(IM),K) HAS C ALREADY BEEN ANNIHILATED. C C C CALCULATE ELIMINATION FACTOR. C C C WE NOW CALCULATE VALUE OF OTHER ELEMENTS IN ROW IV(IM). C C C CALCULATE VALUE OF DETERMINANT. C C C WE START SOLVING RIGHT HAND SIDES.THE SOLUTION REPLACES THE RIGHT HAND C VECTOR. C C C BEGIN BACK SUBSTITUTION. C C SUBROUTINE MATDT IS DMATDT FROM THE UICC SUBROUTINE LIBRARY.