CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CLUSPAC: Computer Programs for Mixture-Model Clustering C C C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C ISOPAC/CLUSPAC/MIX1DTA C C C C C C THE PROGRAMS MIX* IN THE SUBLIBRARY CLUSPAC OF THE C C LIBRARY CLUSPAC ARE FOR CLUSTERING DATA BY C C ITERATIVE MAXIMIZATION OF THE MIXTURE-MODEL LIKELIHOOD C C C C N K C C --- -- C C L = | | > P(C)*F(X(I)|C), C C | | -- C C I=1 C=1 C C C C WHERE C C C C N = NUMBER OF OBSERVATIONS ("SAMPLE SIZE"), C C K = NUMBER OF CLUSTERS, C C X(I) = I-TH OBSERVATION, I = 1,2,...,N, C C F(X|C) = VALUE AT X OF THE C-TH CLASS-CONDITIONAL C C DENSITY FUNCTION (C=1,2,...,K) C C AND C C P(C) = PRIOR PROBABILITY OF CLASS C. C C C C C C REFERENCE FOR CLUSTERING BY MIXTURE MODEL: C C C C Wolfe, J. H. (1970). Pattern clustering by multivariate C C mixture analysis. Multivariate Behavioral Research 5, 329-350. C C C C C C IN THE PROGRAM NAME C C C C MIX1DTA C C C C THE "1" MEANS THAT THE PROGRAM IS FOR UNIVARIATE C C (1-DIMENSIONAL) DATA; C C C C THE "DT" MEANS THE VARIANCES ARE ALLOWED TO VARY ACROSS C C CLUSTERS; C C C C AND THE "A" MEANS THAT THERE IS AUTOMATIC SETTING OF NUMBERS C C OF CLUSTERS AND INITIAL MEANS. C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C PROGRAMMED BY C C DR. STANLEY L. SCLOVE 312/996-2681 C C DEPARTMENT OF C C INFORMATION & DECISION SCIENCES M/C 294 C C COLLEGE OF BUSINESS ADMINISTRATION C C UNIVERSITY OF ILLINOIS AT CHICAGO C C 601 S. MORGAN ST. C C CHICAGO, IL 60607-7124 C C C C C C VERSION 3.0 1992: OCT 20 C C C C COPYRIGHT (C) 1991, 1992 STANLEY LOUIS SCLOVE C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C RESTRICTIONS (CAN BE MODIFIED): C C N, SAMPLE SIZE, AT MOST 9999; C C K, NUMBER OF CLUSTERS, AT MOST 29; C C ITER, MAXIMUM NUMBER OF ITERATIONS, 20. C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C CONTROL CARDS: C C C C (1) DATASET TITLE C C (2) N, IN FORMAT (2X,I4) C C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, C C WHETHER OR NOT THERE IS ONE ON INPUT. C C (4) DATA, IN FORMAT SPECIFIED BY FMT C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C C DIMENSION X(9999),XMNDSQ(9999),ICLUS(9999),IOTA(9999) DIMENSION XJ(9,9) DIMENSION DSQ(29),SUM(29) DIMENSION TITLE(18) DIMENSION NC(29),XMEAN(29),XMEANOL(29) DIMENSION BPLUS(29),BMINUS(29) C DIMENSION FMT(18) DIMENSION SS(29),SSD(29) DIMENSION SD(29) DIMENSION VAR(29) DIMENSION ICLSOL(9999) DIMENSION F(9999,29) DIMENSION P(29), XNC(29) DIMENSION PP(29,9999) DIMENSION XMXPR(9999) DIMENSION DENOM(9999) DIMENSION AICVEC(29),SCHVEC(29),XKSVEC(29) DOUBLE PRECISION SUM,SS,F,P,PP,SSD,XNC C C CONTROL CARDS: C C (1) DATASET TITLE C (2) N, IN FORMAT (2X,I4) C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, WHETHER OR NOR C THERE IS ONE ON INPUT. C (4) DATA, IN FORMAT SPECIFIED BY FMT C C READ(5,15000) TITLE C C WRITE PROGRAM INFORMATION. WRITE(6,20000) WRITE(6,21000) WRITE(6,22000) C WRITE(6,16000) TITLE C C READ SAMPLE SIZE, N. READ(5,10000) N XN = N WRITE(6, 1051) N C C READ DATA FORMAT. C Remove next statement for WATFIV version: READ(5,15000) FMT C C READ DATA AND C COMPUTE STATISTICS OF WHOLE SAMPLE: TOTAL=0.0 SUMSQS=0.0 SSDEVS=0.0 READ(5,FMT) ( X(I), I = 1,N ) C For WATFIV version: C READ(5, * ) ( X(I), I = 1,N ) C Initialize XMAX and XMIN: XMAX = X(1) XMIN = X(1) C DO 100 I = 1,N TOTAL = TOTAL + X(I) SUMSQS = SUMSQS + X(I)*X(I) IF (X(I) .LT. XMIN) XMIN=X(I) IF (X(I) .GT. XMAX) XMAX=X(I) 100 CONTINUE C C IF IT IS DESIRED TO PRINT OUT THE DATA, REMOVE THE "C"S C FROM CC1 IN THE APPROPRIATE FOLLOWING STATEMENTS: C WRITE DATA: C WRITE(6, 1009) C1009 FORMAT(1X,'DATA:'/) C WRITE(6, FMT) (X(I), I=1, N) C C WRITE SUMMARY STATISTICS FOR WHOLE SAMPLE: WRITE(6, 1151) WRITE(6,FMT) XMIN C WRITE(6, * ) XMIN C C WRITE(6, 1152) WRITE(6,FMT) XMAX C WRITE(6, * ) XMAX XBAR = TOTAL/XN SSDEVS = SUMSQS- TOTAL*TOTAL/XN VARHAT1 = SSDEVS/XN WRITE(6, 1112) XBAR WRITE(6, 1113) VARHAT1 PI = 3.141593 TEMP = 2.0*PI*SSDEVS/N XMN2LL = N*(1.0 + ALOG(TEMP)) STDDEV = SQRT(VARHAT1) WRITE(6,66000) XMN2LL WRITE(6,68000) SSDEVS, STDDEV C NO. OF PARAMETERS FOR UNCLUSTERED SAMPLE IS: C 1 MEAN + 1 VARIANCE = 2 PARAMETERS NOPARM = 1 + 1 C AIC = XMN2LL + 2.0*NOPARM WRITE(6,60000) AIC SCH = XMN2LL + ALOG(XN)*NOPARM XKASH = SCH - ALOG(2*VARHAT1**3) WRITE(6,58000) SCH WRITE(6,17000) XKASH C A TABLE OF MODEL SELECTION CRITERIA-VALUES FOR VARIOUS K C WILL BE PRINTED AT THE END. THE NEXT INSTRUCTIONS C SET UP THE FIRST ENTRIES FOR THAT TABLE. AICVEC(1) = AIC SCHVEC(1) = SCH XKSVEC(1) = XKASH C C C C SET CONSTANTS. C XJ(IC,K), K=1 TO 9, IC = 1 TO K C OPTIMAL CLASS PROBABILITIES - NORMAL DISTRIBUTION C JOHARI & SCLOVE (1975). "PARTITIONING A DISTRIBUTION." C COMMUNICATIONS IN STATISTICS. XJ(1,2)=.5 XJ(2,2)=.5 XJ(1,3)=.2703 XJ(2,3)=.4594 XJ(3,3)=.2703 XJ(1,4)=.1631 XJ(2,4)=.3369 XJ(3,4)=.3369 XJ(4,4)=.1631 XJ(1,5)=.1068 XJ(2,5)=.2444 XJ(3,5)=.2976 XJ(4,5)=.2444 XJ(5,5)=.1068 XJ(1,6)=.0739 XJ(2,6)=.1810 XJ(3,6)=.2451 XJ(4,6)=.2451 XJ(5,6)=.1810 XJ(6,6)=.0739 XJ(1,7)=.0536 XJ(2,7)=.1375 XJ(3,7)=.1986 XJ(4,7)=.2106 XJ(5,7)=.1986 XJ(6,7)=.1375 XJ(7,7)=.0536 XJ(1,8)=.0402 XJ(2,8)=.1067 XJ(3,8)=.1613 XJ(4,8)=.1918 XJ(5,8)=.1918 XJ(6,8)=.1613 XJ(7,8)=.1067 XJ(8,8)=.0402 XJ(1,9)=.0310 XJ(2,9)=.0845 XJ(3,9)=.1324 XJ(4,9)=.1643 XJ(5,9)=.1756 XJ(6,9)=.1643 XJ(7,9)=.1324 XJ(8,9)=.0845 XJ(9,9)=.0310 C C PERFORM CLUSTERING FOR K = 2 TO KUP GROUPS. C KUP = 17 KUP = 4 C DO 995 K = 2,KUP WGSS = 0.0 WRITE(6,23000) K C C COMPUTE INITIAL MEANS, EQUALLY SPACED THROUGH RANGE OF DATA, C INITIAL VALUE OF VARIANCES C AND INITIAL PRIOR PROBABILITIES (EQUAL): XK = K DO 101 IC=1,K C XC = IC C C TO spread out the means to the min and max, use: C XMEAN(IC) = XMIN + (IC-1)*(XMAX-XMIN)/(K-1) C C To center the means more, instead use: XMEAN(IC) = XMIN + IC*(XMAX-XMIN)/(K+1) C C TO SET INITIAL VALUE OF VARIANCES, USE THE VARIANCE-COMPONENTS C RELATIONSHIP C MS(TOTAL) = MS(BETWEEN) + MS(WITHIN) C WITH AN (ARBITRARY) ASSUMPTION THAT MS(WITHIN) = MS(TOTAL)/K. VAR(IC) = VARHAT1/K C SET INITIAL VALUES OF PRIOR PROBABILITIES EQUAL TO THE C OPTIMAL VALUES FOR PARTITIONING A NORMAL DISTRIBUTION: IF ( KUP .GT. 9 ) GO TO 103 P(IC) = XJ(IC,K) C TO SET INITIAL VALUES OF PRIOR PROBABILITIES EQUAL TO C BINOMIAL(X=IC-1;N=K-1,P=1/2). C FOR IC = 1,2,...,K, C REMOVE THE "C" FROM CC1 OF THE NEXT LINE: C P(IC) = GAMMA(XK)/(GAMMA(XC)*GAMMA(XK-XC+1)*(2**K)) C C FOR EQUAL INITIAL VALUES OF PRIOR PROBABILITIES: C 103 CONTINUE P(IC) = 1.0/K C C 101 CONTINUE C C C WRITE INITIAL VALUES OF PRIOR PROBS: WRITE(6,25000) C WRITE(6, 1025) ( P(IC), IC = 1, K ) C C WRITE INITIAL MEANS: WRITE(6,14000) C WRITE(6,19000) ( XMEAN(IC), IC=1, K ) C WRITE INITIAL VARIANCE: WRITE(6,24000) C WRITE(6, 1017) VAR(1) DO 105 INTGER=1,N IOTA(INTGER) = INTGER 105 CONTINUE C ITER = 1 C 601 CONTINUE C IF (ITER .EQ. 1) GO TO 560 C STORE OLD CLUSTERING: DO 565 I = 1,N ICLSOL(I) = ICLUS(I) 565 CONTINUE C STORE OLD CLUSTERING RESULTS FOR TESTING CONVERGENCE: XMN2LLOL = XMN2LL DO 850 IC = 1,K XMEANOL(IC) = XMEAN(IC) 850 CONTINUE C C COMMENCE DISTANCE COMPUTATIONS. 560 CONTINUE DO 102 I = 1,N DO 102 IC = 1,K DSQ(IC) = ( XMEAN(IC) - X(I) )**2 ZSQ = DSQ(IC)/VAR(IC) IF ( ZSQ .LE. 174.673 ) GO TO 110 F(I,IC) = 0.0 GO TO 102 110 CONTINUE C Note that pi is included in the computations for comparability C even though it is just a constant. F(I,IC) = EXP(-ZSQ/2.0)/SQRT(2*PI*VAR(IC)) 102 CONTINUE C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP C [PP(IC,I) = CONDITIONAL PROBABILITY OF CLUSTER IC, GIVEN X(I)]: DO 405 I = 1,N DENOM(I) = 0.0 DO 405 IC=1,K DENOM(I) = DENOM(I) + P(IC)*F(I,IC) 405 CONTINUE DO 406 I = 1,N DO 406 IC=1,K C IF ( DENOM(I) .EQ. 0.0 ) DENOM(I)=0.0001 PP(IC,I)= P(IC)*F(I,IC)/DENOM(I) 406 CONTINUE C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: DO 8 I = 1,N XMXPR(I) = PP(1,I) ICLUS(I) = 1 DO 8 IC = 2,K IF ( PP(IC,I) .GT. XMXPR(I) ) GO TO 9 GO TO 8 9 XMXPR(I) = PP(IC,I) ICLUS(I) = IC 8 CONTINUE C C WRITE NEW LABELS: IF (N .GE. 31) GO TO 200 WRITE(6, 11000) WRITE(6,12000) (IOTA(I), I=1, N) WRITE(6,13000) (ICLUS(I), I=1, N) 200 CONTINUE C C UPDATE CLUSTER PRIOR PROBABILITIES P(IC), MEANS XMEAN(IC) AND C VARIANCE VARHAT: WGSS = 0.0 DO 6 IC = 1,K XNC(IC) = 0.0 C XNC(IC) WILL BE THE SUM OVER ALL N OBSERVATIONS OF THEIR C POSTERIOR PROBABILITIES OF MEMBERSHIP IN CLUSTER IC. SUM(IC) = 0.0 SS(IC) = 0.0 DO 67 I = 1,N XNC(IC) = XNC(IC) + PP(IC,I) SUM(IC) = SUM(IC) + PP(IC,I)*X(I) SS(IC) = SS(IC) + PP(IC,I)*X(I)*X(I) 67 CONTINUE C C IF ONE OF THE K CLUSTERS BECOMES EMPTY, THE PROCEDURE C WILL GO ON TO THE NEXT VALUE OF K: IF ( XNC(IC) .EQ. 0.0 ) GO TO 125 XMEAN(IC) = SUM(IC)/XNC(IC) SSD(IC) = SS(IC) - SUM(IC)*SUM(IC)/XNC(IC) C VAR(IC) = SSD(IC)/XNC(IC) C SD(IC) = SQRT(VAR(IC)) C C Recall that XNC(IC) is the sum over all N observations of their C posterior probabilities of membership in cluster IC. P(IC) = XNC(IC)/XN WGSS = WGSS + SSD(IC) 6 CONTINUE C C COUNT NUMBERS IN CLUSTERS: DO 66 IC = 1,K NC(IC) = 0 66 CONTINUE DO 400 I = 1,N IGROUP = ICLUS(I) NC(IGROUP) = NC(IGROUP) + 1 400 CONTINUE C C C VARHAT = WGSS/XN C SUMLNF = 0.0 DO 161 I=1,N SUMPXF = 0.0 DO 159 IC=1,K SUMPXF = SUMPXF + P(IC)*F(I,IC) 159 CONTINUE C F(I,IC) HAD NOT PREVIOUSLY BEEN DIVIDED BY SQRT(2*PI*VARHAT): C Oh, yes it had. C SUMPXF = SUMPXF/SQRT(2.0*PI*VARHAT) SUMLNF = SUMLNF + ALOG(SUMPXF) 161 CONTINUE C XMN2LL = -2.0*SUMLNF C WGMS = WGSS/(N-K) WRITE(6,66000) XMN2LL WRITE(6,64000) WGSS, WGMS STDERR = SQRT(WGMS) WRITE(6,71000) STDERR C KM1 = K-1 DO 500 IC=1,KM1 ICP1 = IC+1 C Compute boundary between c-th and c+1-st classes. B = XMEAN(IC)/VAR(IC) - XMEAN(ICP1)/VAR(ICP1) A = 0.5*(1.0/VAR(ICP1) - 1.0/VAR(IC)) C = DLOG(P(IC)/P(ICP1)) C = ALOG(SD(ICP1)/SD(IC)) + C TEMP=(XMEAN(ICP1))**2/(2.0*VAR(ICP1))-(XMEAN(IC))**2/(2.0*VAR(IC)) C = C + TEMP DISCR = B**2 - 4.0*A*C BPLUS(IC) = (-B + SQRT(DISCR))/(2.0*A) BMINUS(IC) = (-B - SQRT(DISCR))/(2.0*A) 500 CONTINUE WRITE(6, 1036) ITER WRITE(6, 1035) (BMINUS(IC), IC=1, KM1) WRITE(6, 1020) (XMEAN(IC), IC=1, K) IF (ITER .EQ. 1) GO TO 600 DO 555 I = 1,N IF (ICLUS(I) .EQ. ICLSOL(I)) GO TO 555 GO TO 600 555 CONTINUE 600 CONTINUE C Even if no case changes cluster, another iteration will be C performed if the parameter estimates have changed much. DO 2450 IC = 1,K TEST=(XMEAN(IC)-XMEANOL(IC))/XMEANOL(IC) TOL = 0.02 IF (ABS(TEST) .LE. TOL ) GO TO 2450 GO TO 2500 2450 CONTINUE C If all changes are small, write results for this value of K. GO TO 530 C Otherwise, iterate further. 2500 CONTINUE ITER = ITER + 1 C C IF PROCEDURE HAS NOT CONVERGED IN 40 ITERATIONS, IT WILL C GO ON TO THE NEXT VALUE OF K. IF (ITER.GE.41) GO TO 570 GO TO 601 570 WRITE(6, 1160) GO TO 2900 C C IF ONE OF THE K CLUSTERS BECOMES EMPTY, THE PROCEDURE C WILL GO ON TO THE NEXT VALUE OF K. 125 CONTINUE WRITE(6, 1235) GO TO 2900 C C 530 CONTINUE C 2900 CONTINUE WRITE(6, 1040) (SUM(IC), IC=1, K) WRITE(6, 1045) (NC(IC), IC=1, K) WRITE(6,70000) (VAR(IC), IC=1, K) WRITE(6, 1055) (SD(IC), IC=1, K) WRITE(6, 1060) ( P(IC), IC=1, K) C VARHAT IS MLE OF VARIANCE. VARHAT = WGSS/N WRITE(6, 1100) VARHAT C C C C COMPUTE MODEL-SELECTION CRITERIA: C NO. PARAMETERS = K MEANS + K VARIANCES + (K-1) PROBS. NOPARM = K + K + (K-1) C WRITE(6,72000) NOPARM AIC = XMN2LL + 2.0*NOPARM SCH = XMN2LL + ALOG(XN)*NOPARM C XKASH = SCH - ALOG(2*VARHAT**3) C WRITE(6,60000) AIC WRITE(6,58000) SCH WRITE(6,17000) XKASH C C C C AICVEC(K) = AIC SCHVEC(K) = SCH XKSVEC(K) = XKASH C C GO ON TO NEXT VALUE OF K: 995 CONTINUE C C WRITE VALUES OF MSC'S FOR VARIOUS K: WRITE(6,34000) DO 605 K = 1,KUP WRITE(6,33000) K, AICVEC(K), SCHVEC(K), XKSVEC(K) 605 CONTINUE C 10000 FORMAT(2X,I4) 11000 FORMAT(1X,'CLUSTERING') 12000 FORMAT(1X,'POINT: '/, (1X,40I3)) 13000 FORMAT(1X,'CLUSTER: '/, (1X,40I3)) 33000 FORMAT(1X,'K=',I3, 3F15.2) 34000 FORMAT(1X,'MODEL SELECTION CRITERIA'/ X' AIC SCHWARZ KASHYAP '/) 14000 FORMAT(/1X,'INITIAL MEANS') 25000 FORMAT(/1X,'INITIAL VALUES OF PRIOR PROBS') 19000 FORMAT(1X, 9F13.2/) 1025 FORMAT(1X, 9F13.4/) 24000 FORMAT(/1X,'INITIAL VALUE OF VARIANCES') 1017 FORMAT(1X, F15.4/) 1100 FORMAT(1X, 'M.L. ESTIMATE OF COMMON VARIANCE = ',F14.5/) 21000 FORMAT(1X,'CMS DSN = MIX1DTA CLUSPAC') 22000 FORMAT(1X,'Copyright 1991, 1992 Stanley L. Sclove '/) 23000 FORMAT('1',1X,'K = ',I2,' CLUSTERS') C 1020 FORMAT(1X,'MEANS: ',9F13.2) 64000 FORMAT(/1X,' POOLED STATISTICS: '/' WGSS = ',F14.4, X' WGMS = ',F14.4/) 1035 FORMAT(1X,'BOUNDARIES:', 9X, 9F9.1) C 1151 FORMAT(1X,'MINIMUM OF SAMPLE: ') 1152 FORMAT(1X,'MAXIMUM OF SAMPLE: ') 1112 FORMAT(/,1X,'MEAN = ', F14.4) 1113 FORMAT(1X, 'M.L. ESTIMATE OF VARIANCE = ',F14.5/) 66000 FORMAT(/1X,' MINUS 2 LOG LIKELIHOOD = ',F14.4/) 68000 FORMAT(/1X,' SSDEVS = ',F14.4,' STDDEV = ',F14.4/) 20000 FORMAT('1','****************************************', A/,1X,'PROGRAM ISOPAC/CLUSPAC/MIX1DTA.FOR '/ B,1X,'FOR CLUSTERING UNIVARIATE DATA (DATA ON THE LINE)'/ C1X,'DEVELOPED AND PROGRAMMED BY DR. STANLEY L. SCLOVE' D/,1X,'VERSION 3.0 1992: OCT 20 '/) 1036 FORMAT(1X,'ITERATION ', I2) 1040 FORMAT(1X,'SUMS:',6X,9F13.2) 1045 FORMAT(1X,'NUMBERS:',3X,9(I10,3X)) 15000 FORMAT(18A4) 1051 FORMAT(1X,'N = ',I4/) 1055 FORMAT(1X,'STD.DEVS.: ',9F13.2) 1060 FORMAT(1X,'PROBS.: ',9F13.2) 16000 FORMAT(1X,18A4/) 70000 FORMAT(1X,'VARIANCES: ',9F13.2) 71000 FORMAT(1X,'STD.ERROR=SQRT(WGMS) = ',F13.4/) C 1160 FORMAT(1X,'THIS STEP HAS NOT CONVERGED IN 40 ITERATIONS. STOP') 60000 FORMAT(1X,'AIC = ', F14.4/) 58000 FORMAT(1X,'SCHWARZ CRITERION = ', F14.4/) 17000 FORMAT(1X,'KASHYAP CRITERION = ', F14.4) C 72000 FORMAT(/1X,'NUMBER OF PARAMETERS = ',I4/) 1235 FORMAT(1X,'NO OBSERVATIONS IN GROUP ',I3,'. STOP') C 1995 STOP END