C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CMS DSN = MIX1CMA CLUSPAC C C C C THE PROGRAMS MIX* CLUSPAC IN ISOPAC ARE FOR CLUSTERING DATA C C BY ITERATIVE MAXIMIZATION OF THE MIXTURE-MODEL LIKELIHOOD C C C C N K C C --- -- C C L = | | > P(C)*F(X(I)|C), C C | | -- C C I=1 C=1 C C C C WHERE C C C C N = NUMBER OF OBSERVATIONS ("SAMPLE SIZE"), C C K = NUMBER OF CLUSTERS, C C X(I) = I-TH OBSERVATION, I = 1,2,...,N, C C F(X|C) = VALUE AT X OF THE C-TH CLASS-CONDITIONAL C C DENSITY FUNCTION (C=1,2,...,K) C C AND C C P(C) = PRIOR PROBABILITY OF CLASS C. C C C C C C REFERENCE FOR CLUSTERING BY MIXTURE MODEL: C C C C Wolfe, J. H. (1970). Pattern clustering by multivariate C C mixture analysis. Multivariate Behavioral Research 5, 329-350. C C C C C C THE "1" IN THE PROGRAM NAME "MIX1CMA ISOPAC" MEANS THAT C C THE PROGRAM IS FOR UNIVARIATE (1-DIMENSIONAL) DATA C C (DATA ON THE LINE); THE "CM" MEANS THAT A COMMON VARIANCE IS C C ASSUMED ACROSS CLUSTERS; AND THE "A" MEANS THAT THERE IS C C AUTOMATIC SETTING OF NUMBERS OF CLUSTERS AND INITIAL MEANS. C C C C C C PROGRAMMED BY C C DR. STANLEY L. SCLOVE 312/996-2681 C C DEPARTMENT OF C C INFORMATION & DECISION SCIENCES M/C 294 C C COLLEGE OF BUSINESS ADMINISTRATION C C UNIVERSITY OF ILLINOIS AT CHICAGO C C BOX 4348 C C CHICAGO, IL 60680-4348 C C C C C C VERSION 1.2 31-MAR-91 C C C C COPYRIGHT (C) 1991 STANLEY L. SCLOVE C C C C C C C C RESTRICTIONS (CAN BE MODIFIED): C C N, SAMPLE SIZE, AT MOST 999; C C K, NUMBER OF CLUSTERS, AT MOST 29; C C ITER, MAXIMUM NUMBER OF ITERATIONS, 20. C C C C C C C C CONTROL CARDS: C C C C (1) DATASET TITLE C C (2) N, IN FORMAT (2X,I4) C C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, C C WHETHER OR NOT THERE IS ONE ON INPUT. C C (4) DATA, IN FORMAT SPECIFIED BY FMT C C C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C C C C C C C CONTROL CARDS: C C (1) DATASET TITLE C (2) N, IN FORMAT (2X,I4) C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, WHETHER OR NOR C THERE IS ONE ON INPUT. C (4) DATA, IN FORMAT SPECIFIED BY FMT C C C C WRITE PROGRAM INFORMATION. C C C READ SAMPLE SIZE, N. C C READ DATA FORMAT. C C READ DATA AND C COMPUTE STATISTICS OF WHOLE SAMPLE: C C READ(5, * ) ( X(I), I = 1,N ) C C IF IT IS DESIRED TO PRINT OUT THE DATA, REMOVE THE "C"S C FROM CC1 IN THE APPROPRIATE FOLLOWING STATEMENTS: C WRITE DATA: C WRITE(6, 1009) C1009 FORMAT(1X,'DATA:'/) C WRITE(6, FMT) (X(I), I=1, N) C WRITE SUMMARY STATISTICS FOR WHOLE SAMPLE: C NO. OF PARAMETERS FOR UNCLUSTERED SAMPLE IS: C 1 MEAN + 1 VARIANCE = 2 PARAMETERS C C A TABLE OF MODEL SELECTION CRITERIA-VALUES FOR VARIOUS K C WILL BE PRINTED AT THE END. THE NEXT INSTRUCTIONS C SET UP THE FIRST ENTRIES FOR THAT TABLE. C C C C SET CONSTANTS. C XJ(IC,K), K=1 TO 9, IC = 1 TO K C OPTIMAL CLASS PROBABILITIES - NORMAL DISTRIBUTION C JOHARI & SCLOVE (1975). "PARTITIONING A DISTRIBUTION." C COMMUNICATIONS IN STATISTICS. C C PERFORM CLUSTERING FOR K = 2 TO 9 GROUPS. C C COMPUTE INITIAL MEANS, EQUALLY SPACED THROUGH RANGE OF DATA, C INITIAL PRIOR PROBABILITIES (EQUAL) AND INITIAL VALUE C OF COMMON VARIANCE: C XC = IC C SET INITIAL VALUES OF PRIOR PROBABILITIES EQUAL TO THE C OPTIMAL VALUES FOR A NORMAL DISTRIBUTION: C TO SET INITIAL VALUES OF PRIOR PROBABILITIES EQUAL TO C BINOMIAL(X=IC-1;N=K-1,P=1/2). C FOR IC = 1,2,...,K, C REMOVE THE "C" FROM CC1 OF THE NEXT LINE: C P(IC) = GAMMA(XK)/(GAMMA(XC)*GAMMA(XK-XC+1)*(2**K)) C C FOR EQUAL INITIAL VALUES OF PRIOR PROBABILITIES, C REMOVE THE "C" FROM CC1 OF THE NEXT LINE: C P(IC) = 1.0/K . C C C C C WRITE INITIAL VALUES OF PRIOR PROBS: C C C WRITE INITIAL MEANS: C C WRITE INITIAL VARIANCE: C C C C C STORE OLD CLUSTERING: C COMMENCE DISTANCE COMPUTATIONS. C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP: C IF ( DENOM(I) .EQ. 0.0 ) DENOM(I)=0.0001 C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: C C WRITE NEW LABELS: C C UPDATE CLUSTER PRIOR PROBABILITIES P(IC), MEANS XMEAN(IC) AND C VARIANCE VARHAT: C XNC(IC) WILL BE THE SUM OVER ALL N OBSERVATIONS OF THEIR C POSTERIOR PROBABILITIES OF MEMBERSHIP IN CLUSTER IC. C C IF ONE OF THE K CLUSTERS BECOMES EMPTY, THE PROCEDURE C WILL GO ON TO THE NEXT VALUE OF K: C IF ( VAR(IC) .LE. 0.0 ) VAR(IC) = 0.0001 C C COUNT NUMBERS IN CLUSTERS: C C C C C F(I,IC) HAD NOT PREVIOUSLY BEEN DIVIDED BY SQRT(2*PI*VARHAT): C C C C B(IC) IS BOUNDARY BETWEEN G-TH AND G+1-ST CLASSES. C C IF PROCEDURE HAS NOT CONVERGED IN 20 ITERATIONS, IT WILL C GO ON TO THE NEXT VALUE OF K. C C IF ONE OF THE K CLUSTERS BECOMES EMPTY, THE PROCEDURE C WILL GO ON TO THE NEXT VALUE OF K. C C C C VARHAT IS MLE OF VARIANCE. C C C C COMPUTE MODEL-SELECTION CRITERIA: C NO. PARAMETERS = K MEANS + 1 VARIANCE + (K-1) PROBS. C C C C C C C C C GO ON TO NEXT VALUE OF K: C C WRITE VALUES OF MSC'S FOR VARIOUS K: C C C C$DATA