CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CLUSPAC: Computer Programs for Mixture-Model Clustering C C C C COPYRIGHT 1991 STANLEY LOUIS SCLOVE. C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C$STATEMENTS=100000 C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CMS DSN = MIX1DT ISOPAC C C C C "MIX1DT ISOPAC" IS A PROGRAM FOR CLUSTERING UNIVARIATE DATA C C (DATA ON THE LINE) BY ITERATIVE MAXIMIZATION OF THE MIXTURE- C C MODEL LIKELIHOOD C C C C N K C C --- -- C C L = | | > P(C)*F(X(I)|C) C C | | -- C C I=1 C=1 C C C C C C REFERENCE: C C C C Wolfe, J. H. (1970). Pattern clustering by multivariate C C mixture analysis. Multivariate Behavioral Research 5, 329-350. C C C C MANUAL MODE: NUMBER OF CLUSTERS AND INITIAL MEANS ARE C C INPUT. (USE PROGRAM MIX1DTA FOR AUTOMATIC SETTING OF C C NUMBERS OF CLUSTERS AND INITIAL MEANS.) C C C C C C PROGRAMMED BY C C DR. STANLEY L. SCLOVE 312/996-2681 C C DEPARTMENT OF C C INFORMATION & DECISION SCIENCES M/C 294 C C COLLEGE OF BUSINESS ADMINISTRATION C C UNIVERSITY OF ILLINOIS AT CHICAGO C C BOX 4348 C C CHICAGO, IL 60680-4348 C C C C C C VERSION 1.0 3-NOV-89 C C C C COPYRIGHT 1991 STANLEY LOUIS SCLOVE. C C C C C C C C RESTRICTIONS (CAN BE MODIFIED): C C N, SAMPLE SIZE, AT MOST 999; C C K, NUMBER OF CLUSTERS, AT MOST 29; C C ITER, MAXIMUM NUMBER OF ITERATIONS, 20. C C C C C C C C CONTROL CARDS: C C C C (1) DATASET TITLE C C (2) N, IN FORMAT (2X,I4) C C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, C C WHETHER OR NOT THERE IS ONE ON INPUT. C C (4) DATA, IN FORMAT SPECIFIED BY FMT C C (5) K, NUMBER OF CLUSTERS, IN FORMAT (2X,I1) C C (6) K INITIAL VALUES OF PRIOR PROBABILITIES AND MEANS, C C IN FORMAT (5X,F3.2,2X,F8.2). C C (7) INITIAL VALUE OF THE VARIANCE. (THE RESULTS DO NOT C C DEPEND UPON THIS VALUE IF THE INITIAL PRIOR PROBABILITIES C C ARE EQUAL.) C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C DIMENSION X(999),XMNDSQ(999),ICLUS(999),IOTA(999) DIMENSION DSQ(29),C(29),SUM(29) DIMENSION TITLE(18) DIMENSION B(29),NC(29),XMEAN(29) DIMENSION FMT(18) DIMENSION SS(29),SSD(29) DIMENSION SD(29) DIMENSION VAR(29) DIMENSION ICLSOL(999) DIMENSION F(999,29) DIMENSION P(29),XNC(29) DIMENSION PP(29,999) DIMENSION XMXPR(999) DIMENSION DENOM(999) DOUBLE PRECISION SUM,SS,F,P,PP C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C FLOW OF PROGRAM: C WRITE PROGRAM INFORMATION. C C C READ SAMPLE SIZE, N. C C READ DATA FORMAT. C READ (5,64000) FMT C C READ DATA AND C COMPUTE STATISTICS OF WHOLE SAMPLE: C C WRITE DATA: C WRITE SUMMARY STATISTICS FOR WHOLE SAMPLE: C C C C READ K, NUMBER OF CLUSTERS. C C READ INITIAL PRIOR PROBS, MEANS AND VARIANCES: C C C C WRITE INITIAL PRIOR PROBS, MEANS AND VARIANCES: C C SET CONSTANTS. C C IF THE INITIAL PRIOR PROBABILITIES ARE EQUAL, THEN THE C FIRST ITERATION IS EQUIVALENT TO MINIMUM DISTANCE CLUSTERING C TO INITIAL MEANS (I.E., "ISODATA"). C IN GENERAL, THE CLUSTERING IS BY MAXIMUM POSTERIOR C PROBABILITY CLUSTERING. C C C STORE OLD CLUSTERING: C COMMENCE DISTANCE COMPUTATIONS. C NOTE THAT A PROB. DENSITY FUNCTION OTHER THAN THE GAUSSIAN C COULD BE USED HERE: C XMNDSQ(I) = MIN SQ. DISTANCE FROM X(I) TO ANY MEAN C C C C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP: C IF ( DENOM(I) .EQ. 0.0 ) DENOM(I)=0.0001 C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: C C C WRITE NEW LABELS: C C UPDATE CLUSTER PRIOR PROBABILITIES P(IC), MEANS XMEAN(IC) AND C VARIANCES VAR(IC): C XNC(IC) WILL BE THE SUM OVER ALL N OBSERVATIONS OF THEIR C POSTERIOR PROBABILITIES OF MEMBERSHIP IN CLUSTER IC. C IF ( VAR(IC) .LE. 0.0 ) VAR(IC) = 0.0001 C C COUNT NUMBERS IN CLUSTERS: C C C C C C C C C B(IC) IS BOUNDARY BETWEEN G-TH AND G+1-ST CLASSES. C C C C VARHAT IS MLE OF VARIANCE. C C C C C COMPUTE MODEL-SELECTION CRITERIA: C NO. PARAMETERS = K MEANS + K VARIANCES + (K-1) PROBS. C C C SCHWARZ' CRITERION IS FIRST-DEGREE EXPANSION OF C LOG POSTERIOR PROBABILITY OF THE MODEL. C KASHYAP'S CRITERION IS SECOND-DEGREE EXPANSION OF SAME. C C C C C$DATA Sample deck set-up: TRYPANOSOMES DATA: LENGTHS OF 500 ROUNDWORMS N=0500 15 15 15 15 15 15 16 16 16 16 17 17 . . . 35 K=2 C1 .50 15.00 25.00 C2 .50 29.00 25.00 /* Sample output: **************************************** PROGRAM MIX1DT ISOPAC FOR CLUSTERING UNIVARIATE DATA (DATA ON THE LINE) DEVELOPED AND PROGRAMMED BY DR. STANLEY L. SCLOVE VERSION 1.0 3-NOV-89 CMS DSN = MIX1DT ISOPAC COPYRIGHT (C) 1989 STANLEY L. SCLOVE TRYPANOSOMES DATA: LENGTHS OF 500 ROUNDWORMS N = 500 DATA: MINIMUM OF SAMPLE: 15.0000000 MAXIMUM OF SAMPLE: 35.0000000 MEAN = 22.8320 M.L. ESTIMATE OF VARIANCE = 24.40787 SSDEVS = 12203.9375 MINUS 2 LOG LIKELIHOOD = 3016.3906 STDDEV = 4.9404 AIC = 3020.3906 KASHYAP CRITERION = 3018.5417 K = 2 CLUSTERS INITIAL PRIOR PROBS, MEANS AND VARIANCES: 1 0.50 15.00 25.00 2 0.50 29.00 25.00 WGSS = 5733.6602 MINUS 2 LOG LIKELIHOOD = 3349.6826 WGMS = 11.5134 STD.ERROR=SQRT(WGMS) = 3.3931 ITERATION 1 BOUNDARIES: 22.90 MEANS: 19.27 26.47 WGSS = 4653.1914 MINUS 2 LOG LIKELIHOOD = 2808.9653 WGMS = 9.3438 STD.ERROR=SQRT(WGMS) = 3.0568 ITERATION 2 BOUNDARIES: 23.17 MEANS: 19.15 26.93 WGSS = 3924.5544 MINUS 2 LOG LIKELIHOOD = 2764.8540 WGMS = 7.8806 STD.ERROR=SQRT(WGMS) = 2.8072 ITERATION 3 BOUNDARIES: 23.29 MEANS: 19.07 27.22 SUMS: 5131.23 6284.77 NUMBERS: 296 204 VARIANCES: 2.26 14.36 STD.DEVS.: 1.50 3.79 M.L. ESTIMATE OF COMMON VARIANCE = 7.84911 NUMBER OF PARAMETERS = 5 AIC = 2774.8540 SCHWARZ CRITERION = 2795.9270 KASHYAP CRITERION = 2789.0525 Compile time (seconds): 0.32 Execution time (seconds): 0.94 Size of object code: 10278 Number of extensions: 0 Size of local data area(s): 2608 Number of warnings: 0 Size of global data area: 493392 Number of errors: 0 Object/Dynamic bytes free: 653262 Statements Executed: 57645