CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C CLUSPAC: Computer Programs for Mixture-Model Clustering C C C C COPYRIGHT (C) 1998 Stanley Louis Sclove C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C$STATEMENTS=100000 C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C C Program MIX1BETA in CLUSPAC from ISOPAC C C C C ADAPTED FROM "MIX1DT CLUSPAC", C C A PROGRAM FOR CLUSTERING UNIVARIATE DATA C C (DATA ON THE LINE) BY ITERATIVE MAXIMIZATION OF THE MIXTURE- C C MODEL LIKELIHOOD C C C C N K C C --- -- C C L = | | > P(C)*F(X(I)|C) C C | | -- C C I=1 C=1 C C C C C C where the class-conditional probability density functions C C f(x|c) are betas. C C C C C The pdf is C r-1 s-1 C G(r + s) x (1-x) C f(x) = --------------------------------- r > 0, s > 0 C G(r) G(s) C C where G is the gamma function. C The beta function is B(r,s) = G(r)G(s)/G(r+s) . C so the constant in f(x) is 1/B(r,s) . C C C The mean is r/(r+s) = mu C C The variance is rs/(r+s)**2(r+s+1) = var . C C C C This gives r = c*mu, s = c*(1-mu), C where C c = [mu(1-mu)/var - 1] . C C C MANUAL MODE: NUMBER OF CLUSTERS AND INITIAL MEANS ARE C C INPUT. (USE PROGRAM MIX1DTA FOR AUTOMATIC SETTING OF C C NUMBERS OF CLUSTERS AND INITIAL MEANS.) C C C C C C PROGRAMMED BY C C DR. STANLEY L. SCLOVE 312/996-2681 C C DEPARTMENT OF C C INFORMATION & DECISION SCIENCES M/C 294 C C COLLEGE OF BUSINESS ADMINISTRATION C C UNIVERSITY OF ILLINOIS AT CHICAGO C C 601 S. MORGAN ST. C C CHICAGO, IL 60607-7124 C C C C C C VERSION 2.0 13-DEC-98 C C C C RESTRICTIONS (CAN BE MODIFIED): C C N, SAMPLE SIZE, AT MOST 999; C C K, NUMBER OF CLUSTERS, AT MOST 11; C C ITER, MAXIMUM NUMBER OF ITERATIONS, 4. C C (The program seems to wander off to an unwanted trivial, C global maximum as iterations go on.) C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C CONTROL STATEMENTS: C C C C (1) DATASET TITLE C C (2) N, IN FORMAT (2X,I4) C C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, C C WHETHER OR NOT THERE IS ONE ON INPUT. C C (4) DATA, IN FORMAT SPECIFIED BY FMT C C (5) K, NUMBER OF CLUSTERS, IN FORMAT (2X,I1) C C (6) K INITIAL VALUES OF PRIOR PROBS, MEANS AND VARIANCES C C IN FORMAT (5X,F3.2,2X,F8.2,2X,F8.2). C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C C DIMENSION X( 999),XMNDSQ( 999),ICLUS( 999),IOTA( 999) DIMENSION DSQ(11),C(11),SUM(11),R(11),S(11),RPLUSS(11) DIMENSION TITLE(18) DIMENSION B(11),NC(11),XMEAN(11),XMEANOL(11),RECIPBET(11) DIMENSION FMT(18) DIMENSION SS(11),SSD(11) DIMENSION SD(11) DIMENSION VAR(11) DIMENSION ICLSOL( 999) DIMENSION F( 999,11) DIMENSION P(11),XNC(11) DIMENSION PP(11, 999) DIMENSION XMXPR( 999) DIMENSION DENOM( 999) C DOUBLE PRECISION SUM,SS,F,P,PP,SUMSQS,TOTAL,SSD,XNC,VAR DOUBLE PRECISION TEMP,XLL,DENOM,SUMLNF,SUMPXF DOUBLE PRECISION XMEAN,XMEANOL,XMN2LL,XMN2LLOL,WGSS,SSDEVS C READ (5,64000) TITLE C C WRITE PROGRAM INFORMATION. WRITE (6,56000) WRITE (6,32000) WRITE (6,34000) C WRITE (6,70000) TITLE C C READ SAMPLE SIZE, N. READ (5,12000) N XN = N WRITE (6,66000) N C C READ DATA FORMAT. READ (5,64000) FMT C C READ DATA AND C COMPUTE STATISTICS OF WHOLE SAMPLE: TOTAL=0.0 SUMSQS=0.0 SSDEVS=0.0 C C Initialize max and min: XMAX = 0.0 XMIN = 1.0 DO 300 I = 1,N READ (5,FMT) X(I) C Data cleaning: IF ( X(I) .LE. 0.0001 ) X(I)=0.0001 IF ( X(I) .GE. 0.9999 ) X(I)=0.9999 C TOTAL = TOTAL + X(I) SUMSQS = SUMSQS + X(I)*X(I) IF (X(I) .LT. XMIN) XMIN=X(I) IF (X(I) .GT. XMAX) XMAX=X(I) 300 CONTINUE C WRITE DATA: WRITE (6,24000) IF ( N .GE. 31) GO TO 350 WRITE (6,FMT) (X(I),I=1,N) 350 CONTINUE C WRITE SUMMARY STATISTICS FOR WHOLE SAMPLE: WRITE (6,44000) WRITE (6,FMT) XMIN WRITE (6,46000) WRITE (6,FMT) XMAX XBAR = TOTAL/XN SSDEVS = SUMSQS- TOTAL*TOTAL/XN VARHAT = SSDEVS/XN WRITE (6,50000) XBAR WRITE (6,52000) VARHAT C C Compute likelihood for a single beta. XC = XBAR*(1-XBAR)/VARHAT - 1 XR = XBAR*XC XS = (1-XBAR)*XC XLL = 0.0 DO 105 I = 1,N TEMP = (XR-1)*ALOG(X(I)) + (XS-1)*ALOG(1- X(I)) XLL = XLL + TEMP 105 CONTINUE XLL = XLL + ALOG( GAMMA(XR)*GAMMA(XS)/GAMMA(XR+XS) ) XMN2LL = -2.0*XLL STDDEV = SQRT(VARHAT) WRITE (6,54000) SSDEVS, XMN2LL, STDDEV NOPARM = 1 + 1 AIC = XMN2LL + 2.0*NOPARM WRITE (6,82000) AIC SCH = XMN2LL + ALOG(XN)*NOPARM TEMP = 2*VARHAT**3 IF ( TEMP .LE. 0.0 ) TEMP = 0.0001 XKASH = SCH - DLOG(TEMP) WRITE (6,48000) XKASH C C C READ K, NUMBER OF CLUSTERS. READ (5,10000) K WRITE (6,36000) K C C READ INITIAL PRIOR PROBS, MEANS AND VARIANCES: DO 400 IC=1,K READ (5,14000) P(IC),XMEAN(IC),VAR(IC) RPLUSS(IC) = XMEAN(IC)*(1-XMEAN(IC))/VAR(IC) - 1 R(IC) = XMEAN(IC)*RPLUSS(IC) S(IC) = (1-XMEAN(IC))*RPLUSS(IC) 400 CONTINUE C C C C WRITE INITIAL PRIOR PROBS, MEANS AND VARIANCES: WRITE (6,26000) DO 500 IC = 1,K WRITE (6,28000) IC, P(IC), XMEAN(IC), VAR(IC) WRITE (6,28100) IC, R(IC), S(IC) 500 CONTINUE C C SET CONSTANTS. DO 600 INTGER=1,N IOTA(INTGER) = INTGER 600 CONTINUE C ITER = 1 C THE CLUSTERING IS BY MAXIMUM POSTERIOR C PROBABILITY CLUSTERING. C 700 CONTINUE C IF (ITER .EQ. 1) GO TO 900 C because there are no old results to store. C C If it's not the first iteration, C STORE OLD CLUSTERING RESULTS FOR TESTING CONVERGENCE: XMN2LLOL = XMN2LL DO 800 I = 1,N ICLSOL(I) = ICLUS(I) 800 CONTINUE DO 850 IC = 1,K XMEANOL(IC) = XMEAN(IC) 850 CONTINUE C 900 CONTINUE C DO 1100 IC = 1,K C RECIPBET means 1/BETA. RECIPBET(IC) = GAMMA(R(IC))*GAMMA(S(IC))/GAMMA(R(IC)+S(IC)) C DO 1100 I = 1,N C C DSQ(IC) = ( XMEAN(IC) - X(I) )**2 C ZSQ = DSQ(IC)/VAR(IC) C IF ( ZSQ .LE. 174.673 ) GO TO 1000 C F(I,IC) = 0.0 C GO TO 1100 C1000 CONTINUE C C NOTE THAT in MIX1* the GAUSSIAN WAS USED HERE: C F(I,IC) =(EXP(-ZSQ/2.0))/(SQRT(2*PI*VAR(IC))) C F(I,IC) = (X(I))**(R(IC)-1)*(1-X(I))**(S(IC)-1) C C CONST = RECIPBET(IC) F(I,IC) = F(I,IC)*RECIPBET(IC) 1100 CONTINUE C 1200 CONTINUE C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP: DO 1300 I = 1,N DENOM(I) = 0.0 DO 1300 IH=1,K DENOM(I) = DENOM(I) + P(IH)*F(I,IH) 1300 CONTINUE DO 1400 I = 1,N DO 1400 IC=1,K C IF ( DENOM(I) .EQ. 0.0 ) DENOM(I)=0.0001 C PP(IC,I)= P(IC)*F(I,IC)/DENOM(I) 1400 CONTINUE C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: DO 1600 I = 1,N XMXPR(I) = PP(1,I) ICLUS(I) = 1 DO 1600 IC = 2,K IF ( PP(IC,I) .GT. XMXPR(I) ) GO TO 1500 GO TO 1600 1500 XMXPR(I) = PP(IC,I) ICLUS(I) = IC 1600 CONTINUE C IF (N .GE. 31) GO TO 1650 C C WRITE NEW LABELS: WRITE (6,16000) WRITE (6,18000) (IOTA(I), I=1,N) WRITE (6,20000) (ICLUS(I), I=1,N) 1650 CONTINUE C C UPDATE CLUSTER PRIOR PROBABILITIES P(IC), MEANS XMEAN(IC) AND C VARIANCES VAR(IC): WGSS = 0.0 DO 1800 IC = 1,K XNC(IC) = 0.0 C XNC(IC) WILL BE THE SUM OVER ALL N OBSERVATIONS OF THEIR C POSTERIOR PROBABILITIES OF MEMBERSHIP IN CLUSTER IC. SUM(IC) = 0.0 SS(IC) = 0.0 DO 1700 I = 1,N XNC(IC) = XNC(IC) + PP(IC,I) SUM(IC) = SUM(IC) + PP(IC,I)*X(I) SS(IC) = SS(IC) + PP(IC,I)*X(I)*X(I) 1700 CONTINUE IF ( XNC(IC) .EQ. 0.0 ) GO TO 2700 XMEAN(IC) = SUM(IC)/XNC(IC) SSD(IC) = SS(IC) - SUM(IC)*SUM(IC)/XNC(IC) VAR(IC) = SSD(IC)/(XNC(IC)) C IF ( VAR(IC) .LE. 0.0 ) VAR(IC) = 0.0001 SD(IC) = SQRT(VAR(IC)) P(IC) = XNC(IC)/XN C P(IC) = estimate of the mixing prob of component IC. C Update the parms of the beta: RPLUSS(IC) = XMEAN(IC)*(1-XMEAN(IC))/VAR(IC) - 1 R(IC) =XMEAN(IC)*RPLUSS(IC) S(IC) = (1-XMEAN(IC) )*RPLUSS(IC) RECIPBET(IC) = GAMMA(R(IC))*GAMMA(S(IC))/GAMMA(R(IC)+S(IC)) WGSS = WGSS + SSD(IC) 1800 CONTINUE C C COUNT NUMBERS IN CLUSTERS: DO 1900 IC = 1,K NC(IC) = 0 1900 CONTINUE DO 2000 I = 1,N IGROUP = ICLUS(I) NC(IGROUP) = NC(IGROUP) + 1 2000 CONTINUE C VARHAT = WGSS/XN C SUMLNF = 0.0 DO 2200 I=1,N SUMPXF = 0.0 DO 2100 IC=1,K SUMPXF = SUMPXF + P(IC)*F(I,IC) 2100 CONTINUE SUMLNF = SUMLNF + DLOG(SUMPXF) 2200 CONTINUE C XMN2LL = -2.0*SUMLNF C C WGMS = WGSS/(N-K) WRITE (6,40000) WGSS, XMN2LL, WGMS STDERR = SQRT(WGMS) WRITE (6,74000) STDERR C KMINUS1 = K-1 DO 2300 IC=1,KMINUS1 ICPLUS1 = IC+1 C B(IC) IS approximate BOUNDARY BETWEEN classes c and c+1. W1 = P(IC) W2 = P(ICPLUS1) B(IC) = P(IC)*XMEAN(IC) + P(ICPLUS1)*XMEAN(ICPLUS1) B(IC) = B(IC)/( P(IC)+P(ICPLUS1) ) C BDYADJ = DLOG(P(ICPLUS1)) - DLOG(P(IC)) C BDYADJ = BDYADJ/(XMEAN(ICPLUS1) - XMEAN(IC)) C BDYADJ = VARHAT*BDYADJ C B(IC) = B(IC) - BDYADJ 2300 CONTINUE WRITE (6,58000) ITER WRITE (6,42000) (B(IC),IC=1,KMINUS1) WRITE (6,38000) (XMEAN(IC),IC=1,K) C WRITE (6,38100) ( P(IC),IC=1,K) C WRITE (6,38300) (NC(IC),IC=1,K) C IF (ITER .EQ. 1) GO TO 2500 DO 2400 I = 1,N C If any case changes cluster, another iteration will be performed. IF (ICLUS(I) .EQ. ICLSOL(I)) GO TO 2400 GO TO 2500 2400 CONTINUE WRITE (6,79000) C Even if no case changes cluster, another iteration will be C performed if the parameter estimates have changed much. DO 2450 IC = 1,K TEST=(XMEAN(IC)-XMEANOL(IC))/SQRT( XMEAN(IC)*(1-XMEAN(IC)) ) C TEST=(XMEAN(IC)-XMEANOL(IC))/XMEANOL(IC) C TOLERANS = 0.05 C Another iteration will be performed iff. C (1) the max number of iterations has not been reached C and (2) the new means are not close to the old means. C or if (3) some case had changed clusters. C IF (ABS(TEST) .LE. TOLERANS) GO TO 2450 GO TO 2500 2450 CONTINUE C If changes in all means are small, stop and write results. GO TO 2900 2500 CONTINUE ITER = ITER + 1 IF (ITER.GE. 5) GO TO 2600 GO TO 700 C 2600 WRITE (6,80000) 2700 CONTINUE WRITE (6,88000) C C Next the results will be written: C 2900 CONTINUE C WRITE (6,60000) ( P(IC), IC = 1,K ) WRITE (6,62000) (NC(IC),IC=1,K) WRITE (6,72000) (VAR(IC),IC=1,K) WRITE (6,68000) (SD(IC), IC=1,K) WRITE (6,68100) (R(IC), IC=1,K) WRITE (6,68200) (S(IC), IC=1,K) C VARHAT IS MLE OF VARIANCE. VARHAT = WGSS/N WRITE (6,30000) VARHAT C C IF (N .GE. 31) GO TO 3050 WRITE (6,78000) DO 3000 I=1,N WRITE (6,76000) X(I),ICLUS(I),(PP(IC,I),IC=1,K) 3000 CONTINUE C 3050 CONTINUE C C COMPUTE MODEL-SELECTION CRITERIA: C NO. PARAMETERS = K MEANS + K VARIANCES + (K-1) PROBS. NOPARM = K + K + (K-1) C WRITE (6,86000) NOPARM AIC = XMN2LL + 2.0*NOPARM SCH = XMN2LL + ALOG(XN)*NOPARM C C SCHWARZ' CRITERION IS FIRST-DEGREE EXPANSION OF C LOG POSTERIOR PROBABILITY OF THE MODEL. C KASHYAP'S CRITERION IS SECOND-DEGREE EXPANSION OF SAME. TEMP = 2*VARHAT**3 IF ( TEMP .LE. 0.0 ) TEMP = 0.0001 XKASH = SCH - DLOG(TEMP) C WRITE (6,82000) AIC WRITE (6,84000) SCH WRITE (6,48000) XKASH C C C STOP 10000 FORMAT(2X,I1) 12000 FORMAT(2X,I4) 14000 FORMAT(5X,F3.2,2X,F8.2,2X,F8.2) 16000 FORMAT(1X,'CLUSTERING') 18000 FORMAT(1X,'POINT: '/, (1X,40I3)) 20000 FORMAT(1X,'CLUSTER: '/, (1X,40I3)) 24000 FORMAT(1X,'DATA:'/) 26000 FORMAT(/1X,'INITIAL PRIOR PROBS, MEANS AND VARIANCES:') 28000 FORMAT(1X,I2,2X,3F11.2) 28100 FORMAT(3X,I2,2X,'R', F11.2, ' S', F11.2) 30000 FORMAT(1X, 'M.L. ESTIMATE OF COMMON VARIANCE = ',F14.5/) 32000 FORMAT(1X,'Program MIX1BETA CLUSPAC') 34000 FORMAT(1X,'COPYRIGHT (C) 1998 STANLEY LOUIS SCLOVE'/) 36000 FORMAT('1',1X,'K = ',I1,' CLUSTERS') 38000 FORMAT(1X,'MEANS: ',9F10.3) 38100 FORMAT(1X,'PROBS: ',9F10.3) C 38300 FORMAT(1X,'Ns: ',9F10.3) 40000 FORMAT(/1X,'WGSS = ',F14.4,' MINUS 2 LOG LIKELIHOOD = ', XF14.4, ' WGMS = ',F14.4/) 42000 FORMAT(1X,'BOUNDARIES:', 8X, 9F10.3) 44000 FORMAT(1X,'MINIMUM OF SAMPLE: ') 46000 FORMAT(1X,'MAXIMUM OF SAMPLE: ') 48000 FORMAT( 1X,'KASHYAP CRITERION = ', F14.4) 50000 FORMAT(/,1X,'MEAN = ', F14.4) 52000 FORMAT(1X, 'M.L. ESTIMATE OF VARIANCE = ',F14.5/) 54000 FORMAT(/1X,'SSDEVS = ',F14.4,' MINUS 2 LOG LIKELIHOOD = ', XF14.4, ' STDDEV = ',F14.4/) 56000 FORMAT('1','.................................................', X/,1X,'PROGRAM MIX1BETA CLUSPAC'/1X,'MIXTURE MODEL CLUSTERING'/ X,1X,'FOR UNIVARIATE DATA (DATA ON THE LINE)'/ X,1X,'WITH BETA DISTRIBUTIONS '/ X1X,'DEVELOPED AND PROGRAMMED BY DR. STANLEY L. SCLOVE' X/,1X,'VERSION 2.0 13-Dec-98 '/) 58000 FORMAT(1X,'ITERATION ', I2) 60000 FORMAT(1X,'PROBS:',5X,9F10.3) 62000 FORMAT(1X,'NUMBERS:',3X,9(I10,3X)) 64000 FORMAT(18A4) 66000 FORMAT(1X,'N = ',I3/) 68000 FORMAT(1X,'STD.DEVS.: ',9F10.3) 68100 FORMAT(1X,'Rs: ',9F10.3) 68200 FORMAT(1X,'Ss: ',9F10.3) 70000 FORMAT(1X,18A4/) 72000 FORMAT(1X,'VARIANCES: ',9F10.3) 74000 FORMAT(1X,'STD.ERROR=SQRT(WGMS) = ',F13.4/) 76000 FORMAT(1X,F8.2,2X,I2,3X,(8F6.3/)) 78000 FORMAT(1X/,1X,'DATA, LABELS AND PROBS OF CLUSTER MEMBERSHIP'/) 79000 FORMAT(/,1X,'NO CASE CHANGED CLUSTERS IN THIS ITERATION.') 80000 FORMAT(1X,'PROGRAM HAS NOT CONVERGED IN ', X' the prescribed number of ITERATIONS. STOP') 82000 FORMAT(1X,'AIC = ', F14.4 ) 84000 FORMAT(1X,'SCHWARZ CRITERION = ', F14.4 ) 86000 FORMAT(/1X,'NUMBER OF PARAMETERS = ',I4 ) 88000 FORMAT(1X,'NO OBSERVATIONS IN GROUP ',I3,'. STOP') END