Program Documentation: Program LEFTGAMM From CLUSPAC in ISOPAC CLUSPAC: Computer Programs for Mixture-Model Clustering Copyright (C) 1991 Stanley Louis Sclove Prof. Stanley L. Sclove Professor of Information & Decision Sciences; Epidemiology & Biostatistics; and Mathematics, Statistics & Computer Science Information & Decision Sciences Dept. M/C 294 ofc (312) 996-2681 College of Business Administration dept (312) 996-2676 University of Illinois at Chicago fax (312) 413-0385 601 S. Morgan Street slsclove@uic.edu Chicago, IL 60607-7124 www.uic.edu/~slsclove --------------------------------------------------------------------- Program LEFTGAMM Copyright (C) 1998 Stanley Louis Sclove OUTLINE 0. Background: ISOPAC 0.1. CLUSPAC 0.2. TSPAC 0.3. IMPAC 1. Program LEFTGAMM 2. The Gamma Distribution 3. Program Restrictions 4. Control Cards 5. Flow of Program 6. Estimation Method 7. Boundaries between Clusters 8. Model-Selection Criteria --------------------------------------------------------------------- 0. Background: ISOPAC ISOPAC is a suite of programs for cluster analysis: CLUSPAC time-series segmentation: TSPAC digital-image segmentation: IMPAC 1. Program LEFTGAMM -------------------- Program LEFTGAMM is adapted from Program MIX1DT, a program from clustering univariate data (data on the line) by iterative maximization of the mixture-model likelihood C C C N K C C --- -- C C L = | | > P(C)*F(X(I)|C) C C | | -- C C I=1 C=1 C 2. The Gamma Distribution -------------------------- In LEFTGAMM, C r-1 r C f(x|1) = x exp(-x/b)/[G(r)*b ] Here G(r) denotes the gamma function. C C Start with the gamma p.d.f. C C r-1 C f(t) = t exp(-t)/G(r) C C Make the substitution t = x/b. C Then the result is f(x|1) . C 2 C E[T] = r, Var[T] = r; X = bT: E[X] = br, Var[X] = rb C For estimation: C Var[X]/E[X] = b; r = E[X]/b . C Program LEFTGAMM runs in "manual" mode: the number of clusters and initial values of the means are user-provided. By contrast, Program MIX1DTA (the final "A" is for "automatic") provides automatic setting of initial means and tries a range of values for k, the number of clusters. However, Program MIX1DTA has not been adapted to left-gamma mode. (This can be done relatively easily.) 3. Program Restrictions ------------------------ C RESTRICTIONS (CAN BE MODIFIED): C C N, SAMPLE SIZE, AT MOST 1999; C C K, NUMBER OF CLUSTERS, AT MOST 29; C C ITER, MAXIMUM NUMBER OF ITERATIONS, 20. C C C 4. Control Cards ----------------- C CONTROL CARDS: C C C C (1) DATASET TITLE C C (2) N, IN FORMAT (2X,I4) C C (3) FMT, IN FORMAT (18A4), E.G., (1X,F4.1). C C ALLOW AT LEAST ONE BLANK IN FMT: IT WILL ALSO BE USED C C FOR OUTPUT, WHERE CC1 IS FOR CARRIAGE CONTROL. C C ALLOW A CC FOR THE DECIMAL POINT ON OUTPUT, C C WHETHER OR NOT THERE IS ONE ON INPUT. C C (4) DATA, IN FORMAT SPECIFIED BY FMT C C (5) K, NUMBER OF CLUSTERS, IN FORMAT (2X,I1) C C (6) K INITIAL VALUES OF PRIOR PROBABILITIES AND MEANS, C C IN FORMAT (5X,F3.2,2X,F8.2). C C (7) INITIAL VALUE OF THE VARIANCE. (THE RESULTS DO NOT C C DEPEND UPON THIS VALUE IF THE INITIAL PRIOR PROBABILITIES C C ARE EQUAL.) C C C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C 5. Flow of Program ------------------- C WRITE PROGRAM INFORMATION. C C READ SAMPLE SIZE, N. C C READ DATA FORMAT. C C READ DATA AND C COMPUTE STATISTICS OF WHOLE SAMPLE: C C WRITE DATA: C WRITE SUMMARY STATISTICS FOR WHOLE SAMPLE: C C READ K, NUMBER OF CLUSTERS. C C READ INITIAL PRIOR PROBS, MEANS AND VARIANCES: C C WRITE INITIAL PRIOR PROBS, MEANS AND VARIANCES: C C SET CONSTANTS. C C THE CLUSTERING IS BY MAXIMUM POSTERIOR C PROBABILITY CLUSTERING based upon the estimated parameters. 6. Estimation Method --------------------- The parameters are estimated by an E-M algorithm wherein distribution-free estimates of the means and variances provide estimates of all the distributional parameters. In the case of Gaussian distributions, these distributional parameters are the mean and variance. In the case of the gamma distribution with power parameter r and scale b, the distributional parameters are given as follows from the mean and variance. Var[X]/E[X] = b; r = E[X]/b . Continuation of flow of program: C STORE OLD CLUSTERING: C COMMENCE DISTANCE COMPUTATIONS. C NOTE THAT A PROB. DENSITY FUNCTION OTHER THAN THE GAUSSIAN C COULD BE USED HERE: C XMNDSQ(I) = MIN SQ. DISTANCE FROM X(I) TO ANY MEAN C C COMPUTE POSTERIOR PROBABILITIES OF GROUP MEMBERSHIP: C IF ( DENOM(I) .EQ. 0.0 ) DENOM(I)=0.0001 C C COMPUTE NEW LABELS BY MAX POSTERIOR PROBABILITY: C C C WRITE NEW LABELS: C C UPDATE CLUSTER PRIOR PROBABILITIES P(IC), MEANS XMEAN(IC) AND C VARIANCES VAR(IC): C XNC(IC) WILL BE THE SUM OVER ALL N OBSERVATIONS OF THEIR C POSTERIOR PROBABILITIES OF MEMBERSHIP IN CLUSTER IC. C IF ( VAR(IC) .LE. 0.0 ) VAR(IC) = 0.0001 C C COUNT NUMBERS IN CLUSTERS: C 7. Boundaries between Clusters ------------------------------- The boundary between the lefthand, gamma cluster and the second cluster (Gaussian) is the appropriate root of an equation of the form Ax**2 + Bx + C + D*logx = 0. The Newton-Raphson method is used to find this root. The initial value is an approximation to the boundary, the weighted average of the means of Clusters 1 and 2, where the weights are the probabilities of the two corresponding components: BDRY(1) = ( P(1)*XMEAN(1) + P(2)*XMEAN(2) )/(P(1)+P(2)) The boundaries between clusters 2 and 3, 3 and 4, etc., that is between Gaussian clusters is the appropriate root of a quadratic equation of the form Ax**2 + Bx + C = 0. The Newton-Raphson method is used to find this root. The initial value is an approximation to the boundary, the weighted average of the means of the two clusters, where the weights are the probabilities of the two corresponding components. By starting with this value, the Newton-Raphson iterations will converge to the appropriate one of the two roots of the quadratic. (In earlier versions of the program the boundaries between clusters 2 and 3, 3 and 4, etc. are approximated by the formula for boundaries between Gaussian clusters with equal variances.) 8. Model-Selection Criteria ---------------------------- Model-selection criteria can be used as ad hoc figures-of-merit for evaluating models with different numbers k of clusters. C NO. PARAMETERS = K MEANS + K VARIANCES + (K-1) PROBS. C C C SCHWARZ' CRITERION IS FIRST-DEGREE EXPANSION OF C LOG POSTERIOR PROBABILITY OF THE MODEL. C KASHYAP'S CRITERION IS SECOND-DEGREE EXPANSION OF SAME. C --------------------------------------------------------------------- written 17-Dec-1998 latest revision 19-Dec-1998 ---------------------------------------------------------------------