The Statistical Significance Testing Controversy:
A Critical Analysis

PSCH 548 – Graduate Seminar in Methods and Measurement

Fall 2003 | Call # 09744

Instructor: R. Chris Fraley | fraley@uic.edu

Location & Time: 2019 BSB | Monday 4 – 7p.m.

Overview and Course Objectives

During the last century, null hypothesis significance testing (NHST) has emerged as the primary method of data analysis and hypothesis testing in the psychological sciences. Although significance tests have been controversial since their inception, debates about their value have became particularly heated in recent years due to the publication of Jacob Cohen’s American Psychologist article, "The Earth is Round (p < .05)." In this article, Cohen challenged the appropriateness of NHST in psychological science and sparked a lively debate among psychologists, one that has consumed entire sections of the leading journals (e.g., American Psychologist, Psychological Science, and Psychological Methods) and led to the creation of a special APA task force to investigate the way NHST is used in psychology (Wilkinson & the Task Force on Statistical Inference, 1999).

In this seminar we will critically examine the role of significance tests in psychology. I have three objectives for this seminar. First, I would like for students to gain an in-depth understanding of the arguments that have been made for and against the use of NHST in psychological science. Second, I would like you to learn some basic skills in statistical computing so that you can write programs to do things (e.g., basic simulations and modeling, power analyses, effect size computations, interval estimates, permutation analyses, Bayseian analyses) that cannot be done easily in popular statistical packages (e.g., SPSS). Third, I would like you to come away from this class with a stronger appreciation of the role that mathematics can play in theoretical and empirical psychology.

Readings and Reaction Papers

Each week I will ask you to read two to three articles and to write a reaction paper (approximately a page or two in length) in response to these articles. In your reaction paper, please focus on one or more of the following: (a) novel questions or issues that the readings made salient for you; (b) critiques of specific arguments made by the authors; (c) extensions of specific ideas discussed by the authors; or (d) examples of how the issues raised in the readings may apply to your research or the research of your subfield. On the Monday before class, and no later than noon, please submit your comments to the class's on-line discussion forum. From that web site you will be able to submit your reactions to the readings and read the reactions submitted by other students. The purpose of doing this on-line is to encourage you to write comments that are thoughtful enough to share with your peers and to give everyone in the seminar a sense of what issues, questions, and concerns are in need of discussion. Your final grade will be based on the quality and thoughtfulness of these reaction papers, as well as your level of participation in the seminar (including the computing component; see below). Please read all the required readings and be prepared to discuss them. If I get the impression that people are skimping on the readings, I will begin to ask you to complete more thorough assignments. Beyond the reaction paper, there will be no writing assignments for this course.

Computer Component

Although this seminar will consist almost exclusively of discussion, nitpicking, and name-calling, I would like to devote the last hour of each meeting to statistical and mathematical programming. It is my belief that one reason why the use of NHST has persisted, despite widespread arguments against it, is that alternative techniques are not available in popular software packages. In fact, as commercially produced statistical software continues to be developed in a user-friendly fashion (read: "You don’t need to know anything about statistics to use this software"), researchers’ freedom to do what they want with their data will become increasingly constrained.

To help solve this problem, I would like to show you how to do some rudimentary statistical programming. We will be working with the S-PLUS programming environment. I will update the class webpage for our seminar periodically to give you class exercises, etc.

Early in the semester I will begin showing you some simple programs that I have written to illustrate specific concepts and techniques that we’ll be discussing (e.g., confidence intervals, effect sizes, statistical power, probabilistic models). I’ll try to explain how the programs were written, and I will encourage you to experiment with them, modify the code, etc. as a way to gain mastery over the programming language. Later in the semester I will expect you to be able to write your own programs to solve certain problems.

Instructor Bias

Although my goal is to facilitate lively, deep, and fair discussions on the issues at hand, I believe that it is necessary to make my bias explicit from the outset. Paul Meehl once stated that "Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology." I echo Meehl’s sentiment. One of my goals in this seminar is to make it clear why I believe this to be the case. Furthermore, I expect you, by the time you have completed this seminar, to be able to articulate and defend your stance on the NHST debate, regardless of what that stance is.

Readings

I will make the readings available outside my office (1050 "A" BSB) the week prior to discussion. The following is a tentative list of readings. This may need to be modified as the seminar evolves; I’ll make any announcements regarding changes in the reading in class.

Week 1
Introduction: What is a Null Hypothesis Significance Test? Facts, Myths, and the State of Our Science

Lyken, D. L. (1991). What’s wrong with psychology? In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 3 – 39). Minneapolis, MN: University of Minnesota Press.

Week 2
Early Criticisms of NHST

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437. [optional]

Week 3
Contemporary Criticisms of NHST

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Lawrence Erlbaum Associates.

Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapter 2 [A Critique of Significance Tests]) [optional]

Week 4
Rebuttal: Advocates of NHST Come to Its Defense

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.

Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.

Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 6, 212-213.

Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance testing. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 65-116). Mahwah, NJ: Lawrence Erlbaum Associates. [optional]

Week 5
Rebuttal: Advocates of NHST Come to Its Defense

Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8, 12-15.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.

Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16-17.

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33, 175-183.

Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241-301. [optional]

Harris, R. J. (1997). Significance tests have their place. Psychological Science, 8, 8-11. [optional]

Week 6
Effect Size

Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage. [Ch. 2, Defining Research Results]

Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105-110.

Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 129-133. [optional]

Week 7
Statistical Power

Hallahan, M., & Rosenthal, R. (1996). Statistical power: Concepts, procedures, and applications. Behaviour Research and Therapy, 34, 489-499.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153. [optional]

Maddock, J. E., Rossi, J. S. (2001). Statistical power of articles published in three health-psychology related journals. Health Psychology, 20, 76-78. [optional]

Thomas, L. & Juanes, F. (1996). The importance of statistical power analysis: An example from Animal Behaviour. Animal Behaviour, 52, 856-859. [optional]

Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646-656. [optional]

Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83-91. [optional]


R on the web: Some simple programs for estimating power on the Internet
Note: These scripts are written for the R programming language--a language that is very similar to S-Plus. To use these scripts, you can download a free copy of R if you wish. Alternatively, you can simply copy-and-paste the scripts into a window on an Internet version of R. To use this option, simply paste the code into the "Code Window" and "submit" the commands at the following web site: http://www.math.montana.edu/Rweb/Rweb.JavaScript.html. After a few moments (or minutes, depending on the complexity of the program), the results of the simulation will appear in a new window. To adjust the parameters of the programs (e.g., sample sizes, cell means, simulation trials), alter only the parameters in red.

Sampling distribution for a correlation

Statistical power for a one-way ANOVA

Statistical power for a two-way ANOVA

Statistical power for linear regression with two predictors and an optional interaction term

 

Week 8
Confidence Intervals and Significance testing

Gardner, M. J., & D. G. Altman. 1986. Confidence intervals rather than P values: Estimation rather than hypothesis testing. British Medical Journal, 292, 746-750.

Cumming, G., & Finch, S. (2001). A primer on understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532-574.

Loftus, G. R., & Masson, M.E.J. (1994). Using confidence intervals in within-subject designs. Psychonomic Bulletin and Review, 1, 476-490.

R on the web: Some simple programs for studying the properties of confidence intervals

A demonstration of confidence intervals for means

A demonstration of confidence intervals for correlations

Estimate the 95% confidence interval for Cohen's d

 

Week 9 [note: we are skipping this section]
Theoretical Modeling: Developing Formal Models of Natural Phenomena

Haefner, J. W. (1996). Modeling biological systems: Principles and applications. New York: International Thomson Publishing. (Chapters 1 [Models of Systems] & 2 [The Modeling Process])

Loehlin, J. C. (1992). Latent variable models: An introduction to factor, path, and structural analysis. Hillsdale, NJ: Lawrence Erlbaum Associates. (Chapter 1 [Path models in factor, path and structural analysis], p. 1-18]

Grant, D. A. (1962). Testing the null hypothesis and the strategy of investigating theoretical models. Psychological Review, 69, 54-61. [optional]

Binder, A. (1963). Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 70, 107-115. [optional]

Edwards, W. (1965). Tactical note on the relations between scientific and statistical hypotheses. Psychological Bulletin, 63, 400-402. [optional]

Week 10
What is the Meaning of Probability? Controversy Concerning Relative Frequency and Subjective Probability

Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: W. H. Freeman. (Chapters 10, 11, & 12)

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapters 4, 5, & 6)

Pruzek, R. M. (1997). An introduction to Bayesian inference and its applications. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 287-318). Mahwah, NJ: Lawrence Erlbaum Associates.

Rindskoph, D. M. (1997). Testing "small," not null, hypothesis: Classical and Bayesian Approaches. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds). What if there were no significance tests? (pp. 319-332). Mahwah, NJ: Lawrence Erlbaum Associates.

Edwards, W., Lindman, H., Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242. [optional]

Week 11
Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108-141.

Roberts, S. & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358-367.

Week 12
Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories

Urbach, P. (1974). Progress and degeneration in the "IQ debate" (I). British Journal of Philosophy of Science, 25, 99-125.

Serlin, R. C. & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 73-83.

Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42, 145-151.

Gholson, B. & Barker, P. (1985). Kuhn, Lakatos, & Laudan: Applications in the history of physics and psychology. American Psychologist, 40, 755-769. [optional]

Faust, D., & Meehl, P. E. (1992). Using scientific methods to resolve questions in the history and philosophy of science: Some illustrations. Behavior Therapy, 23, 195-211. [optional]

Urbach, P. (1974). Progress and degeneration in the "IQ debate" (II). British Journal of Philosophy of Science, 25, 235-259. [optional]

Salmon, W. C. (1973, May). Confirmation. Scientific American, 228, 75-83. [optional]

Meehl, P. E. (1993). Philosophy of science: Help or hindrance? Psychological Reports, 72, 707-733. [optional]

Manicas. P. T., & Secord, P. F. (1983). Implications for psychology of the new philosophy of science. American Psychologist, 38, 399-413. [optional]

Week 13
Has the NHST Tradition Undermined a Non-Biased, Cumulative Knowledge Base in Psychology?

Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee. Psychological Methods, 2, 447-452.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.

Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.

Berger, J. O. & Berry, D. A. (1988). Statistical analysis and illusion of objectivity. American Scientist, 76, 159-165.

Week 14
Replication and Scientific Integrity

Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research. American Psychologist, 25, 970-975.

Sohn, D. (1998). Statistical significance and replicability: Why the former does not presage the latter. Theory and Psychology, 8, 291-311.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Platt, J. R. (1964). Strong Inference. Science, 146, 347-353.

Feynman, R. L. (1997). Surely you’re joking, Mr. Feynman! New York: W. W. Norton. (Chapter: Cargo-cult science).

Rorer, L. G. (1991). Some myths of science in psychology. In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 61 – 87). Minneapolis, MN: University of Minnesota Press. [optional]

Lindsay, R. M. & Ehrenberg, A. S. C. (1993). The design of replicated studies. The American Statistician, 47, 217-228. [optional]

Week 15
Quantitative Thinking: Why We Need Mathematics (and not NHST per se) in Psychological Science

Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.

Meehl, P. E. (1998, May). The power of quantitative thinking. Invited address as recipient of the James McKeen Cattell Award at the annual meeting of the American Psychological Society, Washington, DC.

Fraley's End of the Semester Thoughts and Summary