Statistical Significance Testing Controversy:
PSCH 548 – Graduate Seminar in Methods and Measurement
Fall 2003 | Call # 09744
Instructor: R. Chris Fraley | email@example.com
Location & Time: 2019 BSB | Monday 4 – 7p.m.
Overview and Course Objectives
During the last century, null hypothesis significance testing (NHST) has emerged as the primary method of data analysis and hypothesis testing in the psychological sciences. Although significance tests have been controversial since their inception, debates about their value have became particularly heated in recent years due to the publication of Jacob Cohen’s American Psychologist article, "The Earth is Round (p < .05)." In this article, Cohen challenged the appropriateness of NHST in psychological science and sparked a lively debate among psychologists, one that has consumed entire sections of the leading journals (e.g., American Psychologist, Psychological Science, and Psychological Methods) and led to the creation of a special APA task force to investigate the way NHST is used in psychology (Wilkinson & the Task Force on Statistical Inference, 1999).
In this seminar we will critically examine the role of significance tests in psychology. I have three objectives for this seminar. First, I would like for students to gain an in-depth understanding of the arguments that have been made for and against the use of NHST in psychological science. Second, I would like you to learn some basic skills in statistical computing so that you can write programs to do things (e.g., basic simulations and modeling, power analyses, effect size computations, interval estimates, permutation analyses, Bayseian analyses) that cannot be done easily in popular statistical packages (e.g., SPSS). Third, I would like you to come away from this class with a stronger appreciation of the role that mathematics can play in theoretical and empirical psychology.
Readings and Reaction Papers
Each week I will ask you to read two to three articles and to write a reaction paper (approximately a page or two in length) in response to these articles. In your reaction paper, please focus on one or more of the following: (a) novel questions or issues that the readings made salient for you; (b) critiques of specific arguments made by the authors; (c) extensions of specific ideas discussed by the authors; or (d) examples of how the issues raised in the readings may apply to your research or the research of your subfield. On the Monday before class, and no later than noon, please submit your comments to the class's on-line discussion forum. From that web site you will be able to submit your reactions to the readings and read the reactions submitted by other students. The purpose of doing this on-line is to encourage you to write comments that are thoughtful enough to share with your peers and to give everyone in the seminar a sense of what issues, questions, and concerns are in need of discussion. Your final grade will be based on the quality and thoughtfulness of these reaction papers, as well as your level of participation in the seminar (including the computing component; see below). Please read all the required readings and be prepared to discuss them. If I get the impression that people are skimping on the readings, I will begin to ask you to complete more thorough assignments. Beyond the reaction paper, there will be no writing assignments for this course.
Although this seminar will consist almost exclusively of discussion, nitpicking, and name-calling, I would like to devote the last hour of each meeting to statistical and mathematical programming. It is my belief that one reason why the use of NHST has persisted, despite widespread arguments against it, is that alternative techniques are not available in popular software packages. In fact, as commercially produced statistical software continues to be developed in a user-friendly fashion (read: "You don’t need to know anything about statistics to use this software"), researchers’ freedom to do what they want with their data will become increasingly constrained.
To help solve this problem, I would like to show you how to do some rudimentary statistical programming. We will be working with the S-PLUS programming environment. I will update the class webpage for our seminar periodically to give you class exercises, etc.
Early in the semester I will begin showing you some simple programs that I have written to illustrate specific concepts and techniques that we’ll be discussing (e.g., confidence intervals, effect sizes, statistical power, probabilistic models). I’ll try to explain how the programs were written, and I will encourage you to experiment with them, modify the code, etc. as a way to gain mastery over the programming language. Later in the semester I will expect you to be able to write your own programs to solve certain problems.
Although my goal is to facilitate lively, deep, and fair discussions on the issues at hand, I believe that it is necessary to make my bias explicit from the outset. Paul Meehl once stated that "Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology." I echo Meehl’s sentiment. One of my goals in this seminar is to make it clear why I believe this to be the case. Furthermore, I expect you, by the time you have completed this seminar, to be able to articulate and defend your stance on the NHST debate, regardless of what that stance is.
I will make the readings available outside my office (1050 "A" BSB) the week prior to discussion. The following is a tentative list of readings. This may need to be modified as the seminar evolves; I’ll make any announcements regarding changes in the reading in class.
Introduction: What is a Null Hypothesis Significance Test? Facts, Myths, and the State of Our Science
Lyken, D. L. (1991). What’s wrong with psychology? In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 3 – 39). Minneapolis, MN: University of Minnesota Press.
Early Criticisms of NHST
Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.
Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437. [optional]
Contemporary Criticisms of NHST
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Lawrence Erlbaum Associates.
Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.
Oakes, M. (1986). Statistical
inference: A commentary for the social and behavioral sciences. New
York: Wiley. (Chapter 2 [A Critique of Significance Tests]) [optional]
Rebuttal: Advocates of NHST Come to Its Defense
Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.
Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.
Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 6, 212-213.
Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance testing. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 65-116). Mahwah, NJ: Lawrence Erlbaum Associates. [optional]
Rebuttal: Advocates of NHST Come to Its Defense
Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8, 12-15.
Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.
Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16-17.
Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33, 175-183.
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241-301. [optional]
Harris, R. J. (1997). Significance tests have their place. Psychological Science, 8, 8-11. [optional]
Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage. [Ch. 2, Defining Research Results]
Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105-110.
Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 129-133. [optional]
Hallahan, M., & Rosenthal, R. (1996). Statistical power: Concepts, procedures, and applications. Behaviour Research and Therapy, 34, 489-499.
Sedlmeier, P., & Gigerenzer,
G. (1989). Do studies of statistical power have an effect on the power of
studies? Psychological Bulletin, 105, 309-316.
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153. [optional]
Maddock, J. E., Rossi, J. S. (2001). Statistical power of articles published in three health-psychology related journals. Health Psychology, 20, 76-78. [optional]
Thomas, L. & Juanes, F. (1996). The importance of statistical power analysis: An example from Animal Behaviour. Animal Behaviour, 52, 856-859. [optional]
Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646-656. [optional]
Tukey, J. W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 83-91. [optional]
R on the web: Some simple programs for estimating power on the Internet
Sampling distribution for a correlation
Statistical power for a one-way ANOVA
Statistical power for a two-way ANOVA
Statistical power for linear regression with two predictors and an optional interaction term
Confidence Intervals and Significance testing
Gardner, M. J., & D. G. Altman. 1986. Confidence intervals rather than P values: Estimation rather than hypothesis testing. British Medical Journal, 292, 746-750.
Cumming, G., & Finch, S. (2001). A primer on understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532-574.
Loftus, G. R., & Masson, M.E.J. (1994). Using confidence intervals in within-subject designs. Psychonomic Bulletin and Review, 1, 476-490.
R on the web: Some simple programs for studying the properties of confidence intervals
A demonstration of confidence intervals for means
A demonstration of confidence intervals for correlations
Estimate the 95% confidence interval for Cohen's d
Week 9 [note: we are skipping this
Theoretical Modeling: Developing Formal Models of Natural Phenomena
Haefner, J. W. (1996). Modeling
biological systems: Principles and applications. New York: International
Thomson Publishing. (Chapters 1 [Models of Systems] & 2 [The Modeling
Loehlin, J. C. (1992). Latent variable models: An introduction to factor, path, and structural analysis. Hillsdale, NJ: Lawrence Erlbaum Associates. (Chapter 1 [Path models in factor, path and structural analysis], p. 1-18]
Grant, D. A. (1962). Testing the null hypothesis and the strategy of investigating theoretical models. Psychological Review, 69, 54-61. [optional]
Binder, A. (1963). Further considerations
on testing the null hypothesis and the strategy and tactics of investigating
theoretical models. Psychological Review, 70, 107-115. [optional]
Edwards, W. (1965). Tactical note on the relations between scientific and statistical hypotheses. Psychological Bulletin, 63, 400-402. [optional]
What is the Meaning of Probability? Controversy Concerning Relative Frequency and Subjective Probability
Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. New York: W. H. Freeman. (Chapters 10, 11, & 12)
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapters 4, 5, & 6)
Pruzek, R. M. (1997). An introduction to Bayesian inference and its applications. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 287-318). Mahwah, NJ: Lawrence Erlbaum Associates.
Rindskoph, D. M. (1997). Testing "small," not null, hypothesis: Classical and Bayesian Approaches. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds). What if there were no significance tests? (pp. 319-332). Mahwah, NJ: Lawrence Erlbaum Associates.
Edwards, W., Lindman, H., Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193-242. [optional]
Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories
Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108-141.
Roberts, S. & Pashler, H. (2000). How persuasive is a good fit? A comment on theory testing. Psychological Review, 107, 358-367.
Theory Appraisal: Philosophy of Science and the Testing and Amending of Theories
Urbach, P. (1974). Progress and degeneration in the "IQ debate" (I). British Journal of Philosophy of Science, 25, 99-125.
Serlin, R. C. & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 73-83.
Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42, 145-151.
Gholson, B. & Barker, P. (1985). Kuhn, Lakatos, & Laudan: Applications in the history of physics and psychology. American Psychologist, 40, 755-769. [optional]
Faust, D., & Meehl, P. E. (1992). Using scientific methods to resolve questions in the history and philosophy of science: Some illustrations. Behavior Therapy, 23, 195-211. [optional]
Urbach, P. (1974). Progress and degeneration in the "IQ debate" (II). British Journal of Philosophy of Science, 25, 235-259. [optional]
Salmon, W. C. (1973, May). Confirmation. Scientific American, 228, 75-83. [optional]Meehl, P. E. (1993). Philosophy of science: Help or hindrance? Psychological Reports, 72, 707-733. [optional]
Manicas. P. T., & Secord, P. F. (1983). Implications for psychology of the new philosophy of science. American Psychologist, 38, 399-413. [optional]
Has the NHST Tradition Undermined a Non-Biased, Cumulative Knowledge Base in Psychology?
Cooper, H., DeNeve, K., & Charlton, K. (1997). Finding the missing science: The fate of studies submitted for review by a human subjects committee. Psychological Methods, 2, 447-452.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.
Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1-20.
Berger, J. O. & Berry, D. A. (1988). Statistical analysis and illusion of objectivity. American Scientist, 76, 159-165.
Replication and Scientific Integrity
Smith, N. C. (1970). Replication studies: A neglected aspect of psychological research. American Psychologist, 25, 970-975.
Sohn, D. (1998). Statistical significance and replicability: Why the former does not presage the latter. Theory and Psychology, 8, 291-311.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.
Platt, J. R. (1964). Strong Inference. Science, 146, 347-353.
Feynman, R. L. (1997). Surely you’re joking, Mr. Feynman! New York: W. W. Norton. (Chapter: Cargo-cult science).
Rorer, L. G. (1991). Some myths of science in psychology. In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1: Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 61 – 87). Minneapolis, MN: University of Minnesota Press. [optional]
Lindsay, R. M. & Ehrenberg, A. S. C. (1993). The design of replicated studies. The American Statistician, 47, 217-228. [optional]
Quantitative Thinking: Why We Need Mathematics (and not NHST per se) in Psychological Science
Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.
Meehl, P. E. (1998, May). The power
of quantitative thinking. Invited address as recipient of the James McKeen Cattell
Award at the annual meeting of the American Psychological Society, Washington,
Fraley's End of the Semester Thoughts and Summary