End of the Semester Thoughts on the Significance Testing Debate

R. Chris Fraley | November 24, 2003

A Review of the Problems with Significance Testing

a. It is commonly assumed that the p-value is indicative of the meaningfulness or importance of a finding.  The meaningfulness of a finding, however, can only be evaluated in the context of theory or application.  In most cases, the meaningfulness of a finding is reflected in the effect size or parameter estimate, and these estimates can have large or small p-values, depending on the sample size.

b. It is commonly believed that the p-value is indicative of the likelihood that the results were due to chance.  This misunderstanding results from identifying P(D|Ho) with P(Ho|D).  To evaluate the probability that the results were due to chance, one needs to know the a priori probability that the null hypothesis is true, in addition to other probabilities (the probability that the alternative hypothesis is true, the probability of the data under the alternative hypothesis).  This kind of information is almost never considered.

c. It is commonly believed that the p-value corresponds to the reliability or replicability of the result.  Assuming the null hypothesis is false, the replicability of a result is a function of statistical power, and power is independent of the p-value of any one study.  P-values cannot convey information about reliability.

2. Asymmetric nature of significance tests.

a. NHSTs are used to evaluate the probability of observing the data assuming that the null hypothesis is true.  No attempt is made to evaluate the probability of observing the data assuming the research hypothesis is true.  This asymmetry is unfortunate because, in many cases, non-significant results are actually more likely under the research hypothesis than the null hypothesis.

3. Widespread blindness to statistical power.

a. It is extremely rare for psychologists to take the statistical power of their planned analysis into account when selecting sample sizes.  In addition, conclusions are often drawn without consideration of the Type II error rate.  Methods for studying power have been around for decades; there is no excuse for not taking power into consideration in research design.

b. Research by Cohen and others has shown that the power to detect a medium-sized effect in published research is about 50%.  In other words, the power to detect a true effect in a typical study conducted in psychology is not any better than a coin toss.

c. Researchers often believe that the use of NHST (and, importantly, a conventional alpha level) provides an objective way to make decisions about data.  However, the fact that researchers do not rely upon formal guidelines for selecting N allows them to support or refute the null hypothesis simply by adjusting the sample size.  If the null is false, we can refute it by sampling a large number of subjects or corroborate it by sampling very few subjects.

4. The paradox of using NHST as a “hypothesis test”.

a.  Textbook authors often write about significance tests as if they are “hypothesis tests.”  This terminology is unfortunate because it leads researchers to believe that significance tests provide a way to test theoretical, as opposed to statistical, hypotheses.  The link between theoretical hypotheses, statistical hypotheses, and data, however, does not receive much attention in psychology (see Meehl, 1990).  In fact, explicit training on translating theoretical hypotheses into statistical hypotheses is absent in most graduate programs.  The only mathematical training that students typically receive is concerned with the relation between statistical hypotheses and data—the most trivial part of the equation, from a scientific and philosophical perspective.

b.  One limitation of significance tests is that the null hypothesis will always be rejected as sample size approaches infinity because no model is accurate to a large number of decimal places.  This leads to an interesting paradox (Meehl):  In cases in which the null hypothesis is the hypothesis of interest (e.g., in structural equation modeling), the theory is subjected to a more stringent test as sample size and precision increase.  In cases where the null hypothesis is the straw man (e.g., in most psychological research), the theory is subjected to a weaker test as sample size and precision increase.  There are clearly limitations of significance testing regardless of whether the theory is identified with the null hypothesis*; however, when the null hypothesis is not the theoretical hypothesis, the so-called “test” is about as flimsy as a wet noodle.  (* Ideally, a theory’s ability to account for the data should be evaluated by examining the deviations between predicted and observed values.  Sample size should not be the factor determining the success or failure of the theory.  Sample size reflects the resources available to the researcher; it does not reflect the verisimilitude of the theory.)

5. In non-experimental research, the null hypothesis is almost always false.

a.  Everything is correlated with everything else to some non-zero degree (Lykken, Meehl).  These correlations exist for a combination of interesting and trivial reasons.  If the null hypothesis is unlikely to be true to begin with, testing the null hypothesis is not especially useful. Moreover, it is unclear how much support a theory should gain if the null hypothesis is rejected because there are hundreds of reasons why the null hypothesis may be false that have nothing to do with the theory of interest.

6. The ritualistic emphasis on p < .05 seems to have distorted the scientific nature of our literature.

a.  Studies by Cooper and his colleagues indicate that researchers are much more likely to submit a study for publication (74%) if they obtained significant results than if they did not (4%).

b. Consequently, the published literature does not represent a non-biased selection of research studies on the issues we care about.  Any one study might be scientific, but the literature is not.  As a result, it is unlikely that reviewers of the literature will draw conclusions that are any more accurate than the lay person.

a.  The typical study conducted in psychology has the same power as a coin toss.  If the null hypothesis is false (which it is likely to be, especially in non-experimental research), the error rate of the research is 50%.  It should be noted that decisions based on descriptive statistics alone have a dramatically lower error rate.

b. The emphasis on correcting for multiple tests (e.g., alpha adjustments) leads to lower levels of power (Sedlmeier & Gigerenzer, 1990).  Textbooks emphasize that researchers need to control for inflated Type I error rates; textbooks do not typically consider Type II error rates.

c. The presence of low power studies indicates that researchers will over-estimate the effects they are investigating.  This might not be of concern to the researcher who is asking “Is there a difference?” as opposed to “How big is the difference?”  For others who might need the information, however, significance testing will lead to the wrong answer (Schmidt, 1996).

What should we do about the problem?

1. Stop significance testing at once!

a.  The radical version of this statement is probably unnecessary.  Nonetheless, I strongly believe that if we removed the crutch of significance testing, researchers would be forced to think more carefully about their data, research design, theory development, and theory evaluation.  As long as the crutch is there, it is unlikely that things will change.

We would do well to take Meehl’s advice:  Always ask oneself “If there was no sampling error present (i.e., if these sample statistics were the population values), what would these data mean and does the general method provide a strong test of my theory?”  If one feels uncomfortable confronting this question, then one is relying on significance tests for the wrong reasons.  If one can answer this question confidently, then the use of significance tests will probably do you little harm (but significance tests will probably do you no good either).

2.  Report descriptive statistics, confidence intervals, effect sizes, and parameter estimates.

a.  I don’t believe that effect sizes and confidence intervals are the solution to the problem, but they sure are helpful.  There simply is no way to build a cumulative science based on p-values. A results section should first and foremost be devoted to a careful and systematic description of data.  We construct theories to explain (and anticipate) data, not p-values.

3. If you insist upon using significance testing, please take statistical power seriously.

a.  There is no point in spending a year of one’s life collecting data on a topic of interest if one is going to try to make sense of those data by using a method that, on average, has a 50% error rate.  If the question is important enough to ask, it is important enough to answer correctly.

b.  Keep in mind that significance tests are typically used to deal with the problem of sampling error.  Thus, the need for significance tests primarily exists in low power situations (situations in which sampling error is substantial).  If one increases power, sampling error decreases, and the need for significance tests diminishes.  The bottom line: Significance tests are dangerous in low power situations, and pretty much useless in high power situations.  Nonetheless, if you’re going to use them (e.g., editors require them, co-authors feel naked without them) it is better to use them in a way that makes them pointless than detrimental.

4. Spend more time learning about the philosophy and history of science.

a.  Graduate training rarely addresses the problems of theory development and appraisal.  What makes a theory a good one?  How can one develop a sophisticated theory and devise clever, rigorous ways to test it?  How can one generate novel predictions?  If you want to learn the answer to these questions, you’ll have to look outside of the typical psychology curriculum.

b. I would recommend studying the philosophy of science in order to obtain answers to these questions.  But, better yet . . .

c. . . . use cliometrics.  In the philosophy of science, our understanding of what makes theories good is based on nothing more than case studies, selected examples, and rhetoric.  It is fun, interesting, and inspiring to read selections from the history of science, but there is virtually no scientific research on what makes theories good.  Ala Meehl: We should treat the history of science as a scientific problem. Theories and methods should be randomly sampled from the history of science to determine what factors differentiate progressive from degenerative theories.

5. Spend more time learning and teaching mathematics.

a.  This is not a catch-all solution, of course.  Nonetheless, if one is able to present a theory in a formal way (via mathematics, symbolic logic, physical or computational models), it is much easier to generate concrete predictions, to understand what the theory anticipates and what the theory ignores, and to avoid ad hockery.  In short, formalism promotes clear thinking.

b. It is important to keep in mind that “developed sciences” like physics do not always begin with a rigorous formalism.  The formalism is developed over time.  It might be hard to imagine how some loose social psychological theory of stereotyping may be formalized initially, but that should not lead one to not try.  These attempts should be informed by (and inform) data.  (This is why it is important to report real data, and not just p-values.) Our path analysis examples demonstrated some ways in which an initially loose theory can eventually be used to make point  predictions about phenomenon yet to be observed.

6. Adopt a Do It Yourself (DIY) attitude.

a.  As noted above, I think what we need and what we teach are very different things.  The only way to improve one’s research and thinking is to adopt a DIY attitude.  If you think you would benefit from more mathematical training, start reading more books on mathematics, start “sitting in” on courses in the mathematics department, start looking for good examples and emulate them.  This training will not come to you; you will have to hunt it down—or demand it from your graduate program.

b. Start learning how to use mathematical software packages other than SPSS.  There is an increasing trend toward “click and point” data analysis in mainstream statistical packages.  This trend will constrain, not liberate, your ability to do good science.  R, S+, Mathematica, Matlab and other packages exist that allow you to do things your own way.  I would strongly encourage you to work with these packages.  The learning curve is a bit challenging at first, but keep in mind, you probably didn’t master SPSS in a day.

c.  Nature doesn’t present itself as an ANOVA problem.  Chances are that the models you develop for psychological phenomena will not conform to those available in the SPSS tool chest.  You’ll need to be able to build, explore, and test these models via alternative means. Use mathematical models to understand and explain the phenomena of interest; do not force the phenomena to fit a readily available, generic mathematical model.

7.  Keep in mind that there is no magical solution to NHST.

a.  When I discuss the significance testing debate with colleagues, they often ask “What should we do instead of significance tests?  How do we test hypotheses without significance tests?”  In my mind, there is nothing to be replaced.  If we had a cancerous tumor growing in our lungs, we would ask a doctor to remove the growth; we would not ask the doctor to replace it with a better cancer.

b. Another way to spin the issue:  If we want a solution, we need to be clear on what problem we think we’re solving when we use NHST.  If we are trying to deal with the problem of sampling error, the solution is obvious: minimize sampling error by using large samples and quantify the standard error wherever possible (e.g., report confidence intervals).  We should place more trust in parameter estimates that have small associated standard errors.

If the problem we are dealing with is “hypothesis testing,” then the solution is to learn more about the philosophy of science, quantitative model development, risky tests, and the problem of verisimilitude.  Significance testing should never be equated with “hypothesis testing.” See point 1.a: If we could estimate the parameters of interest without error, we would still need to develop ways to appraise the verisimilitude of the theory.   NHST does not solve this philosophical problem for us.

b.  As noted above, if you feel naked and hopeless without significance tests, that’s a good indication that you were relying on NHST for the wrong reasons.  Take a moment to feel discouraged, and then start finding ways to make your theories, and your tests of those theories, more sophisticated.  If your theory predicts that there is a positive relationship between two variables, then your theory is corroborated (to a small degree) if you find a positive correlation between two variables.  If that seems trivial to you, that’s because the theoretical prediction was trivial to begin with.  You need to recognize this before you can change things.  Making the test “harder” or “stiffer” by giving yourself a .05 hurdle to pass is a misguided way to rigorously test a theory.