home | alphabetical index | |||||

## Testing hypotheses suggested by the dataIn statistics,hypotheses suggested by the data must be tested differently from hypotheses formed independently of the data.## How to do it wrong## The general problemSuch a process greatly inflates the probability of type I error as all but the data most favourable to the hypothesis is discarded. This is a risk, not only in hypothesis testing but in all statistical inference as it is often problematic accurately to describe the process that has been followed in searching and discarding data. It is a particular problem in statistical modelling, where many different models are rejected by trial and error before publishing a result (see also overfitting.) Likelihood and bayesian approaches are no less at risk owing to the difficulty in specifying the likelihood function without an exact description of the search and discard process. The error is a particularly prevalent in data mining and machine learning. It also commonly occurs in academic publishing where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as publication bias. ## How to do it rightStrategies to avoid the problem include: Henry Scheffé's simultaneous test of all contrasts in multiple comparisons problems is the most well-known remedy in the case of analysis of variance. It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above. See hisA Method for Judging All Contrasts in the Analysis of Variance, Biometrika, 40, pages 87-104. | |||||

copyright © 2004 FactsAbout.com |