2.4. Frequency, distribution and inferential statistics
2.4.4. Statistical significance and effect size
The concepts of statistical significance and effect size are fundamentally important in inferential statistics.
In frequentist hypothesis testing, statistical significance is indicated using a p-value. The p-value indicates the probability of a difference between the populations of at least the same magnitude observed in the samples (read: corpora). In more formal frequentist terms, the concept of probability can be defined as the likelihood that repeated samples drawn from the same populations show the aforementioned frequency difference.
Imagine that we are interested in using the modal word "ought" in 19th-century and 20th-century newspaper English. We have two corpora, A and B, containing texts representing newspapers of one of the two centuries. We carry out a corpus query and establish that "ought" occurs at different frequencies in corpus A and corpus B. Reporting that the difference is statistically significant at p<0.05 means that if we were to randomly draw the same number of texts from each population and compared the results, and repeated the same process another 98 times, in 95 out of those 100 times the frequency difference would be at least as large as observed in the first corpus. In statistical terms, we would report that we can reject the null hypothesis at p<0.05 (see also 2.6.3). The p-value of 0.05 is a commonly held threshold for considering a difference statistically significant. There is no magical reason why 0.05 was chosen as the threshold, and depending on the field of research and question asked, other thresholds such as 0.01 and 0.001 are also used.
While statistical significance tells us about the reliability of being able to conclude that the observed frequency difference is likely to hold for the populations as well, it does not directly address the magnitude of the difference. For that, we use a separate statistic called effect size. Now, it goes without saying that the frequency difference itself is naturally an indication of how substantial the difference is, but the trouble is that since every linguistic feature occurs at a different frequency — common function words have very high frequencies, rare lexical words have low frequencies — it can be difficult to evaluate how substantial the difference is in real-world terms. Effect size statistics standardise the magnitude differences and give us a steady point of reference.