1.2. Data science research perspectives

1.2.3. Sampling

So, how do we select our sample from a population (or from a sample too large to work with)?

Probability sampling is often a good choice when working with language materials and is relatively easy to use. The most important aspect of probability sampling is that every entry in our data has the same chance of being sampled. There are a few different types of probability sampling commonly found in different scenarios, and the one we pick will depend on what we will use the sample for and what we are looking for.

The most straightforward way of sampling is making use of random sampling. This is done by assigning a number to all available entries in the population and using a random number generator to select as many samples as desired. While random sampling is an underlying principle in many inferential statistical models, it is worth keeping in mind that, in reality, it is very rarely possible to carry out true random sampling. If, for instance, we are interested in comparing newspaper languages in England and Scotland, our sampling will most probably not draw on every single issue of every newspaper published in the two countries. We may randomise the selection from within our sampling frame, but that conceptually differs from true random sampling.

Systematic sampling is similar to random sampling but makes use of intervals for selection rather than a random number generator, for instance, selecting every 10th sample starting from a selected number.

It is worth noting that many of the inferential statistical methods used for inferring from the sample to the population are based on the assumption that the sampling was random, that is, that when the sample was collected, every member of the population had an equal chance of being selected into the sample. If the sampling was not random, this should be considered when selecting statistical methods.

For some datasets, selection needs to take some aspect of data into account. For instance, if we want to sample texts from an extended period of time, we might want to have an equal number of samples for each year or decade. In this scenario, we would want to make use of stratified sampling, which means that we divide our data into groups based on our desired characteristics and then select a number of samples from each group to ensure each group is represented in our final sample.