1.2. Data science research perspectives

1.2.2. Sample vs. Population

Regardless of whether we are using static or streaming data, we are rarely privy to the entire group we are researching. Most of the time, we only explore a small part, a sample, of the group we are interested in. This is normally discussed as the population sample or the whole group. This is important, as it ties into what we can consider the representativity of our results. Representativity, simply put, means the degree to which our results can be said to represent the population based on our sample.


Representation by Jukka Tyrkkö


Population
is a statistical concept that refers to all instances of the objects or phenomena we are interested in studying. It does not necessarily refer to people but could be a group of anything we collect data from, such as novels of a particular genre from a particular country or period. There are rare situations where we can perform queries on an entire population, for instance, when we look at publications from a particular author or the tweets of a small, clearly defined group. In these situations, where it is feasible to collect and process all instances, it is, of course, the best thing to do. 

However, in most instances, we are interested in populations much larger than what we can realistically manage to fully compile, process and manage. In these cases, we need to be aware of what we are actually looking at and not attempt to generalize beyond what our data can actually be considered to represent. The concepts of representativity and generalizability always need to be explicitly discussed and considered when we present our results.