Relevance of Statistics in the new Data Science world
Rajeeva L Karandikar
Chennai Mathematical Institute, India
Abstract
With Big Data and Data Science becoming buzzwords, various people are wondering about the relevance of statistics versus pure data driven models.
In this article, I will explain my view that several statistical ideas are as relevant now as they have been in the past.
1 Introduction
For over a decade now, Big Data, Analytics, Data-Science have become buzzwords. As is the trend now, we will just refer to any combination of these three as data-science. Many professionals working in the IT sector have moved to positions in data science and they have picked up new tools. Often, these tools are used as black boxes. This is not surprising because most of them have little if any background in statistics.
We can often hear them make comments such as, “With a large amount of data available, who needs statistics and statisticians? We can process the data with various available tools and pick the tool that best serves our purpose.”
We hear many stories of wonderful outcomes coming from what can be termed pure data-driven approaches. This has led to a tendency of simply taking a large chunk of available data and pushing it through an AIML engine, to derive ‘intelligence’ out of it, without giving a thought to where the data came from, how it was collected and what connection the data has with the questions that we are seeking answers to…. If an analyst were to ask questions about the data, – How was it collected? When was it collected? – the answer one frequently hears is: “How does it matter?”
Later in this article, we will see that it does matter. We will also see that there are situations where blind use of the tools with data may lead to poor conclusions.
As more and more data become available in various contexts, our ability to draw meaningful actionable intelligence will grow enormously. The best way forward is to marry statistical insights to ideas in AIML, and then use the vast computing power available at one’s fingertips. For this to happen, statisticians and AIML experts must work together along with domain experts.
Through some examples, we will illustrate how ignoring statistical ideas and thought processes that have evolved over the last 150 years can lead to incorrect conclusions in many critical situations.
2 Small data is still relevant
First let us note that there is a class of problems where all the statistical theory and methodology developed over the last 150 years continues to have a role – since the data is only in hundreds or at most thousands and never in millions. For example, issues related to quality control, quality measurement, quality assurance etc. only require a few hundred data points from which to draw valid conclusions. Finance – where the term: VaR (value-at-risk), which is essentially a statistical term- 95th or 99th percentile of the potential loss, has entered law books of several countries – is another area where the use of data has become increasingly common; and here too we work with a relatively small number of data points. There are roughly 250 trading days in a year and there is no point going beyond 3 or 5 years in the past as economic ground realities are constantly changing. Thus, we may have only about 1250 data points of daily closing prices to use for, say, portfolio optimisation or option pricing, or for risk management. One can use hourly prices (with 10,000 data points), or even tick-by-tick trading data, but for portfolio optimisation, risk management, the common practice is to use daily prices. In election forecasting, psephologists usually work with just a few thousand data points from an opinion poll to predict election outcomes. Finally, policy makers, who keep tabs on various socio-economic parameters in a nation, rely on survey data which of course is not in millions.
One of the biggest problems faced by humanity in recent times is the COVID-19 virus. From March 2020 till the year end, everyone was waiting for the vaccines against COVID-19. Finally in December 2020, the first vaccine was approved and more have followed. Let us recall that the approval of vaccines is based on RCT – Randomised Clinical Trials which involve a few thousand observations, along with concepts developed in statistical literature under the theme Design of experiments. Indeed, most drugs and vaccines are identified, tested and approved using these techniques.
These examples illustrate that there are several problems where we need to arrive at a decision or reach a conclusion where we do not have millions of data points. We must do our best with a few hundred or few thousand data points. So statistical techniques of working with small data will always remain relevant.
3 Perils of purely data driven inference
This example goes back nearly 150 years. Sir Francis Galton was a cousin of Charles Darwin, and as a follow up to Darwin’s ideas of evolution, Galton was studying inheritance of genetic traits from one generation to the next. He had his focus on how intelligence is passed from one generation to the next. Studying inheritance, Galton wrote “It is some years since I made an extensive series of experiments on the produce of seeds of different size but of the same species. It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre; smaller than the parents, if the parents were large; larger than the parents, if the parents were small.” Galton firmly believed that this phenomenon will be true for humans as well and for all traits that are passed on genetically, including intelligence.
To illustrate his point, Galton obtained data on heights of parents and their (grown-up) offspring. He chose height as it was easy to obtain data on it. His analysis of the data confirmed his hypothesis, quoted above in italics. He further argued that this phenomenon would continue over generations, and its effect would mean that heights of future offspring will continue to move towards the average height. He argued that the same will happen to intelligence and thus everyone will only have average intelligence. He chose the title of the paper as Regression Towards Mediocrity in Hereditary Stature.
The conclusion drawn by Galton is fallacious as can be seen by analysing the same data by interchanging roles of heights of offspring and mid-heights of parents leading to an exactly opposite conclusion – namely that if the off-spring is taller than average then the average height of parents will be less than that of the offspring while if the offspring is shorter than average, then the average height of parents will be more than the child. It could be seen that the variation in heights (variance of heights) in the two generations was comparable whereas if there was regression towards mean, variance would have decreased. Thus, Galton’s conclusion about regression to mediocrity over generations is not correct. However, the methodology that he developed for the analysis of inheritance of heights has become a standard tool in statistics and continues to be called Regression.
Galton was so convinced of his theory that he just looked at the data from one angle and got confirmation of his belief. This phenomenon is called Confirmation Bias – a term coined by English psychologist Peter Wason in the 1960s.
4 Is the data representative of the population
Given data, even if it is huge, one must first ask how it was collected. Only after knowing this, one can begin to determine if the data is representative of the population,
In India, many TV news channels take a single view either supporting the government or against it. Let us assume News Channel 1 and News Channel 2 both run a poll on their websites, at the same time, on a policy announced by the government. Even if both sites attract large number of responses, it is very likely that the conclusions will be diametrically opposite, since the people who frequent each site will likely be ones with a political inclination aligned with the website. This underscores the point that just having a large set of data is not enough – it must truly represent the population in question for the inference to be valid.
If someone gives a large chunk of data on voter preferences to an analyst and wants her to analyse and predict the outcome of the next elections, she must start by asking as to how the data was collected and only then can she decide if it represents Indian electorate or not. For example, data from the social media on posts and messages regarding political questions during previous few weeks. However, less educated, rural, economically weaker sections are highly underrepresented on social media and thus the conclusions drawn based on the opinion of such a group (of social media users) will not be able to give insight into how the Indian electorate will vote. However, same social media data can be used to quickly assess market potential of a high-end smartphone – for their target market is precisely those who are active on social media.
5 Perils of blind use of tools without understanding them
The next example is not one incident but a theme that is recurrent – that of trying to evaluate efficacy of an entrance test for admission, such as IIT-JEE for admission to IITs or CAT for admission to IIMs or SAT or GRE for admission to top universities in the USA. Let us call such tests as benchmark tests, which are open for all candidates and those who perform very well in this benchmark test are shortlisted admission to the targeted program. The analysis consists of computing correlation between the score on the benchmark test and the performance of the candidate in the program. Often it is found that the correlation is rather poor, and this leads to discussion on the quality of the benchmark test. What is forgotten or ignored is that the performance data is available only for the candidates selected for admission. This phenomenon is known as Selection Bias – where the data set consists of only a subset of the whole group under consideration, selected based on some criterion.
This study also illustrates the phenomenon known as Absence of Tail Dependence for joint normal distribution. Unfortunately, this property is inherited by many statistical models used for risk management and is considered one of the reasons for the collapse of global financial markets in 2008.
Similar bias occurs in studies related to health, where for reasons beyond the control of the team undertaking the study, some patients are no longer available for observation. The bias it introduces is called Censoring Bias and how to account for it in analysis is a major theme in an area known as Survival Analysis in statistics.
6 Correlation does not imply causation
Most of data-driven analysis can be summarised as trying to discover relationships among different variables – and this is what correlation and regression are all about. These were introduced by Galton about 150 years ago and have been a source of intense debate. One of the myths in pure data analysis is to assume that correlation implies causation. However, this need not be true in all cases, and one needs to use transformations to get to more complex relationships.
One example often cited is where X is the sale of ice-cream in a coastal town in Europe and Y is the number of deaths due to drowning (while swimming in the sea, in that town) in the same month. One sees strong correlation! While there is no reason as to why eating more ice-creams would lead to more deaths due to drowning, one can see that they are strongly correlated to a variable Z = average of the daily maximum temperature during the month; in summer months more people eat ice-cream, and more people go to swim! In such instances, the variable Z is called a Confounding Variable.
In today’s world, countrywide data would be available for a large number of socio-economic variables, variables related to health, nutrition, hygiene, pollution, economic variables and so on – one can, say, list about 500 variables where data on over 100 countries is available. One is likely to observe correlations among several pairs of these 500 variables – one such recent observation is: Gross Domestic Product (GDP) of a country and number of deaths per million population due to COVID-19 are strongly correlated! Of course, there is no reason why richer or more developed countries should have more deaths.
As just linear relationships may be spurious, the relationships discovered by AIML algorithms may also be so. Hence learning from the statistical literature going back a century is needed to weed out spurious conclusions and find the right relationships for business intelligence.
7 Simpson’s paradox and the omitted variable bias
Simpson’s Paradox is an effect wherein ignoring an important variable may reverse the conclusion. One of the examples of Simpson’s paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. In 1973, it was alleged that there is a gender bias in graduate school admissions – the acceptance ratio among males was 44% while among females it was 35%. When the statisticians at Berkeley wanted to identify as to which department is responsible for this, they looked at department wise acceptance ratios and found that if anything, there was a bias against the males… The apparent bias in the pooled data appeared because a lot more women applied to departments which had lower acceptance rates. The variable department in this example is called a confounding factor. In Economics literature, the same phenomenon is also called Omitted Variable Bias.
8 Selection Bias and Word War II
During World War II, based on simple analysis of the data obtained from damaged planes returning to the base post bombing raids, it was proposed to the British air force that armour be added to those areas that showed the most damage. Professor Abraham Wald Columbia University, a member of the Statistical Research Group (SRG) was asked to review the findings and recommend how much extra armour should be added to the vulnerable parts.
Wald looked at the problem from a different angle. He realised that there was a selection bias in the data that was presented to him – only the aircraft that did not crash returned to the base and made it to the data. Wald assumed that the probability of being hit in any given part of the plane was proportional to its area (since the shooters could not aim at any specific part of the plane). Also, given that there was no redundancy in aircrafts at that time, the effect of hits on a given area of the aircraft were independent of the effect of hits in any other area. Once he put these two assumptions, the conclusion was obvious – that armour be added in parts where less hits have been observed. So, the statistical thinking led Wald to the model that gave the right frame of reference that connected the data (hits on planes that returned) and the desired conclusion (where to add armour).
9 GMGO (Garbage Model Garbage out), the new GIGO in Data Science world
The phrase Garbage-In-Garbage-Out (GIGO) is often used to describe the fact that even with the best of algorithms, if the input data is garbage, then the conclusion (output) is also likely to be garbage. Our discussion adds a new phenomenon called GMGO i.e., a garbage model will lead to garbage output even with accurate data!
10 Conclusion
We have given examples where disregarding statistical understanding digested over 150 years can lead to wrong conclusions. While in many situations pure data driven techniques can do OK, this combined with domain knowledge and statistical techniques can do wonders in terms of unearthing valuable business intelligence to improve business performance.
We recommend that data driven AI/ML models are a good starting point, or a good exploratory step. In addition, using domain knowledge to remove the various pitfalls discussed in this paper can take the analysis to a much higher level.