Skip to main content

Domain and LLM

I am in total agreement with Morgan Zimmerman, Dassault Systems quote in TOI today.  Every industry has its own terminologies, concepts, names, words i.e Industry Language. He says even a simple looking word like “Certification” have different meanings in Aerospace vs life sciences.  He recommends use of Industry specific language and your own company specific language for getting significant benefit out of LLMs. This will also reduce hallucinations and misunderstanding.

This is in line with @AiThoughts.Org thoughts on Domain and company specific information on top of general data used by all LLMs.  Like they say in Real Estate, the 3 most important things in any real estate buy decision is “Location, Location and Location”.  We need 3 things to make LLMs work for the enterprise.  “Domain, Domain and Domain”.   Many of us may recall a very successful Bill Clinton Presidential campaign slogan. “The economy, Stupid”.   We can say “The domain, Stupid” as the slogan to make LLMs useful for the enterprises.

But the million-dollar question is how much it is going to cost for the learning updates using your Domain and company data?  EY published a cost of $1.4 Billion which very few can afford.  We need much less expensive solutions for large scale implementation of LLMs.

Solicit your thoughts. #LLM #aiml #Aiethics #Aiforindustry

L Ravichandran

AI Legislation: Need Urgency

Let me first wish you all a happy Navratri Festivities.  I still fondly remember the Durga Pooja days during my Indian Statistical Institute years.  However, we also need to remember we are in the midst of two wars, one in Ukraine and other in Middle East.  We wish solutions are found and further loss of life and destruction is stopped.

I came across two articles in Hindu Newspaper regarding our topic AI.  I have attached a scan of an editorial by M.K. Narayanan, well known National Security and Cyber expert. 

Few highlights are worth mentioning for all of us to ponder.

  1. There is a general agreement that latest advances in AI do pose a major threat and need to be regulated like nuclear power technologies.
  • All countries are not only “locking the gates after the horse has bolted””, but “discussing about locking the gates and deciding on the make & model of the Lock while the horse has bolted”.  Huge delays in enacting and implementing AI Legislation is flagged as a big issue.
  • Rogue nations who willfully decide not to enforce any regulations will get huge advantage over law abiding nations.
  • More than 50% of the large enterprises are sitting on “intangible” assets which are at huge risk of evaporating by non-state actors with AI powered cyber warfare.
  • Cognitive warfare using AI technologies, will destabilize governments, news media and alter the human cognition.
  • This is a new kind of war fare where state and technology companies must closely collaborate.
  • Another interesting mention of over dependence on AI and algorithms which may have caused the major intelligence failure in the latest middle east conflict.

All of these point to the same conclusion.  All countries and multi-lateral organizations such as UN, EU, African Union, G20 etc., multi-lateral military alliances like NATO etc. must move at lightning speed to understand and agree on measures to effectively control and use this great technology.  

The old classic advertisement slogan “JUST DO IT” must be the motto of all the organizations.

Similar efforts are needed by all large enterprises, large financial institutions, regulatory agencies to get ready for the scale implementation of these technologies.

Last but not the least, large technology companies need to look at this not just as a form of another innovation to help automation, but a human affecting , major disruption causing technology and spend sufficient resources in understanding and putting sufficient brakes to avoid run away type situations.

Cyber Security, Ethical auditors, risk management auditors will have huge opportunities and they have to start upskilling fast.

More later,

L Ravichandran.

AI Regulations : Need for urgency

Few weeks ago, I saw a news article about risks of unregulated AI.  The news article quoted that in USA, Police came to a house of a 8 months pregnant African American lady and arrested her due to a facial recognition system identified her as the theft suspect in a robbery. No amount of pleading from the lady about her advanced pregnancy condition during the time of robbery and she just could not have committed the said crime with this condition, was heard by the police officer.  The Police officer did not have any discretion.  The system set up was such that once the AI face recognition identifies the suspect, Police are required to arrest her, bring her to the police station and book her.  

In this case, she was taken to the police station, booked and released on bail. Few days later the case against her was dismissed as the AI system has wrongly identified her.  It was also found out that she was not the first case and few more people, especially African American women were wrongly arrested and released later due to incorrect facial recognition model.

The speed in which the governments are moving on regulations and proliferation of AI tech companies delivering business application such as this facial recognition model demand urgent regulations.

May be citizens themselves should organize and let the people responsible for deploying these systems accountable.  The Chief of Police, may be the Mayor of the town and County officials who signed off this AI facial recognition system, should be made accountable.  May be the County should pay hefty fines and just not a simple oops, sorry.

Lots of attention need to be placed on training data.  Training data should represent all the diverse people in the country in sufficient samples.  Expected biases due to lack of sufficient diversity in training data must be anticipated and the model tweaked.  Most democratic countries have criminal justice system with a unwritten motto “Let 1000 criminals go free but not a single innocent person should go to jail”.  The burden of proof of guilt is always on the state.  However, we seem to have forgotten this when deploying these law enforcement systems.  The burden of proof with very high confidence levels and explainable AI human understandable reasoning, must be the basic approval criteria for these systems to be deployed.

The proposed EU act classifies these law enforcement systems as high risk and will be under the act.  Hopefully the EU act becomes a law soon and avoid this unfortunate violation of civil liberty and human rights.

More Later,

L Ravichandran

AI for Sustainability and Sustainability in AI

I will be referring to the following 3 papers on this very interesting topic.

(1}  https://link.springer.com/article/10.1007/s43681-021-00043-6

 Sustainable AI: AI for sustainability and the sustainability of AI

A Van Wynsberghe – AI and Ethics, 2021 – Springe

(2) https://www.researchgate.net/publication/342763375_Carbontracker_Tracking_and_Predicting_the_Carbon_Footprint_of_Training_Deep_Learning_Models/link/5f0ef0f2a6fdcc3ed7083852/download

(3)      Lacoste, A., Luccioni, A., Schmidt, V., Dandres T.: Quantifying

the Carbon Emissions of Machine Learning. (2019)

While there is a tremendous push for using new-generation generative AI based on large language models to solve business applications, there are also voices of concern from experts in the community about the dangers and ethical consequences.  A lot has been written about this but one aspect which has not picked up sufficient traction, in my opinion, is Sustainable AI.  

In (1), Wynsberghe defines two disciplines on AI & sustainability.   AI for Sustainability and Sustainable AI.

AI for Sustainability is any business application using AIML technology to solve climate problems.  Use of this new generation technology to help in climate change and CO2 reductions.   Major applications are getting developed for optimal energy distribution across renewable and fossil energy sources. Any % extra use from renewable sources, help in less use of fossil fuels and help in climate change.  Various other applications may include better climate predictions and the use of less water, pesticides, and fertilizers for food production.  Many Industry 4.0 applications to build new smart factories, smart cities, and smart buildings fall into this category.

On the other hand, Sustainable AI measures the massive use of GPU and other computing, storage, and communications energy usage while building the AI models and suggest ways to reduce this.  While digital software development and testing can be done in a few developers’ laptops with minimal use of IT resources, the AIML software development life cycle calls for the use of massive training data and develop deep learning neural networks with multiple millions of nodes.   Some of the new generation Large Language models use billions of parameters beyond the imagination of all of us.  The energy use does not stop here.  Fine Tuning learning for specific domains or relearning is as energy-consuming or sometimes even more than original Training.   Some numbers mentioned in (1) are reproduced here to highlight the point.   One deep-learning NLP model consumed energy equivalent to 600,000 lbs of CO2.  Google Alpha-Go-Zero generated over 90 Tonnes of CO2 over 40 days it took for the initial training.  These numbers are large and at least call for review and discussions.   I hope I have been able to open your eyes and generate some interest in this new dimension of AI & Ethics i.e impact on climate change.

I am sure many of you will ask “Isn’t any next-generation industrialization from horse carriages to automobiles or steam engines to combustion always increased the use of energy and why do we need to worry about this for AI?”.  Or “There has been so much talk on how many light bulbs one can light for the same power used for a simple google search , why worry about this now ?”.  All valid questions.  

However, I will argue that

  1. The current climate change situation is already in a critical stage and any unplanned large-scale usage new of energy can become “the feather that broke the camel’s back!”.
  2. Use of fully data driven life cycle and billions of parameters, deep neural networks are being used for the first time at an industrial scale and industry-wide and there are too many unknowns.

What are the suggestions?

  • Energy consumption measurement and publication must become part of the AI & Ethics practice followed by all AI development organizations.   (2)  Carbon Tracker Tool and (3) Machine learning emission calculator are suggestions for this crucial measurement.  I strongly recommend organizations use their Quality & Metrics departments to research and agree on a measurement acceptable to all within each organization.  More research and discussions need to calculate the net increased use of energy compared to current IT tools to get the right measurement. In some cases, the current IT tools may be using legacy mainframes and expensive dedicated communication lines using up large amounts of energy and the net difference by using AIML may not be that large.
  • Predicting the energy use at the beginning of the AIML project life cycle also is required. (3). 
  • The prediction data of CO2 equivalent emissions need to be used as another cost in approving AIML projects.
  • Emission prediction also will force AIML developers to select the right size training data and use of right models for the application. Avoid the temptation of running the model on billions of data sets just because data is available!. Use the right tools for the right job.  You don’t need a tank to kill an ant!.
  • Ask the question of whether the use of deep learning is appropriate for this business application? For example, a simple HR application used for recruitment or employee loyalty prediction with Deep learning models may turn out to be too expensive in terms of Co2 emissions and need not be considered a viable project.
  • CEOs include this data in their Climate Change Initiatives Report to the Board and shareholders and also deduct carbon credits used up by these AIML applications in the company’s Carbon credit commitments.

More Later,

L Ravichandran

Plus ça change- Is ML the new name for Statistics?

Names change, but ideas usually don’t. How is today’s ‘data science’ different from yesterday’s statistics, mathematics and probability?

 Actually, it’s not very different. If it seems changed it’s only because the ground reality has changed. Yesterday we had data scarcity, today we have a data glut (“big data”). Yesterday we had our models, and were seeking data to validate them. Today we have data, and seek models to explain what this data is telling.

 Can we find associations in our data? If there’s association, can we identify a pattern? If there are multiple patterns, can we identify which are the most likely? If we can identify the most likely pattern, can we abstract it to a universal reality? That’s essentially the data science game today.

 Correlation

 Have we wondered why the staple food in most of India is dal-chaval or dal-roti? Why does almost everyone eat the two together? Why not just dal followed by just chaval?

 The most likely reason is that the nutritive benefit when eaten together is more than the benefit when eaten separately. Or think of why doctors prescribe combination drug therapies, or think back to the film Abhimaan (1973) in which Amitabh Bachchan and Jaya Bhaduri discovered that singing together created harmony, while singing separately created discord. Being together can offer a greater benefit than being apart.

 Of course, togetherness could also harm more. Attempting a combination of two business strategies could hurt more than using any individual strategy. Or partnering Inzamam ul Haq on the cricket field could restrict two runs to a single, or, even more likely, result in a run out!

 In data science, we use the correlation coefficient to measure the degree of linear association or togetherness. A correlation coefficient of +1 indicates the best possible positive association; while a value of -1 corresponds to the negative extreme. In general, a high positive or negative value is an indicator of greater association.

 The availability of big data now allows us to use the correlation coefficient to more easily confirm suspected associations, or discover hidden associations. Typically, the data set is a spreadsheet, e.g., supermarket data with customers as rows, and every merchandise sold as a column. With today’s number crunching capability, it is possible to compute the correlation coefficient between every pair of columns in the spreadsheet. So, while we can compute the correlation coefficient to confirm that beer cans and paper napkins are positively correlated (could be a dinner party), we could also unearth a hidden correlation between beer cans and baby diapers.

 Why would beer cans and baby diapers be correlated? Perhaps there’s no valid reason, perhaps there’s some unknown common factor that we don’t know about (this has triggered off the ‘correlation-is-not-causation’ discussion). But today’s supermarket owner is unlikely to ponder over such imponderables; he’ll just direct his staff to place baby diapers next to beer cans and hope that it leads to better sales!

 Regression

 If two variables X and Y have a high correlation coefficient, it means that there is a strong degree of linear dependence between them. This opens up an interesting possibility: why not use the value of X to predict the likely value of Y? The prospect becomes even more enticing when it is easy to obtain X, but very hard (or expensive) to obtain Y.

 To illustrate, let us consider the height (X) and weight (Y) data of 150 male students in a class. The correlation coefficient between X and Y is found to be 0.88. Suppose a new student joins. We can measure his height with a tape, but we don’t have a weighing scale to obtain his weight. Is it possible to predict his weight?

 Let us first plot this data on a scatter diagram (see below); every blue dot on the plot corresponds to the height-weight of one student. The plot looks like a dense maze of blue dots. Is there some ‘togetherness’ between the dots? There is (remember the correlation is 0.88?), but it isn’t complete togetherness (because, then, all the dots would’ve aligned on a single line).

 To predict the new student’s weight, our best bet is to draw a straight line cutting right through the middle of the maze. Once we have this line, we can use it to read off the weight of the new student on the Y-axis, corresponding to his measured height plotted on the X-axis.

 How should we draw this line? The picture offers two alternatives: the blue line and the orange line. Which of the two is better? The one that is ‘middler’ through the maze is better. Let us drop down (or send up) a ‘blue perpendicular’ from every dot on to the blue line, and, likewise, an ‘orange perpendicular’ from every dot on to the orange line (note that if the dot is on the line, the corresponding perpendicular has zero length). Now sum the lengths of all the blue and orange perpendiculars. The line with a smaller sum is the better line!

  

X: height; Y: weight

 Notice that the blue and orange lines vary only in terms of their ‘slope’ and ‘shift’, and there can be an infinity of such lines. The line with the lowest sum of the corresponding perpendiculars will be the ‘best’ possible line. We call this the regression line to predict Y using X; and it will look like:

a1 X + a2, with a1 and a2 being the slope and shift values of this best line. This is the underlying idea in the famed least-square method.

 Bivariate to multivariate

 Let us see how we can apply the same idea to the (harder) problem of predicting the likely marks (Y) that a student might get in his final exam. The numbers of hours studied (X1) seems to be a reasonable predictor. But if we compute the correlation coefficient between Y and X1, using sample data, we’ll probably find that it is just about 0.5. That’s not enough, so we might want to consider another predictor variable. How about the intelligence quotient (IQ) of the student (X2)? If we check, we might find that the correlation between Y and X2 too is about 0.5.

 Why not, then, consider both these predictors? Instead of looking at just the simple correlation between Y and X, why not look at the multiple correlation between Y and both X1 and X2? If we calculate this multiple correlation, we’ll find that it is about 0.8.

 And, now that we are at it, why not also add two more predictors: Quality of the teaching (X3), and the student’s emotional quotient (X4)? If we go through the exercise, we’ll find that the multiple correlation keeps increasing as we keep adding more and more predictors.

 However, there’s a price to pay for this greed. If three predictor variables yield a multiple correlation of 0.92, and the next predictor variable makes it 0.93, is it really worth it? Remember too that with every new variable we also increase the computational complexity and errors.

 And there’s another – even more troubling – question. Some of the predictor variables could be strongly correlated among themselves (this is the problem of multicollinearity). Then the extra variables might actually bring in more noise than value!

 How, then, do we decide what’s the optimal number of predictor variables? We use an elegant construct called the adjusted multiple correlation. As we keep adding more and more predictor variables to the pot (we add the most correlated predictor first, then the second most correlated predictor and so on …), we reach a point where the addition of the next predictor diminishes the adjusted multiple correlation even though the multiple correlation itself keeps rising. That’s the point to stop!

 Let us suppose that this approach determines that the optimal number of predictors is 3. Then the multiple regression line to predict Y will look like a1 X1 + a2 X2 + a3 X3 + a4. where a1, a2, a3, a4 are the coefficients based on the least-square criterion. 

 Predictions using multiple regression are getting more and more reliable because there’s so much more data these days to validate. There is this (possibly apocryphal) story of a father suing a supermarket because his teenage daughter was being bombarded with mailers to buy new baby kits. “My daughter isn’t pregnant”, the father kept explaining. “Our multiple regression model indicates a very high probability that she is”, the supermarket insisted. And she was …

 As we dive deeper into multivariate statistics we’ll find that this is the real playing arena for data science; indeed, when I look at the contents of a machine learning course today, I can’t help feeling that it is multivariate statistics re-emerging with a new disguise. As the French writer Jean-Baptiste Alphonse Karr remarked long ago: plus ça change, plus c’est la même chose!

Relevance of Statistics In the New Data Science World

Relevance of Statistics in the new Data Science world

Rajeeva L Karandikar

Chennai Mathematical Institute, India 

Abstract 

With Big Data and Data Science becoming buzzwords, various people are wondering about the relevance of statistics versus pure data driven models.

In this article, I will explain my view that several statistical ideas are as relevant now as they have been in the past.  

 

1 Introduction

For over a decade now, Big Data, Analytics, Data-Science have become buzzwords. As is the trend now, we will just refer to any combination of these three as data-science. Many professionals working in the IT sector have moved to positions in data science and they have picked up new tools. Often, these tools are used as black boxes.  This is not surprising because most of them have little if any background in statistics. 

We can often hear them make comments such as, “With a large amount of data available, who needs statistics and statisticians? We can process the data with various available tools and pick the tool that best serves our purpose.

We hear many stories of wonderful outcomes coming from what can be termed pure data-driven approaches. This has led to a tendency of simply taking a large chunk of available data and pushing it through an AIML engine, to derive ‘intelligence’ out of it, without giving a thought to where the data came from, how it was collected and what connection the data has with the questions that we are seeking answers to…. If an analyst were to ask questions about the data, – How was it collected? When was it collected? – the answer one frequently hears is: “How does it matter?”

 Later in this article, we will see that it does matter. We will also see that there are situations where blind use of the tools with data may lead to poor conclusions.

As more and more data become available in various contexts, our ability to draw meaningful actionable intelligence will grow enormously. The best way forward is to marry statistical insights to ideas in AIML, and then use the vast computing power available at one’s fingertips. For this to happen, statisticians and AIML experts must work together along with domain experts

 

Through some examples, we will illustrate how ignoring statistical ideas and thought processes that have evolved over the last 150 years can lead to incorrect conclusions in many critical situations. 

2 Small data is still relevant

First let us note that there is a class of problems where all the statistical theory and methodology developed over the last 150 years continues to have a role – since the data is only in hundreds or at most thousands and never in millions. For example, issues related to quality control, quality measurement, quality assurance etc. only require a few hundred data points from which to draw valid conclusions. Finance – where the term: VaR (value-at-risk), which is essentially a statistical term- 95th or 99th percentile of the potential loss, has entered law books of several countries – is another area where the use of data has become increasingly common; and here too we work with a relatively small number of data points. There are roughly 250 trading days in a year and there is no point going beyond 3 or 5 years in the past as economic ground realities are constantly changing. Thus, we may have only about 1250 data points of daily closing prices to use for, say, portfolio optimisation or option pricing, or for risk management. One can use hourly prices (with 10,000 data points), or even tick-by-tick trading data, but for portfolio optimisation, risk management, the common practice is to use daily prices. In election forecasting, psephologists usually work with just a few thousand data points from an opinion poll to predict election outcomes. Finally, policy makers, who keep tabs on various socio-economic parameters in a nation, rely on survey data which of course is not in millions. 

One of the biggest problems faced by humanity in recent times is the COVID-19 virus. From March 2020 till the year end, everyone was waiting for the vaccines against COVID-19. Finally in December 2020, the first vaccine was approved and more have followed. Let us recall that the approval of vaccines is based on RCT – Randomised Clinical Trials which involve a few thousand observations, along with concepts developed in statistical literature under the theme Design of experiments. Indeed, most drugs and vaccines are identified, tested and approved using these techniques. 

These examples illustrate that there are several problems where we need to arrive at a decision or reach a conclusion where we do not have millions of data points. We must do our best with a few hundred or few thousand data points. So statistical techniques of working with small data will always remain relevant. 

3 Perils of purely data driven inference

This example goes back nearly 150 years. Sir Francis Galton was a cousin of Charles Darwin, and as a follow up to Darwin’s ideas of evolution, Galton was studying inheritance of genetic traits from one generation to the next. He had his focus on how intelligence is passed from one generation to the next.  Studying inheritance, Galton wrote “It is some years since I made an extensive series of experiments on the produce of seeds of different size but of the same species. It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre; smaller than the parents, if the parents were large; larger than the parents, if the parents were small.” Galton firmly believed that this phenomenon will be true for humans as well and for all traits that are passed on genetically, including intelligence. 

 

To illustrate his point, Galton obtained data on heights of parents and their (grown-up) offspring. He chose height as it was easy to obtain data on it. His analysis of the data confirmed his hypothesis, quoted above in italics. He further argued that this phenomenon would continue over generations, and its effect would mean that heights of future offspring will continue to move towards the average height. He argued that the same will happen to intelligence and thus everyone will only have average intelligence. He chose the title of the paper as Regression Towards Mediocrity in Hereditary Stature

 

The conclusion drawn by Galton is fallacious as can be seen by analysing the same data by interchanging roles of heights of offspring and mid-heights of parents leading to an exactly opposite conclusion – namely that if the off-spring is taller than average then the average height of parents will be less than that of the offspring while if the offspring is shorter than average, then the average height of parents will be more than the child. It could be seen that the variation in heights (variance of heights) in the two generations was comparable whereas if there was regression towards mean, variance would have decreased. Thus, Galton’s conclusion about regression to mediocrity over generations is not correct. However, the methodology that he developed for the analysis of inheritance of heights has become a standard tool in statistics and continues to be called Regression.

Galton was so convinced of his theory that he just looked at the data from one angle and got confirmation of his belief. This phenomenon is called Confirmation Bias  a term coined by English psychologist Peter Wason in the 1960s.

4 Is the data representative of the population

Given data, even if it is huge, one must first ask how it was collected. Only after knowing this, one can begin to determine if the data is representative of the population,

In India, many TV news channels take a single view either supporting the government or against it.  Let us assume News Channel 1 and News Channel 2 both run a poll on their websites, at the same time, on a policy announced by the government. Even if both sites attract large number of responses, it is very likely that the conclusions will be diametrically opposite, since the people who frequent each site will likely be ones with a political inclination aligned with the website.  This underscores the point that just having a large set of data is not enough – it must truly represent the population in question for the inference to be valid. 

If someone gives a large chunk of data on voter preferences to an analyst and wants her to analyse and predict the outcome of the next elections, she must start by asking as to how the data was collected and only then can she decide if it represents Indian electorate or not. For example, data from the social media on posts and messages regarding political questions during previous few weeks. However, less educated, rural, economically weaker sections are highly underrepresented on social media and thus the conclusions drawn based on the opinion of such a group (of social media users) will not be able to give insight into how the Indian electorate will vote. However, same social media data can be used to quickly assess market potential of a high-end smartphone – for their target market is precisely those who are active on social media.

  5 Perils of blind use of tools without understanding them

The next example is not one incident but a theme that is recurrent – that of trying to evaluate efficacy of an entrance test for admission, such as IIT-JEE for admission to IITs or CAT for admission to IIMs or SAT or GRE for admission to top universities in the USA. Let us call such tests as benchmark tests, which are open for all candidates and those who perform very well in this benchmark test are shortlisted admission to the targeted program. The analysis consists of computing correlation between the score on the benchmark test and the performance of the candidate in the program. Often it is found that the correlation is rather poor, and this leads to discussion on the quality of the benchmark test.  What is forgotten or ignored is that the performance data is available only for the candidates selected for admission. This phenomenon is known as Selection Bias – where the data set consists of only a subset of the whole group under consideration, selected based on some criterion. 

This study also illustrates the phenomenon known as Absence of Tail Dependence for joint normal distribution. Unfortunately, this property is inherited by many statistical models used for risk management and is considered one of the reasons for the collapse of global financial markets in 2008.

Similar bias occurs in studies related to health, where for reasons beyond the control of the team undertaking the study, some patients are no longer available for observation. The bias it introduces is called Censoring Bias and how to account for it in analysis is a major theme in an area known as Survival Analysis in statistics. 

6 Correlation does not imply causation

Most of data-driven analysis can be summarised as trying to discover relationships among different variables – and this is what correlation and regression are all about. These were introduced by Galton about 150 years ago and have been a source of intense debate. One of the myths in pure data analysis is to assume that correlation implies causation. However, this need not be true in all cases, and one needs to use transformations to get to more complex relationships.

One example often cited is where X is the sale of ice-cream in a coastal town in Europe and Y is the number of deaths due to drowning (while swimming in the sea, in that town) in the same month. One sees strong correlation! While there is no reason as to why eating more ice-creams would lead to more deaths due to drowning, one can see that they are strongly correlated to a variable Z = average of the daily maximum temperature during the month; in summer months more people eat ice-cream, and more people go to swim! In such instances, the variable Z is called a Confounding Variable.

In today’s world, countrywide data would be available for a large number of socio-economic variables, variables related to health, nutrition, hygiene, pollution, economic variables and so on – one can, say, list about 500 variables where data on over 100 countries is available. One is likely to observe correlations among several pairs of these 500 variables – one such recent observation is: Gross Domestic Product (GDP) of a country and number of deaths per million population due to COVID-19 are strongly correlated! Of course, there is no reason why richer or more developed countries should have more deaths.

As just linear relationships may be spurious, the relationships discovered by AIML algorithms may also be so. Hence learning from the statistical literature going back a century is needed to weed out spurious conclusions and find the right relationships for business intelligence.

 

7 Simpson’s paradox and the omitted variable bias

Simpson’s Paradox is an effect wherein ignoring an important variable may reverse the conclusion. One of the examples of Simpson’s paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. In 1973, it was alleged that there is a gender bias in graduate school admissions – the acceptance ratio among males was 44% while among females it was 35%. When the statisticians at Berkeley wanted to identify as to which department is responsible for this, they looked at department wise acceptance ratios and found that if anything, there was a bias against the males… The apparent bias in the pooled data appeared because a lot more women applied to departments which had lower acceptance rates. The variable department in this example is called a confounding factor. In Economics literature, the same phenomenon is also called Omitted Variable Bias. 

8 Selection Bias and Word War II

During World War II, based on simple analysis of the data obtained from damaged planes returning to the base post bombing raids, it was proposed to the British air force that armour be added to those areas that showed the most damage. Professor Abraham Wald Columbia University, a member of the Statistical Research Group (SRG) was asked to review the findings and recommend how much extra armour should be added to the vulnerable parts.

Wald looked at the problem from a different angle. He realised that there was a selection bias in the data that was presented to him – only the aircraft that did not crash returned to the base and made it to the data. Wald assumed that the probability of being hit in any given part of the plane was proportional to its area (since the shooters could not aim at any specific part of the plane). Also, given that there was no redundancy in aircrafts at that time, the effect of hits on a given area of the aircraft were independent of the effect of hits in any other area. Once he put these two assumptions, the conclusion was obvious – that armour be added in parts where less hits have been observed. So, the statistical thinking led Wald to the model that gave the right frame of reference that connected the data (hits on planes that returned) and the desired conclusion (where to add armour).

 9 GMGO (Garbage Model Garbage out), the new GIGO in Data Science world

The phrase Garbage-In-Garbage-Out (GIGO) is often used to describe the fact that even with the best of algorithms, if the input data is garbage, then the conclusion (output) is also likely to be garbage. Our discussion adds a new phenomenon called GMGO i.e., a garbage model will lead to garbage output even with accurate data! 

 10 Conclusion

We have given examples where disregarding statistical understanding digested over 150 years can lead to wrong conclusions. While in many situations pure data driven techniques can do OK, this combined with domain knowledge and statistical techniques can do wonders in terms of unearthing valuable business intelligence to improve business performance.  

We recommend that data driven AI/ML models are a good starting point, or a good exploratory step. In addition, using domain knowledge to remove the various pitfalls discussed in this paper can take the analysis to a much higher level.