Skip to main content

Plus ça change- Is ML the new name for Statistics?

Names change, but ideas usually don’t. How is today’s ‘data science’ different from yesterday’s statistics, mathematics and probability?

 Actually, it’s not very different. If it seems changed it’s only because the ground reality has changed. Yesterday we had data scarcity, today we have a data glut (“big data”). Yesterday we had our models, and were seeking data to validate them. Today we have data, and seek models to explain what this data is telling.

 Can we find associations in our data? If there’s association, can we identify a pattern? If there are multiple patterns, can we identify which are the most likely? If we can identify the most likely pattern, can we abstract it to a universal reality? That’s essentially the data science game today.

 Correlation

 Have we wondered why the staple food in most of India is dal-chaval or dal-roti? Why does almost everyone eat the two together? Why not just dal followed by just chaval?

 The most likely reason is that the nutritive benefit when eaten together is more than the benefit when eaten separately. Or think of why doctors prescribe combination drug therapies, or think back to the film Abhimaan (1973) in which Amitabh Bachchan and Jaya Bhaduri discovered that singing together created harmony, while singing separately created discord. Being together can offer a greater benefit than being apart.

 Of course, togetherness could also harm more. Attempting a combination of two business strategies could hurt more than using any individual strategy. Or partnering Inzamam ul Haq on the cricket field could restrict two runs to a single, or, even more likely, result in a run out!

 In data science, we use the correlation coefficient to measure the degree of linear association or togetherness. A correlation coefficient of +1 indicates the best possible positive association; while a value of -1 corresponds to the negative extreme. In general, a high positive or negative value is an indicator of greater association.

 The availability of big data now allows us to use the correlation coefficient to more easily confirm suspected associations, or discover hidden associations. Typically, the data set is a spreadsheet, e.g., supermarket data with customers as rows, and every merchandise sold as a column. With today’s number crunching capability, it is possible to compute the correlation coefficient between every pair of columns in the spreadsheet. So, while we can compute the correlation coefficient to confirm that beer cans and paper napkins are positively correlated (could be a dinner party), we could also unearth a hidden correlation between beer cans and baby diapers.

 Why would beer cans and baby diapers be correlated? Perhaps there’s no valid reason, perhaps there’s some unknown common factor that we don’t know about (this has triggered off the ‘correlation-is-not-causation’ discussion). But today’s supermarket owner is unlikely to ponder over such imponderables; he’ll just direct his staff to place baby diapers next to beer cans and hope that it leads to better sales!

 Regression

 If two variables X and Y have a high correlation coefficient, it means that there is a strong degree of linear dependence between them. This opens up an interesting possibility: why not use the value of X to predict the likely value of Y? The prospect becomes even more enticing when it is easy to obtain X, but very hard (or expensive) to obtain Y.

 To illustrate, let us consider the height (X) and weight (Y) data of 150 male students in a class. The correlation coefficient between X and Y is found to be 0.88. Suppose a new student joins. We can measure his height with a tape, but we don’t have a weighing scale to obtain his weight. Is it possible to predict his weight?

 Let us first plot this data on a scatter diagram (see below); every blue dot on the plot corresponds to the height-weight of one student. The plot looks like a dense maze of blue dots. Is there some ‘togetherness’ between the dots? There is (remember the correlation is 0.88?), but it isn’t complete togetherness (because, then, all the dots would’ve aligned on a single line).

 To predict the new student’s weight, our best bet is to draw a straight line cutting right through the middle of the maze. Once we have this line, we can use it to read off the weight of the new student on the Y-axis, corresponding to his measured height plotted on the X-axis.

 How should we draw this line? The picture offers two alternatives: the blue line and the orange line. Which of the two is better? The one that is ‘middler’ through the maze is better. Let us drop down (or send up) a ‘blue perpendicular’ from every dot on to the blue line, and, likewise, an ‘orange perpendicular’ from every dot on to the orange line (note that if the dot is on the line, the corresponding perpendicular has zero length). Now sum the lengths of all the blue and orange perpendiculars. The line with a smaller sum is the better line!

  

X: height; Y: weight

 Notice that the blue and orange lines vary only in terms of their ‘slope’ and ‘shift’, and there can be an infinity of such lines. The line with the lowest sum of the corresponding perpendiculars will be the ‘best’ possible line. We call this the regression line to predict Y using X; and it will look like:

a1 X + a2, with a1 and a2 being the slope and shift values of this best line. This is the underlying idea in the famed least-square method.

 Bivariate to multivariate

 Let us see how we can apply the same idea to the (harder) problem of predicting the likely marks (Y) that a student might get in his final exam. The numbers of hours studied (X1) seems to be a reasonable predictor. But if we compute the correlation coefficient between Y and X1, using sample data, we’ll probably find that it is just about 0.5. That’s not enough, so we might want to consider another predictor variable. How about the intelligence quotient (IQ) of the student (X2)? If we check, we might find that the correlation between Y and X2 too is about 0.5.

 Why not, then, consider both these predictors? Instead of looking at just the simple correlation between Y and X, why not look at the multiple correlation between Y and both X1 and X2? If we calculate this multiple correlation, we’ll find that it is about 0.8.

 And, now that we are at it, why not also add two more predictors: Quality of the teaching (X3), and the student’s emotional quotient (X4)? If we go through the exercise, we’ll find that the multiple correlation keeps increasing as we keep adding more and more predictors.

 However, there’s a price to pay for this greed. If three predictor variables yield a multiple correlation of 0.92, and the next predictor variable makes it 0.93, is it really worth it? Remember too that with every new variable we also increase the computational complexity and errors.

 And there’s another – even more troubling – question. Some of the predictor variables could be strongly correlated among themselves (this is the problem of multicollinearity). Then the extra variables might actually bring in more noise than value!

 How, then, do we decide what’s the optimal number of predictor variables? We use an elegant construct called the adjusted multiple correlation. As we keep adding more and more predictor variables to the pot (we add the most correlated predictor first, then the second most correlated predictor and so on …), we reach a point where the addition of the next predictor diminishes the adjusted multiple correlation even though the multiple correlation itself keeps rising. That’s the point to stop!

 Let us suppose that this approach determines that the optimal number of predictors is 3. Then the multiple regression line to predict Y will look like a1 X1 + a2 X2 + a3 X3 + a4. where a1, a2, a3, a4 are the coefficients based on the least-square criterion. 

 Predictions using multiple regression are getting more and more reliable because there’s so much more data these days to validate. There is this (possibly apocryphal) story of a father suing a supermarket because his teenage daughter was being bombarded with mailers to buy new baby kits. “My daughter isn’t pregnant”, the father kept explaining. “Our multiple regression model indicates a very high probability that she is”, the supermarket insisted. And she was …

 As we dive deeper into multivariate statistics we’ll find that this is the real playing arena for data science; indeed, when I look at the contents of a machine learning course today, I can’t help feeling that it is multivariate statistics re-emerging with a new disguise. As the French writer Jean-Baptiste Alphonse Karr remarked long ago: plus ça change, plus c’est la même chose!

What NOT to say

What Not to Say

Teaching chatbots to speak ‘properly’ and ‘decently’

Many of us would have heard about Microsoft’s Tay.ai chatbot, which was released and pulled back within 24 hours in 2016, due to abusive learnings by the chatbot. It took less than 24 hours to corrupt an innocent AI chatbot. What went wrong? Tay.ai’s learning module was excellent, which ironically was the problem – it was rapidly learning swear words, hate language etc. from the large number of people who used abusive language during conversations with the chatbot.  However, unlike some of the internal filters many of us have, Tay.ai went ahead and learnt from these signals, and started using these phrases and hate language.  All this happened in less than 24 hours, which forced Microsoft to pull this from public use.

I have been observing how my son and daughter-in-law are teaching my 3-year-old granddaughter about the use of good language.  Basic things like saying ‘Please’, ‘Thank You’, ‘Good morning’, ‘Good night’, etc. In other words, decent and desirable language was taught first.  They have also given strict instructions to us (grandparents) and extended family about what to say – and what not to say – in front of the kid. The child will still hear some ‘bad words’ in school, malls, playgrounds etc. This is beyond the parents’ control. In these cases, they teach the child about how a very few bad people still use ‘bad’ language and good people never use these words, thus starting to lay in the internal filters in my granddaughter’s mind.

We should apply the same principle to these innocent but fast-learning chatbots.  Let us ‘teach’ the chatbot all the ‘good’ phrases like ‘Please’, ‘Thank you’ etc. Let us also ‘teach’ the chatbot about showing empathy, such as saying ‘Sorry that your product is not working.  We will do everything possible to fix it’, ‘Sorry to ask you to repeat as I did not understand your question’, and so on.

Finally, let us create a negative list of ‘bad’ phrases, and hate language in all possible variations.  English in the UK will have British, Scottish, and Irish variations.  Some phrases which are considered acceptable in one area may be objectionable in another. Same for Australia, New Zealand, India, New York Northern English, Southern USA English, etc.  Let us build internal filters in these chatbots to ignore or unlearn these phrases in the learning process.  By looking at the IP address of the user, the bot can identify the geographical location and apply the right language filters.

Will this work?  As good parents we have been doing this to teach our kids and grandkids from time immemorial.  Mostly this is working; very few kids grow to become users of hate language.

Will it slow down the machine learning process?  Perhaps a little bit, but this is a price worth paying, compared to having a chatbot use foul language and upset your valuable customers.

You may be wondering if this simple approach is supported by any AI research or whether this is just a grandfather’s tale! There is lots of research in this area that supports my approach.

There are many references to articles on ‘Seldonain Algorithm’ for AI Ethics. I want to refer to an article titled ‘Developing safer machine learning algorithms at UMass Amhrest’.  The authors recommend that the burden of ensuring that ML systems are well-behaved is with the ML designer and not with the end user, and they suggested a 3-step Seldonian algorithm. Let us look at this.

Step one is to provide an Interface specified by the user to define undesirable or bad behaviour.  The ML algorithm will use the interface and try as much as possible to avoid these undesirable behaviours.

Step two is to use High-Probability Constraints: Seldonian algorithms guarantee with high-probability that they will not cause the undesirable behaviour that the user specified via the interface.

Step three in the algorithm is No Solution Found: Seldonian algorithms must have the ability to say No Solution Found (NSF) to indicate that they were unable to achieve what they were asked.

 

Let us consider two examples involving human life to illustrate the Interface definitions. Example one is a robot that controls a robotic assembly line. The robot senses that a welding operation has gone out of sync and is causing all welded cars to be defective. The robot controller wants to issue the instruction to immediately stop the assemble line and get the welding station fixed. However, the user knows that abrupt stoppage of assembly line may cause harm to some factory workers who may be on another station in the assembly line.  This undesirable decision to immediately stop the assembly line needs to be defined in the interface, as this will cause harm to humans compared to a material loss in defective cars.

Example two is an autonomous truck carrying cargo driving in a hilly road with a cliff on the driving side.  A human driver is coming fast in the wrong lane ( human’s fault) and approaching the truck for a certain head-on collision. The only desirable outcome for the truck is to fall of the cliff and destroy itself with the cargo rather than trying to look at various other optimal decisions which may have some probability of hitting the car and harming the human.

In our chatbot good-behavior problem, the undesirable behaviors are usage of the phrases in the ‘Negative List’ for each geographical variation.  The interface will have this list and the logic to identify geographical variations.

I am in discussion with some sponsors for a research project to develop an English-language chatbot etiquette engine.  Initial reactions from the various stakeholders are positive – everyone agrees on the need for an etiquette engine as well as my approach. 

I will be delighted to receive critique and comments from all of you. 

As a closing note, wanted to tell you that Natural Language processing (NLP) is taking huge strides.  NLP is eating the ML” is the talk of the town.  NLP research supported by Large Language models, Transformers etc. are moving way ahead. Investment is going into Q&A, Language Generation, Knowledge management, Unsupervised/reinforcement learning.

In addition to desirable behavior, many other ethical issues need to be incorporated. For e.g

·        Transparency: Does everyone know broadly how learning is done and how decisions are taken?

·        Explainability:  For every individual decision, if requested, can we explain how the decision was taken?

Also, a lot of current AI/ML algorithms especially neural networks based have become black boxes. We expect a shift towards more simpler algorithms for enterprise usage.

 

Relevance of Statistics In the New Data Science World

Relevance of Statistics in the new Data Science world

Rajeeva L Karandikar

Chennai Mathematical Institute, India 

Abstract 

With Big Data and Data Science becoming buzzwords, various people are wondering about the relevance of statistics versus pure data driven models.

In this article, I will explain my view that several statistical ideas are as relevant now as they have been in the past.  

 

1 Introduction

For over a decade now, Big Data, Analytics, Data-Science have become buzzwords. As is the trend now, we will just refer to any combination of these three as data-science. Many professionals working in the IT sector have moved to positions in data science and they have picked up new tools. Often, these tools are used as black boxes.  This is not surprising because most of them have little if any background in statistics. 

We can often hear them make comments such as, “With a large amount of data available, who needs statistics and statisticians? We can process the data with various available tools and pick the tool that best serves our purpose.

We hear many stories of wonderful outcomes coming from what can be termed pure data-driven approaches. This has led to a tendency of simply taking a large chunk of available data and pushing it through an AIML engine, to derive ‘intelligence’ out of it, without giving a thought to where the data came from, how it was collected and what connection the data has with the questions that we are seeking answers to…. If an analyst were to ask questions about the data, – How was it collected? When was it collected? – the answer one frequently hears is: “How does it matter?”

 Later in this article, we will see that it does matter. We will also see that there are situations where blind use of the tools with data may lead to poor conclusions.

As more and more data become available in various contexts, our ability to draw meaningful actionable intelligence will grow enormously. The best way forward is to marry statistical insights to ideas in AIML, and then use the vast computing power available at one’s fingertips. For this to happen, statisticians and AIML experts must work together along with domain experts

 

Through some examples, we will illustrate how ignoring statistical ideas and thought processes that have evolved over the last 150 years can lead to incorrect conclusions in many critical situations. 

2 Small data is still relevant

First let us note that there is a class of problems where all the statistical theory and methodology developed over the last 150 years continues to have a role – since the data is only in hundreds or at most thousands and never in millions. For example, issues related to quality control, quality measurement, quality assurance etc. only require a few hundred data points from which to draw valid conclusions. Finance – where the term: VaR (value-at-risk), which is essentially a statistical term- 95th or 99th percentile of the potential loss, has entered law books of several countries – is another area where the use of data has become increasingly common; and here too we work with a relatively small number of data points. There are roughly 250 trading days in a year and there is no point going beyond 3 or 5 years in the past as economic ground realities are constantly changing. Thus, we may have only about 1250 data points of daily closing prices to use for, say, portfolio optimisation or option pricing, or for risk management. One can use hourly prices (with 10,000 data points), or even tick-by-tick trading data, but for portfolio optimisation, risk management, the common practice is to use daily prices. In election forecasting, psephologists usually work with just a few thousand data points from an opinion poll to predict election outcomes. Finally, policy makers, who keep tabs on various socio-economic parameters in a nation, rely on survey data which of course is not in millions. 

One of the biggest problems faced by humanity in recent times is the COVID-19 virus. From March 2020 till the year end, everyone was waiting for the vaccines against COVID-19. Finally in December 2020, the first vaccine was approved and more have followed. Let us recall that the approval of vaccines is based on RCT – Randomised Clinical Trials which involve a few thousand observations, along with concepts developed in statistical literature under the theme Design of experiments. Indeed, most drugs and vaccines are identified, tested and approved using these techniques. 

These examples illustrate that there are several problems where we need to arrive at a decision or reach a conclusion where we do not have millions of data points. We must do our best with a few hundred or few thousand data points. So statistical techniques of working with small data will always remain relevant. 

3 Perils of purely data driven inference

This example goes back nearly 150 years. Sir Francis Galton was a cousin of Charles Darwin, and as a follow up to Darwin’s ideas of evolution, Galton was studying inheritance of genetic traits from one generation to the next. He had his focus on how intelligence is passed from one generation to the next.  Studying inheritance, Galton wrote “It is some years since I made an extensive series of experiments on the produce of seeds of different size but of the same species. It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre; smaller than the parents, if the parents were large; larger than the parents, if the parents were small.” Galton firmly believed that this phenomenon will be true for humans as well and for all traits that are passed on genetically, including intelligence. 

 

To illustrate his point, Galton obtained data on heights of parents and their (grown-up) offspring. He chose height as it was easy to obtain data on it. His analysis of the data confirmed his hypothesis, quoted above in italics. He further argued that this phenomenon would continue over generations, and its effect would mean that heights of future offspring will continue to move towards the average height. He argued that the same will happen to intelligence and thus everyone will only have average intelligence. He chose the title of the paper as Regression Towards Mediocrity in Hereditary Stature

 

The conclusion drawn by Galton is fallacious as can be seen by analysing the same data by interchanging roles of heights of offspring and mid-heights of parents leading to an exactly opposite conclusion – namely that if the off-spring is taller than average then the average height of parents will be less than that of the offspring while if the offspring is shorter than average, then the average height of parents will be more than the child. It could be seen that the variation in heights (variance of heights) in the two generations was comparable whereas if there was regression towards mean, variance would have decreased. Thus, Galton’s conclusion about regression to mediocrity over generations is not correct. However, the methodology that he developed for the analysis of inheritance of heights has become a standard tool in statistics and continues to be called Regression.

Galton was so convinced of his theory that he just looked at the data from one angle and got confirmation of his belief. This phenomenon is called Confirmation Bias  a term coined by English psychologist Peter Wason in the 1960s.

4 Is the data representative of the population

Given data, even if it is huge, one must first ask how it was collected. Only after knowing this, one can begin to determine if the data is representative of the population,

In India, many TV news channels take a single view either supporting the government or against it.  Let us assume News Channel 1 and News Channel 2 both run a poll on their websites, at the same time, on a policy announced by the government. Even if both sites attract large number of responses, it is very likely that the conclusions will be diametrically opposite, since the people who frequent each site will likely be ones with a political inclination aligned with the website.  This underscores the point that just having a large set of data is not enough – it must truly represent the population in question for the inference to be valid. 

If someone gives a large chunk of data on voter preferences to an analyst and wants her to analyse and predict the outcome of the next elections, she must start by asking as to how the data was collected and only then can she decide if it represents Indian electorate or not. For example, data from the social media on posts and messages regarding political questions during previous few weeks. However, less educated, rural, economically weaker sections are highly underrepresented on social media and thus the conclusions drawn based on the opinion of such a group (of social media users) will not be able to give insight into how the Indian electorate will vote. However, same social media data can be used to quickly assess market potential of a high-end smartphone – for their target market is precisely those who are active on social media.

  5 Perils of blind use of tools without understanding them

The next example is not one incident but a theme that is recurrent – that of trying to evaluate efficacy of an entrance test for admission, such as IIT-JEE for admission to IITs or CAT for admission to IIMs or SAT or GRE for admission to top universities in the USA. Let us call such tests as benchmark tests, which are open for all candidates and those who perform very well in this benchmark test are shortlisted admission to the targeted program. The analysis consists of computing correlation between the score on the benchmark test and the performance of the candidate in the program. Often it is found that the correlation is rather poor, and this leads to discussion on the quality of the benchmark test.  What is forgotten or ignored is that the performance data is available only for the candidates selected for admission. This phenomenon is known as Selection Bias – where the data set consists of only a subset of the whole group under consideration, selected based on some criterion. 

This study also illustrates the phenomenon known as Absence of Tail Dependence for joint normal distribution. Unfortunately, this property is inherited by many statistical models used for risk management and is considered one of the reasons for the collapse of global financial markets in 2008.

Similar bias occurs in studies related to health, where for reasons beyond the control of the team undertaking the study, some patients are no longer available for observation. The bias it introduces is called Censoring Bias and how to account for it in analysis is a major theme in an area known as Survival Analysis in statistics. 

6 Correlation does not imply causation

Most of data-driven analysis can be summarised as trying to discover relationships among different variables – and this is what correlation and regression are all about. These were introduced by Galton about 150 years ago and have been a source of intense debate. One of the myths in pure data analysis is to assume that correlation implies causation. However, this need not be true in all cases, and one needs to use transformations to get to more complex relationships.

One example often cited is where X is the sale of ice-cream in a coastal town in Europe and Y is the number of deaths due to drowning (while swimming in the sea, in that town) in the same month. One sees strong correlation! While there is no reason as to why eating more ice-creams would lead to more deaths due to drowning, one can see that they are strongly correlated to a variable Z = average of the daily maximum temperature during the month; in summer months more people eat ice-cream, and more people go to swim! In such instances, the variable Z is called a Confounding Variable.

In today’s world, countrywide data would be available for a large number of socio-economic variables, variables related to health, nutrition, hygiene, pollution, economic variables and so on – one can, say, list about 500 variables where data on over 100 countries is available. One is likely to observe correlations among several pairs of these 500 variables – one such recent observation is: Gross Domestic Product (GDP) of a country and number of deaths per million population due to COVID-19 are strongly correlated! Of course, there is no reason why richer or more developed countries should have more deaths.

As just linear relationships may be spurious, the relationships discovered by AIML algorithms may also be so. Hence learning from the statistical literature going back a century is needed to weed out spurious conclusions and find the right relationships for business intelligence.

 

7 Simpson’s paradox and the omitted variable bias

Simpson’s Paradox is an effect wherein ignoring an important variable may reverse the conclusion. One of the examples of Simpson’s paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. In 1973, it was alleged that there is a gender bias in graduate school admissions – the acceptance ratio among males was 44% while among females it was 35%. When the statisticians at Berkeley wanted to identify as to which department is responsible for this, they looked at department wise acceptance ratios and found that if anything, there was a bias against the males… The apparent bias in the pooled data appeared because a lot more women applied to departments which had lower acceptance rates. The variable department in this example is called a confounding factor. In Economics literature, the same phenomenon is also called Omitted Variable Bias. 

8 Selection Bias and Word War II

During World War II, based on simple analysis of the data obtained from damaged planes returning to the base post bombing raids, it was proposed to the British air force that armour be added to those areas that showed the most damage. Professor Abraham Wald Columbia University, a member of the Statistical Research Group (SRG) was asked to review the findings and recommend how much extra armour should be added to the vulnerable parts.

Wald looked at the problem from a different angle. He realised that there was a selection bias in the data that was presented to him – only the aircraft that did not crash returned to the base and made it to the data. Wald assumed that the probability of being hit in any given part of the plane was proportional to its area (since the shooters could not aim at any specific part of the plane). Also, given that there was no redundancy in aircrafts at that time, the effect of hits on a given area of the aircraft were independent of the effect of hits in any other area. Once he put these two assumptions, the conclusion was obvious – that armour be added in parts where less hits have been observed. So, the statistical thinking led Wald to the model that gave the right frame of reference that connected the data (hits on planes that returned) and the desired conclusion (where to add armour).

 9 GMGO (Garbage Model Garbage out), the new GIGO in Data Science world

The phrase Garbage-In-Garbage-Out (GIGO) is often used to describe the fact that even with the best of algorithms, if the input data is garbage, then the conclusion (output) is also likely to be garbage. Our discussion adds a new phenomenon called GMGO i.e., a garbage model will lead to garbage output even with accurate data! 

 10 Conclusion

We have given examples where disregarding statistical understanding digested over 150 years can lead to wrong conclusions. While in many situations pure data driven techniques can do OK, this combined with domain knowledge and statistical techniques can do wonders in terms of unearthing valuable business intelligence to improve business performance.  

We recommend that data driven AI/ML models are a good starting point, or a good exploratory step. In addition, using domain knowledge to remove the various pitfalls discussed in this paper can take the analysis to a much higher level. 

AI Ethics Self Governance

AI Ethics:  Self-governed by Corporations and Employees

L Ravichandran, Founder – AIThoughts.Org

As more self-learning AI software & products are being used in factories, retail stores, enterprises and on self-driven cars on our roads, the age-old philosophical area of Ethics has become an important current-day issue.

Who will ensure that ethics is a critical component of AI projects right from conceptualization?  Nowadays, ESG (environmental, social, and corporate governance) and sustainability considerations have become business priorities at all corporations; how do we make AIEthics a similar priority? The Board, CEO, CXOs and all employees must understand the impact of this issue and ensure compliance. In this blog, I am suggesting one of the things corporations can do in this regard.

All of us have heard of the Hippocratic Oath taken by medical doctors, affirming their professional obligations to do no harm to human beings. Another ethical oath is called the Iron Ring Oath, taken by Canadian Engineers, along with the wearing of iron rings, since 1922. There is a myth that the initial batch of iron rings was made from the beams of the first Quebec Bridge that collapsed during construction in 1907 due to poor planning and engineering design. The iron ring oath affirms engineers’ responsibility to good workmanship and NO compromise in their work regarding good design and good material, regardless of external pressures.

 

When it comes to AI & Ethics, the ethical questions become more complex. Much more complex.

 

If a self-driven car hits a human being, who is responsible? The car company, the AI product company or the AI designer/developers? Or the AI car itself?

 

Who is responsible if an AI Interviewing system is biased and selects only one set of people (based on gender, race, etc.)?

 

Who is responsible if an Industrial Robot shuts off an assembly line when sensing a fault but kills a worker in the process?  

 

Ironically, much literature on this topic refers to and even suggests the use of Isaac Asimov’s Laws of Robotics from his 1942 science fiction book.

The Three Laws are:

1.    A robot may not injure a human being or, through inaction, allow a human being to come to harm.

2.    A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

3.    A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

 

In June 2016, Satya Nadella, CEO of Microsoft Corporation in an interview with the Slate magazine talked about the following guidelines for Microsoft AI designers.

1.      “A.I. must be designed to assist humanity” meaning human autonomy needs to be respected.

  1. “A.I. must be transparent” meaning that humans should know and be able to understand how they work.
  2. “A.I. must maximize efficiencies without destroying the dignity of people”.
  3. “A.I. must be designed for intelligent privacy” meaning that it earns trust through guarding their information.
  4. “A.I. must have algorithmic accountability so that humans can undo unintended harm”.
  5. “A.I. must guard against bias” so that they must not discriminate against people.

 

Lots of research is underway to address this topic. Philosophers, lawyers, government bodies and IT professionals are jointly working on defining the problem in granular detail and coming out with solutions.

I recommend the following :-

 

1.                All corporate stake holders (user corporations and tech firms) should publish an AIEthics Manifesto and report compliance to the Board on a quarterly basis. This manifesto will ensure they meet all in-country AIEthics policies if available or follow a minimum set of safeguards even if some countries are yet to publish their policies. This will ensure CEO and CXOs will have an item on their KPIs/BSCs regarding AIEthics and ensure proliferation inside the company.

 

2.                Individual developers and end-users can take an oath or pledge stating that ‘I will, to the best of my ability, develop or use only products which are ethical and protect human dignity and privacy’.

 

 

3.                Whistle Blower policy to be extended to AIEthics compliance issues, to encourage employees to report issues without fear.