Skip to main content

Small talk about Large Language Models

Since its formal launch, ChatGPT has been receiving a lot of press and has also been the topic of – heated – discussions in the recent past.

I had played with generative AI some time back and also shared the result in one of my earlier posts.

Post ChatGPT, the investments in AI – or more specifically generative AI tech – based companies has seen a sharp rise.

There is also a general sense of fear – rising from uncertainty and the dread of the possibility of such technologies taking away specialized jobs and roles has been noticed across industries.

I was talking to an architect a few days ago and she said that in their community, the awe and fear of AI tech is unprecedented.

With just a few words, some of the sketches generated by tools like Dall-E, Craiyon, Stable diffusion etc are apparently so realistic and logical.. for example, when the query was to have the porch door opening out into the garden with a path to the main gate.. the image was generated in less than a couple of minutes..

With all the promise of creating new content quickly, many questions have also come up, without clear answers.

The first – also a topic of interest on aithougts.org – is that of ethics.

Whether it is deep fakes – btw, I had experimented with a technology that could have been used for this – when I was looking for tools to simplify podcast editing – on a platform called Descript – where I could train the model with my voice.. I had to read a predefined text for about 30 minutes – and then, based on written text, it could synthesize that text in my voice.. At that time, the technology was not yet as mature as today and so, I did not pursue.

I digress..

Getting back to the debate on generative AI, ethics of originality [I believe that there are now tools emerging that can check if content was generated by ChatGPT!] that could influence how students create their assignment papers.. or generate more marketing content, all based on content that is already available on the net – and ingested by the ChatGPT transformer.

Another aspect is the explainability of the generated content. The bias in the generated content or when there is a need for an expert opinion to also be factored in, would not be possible unless the source is known. The inherent bias in the training data is difficult to overcome as much of this is historical and if balanced data has not been captured or recorded in the past, would be very difficult to fix, or at least adjust the relevance.

The third aspect is about the ‘originality’ or ‘uniqueness’ of the generated content – let me use the term solution from now on..

There is a lot of work being done in these areas, some in research institutions and some in companies applying them in specific contexts.

I had an opportunity recently to have a conversation with the founder of a startup that is currently in stealth mode, working on a ‘domain aware, large language model based’ generative AI solution.

A very interesting conversation that touches upon many of the points as above.

 

You can listen to this conversation as a podcast in 2 parts here:

https://pm-powerconsulting.com/blog/the-potential-of-large-language-models-with-steven-aberle/

https://pm-powerconsulting.com/blog/episode-221/

 

Or watch the conversation as a video in 2 parts here:

https://www.youtube.com/watch?v=86fGLa9ljso

https://www.youtube.com/watch?v=f9DnDNUwFBs

 

Do share your comments and experiences with the emerging applications of GAN, Transformers etc.

Are You Human? Tale of CAPTCHA

Recently I gave a keynote speech in Mahindra University, Hyderabad as part of a 2-day workshop on “Data Science for the Industry”. Great opportunity to share my thoughts on Data Sciences/AIML technologies and industry use cases.

I talked about various problems to be solved by these rapidly advancing technologies. One of them was “Are You Human?” question. The problem is created by AIML technology and also solutions need to come from AIML technology. Basically, how does any IT system distinguish between humans and machines during transactional interfaces?

Is this problem important enough to worry about? Yes. I will give you both technical and commercial reasons for it.

First Commercial reason.

I am sure all of you heard of the Twitter take-over bid by Elon Musk, CEO Tesla for US$44 Billion. The deal was cancelled by Elon Musk due to inability to determine % of non-human or BOT users on Twitter. Elon Musk accused Twitter of using incorrect algorithms to determine BOT users and under estimating the real BOT numbers. The issue is now in legal dispute.

Many commercial decisions are based on number of customers. For e.g., number of people visiting the website determines cost of advertisements and royalty payments to the website content authors.

Second Technology reason.

The age of Digital has transformed IT landscape across the enterprises and use of web, mobile phones, chatbots and IOT devices are the norm and not exceptions. All of these channels are communicating with the enterprise IT systems and getting business executed i.e. placing orders for products, registering service issues etc. At the same time automation is also become a norm and Robotic Process automation tools are widely used in enterprises. Many cases they use various technologies to simplify data entry by using a single screen input and on the background simulating multiple screens data entry to various enterprise systems. These interactions are legitimate and should be flagged as non-human but legitimate approved interfaces.

I am sure now you are convinced about the importance of the problem.

 

Now let us come to main topic of our blog i.e., CAPTCHA.

All of us have used on-line or mobile Banking to do banking transactions. Most of us would have encountered some thing called CAPTCHA. The system throws a set of characters twisted in a wavy curvy fashion and system expects the interacting person to see the image and do the right interpretation and enter it back to the system for confirmation. Some examples are given below.

 


 

The system generates random sequence of case sensitive alpha-numeric characters such as 263S2V. This is twisted in to an image as you see above and shown back to the interacting agent. It is assumed that automated systems will fail to interpret this correctly and only human can interpret and type back the same set of characters 263S2V.

What is full form of Captcha? “Completely Automated Public Turing Test to tell Computers and Humans Apart”.

When was this invented? Between 1997 to 2003. The most common type of CAPTCHA (displayed as Version 1.0) was first invented in 1997 by two groups working in parallel. In 2000, CMU professors Luis Von Ahn, Manuel Blum and John Langford wrote a paper titled “Telling Humans and Computers Apart (Automatically) or How Lazy Cryptographers do AI”. The term CATCHA was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper and John Langford.

This form of CAPTCHA requires someone to correctly evaluate and enter a sequence of letters or numbers perceptible in a distorted image displayed on their screen. Because the test is administered by a computer, in contrast to the standard Turing test that is administered by a human, a CAPTCHA is sometimes described as a revised Turing test.

I am sure you are wondering how come a 20-year-old technology is still in use for Hitech Banking industry as a digital security solution?

This technology is very cumbersome and frustrating for all humans. Younger people with sharp 20-20 eye vision may be able to get it right first time but it still adds 10-15 seconds to the transaction. Senior citizens, people with glasses, people with vison disability, people with low quality display units, poor lighting in capital or not with all the wavy and squiggly images. In addition, the systems are badly designed and it forces me to re-enter all the data fields till I get my Captcha right.

In the last 20 years, the AIML technology has improved exponentially. Hand writing recognition and image recognition technologies are very good and they can easily recognize the Captcha transformed images. I can go one step further and say that if a senior citizen customer gets the CAPTCHA first time right, then the bank should assume it a fraud!. Unfortunately, the banks assume the exact opposite, which was the original basis of the CAPTCHA technology.

Various other ideas such as speed of data filling were also considered as part of CAPTCHA. Humans do take time to type the data while automatic BOTS can do it at super-fast speeds. However, RPA based automation systems will always be fast and they are genuine systems interactions. Also, it is so easy for a BOT to slow things down by waiting few seconds before submitting the data and fool the timing algorithm.

We have seen big discussions on evolution race between the prey and the hunter in the biological world. The deer evolves to get stronger legs to out run the tiger. Tiger evolved stronger lungs to sustain long chases. Same way as the AIML technologies evolving so fast to mimic human interactions, we need to get better technologies to solve the “Are You Human?” problem.

 

More later. Do share your views.

 

Regards,

L Ravichandran

AiThoughts.Org

 

 


 

Test your AI Quotient

Take this fun quiz to find ten words related to the world of AI.

These may be acronyms or terms that you would come across while exploring the wide world of Artificial Intellingence, Machine Learning, using techniques, applications etc.

For this puzzle, there may also be phrases – or multiple words – that have been included without a separating space.

Example: Machine learning would be machinelearning

The words may be horiontal, vertical, diagonal or, sometimes, in reverse.

Since this is a dynamic and interactive puzzle, if you play multiple times, the grid may appear differently.

This is the first in many more such interactives you will see on this site.

So, to make it easy, the vocabulary is also shown below the grid – to help you look for them.

If you have any suggestions for other interactive and fun ways to explore the world of AI, do contact us and share your ideas.

All the best!

[h5p id=”1″]

AI becoming Sentinel

Google CEO demonstrated their new Natural Language chatbot LAMDA.  The video is available on youtube. https://www.youtube.com/watch?v=aUSSfo5nCdM

The demo was very impressive. All the planets in the solar system were created as personas and any human can converse with LaMDA and ask questions about that particular planet.  LaMDA responses had sufficient human like qualities. For e.g. If you talk good about the planet then it says thanks for appreciating and when you talk about myths about the planet, it corrects you with human like statements.  Google CEO also mentioned that this is still under R&D but being used internally and this is Google’s efforts to make machines understand and respond as humans using natural language constructs.

Huge controversy was also created by a Google engineer, Blake  Lemoine.  His short interview is available on Youtube. https://www.youtube.com/watch?v=kgCUn4fQTsc&t=556s

Blake was part of testing team of LAMDA and after many question & answer sessions with LAMDA, felt that LaMDA is becoming a real person with feelings, understanding of trick questions and answering with trick or silly answers like a person would do etc.  He asked a philosophical question “Is LaMDA sentinel? “

Google management and many other AI experts have dismissed these claims and questioned him on his motives for over playing his cards.

In simple terms let me summarize both the positions.

  • Google and other big players in the AI space are trying to crack the Artificial General Intelligence ( AGI) area i.e how to make AI/ML models as human as possible. This is their stated purpose and there is no question of denying this.
  • Any progress towards AGI will involve machines to behave in irrational ways as humans do. Machines may not always chose the correct decision all the times ..  may not want to answer the same question many times like humans do ..  may show signs of emotions such as feeling hurt , sad , happy etc. like humans do.
  • This does not mean that AI has become sentinel and has actually become a person demanding its rights as a global citizen!.  All new technologies have rewards and risks and may be we are exaggerating the risks of AI tech too much.
  • Blake gave an example of one test case during his testing role at Google.  He tried various test conversations with LaMDA to identify ethical issues like bias etc.  When he gave a trick question to LaMDA which had no right answer, LaMDA responded back with a real stupid out of the line answer.   Blake reasoned that LaMDA understood that this was a trick question, deliberately being asked to confuse LaMDA and hence gave a out of the line stupid answer. For another question “what are you afraid of”, LaMDA said it is afraid of being turned off.  He felt these answers are way and beyond just conversational intelligence and hence felt that LaMDA has become more of a person.
  • You may refer my earlier Blogs on Turing test for AI.  Prof Turing published this test in 1953 to determine whether an AI machine has full general intelligence.  Blake also wanted Google to run this Turing test on LaMDA and see if LaMDA passes or fails this.  He says Google felt this is not necessary. He also claims that as per Google’s policy, LaMDA is hard coded to fail the Turing test.  If you ask a question “Are you an AI” , LaMDA is hardcoded to say Yes thus failing the Turing test.

Very interesting thoughts and discussions.  Nothing dramatic about this as AGI by its definition very controversial as it gets in to deep human knowledge replication.

What do enterprises who are planning on using AI/ML need to do? 

For enterprise applications of AI/ML, we do not need AGIs and our focused domain specific AI/ML models are sufficient.  Hence no need to worry about these sentinel discussions as yet.

However, the discussions on AI Ethics are still very relevant for all enterprise AIML applications and not to be confused with the AGI sentinel discussions.  

More Later,

L Ravichandran.

EU Artificial Intelligence Act proposal

Lot has been talked about #ResponsibleAI, #ai and #ethics. We also have a brand new filed called #xai Explainable AI with the sole objective of creating new simpler models to interpret more complex original models.  Many tech companies such as Google, Microsoft, IBM etc. have released their #ResponsibleAI guiding principles. 

European Union has circulated a proposal for a “The EU Artificial Intelligence Act”.  As per process this proposal  will be discussed, debated, modified and made in to law by the European parliament soon.

Let me give you a brief summary of the proposal.  

First is the definition of 4 risk categories with different type of checks & balances in each category. 

The categories are  

  1. Unacceptable
  2. High Risk
  3. Limited Risk
  4. Minimal Risk

Category 1 the recommendation is a Big NO.   No company can deploy this category SW within EU for commercial use.

Category 2 consisting of many of the business innovation & productivity improvement applications will be under formal review & certification before put to commercial use.

Category 3 will require full transparency to the end users and option to ask for alternate human in the loop solutions.

Category 4 is not addressed in this proposal.  Expected to be self-governed by companies

Let us look at what kind of applications fall in Category 2

  • Biometric identification and categorization
  • Critical Infrastructure management
  • Education and vocational training
  • Employment
  • Access to public and private services including benefits
  • Law enforcement ( Police & Judiciary)  
  • Border management ( Migration and asylum)
  • Democratic process such as elections & campaigning.

Very clearly EU is worried about the ethical aspects of these complex AI systems with their inbuild biases, lack of explain ability, transparency etc. and also clearly gives very high weightage to human rights and fairness & decency.

 I recommend that all organizations start reviewing this and include this aspect in their AIML deployment plans without waiting for the eventual EU law.

Insurance is changing: AI/ML will dominate the business in a decade

To understand the coming tsunami in insurance business, we have to understand how insurance business works in the first place. Buying insurance is not like buying a mango. You can buy a mango and find out quickly if it was worth the money you spent on it by tasting it. On the other hand, when you buy an insurance policy, you have no idea how good it is. All you get is a contract with a promised payment under certain conditions. There are fine prints. Unfortunately, most people do not read the insurance contracts. As a result, when it comes to claims, they find that they are ineligible to receive the promised compensation. Take life insurance, for example.

Life insurance policies often have a pandemic exclusion clause. In these Covid times, many beneficiaries are finding out that exclusion clause the hard way.

An insurance policy covers a low probability specific event where the loss is high (in monetary value). If you buy a car insurance for a year, most often you do not have an accident and not make a claim. As a result, you do not find out if the policy would have paid anything at all – after all it is a low probability event. When you buy life insurance policies, the policy may last for decades. For these reasons, regulators monitor insurance business closely. They want to make sure that the company actually pays the compensation when the time comes. The insurance company cannot simply close shop and not pay if it incur losses.

The consequence of it is this: Regulators (like the IRDA) do not insist on a maximum retail price (what is called a premium of the policy) for an insurance policy. Instead, it stipulates a minimum price! No other business regulation works like that. If the insurance company does not charge high enough for policies, it may lose so much money that it could go bankrupt and leave the policy holders nowhere to go. It does not happen because premium incomes of an insurance company is kept separate from other incomes of an insurance company. This is the reason for charging a minimum price (premium) for a policy.

How does an insurance company set the premium? First, it has to calculate the average loss incurred for a specific policy. Let us take a concrete example.

A life insurance company is selling life insurance for 25 year old healthy (meaning no obvious disease such as cancer or heart defect or smoker) males (females will have a separate premium).

The probability of dying for that individual in India is 0.0017. In other words, out of 10,000 such people 17 would die in a given year. For one rupee (per year) premium, the company can pay Rs 588 (=1/0.0017) life insurance benefits on the average if the company breaks even on that product.

But, if it sells such policies at that premium, it will lose money half the time. If the company sells that policy for 10 years for 10,000 men of that age, it will lose money 5 of the 10 years. Such a policy is not viable in the long run. This is precisely why the IRDA will not permit the company to charge such a low premium. In fact, no company will sell a policy of paying Rs 250 for a Re 1 premium per year for that age group of males.

For females of the same age group, the probability of dying is 0.0013. The same calculations yield a breakeven premium of Re 1 policy to produce benefits worth Rs 769.

There are two relevant observations here. First, females of all relevant ages of buying life insurance have lower mortality rates than their male counterparts of the same age. Hence, life insurance policies always have lower premiums for women. Second, the insurance company is allowed to discriminate against customers based on their age and sex (but not other factors like location). No seller of mangoes can do that legally.

In technical terms, this is a pure term life insurance policy. Most insurance companies are reluctant to sell such policies because they want long term customers who will keep renewing their policies to make a bigger profit.

This example can be used for calculating premium for any other kind of insurance policy. Once we calculate the probabilities accurately, we can calculate a level of premium based on those probabilities. This process is known as ratemaking in insurance parlance.

The trickiest part of ratemaking is to put an individual to the right class of risk. One omnipresent problem in risk classification is attracting the people with higher than average risk. For example, if I know my parents have died of heart attacks, and my grandparents died of heart attacks, I will have a higher than average risk of dying from a heart attack. I might be buying a life insurance policy precisely because I have this knowledge and the insurance company does not. This problem is called an adverse selection problem. The other problem is that I might become less careful about the underlying covered risk if I know I have an insurance policy. I may not be less careful of dying if I have a life insurance policy. But, I might be less careful driving my car if I have comprehensive car insurance with no deductible or coinsurance clause. This problem is known as a moral hazard problem. [It is precisely this problem that car insurance is never sold with zero deductible or coinsurance clauses.]

Insurance policies are sold through agencies – specifically with agents. Agents (and underwriters) assess the risk of the potential buyer of a policy. If the agent signals a potential bad risk, the policy will not be sold.

Once the policy is sold, the risk of the buyer can be constantly monitored. For example, we know that the longer a person drives at a stretch or the higher the speed at which a person drives, the risk of an auto accident goes up. If we can monitor those parameters of driving, we can assess the changing risk of the driver.

This is one area where AI comes into play. Today it is possible to monitor a driver through a GPS in real time to measure the speed and driving duration cheaply.

Over the next decade, the agency model to assess the risk of a potential insurance buyer can be entirely replaced by AI/ML agents. Automation will replace human judgment in the process making it far more uniform.

The AI/ML domain will also be very useful in assessing the risks. Consider the adverse selection problem of attracting the wrong people. AI/ML methods can be used to search and discover many underlying risks. For example, the death certificate of my parents can be pulled up to verify their causes of death before selling me an expensive life insurance policy. My health can be monitored through my FitBit device making sure I am truly at the good risk that I claim to be for buying health insurance. Smart watches are already capable of detecting how often and how much I consume alcohol. Thus, a wearable can easily monitor the risk for a particular individual.

A large area of an insurance company is dedicated to verification of claim authenticity. Insurance fraud is an ever present worry of an insurance company. Most fraud detection today are done manually. With AI/ML, we can connect different data systems very quickly to detect fraud. For example, a payment for damage to a car requires several quotes from several garages. If one garage is found to be always producing higher quotes, we can blacklist it from future business.

Similarly, for medical insurance, if one clinic is consistently charging more from the same treatment, it can be identified quickly and disbarred.

In the case of life insurance, there have been many instances of faking own death. Recently, there was one case of one Prabhakar Bhimaji Waghchaure who tried to collect five million US dollars by killing one lookalike and pretending he was the victim of a cobra bite. The case unraveled only after the insurance company sent an investigator from the US to India. Such verifications are expensive. Access to phone records can easily unearth such frauds. In fact, this was precisely the method used in that particular case. But AI/ML methods can easily automate the process thereby drastically reducing the cost of fraud detection.

Cryptography and Artificial Intelligence

Nowadays, both Cryptography and Artificial Intelligence (AI) have become integral parts of our daily life. The first one makes human communication safe from unwanted attackers and the second one makes our life easier by helping to make decisions.
In this article, we give a short overview of how these subjects are related and depended on each other.

Cryptography

Cryptography is an indispensable tool used to protect information in computing systems. It is used to protect data at rest and data in motion. It is the study of mathematical techniques related to aspects of information security such as confidentiality, data integrity, entity authentication, data origin authentication, etc.

Modern cryptography is heavily based on mathematical theory and computer science practice; cryptographic algorithms are designed around computational hardness assumptions, making such algorithms hard to break in actual practice by any adversary. While it is theoretically possible to break into a well-designed system, it is infeasible in actual practice to do so.

Artificial Intelligence

Artificial intelligence is a technology that enables a machine to simulate human behavior. Machine learning (ML) is a subset of AI which allows a machine to automatically learn from past data without programming explicitly. The goal of AI is to make a smart computer system like humans to solve complex problems.

AI applications include advanced web search engines (e.g., Google), recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Tesla), automated decision-making and competing at the highest level in strategic game systems (such as chess and Go), etc.

Relation between artificial intelligence and cryptography

Mathematical cryptanalysis deals with the problem of breaking cryptographic mechanisms solely by exploiting their mathematical properties. There is a variety of cryptographic mechanisms that are considered secure against this type of attack. However, this security cannot usually be strictly proven mathematically at present.

Artificial intelligence and cryptography have many things in common. The most apparent is the processing of a large amount of data and large search spaces. In a typical cryptanalytic situation, the cryptanalyst wishes to break” some cryptosystem. This means he wishes to find the secret key used by the users of the cryptosystem, where the general system is already known. The decryption function thus comes from a known family of such functions (indexed by the key), and the goal of the cryptanalyst is to exactly identify which such function is being used. He may typically have available a large quantity of matching ciphertext and plaintext to use in his analysis. This problem can also be described as the problem of learning an unknown function” (that is, the decryption function) from examples of its input/output behavior and prior knowledge about the class of possible functions.

Artificial intelligence in Cryptography

Artificial intelligence has been an interesting field of study with massive potential for application. In the past three decades, machine learning techniques, whether supervised or unsupervised, have been applied in cryptographic algorithms, cryptanalysis, steganography, among other data-security-related applications.

AI techniques can be applied to cryptographic problems in various ways. The goal is to understand potential attacks and security guarantees of cryptographic methods and implementations in more detail. AI can be used to improve or automate attack techniques, but also to create security proofs or to uncover errors in security proofs.

AI is applied in both cryptography and cryptanalysis. Based on machine learning, many cryptosystems ([1]) have already been proposed. For example, a phenomenon, like mutual learning [2] can help the two sides of communication to create a common secret key over a public channel. Classifying encrypted traffic [3], based on machine learning, is also a good example of AI being used in cryptography. Machine learning techniques were also applied to perform side-channel attacks [4]. known-plaintext attack over DES [5]. The proposed attack trains a neural network to decrypt ciphertext without knowing the encryption key, in a greatly reduced time, compared to other known-plaintext attacks.

Cryptography in artificial intelligence

As we use AI-enabled devices in day-to-day life, they are prone to attack by unwanted people. At present most of the AI devices do not use cryptography secure AI protocols as they require a very high amount of computational resources and so are inefficient to use in practice.

For example, an automated car can be hacked and misdirected. Chaos on the roads of a city can be created by hacking an AI-based automated traffic lights system.
Moreover, AI is being applied to a growing number of systems, particularly those problems where the intention is to detect anomalous system behavior. Such things are achieved by training on good and bad data. Since AI uses past data to learn and predict the future, AI algorithms can be forced to output bad results by injecting manipulated data during training.

Shamir et al. [7], recently studied a broader issue in machine learning such as what happens to deep neural networks during regular and during adversarial training. They introduced a new theory of adversarial examples called the Dimpled Manifold Model where they showed how adversarial training can affect deep neural networks. Such training turns a cat into a car that does not look like a car at all. He discussed in Indocrypt, Dec 25, 2021, that Tesla used a similar deep learning algorithm to read street signs and he changed a few chosen pixels of the ‘STOP’ sign, during training, where new images are usually the same. However, such adversarial training turns the STOP’ sign into the `SPEED LIMIT 45 KMPH’ sign. What a disaster it may cause!

Cryptographic techniques can be used to reduce problems in the application of AI methods such as privacy-preserving machine learning. In such scenarios, data may be encrypted before training, as well as prediction algorithms may be cryptographically secure. Then it will become difficult to attack. However, such existing protection mechanisms require a significantly large amount of computational power. Due to the recent increase of attacks on AI protocols, a detailed study is necessary to make them secure by adopting cryptographic techniques.

Future direction

Thus we can see, that Cryptography and Artificial Intelligence are becoming greatly dependent on each other. On one hand, the study of new methods of attack is therefore important in order to detect possible weaknesses of cryptographic mechanisms at an early stage. Then we can design new robust schemes.
On the other hand, finding new cryptographically secure AI protocols is of great necessity in the upcoming days. All these can be achieved only from the collaborative effort of people from both of those directions of research.

References

[1] Alani, Mohammed M. “Applications of machine learning in cryptography: a survey.” Proceedings of the 3rd International Conference on cryptography, security and privacy. 2019.

[2] Rosen-Zvi, Michal, et al. “Mutual learning in a tree parity machine and its application to cryptography.” Physical Review E 66.6 (2002): 066135.

[3] Alshammari, Riyad, and A. Nur Zincir-Heywood. “Machine learning based encrypted traffic classification: Identifying ssh and skype.” 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE, 2009.

[4] Hospodar, Gabriel, et al. “Machine learning in side-channel analysis: a first study.” Journal of Cryptographic Engineering 1.4 (2011): 293.

[5] Alani, Mohammed M. “Neuro-cryptanalysis of DES and triple-DES.” International Conference on Neural Information Processing. Springer, Berlin, Heidelberg, 2012.

[6] Rivest, Ronald L. “Cryptography and machine learning.” International Conference on the Theory and Application of Cryptology. Springer, Berlin, Heidelberg, 1991.

[7] Shamir, Adi, Odelia Melamed, and Oriel BenShmuel. “The dimpled manifold model of adversarial examples in machine learning.” arXiv preprint arXiv:2106.10151 (2021).

Plus ça change- Is ML the new name for Statistics?

Names change, but ideas usually don’t. How is today’s ‘data science’ different from yesterday’s statistics, mathematics and probability?

 Actually, it’s not very different. If it seems changed it’s only because the ground reality has changed. Yesterday we had data scarcity, today we have a data glut (“big data”). Yesterday we had our models, and were seeking data to validate them. Today we have data, and seek models to explain what this data is telling.

 Can we find associations in our data? If there’s association, can we identify a pattern? If there are multiple patterns, can we identify which are the most likely? If we can identify the most likely pattern, can we abstract it to a universal reality? That’s essentially the data science game today.

 Correlation

 Have we wondered why the staple food in most of India is dal-chaval or dal-roti? Why does almost everyone eat the two together? Why not just dal followed by just chaval?

 The most likely reason is that the nutritive benefit when eaten together is more than the benefit when eaten separately. Or think of why doctors prescribe combination drug therapies, or think back to the film Abhimaan (1973) in which Amitabh Bachchan and Jaya Bhaduri discovered that singing together created harmony, while singing separately created discord. Being together can offer a greater benefit than being apart.

 Of course, togetherness could also harm more. Attempting a combination of two business strategies could hurt more than using any individual strategy. Or partnering Inzamam ul Haq on the cricket field could restrict two runs to a single, or, even more likely, result in a run out!

 In data science, we use the correlation coefficient to measure the degree of linear association or togetherness. A correlation coefficient of +1 indicates the best possible positive association; while a value of -1 corresponds to the negative extreme. In general, a high positive or negative value is an indicator of greater association.

 The availability of big data now allows us to use the correlation coefficient to more easily confirm suspected associations, or discover hidden associations. Typically, the data set is a spreadsheet, e.g., supermarket data with customers as rows, and every merchandise sold as a column. With today’s number crunching capability, it is possible to compute the correlation coefficient between every pair of columns in the spreadsheet. So, while we can compute the correlation coefficient to confirm that beer cans and paper napkins are positively correlated (could be a dinner party), we could also unearth a hidden correlation between beer cans and baby diapers.

 Why would beer cans and baby diapers be correlated? Perhaps there’s no valid reason, perhaps there’s some unknown common factor that we don’t know about (this has triggered off the ‘correlation-is-not-causation’ discussion). But today’s supermarket owner is unlikely to ponder over such imponderables; he’ll just direct his staff to place baby diapers next to beer cans and hope that it leads to better sales!

 Regression

 If two variables X and Y have a high correlation coefficient, it means that there is a strong degree of linear dependence between them. This opens up an interesting possibility: why not use the value of X to predict the likely value of Y? The prospect becomes even more enticing when it is easy to obtain X, but very hard (or expensive) to obtain Y.

 To illustrate, let us consider the height (X) and weight (Y) data of 150 male students in a class. The correlation coefficient between X and Y is found to be 0.88. Suppose a new student joins. We can measure his height with a tape, but we don’t have a weighing scale to obtain his weight. Is it possible to predict his weight?

 Let us first plot this data on a scatter diagram (see below); every blue dot on the plot corresponds to the height-weight of one student. The plot looks like a dense maze of blue dots. Is there some ‘togetherness’ between the dots? There is (remember the correlation is 0.88?), but it isn’t complete togetherness (because, then, all the dots would’ve aligned on a single line).

 To predict the new student’s weight, our best bet is to draw a straight line cutting right through the middle of the maze. Once we have this line, we can use it to read off the weight of the new student on the Y-axis, corresponding to his measured height plotted on the X-axis.

 How should we draw this line? The picture offers two alternatives: the blue line and the orange line. Which of the two is better? The one that is ‘middler’ through the maze is better. Let us drop down (or send up) a ‘blue perpendicular’ from every dot on to the blue line, and, likewise, an ‘orange perpendicular’ from every dot on to the orange line (note that if the dot is on the line, the corresponding perpendicular has zero length). Now sum the lengths of all the blue and orange perpendiculars. The line with a smaller sum is the better line!

  

X: height; Y: weight

 Notice that the blue and orange lines vary only in terms of their ‘slope’ and ‘shift’, and there can be an infinity of such lines. The line with the lowest sum of the corresponding perpendiculars will be the ‘best’ possible line. We call this the regression line to predict Y using X; and it will look like:

a1 X + a2, with a1 and a2 being the slope and shift values of this best line. This is the underlying idea in the famed least-square method.

 Bivariate to multivariate

 Let us see how we can apply the same idea to the (harder) problem of predicting the likely marks (Y) that a student might get in his final exam. The numbers of hours studied (X1) seems to be a reasonable predictor. But if we compute the correlation coefficient between Y and X1, using sample data, we’ll probably find that it is just about 0.5. That’s not enough, so we might want to consider another predictor variable. How about the intelligence quotient (IQ) of the student (X2)? If we check, we might find that the correlation between Y and X2 too is about 0.5.

 Why not, then, consider both these predictors? Instead of looking at just the simple correlation between Y and X, why not look at the multiple correlation between Y and both X1 and X2? If we calculate this multiple correlation, we’ll find that it is about 0.8.

 And, now that we are at it, why not also add two more predictors: Quality of the teaching (X3), and the student’s emotional quotient (X4)? If we go through the exercise, we’ll find that the multiple correlation keeps increasing as we keep adding more and more predictors.

 However, there’s a price to pay for this greed. If three predictor variables yield a multiple correlation of 0.92, and the next predictor variable makes it 0.93, is it really worth it? Remember too that with every new variable we also increase the computational complexity and errors.

 And there’s another – even more troubling – question. Some of the predictor variables could be strongly correlated among themselves (this is the problem of multicollinearity). Then the extra variables might actually bring in more noise than value!

 How, then, do we decide what’s the optimal number of predictor variables? We use an elegant construct called the adjusted multiple correlation. As we keep adding more and more predictor variables to the pot (we add the most correlated predictor first, then the second most correlated predictor and so on …), we reach a point where the addition of the next predictor diminishes the adjusted multiple correlation even though the multiple correlation itself keeps rising. That’s the point to stop!

 Let us suppose that this approach determines that the optimal number of predictors is 3. Then the multiple regression line to predict Y will look like a1 X1 + a2 X2 + a3 X3 + a4. where a1, a2, a3, a4 are the coefficients based on the least-square criterion. 

 Predictions using multiple regression are getting more and more reliable because there’s so much more data these days to validate. There is this (possibly apocryphal) story of a father suing a supermarket because his teenage daughter was being bombarded with mailers to buy new baby kits. “My daughter isn’t pregnant”, the father kept explaining. “Our multiple regression model indicates a very high probability that she is”, the supermarket insisted. And she was …

 As we dive deeper into multivariate statistics we’ll find that this is the real playing arena for data science; indeed, when I look at the contents of a machine learning course today, I can’t help feeling that it is multivariate statistics re-emerging with a new disguise. As the French writer Jean-Baptiste Alphonse Karr remarked long ago: plus ça change, plus c’est la même chose!

Relevance of Statistics In the New Data Science World

Relevance of Statistics in the new Data Science world

Rajeeva L Karandikar

Chennai Mathematical Institute, India 

Abstract 

With Big Data and Data Science becoming buzzwords, various people are wondering about the relevance of statistics versus pure data driven models.

In this article, I will explain my view that several statistical ideas are as relevant now as they have been in the past.  

 

1 Introduction

For over a decade now, Big Data, Analytics, Data-Science have become buzzwords. As is the trend now, we will just refer to any combination of these three as data-science. Many professionals working in the IT sector have moved to positions in data science and they have picked up new tools. Often, these tools are used as black boxes.  This is not surprising because most of them have little if any background in statistics. 

We can often hear them make comments such as, “With a large amount of data available, who needs statistics and statisticians? We can process the data with various available tools and pick the tool that best serves our purpose.

We hear many stories of wonderful outcomes coming from what can be termed pure data-driven approaches. This has led to a tendency of simply taking a large chunk of available data and pushing it through an AIML engine, to derive ‘intelligence’ out of it, without giving a thought to where the data came from, how it was collected and what connection the data has with the questions that we are seeking answers to…. If an analyst were to ask questions about the data, – How was it collected? When was it collected? – the answer one frequently hears is: “How does it matter?”

 Later in this article, we will see that it does matter. We will also see that there are situations where blind use of the tools with data may lead to poor conclusions.

As more and more data become available in various contexts, our ability to draw meaningful actionable intelligence will grow enormously. The best way forward is to marry statistical insights to ideas in AIML, and then use the vast computing power available at one’s fingertips. For this to happen, statisticians and AIML experts must work together along with domain experts

 

Through some examples, we will illustrate how ignoring statistical ideas and thought processes that have evolved over the last 150 years can lead to incorrect conclusions in many critical situations. 

2 Small data is still relevant

First let us note that there is a class of problems where all the statistical theory and methodology developed over the last 150 years continues to have a role – since the data is only in hundreds or at most thousands and never in millions. For example, issues related to quality control, quality measurement, quality assurance etc. only require a few hundred data points from which to draw valid conclusions. Finance – where the term: VaR (value-at-risk), which is essentially a statistical term- 95th or 99th percentile of the potential loss, has entered law books of several countries – is another area where the use of data has become increasingly common; and here too we work with a relatively small number of data points. There are roughly 250 trading days in a year and there is no point going beyond 3 or 5 years in the past as economic ground realities are constantly changing. Thus, we may have only about 1250 data points of daily closing prices to use for, say, portfolio optimisation or option pricing, or for risk management. One can use hourly prices (with 10,000 data points), or even tick-by-tick trading data, but for portfolio optimisation, risk management, the common practice is to use daily prices. In election forecasting, psephologists usually work with just a few thousand data points from an opinion poll to predict election outcomes. Finally, policy makers, who keep tabs on various socio-economic parameters in a nation, rely on survey data which of course is not in millions. 

One of the biggest problems faced by humanity in recent times is the COVID-19 virus. From March 2020 till the year end, everyone was waiting for the vaccines against COVID-19. Finally in December 2020, the first vaccine was approved and more have followed. Let us recall that the approval of vaccines is based on RCT – Randomised Clinical Trials which involve a few thousand observations, along with concepts developed in statistical literature under the theme Design of experiments. Indeed, most drugs and vaccines are identified, tested and approved using these techniques. 

These examples illustrate that there are several problems where we need to arrive at a decision or reach a conclusion where we do not have millions of data points. We must do our best with a few hundred or few thousand data points. So statistical techniques of working with small data will always remain relevant. 

3 Perils of purely data driven inference

This example goes back nearly 150 years. Sir Francis Galton was a cousin of Charles Darwin, and as a follow up to Darwin’s ideas of evolution, Galton was studying inheritance of genetic traits from one generation to the next. He had his focus on how intelligence is passed from one generation to the next.  Studying inheritance, Galton wrote “It is some years since I made an extensive series of experiments on the produce of seeds of different size but of the same species. It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre; smaller than the parents, if the parents were large; larger than the parents, if the parents were small.” Galton firmly believed that this phenomenon will be true for humans as well and for all traits that are passed on genetically, including intelligence. 

 

To illustrate his point, Galton obtained data on heights of parents and their (grown-up) offspring. He chose height as it was easy to obtain data on it. His analysis of the data confirmed his hypothesis, quoted above in italics. He further argued that this phenomenon would continue over generations, and its effect would mean that heights of future offspring will continue to move towards the average height. He argued that the same will happen to intelligence and thus everyone will only have average intelligence. He chose the title of the paper as Regression Towards Mediocrity in Hereditary Stature

 

The conclusion drawn by Galton is fallacious as can be seen by analysing the same data by interchanging roles of heights of offspring and mid-heights of parents leading to an exactly opposite conclusion – namely that if the off-spring is taller than average then the average height of parents will be less than that of the offspring while if the offspring is shorter than average, then the average height of parents will be more than the child. It could be seen that the variation in heights (variance of heights) in the two generations was comparable whereas if there was regression towards mean, variance would have decreased. Thus, Galton’s conclusion about regression to mediocrity over generations is not correct. However, the methodology that he developed for the analysis of inheritance of heights has become a standard tool in statistics and continues to be called Regression.

Galton was so convinced of his theory that he just looked at the data from one angle and got confirmation of his belief. This phenomenon is called Confirmation Bias  a term coined by English psychologist Peter Wason in the 1960s.

4 Is the data representative of the population

Given data, even if it is huge, one must first ask how it was collected. Only after knowing this, one can begin to determine if the data is representative of the population,

In India, many TV news channels take a single view either supporting the government or against it.  Let us assume News Channel 1 and News Channel 2 both run a poll on their websites, at the same time, on a policy announced by the government. Even if both sites attract large number of responses, it is very likely that the conclusions will be diametrically opposite, since the people who frequent each site will likely be ones with a political inclination aligned with the website.  This underscores the point that just having a large set of data is not enough – it must truly represent the population in question for the inference to be valid. 

If someone gives a large chunk of data on voter preferences to an analyst and wants her to analyse and predict the outcome of the next elections, she must start by asking as to how the data was collected and only then can she decide if it represents Indian electorate or not. For example, data from the social media on posts and messages regarding political questions during previous few weeks. However, less educated, rural, economically weaker sections are highly underrepresented on social media and thus the conclusions drawn based on the opinion of such a group (of social media users) will not be able to give insight into how the Indian electorate will vote. However, same social media data can be used to quickly assess market potential of a high-end smartphone – for their target market is precisely those who are active on social media.

  5 Perils of blind use of tools without understanding them

The next example is not one incident but a theme that is recurrent – that of trying to evaluate efficacy of an entrance test for admission, such as IIT-JEE for admission to IITs or CAT for admission to IIMs or SAT or GRE for admission to top universities in the USA. Let us call such tests as benchmark tests, which are open for all candidates and those who perform very well in this benchmark test are shortlisted admission to the targeted program. The analysis consists of computing correlation between the score on the benchmark test and the performance of the candidate in the program. Often it is found that the correlation is rather poor, and this leads to discussion on the quality of the benchmark test.  What is forgotten or ignored is that the performance data is available only for the candidates selected for admission. This phenomenon is known as Selection Bias – where the data set consists of only a subset of the whole group under consideration, selected based on some criterion. 

This study also illustrates the phenomenon known as Absence of Tail Dependence for joint normal distribution. Unfortunately, this property is inherited by many statistical models used for risk management and is considered one of the reasons for the collapse of global financial markets in 2008.

Similar bias occurs in studies related to health, where for reasons beyond the control of the team undertaking the study, some patients are no longer available for observation. The bias it introduces is called Censoring Bias and how to account for it in analysis is a major theme in an area known as Survival Analysis in statistics. 

6 Correlation does not imply causation

Most of data-driven analysis can be summarised as trying to discover relationships among different variables – and this is what correlation and regression are all about. These were introduced by Galton about 150 years ago and have been a source of intense debate. One of the myths in pure data analysis is to assume that correlation implies causation. However, this need not be true in all cases, and one needs to use transformations to get to more complex relationships.

One example often cited is where X is the sale of ice-cream in a coastal town in Europe and Y is the number of deaths due to drowning (while swimming in the sea, in that town) in the same month. One sees strong correlation! While there is no reason as to why eating more ice-creams would lead to more deaths due to drowning, one can see that they are strongly correlated to a variable Z = average of the daily maximum temperature during the month; in summer months more people eat ice-cream, and more people go to swim! In such instances, the variable Z is called a Confounding Variable.

In today’s world, countrywide data would be available for a large number of socio-economic variables, variables related to health, nutrition, hygiene, pollution, economic variables and so on – one can, say, list about 500 variables where data on over 100 countries is available. One is likely to observe correlations among several pairs of these 500 variables – one such recent observation is: Gross Domestic Product (GDP) of a country and number of deaths per million population due to COVID-19 are strongly correlated! Of course, there is no reason why richer or more developed countries should have more deaths.

As just linear relationships may be spurious, the relationships discovered by AIML algorithms may also be so. Hence learning from the statistical literature going back a century is needed to weed out spurious conclusions and find the right relationships for business intelligence.

 

7 Simpson’s paradox and the omitted variable bias

Simpson’s Paradox is an effect wherein ignoring an important variable may reverse the conclusion. One of the examples of Simpson’s paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. In 1973, it was alleged that there is a gender bias in graduate school admissions – the acceptance ratio among males was 44% while among females it was 35%. When the statisticians at Berkeley wanted to identify as to which department is responsible for this, they looked at department wise acceptance ratios and found that if anything, there was a bias against the males… The apparent bias in the pooled data appeared because a lot more women applied to departments which had lower acceptance rates. The variable department in this example is called a confounding factor. In Economics literature, the same phenomenon is also called Omitted Variable Bias. 

8 Selection Bias and Word War II

During World War II, based on simple analysis of the data obtained from damaged planes returning to the base post bombing raids, it was proposed to the British air force that armour be added to those areas that showed the most damage. Professor Abraham Wald Columbia University, a member of the Statistical Research Group (SRG) was asked to review the findings and recommend how much extra armour should be added to the vulnerable parts.

Wald looked at the problem from a different angle. He realised that there was a selection bias in the data that was presented to him – only the aircraft that did not crash returned to the base and made it to the data. Wald assumed that the probability of being hit in any given part of the plane was proportional to its area (since the shooters could not aim at any specific part of the plane). Also, given that there was no redundancy in aircrafts at that time, the effect of hits on a given area of the aircraft were independent of the effect of hits in any other area. Once he put these two assumptions, the conclusion was obvious – that armour be added in parts where less hits have been observed. So, the statistical thinking led Wald to the model that gave the right frame of reference that connected the data (hits on planes that returned) and the desired conclusion (where to add armour).

 9 GMGO (Garbage Model Garbage out), the new GIGO in Data Science world

The phrase Garbage-In-Garbage-Out (GIGO) is often used to describe the fact that even with the best of algorithms, if the input data is garbage, then the conclusion (output) is also likely to be garbage. Our discussion adds a new phenomenon called GMGO i.e., a garbage model will lead to garbage output even with accurate data! 

 10 Conclusion

We have given examples where disregarding statistical understanding digested over 150 years can lead to wrong conclusions. While in many situations pure data driven techniques can do OK, this combined with domain knowledge and statistical techniques can do wonders in terms of unearthing valuable business intelligence to improve business performance.  

We recommend that data driven AI/ML models are a good starting point, or a good exploratory step. In addition, using domain knowledge to remove the various pitfalls discussed in this paper can take the analysis to a much higher level.