Skip to main content

Insights into AI Landscape – A Preface

AI Landscape and Key Areas of Interest

The AI landscape encompasses several crucial domains, and it’s imperative for any organization aiming to participate in this transformative movement to grasp these aspects. Our objective is to offer our insights and perspective into each of these critical domains through a series of articles on this platform.

We will explore key topics each area depicted in the diagram below.

1.      Standards, Framework, Assurance: We will address the upcoming International Standards and Frameworks, as well as those currently in effect. Significant efforts in this area are being undertaken by international organizations like ISO, IEEE, BSI, DIN, and others to establish order by defining these standards. This also encompasses Assurance frameworks, Ethics frameworks, and the necessary checks and balances for the development of AI solutions. It’s important to note that many of these frameworks are still in development and are being complemented by Regulations and Laws. Certain frameworks related to Cybersecurity and Privacy Regulations (e.g., GDPR) are expected to become de facto reference points. More details will be provided in the forthcoming comprehensive write-up in Series 1.

2.      Legislations, Laws, Regulations: Virtually all countries have recognized the implications and impact of AI on both professional and personal behavior, prompting many to work on establishing fundamental but essential legislations to safeguard human interests. This initiative began a couple of years ago and has gained significant momentum, especially with the introduction of Generative AI tools and platforms. Europe is taking the lead in implementing legislation ahead of many other nations, and countries like the USA, Canada, China, India, and others are also actively engaged in this area. We will delve deeper into this topic in Series 2.

3.      AI Platforms & Tools: AI Platforms and Tools: An array of AI platforms and tools is available, spanning various domains, including Content Creation, Software Development, Language Translation, Healthcare, Finance, Gaming, Design/Arts, and more. Generative AI tools encompass applications such as ChatGpt, Copilot, Dall-E2, Scribe, Jasper, etc. Additionally, AI chatbots like Chatgpt, Google Bard, Microsoft AI Bing, Jasper Chat, and ChatSpot, among others, are part of this landscape. This section will provide insights into key platforms and tools, including open-source options that cater to the needs of users.

4.      Social Impact:  AI Ethics begins at the strategic planning and design of AI systems. Various frameworks are currently under discussion due to their far-reaching societal consequences, leading to extensive debates on this subject. Furthermore, it has a significant influence on the jobs of the future, particularly in terms of regional outcomes, the types of jobs that will emerge, and those that will be enhanced or automated. The frameworks, standards, and legislations mentioned earlier strongly emphasize this dimension and are under close scrutiny. Most importantly, it is intriguing to observe the global adoption of AI solutions and whether societies worldwide embrace them or remain cautious. This section aims to shed light on this perspective.

5.      Others: Use Cases and Considerations:  In this Section, we will explore several use cases and success stories of AI implementation across various domains. We will also highlight obstacles in the adoption of AI, encompassing factors such as the pace of adoption, the integration of AI with existing legacy systems, and the trade-offs between new solutions and their associated costs and benefits.  We have already published a recent paper on this subject, and we plan to share more insights as the series continues to unfold.

The Executive Order!

Close on the heels of the formation of the Frontier Model Forum and a White House announcement that it had secured “voluntary commitments” from seven leading A.I companies to self-regulate the risks posed by artificial intelligence, President Joe Biden, yesterday issued an executive order regulating the development and ensuring safe and secure deployment of artificial intelligence models . The underlying principles of the order can be summarized in the picture.

The key aspects of the order focus on what is termed “dual-use foundation models” – models that are trained on broad data, uses self-supervision, and can be applied in a variety of contexts. Typically the generative AI models like GPT fall into this category, although, the order is aimed at the next generation of models beyond GPT-4.

Let’s look at what are the key aspects of what the order says in this part. Whilst the order talks about the

Safe & Secure AI

  • The need for safe and secure AI through thorough testing – even sharing test results with the government for critical systems that can impact national security, economy, public health and safety
  • Build guidelines to conduct AI red-teaming tests that involves assessing and managing the safety, security, and trustworthiness of AI models
  • The need to establish provenance of AI generated content
  • Ensure that compute & data are not in the hands of few colluding companies and ensuring that new businesses can thrive [This is probably the biggest “I don’t trust you” statement back to Big Tech!]

AI Education / Upskilling

  • Given its criticality, the need for investments in AI related education, training, R&D and protection of IP.
  • Support for programs to provide Americans with the skills they need for the age of AI and attract the world’s AI talent, via investments in AI-related education, training, development, research, and capacity and IP development
  • Encouraging AI skills import into the US [probably the one that most Indian STEM students who hope to study and work in the US will find a reason to cheer]

Protection Of Rights

  • Ensuring the protection of civil rights, protection against bias & discrimination, rights of consumers (users)
  • Lastly, also the growth of governmental capacity to regulate, govern and support for responsible AI.

Development of guidelines & standards

  • Building up on the Blueprint AI Bill of Rights & the AI Risk Management Framework, to create guidance and benchmarks for evaluating and auditing AI capabilities, particularly in areas where AI could cause harm, such as cybersecurity and biosecurity

Protecting US Interests

  • The regulations also propose that companies developing or intending to develop potential dual-use foundation models to report to the Govt on an ongoing basis their activities w.r.t training & assurance on the models and the the results of any red-team testing conducted
  • IaaS providers report on the security of their infrastructure and the usage of compute (large enough to train these dual use foundation models), as well as its usage by foreign actors who train large AI models which could be used for malafide purposes

Securing Critical Infrastructure

  • With respect to critical infrastructure, the order directs that under the Secretary Homeland Security, an AI Safety & Security Board is established, composed of AI experts from various sectors, to provide advice and recommendations to improve security, resilience, and incident response related to AI usage in critical infrastructure
  • All critical infrastructure is assessed for potential risks (vulnerabilities to critical failures, physical attacks, and cyberattacks) associated with the use of AI in critical infrastructure.
  • An assessment to be undertaken of the risks of AI misuse in developing threats in key areas like CBRN (chemical, biological, radiological and nuclear) & bio sciences

Data Privacy

  • One section of the document deals with mitigating privacy risks associated with AI, including an assessment and standards on the collection and use of information about individuals.
  • It also wants to ensure that the collection, use, and retention of data ensures that privacy and confidentiality are respected
  • Also calls for Congress to pass Data Privacy legislation

Federal Government Use of AI

  • The order encourages the use of AI, particularly generative AI, with safeguards in place and appropriate training, across federal agencies, except for national security systems.
  • It also calls for an interagency council to be established to coordinate AI development and use.

Finally, the key element – keeping America’s leadership in AI strong – by driving efforts to expand engagements with international allies and establish international frameworks for managing AI risks and benefits as well as driving an AI research agenda.

In subsequent posts, we will look at reactions, and what it means for Big Tech and for the Indian IT industry which is heavily tied to the US!

Domain and LLM

I am in total agreement with Morgan Zimmerman, Dassault Systems quote in TOI today.  Every industry has its own terminologies, concepts, names, words i.e Industry Language. He says even a simple looking word like “Certification” have different meanings in Aerospace vs life sciences.  He recommends use of Industry specific language and your own company specific language for getting significant benefit out of LLMs. This will also reduce hallucinations and misunderstanding.

This is in line with @AiThoughts.Org thoughts on Domain and company specific information on top of general data used by all LLMs.  Like they say in Real Estate, the 3 most important things in any real estate buy decision is “Location, Location and Location”.  We need 3 things to make LLMs work for the enterprise.  “Domain, Domain and Domain”.   Many of us may recall a very successful Bill Clinton Presidential campaign slogan. “The economy, Stupid”.   We can say “The domain, Stupid” as the slogan to make LLMs useful for the enterprises.

But the million-dollar question is how much it is going to cost for the learning updates using your Domain and company data?  EY published a cost of $1.4 Billion which very few can afford.  We need much less expensive solutions for large scale implementation of LLMs.

Solicit your thoughts. #LLM #aiml #Aiethics #Aiforindustry

L Ravichandran

AI and Law

The Public Domain is full of initiatives by many Law Universities, large law firms, and various government departments on the topic of “AI and Law “. I was happy to see a news article a few days ago about the Indian Consumer grievances cell thinking about using AI to clear a large number of pending cases. They have had some success in streamlining processes and making it all digital but they felt that the sheer large volume of pending cases needs AI-type intervention.  I have already talked about the huge volume of civil cases pending in lower courts in India and some cases taking even 20 years to get final judgment.  As the saying goes “Justice delayed is Justice denied”, it is imperative that we find solutions to this huge backlog problem.

All discussions are centred around two broad areas: –

1.      Legal Research and development of customer’s case by Law firms.  Basically, core work of both junior and senior law associates and partners.

2.      Assisting judges or even rendering judgment on their own by AI models to reduce backlog and speedy justice. 

Lots of interesting discussions happening on (1). Law research, looking into archives, similar judgments, precedence’s, etc. seem to be a no-brainer.  Huge advances in automation have been already done and this will increase multi-fold by these Law purpose-built language models.  What will happen to junior law associates is an interesting question. Can they use better research and develop actual arguments and superior case brief for their clients and take the load off senior associates who in turn can focus more on customer interactions?  I found discussions on the model analysing judges’ earlier judgments and making the argument briefs customized per judge, fascinating.  

The no (2) item needs lot of discussions.   All democratic countries jurisprudence is based on these 3 fundamental principles.

  1. Every citizen will have their “day in the court” to present their case to an impartial judge.
  2. Every citizen will have a right to a competent counsel with a provision of public defenders given free to the citizens.
  3. Every witness can be cross examined by the other party without any restrictions.

On the one hand, we have these great jurisprudence principles.  On the other hand, we have huge backlogs and delays. 

How much citizens are willing to give up some of the basic principles to get speedy justice? 

Can we give up the principle of “my day in Court” and let only written briefs submitted to the court to be used for final judgement? This will mean witness statements in briefs will not be cross examined or questioned.

Can we give up the presence of a human judge who will read the briefs on both sides and make a judgement and let an AI Model read both the briefs and pronounce the judgement?

Even if citizens are willing to give up these principles, does the existing law of the land allow this?   It may require changes to law and in some countries even changes to their constitution to allow for this new AI jurisprudence.

Do we differentiate between civil cases and criminal cases separately and find different solutions?  Criminal cases will involve human liberty issues such as imprisonment and will need a whole set of different benchmarks.

What about changes to appeal process if you do not like lower court judgment?   I presume we will need human judges to review the judgements given by AI Models. It is very difficult for us to accept higher court AI model, reviewing and correcting a lower court AI model’s original judgement.

The biggest hurdle is going to be us, the citizens.  In any legal case involving two parties, one party always and in many cases both parties will be unhappy with any judgement.  No losing party in any civil case is going to be happy that they lost as per some sub clause in some law text. In many cases, even winning parties may not be happy with the award amount.  In this kind of scenario, how do you expect citizens to accept an instantaneous verdict after both parties submit their briefs?  This will be a great human change management issue.

Even if we come out with some solutions to these complex legal and people problems, one technical challenge still remains a big hurdle.  With the release of many large language models and APIs, many projects are happening to train these LLMs on specific domain. A few days ago, we saw a press release by EY about their domain-specific model developed with an investment of US$1.4 Billion.  Bloomberg announced a BloombergGPT, their own 50-billion parameters language model purpose-built for finance. Who will bell the cat for the Law domain? Who will invest large sums of $$s and create a Legal AI Model for each country? Until this model is available for general use, many of the things we discussed will not be possible.

To conclude, there are huge opportunities to get business value out of the new AI technology in the Law and Justice Domain. However, technical, legal and people issues must be understood, addressed and resolved before any large-scale implementation.

More Later. Like to hear your thoughts.

L Ravichandran

AI Regulations : Need for urgency

Few weeks ago, I saw a news article about risks of unregulated AI.  The news article quoted that in USA, Police came to a house of a 8 months pregnant African American lady and arrested her due to a facial recognition system identified her as the theft suspect in a robbery. No amount of pleading from the lady about her advanced pregnancy condition during the time of robbery and she just could not have committed the said crime with this condition, was heard by the police officer.  The Police officer did not have any discretion.  The system set up was such that once the AI face recognition identifies the suspect, Police are required to arrest her, bring her to the police station and book her.  

In this case, she was taken to the police station, booked and released on bail. Few days later the case against her was dismissed as the AI system has wrongly identified her.  It was also found out that she was not the first case and few more people, especially African American women were wrongly arrested and released later due to incorrect facial recognition model.

The speed in which the governments are moving on regulations and proliferation of AI tech companies delivering business application such as this facial recognition model demand urgent regulations.

May be citizens themselves should organize and let the people responsible for deploying these systems accountable.  The Chief of Police, may be the Mayor of the town and County officials who signed off this AI facial recognition system, should be made accountable.  May be the County should pay hefty fines and just not a simple oops, sorry.

Lots of attention need to be placed on training data.  Training data should represent all the diverse people in the country in sufficient samples.  Expected biases due to lack of sufficient diversity in training data must be anticipated and the model tweaked.  Most democratic countries have criminal justice system with a unwritten motto “Let 1000 criminals go free but not a single innocent person should go to jail”.  The burden of proof of guilt is always on the state.  However, we seem to have forgotten this when deploying these law enforcement systems.  The burden of proof with very high confidence levels and explainable AI human understandable reasoning, must be the basic approval criteria for these systems to be deployed.

The proposed EU act classifies these law enforcement systems as high risk and will be under the act.  Hopefully the EU act becomes a law soon and avoid this unfortunate violation of civil liberty and human rights.

More Later,

L Ravichandran

EU AI Regulations Update

I have written some time back about EU AI Act draft circulation.  After more than 2 years, there is some more movement in making this a EU Law.  In June 2023,  the EU Parliament adapted the draft and a set of negotiating principles and the next step of discussions with member countries has started.  The EU officials are confident that this process will be completed by end of 2023 and this will become an EU law soon.  Like the old Hindi proverb “ Bhagawan Ghar mein Dher hain Andher Nahin”. Or “In God’s scheme of things, there may be delays but never darkness”.  EU has taken the first step and if this becomes a law by early 2024, it will be a big achievement.   I am sure USA and other large countries will follow soon.

The draft has more or less maintained its basic principles and structure. 

The basic objective of the new law is to make sure that AI systems used in the EU are safe, transparent, traceable, non-discriminatory and environmentally friendly.  In addition, there is an larger emphasis on AI systems should be overseen by people, rather than by automation alone.  The principle of proportionate regulations, the risk categorization of AI systems and the level of regulations appropriate to the risk are the central theme of the proposed laws.  In addition, there was no generative AI or ChatGPT like products when the original draft was developed in 2021 and hence additional regulations are added to address this large language models / Generative AI models. The draft also plans to establish a technology-neutral, uniform definition for AI that could be applied to future AI systems.

Just to recall from my earlier Blog, the risks are categorized  in to Limited risk, high risk and unacceptable risk.

The draft Law clearly defines systems which are categorized as “Unacceptable risk” and proposed to ban them from commercial launch within EU community countries.  Some examples are given below.

  • Any AI system which can change or manipulate Cognitive behaviour of  humans , especially vulnerable groups such as children, elderly etc.
  • Any AI system which classifies people based on various personal traits such as behaviour, socio-economic stataus or race and other personal characteristics.
  • Any AI system which does real-time and remote biometric identification systems, such as facial recognition which is usually without consent of the person targeted.   The law also clarifies that past data analysis for law enforcement purposes is acceptable with court orders.

The draft law is concerned about any negative impact on fundamental rights of EU citizens and any impact on personal safety.  These types of systems will be categorized as High Risk.

1)  Many products such as toys, automobiles, aviation products, medical devices etc. are already under existing U Product safety legislation.  Any AI systems that are used inside products already  regulated under this legislation will also be subjected to additional regulations as per High Risk category.


2)  Other AI systems falling into eight specific areas that will be classified as High Risk and required registration in an EU database and subjected to the new regulations.

The eight areas are: –

  1. Biometric identification and categorisation of natural persons
  2. Management and operation of critical infrastructure
  3. Education and vocational training
  4. Employment, worker management and access to self-employment
  5. Access to and enjoyment of essential private services and public services and benefits
  6. Law enforcement
  7. Migration, asylum and border control management
  8. Assistance in legal interpretation and application of the law.


Once these systems are registered in the EU database, they will be assessed by appropriate agencies for functionality, safety features, transparency, grievance mechanisms for appeal etc and will be given approvals before they are deployed in EU market.  All updates and new versions of these AI system will be subjected to similar scrutiny.  


Other AI systems not in the above two lists will be termed as “Limited risk” systems and subjected to self-regulations.  At the minimum, the law expects these systems to inform the users that they are indeed interacting with an AI system and provide options to change to a human operated system or discontinue using the system. 

As I have mentioned before, the proposed law is covering Generative AI systems also.  The law required these systems to disclose to the users that the output document or a output decision is generated or derived by a Generative AI system.  In addition, the system should publish the list of copyrighted training content used by the model.  I am not sure how practical this is given that ChatGPT like systems are reading every digital content in the web and now moving in to very audio / video content.  Even if the system produces this list which is expected to be very large, not sure current copy right laws are sufficient to address the use of this copyrighted material in a different form inside the deep learning neural networks. 

The proposed law also wants to ensure that the generative AI models are self-regulated enough not to generate illegal content or provide illegal advice to users.


 Indian Government is also looking at enacting AI regulations soon.  June 9th 2023 interview, Indian IT minister talked about this.  He emphasized the objective of “No harm” to citizen digital users.  Government’s approach to any regulation of AI will be thru the prism of “ User harm or derived user harm thru use of any AI technology”.  I am sure draft will be out soon and India also will have similar laws soon.

Let us discuss about what are the implications or consequences of this regulation among the various stakeholders.

  • AI system developer company ( Tech and Enterprises )


They need to educate all their AI development teams on these laws and ensure these systems are tested for compliance prior to commercial release.  Large enterprises may even ask large scale model developers like open.AI to indemnify them against any violations while using their APIs.  Internal legal counsels of both the tech companies and API user enterprises need to be trained on the new laws and get ready for contract negotiations.  Systems Integrators and outsourcers such as Tech Mahindra, TCS, Infosys etc. are also need to gear up for the challenge.  The liability will be passed down from the enterprise to the Systems Integrators and they need to ensure compliance is built in and also tested correctly with proper documentation.

  • Governments & Regulators

Government and regulatory bodies need to upskill their staff on the new laws and how to verify and test compliance for the commercial launch approval.  The tech companies are very big and throw in best technical as well as legal talent to justify their systems are compliant and if regulatory bodies are not skilled enough to verify then the law will become ineffective and will be only on paper.  This is a huge challenge for the government bodies. 

  • Legal community both public prosecutors, company legal counsels and defence lawyers

Are they ready for the avalanche of legal cases starting from regulatory approvals and appeals, ongoing copyright violations, privacy violations, inter company litigations of liability sharing between Tech, enterprise and Systems Integrators etc.

Massive upskillng and training is needed for even senior lawyers as issues arising from this law are very different.  The law degree curriculum needs to include a course on AI regulations. For example, the essence of a comedian talk show “learnt” by a deep learning model and stored deep in to neural networks.  Is it a copyright violation?   The model outputs similar style comedy speech by using the “essence” stored in neural network.  Is the output a copy right violation?  Who is responsible and accountable for an autonomous car accident?  Who is responsible for a factory accident, causing injury to a worker in a autonomous robot factory?  Lots of new legal challenges.

Most Indian Systems Integrators are investing large sums of money to reskill and also create new AI based service offerings.  Hope they are spending part of that investment in AI regulations and compliance. Otherwise, they run a risk of losing all the profits in few tricky legal challenges. 

More later

L Ravichandran

brAInWaves – Oct ’22

Welcome to brAInwaves – Our first newsletter! And thank you all for signing up! Ever since we launched AiThoughts, we have expanded our core team, now comprising of S SivaguruAnil Sane & Diwakar Menon.

We have had a couple of events with large consulting organisations & large IT services companies around AI & DevSecOps and how to package and sell AI services.

We have also have about 17 posts on various topics covering AiOps, Ethics, DevSecOps & Agile SDLC processes and other posts, including games to test your AI Quotient.

What we would like is for you to share case studies, your experiences with AI, articles of interest you may have come across, spread the word about this community and encourage them to subscribe & contribute to this forum


HERE’S WHAT YOU MAY HAVE MISSED

Are You Human? Tale of CAPTCHA (L Ravichandran)

Recently I gave a keynote speech in Mahindra University, Hyderabad as part of a 2-day workshop on “Data Science for the Industry”. Great opportunity to share my thoughts on Data Sciences/AIML technologies and industry use cases. I talked about various problems to be solved by these rapidly advancing technologies.

Test Your AI Quotient (S Sivaguru)

Take this fun quiz to find ten words related to the world of AI. These may be acronyms or terms that you would come across while exploring the wide world of Artificial Intelligence, Machine Learning, techniques, applications etc.


SOME RECENT NEWS


Devang Sachdev, Snorkel AI: On easing the laborious process of labelling data

Correctly labelling training data for AI models is vital to avoid serious problems, as is using sufficiently large data sets. However, manually labelling massive amounts of data is time consuming & labourious. So what’s the middle ground?

OpenAI removes waitlist for DALL-E text-to-image generator

OpenAI has removed the waitlist for its DALL-E service and the text-to-image generator is now publicly available. The original DALL-E debuted in January 2021 to much fanfare. In April this year, DALL-E 2 was released with significant improvements.

Chess: How to spot a potential cheat

The recent controversy involving Magnus Carlsen, who recently resigned without comment in a game against a nineteen year old Niemann, has raised the questions of ethics & how to identify cheats in Chess.


Keep your (Ai)Thoughts flowing, and if you have an article, news, case study to submit, do send it to lravi@aithoughts.com

Plus ça change- Is ML the new name for Statistics?

Names change, but ideas usually don’t. How is today’s ‘data science’ different from yesterday’s statistics, mathematics and probability?

 Actually, it’s not very different. If it seems changed it’s only because the ground reality has changed. Yesterday we had data scarcity, today we have a data glut (“big data”). Yesterday we had our models, and were seeking data to validate them. Today we have data, and seek models to explain what this data is telling.

 Can we find associations in our data? If there’s association, can we identify a pattern? If there are multiple patterns, can we identify which are the most likely? If we can identify the most likely pattern, can we abstract it to a universal reality? That’s essentially the data science game today.

 Correlation

 Have we wondered why the staple food in most of India is dal-chaval or dal-roti? Why does almost everyone eat the two together? Why not just dal followed by just chaval?

 The most likely reason is that the nutritive benefit when eaten together is more than the benefit when eaten separately. Or think of why doctors prescribe combination drug therapies, or think back to the film Abhimaan (1973) in which Amitabh Bachchan and Jaya Bhaduri discovered that singing together created harmony, while singing separately created discord. Being together can offer a greater benefit than being apart.

 Of course, togetherness could also harm more. Attempting a combination of two business strategies could hurt more than using any individual strategy. Or partnering Inzamam ul Haq on the cricket field could restrict two runs to a single, or, even more likely, result in a run out!

 In data science, we use the correlation coefficient to measure the degree of linear association or togetherness. A correlation coefficient of +1 indicates the best possible positive association; while a value of -1 corresponds to the negative extreme. In general, a high positive or negative value is an indicator of greater association.

 The availability of big data now allows us to use the correlation coefficient to more easily confirm suspected associations, or discover hidden associations. Typically, the data set is a spreadsheet, e.g., supermarket data with customers as rows, and every merchandise sold as a column. With today’s number crunching capability, it is possible to compute the correlation coefficient between every pair of columns in the spreadsheet. So, while we can compute the correlation coefficient to confirm that beer cans and paper napkins are positively correlated (could be a dinner party), we could also unearth a hidden correlation between beer cans and baby diapers.

 Why would beer cans and baby diapers be correlated? Perhaps there’s no valid reason, perhaps there’s some unknown common factor that we don’t know about (this has triggered off the ‘correlation-is-not-causation’ discussion). But today’s supermarket owner is unlikely to ponder over such imponderables; he’ll just direct his staff to place baby diapers next to beer cans and hope that it leads to better sales!

 Regression

 If two variables X and Y have a high correlation coefficient, it means that there is a strong degree of linear dependence between them. This opens up an interesting possibility: why not use the value of X to predict the likely value of Y? The prospect becomes even more enticing when it is easy to obtain X, but very hard (or expensive) to obtain Y.

 To illustrate, let us consider the height (X) and weight (Y) data of 150 male students in a class. The correlation coefficient between X and Y is found to be 0.88. Suppose a new student joins. We can measure his height with a tape, but we don’t have a weighing scale to obtain his weight. Is it possible to predict his weight?

 Let us first plot this data on a scatter diagram (see below); every blue dot on the plot corresponds to the height-weight of one student. The plot looks like a dense maze of blue dots. Is there some ‘togetherness’ between the dots? There is (remember the correlation is 0.88?), but it isn’t complete togetherness (because, then, all the dots would’ve aligned on a single line).

 To predict the new student’s weight, our best bet is to draw a straight line cutting right through the middle of the maze. Once we have this line, we can use it to read off the weight of the new student on the Y-axis, corresponding to his measured height plotted on the X-axis.

 How should we draw this line? The picture offers two alternatives: the blue line and the orange line. Which of the two is better? The one that is ‘middler’ through the maze is better. Let us drop down (or send up) a ‘blue perpendicular’ from every dot on to the blue line, and, likewise, an ‘orange perpendicular’ from every dot on to the orange line (note that if the dot is on the line, the corresponding perpendicular has zero length). Now sum the lengths of all the blue and orange perpendiculars. The line with a smaller sum is the better line!

  

X: height; Y: weight

 Notice that the blue and orange lines vary only in terms of their ‘slope’ and ‘shift’, and there can be an infinity of such lines. The line with the lowest sum of the corresponding perpendiculars will be the ‘best’ possible line. We call this the regression line to predict Y using X; and it will look like:

a1 X + a2, with a1 and a2 being the slope and shift values of this best line. This is the underlying idea in the famed least-square method.

 Bivariate to multivariate

 Let us see how we can apply the same idea to the (harder) problem of predicting the likely marks (Y) that a student might get in his final exam. The numbers of hours studied (X1) seems to be a reasonable predictor. But if we compute the correlation coefficient between Y and X1, using sample data, we’ll probably find that it is just about 0.5. That’s not enough, so we might want to consider another predictor variable. How about the intelligence quotient (IQ) of the student (X2)? If we check, we might find that the correlation between Y and X2 too is about 0.5.

 Why not, then, consider both these predictors? Instead of looking at just the simple correlation between Y and X, why not look at the multiple correlation between Y and both X1 and X2? If we calculate this multiple correlation, we’ll find that it is about 0.8.

 And, now that we are at it, why not also add two more predictors: Quality of the teaching (X3), and the student’s emotional quotient (X4)? If we go through the exercise, we’ll find that the multiple correlation keeps increasing as we keep adding more and more predictors.

 However, there’s a price to pay for this greed. If three predictor variables yield a multiple correlation of 0.92, and the next predictor variable makes it 0.93, is it really worth it? Remember too that with every new variable we also increase the computational complexity and errors.

 And there’s another – even more troubling – question. Some of the predictor variables could be strongly correlated among themselves (this is the problem of multicollinearity). Then the extra variables might actually bring in more noise than value!

 How, then, do we decide what’s the optimal number of predictor variables? We use an elegant construct called the adjusted multiple correlation. As we keep adding more and more predictor variables to the pot (we add the most correlated predictor first, then the second most correlated predictor and so on …), we reach a point where the addition of the next predictor diminishes the adjusted multiple correlation even though the multiple correlation itself keeps rising. That’s the point to stop!

 Let us suppose that this approach determines that the optimal number of predictors is 3. Then the multiple regression line to predict Y will look like a1 X1 + a2 X2 + a3 X3 + a4. where a1, a2, a3, a4 are the coefficients based on the least-square criterion. 

 Predictions using multiple regression are getting more and more reliable because there’s so much more data these days to validate. There is this (possibly apocryphal) story of a father suing a supermarket because his teenage daughter was being bombarded with mailers to buy new baby kits. “My daughter isn’t pregnant”, the father kept explaining. “Our multiple regression model indicates a very high probability that she is”, the supermarket insisted. And she was …

 As we dive deeper into multivariate statistics we’ll find that this is the real playing arena for data science; indeed, when I look at the contents of a machine learning course today, I can’t help feeling that it is multivariate statistics re-emerging with a new disguise. As the French writer Jean-Baptiste Alphonse Karr remarked long ago: plus ça change, plus c’est la même chose!

What NOT to say

What Not to Say

Teaching chatbots to speak ‘properly’ and ‘decently’

Many of us would have heard about Microsoft’s Tay.ai chatbot, which was released and pulled back within 24 hours in 2016, due to abusive learnings by the chatbot. It took less than 24 hours to corrupt an innocent AI chatbot. What went wrong? Tay.ai’s learning module was excellent, which ironically was the problem – it was rapidly learning swear words, hate language etc. from the large number of people who used abusive language during conversations with the chatbot.  However, unlike some of the internal filters many of us have, Tay.ai went ahead and learnt from these signals, and started using these phrases and hate language.  All this happened in less than 24 hours, which forced Microsoft to pull this from public use.

I have been observing how my son and daughter-in-law are teaching my 3-year-old granddaughter about the use of good language.  Basic things like saying ‘Please’, ‘Thank You’, ‘Good morning’, ‘Good night’, etc. In other words, decent and desirable language was taught first.  They have also given strict instructions to us (grandparents) and extended family about what to say – and what not to say – in front of the kid. The child will still hear some ‘bad words’ in school, malls, playgrounds etc. This is beyond the parents’ control. In these cases, they teach the child about how a very few bad people still use ‘bad’ language and good people never use these words, thus starting to lay in the internal filters in my granddaughter’s mind.

We should apply the same principle to these innocent but fast-learning chatbots.  Let us ‘teach’ the chatbot all the ‘good’ phrases like ‘Please’, ‘Thank you’ etc. Let us also ‘teach’ the chatbot about showing empathy, such as saying ‘Sorry that your product is not working.  We will do everything possible to fix it’, ‘Sorry to ask you to repeat as I did not understand your question’, and so on.

Finally, let us create a negative list of ‘bad’ phrases, and hate language in all possible variations.  English in the UK will have British, Scottish, and Irish variations.  Some phrases which are considered acceptable in one area may be objectionable in another. Same for Australia, New Zealand, India, New York Northern English, Southern USA English, etc.  Let us build internal filters in these chatbots to ignore or unlearn these phrases in the learning process.  By looking at the IP address of the user, the bot can identify the geographical location and apply the right language filters.

Will this work?  As good parents we have been doing this to teach our kids and grandkids from time immemorial.  Mostly this is working; very few kids grow to become users of hate language.

Will it slow down the machine learning process?  Perhaps a little bit, but this is a price worth paying, compared to having a chatbot use foul language and upset your valuable customers.

You may be wondering if this simple approach is supported by any AI research or whether this is just a grandfather’s tale! There is lots of research in this area that supports my approach.

There are many references to articles on ‘Seldonain Algorithm’ for AI Ethics. I want to refer to an article titled ‘Developing safer machine learning algorithms at UMass Amhrest’.  The authors recommend that the burden of ensuring that ML systems are well-behaved is with the ML designer and not with the end user, and they suggested a 3-step Seldonian algorithm. Let us look at this.

Step one is to provide an Interface specified by the user to define undesirable or bad behaviour.  The ML algorithm will use the interface and try as much as possible to avoid these undesirable behaviours.

Step two is to use High-Probability Constraints: Seldonian algorithms guarantee with high-probability that they will not cause the undesirable behaviour that the user specified via the interface.

Step three in the algorithm is No Solution Found: Seldonian algorithms must have the ability to say No Solution Found (NSF) to indicate that they were unable to achieve what they were asked.

 

Let us consider two examples involving human life to illustrate the Interface definitions. Example one is a robot that controls a robotic assembly line. The robot senses that a welding operation has gone out of sync and is causing all welded cars to be defective. The robot controller wants to issue the instruction to immediately stop the assemble line and get the welding station fixed. However, the user knows that abrupt stoppage of assembly line may cause harm to some factory workers who may be on another station in the assembly line.  This undesirable decision to immediately stop the assembly line needs to be defined in the interface, as this will cause harm to humans compared to a material loss in defective cars.

Example two is an autonomous truck carrying cargo driving in a hilly road with a cliff on the driving side.  A human driver is coming fast in the wrong lane ( human’s fault) and approaching the truck for a certain head-on collision. The only desirable outcome for the truck is to fall of the cliff and destroy itself with the cargo rather than trying to look at various other optimal decisions which may have some probability of hitting the car and harming the human.

In our chatbot good-behavior problem, the undesirable behaviors are usage of the phrases in the ‘Negative List’ for each geographical variation.  The interface will have this list and the logic to identify geographical variations.

I am in discussion with some sponsors for a research project to develop an English-language chatbot etiquette engine.  Initial reactions from the various stakeholders are positive – everyone agrees on the need for an etiquette engine as well as my approach. 

I will be delighted to receive critique and comments from all of you. 

As a closing note, wanted to tell you that Natural Language processing (NLP) is taking huge strides.  NLP is eating the ML” is the talk of the town.  NLP research supported by Large Language models, Transformers etc. are moving way ahead. Investment is going into Q&A, Language Generation, Knowledge management, Unsupervised/reinforcement learning.

In addition to desirable behavior, many other ethical issues need to be incorporated. For e.g

·        Transparency: Does everyone know broadly how learning is done and how decisions are taken?

·        Explainability:  For every individual decision, if requested, can we explain how the decision was taken?

Also, a lot of current AI/ML algorithms especially neural networks based have become black boxes. We expect a shift towards more simpler algorithms for enterprise usage.

 

Relevance of Statistics In the New Data Science World

Relevance of Statistics in the new Data Science world

Rajeeva L Karandikar

Chennai Mathematical Institute, India 

Abstract 

With Big Data and Data Science becoming buzzwords, various people are wondering about the relevance of statistics versus pure data driven models.

In this article, I will explain my view that several statistical ideas are as relevant now as they have been in the past.  

 

1 Introduction

For over a decade now, Big Data, Analytics, Data-Science have become buzzwords. As is the trend now, we will just refer to any combination of these three as data-science. Many professionals working in the IT sector have moved to positions in data science and they have picked up new tools. Often, these tools are used as black boxes.  This is not surprising because most of them have little if any background in statistics. 

We can often hear them make comments such as, “With a large amount of data available, who needs statistics and statisticians? We can process the data with various available tools and pick the tool that best serves our purpose.

We hear many stories of wonderful outcomes coming from what can be termed pure data-driven approaches. This has led to a tendency of simply taking a large chunk of available data and pushing it through an AIML engine, to derive ‘intelligence’ out of it, without giving a thought to where the data came from, how it was collected and what connection the data has with the questions that we are seeking answers to…. If an analyst were to ask questions about the data, – How was it collected? When was it collected? – the answer one frequently hears is: “How does it matter?”

 Later in this article, we will see that it does matter. We will also see that there are situations where blind use of the tools with data may lead to poor conclusions.

As more and more data become available in various contexts, our ability to draw meaningful actionable intelligence will grow enormously. The best way forward is to marry statistical insights to ideas in AIML, and then use the vast computing power available at one’s fingertips. For this to happen, statisticians and AIML experts must work together along with domain experts

 

Through some examples, we will illustrate how ignoring statistical ideas and thought processes that have evolved over the last 150 years can lead to incorrect conclusions in many critical situations. 

2 Small data is still relevant

First let us note that there is a class of problems where all the statistical theory and methodology developed over the last 150 years continues to have a role – since the data is only in hundreds or at most thousands and never in millions. For example, issues related to quality control, quality measurement, quality assurance etc. only require a few hundred data points from which to draw valid conclusions. Finance – where the term: VaR (value-at-risk), which is essentially a statistical term- 95th or 99th percentile of the potential loss, has entered law books of several countries – is another area where the use of data has become increasingly common; and here too we work with a relatively small number of data points. There are roughly 250 trading days in a year and there is no point going beyond 3 or 5 years in the past as economic ground realities are constantly changing. Thus, we may have only about 1250 data points of daily closing prices to use for, say, portfolio optimisation or option pricing, or for risk management. One can use hourly prices (with 10,000 data points), or even tick-by-tick trading data, but for portfolio optimisation, risk management, the common practice is to use daily prices. In election forecasting, psephologists usually work with just a few thousand data points from an opinion poll to predict election outcomes. Finally, policy makers, who keep tabs on various socio-economic parameters in a nation, rely on survey data which of course is not in millions. 

One of the biggest problems faced by humanity in recent times is the COVID-19 virus. From March 2020 till the year end, everyone was waiting for the vaccines against COVID-19. Finally in December 2020, the first vaccine was approved and more have followed. Let us recall that the approval of vaccines is based on RCT – Randomised Clinical Trials which involve a few thousand observations, along with concepts developed in statistical literature under the theme Design of experiments. Indeed, most drugs and vaccines are identified, tested and approved using these techniques. 

These examples illustrate that there are several problems where we need to arrive at a decision or reach a conclusion where we do not have millions of data points. We must do our best with a few hundred or few thousand data points. So statistical techniques of working with small data will always remain relevant. 

3 Perils of purely data driven inference

This example goes back nearly 150 years. Sir Francis Galton was a cousin of Charles Darwin, and as a follow up to Darwin’s ideas of evolution, Galton was studying inheritance of genetic traits from one generation to the next. He had his focus on how intelligence is passed from one generation to the next.  Studying inheritance, Galton wrote “It is some years since I made an extensive series of experiments on the produce of seeds of different size but of the same species. It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre; smaller than the parents, if the parents were large; larger than the parents, if the parents were small.” Galton firmly believed that this phenomenon will be true for humans as well and for all traits that are passed on genetically, including intelligence. 

 

To illustrate his point, Galton obtained data on heights of parents and their (grown-up) offspring. He chose height as it was easy to obtain data on it. His analysis of the data confirmed his hypothesis, quoted above in italics. He further argued that this phenomenon would continue over generations, and its effect would mean that heights of future offspring will continue to move towards the average height. He argued that the same will happen to intelligence and thus everyone will only have average intelligence. He chose the title of the paper as Regression Towards Mediocrity in Hereditary Stature

 

The conclusion drawn by Galton is fallacious as can be seen by analysing the same data by interchanging roles of heights of offspring and mid-heights of parents leading to an exactly opposite conclusion – namely that if the off-spring is taller than average then the average height of parents will be less than that of the offspring while if the offspring is shorter than average, then the average height of parents will be more than the child. It could be seen that the variation in heights (variance of heights) in the two generations was comparable whereas if there was regression towards mean, variance would have decreased. Thus, Galton’s conclusion about regression to mediocrity over generations is not correct. However, the methodology that he developed for the analysis of inheritance of heights has become a standard tool in statistics and continues to be called Regression.

Galton was so convinced of his theory that he just looked at the data from one angle and got confirmation of his belief. This phenomenon is called Confirmation Bias  a term coined by English psychologist Peter Wason in the 1960s.

4 Is the data representative of the population

Given data, even if it is huge, one must first ask how it was collected. Only after knowing this, one can begin to determine if the data is representative of the population,

In India, many TV news channels take a single view either supporting the government or against it.  Let us assume News Channel 1 and News Channel 2 both run a poll on their websites, at the same time, on a policy announced by the government. Even if both sites attract large number of responses, it is very likely that the conclusions will be diametrically opposite, since the people who frequent each site will likely be ones with a political inclination aligned with the website.  This underscores the point that just having a large set of data is not enough – it must truly represent the population in question for the inference to be valid. 

If someone gives a large chunk of data on voter preferences to an analyst and wants her to analyse and predict the outcome of the next elections, she must start by asking as to how the data was collected and only then can she decide if it represents Indian electorate or not. For example, data from the social media on posts and messages regarding political questions during previous few weeks. However, less educated, rural, economically weaker sections are highly underrepresented on social media and thus the conclusions drawn based on the opinion of such a group (of social media users) will not be able to give insight into how the Indian electorate will vote. However, same social media data can be used to quickly assess market potential of a high-end smartphone – for their target market is precisely those who are active on social media.

  5 Perils of blind use of tools without understanding them

The next example is not one incident but a theme that is recurrent – that of trying to evaluate efficacy of an entrance test for admission, such as IIT-JEE for admission to IITs or CAT for admission to IIMs or SAT or GRE for admission to top universities in the USA. Let us call such tests as benchmark tests, which are open for all candidates and those who perform very well in this benchmark test are shortlisted admission to the targeted program. The analysis consists of computing correlation between the score on the benchmark test and the performance of the candidate in the program. Often it is found that the correlation is rather poor, and this leads to discussion on the quality of the benchmark test.  What is forgotten or ignored is that the performance data is available only for the candidates selected for admission. This phenomenon is known as Selection Bias – where the data set consists of only a subset of the whole group under consideration, selected based on some criterion. 

This study also illustrates the phenomenon known as Absence of Tail Dependence for joint normal distribution. Unfortunately, this property is inherited by many statistical models used for risk management and is considered one of the reasons for the collapse of global financial markets in 2008.

Similar bias occurs in studies related to health, where for reasons beyond the control of the team undertaking the study, some patients are no longer available for observation. The bias it introduces is called Censoring Bias and how to account for it in analysis is a major theme in an area known as Survival Analysis in statistics. 

6 Correlation does not imply causation

Most of data-driven analysis can be summarised as trying to discover relationships among different variables – and this is what correlation and regression are all about. These were introduced by Galton about 150 years ago and have been a source of intense debate. One of the myths in pure data analysis is to assume that correlation implies causation. However, this need not be true in all cases, and one needs to use transformations to get to more complex relationships.

One example often cited is where X is the sale of ice-cream in a coastal town in Europe and Y is the number of deaths due to drowning (while swimming in the sea, in that town) in the same month. One sees strong correlation! While there is no reason as to why eating more ice-creams would lead to more deaths due to drowning, one can see that they are strongly correlated to a variable Z = average of the daily maximum temperature during the month; in summer months more people eat ice-cream, and more people go to swim! In such instances, the variable Z is called a Confounding Variable.

In today’s world, countrywide data would be available for a large number of socio-economic variables, variables related to health, nutrition, hygiene, pollution, economic variables and so on – one can, say, list about 500 variables where data on over 100 countries is available. One is likely to observe correlations among several pairs of these 500 variables – one such recent observation is: Gross Domestic Product (GDP) of a country and number of deaths per million population due to COVID-19 are strongly correlated! Of course, there is no reason why richer or more developed countries should have more deaths.

As just linear relationships may be spurious, the relationships discovered by AIML algorithms may also be so. Hence learning from the statistical literature going back a century is needed to weed out spurious conclusions and find the right relationships for business intelligence.

 

7 Simpson’s paradox and the omitted variable bias

Simpson’s Paradox is an effect wherein ignoring an important variable may reverse the conclusion. One of the examples of Simpson’s paradox is a study of gender bias among graduate school admissions to University of California, Berkeley. In 1973, it was alleged that there is a gender bias in graduate school admissions – the acceptance ratio among males was 44% while among females it was 35%. When the statisticians at Berkeley wanted to identify as to which department is responsible for this, they looked at department wise acceptance ratios and found that if anything, there was a bias against the males… The apparent bias in the pooled data appeared because a lot more women applied to departments which had lower acceptance rates. The variable department in this example is called a confounding factor. In Economics literature, the same phenomenon is also called Omitted Variable Bias. 

8 Selection Bias and Word War II

During World War II, based on simple analysis of the data obtained from damaged planes returning to the base post bombing raids, it was proposed to the British air force that armour be added to those areas that showed the most damage. Professor Abraham Wald Columbia University, a member of the Statistical Research Group (SRG) was asked to review the findings and recommend how much extra armour should be added to the vulnerable parts.

Wald looked at the problem from a different angle. He realised that there was a selection bias in the data that was presented to him – only the aircraft that did not crash returned to the base and made it to the data. Wald assumed that the probability of being hit in any given part of the plane was proportional to its area (since the shooters could not aim at any specific part of the plane). Also, given that there was no redundancy in aircrafts at that time, the effect of hits on a given area of the aircraft were independent of the effect of hits in any other area. Once he put these two assumptions, the conclusion was obvious – that armour be added in parts where less hits have been observed. So, the statistical thinking led Wald to the model that gave the right frame of reference that connected the data (hits on planes that returned) and the desired conclusion (where to add armour).

 9 GMGO (Garbage Model Garbage out), the new GIGO in Data Science world

The phrase Garbage-In-Garbage-Out (GIGO) is often used to describe the fact that even with the best of algorithms, if the input data is garbage, then the conclusion (output) is also likely to be garbage. Our discussion adds a new phenomenon called GMGO i.e., a garbage model will lead to garbage output even with accurate data! 

 10 Conclusion

We have given examples where disregarding statistical understanding digested over 150 years can lead to wrong conclusions. While in many situations pure data driven techniques can do OK, this combined with domain knowledge and statistical techniques can do wonders in terms of unearthing valuable business intelligence to improve business performance.  

We recommend that data driven AI/ML models are a good starting point, or a good exploratory step. In addition, using domain knowledge to remove the various pitfalls discussed in this paper can take the analysis to a much higher level.