Skip to main content

Synthetic Data: A Double-Edged Sword?

When LLMs started making inroads into the technology space over the last few years, training relied heavily on publicly accessible internet data. This data, in various forms including audio, video, images, and text, is full of subtleties and nuances, resulting in rich training for LLMs. However, the original creators of this data are becoming aware of their rights and are starting to restrict access or set commercial terms. As a result, there is a growing inclination to train the next generation of LLMs on synthetic data generated by existing models.

When it comes to many industries, especially those with specialized or highly regulated applications like manufacturing automation, generating sufficient real-world training data can be challenging, time-consuming, and expensive. Synthetic data offers a viable solution by simulating various scenarios and conditions that may be difficult or impossible to replicate in the real world. Synthetic data can be used to create a wide range of scenarios, including edge cases and anomalies. For rare events (e.g., equipment failures), synthetic data augments the limited real-world examples.

Researchers warn (Reference 1) that increasing use of synthetic data could lead to a phenomenon called ‘Model Collapse’. The paper argues that the indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. The paper defines ‘Model collapse’ as a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation. Being trained on polluted data, they then mis-perceive reality.

Slowly but steadily the information on the web will start getting flooded by the one generated by the current or earlier versions of LLMs. Which will in turn will go as input to the training. The recursive nature of this process can result in models drifting further away from the real-world distribution, compromising their accuracy in representing the world. We must remember that the outputs from LLM are not always perfect. With such recursive degradation the content starts losing the diversity, subtleties and nuances that is characteristic of human generated data. As a result, the subsequent generations of LLMs start producing content that is increasingly homogenous, lacking the richness and variety of human experiences and less connected to the real world.

This problem has a potential to become more acute in specialized applications like manufacturing automation. Consequence of model collapse can be severe and costly in such cases. Some examples of fallout are – The model’s ability to accurately predict or classify manufacturing data (e.g., sensor readings, product quality) may deteriorate. The model might start making more mistakes in tasks like defect detection, predictive maintenance, or process optimization. The model may become overly specialized to the synthetic data it was trained on, struggling to generalize to real-world manufacturing scenarios. The model may find it difficult to adapt to changes in manufacturing processes or conditions. If the model makes incorrect decisions in critical applications like quality control or safety monitoring, it could lead to product defects, equipment failures, or even accidents. Frequent errors and breakdowns caused by model collapse can increase maintenance and repair costs. As the model’s performance degrades, trust in its capabilities may erode, leading to reluctance to rely on it for critical tasks.

While model collapse is a significant concern in the field of language models, experts have proposed several strategies to mitigate its risks, as listed below. But as with many other things, devil will be in details. It will never be a trivial exercise.

  • Diverse and High-Quality Training Data: Ensure that the model is trained on a diverse and representative dataset that accurately reflects the real world. Prioritize human generated content over synthetic data to help the models remain grounded in reality.
  • Regular Evaluation and Monitoring: Continuously monitor the model’s performance using appropriate metrics to detect signs of degradation. Implement systems to identify potential issues before they escalate.
  • Data Augmentation Techniques: Use synthetic data augmentation techniques judiciously, ensuring that it complements real-world data and doesn’t introduce biases.
  • Model Architecture and Training Methods: Employ different techniques to prevent overfitting. Combine multiple models to improve robustness and reduce the risk of catastrophic failure.
  • Human Oversight and Intervention: Incorporate human feedback loops to guide the model’s learning and correct errors. Implement safeguards to prevent the model from generating harmful or biased content.
  • Transparency and Explainability: Develop techniques to understand the inner workings of the model and identify potential vulnerabilities.

While these recommendations are expected to help mitigate the risk of model collapse, can this problem be eliminated? Is this a real problem or hypothetical? Time will tell.

It is our view at AiThoughts.org that, as language models continue to evolve, new strategies and approaches will be employed by model vendors to address this ‘model collapse’ challenge.

References:

  1. AI models collapse when trained on recursively generated data by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal at nature.com (https://www.nature.com/articles/s41586-024-07566-y#citeas )

Generative AI in Product Genealogy Solution in Manufacturing

The demand for guaranteed product quality through comprehensive traceability is rapidly spreading beyond the pharmaceutical industry and into other manufacturing sectors. This rising demand stems from both increased customer awareness and stricter regulations. To address this need, manufacturers are turning to Product Traceability, also known as Product Genealogy, solutions.

Efforts over the past 4-5 years, even by Micro, Small and Medium Enterprises (MSMEs), to embrace digitalization and align with Industry 4.0 principles have paved the way for the deployment of hybrid Product Genealogy solutions. These solutions combine digital technology with human interventions. However, the emergence of readily available and deployable Generative AI models presents a promising opportunity to further eliminate human intervention, ultimately boosting manufacturing profitability.

To illustrate this potential, let’s consider the Long Steel Products Industry. This industry encompasses a diverse range of products, from reinforcement bars (rebars) used in civil construction with less stringent requirements, to specialized steel rods employed in demanding applications like automobiles and aviation.

The diagram below gives a high-level view of the manufacturing process stages.

Beyond core process automation done under Industry 3.0, steel manufacturers have embraced digitalization through Visualization Solutions. These solutions leverage existing sensors, supplemented by new ones and IIoT (Industrial IoT) technology, to transform data collection. They gather data from the production floor, send it to cloud hosted Visualization platforms, and process it into meaningful textual and graphical insights presented through dashboards. This empowers data-driven decision-making by providing valuable management insights, significantly improving efficiency, accuracy, and decision-making speed, ultimately benefiting the bottom line.

However, human involvement remains high in decision-making, defining actions, and implementing them on the production floor. This is where Generative AI, a disruptive technology, enters the scene.

Imagine a production process equipped with a pre-existing Visualization solution, constantly collecting data from diverse sensors throughout the production cycle. Let’s explore how Generative AI adds value in such a plant, specifically focusing on long steel products where each batch run (“campaign”) typically produces rods/bars with distinct chemical compositions (e.g., 8mm with one composition, 14mm with another).

Insights and Anomalies

  • Real-time data from diverse production sensors (scrap sorting, melting, rolling, cooling) feeds into a Time-Series database. This multi-modal telemetry data, like temperature, pressure, chemical composition, vibration, visual information etc., fuels a Visualization platform generating predefined dashboards and alerts. With training and continuous learning, Generative AI models analyse this data in real-time, identifying patterns and deviations not envisaged by predefined expectations. These AI-inferred insights, alongside predefined alerts, highlight potential issues like unexpected temperature spikes, unusual pressure fluctuations, or off-spec chemical composition.
  • If trained on historical and ongoing ‘action taken’ data, the AI model can generate partial or complete configurations (“recipes”) for uploading to PLCs (Programmable Logic Controllers). These recipes, tailored for specific campaigns based on desired results, adjust equipment settings like temperature, cooling water flow, and conveyor speed. The PLCs then transmit these configs to equipment controllers, optimizing production for each unique campaign.
  • Individual bars can be identified within a campaign using QR code stickers, engraved codes, or even software-generated IDs based on sensor data. This ID allows the AI to link process and chemical data (known as ‘Heat Chemistry’) to each specific bar. This information helps identify non-conforming products early, preventing them from reaching final stages. For example, non-conforming bars can be automatically separated at the cooling bed before reaching bundling stations.
  • Customers can access detailed information about the specific processes and materials used to create their steel products, including actual chemistry and physical quality data points. This transparency builds trust in the product’s quality and origin, differentiating your brand in the market.

Enriched Data Records

  • The AI model’s capabilities extend beyond mere interpretation of raw sensor data—it actively enriches it with additional information. This enrichment process encompasses:
    • Derived features: AI extracts meaningful variables from sensor data, such as calculating cooling rates from temperature readings or estimating carbon content from spectral analysis.
    • Contextualization: AI seamlessly links data points to specific production stages, equipment used, and even raw material batch information, providing a holistic view of the manufacturing process.
    • Anomaly flagging: AI vigilantly marks data points that deviate from expected values, making critical events easily identifiable and facilitating prompt corrective actions. This also helps in continuous learning by the AI model.
  • This enriched data forms a comprehensive digital history for each bar, providing invaluable insights that fuel process optimization and quality control initiatives.

While the aforementioned functionalities showcase Generative AI’s immediate impact on traceability, its potential extends far beyond. Trained and self-learning models pave the way for advancements like predictive maintenance, product simulation, waste forecasting, and even autonomous recipe management. However, these exciting future applications lie beyond the scope of this blog.

Despite its nascent stage in long steel product genealogy, Generative AI is already attracting significant attention from various companies and research initiatives. This growing interest underscores its immense potential to revolutionize the industry.

Challenges and Considerations

  • Data Quality and Availability: The success of AI-powered traceability hinges on accurate and complete data throughout the production process. Integrating AI with existing infrastructure and ensuring data consistency across systems pose significant challenges.
  • Privacy and Security Concerns: Sensitive data about materials, processes, and customers must be protected. Secure data storage, robust access control mechanisms, and compliance with relevant regulations are paramount.
  • Scalability and Cost-Effectiveness: Implementing AI-based solutions requires investment in hardware, software, and expert skills. Careful ROI analysis and planning are crucial to avoid budget overruns. Scaling these solutions to large facilities and complex supply chains requires thoughtful cost analysis and strategic planning.

By addressing these challenges and unlocking the power of Generative AI, manufacturers can establish robust and transparent product traceability systems. This, in turn, will lead to enhanced product quality, increased customer trust, and more sustainable practices.

GenAi & LLM: Impact on Human Jobs

I met an IT Head of a leading Manufacturing company in a social gathering. During our discussion, when he convincingly told me that current AI progress is destructive from the point of jobs done by humans and it’s going to be doomsday, I realized that many would be carrying a similar opinion, which I felt needs to be corrected.

A good starting point to understand impact of AI on jobs done by humans today is the World Economic Forum’s white paper published in September 2023 (Reference 1). It gives us a fascinating glimpse into the future of work in the era of Generative AI (GenAi) and Large Language Models (LLM). This report sheds light on the intricate dance between Generative AI and the future of employment, revealing some nuanced trends that are set to reshape the job market. Few key messages from the paper are below.

At the heart of the discussion is the distinction between jobs that are ripe for augmentation and those that face the prospect of automation. According to the report, jobs that involve routine, repetitive tasks are at a higher risk of automation. Tasks that can be easily defined and predicted might find themselves in the capable hands of AI. Think data entry, basic analysis, and other rule-based responsibilities. LLMs, with their ability to understand and generate human-like text, excel in scenarios where the tasks are well-defined and can be streamlined.

However, it’s not a doomsday scenario for human workers. In fact, the report emphasizes the idea of job augmentation rather than outright replacement. This means that while certain aspects of a job may be automated, there’s a simultaneous enhancement of human capabilities through collaboration with LLMs. It’s a symbiotic relationship where humans leverage the strengths of AI to become more efficient and dynamic in their roles. For instance, content creation, customer service, and decision-making processes could see a significant boost with the integration of LLMs.

Interestingly, the jobs that seem to thrive in this evolving landscape are the ones requiring a distinctly human touch. Roles demanding creativity, critical thinking, emotional intelligence, and nuanced communication are poised to flourish. LLMs, despite their impressive abilities, still grapple with the complexity of human emotions and the subtleties of creative expression. This places humans in a unique position to contribute in ways that machines currently cannot. But here the unique ability of LLMs to understand context, generate human-like text, and even assist in complex problem-solving, positions them as valuable tools for humans.

Imagine a future where content creation becomes a collaborative effort between human creativity and AI efficiency, or where customer service benefits from the empathetic understanding of LLMs. Decision-making processes, too, could see a paradigm shift as humans harness the analytical prowess of AI to make more informed and strategic choices.

There is also creation of new type of jobs, emerging jobs as it is called. For example, Ethics and Governance Specialists is one such emerging job.

The said paper further nicely brings together a view of job exposure by functional area and by industry groups: ranked by exposure (augmentation and automation potential) across large number of jobs to give reader a feel of what is stated above.

In essence, the report paints a picture of a future where humans and AI are not adversaries but partners in progress. The workplace becomes a dynamic arena where humans bring creativity, intuition, and emotional intelligence to the table, while LLMs contribute efficiency, data processing power, and a unique form of problem-solving. The key takeaway is one of collaboration, where the fusion of human and machine capabilities leads to a more productive, innovative, and engaging work environment. So, as we navigate this evolving landscape, it’s not about job replacement; it’s about embracing the opportunities that arise when humans and LLMs work hand in virtual hand.

 

References:

1.      Jobs of Tomorrow: Large Language Models and Jobs, September 2023. A World Economic Forum (WEF) white paper jointly authored by WEF and Accenture. https://www3.weforum.org/docs/WEF_Jobs_of_Tomorrow_Generative_AI_2023.pdf