Preface
AI-ML solutions have now become integral part of IT Enabled solutions provided to the businesses.
We examined various life cycle stages including conventional SDLC, which are not entirely suited for data science projects due to its broader scope, experimentation nature, data dependency, creative yet often chaotic, non-linear process, and relatively intangible deliverable (in the form of knowledge, insight).
We also examined DevOps, DevSecOps practices that help repeatability & provides an overarching ecosystem to continuously build and deploy solutions in a systematic manner. Also, there are the MLOPS practices that cater to the requirements of building and deploying AIML solutions in the production, in systematic & secured manner. This ecosystem supports continuous experimentation, learn, build, deploy, monitor on the larger scale.
In this part, we discuss the elements of AIML CICD life cycle stages, with key security considerations at every stage. Intent, eventually, is to build a set of practices/processes that will help an organization to securely build & maintain AIML solutions consistently. Towards later half of this write-up, we touch base on overall AIML operations ecosystem, that is essential in building, maintaining and monitoring AIML Solutions.
In subsequent write-ups, we will deal in more details on each of the areas mainly – planning, effective Testing, Performance Management, Maintenance and currency of the solutions, Maturity mechanisms etc. This will include developing an overall ecosystem comprising of legacy solutions and AIML Solutions and an overall transformation into such ecosystem.
AIML CICD Life Cycle
Below is a high level functional representation of standard life cycle stages that AIML Projects will adopt in order to deliver an appropriate solution, in a consistent & secured manner. The diagram below illustrates few key security considerations in ML CICD cycle. (More details are available in the table following this diagram).
As depicted in diagram, the AIML Life Cycle typically has following Lifecycle stages.
It may vary for different models (Supervised or Unsupervised) or other techniques such as NLP, Deep Learning and so on.
- Problem Definition – Define the problem, stakeholders, environment, what data is available, what are the goals and performance expectations
- Data Build – caters to data collection, collation, Annotation – building data pipeline
- Model Train, build – Feature Engineering, Model Building, Testing – building model pipeline
- Test – Evaluation, Validation or Quality Assurance or testing of the model before deploying
- Integrate and Release – Freeze code baseline, Baseline versioning, Release Notes, Release Infra readiness
- Deployment – Deploying the Model into the Production – independently or integrated with other applications, as decided by the serving mechanism
- Model Serving – Serving mechanism, Serving performance, Adjustments,
- Monitoring – Monitoring the performance throughout the life cycle, fine tuning, adjusting or retiring the model basis performance of the model and change in the overall ecosystem
The following section describes briefly, for each stage, the typical tasks that are performed, expected output for that stage and more importantly, security considerations that may be employed for the stage.
Life Cycle Stages – Key Tasks, Output and Security Considerations
Stage
|
Tasks
|
Output
|
Security Considerations
|
Problem Definition
|
Brainstorm problems, define
Boundaries, goals, thresholds
Define data sources, type, frequency
define ideal outcome, visualization, usage
Define metrics, success-failures
Define resources
Methodology, proposed models that will be used
Define an overall implementation Project Plan
Evaluate threats, vulnerabilities, remedies
|
Clearly laid out problem statement, defined data needs, ideal outcome, Infra needs, resource competency, measures of success and goals
Clearly defined Project Plan -defining timelines, schedule, cost, delivery artefacts, release schedule
Threat Management Plan (RATP)
|
Identify Vulnerabilities, possible attack scenarios, probable risks to the data, Model, Infra and overall system, defining probable mitigation actions – creating an RATP (Risk Analysis and Treatment Plan)
|
Data build
|
Collect /Ingest data – from sources
Cleanse – missing, improper values, outliers,
Transform – Labelling, Annotating, Feature Engineering – devise features, build/extract features, select features
Analyzing – for meaningful, completeness, fairness etc
Build training, Validation, Test Data repositories
Verify data & Data building scripts – static code analysisT
Check for Data biasness
Define data coverage, Prediction Accuracy rule / thresholds
Study data patterns, statistical analysis in order to decide appropriate model
|
Labelled, Annotated data with explicit features identified
Training, Validation and Test Data Repositories
Vulnerabilities in data, features and data build scripts
|
Databased, API, Features, Infra, Data in transformation
Data formation, transformation scripts are analysed using static code analysers
Data Privacy (such as GDPR, HIPAA) compliance requirements??
|
Model Build
|
Select Appropriate Model/s
Build Model/s by defining a model Pipeline (Code, Unit Test, Security test)
Train Model/s with the training data (will include data pipeline)
Evaluate with the validation data
Refine Model/s as required
Containerize (Build Package) all into image / build file
Unit test
Store artefacts into artefact repository
Store Version of the model code
Static code analysis
Simulation of run time behaviour – where possible
|
Trained, Evaluated Model(s)
Container Package
Unit test reports
Training, Evaluation metrics, Reports
Version controlled artefacts – Code, Data, Evaluation reports
Static code analysis report
|
Application code scans to identify security risks to Software, Libraries, containers & other artefacts and to ensure code coverage and adherence to coding standards – using SAST Tools
Analyses Source Code, Byte Code, binaries
Separate Secure build, Staging, production environments
|
Test
|
Deployment on Staging
Model Evaluation with Test Data (Testing with fraud data, data injection tests)
Integration Test
UI Test
API testing
Penetration Test
Model Bias Validation
Test defect remediation
Model Refinement
|
Test Reports
Remediation Reports
Tested Model(s)
|
Performing Black Box testing, using DAST tools
Memory consumption, Resource usage, encryption algorithm, privileges, cross-site scripting, SQL injections, third party interfaces, Cookie manipulations etc
Test data Security / anomalies testing,
Model Bias Evaluation
|
Int. & Release
|
Freeze Code, Feature List
Appropriate Configuration, versioning of all artefacts
Release Notes creation
|
Code, Feature list
Release Notes
|
|
Deploy
|
Perform Security Audit, remediation
Deployment on Production
Test on deployment (Smoke testing)
Vulnerability testing for Containers??
Infra-as-code automation scripts verification
|
Release Deployed on Production
Security Audit reports
Smoke test reports
Infra-as-a-code automation run reports
|
Infrastructure Security –
Scan Infrastructure-as-Code Templates
Scan Kubernetes Application Manifests
Scan Container Images
Scan Model Algorithms on production, Versions from Staging to Production
|
Operate (Model Serving)
|
Monitor Model Performance on live data – Alerts, KPIs, Under/Over fitting, prediction accuracy etc
Learn and refine Model as required
Remove, Retire Models are required
Monitor the integrated functionality
Security Events Management
Monitor triggers, alarms, error logs
Remediation as per SOP – incident management, Automatic event handler
|
Model Performance Reports, KPIs performance reports
Incidents, events managed, addressed,
Change Management on Models,
Refined Models and artefacts
including list of models removed / Retired
|
Model Security, Data Security, API Security, Infrastructure Security, Pipeline Security, Output / UI Security
|
As evident, it is important to plan for security checks right from the beginning and at every stage in order to build an overall secure and unbiased AIML Solutions.
MLOPS – An Ecosystem
While the above section describes life cycle stages of AIML development projects, an Overall MLOps Ecosystem provides for environments/areas for experimenting, building, continuously training, data management, continuously evaluating, integrating with other applications, deploying (stand alone as micro service or integrated with other customer applications), as represented in the functional diagram below.
Typical areas covered are:
- Requirement Management – Managing frequent requirements / Problem definition – especially for new requirements and the feed that comes for existing models from production serving and new data that might be available
- Model Development – that includes experimentation, research, development, evaluation and go-no-go decisions, overall model security
- Data, Feature Management – overall data estate management, management of overall data pipeline, managing meta data related to models, scripts for building data instance and features, overall data security
- Continuous Training – this environment provides continuous training for existing and newly built models,
- Continuous evaluation – mechanisms to evaluate models while building or refining, testing the efficacy of the model with test and real data
- Continuous Deployment – Systematic process & pipelines for Continuous integration and deployment
- Serving Management – Manage the production serving, verifying methods of serving, capturing results of serving, tuning and refining as required
- Continuous Monitoring & Improvement – Monitor the health of models, additions or changes to data behaviour, continuously improvement model performance and remove/ retire models as required
- Repository/Registry Management- Managing overall repositories for Model, Data, Features, pipelines etc. This includes establishing overall version control, baseline traceability and encourage reuse
Following are the examples of outcome from these areas:
- Models experimented, Data Sets used, Reusable assets, Model code/package/Containers
- Data Pipeline, Data assets, Features, data connectors
- Training, Testing data, Training environments, ML Accelerators,
- Serving Packages, Serving Logs
- Evaluation metrics, Performance models, Fairness / biasness measurements
- Overall repository, registry of the models – experimented models, failed / discarded models, metadata,
- Overall repository of data – reusable data sets, features, ETL scripts,
- Other artefacts – Code packages, containers, Infra as a code
AIML CICD (along with the security considerations) mentioned in early part of this blog, therefore, becomes part of this overall ecosystem. This ecosystem plays an important role in managing overall assets generated across multiple AIML Solutions across the life of the solutions.
References
- – Practitioners guide to MLOps: A framework for continuous delivery and automation of machine learning by Google
- – DEVOPS – Secure and Scalable CI/CD pipelines with AWS