Advanced Computing in the Age of AI | Thursday, April 25, 2024

Solving for ML Model Monitoring Challenges with Model Performance Management (MPM) 

The complete digitization of the world in the last few years has created both a unique opportunity and challenge for organizations. While the boom in data presents a chance to significantly enhance decision-making, it is now even more time-consuming and expensive to analyze and leverage that information. As a result, businesses of all sizes are deploying machine learning (ML) models that can handle large volumes of data and identify patterns and correlations that often go unnoticed by human analysts or consume unreasonable amounts of their time. These models have the power to enhance decision-making and drive superior business outcomes. For example, some models can produce highly accurate predictions about the sales rate of a specific product over the next year to improve marketing and inventory planning. Others are able to recognize fraudulent transactions that could lead to millions of dollars in lost revenue.

But with an increased reliance on ML models comes an even greater need to monitor model performance and build trust into AI. Without monitoring, MLOps and data science teams struggle with:

  • Inconsistent model performance. Fluctuations can happen because ML models are trained on historical data that might look different from the real data they see in production.
  • Lack of control and debuggability. Because complex ML systems are opaque, practitioners may not understand the model well enough to know how to fix it when there are problems.
  • Instances of bias.Models can amplify hidden bias in the data they are trained on, putting the business in legal and reputational risk, and potentially leading to harmful results for consumers.
  • Improving ML performance.Since it is difficult to understand and track what improvements need to be made, models won’t get more investment after the initial launch

MLOps teams are also more likely to have low confidence in their models, which can result in more time spent on a project and more mistakes. Model monitoring gives developers the ability to debug models both during experimentation and production to catch issues as they happen. It is the most effective way to have explainable, fair and ethical AI solutions – which is essential in today’s world. Say a bank is using ML to approve loans, and they receive a customer complaint about why a particular loan got denied. That bank is now responsible for explaining why the model made that decision. Without monitoring solutions in place, tracking down an answer will be next to impossible.

Whether a model is responsible for predicting fraud, approving loans or targeting ads, small changes in models can lead to model drift, inaccurate reporting or instances of bias – all of which contribute to revenue loss and brand deterioration.

The Challenges with Model Monitoring Today

Unfortunately, model monitoring has become more complex as a result of the vast variety and amount of ML models organizations today rely on. Models are now serving a wide range of use-cases like anti-money laundering, job matching, clinical diagnosis, and planetary surveillance. They also come in many different forms and modalities (tabular, time series, text, image, video, and audio). While these models can handle the large volumes of data businesses need to ingest, they are exponentially more difficult – and costly – to keep track of.

Some companies have deployed traditional infrastructure monitoring solutions designed to support broad operational visibility to overcome these challenges. Others have attempted to create their own tools in-house. In either instance, these solutions often fail to meet the unique requirements of ML systems, whose performance, unlike that of traditional software systems, is non-deterministic and depends on various factors such as seasonality, new user behavior trends, and often extremely high-dimensionality upstream data systems. For instance, a perfectly functioning advertising model might need to be updated when a new holiday season arrives. Similarly, a model trained to show content recommendations in the U.S. may not do very well for users signing up internationally. As a result, organizations are often faced with the inability to scale due to pipeline issues; wasted time troubleshooting production because of out-of-date models; and additional costs from internal tool maintenance.

To enable visibility and explainability in models and overcome common model monitoring challenges, organizations need solutions that enable them to easily monitor, explain, analyze, and improve ML models overtime. Enter model performance management (MPM).

How MPM Addresses Performance and Bias

MPM is a centralized control system at the heart of the ML workflow that tracks performance at all the stages of the model lifecycle and closes the ML feedback loop. With MPM, enterprises can uncover deep actionable insights with explanations and root cause analysis, while giving immediate visibility into ML performance issues to avoid negative business impact.

The technology automatically reassesses model business value and performance on an ongoing basis, issuing alerts on model performance in production and helping developers respond proactively at the first sign of deviation. Because MPM tracks a model’s behavior from training to launch, it can also explain what factors led to a certain prediction to be made. Tying model monitoring to other pillars of ML observability – like explainability and model fairness – presents a comprehensive toolkit for ML engineers and data scientists to embed into their ML workflows and provides a single pane of glass across model validation and monitoring use cases. Businesses benefit from MPM not only because of the ability to increase efficiency in model monitoring, but also because it can reduce instances of bias that result in costly regulatory fines or reputational loss.

ML models require continuous model monitoring and retraining throughout their entire lifecycle. MPM makes it possible for developers to not only gain confidence and greater efficiency in their models, but also understand and validate the “why” and “how” behind their AI outcomes.

About the Author

Krishnaram Kenthapadi is the Chief Scientist of Fiddler AI, an enterprise startup building a responsible AI and ML monitoring platform. Previously, he was a Principal Scientist at Amazon AWS AI, where he led the fairness, explainability, privacy, and model understanding initiatives in Amazon AI platform. Prior to joining Amazon, he led similar efforts at the LinkedIn AI team, and served as LinkedIn’s representative in Microsoft’s AI and Ethics in Engineering and Research (AETHER) Advisory Board.

About the author: Tiffany Trader

With over a decade’s experience covering the HPC space, Tiffany Trader is one of the preeminent voices reporting on advanced scale computing today.

EnterpriseAI