Advanced Computing in the Age of AI | Saturday, May 18, 2024

AIOps Platform Aims to Cut ‘Alert Fatigue’ 

AIOps seems to take delight in crying wolf – day and night, bells and whistles go off notifying IT site reliability engineers (SRE) that something might be wrong. Or not. The bigger and more complex the IT infrastructure, the more time and effort are wasted responding to false positives and “alert noise,” distracting IT teams from real problems that need fixing.

New Relic, maker of a cloud-based “observability platform” for custom software development, today announced the fruit of its acquisition last year of event-intelligence vendor SignifAI with the release of an AIOps suite for on-call DevOps, SRE and network operations center IT teams. Called New Relic AI, it applies AI and machine learning to help detect and resolve IT incidents and, according to the company, continuously improve incident management workflow.

Industry analyst firm MarketsandMarkets has reported the AIOps platform market is expected to grow to $11.02 billion by 2023, and Gartner predicts that within three years, 40 percent of DevOps teams will utilize AIOps capabilities. The technology addresses an urgent need among IT teams under constant strain to meet service level objectives and quickly identify and resolve problems in increasingly complex IT environments.

“AIOps will detect patterns a human would be unlikely to uncover, including those that reveal cause and effect," said Padraig Byrne, senior director analyst at Gartner. "From this determination of causality, models should be created that will help decide which IT metrics should be mapped to which business objective. Observe these over time to refine each model; ensure that it is up-to-date and that any assumptions it makes remain accurate. Through its usage of machine learning algorithms, AIOps specifically offers a mathematical way to find the hidden connections, causes and opportunities in the data that make this process possible.”

Michael Olson, New Relic’s director of product marketing, told us that as IT landscapes and the software that drives them are expanded and modernized, “there's a wide surface area that needs to be managed, there's an increasing number of alerts that these teams are having to deal with, which makes it difficult to prioritize the issues that matter most, to separate signals from the noise and focus on issues that are most important to take action on.”

In short, AIOps needs to get smarter.

“That's really where we see New Relic AI being able to help,” he said, “by analyzing your data that you're able to ingest from multiple sources, by grouping and correlating alerts and events and incidents that are related to each other and, ultimately… to help enable our customers to focus on the highest priority issues.”  He said early access customers report more than a 50 percent reduction in alert noise and “alert fatigue.”

source: New Relic AI

The company described New Relic AI as “an open incident correlation and intelligence solution that is source and data agnostic” and utilizes New Relic’s unified telemetry database, which fuels ML models and provides a “context-rich incident response workflow” that reduces alert noise.

A key capability of New Relic AI, Olson said, is its integration within existing management workflows, such as Slack, PagerDuty, ServiceNow, OpsGenie, VictorOps and other widely used tools. Customers can see a live view of ingested data, a summary of incidents and can “tune correlations with user feedback,” the company said.

Telemetry data is continuously fed through New Relic AI for anomaly detection. The platform ingests, analyze, and take action on multiple data types, including alerts, logs, metrics and deployment events, according to the company, giving “teams better context into incidents … and how they impact the broader environment, so they can diagnose and prioritize problems faster.”

Alert noise is reduced, the company said, by correlating related alerts, events and incidents, “while also suppressing flapping and low-priority alerts. Correlated incidents are enriched with context, automatically classified based on golden signals (i.e. errors, saturation, traffic, latency), as well as identifying related components affected and suggesting responders, to help on-call teams get closer to root cause and take action faster.”

Olson added that New Relic AI is designed to promote transparency.

“We're giving customers a tremendous amount of transparency and flexibility and control into how incidents are correlated…,” he said, “and then secondly, we actually give our customers the flexibility and control to be able to infuse the system with their own human decisions and tune the correlation logic. And that gives our customers the ability to get better transparency into why issues are correlated and ultimately have a greater level of trust in the system.”