Advanced Computing in the Age of AI | Sunday, April 21, 2024

AIOps: Power Tool for Managing Systems Beyond Human Comprehension 

Source: Shutterstock, alphaspirit

You’re an accomplished IT manager at a financial services, retail or other data-intensive organization, and you preside over an insanely complex IT infrastructure that handles millions of roboticized transactions per hour, that sprawls globally from on-prem to the cloud to the edge, a system with containers that live and die in a heartbeat, that generates hundreds of daily events. No one doubts the smarts of you and your staff – yet you also know the system defies the comprehension of any human being, and you live in apprehension of the next downtime incident.

That’s why the application of AI to IT operations – AIOps – may be the most fertile early ground for AI to deliver significant benefit for large organizations in the broader, non-FAANG* enterprise market. This is because, complicated as they are, IT systems in the end are comprised of programmed, deterministic machines that machine learning can encompass.

While the need for AIOps is abundantly, painfully obvious to IT managers responsible for complex infrastructures and applications, AIOps itself is an emerging technology unfamiliar to many in IT. But according to Will Cappelli, former Gartner analyst and newly appointed CTO and global VP of product strategy at AIOps specialist Moogsoft, AIOps may be on the verge of break-out market traction this year or next.

Will Cappelli of MoogSoft

AIOps' core thesis, according to Cappelli: there’s an alarming and widening disparity between systems complexity and human ability. It’s a gap about which, at Gartner, he heard a rising chorus of distress from end-users, starting four or five years ago.

“When you look at IT infrastructures…and the ways that IT and business have become much more intertwined over the last 10 years, one consequence has been the massive increase in the complexity of IT systems and in the speed with which IT systems change,” Cappelli told EnterpriseTech. “(At Gartner,) I’d talk with users at large enterprises, and they were complaining about the growing complexity and the explosion in the number of events they had to deal with, and about the mismatch between their skills and systems complexity.”

The challenge has moved beyond human scale. “It’s become literally impossible for IT operations teams and traditional data centers to even see what’s going on in their systems," he said, "let alone manage them to ensure optimal results or to rapidly fix problems.”

AIOps’s objective is to automate the discovery of behavior patterns, both normal and anomalous, in systems using large data sets, as well as detecting events that result in systems problems.

“You get an understanding of which events are important, causal events, not just events that correlate with events,” he said, “so if I want to fix things – either address a problem or improve my system – I know which events to focus on.”

AIOps, a tool within the larger DCIM (data center infrastructure management) arsenal, turns the traditional view of data centers on its head.

“When we look at IT operations and managing data centers,” Cappelli said, “we’re generally moving away from thinking of it as a world of solid things (servers, networks, storage systems) that we need to fix and keep in place, and moving toward a world of data streams, to a world of data. The topologies, the solid structures, emerge out of the streams of data, rather than the other way around.” (See related article on “autonomous centers of data.”)

AIOps, Cappelli explained, relies on “mathematical AI” (as opposed to symbolic, or rules-based, AI), which operates on statistical principles, using algorithms to find mathematical functions that reveal the statistical properties of the data. The goal of AIOps statistical analysis is to find, amidst the vast forest of system events, the relatively few events that really matter.

An AIOps workflow starts with the ingesting of entire streams of system data from monitoring systems, from log management systems such as Splunk, from SNMP (Simple Network Management Protocol) systems, and then converts the data into a standard canonical form for analysis. From there, the system takes masses of events and begins placing them within groups, called “situations.”

“Say it looks at 10 events and sees that three of them are related to each other, it calls them a ‘situation,’” Cappelli explained. “As for the other seven events, the platform sees they are related to each other based on mathematical measurements. So it’s gone from 10 events to two distinct situations. Then it analyzes the internal structure of these situations, each composed of events that are related to one another in various ways.

“It goes from raw event streams to situations, and then within each situation – through an understanding of where these events came from, time stamps, a number of things – it determines which of the events within each situation actually caused the events in the rest of the situation.”

Think of it as three reductions, he said. “Each time it goes through these steps, it boils down the number of events (the IT operator has) to deal with. The output, what you have to deal with, becomes richer and richer in meaning.”

Cappelli posed the hypothetical of an IT breakdown within a large, distributed system with end users around the world, an application that touches multiple databases on premises and in the cloud, and that executes highly complex transactions.

MoogSoft AIOps screen shot

“Now let’s say response times start to degrade,” he said. “The end user sees that transactions had been completed in a second, but suddenly it’s taking two or three minutes, something has gone dreadfully wrong. Each transaction generates revenue for the company, so the slowdown in the system means a slowdown in the revenue stream. It looks as though the whole system may keel over.”

In a system with thousands of state changes happening constantly, “you’re dealing with an incredibly complex environment that’s literally impossible for a human being to diagnose, yet you’re losing revenue until you can ascertain the root cause.”

The AIOps system goes through its data streams-to events-to situations reduction process and narrows the number of possible causes of the slowdown to hundreds, rather than tens or hundreds of thousands. “There’s only a small subset that are the cause, that are responsible for all the rest.”

AIOps is a response to a world increasingly run in the blur of machine time (see related article) – that is, at the speed with which advanced systems operate. This requires generating systems diagnostics also in machine time (i.e., real time), and Cappelli said it’s one of the critical concepts underlying the technology.

“It’s like solving a crossword puzzle or other challenging game,” he said. “As humans, we run patterns through our heads, and then the event hits us and we suddenly see how it all fits together. There is an elapsed time between our ingestion of the data that we need and then the insight that helps us organize that data. What AIOps lets us do is deal with events, with puzzles, that are generated in machine time. It is generating the pattern discovery insights at similar temporal scale. It’s a relevant concept and a very interesting dimension of the whole thing.”

* FANG: Facebook, Amazon, Azure, Netflix, Google