Advanced Computing in the Age of AI | Friday, March 29, 2024

AIOps: A Holistic Approach to Data Center Monitoring & Management 

We all know today’s IT infrastructures are incredibly complex, and the expected increase in application demands and data growth will only exacerbate that complexity. It’s also fair to say that this trend is irreversible.

This phenomenon is a natural outcome of forces such as the massive growth of e-commerce and associated customer expectations, hybrid-cloud adoption, an increasingly mobile workforce and the widespread use of virtualization. The problem is that monitoring and management tools and methodologies haven't kept pace with the rate of change of enterprise infrastructures.

Tool vendors have tried to solve the problem for their part of the stack, and that has led to a proliferation of monitoring and management tools and silos. But multiple, disparate tools and methods for sleuthing out inefficiencies and pinpointing trouble spots lack a unified view of the data center infrastructure, as well as  application context.

Now practitioners are starting to adopt AIOps, Artificial Intelligence for IT Operations, to solve infrastructure problems. Gartner, in 2017, coined the term and defined it this way:

AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.”

Applying AI to Infrastructure Issues

AIOps platforms ingest large volumes of data originating from all areas of the application environment and analyze it using AI to identify areas where optimization or remediation are required. AIOps promises to counter the prevailing, reactive nature of existing methods. AIOps is designed for both proactive use cases, such as workload placement and optimization, and reactive issues, such as finding the source of anonymous faults that take a toll on availability or performance.

AIOps is correctly defined as an approach rather than a single technology or point solution. As such, it takes an ecosystem and a consistent approach across that ecosystem to deliver on the promise of AIOps. Practitioners need to be looking at leveraging AIOps from both an application perspective and an IT operations perspective.

I’ve never met an enterprise customer who said, “I don’t have enough tools,” but I’ve known many who can’t make sense of what all the tools are telling them — there’s too much noise. Further, most monitoring tools are inefficient because none provide a holistic view of the infrastructure or operate with a shared context. So, taking a stab at the problem by layering on statistical analysis for deduplication, or a basic time-based correlation on top of them, can reduce the noise, but not necessarily generate unique insight.

Nor can you simply throw math at your data. But you can determine the right application of a given set of mathematical capabilities with a known input driving a desired or anticipated outcome. And so, what’s more enlightening is to use an informed application of AI technology, and typically, that’s a combination of approaches.

Two AIOps Use Cases

Now we’re getting down to the real challenge, which is how do we move beyond throwing math at a data lake to find actionable insights, and to a very purposeful application of AI and ML to solve very specific challenges or problems in enterprise IT operations?

Let’s look at two use cases: Using AIOps to solve a “noisy neighbor” in a highly virtualized or private cloud estate, and using AIOps for workload placement and optimization.

Use Case 1: Remedying a noisy neighbor in a cloud:

Let’s say I have a data lake and all my monitoring alerts are coming from various monitoring tools, but also alarms are going off on a VM or the network or the storage array. My APM solution is showing that my applications are being impacted. Throwing math at a data lake establishes that these alarms are time-correlated and, therefore, must be related.

But so far, I don’t know which domain the issues originated in. I don’t know which domain is causal or simply corollary, and I don’t know how to address or remediate the issue. And so, I dedupe the alarms to a single alarm, and I know that they all happen at the same time. Then I resort to traditional troubleshooting, I get all the domain owners in a room, and we look for a root cause.

Based on my experience, and a ML engine that learns from the past, I know that the issue I’m seeing at the application layer is related to a particular infrastructure contention issue – and I know that this infrastructure is shared by a number of application workloads. That means I can eliminate any part of the infrastructure that isn’t in that shared space.

Then I can look for the known causes of that given issue, and I’m going to see that my APM solution did not see it because the application causing the issue isn’t within the scope of the APM solution. My VMware layer didn’t see it because it’s on a separate cluster. But I do see that there’s a shared component in the backend storage, so I can immediately see that the congestion on this backend element rippled up through the rest of the stack and caused an issue in an unrelated application.

Now, by triaging the problem, I learn that a storage issue in the shared backend infrastructure is causing application issues that the app owners can see, so I will let the app owners know that it’s an infrastructure issue, and I’ll open a trouble ticket and assign it to the IT operations team that owns storage – no war room needed. I’ve determined that root cause of the issue is the result of, let’s say, a speed mismatch in the fabric. So now I’m going to recommend remediation – to upgrade an HBA and/or move workloads from a shared resource to a dedicated resource. Rationale: my tier 1 apps are more important than my tier 3 apps, and I can triage my tier 3 apps at a later time.

Use Case 2: Using AIOps-powered predictive workload placement and optimization strategy to avoid issue remediation:

Now let’s look at a more proactive application of AI and ML to ensure that the right infrastructure resources are available to meet changing and dynamic application workloads. Imagine you are running a business with multiple applications that have different levels of utilization during different business cycles, and they are all running on a highly virtualized private cloud infrastructure.

If you want to avoid the contention for resources described in the first use case, you would allocate resources to your most important applications first, but also accommodate or adapt to changing needs as usage patterns change. By intelligently monitoring all infrastructure resources supporting those applications and applying ML to understand the usage patterns and establish the recurring patterns, you can accomplish two things: identify what is “normal behavior” and alert to anything that is a set deviation from normal. This would let you when a change in “normal” will cause resource contention, allowing you to move workloads proactively to avoid contention and ensure consistent performance.

Now, let’s take that one step further. If you understand what normal behavior is, you can establish when one application is busy while determining when another application is not. You can also determine if one application is particularly memory-intensive, where another is more compute-hungry. By looking at these behaviors and resource requirements, over time you can use analytics to find the best combination of applications to optimally put together on a shared set of infrastructure.

Better yet, an advance application of AIOps would combine all these approaches, applying the ML and then leverage simulation analytics in the background to make proactive recommendations on workload placement, which would drive optimal resource utilization while ensuring consistent performance. And that is when an AIOps approach becomes transformative and creates competitive advantage for your business.

John Gentry is CTO of Virtana, a real-time monitoring and AIOps platform for mission-critical IT infrastructure.

EnterpriseAI