Advanced Computing in the Age of AI | Thursday, May 30, 2024

Avoiding the Nightmare of Network Downtime in the Cloud 

When I think about network downtime, many words come to mind. Among them: panic, fear, Armageddon.

Perhaps that last one is a bit dramatic, but as IT functions and advanced scale combined with mission critical tasks driven by enterprise HPC technologies, such as real-time analytics, become increasingly important to business success, downtime is a critical problem. Especially as we move further toward more cloud and hybrid IT environments, the thought of conquering downtime becomes even more daunting.

This is because the mere diagnosis of network downtime in the cloud brings with it two major challenges — ownership/control, and discerning downtime vs. service unavailability. At the end of the day, regardless of the issue that’s causing the downtime and who is responsible, the network engineer is ultimately accountable for the service to be up and running properly. This is particularly crucial as we move further into the hybrid IT era and advanced scale computing combined with mission critical tasks driven by enterprise HPC technologies, such as real-time analytics, make minimizing network downtime more important than ever before.

Let’s explore these challenges in depth.

SaaS applications are typically services that run on our networks, but the ownership lies with the service provider and not you or another IT professional employed by your organization.

Leon Adato of SolarWinds

Leon Adato of SolarWinds

For IT professionals, this is new territory. Diagnosing downtime for on-premises services or a server running a service that we, the internal IT professionals, built on Azure or AWS is easier because there is a degree of ownership and a single source of truth. But for applications completely outside our ownership that we merely consume as a service, we haven’t traditionally had visibility to even begin to diagnose the issue.

Not only that, but the carriers and service providers have control over which packets they prioritize and how they are routed, meaning they can move your services to any piece of hardware anywhere in the cloud at any given moment — all unbeknownst to you – and your network services could slow down as a result.

The second challenge is understanding if the issue is downtime or service unavailability. The cloud almost by definition is highly redundant. There is no limit to the number of connections and routes on the network, there are multiple network paths to devices and multiple devices running on the network. There is also a cluster of servers providing services and balancing the load at each level of the application, from web presentation to database to storage.

All of this has created an environment of network redundancies, so you need to discern if one network port on one router going down is in fact a critical issue (spoiler alert: it’s often not). We have blurred the lines between the network being down and services being down, and it’s imperative for IT to take charge.

What’s clear is that network engineers need visibility into your networks. Ultimately, this comes down to robust monitoring that is customized for the Wild West of hybrid IT.

Today, simple trust in your cloud vendor or user isn’t enough; there are several factors that could be leading to the problem. Is it a flaky switch, a slow spindle, a bad path through the ISP network or something else?

It’s very easy to get into a blame game with the cloud service provider because of your lack of control over every layer of the stack. You should employ “healthy skepticism,” meaning it’s okay to believe the cloud service provider has everything under control, but it’s prudent to get to the bottom of the issue ourselves because the proper functioning of the network falls on us.

Here are some best practices for diagnosing and minimizing network downtime in the hybrid IT era:

  • Know Your Networks and Accept Responsibility – You should keep an inventory of your networks; know where your devices are and what they’re doing. You should acknowledge you have services going out to cloud-based applications, and that these services are just as much your responsibility as they are the application team’s. This also means knowing, or knowing how to find out, what the patterns of your network usage are day by day, hour by hour, and at different points in the month. Basically, it means treating monitoring — the regular, consistent, ongoing collection of data from devices — as its own discipline and not just “the thing that creates all those tickets” or an item on your to-do list. Monitoring as a discipline varies from basic monitoring in that it is an actual role, an assigned focus of one or more individuals within an organization. I’ve seen the benefit of this role in action, and it provides value through the ability to turn disparate data points from various monitoring tools and utilities into more actionable insights. It considers all, and does so from a holistic vantage point.
  • Employ Discovery and Monitoring Tools –Processes that allow you to know when devices are coming onto the network will help you understand your entire network landscapes and pinpoint when/where there is an issue. If you don’t do this, you’ll end up with a network you don’t recognize, which is impossible to troubleshoot. However, it’s not as simple as installing any monitoring and alerting solution. It’s not enough to simply keep doing things the way we always have. Significant product innovation is required to meet the challenge presented by hybrid IT and offer network administrators a valuable solution to bridge the hybrid IT visibility gap. At the same time, there are some pure hybrid IT and cloud products that do very little to help assure the performance of the legacy side of the hybrid network, including certain components that will never go away (campus LAN, telepresence, VOIP, etc.). Network monitoring solutions developed specifically for the needs of the hybrid IT era will provide rich and complete monitoring for on-premises, as well as converged views from cloud resources.
  • Automate Automation is your friend. Networking has already been complex for a long time and, as discussed, it’s only getting more so with hybrid IT. Tools that help by automating various network management routines can greatly alleviate your burden and free up your time to focus on tasks that absolutely can’t be automated. Take IP address management as an example. Too many IT organizations are still using archaic methods, such as manual tracking via spreadsheets. How can one even begin to think about the challenges of hybrid IT paths when still mired in the manual management of every single IP address, and the address conflicts and other issues that come along with it?


The shift to hybrid IT environments means relinquishing control while taking on more responsibility. By truly knowing all the networks that impact the performance of your environment, even those you may not own, properly monitoring those networks and leveraging automation in their management, you can meet the challenge.

Leon Adato is a head geek and technical evangelist at SolarWinds, a provider of IT management software.