Advanced Computing in the Age of AI | Wednesday, April 24, 2024

Human Error Offsets Greater Datacenter Reliability 

Despite the vastly improved reliability of datacenter gear and the rise of managed services to anticipate and limit consequences, an industry survey finds that datacenter outages remain common and costly.

That’s unsettling, according to the Uptime Institute’s survey of known outages over the past year, since “the consequences are high, and possibly higher than in the past” owing to greater reliance on IT systems to deliver more goods and services.

Datacenter outages affected mission-critical platforms ranging from financial and retail systems, day-long outages for emergency 911 services as well as failures that knocked out aviation and healthcare systems. Overall, the survey found that more than 30 percent of IT service and datacenter operators experienced downtime over the last year described as “severe degradation of service.”

Improved infrastructure reliability is offset in many cases by human and management errors. Uptime Institute estimates about 70 percent of outages are the result of operator error or poor management. Others industry assessments run as high as 75 percent.

“Perhaps there is simply a limit to what can be achieved in an industry that still relies heavily on people to perform many of the most basic and critical tasks and thus is subject to human error, which can never be completely eliminated,” Uptime added in a recent blog post.

“However, a quick survey of the issues suggests that management failure — not [just] human error — is the main reason that outages persist.”

Uptime’s annual assessment found that 60 percent of datacenter owners and operators it surveyed said “their most recent significant downtime incident could have been prevented with better management/processes or configuration.”

Source: Uptime Institute

Indeed, reliability is emerging a key differentiator for IT services vendors. A separate survey released in August by industry tracker IHS Markit found that reliability was the top priority when investing in new datacenter technologies.

“This high priority reflects the importance enterprises are placing on ensuring their networks run seamlessly,” IHS said. “This issue is becoming more critical as workloads increase and the variety of applications expands, two factors that place greater strain on datacenter infrastructure.”

Among the conclusions of the Uptime survey are that datacenter outages are a continuing and expensive problem for infrastructure vendors. The institute’s research along with anecdotal evidence show “management shortcomings play a major role in these failures.”

As failures grow in complexity, affecting services across multiple platforms, the impact of separate IT operations is magnified. “IT continues to operate in silos, a strategy that is successful because of the specialties required, but which can allow for critical vulnerabilities to be overlooked,” Uptime noted.

As more enterprises shift to hybrid cloud deployments, the survey authors also stress the need for a more holistic IT approach along with “increase[d] transparency and accountability across their hybrid infrastructures.”

About the author: George Leopold

George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).

EnterpriseAI