Advanced Computing in the Age of AI | Thursday, March 28, 2024

Google Outage Traced to Network Glitch 

Google has tentatively traced the cause of a roughly two-hour global cloud outage to an internal software issue related to its virtual network traffic routing.

The Google Compute Engine public cloud outage commenced at 10:40 p.m. Pacific time on Wednesday (Feb. 18); traffic loss was stopped and normal traffic levels resumed at about 1:20 a.m. local time on Thursday (Feb. 19), according to a company statement released later in the day.

Given the time span, Asian public cloud customers were probably affected the most by the service interruption, although reports said a European online grocer was among those affected. Regardless of time of day, however, a public cloud outage of this magnitude forces hyper-scale cloud providers like Google to scramble.

"At the time of the outage we saw a significant degradation of web performance across virtually all business sectors, an indication that organizations using Google’s cloud infrastructure encountered difficulties beyond their control to fix," noted digital performance analyst David Jones of cloud application tracker Dynatrace.

In a status report, Google said the "preliminary" root cause of the outage was traced to the Compute Engine's virtual network for VM outgoing traffic. The company said the network stopped issuing routing information. The precise cause of the interruption "is still under active investigation," Google said. "Cached route information provided a defense in depth against missing updates, but [Google Compute Engine] VM egress traffic started to be dropped as the cached routes expired."

Once its virtual network started dropping packets, Google engineers began reloading all routing information about 45 minutes after the outage was first spotted. They were able to force a reload to fix the networking issue about an hour after the issue was identified and before all routing entries had expired, the search giant said.

Once the fix was in place on early Thursday morning, Google engineers extended the expiration lifetime of the network routing entries from several hours to a week. That stopgap is intended to give engineers enough time to develop a long-term fix should the routing problem surface again.

According to cloud service status tracker CloudHarmony, Google Compute Engine was down for just under two hours. The global outage effected customers in East Asia, Western Europe and the central U.S.

Jones, the cloud performance analyst, called the Google outage "a prime example of the role third-parties play in businesses’ digital strategies, and the vulnerabilities they face if they do not have the ability to detect and respond immediately to ensure their end users are not impacted."

He added that this week's outage is not an isolated incident, noting that Amazon Web Services and especially Microsoft Azure suffered major outages last year.

Public cloud vendors are hammered when their services are down for just a few hours. In January, Verizon Cloud was down for entire weekend for what the company termed a "seamless upgrade" to a "pure virtualized infrastructure." Despite the fact that the outage was planned to upgrade the enterprise-class infrastructure to prevent future outages, the carrier took heavy fire for failing to alert customers about the planned maintenance.

EnterpriseAI