How to Avoid the Container ‘Mushroom Cloud Effect’
As application containers scale in production, some early results are coming in about what can go wrong. Among the potential downsides of a micro-service architecture is what one observer ominously refers to as the "mushroom cloud effect."
In a presentation this week at the DockerCon meet up in Seattle, a cloud and virtualization technologist appraised the state of container deployments in a talk subtitled, "What Happens When Containers Fail?" Alois Mayr of application performance management specialist Dynatrace noted that micro-service architectures supporting application containers can balloon to production environments as much as 20 times larger than standard IT platforms.
In these large, connected and dynamic environments, "Container metrics will tell you about infrastructure health but not service health," Mayr stressed. Even with infrastructure service "health checks" in place, he continued, "you will see intermediary mushroom cloud effects of a large number of services being affected temporarily."
The problem for IT managers, Mayr added, is determining "what really caused the problem and how to distinguish effect [versus] cause?"
Container-based deployments are no longer static, with "ephemeral" containers being clustered for scale and running on what should be a resilient architecture. Once container-based services are scaled up to peak workloads, "failing containers may or may not have an impact [on] service performance," Mayr noted.
Still, a daisy chain of cascading container failures could lead to the "mushroom cloud effect," he warned.
In what Mayr referred to as the "hungry container breakdown," the evolving problems stems from a variety of infrastructure sources, including failed container health checks and orchestration tools that "kill" containers and reschedule new ones. One result is that cluster nodes are unable to run any containers.
The impact of the breakdown on services could include increased failure rates and—since application containers entail many services dependencies—consequences for dependent database and Apache Tomcat service applications.
At least some of these problems could be avoided through greater use of log management tools for application logs and more partitioning. "Buggy containers may kill your nodes," Mayr stressed.
Mayr also stressed the need for "massive load testing" in order to determine a cluster's breaking point. He recommended throwing everything but the kitchen sink into load testing, including containers, services, orchestration tools and elastic compute cloud instances.
"Try to break your clusters early, and be prepared for Black Friday," Mayr advised.
The application management vendor said its post-mortem analysis was based on a "real-world large e-commerce production environment" in which 14 applications were tested along with more than 3,400 services. That worked out to an estimated 133,000 containers along with the analysis of more than 820 trillion dependencies.
Hence, Mayr noted, it's "all about service dependencies."
If the evolution of problems involving failed containers is graphed, the visualization does indeed resemble a mushroom cloud.