Advanced Computing in the Age of AI | Wednesday, August 17, 2022

Improving Chaos Engineering in the Cloud with AWS Fault Injection Simulator 

AWS re:Invent 2020—For a long time, it has been serious challenge for IT teams to adequately stress-test cloud applications to find and fix their weaknesses, especially at scale.

But at this year’s AWS re:Invent 2020 conference, vice president and CTO Dr. Werner Vogels unveiled Amazon Web Services’ (AWS) coming new tool, the AWS Fault Injection Simulator, which aims to make the critical task much easier for development teams.

The Fault Injection Simulator (FIS) is a fully-managed, as-a-service offering that allows teams to see how cloud applications and processes will react to failures so they can be anticipated and prevented later in in production. FIS is expected to be available in early 2021.

Chaos engineering isn’t new, Vogels said in his keynote address on Tuesday (Dec. 15) at the event, which is being held virtually due to the COVID-19 pandemic. But its effectiveness in helping development teams improve and harden their code by observing their performance and resiliency is invaluable and critical, he said.

“The goal of chaos engineering is to understand how your application responds to issues, like injecting failures into your infrastructure, usually, running against production systems,” said Vogels. “These experiments can include generating a baseline traffic load against the system, adding latency to database calls, and then validating timeouts and returns. We believe that chaos engineering is for everyone, not just shops running at an Amazon or Netflix scale.”

With the introduction of the new service, AWS will simplify the process of running chaos experiments in the cloud, he said. Built to be easily set up and integrated, FIS is designed to run controlled chaos engineering experiments across a wide range of AWS services. “That includes [introducing] control plane level faults, such as API toppling and server errors. And FIS makes it easy to run safe experiments. We built it to follow the typical case experimental workflow where you understand your steady state, set your hypotheses, inject faults and monitor your application.”

When the experiment is over it, FIS then tells users if their hypothesis was confirmed, then they can use the collected data to decide where improvements must be made, he said.

Dr. Werner Vogels, VP and CTO of AWS

“FIS removes the barriers to adopting chaos engineering,’ said Vogels. “Fault Injection Simulator provides the controls and guardrails that teams need to run experiments in production, such as automatically rolling back or stopping the experiment if specific conditions are met.”

Users don’t need to be experts to use and incorporate FIS into their development workflows, he said. It can be used with services such as Amazon Elastic Compute Cloud 2 (EC2), Amazon Relational Database Service (RDS), Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS) and others.

“Out of all the things we're announcing this year, this is the one that I'm most excited about,” said Vogels. “By offering this as a service, I believe that we can have a massive, positive impact on building more robust, more durable and more dependable systems in the clouds.”

At its core, chaos engineering stresses an application in a testing or production environment by creating disruptive events, such as server outages or API throttling. It then observes how the system responds and shows improvements that can be made. Chaos engineering essentially simulates real-world conditions that can be used to uncover hidden issues, blind spots and performance bottlenecks that are difficult to find in distributed systems, according to AWS.

FIS lets teams set up experiments using pre-built templates that generate the desired disruptions.

Earlier recent announcements at AWS re:Invent 2020 include new machine learning tools and services for enterprise, manufacturing and industrial customers, and coming new managed services for visualizing logs and metrics and for machine data storage and monitoring.



Add a Comment