Chaos Engineering Tests: Gremlin GameDay Lessons
Now more than ever, it's important to stress test software. People rely heavily on these solutions for everything from company meetings to grocery delivery. The ability to keep up with turbulent conditions or increased demands is critical.
Fortunately, there’s a way to test for stability before things break. It’s called chaos engineering, and it is the practice of testing your software/systems through controlled exercises to make sure they can handle any unpredictable variables. Chaos engineering allows you to expose weaknesses in your systems before they lead to a crisis in production. Additionally, chaos engineering tests deepen a team's understanding of software systems and how they behave. This makes them a great tool even if you’re not bracing for a surge. We practice chaos engineering internally to have proactive visibility into vulnerabilities before they become issues.
To facilitate chaos engineering at cloudtamer.io, we worked with Gremlin, an expert chaos engineering platform, to test the operational excellence and reliability of our software through a variety of conditions. For one full day in January 2020—our Gremlin GameDayTM—our engineers tested cloudtamer.io in support of one of our major federal customers. Then, we repeated some of the tests one month later to see how performance improved.
The results of creating a chaos engineering case study on our very own software were astounding. This is what we found.
We ran five different failure scenarios (called test cases) during our Gremlin GameDay. For this article, we are focusing on the first three: Amazon EC2 instance terminations, container instance terminations, and availability zone failures. We performed each of these tests at several different magnitudes to account for different degrees of instance termination, packet loss, or latency. Based on the results, we made tweaks to our software and processes. In February, we ran through the three test cases again to check for improvements.
Chaos Engineering Results
Failure Scenario #1: Amazon EC2 Instance Termination
First, we tested Amazon EC2 instance terminations using Gremlin’s “Shutdown” test.
- Expected Behavior: We expected single-node deployments to take 3 to 5 minutes to create the new node. Users would experience a service outage.
- Actual Performance: The node did not shut down as expected. ASG took eight minutes to detect and terminate the unhealthy node. Two minutes after the new node came up, the UI was available.
- Steps Taken to Improve Process: We looked for long-running/blocking services and threads to tighten up the shutdown process.
- Behavior When Re-Tested in February: The node flagged as unhealthy after about 1 minute, and the autoscaling group (ASG) began instance termination. A new node launched 2 minutes after the start time, and it was in service 5 minutes after the attack. We concluded that the changes we made after testing in January were successful, as results were now aligned with our original expectations.
Failure Scenario #2: Container Instance Termination
Next, we tested container instance termination. We performed our tests in three different scenarios: termination of the ingress microservice, termination of the webui microservice, and termination of the webapi microservice. This was done using Gremlin’s “Shutdown” test.
- Expected Behavior: For all scenarios, it was expected that Docker would recreate the container almost immediately, with very little impact on the user. In the case of ingress termination and webapi termination, the user may get a red toast message that says, “cannot connect to cloudtamer.io.” Killing the webui microservice should cause the UI to hang in the browser, which would be resolved by a refresh.
- Actual Performance: In the ingress termination scenario, Docker relaunched the container in 24 seconds, with no impact on the UI. In the webui and webapi scenarios, the container was recreated nearly immediately, with no impact on the UI. We did, however, hit one snag with the webapi scenario: the logs did not generate when we tried sending API calls with the Postman API tool.
- Steps Taken to Improve Process: The outcome was better than expected for all three scenarios, so we focused on how to capture web API logs after sending API calls with Postman.
- Behavior When Re-Tested in February: Once again, there was no impact on the UI for ingress termination. We did not proceed with the other containers, as we did not change them.
Failure Scenario #3: Availability Zone Degradation
The third test evaluated our availability zone degradation/failure. We tested this at various levels of packet loss using Gremlin’s “Packet Loss” test.
- Expected Behavior: The load balancer should detect the unhealthy node and direct traffic to the remaining healthy nodes. There should be no impact to the user in the UI.
- Actual Performance: At 5%, 15%, 25%, and 50% packet loss, the load balancer did not flag the instance as unhealthy. The UI seemed somewhat slower to load pages at 5% packet loss. There was a very noticeable UI speed degradation for 15%, 25%, and 50% packet loss. At 100% packet loss, the load balancer flagged the affected node as unhealthy and took it out of service in about 2 minutes. The ASG health checks did not fail, so the load balancer did not rotate the nodes.
- Steps Taken to Improve Process: We changed the ASG health checks to use the load balancer health check rather than that of EC2 to ensure the node rotation is triggered in future tests.
- Behavior When Re-Tested in February: The results improved since we tested in January, with no issues in the UI at 5% packet loss, slowness after a 45-second delay at 15% packet loss, and some slowness at 75% packet loss. 100% packet loss resulted in the node being flagged as unhealthy after about 1 minute. In the last case, the load balancer took the node out of service and rerouted to a healthy node. We decided to explore a shorter timeout for the external load balancer health check based on this test, which is currently 4 seconds. Otherwise, we were happy with this improvement in results.
Chaos Engineering Tips
We learned a lot about our own application and the value of well-planned chaos engineering during our Gremlin GameDays. We also learned a lot about how to run more successful chaos engineering experiments in the future.
Thinking about planning your own chaos engineering experiment? Here are our suggestions for organizations planning to experiment with chaos engineering:
- Preparation is key! The preparation for a GameDay is equally as important—if not more so—than running the actual GameDay scenarios. Our engineers worked carefully to craft targets, date/time and place, assumptions, test cases and scope, and to clearly define goals. Knowing these elements made the GameDay run smoothly and efficiently, allowing us to get the most out of it.
- Make sure you have a clear understanding of the application infrastructure. Our DevOps team came to GameDay with a drafted architecture diagram, giving us a clear picture of what we were attempting to break in our scenarios. The architecture allowed us to visually depict areas worth testing, and it aligned everyone's understanding of the latest build of our system/application.
- Record your expectations before you start. Before and after we executed any of the test cases, we asked ourselves questions to round out our expectations for each test. Anytime a scenario didn't go as expected, it became an opportunity for us to reevaluate our setup and come up with a more resilient plan and, ultimately, a more resilient application. Some of the questions we asked ourselves before and after each test case was:
- What do we expect to happen for the test case?
- Is the behavior what we expected?
- What is the customer seeing if this were to happen?
- What's happening to systems upstream or downstream?
Give It A Try!
GameDay allowed us to catch incidents in a controlled environment, conduct reviews, and implement a better system, all without disturbing customer accounts. We highly recommend that other organizations try the chaos engineering approach.
About the author: Hasnat is a senior solutions delivery manager at cloudtamer.io.