ChaosToolkit Steady-State Verification Strategies Explained
1. Overview
Defining the right steady-state hypothesis is the first step towards a successful chaos experiment. The second step is selecting the right verification strategy.
The goal of a controlled chaos experiment isn’t breaking random things in the system. After all, no business owner in their right mind would allocate a budget to “break” their infrastructure.
As chaos engineers, we’re interested in learning how the system reacts to certain real-world events, so to avoid unwanted disruptions, when we develop an experiment it’s important to ask WHEN is the right time to bail out.
In this article, we’ll examine the different strategies for hypothesis verifications available in ChaosToolkit so we can choose the right one for our experiments.
2. One-Time vs Continuous Verification
The first thing we need to decide about hypothesis verification is whether we want to run it just once or continuously. This largely depends on the type of experiment we’re running and the context we’re running it in. Here are some things to consider before making this choice:
- Can we live with the consequences of a temporary failure in the target system or it’s essential to cancel the experiment as soon as an unforeseen problem arises?
- Do the effects caused by introducing a chaotic event in the system need time to self-heal? Or should this be transparent to an outside observer? For example: if we kill a container, we will lose some requests in the process, but if we simulate a spike in traffic, we can expect the service to scale and handle all requests correctly.
3. Before And After Hypothesis Verification
This type of verification is more suited when the effects of the real-world events introduced by our chaos experiment cause an observable change that is part of the verification. Take this experiment for instance:
title: "Verify if service can recover from container loss"
description: |
This experiment is designed to verify if an AWS ECS service can recover
from the loss of a percentage of its running containers timely.
configuration: [...]
steady-state-hypothesis:
title: "All container replicas should be online"
probes:
- type: probe
name: "desired-tasks-should-be-running"
tolerance: true
provider:
type: python
module: chaosaws.ecs.probes
func: are_all_desired_tasks_running
arguments:
cluster: "${cluster_arn}"
service: "${service_name}"
method:
- type: action
name: "service-loses-half-capacity"
provider:
type: python
module: chaosaws.ecs.actions
func: stop_random_tasks
arguments:
cluster: "${cluster_arn}"
service: "${service_name}"
task_percent: 50
reason: "${stop_reason_msg}"
pauses:
after: ${allowed_recovery_time_seconds}
In this experiment, we verify a certain application can self-heal when it loses half of its containers. Unfortunately, re-spawning the lost containers takes time, so there’s no point in verifying the hypothesis during the method execution because the experiment will surely fail.
The correct verification strategy, in this case, would be the default
. By default, ChaosToolkit verifies the hypothesis before and after the experiment’s method.
chaos run experiment.yaml
# or
chaos run --hypothesis-strategy=default experiment.yaml
The before verification guarantees that the system is stable before we start experimenting with it. We don’t want to run experiments on systems that are already unstable. So when a before verification fails, the experiment is immediately stopped to guarantee we’re not piling on existing issues.
The after verification is necessary to verify the system can indeed self-heal and repair the application by spawning additional instances to match the desired container count.
Timing in this experiment is also important. We run the second verification after the allowed_recovery_time_seconds
as this is the maximum risk we accept for this application.
4. One-Time Hypothesis Verification
There are experiments where the hypothesis should only be verified once on every experiment run. This is the case of the before-method-only
or the after-method-only
strategies:
title: "Verify if service can update its container count quickly"
description: |
This experiment is designed to verify if an AWS ECS service can update
its containers count and spawn up new instances in the allowed start time.
configuration: [...]
steady-state-hypothesis:
title: "All container replicas should be online"
probes:
- type: probe
name: "desired-tasks-should-be-running"
tolerance: true
provider:
type: python
module: chaosaws.ecs.probes
func: are_all_desired_tasks_running
arguments:
cluster: "${cluster_arn}"
service: "${service_name}"
# Set a new value for desired containers
method:
- type: action
name: "update-service-container-count"
provider:
type: python
module: chaosaws.ecs.actions
func: update_desired_count
arguments:
cluster: "${cluster_arn}"
service: "${service_name}"
desired_count: ${new_desired_count}
pauses:
after: ${allowed_start_time_seconds}
This is a slight variation of the previous example. The experiment ensures the application can scale out instances correctly.
In this case, running the verification at the start of the experiment doesn’t help, as we need to first run the method to set the new desired container count for the service.
In these cases, we only want to check the hypothesis after the method and the correct hypothesis verification strategy to use is the following:
chaos run --hypothesis-strategy=after-method-only experiment.yaml
5. Continuous Hypothesis Verification
Being able to run continuous verifications on a system creates a lot of interesting possibilities, and in ChaosToolkit we can achieve that by using either the continuously
or during-method-only
strategies. Take this experiment as an example:
title: "Verify if service can sustain increase in traffic"
description: |
This experiment is designed to verify if an AWS ECS service can sustain an
increase in traffic without failing.
configuration: [...]
steady-state-hypothesis:
title: "Service must always respond"
probes:
- type: probe
name: "service-must-respond"
tolerance:
type: "range"
range: [200, 299]
target: "status"
provider:
type: http
url: "${service_url}/health"
method: "GET"
timeout: 5
# Simulate some traffic in the environment
# using Grafana K6
method:
- type: action
name: "stress-endpoint-with-simulated-traffic"
provider:
type: python
module: chaosk6.actions
func: stress_endpoint
arguments:
endpoint: ${service_url}
vus: ${stress_users}
duration: ${stress_duration}
The template will verify that the target service can handle a sudden increase in traffic, and the steady state hypothesis confirms that health checks return a success HTTP status code.
The experiment would run successfully by just running with the default before and after verification, though checking the service continuously has much more value in this case as we’ll make sure the additional traffic is not affecting the application at any point during the test.
We can run the experiment using the continuously
hypothesis strategy:
chaos run --hypothesis-strategy=continuously experiment.yaml
ChaosToolkit will now run the hypothesis verification every few seconds, and if any of the verifications fail, the entire experiment will be marked as deviated.
The frequency (in seconds) with which the experiment verification runs can be customised using the --hypothesis-frequency
option:
chaos run \
--hypothesis-strategy=continuously \
--hypothesis-frequency=2 \
experiment.yaml
5.1. Continuously vs During-method-only
Yes, there are two options for continuous verification, and the difference is subtle:
continuously | during-method-only | |
---|---|---|
before method | Yes | No |
during method | Yes | Yes |
after method | Yes | No |
The during-method-only
strategy runs continuous verifications only throughout the method execution. This means that to use this strategy we must accept that the experiment method will run even on an unstable system.
In fact, the biggest difference between the two strategies is that by using the continuously
option, we still run a verification before and after the method, giving us a chance to bail out of the experiment if we’re already experiencing issues with the service. The during-method-only
, on the other hand, will ALWAYS run the experiment’s method.
5.2. Fail-fast option
Both continuous verification strategies support the additional --fail-fast
flag. This option does exactly what it says: stop the experiment as soon as the probe verification fails for the first time. To use it:
chaos run --hypothesis-strategy=continuously --fail-fast experiment.yaml
A word or warning, though, you may think that using continuously or during-method-only with --fail-fast
will ultimately have the same effect because if the first verification fails, both experiments will stop. It DOES NOT. ⚠️
I encourage you to try these two combinations yourself to fully understand the execution flow. Still, the key lesson is that the during-method-only will always enter the experiment method and, at the very least, run the first step. For this reason, we can’t guarantee the --fail-fast
and during-method-only
combination will not affect an already unstable system.
In contrast, the continuously strategy will fail the experiment entirely if the first verification fails, qualifying as the “safer” option.
Conclusion
In this article, we discussed the different hypothesis verification strategies for ChaosToolkit experiments and when it’s appropriate to use each.
- default verification before and after the experiment
- one-time verification with
before-method-only
andafter-method-only
- continuous verification with
continuously
andduring-method-only
.
The code examples used in this article are available over on GitHub.