How to Implement Safeguards and Run Chaos Experiments Safely
1. Overview
Experimenting in production has the potential to cause unnecessary customer pain. […] it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.
In this article, we’ll learn how to use safeguards in Chaos Toolkit to immediately interrupt an experiment execution before it seriously affects the target system.
Whether you’re injecting chaos events on a production system or still experimenting with a test environment, Chaos Toolkit safeguards can be incredibly useful to stop experiments as soon as possible.
2. Safeguards vs Steady-State Hypothesis
A steady-state hypothesis can also be used to bail out an experiment early. In fact, in Chaos Toolkit, we can use a verification strategy to check the steady state of the system continuously and exit as soon as the first verification fails (More on hypothesis strategies in a previous article
So how are safeguards different from a steady-state hypothesis?
Safeguards are experiment controls
Controls in Chaos Toolkit operate outside of the experiment context and can be used to modify its execution flow (like interrupting the experiment early in this case). Because they’re not part of the experiment, they can be defined in separate files and activated for multiple experiments using command line options.
Safeguards can keep an eye on the system as a whole
While a steady-state hypothesis is usually relevant to a single service or sub-system, safeguards can define probes to check the system as a whole. For instance, while we simulate a traffic spike on a frontend service, we could use a safeguard to ensure the underlying database CPU is still in check.
3. Using Experiment Safeguards
The safeguards control is part of the chaosaddons
module, so to use it in our experiments, we first need to install it:
pip install -U chaostoolkit-addons
Then all we need to do is add our safeguards to the controls section, for example:
# ...
controls:
# This safeguards control includes two probes:
- name: "my-experiment-safeguards"
provider:
type: python
module: chaosaddons.controls.safeguards
arguments:
probes:
# The first probe verifies the container called `sampleapp-back`
# is running before starting the experiment
- type: probe
name: "backend-container-must-be-running"
tolerance:
type: regex
target: stdout
pattern: sampleapp-back
provider:
type: process
path: "docker"
arguments: "container ps --filter name=sampleapp-back --format '{{lower .Names}}'"
# The second probe verifies the backend application remains
# responsive throughout the test
- type: probe
name: "backend-must-be-healthy"
frequency: 2 # <<<< run continuously
background: true # <<<<
tolerance: 200
provider:
type: http
url: "http://localhost:8999/health"
method: "GET"
timeout: 2
# ...
Safeguards are nothing more than a list of probes that verify the system’s condition. In the example above, we don’t want the backend service to be affected by chaotic events and stop the experiment as soon as that happens.
Probes defined in safeguards can affect the experiment execution in different ways, depending on how we configure them using two attributes background: bool
and frequency: float
:
Attribute | Execution time | Effect |
---|---|---|
None |
Blocking, before the experiment begins | When no attributes are set, the probe can prevent the experiment from starting |
frequency:N |
Blocking, runs continuously every N seconds | Probe will not stop experiment start but will interrupt the execution on failure |
background:bool |
Non-blocking, runs as soon as possible in the background after the experiment starts | Probe will not prevent the experiment from starting but can stop execution early |
frequency:N and background:bool |
Non-blocking, runs continuously in the background every N seconds | Just like frequency but running activity in the background |
3.1. Block experiments at the start
The most dangerous thing a chaos experiment can do is introduce faults to an unstable system. We can use safeguard probes to block experiments at the start by not setting the frequency
and background
attributes.
According to the table above, the probe verification will block the experiment execution until it’s done verifying:
controls:
- name: "my-experiment-safeguards"
provider:
type: python
module: chaosaddons.controls.safeguards
arguments:
probes:
- type: probe
name: "backend-container-must-be-running"
tolerance:
type: regex
target: stdout
pattern: sampleapp-back
provider:
type: process
path: "docker"
arguments: "container ps --filter name=sampleapp-back ...
- [...]
As a result, if the container called sampleapp-back
is not running, the experiment will immediately exit and be marked as interrupted:
chaos run --rollback-strategy=always experiment.yaml
# ...
# [CRITICAL] Safeguard 'backend-container-must-be-running' triggered the end of the experiment
# [INFO] Steady state hypothesis: Verify server is available
# [WARNING] Received the exit signal: 20
# ...
3.2. Run safeguards continuously
We can use safeguards to verify the state of the system continuously throughout the experiment execution by defining the frequency attribute for a probe:
controls:
- name: "my-experiment-safeguards"
provider:
type: python
module: chaosaddons.controls.safeguards
arguments:
probes:
- [...]
- type: probe
name: "backend-must-be-healthy"
frequency: 2 # <<<< run continuously
tolerance: 200
provider:
type: http
url: "http://localhost:8999/health"
method: "GET"
timeout: 2
In this example, frequency: 2
will run a verification every 2 seconds until the end of the experiment. As soon as the verification fails, the experiment will be immediately interrupted:
chaos run --rollback-strategy=always experiment.yaml
# ...
# [CRITICAL] Safeguard 'backend-must-be-healthy' triggered the end of the experiment
# [WARNING] Received the exit signal: 20
# ...
3.3. Background verifications
In Chaos Toolkit experiments, the background: bool
option will tell the framework to run the action in a subprocess and allow the rest of the steps to proceed concurrently.
Using this feature is sensible if your safeguard verifications can take a long time to complete. Of course, what qualifies as “a long time” entirely depends on the use case: sometimes a 10 seconds wait is acceptable, in other cases, it can make or break a test.
Generally, I avoid using background steps as much as possible to keep the execution flow simple and predictable. If I can live with the wait, I’ll accept it.
4. Reuse Safeguards For Multiple Experiments
Like any other control, safeguards can be defined outside the experiment file and added to existing experiments at run-time using the command line. Let’s see an example.
We create a new file called safeguards.yaml
with the following content:
backend-safeguards:
provider:
type: python
module: chaosaddons.controls.safeguards
arguments:
probes:
- type: probe
name: "backend-container-must-be-running"
tolerance:
type: regex
target: stdout
pattern: sampleapp-back
provider:
type: process
path: "docker"
arguments: "container ps --filter name=sampleapp-back --format '{{lower .Names}}'"
- type: probe
name: "backend-must-be-healthy"
frequency: 2
background: true
tolerance: 200
provider:
type: http
url: "http://localhost:8999/health"
method: "GET"
timeout: 2
This definition is the same as the example above but used in a control-file. We can then run the experiment and apply an external control using this syntax:
chaos run \
--rollback-strategy=always \
--control-file safeguards.yaml \
experiment.yaml
For more information on how to apply controls defined outside of experiment templates, check out my previous article.
5. Safeguards In Practice
If you’re interested in testing safeguards for yourself, we’ll now reproduce the experiment described in the article.
Run the demo application locally
First, we need to have an application to experiment with. You can run my sample frontend-backend application locally with Docker following the steps described in the GitHub repo.
Create the experiment file
Here’s the full example of the experiment including safeguards:
title: "ChaosToolkit experiment with Safeguards"
description: |
This experiment is an example of how to use `safeguards` to ensure we
bail out of an experiment early when an unwanted effect is introduced.
controls:
- name: "backend-safeguards"
provider:
type: python
module: chaosaddons.controls.safeguards
arguments:
probes:
- type: probe
name: "backend-container-must-be-running"
tolerance:
type: regex
target: stdout
pattern: sampleapp-back
provider:
type: process
path: "docker"
arguments: "container ps --filter name=sampleapp-back --format '{{lower .Names}}'"
- type: probe
name: "backend-must-be-healthy"
frequency: 2
background: true
tolerance: 200
provider:
type: http
url: "http://localhost:8999/health"
method: "GET"
timeout: 2
steady-state-hypothesis:
title: "Verify server is available"
probes:
- type: probe
name: "server-must-respond-200"
tolerance: 200
provider:
type: http
url: "http://localhost:8000"
method: "GET"
timeout: 3
method:
# Simulate some user traffic using Grafana k6
- type: action
name: "stress-endpoint-with-simulated-traffic"
provider:
type: python
module: chaosk6.actions
func: stress_endpoint
arguments:
endpoint: "http://localhost:8000"
vus: 1
duration: "10s"
# Simulate failure on backend container
- type: action
name: "fail-backend-container"
provider:
type: http
url: "http://localhost:8999/_chaos/fail"
method: "POST"
timeout: 2
# Stress the frontend application some more
- ref: "stress-endpoint-with-simulated-traffic"
rollbacks:
- type: action
name: "restart-backend-container"
provider:
type: process
path: docker
arguments: "start sampleapp-back"
Run the experiment
To run the experiment:
chaos run --rollback-strategy=always experiment.yaml
6. Conclusion
In this article, we learned how to use safeguards to monitor the system’s state and interrupt an experiment before it can cause pain for users.
As always, all examples used in this post and more are available over on GitHub