Resilience Testing in Kubernetes

Resilience testing is an important part of ensuring that a system is reliable and can recover from failures. To perform resilience testing for Redpanda in Kubernetes, you can introduce failure scenarios and observe how the system behaves under each scenario.

Prerequisites

  • Create a test environment that mimics your production environment as closely as possible. The test environment should include a Redpanda cluster with at least three replicas, and any services that your application depends on. You can find guides for deploying Redpanda in Get Started with Redpanda in Kubernetes.

  • Set up monitoring so that you can observe changes in the system behavior.

Simulate failure scenarios

This section provides the steps to simulate failure scenarios in Kubernetes. After each simulation, it’s important to monitor the behavior of the Redpanda cluster and any clients that are connected to it.

Broker going down

You can simulate a broker going down for an extended period of time by manually terminating one of them.

  1. Find out on which node each of your brokers is running:

    kubectl get pod --namespace <namespace>  \
      -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -l \
      app.kubernetes.io/component=redpanda-statefulset
  2. Taint the node that’s running the broker that you want to terminate:

    kubectl taint nodes <node-name> isolate-broker=true:NoExecute

    Replace <node-name> with the name of the node you want to taint.

    Any Pods that do not tolerate this taint are terminated and evicted from the node.

  3. Monitor the logs and metrics of the remaining brokers to observe how they behave when a broker is unexpectedly terminated.

  4. Remove the taint when you’re ready for the broker to come back online:

    kubectl taint nodes <node-name> isolate-broker=true:NoExecute-
  5. Check whether the terminated broker can rejoin the cluster when it is rescheduled on the node and comes back online.

Suggested reading

It’s best practice to automate failure scenarios as part of your regular testing to identify any weaknesses in your deployment. You can use tools, such as Chaos Monkey and LitmusChaos.