Docs Self-Managed Manage Kubernetes Resilience Testing This is documentation for Self-Managed v23.2, which is no longer supported. To view the latest available version of the docs, see v24.2. Resilience Testing in Kubernetes Resilience testing is an important part of ensuring that a system is reliable and can recover from failures. To perform resilience testing for Redpanda in Kubernetes, you can introduce failure scenarios and observe how the system behaves under each scenario. Prerequisites Create a test environment that mimics your production environment as closely as possible. The test environment should include a Redpanda cluster with at least three replicas, and any services that your application depends on. You can find guides for deploying Redpanda in Get Started with Redpanda in Kubernetes. Set up monitoring so that you can observe changes in the system behavior. Simulate failure scenarios This section provides the steps to simulate failure scenarios in Kubernetes. After each simulation, it’s important to monitor the behavior of the Redpanda cluster and any clients that are connected to it. Broker going down You can simulate a broker going down for an extended period of time by manually terminating one of them. Find out on which node each of your brokers is running: kubectl get pod --namespace <namespace> \ -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -l \ app.kubernetes.io/component=redpanda-statefulset Taint the node that’s running the broker that you want to terminate: kubectl taint nodes <node-name> isolate-broker=true:NoExecute Replace <node-name> with the name of the node you want to taint. Any Pods that do not tolerate this taint are terminated and evicted from the node. Monitor the logs and metrics of the remaining brokers to observe how they behave when a broker is unexpectedly terminated. Remove the taint when you’re ready for the broker to come back online: kubectl taint nodes <node-name> isolate-broker=true:NoExecute- Check whether the terminated broker can rejoin the cluster when it is rescheduled on the node and comes back online. Suggested reading It’s best practice to automate failure scenarios as part of your regular testing to identify any weaknesses in your deployment. You can use tools, such as Chaos Monkey and LitmusChaos. Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Rolling Restart Troubleshooting