Chaos Engineering
Estimated time to read: 10 minutes
Intro¶
Chaos Engineering intentionally introduces failures, disruptions, or stress into a system to test its resilience and identify weaknesses. The goal is to improve the system's reliability and performance by discovering and addressing potential issues before they manifest in real-world situations, such as outages or degraded user experiences.
Why Chaos Engineering¶
There are several reasons why you may need Chaos Engineering:
- Complex systems Modern software systems are becoming increasingly complex and distributed, making predicting all potential failure scenarios difficult. Chaos Engineering helps you uncover and understand these scenarios to better prepare for them.
- Improve system reliability By proactively introducing failures and observing their impact, you can identify and address vulnerabilities before they lead to outages or performance degradation. This helps to improve the overall reliability of your system.
- Continuous learning Chaos Engineering encourages a constant learning and improvement culture within your team. As you run experiments, you'll gain insights into your system's behaviour under various conditions, allowing you to make informed decisions about architecture, design, and operations.
- Mitigate risk By identifying and fixing potential issues, you can mitigate the risk of unexpected outages or degraded performance, leading to customer dissatisfaction, loss of revenue, or even reputational damage.
- Optimise resource usage Chaos Engineering can help you identify opportunities for optimising resource usages, such as underutilised servers, inefficient code, or suboptimal configurations. This can lead to cost savings and improved performance.
- Enhance team collaboration Chaos Engineering fosters collaboration between development, operations, and other teams building and maintaining software systems. This collaboration helps ensure everyone is aligned and working together to improve the system's resilience.
In summary, Chaos Engineering is a practical approach for testing and improving the resilience of your software systems. By proactively identifying and addressing potential issues, you can enhance system reliability, optimise resources, and foster a culture of continuous learning and collaboration within your organisation.
Planning Your First Chaos Experiment¶
- Define the Scope Begin by identifying the system, service, or component you want to test. Start with a smaller, less critical part of your infrastructure before moving on to more complex, critical systems.
- Identify Potential Weaknesses Gather your team and brainstorm potential weaknesses in the system. Consider internal and external dependencies, data stores, and other factors impacting the system's performance or reliability.
- Prioritize Scenarios Rank the identified weaknesses according to their likelihood and potential impact. Prioritise the experiments by focusing on the most critical and probable scenarios first.
- Formulate Hypotheses For each chosen scenario, develop a hypothesis about the expected outcome when the failure is introduced. Consider the impact on customers, your service, and dependencies.
- Define Key Performance Metrics Establish key performance metrics correlating to customer success, such as orders per minute or stream starts per second. Monitor these metrics during the experiment to ensure minimal impact on customers.
- Design the experiment Create a detailed plan for each chaos experiment, specifying the failure to be injected, the method of injection, and the duration of the experiment.
- Prepare Rollback Plans Develop a rollback plan to revert the experiment's impact if necessary. Ensure your team is prepared to abort the experiment and return the system to normal if problems arise.
- Conduct the experiment Run the chaos experiment, closely monitoring the key performance metrics, system behaviour, and any unintended side effects. Be prepared to abort the experiment if needed.
- Analyze Results After completing the experiment, analyse the results to determine whether your hypothesis was correct and if there were any unexpected findings.
- Implement Improvements Based on the analysis, identify areas for improvement and implement necessary changes to enhance system resilience.
- Repeat and Iterate Chaos Engineering is an ongoing process. Continue to conduct experiments and refine your systems to improve their resilience and reliability.
Remember, Chaos Engineering is a proactive approach to improving system reliability. By regularly conducting chaos experiments, you can identify and address potential issues before they result in significant downtime or customer impact. So, embrace the process and have fun while improving your systems.
Example of Chaos planning¶
| Scope | Potential Weaknesses | Hypotheses | Key Performance Metrics | Implementation Method | Metrics to Observe & Analyze | 
|---|---|---|---|---|---|
| AWS EC2 Instances | Instance failure | An instance failure will not impact the overall service due to auto-scaling and load balancing | Latency, Request success rate | Terminate a random EC2 instance using AWS CLI or SDK, and observe the system's response | EC2 instance count, Auto Scaling events, Load balancer request distribution, CPU & memory usage | 
| AWS RDS Database | Database connection failure | Connection failures will trigger auto-retry and fallback to read replicas without affecting users | DB connection errors, Latency | Introduce network issues between app and RDS using AWS Security Groups or VPC NACLs | Database connection errors, RDS replica lag, Query execution time, Application error rates | 
| Kubernetes Nodes | Node failure | Kubernetes will reschedule affected pods to other healthy nodes without impacting user experience | Pod restart count, Node CPU usage | Drain a Kubernetes node using kubectl drain, then observe the pod rescheduling process | Node status, Pod status, Pod restarts, Node resource usage (CPU, memory) | 
| Kubernetes Pod | Pod crash | Crashing pods will be automatically restarted, ensuring minimal user impact | Pod restart count, Latency | Introduce a fault within a pod, e.g., by using kubectl execto kill a critical process | Pod status, Pod restarts, Container logs, Application error rates | 
| AWS S3 | S3 latency increase | Increased S3 latency will cause delays but not service outages, as retries and timeouts will be handled | S3 latency, Request success rate | Use a tool like AWS Fault Injection Simulator (FIS) to simulate latency increase in S3 API calls | S3 request latency, S3 error rates, Application response time | 
| Kubernetes Service | Network latency between services | Increased network latency will cause delays but not service outages due to built-in retries and timeouts | Service-to-service latency | Inject latency between services using a service mesh like Istio, or using tools like tcoriptables | Service-to-service latency, Request success rate, Application response time | 
| AWS DynamoDB | Throttling errors | Throttling errors will be handled by retries with exponential backoff, ensuring limited user impact | Throttling errors, Latency | Temporarily decrease DynamoDB provisioned capacity or use AWS FIS to simulate throttling errors | Throttling errors, Read/write capacity utilization, Latency, Application error rates | 
| Kubernetes Ingress Controller | Ingress controller failure | A failure in the ingress controller will cause temporary service disruption, which will be resolved quickly | Ingress error rate, Latency | Disable or introduce faults to the ingress controller, e.g., by modifying its configuration or scaling | Ingress error rates, Ingress latency, Ingress controller logs, Pod status | 
| AWS Lambda | Lambda function timeout | Lambda timeouts will be handled by retries and fallbacks, ensuring limited user impact | Lambda timeouts, Latency | Modify the Lambda function to include an intentional delay, or reduce the function's timeout setting | Lambda invocation count, Lambda duration, Lambda error rates, Cold start count, Application error rates | 
| AWS Kinesis Streams | Stream processing delays | Delays in processing Kinesis streams will cause | Stream processing delays | Delays in processing Kinesis streams will cause temporary lag in data processing, but not service outages | Kinesis processing latency | 
Find example code how to run your first experiment here. AWS Fault Injection Simulator (FIS)
Chaos Engineering and Observability¶
Chaos Engineering and observability are closely linked concepts that complement each other in building and maintaining resilient and high-performing systems. While Chaos Engineering intentionally injects failures into a system to test its resilience, observability focuses on gathering, analysing, and visualising system data to understand its behaviour and performance. Here's how Chaos Engineering and observability are linked and what you should know about their relationship: Observability is crucial during chaos experiments: When conducting Chaos Engineering experiments, it's vital to have good observability in place. This allows you to monitor the system's behaviour and performance during the experiments, helping you understand how the system reacts to injected failures and stressors. You can then use this information to identify and address any uncovered weaknesses or issues.
- Validate hypotheses: Observability enables you to validate the hypotheses you form during Chaos Engineering experiments. By collecting and analysing data about the system's behaviour, you can determine whether your hypotheses are correct or if there are unexpected outcomes that need further investigation.
- Detect unintended side effects Chaos experiments may sometimes reveal unexpected side effects or issues not part of your original hypothesis. Observability helps you detect and analyse these side effects to understand and address their root causes.
- Measure the impact Observability allows you to measure the impact of chaos experiments on your system's key performance indicators (KPIs), such as latency, error rates, and resource usage. This helps you assess the overall success of the experiments and determine if any changes need to be made to improve system resilience.
- Improve system understanding Chaos Engineering and observability contribute to a deeper understanding of your system. Chaos experiments uncover potential weaknesses and stress points, while observability provides insights into the system's behaviour under normal and failure conditions. Combining both practices helps build a more comprehensive view of your system, making identifying and addressing potential issues easier.
Conclusion¶
Chaos Engineering and observability are intertwined concepts that contribute to building and maintaining resilient systems. You can identify weaknesses and improve system resilience by intentionally introducing failures and monitoring the system's behaviour. Observability is critical in this process by providing the necessary data and insights to understand the system's behaviour and performance during chaos experiments. Implementing both practices will help you create more reliable, high-performing systems.
Reference
Several tools can help you get started with chaos engineering in an AWS and Kubernetes environment. Here are a few easy-to-use tools to consider for implementing the experiments mentioned above:
- Chaos Mesh Chaos Mesh is an open-source, cloud-native chaos engineering platform for Kubernetes. It provides a variety of chaos experiments like pod failure, network latency, and resource stress. Chaos Mesh is easy to install as a Kubernetes custom resource, and you can manage your experiments using its web dashboard or the kubectl command-line tool.
- AWS Fault Injection Simulator (FIS) FIS is a fully managed AWS service designed to help you perform controlled chaos experiments on your AWS resources. It integrates with many AWS services, including EC2, RDS, Lambda, and DynamoDB. In addition, FIS provides pre-built templates and a simple UI to define and run your experiments.
- Gremlin Gremlin is a popular and user-friendly chaos engineering platform supporting AWS and Kubernetes environments. It offers various failure injection scenarios and a convenient web UI to manage and monitor your experiments. In addition, Gremlin's safety features allow you to minimise the risk of unintended consequences during your tests.
- Litmus Litmus is another open-source, cloud-native chaos engineering tool designed for Kubernetes. It includes a wide range of chaos experiments and a user-friendly web portal for managing and monitoring your chaos workflows. Litmus also supports GitOps-based workflows for managing your experiments as code.
- PowerfulSeal PowerfulSeal is an open-source chaos engineering tool for Kubernetes environments. It allows you to inject failures at the infrastructure and application levels. In addition, PowerfulSeal can be run in autonomous or interactive mode, giving you flexibility in how you run and manage your experiments.
Choose a tool based on your environment (AWS, Kubernetes, or both) and your preferred method of managing experiments (UI, command line, or GitOps). These tools are relatively easy to start with and provide comprehensive documentation to help you implement the chaos experiments mentioned in the matrix.