Antifragile patterns¶
Estimated time to read: 49 minutes
Antifragile patterns are design principles to build resilient and adaptable systems that withstand shocks, stresses, and volatility.
Patterns¶
- Circuit breakers Automatically halt the flow of operations when a predefined threshold is breached, preventing further damage and allowing the system to recover.
- Bulkheads Divide a system into isolated compartments so that a failure in one compartment doesn't cascade to the others.
- Isolation Separate components or subsystems to limit the impact of a failure, making it easier to diagnose and repair.
- Timeouts Set time limits for operations to prevent them from running indefinitely and consuming resources.
- Redundancy Introduce multiple instances of the same component or process so that the others can take over if one fails.
- Diversity Use different technologies, methods, or designs to perform the same function, increasing the chances of at least one succeeding in the face of failure.
- Modularity Break a system into smaller, more manageable components that can be developed, tested, and maintained independently.
- Loose coupling Minimize dependencies between components, making it easier to swap them out, upgrade, or repair without impacting the overall system.
- Self-healing Build systems automatically detect and recover from failures without human intervention.
- Graceful degradation Design systems that can continue functioning, albeit at a reduced capacity, when one or more components fail.
- Load shedding Drop or throttle low-priority tasks during periods of high demand to maintain the performance and stability of the system.
- Adaptive capacity Build systems that can learn from and adapt to changing conditions, improving their performance and resilience.
- Rate limiting Control the frequency or quantity of requests, preventing overload and ensuring fair resource allocation.
- Monitoring and observability Implement tools and processes to continuously monitor the health and performance of a system, making it easier to detect and respond to issues.
- Feedback loops Collect data on the system's performance and use it to adjust, improving its ability to adapt and respond to changing conditions.
Circuit Breakers¶
A circuit breaker is a design pattern that helps prevent system failures by automatically stopping the flow of operations when a predefined threshold is reached. It acts as a safety mechanism that allows the system to recover from errors or high loads. Circuit breakers can be applied to system parts like network communication, database access, or other resource-intensive operations.
Implementation¶
It would be best if you tracked your operation's success and failure rates to implement a circuit breaker. The circuit breaker "trips" and rejects new requests when the failure rate exceeds a certain threshold. After a set period, the circuit breaker enters a "half-open" state, allowing a limited number of requests to pass through. If these requests succeed, the circuit breaker resets to the "closed" state. Otherwise, it reverts to the "open" state.
In this example, we will create a Python Flask API that communicates with a third-party service. We will implement a circuit breaker to manage the state of the service and prevent further requests when the service fails. The circuit breaker will have three states: "Closed", "Open", and "Half-Open".
from flask import Flask, jsonify
import requests
import time
from functools import wraps
app = Flask(__name__)
# Circuit breaker configuration
failure_threshold = 3
reset_timeout = 60
half_open_timeout = 30
# Circuit breaker state
state = "Closed"
failures = 0
last_open_time = 0
def circuit_breaker(func):
@wraps(func)
def wrapper(*args, **kwargs):
nonlocal state, failures, last_open_time
if state == "Open":
if time.time() - last_open_time >= reset_timeout:
state = "Half-Open"
else:
return jsonify({"error": "Circuit breaker is open"}), 503
try:
result = func(*args, **kwargs)
if state == "Half-Open":
state = "Closed"
failures = 0
return result
except requests.exceptions.RequestException as e:
failures += 1
if failures >= failure_threshold:
state = "Open"
last_open_time = time.time()
return jsonify({"error": "Request failed"}), 500
return wrapper
@circuit_breaker
def fetch_data(url):
response = requests.get(url)
response.raise_for_status()
return response.json()
@app.route('/data')
def data():
try:
data = fetch_data('https://api.example.com/data')
return jsonify(data)
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run()
In this example, the circuit_breaker
decorator wraps the fetch_data
function, which makes an HTTP request to the third-party service. The decorator tracks the state of the service and prevents further requests when the service is in the "Open" state. When the state transitions from "Open" to "Half-Open", the circuit breaker allows a single request to be made to the failing service. If the request succeeds, the state changes to "Closed", and further requests are allowed. If the request fails, the state reverts to "Open", and the circuit breaker waits for a specified period before transitioning back to the "Half-Open" state.
A diagram to illustrate the circuit breaker pattern described in the previous example:
stateDiagram
[*] --> Closed: Start
Closed --> Open : Failures >= failure_threshold
Open --> HalfOpen : time.time() - last_open_time >= reset_timeout
HalfOpen --> Closed : Request succeeds
HalfOpen --> Open : Request fails
Closed --> Closed : Request succeeds
This diagram shows the possible state transitions of the circuit breaker:
- Initially, the circuit breaker starts in the "Closed" state.
- When the number of failures reaches the
failure_threshold
, the circuit breaker transitions from "Closed" to "Open". - After waiting for the
reset_timeout
duration, the circuit breaker moves from "Open" to "Half-Open". - In the "Half-Open" state, the circuit breaker returns to the "Closed" state if a request succeeds.
- If a request fails while in the "Half-Open" state, the circuit breaker returns to the "Open" state.
- When a request succeeds in the "Closed" state, the circuit breaker remains in the "Closed" state.
Neither NGINX nor Apache2 natively supports circuit breakers out of the box. However, you can use third-party modules, plugins, or external tools to add circuit breaker functionality.
You can use the third-party Lua module lua-resty-circuit-breaker
for NGINX with the ngx_http_lua_module
. The lua-resty-circuit-breaker
module lets you implement circuit breaker logic using Lua scripts in your NGINX configuration.
Here is an example of how to set up NGINX with lua-resty-circuit-breaker
:
-
Install
ngx_http_lua_module
andlua-resty-circuit-breaker
. You can follow the installation instructions for thengx_http_lua_module
from the official documentation (https://github.com/openresty/lua-nginx-module#installation) andlua-resty-circuit-breaker
from its GitHub repository (https://github.com/pintsized/lua-resty-circuit-breaker). -
Update your NGINX configuration to include the circuit breaker logic. Assuming you have an upstream service called
service1
, your configuration might look like this:
http {
# Import the Lua circuit breaker library
lua_package_path "/path/to/lua-resty-circuit-breaker/lib/?.lua;;";
server {
listen 80;
location /service1 {
# Define the circuit breaker
access_by_lua_block {
local circuit_breaker = require "resty.circuit-breaker"
local cb = circuit_breaker.new({
failure_threshold = 3,
recovery_timeout = 60,
half_open_timeout = 30
})
local ok, err = cb:call()
if not ok then
ngx.status = ngx.HTTP_SERVICE_UNAVAILABLE
ngx.say("Circuit breaker is open")
ngx.exit(ngx.status)
end
}
# Proxy requests to the upstream service
proxy_pass http://service1.example.com;
}
}
}
This configuration sets up a circuit breaker using the lua-resty-circuit-breaker
library. It implements the necessary logic to return a "Service Unavailable" error when the circuit breaker is in the open state.
For Apache2, no built-in or third-party module directly implements circuit breakers. However, you can achieve similar functionality using a combination of retries, timeouts, and a fallback mechanism provided by the mod_proxy
module. While this approach does not fully implement the circuit breaker pattern, it can help mitigate the impact of failing services:
<VirtualHost *:80>
ServerName example.com
ProxyRequests Off
ProxyPreserveHost On
<Proxy *>
Require all granted
</Proxy>
# Configure retries and timeouts
ProxyTimeout 5
ProxyPass /service1 http://service1.example.com retry=3 timeout=5
# Configure fallback mechanism
ProxyPass /fallback http://fallback.example.com
ErrorDocument 502 /fallback
ErrorDocument 503 /fallback
ErrorDocument 504 /fallback
ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>
In this example, Apache2 proxies requests to service1.example.com
with a timeout of 5 seconds and a maximum of 3 retries. If the service is unavailable, Apache2 will redirect the client to the fallback location (fallback.example.com
). This approach provides some level of fault tolerance but does not fully adhere to the circuit breaker pattern.
For a complete circuit breaker implementation in Apache2, you might consider using an external tool or service mesh that natively supports the circuit breaker pattern. One such option is using Istio, a popular service mesh that offers advanced traffic management features, including circuit breakers, for microservices.
Istio¶
To implement a circuit breaker using Istio, you must install a Kubernetes cluster with Istio. You can follow the official Istio documentation for installation instructions (https://istio.io/latest/docs/setup/getting-started/).
Once Istio is installed and your services run within the Istio service mesh, you can configure a circuit breaker using Istio's DestinationRule resource. Here's an example of a YAML configuration file for a circuit breaker in Istio:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: service1-circuit-breaker
spec:
host: service1.example.com
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 1
outlierDetection:
consecutiveErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 100
This configuration creates a circuit breaker for the service service1.example.com
with the following settings:
- A maximum of ten pending requests are allowed.
- Each connection can handle a maximum of one request.
- After three consecutive errors, the service instance is ejected from the load balancing pool.
- Ejection checks occur every 10 seconds.
- The base ejection time is 30 seconds, meaning an ejected instance will be unavailable for at least 30 seconds.
- The maximum ejection percentage is set to 100%, which means all instances can be ejected if they continually fail.
In summary, while neither NGINX nor Apache2 natively support circuit breakers, you can use third-party modules, plugins, or external tools like service meshes to implement the circuit breaker pattern in your infrastructure.
AWS Service App Mesh¶
You can use AWS App Mesh to implement circuit breakers and other resiliency patterns with AWS services. This service mesh provides application-level networking to make it easy to build, run and monitor microservices on AWS. App Mesh works with Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), and AWS Fargate.
The high-level steps to implement circuit breakers with AWS App Mesh:
-
Set up AWS App Mesh: Follow the official documentation to set up AWS App Mesh with your preferred container service (EKS, ECS, or Fargate): https://docs.aws.amazon.com/app-mesh/latest/userguide/getting_started.html
-
Define a Virtual Service: A Virtual Service is an abstraction of a real service and can be used to route traffic between different versions of a service or to a fallback service. Create a Virtual Service for your primary service.
-
Define a Virtual Node: A Virtual Node is a logical pointer to a physical service running in your infrastructure. Create a Virtual Node for your primary service and configure the circuit breaker settings.
-
(Optional) Define a Virtual Node for a fallback service: If you want to route traffic to a fallback service when the circuit breaker trips, create a Virtual Node for the fallback service.
-
Update the Virtual Service to use the Virtual Nodes: Update the Virtual Service configuration to route traffic to the primary service's Virtual Node and, if applicable, the fallback service's Virtual Node.
An example of how to configure circuit breakers using AWS App Mesh with a primary service and a fallback service:
{
"meshName": "my-mesh",
"virtualNodeName": "primary-service",
"spec": {
"listeners": [
{
"portMapping": {
"port": 8080,
"protocol": "http"
}
}
],
"serviceDiscovery": {
"awsCloudMap": {
"namespaceName": "my-namespace",
"serviceName": "primary-service"
}
},
"backendDefaults": {
"clientPolicy": {
"tls": {
"enforce": false
}
}
},
"backends": [
{
"virtualService": {
"virtualServiceName": "fallback-service.my-mesh"
}
}
],
"outlierDetection": {
"maxEjectionPercent": 100,
"baseEjectionDuration": "30s",
"interval": "10s",
"consecutiveErrors": 3
}
}
}
This example JSON configuration creates a Virtual Node for the primary service, configures the circuit breaker settings, and adds a backend for the fallback service.
In conclusion, AWS App Mesh allows you to implement circuit breakers and other resiliency patterns for your microservices running on AWS services. By configuring Virtual Services and Virtual Nodes, you can control traffic routing and fault tolerance settings to build robust and fault-tolerant applications on AWS.
Bulkheads¶
A bulkhead is a design pattern that isolates different parts of a system so that a failure in one part doesn't cascade to the others. Bulkheads can be physical, like the watertight compartments in a ship, or logical, like separate thread pools or resource limits for different tasks. The key is to ensure that each compartment can operate independently, minimising the impact of a single failure.
To implement bulkheads, you need to identify your application's different components or subsystems and apply resource constraints, such as thread pools or connection limits to each. This can be done at the infrastructure level, using separate servers or containers, or at the application level, using queues, thread pools, or rate limiting.
Bulkheads can be implemented in software systems at different levels, such as at the infrastructure, application, or code level. The key is to ensure that each compartment can operate independently, minimising the impact of a single failure.
One approach to implementing bulkheads in Python is to use separate thread pools or worker processes for different tasks. This prevents a failure or bottleneck in one task from affecting the performance of other tasks.
Python Implementation¶
Below is an example of implementing bulkheads using Python and the concurrent.futures.ThreadPoolExecutor
for isolating two tasks:
import concurrent.futures
import requests
import random
import time
# Define separate thread pools for two different tasks
task1_thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=5)
task2_thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=5)
def perform_task(task_type, duration):
print(f"{task_type} started, duration: {duration}s")
time.sleep(duration)
print(f"{task_type} completed")
return f"{task_type} task completed after {duration}s"
# Usage example
task1_futures = []
task2_futures = []
for _ in range(10):
# Simulating variable task durations
task1_duration = random.uniform(0.5, 3)
task1_future = task1_thread_pool.submit(perform_task, 'Task 1', task1_duration)
task1_futures.append(task1_future)
task2_duration = random.uniform(0.5, 3)
task2_future = task2_thread_pool.submit(perform_task, 'Task 2', task2_duration)
task2_futures.append(task2_future)
# Wait for all tasks to complete
concurrent.futures.wait(task1_futures + task2_futures)
# Clean up
task1_thread_pool.shutdown()
task2_thread_pool.shutdown()
In this example, two separate thread pools are created for Task 1 and Task 2. When tasks are submitted to the appropriate thread pool, they are isolated and can be processed independently without affecting each other's performance.
If one of the tasks encounters an issue or takes longer than expected, the other tasks will continue to execute, as they are running in separate thread pools. This isolation helps ensure that a problem in one part of the system does not cascade and impact the entire system's performance.
In addition to thread pools, bulkheads can be implemented using other techniques, such as separate processes, containers, or servers, and applying resource constraints or rate limiting to individual components.
Implementing bulkheads using NGINX or Apache2 typically involves isolating different services or applications at the infrastructure level. In this example, I will demonstrate how to implement bulkheads using NGINX by isolating two upstream services.
Let's assume you have two upstream services running on two different servers:
- Service 1:
http://service1.example.com
- Service 2:
http://service2.example.com
NGINX Implementation¶
You can configure NGINX to act as a reverse proxy and load balancer for these two services, ensuring that the services are isolated from each other.
Create an NGINX configuration file, such as /etc/nginx/conf.d/bulkheads.conf
, with the following content:
http {
# Define upstream servers for Service 1 and Service 2
upstream service1 {
server service1.example.com;
}
upstream service2 {
server service2.example.com;
}
server {
listen 80;
# Proxy requests for Service 1
location /service1 {
proxy_pass http://service1;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# Proxy requests for Service 2
location /service2 {
proxy_pass http://service2;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
}
This configuration sets up two upstream services (service1 and service2) and configures NGINX to proxy requests for /service1
and /service2
to their respective upstream services. Doing this ensures that the services are isolated and a failure in one service does not cascade to another.
Apache2 Implementation¶
If you want to implement bulkheads using Apache2, you can do so by configuring it as a reverse proxy using the mod_proxy
module. I will provide you with an Apache2 configuration file example.
<VirtualHost *:80>
ServerName example.com
# Enable the mod_proxy and mod_proxy_http modules
ProxyRequests Off
ProxyPreserveHost On
<Proxy *>
Require all granted
</Proxy>
# Proxy requests for Service 1
ProxyPass /service1 http://service1.example.com
ProxyPassReverse /service1 http://service1.example.com
# Proxy requests for Service 2
ProxyPass /service2 http://service2.example.com
ProxyPassReverse /service2 http://service2.example.com
# Configure logging
ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>
This configuration achieves the same goal as the NGINX example: requests for /service1
and /service2
are proxied to their respective upstream services, ensuring that the services are isolated.
You can use Istio or AWS App Mesh to implement bulkheads as part of their traffic management and isolation features. Both Istio and AWS App Mesh allow you to configure separate connection pools for each service, effectively isolating them from each other and preventing cascading failures.
Here's how you can implement bulkheads using Istio and AWS App Mesh:
Istio
To implement bulkheads in Istio, you can combine DestinationRules and VirtualServices. First, create a DestinationRule for each service to configure the connection pool settings:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: service1-destination
spec:
host: service1
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 1
tcp:
maxConnections: 100
connectTimeout: 10s
This DestinationRule configures the connection pool for the service1
host. You can adjust the settings for http1MaxPendingRequests
, maxRequestsPerConnection
, maxConnections
, and connectTimeout
to meet the requirements of your system.
Create similar DestinationRules for other services and apply them to your cluster.
Next, create a VirtualService to route traffic between different services or service versions:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: service1-route
spec:
hosts:
- service1
http:
- route:
- destination:
host: service1
subset: v1
This VirtualService routes traffic to the service1
host with the v1
subset. Create similar VirtualServices for other services and apply them to your cluster.
By configuring separate connection pools for each service using DestinationRules and VirtualServices, you effectively isolate services from each other, implementing bulkheads with Istio.
AWS App Mesh
To implement bulkheads using AWS App Mesh, create a VirtualNode for each service with the desired connection pool settings. An example of JSON configuration for a VirtualNode with connection pool settings:
{
"meshName": "my-mesh",
"virtualNodeName": "service1",
"spec": {
"listeners": [
{
"portMapping": {
"port": 8080,
"protocol": "http"
}
}
],
"serviceDiscovery": {
"awsCloudMap": {
"namespaceName": "my-namespace",
"serviceName": "service1"
}
},
"backendDefaults": {
"clientPolicy": {
"tls": {
"enforce": false
}
},
"timeout": {
"http": {
"idle": "10s",
"perRequest": "5s"
}
}
},
"backends": [
{
"virtualService": {
"virtualServiceName": "service2.my-mesh"
}
}
],
"connectionPool": {
"http": {
"maxConnections": 100,
"maxPendingRequests": 10
},
"tcp": {
"maxConnections": 100
}
}
}
}
This example JSON configuration creates a VirtualNode for the service1
host with connection pool settings. Adjust the maxConnections
, maxPendingRequests
, and other settings as needed.
Create similar VirtualNodes for other services and apply them to your AWS App Mesh.
By configuring separate connection pools for each service using VirtualNodes, you effectively isolate services from each other, implementing bulkheads with AWS App Mesh.
In conclusion, both Istio and AWS App Mesh provide features to implement bulkheads and isolate your services from each other, preventing cascading failures. You can build a resilient and fault-tolerant system by configuring separate connection pools for each service using DestinationRules and VirtualServices in Istio or VirtualNodes in AWS App Mesh. Remember to adjust the settings for connection pools, timeouts, and other parameters according to the requirements of your specific use case.
Isolation¶
Isolation is a broader concept that focuses on separating components, subsystems, or services in a way that limits the impact of a failure, making it easier to diagnose, repair, and prevent potential cascading effects. Isolation can be applied at different levels, such as process, network, or data, and can involve techniques like error handling, fault tolerance, or resource management.
In software systems, isolation can be achieved by using separate processes or containers, creating isolated network zones, or implementing well-defined interfaces and contracts between components.
While both bulkheads and isolation involve separating and protecting different parts of a system, bulkheads focus more on resource allocation and preventing a failure in one part of the system from consuming resources needed by other parts. On the other hand, isolation encompasses a broader range of techniques and principles to contain the impact of failures and maintain system stability.
In summary, bulkheads and isolation are separate concepts in system design. Bulkheads focus on partitioning resources to prevent failures from affecting other parts of the system. At the same time, isolation involves a broader set of techniques to limit the impact of failures and maintain system stability. Implementing both concepts together can help build more resilient and fault-tolerant systems.
Isolation can be implemented at various levels, such as process, network, or data. I'll provide an example of implementing isolation using microservices architecture, containers, and a message queue.
In this example, we have a system with three microservices: Order Service, Payment Service, and Shipping Service. These services communicate with each other through a message queue to process orders, payments, and shipments.
1. Microservices Architecture
Instead of a monolithic application, the system is designed as a set of microservices. Each microservice focuses on a specific domain and is developed, deployed, and scaled independently. This approach isolates each service from the others, so if one service fails or experiences performance issues, it won't directly impact the other services.
2. Containers
Deploy each microservice in a container, such as Docker. Containers provide process and resource isolation, ensuring each service runs in its own environment with a separate file system, process tree, and network stack. Containers also make managing dependencies, versioning, and deployment easy, further isolating each service from the others.
Here's an example Dockerfile
for the Order Service:
FROM python:3.8-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "order_service.py"]
Build and run the container for each service:
Repeat these steps for the Payment Service and Shipping Service.
3. Message Queue
Instead of direct HTTP calls or other synchronous communication, use a message queue, such as RabbitMQ or Apache Kafka, for communication between services. This decouples the services and allows them to operate independently, even if one is experiencing issues.
For example, the Order Service can publish a message to a queue when a new order is created. The Payment Service can consume messages from that queue, process payments, and publish another message to a different queue when payment is completed. Finally, the Shipping Service can consume messages from the second queue and handle shipping logistics.
Here's an example using RabbitMQ and Python's pika
library:
Order Service (Publisher)
import pika
# Establish connection to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
# Declare a queue
channel.queue_declare(queue='order_queue')
# Publish a message to the queue
channel.basic_publish(exchange='', routing_key='order_queue', body='New order')
print(" [x] Sent 'New order'")
# Close the connection
connection.close()
Payment Service (Consumer)
import pika
def callback(ch, method, properties, body):
print(f" [x] Received {body}")
# Process payment
# Publish a message to another queue for Shipping Service
# Establish connection to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
# Declare a queue
channel.queue_declare(queue='order_queue')
# Start consuming messages from the queue
channel.basic_consume(queue='order_queue', on_message_callback=callback, auto_ack=True)
print(' [*] Waiting for messages. To exit, press CTRL+C')
channel.start_consuming()
In this example, the Order Service, Payment Service, and Shipping Service are isolated from each other through microservices architecture, containers, and a message queue. Implementing the isolation pattern helps minimise the impact of failures, maintain system stability, and simplify diagnostics and recovery.
Isolation and landing zones are related because both concepts aim to reduce the risk of failures and improve manageability in a system. However, landing zones specifically pertain to cloud infrastructure and management, while isolation is a more general concept applicable to various aspects of system design.
Landing Zones¶
A landing zone is a well-architected, multi-account AWS environment that serves as a foundation for building and deploying workloads in the cloud. It provides a blueprint for setting up a secure, scalable, resilient infrastructure that follows best practices and adheres to compliance and governance requirements.
Landing zones help organisations establish a consistent approach to cloud deployments, including aspects such as account structure, network design, security, and governance. They often include pre-configured components, such as AWS Organizations, AWS Control Tower, AWS Security Hub, and other AWS services.
Relation to Isolation¶
Landing zones promote isolation in cloud infrastructure by creating separate AWS accounts for different teams, environments, or workloads. This separation ensures that resources are dedicated to specific purposes, reducing the risk of accidental or unauthorised access and minimising the impact of failures or misconfigurations.
For example, a landing zone might define separate AWS accounts for development, staging, production environments or different business units within an organisation. Each account would have its resources, such as VPCs, subnets, and security groups, effectively isolating them.
In conclusion, while isolation is a general concept that can be applied to various aspects of system design, landing zones are specific to cloud infrastructure management and promote isolation at the AWS accounts and resources level. By implementing landing zones, organisations can establish a more secure, scalable, and resilient cloud infrastructure that helps reduce the risk of failures and improve overall manageability.
Timeouts¶
Timeouts are an essential part of antifragile patterns, as they help prevent operations from running indefinitely and consuming resources, which can lead to system instability or cascading failures. Implementing timeouts is a proactive measure to ensure a system can recover gracefully from failures or slowdowns.
Here are some best practices and tips for using timeouts effectively:
Identify critical operations¶
Determine which operations in your system are critical and could cause performance issues or failures if they run too long. Examples include remote API calls, database queries, or any operation that relies on external resources.
Set appropriate timeout values¶
Choose reasonable timeout values for each operation based on its expected duration and the impact on the system if it exceeds that duration. Consider the worst-case scenarios, such as network latency or resource contention, and factor them into your chosen timeout values. Be cautious not to set timeout values too low, as doing so might cause unnecessary failures or retries.
Implement retries with exponential backoff¶
It may be appropriate to retry the operation when an operation times out. Implement retries with exponential backoff and a maximum number of attempts to avoid overwhelming the system or external resources. Exponential backoff means exponentially increasing the waiting time between retries, which helps distribute the load more evenly and reduces the risk of cascading failures.
Example: In Python, you can use the backoff library to implement retries with exponential backoff for an API call
import requests
import backoff
@backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_tries=5)
def fetch_user_data(user_id):
response = requests.get(f'https://api.example.com/users/{user_id}', timeout=5)
response.raise_for_status()
return response.json()
Monitor and adjust timeouts¶
Continuously monitor your system's performance, latency, and error rates. Use this information to adjust timeout values to balance system stability and responsiveness. Regularly review and fine-tune timeout settings to ensure they remain appropriate as your system evolves.
Provide fallbacks or circuit breakers¶
Consider implementing fallback strategies or circuit breakers for operations prone to timeouts. A fallback strategy might involve returning cached data or a default response when an operation times out. Circuit breakers can help prevent a system from continuously retrying a failing operation by temporarily "opening" the circuit and blocking new requests until the issue is resolved or a specified time has passed.
Example: In the case of a failing API call, you could return cached data or a default response:
import requests
import backoff
def fetch_user_data_with_fallback(user_id):
try:
return fetch_user_data(user_id)
except requests.exceptions.RequestException:
return get_cached_data(user_id) or get_default_user_data()
@backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_tries=5)
def fetch_user_data(user_id):
response = requests.get(f'https://api.example.com/users/{user_id}', timeout=5)
response.raise_for_status()
return response.json()
def get_cached_data(user_id):
# Implement your cache retrieval logic here
pass
def get_default_user_data():
# Return default user data
return {"name": "Unknown", "email": "[email protected]"}
Communicate timeouts to users¶
If an operation times out and affects the user experience, provide clear and informative error messages to let users know what happened and what they can do to resolve the issue. This might involve asking users to retry their actions, refresh the page, or contact support for assistance.
Example: If a web application fails to load user data due to a timeout, display an error message with a "Retry" button:
import httpretty
import requests
def test_fetch_user_data_timeout():
httpretty.enable()
httpretty.register_uri(
httpretty.GET,
'https://api.example.com/users/123',
body='{"name": "John Doe", "email": "[email protected]"}',
status=200,
adding_headers={'Content-Type': 'application/json'},
response_delay=6 # Simulate a delay longer than the timeout value
)
with pytest.raises(requests.exceptions.Timeout):
fetch_user_data('123')
httpretty.disable()
httpretty.reset()
Test for timeouts¶
During development and testing, simulate scenarios where operations time out to ensure your system can handle such situations gracefully. Use tools like chaos engineering or fault injection to test your system's resilience and ability to recover from timeouts.
In conclusion, implementing timeouts is an essential part of building antifragile systems. By following these best practices and tips, you can ensure that your system can recover from failures or slowdowns more gracefully and maintain its stability and performance even in the face of unexpected issues.
Use deadlines or timeouts at the system level**¶
In some cases, it might be helpful to set a deadline or timeout for an entire sequence of operations rather than individual operations. This approach can help ensure that a user request doesn't take longer than a specified amount of time, even if the individual operations within the request have their own timeouts.
Example: In Python, you can use the contextvars
module and asyncio
to set a timeout for a sequence of operations:
import asyncio
import contextvars
timeout_var = contextvars.ContextVar("timeout")
async def fetch_user_data(user_id):
# Implement the async version of the API call
pass
async def process_user_data(user_data):
# Perform some processing on the user data
pass
async def handle_request(user_id):
try:
timeout = timeout_var.get()
user_data = await asyncio.wait_for(fetch_user_data(user_id), timeout=timeout)
processed_data = await asyncio.wait_for(process_user_data(user_data), timeout=timeout)
return processed_data
except asyncio.TimeoutError:
# Handle the case when the entire request sequence times out
pass
async def main():
timeout_var.set(10) # Set a 10-second timeout for the entire request sequence
await handle_request("123")
asyncio.run(main())
Use load shedding to avoid overloading your system**¶
In high-traffic scenarios, timeouts alone might not be enough to prevent your system from becoming overloaded. Combining timeouts with load-shedding techniques can help you proactively drop or delay requests when your system is under a heavy load. This can help maintain overall system stability and reduce the likelihood of cascading failures.
Example: Implement a rate limiter to limit the number of incoming requests per second:
from ratelimit import limits, sleep_and_retry
import requests
SECONDS = 1
MAX_REQUESTS_PER_SECOND = 10
@sleep_and_retry
@limits(calls=MAX_REQUESTS_PER_SECOND, period=SECONDS)
def fetch_user_data(user_id):
response = requests.get(f'https://api.example.com/users/{user_id}', timeout=5)
response.raise_for_status()
return response.json()
Following these additional best practices, tips, and examples can improve your system's ability to handle failures, slowdowns, and unexpected issues. Building an antifragile system is an ongoing process, and it's essential to continuously monitor, adapt, and refine your strategies to maintain your system's stability, performance, and resilience.
Also, you can implement timeouts in AWS services to ensure your applications remain antifragile. Different AWS services provide different ways to set timeouts. Here are a few examples:
AWS Lambda
For AWS Lambda, you can set the function timeout, which is the maximum amount of time your function is allowed to run before it's terminated. You can set the timeout value in the AWS Management Console, AWS CLI, or AWS SDKs.
Example: Set the timeout for a Lambda function in the AWS Management Console:
- Navigate to the Lambda function in the AWS Management Console.
- Scroll down to the "Basic settings" section.
- Set the "Timeout" value to an appropriate duration (e.g., 5 seconds).
- Save the changes.
Amazon API Gateway
In Amazon API Gateway, you can set the integration timeout for an API, which is the maximum amount of time the API waits for a response from the backend before returning an error. You can set the integration timeout value in the AWS Management Console or by using AWS CLI or SDKs.
Example: Set the integration timeout for an API in the AWS Management Console:
- Navigate to the API in the Amazon API Gateway console.
- Choose the "Resources" section.
- Select a method (e.g., GET or POST).
- Choose "Integration Request."
- Set the "Timeout" value to an appropriate duration (e.g., 5 seconds).
- Save the changes.
AWS Step Functions
In AWS Step Functions, you can set the timeout for state machine tasks, such as Lambda tasks or activity tasks. The timeout value is specified using the "TimeoutSeconds" field in the state machine definition.
Example: Set the timeout for a Lambda task in a state machine definition:
{
"StartAt": "MyLambdaTask",
"States": {
"MyLambdaTask": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"TimeoutSeconds": 5,
"Next": "NextState"
},
"NextState": {
"Type": "Pass",
"End": true
}
}
}
These are just a few examples of how to set timeouts in AWS services. Depending on the service you use and your application architecture, you may also need to configure timeouts in other services. Always refer to the documentation for the specific AWS service to understand how to set timeouts appropriately.
Additionally, you can use the AWS SDKs or Boto3 (Python) to set timeouts for API calls to various AWS services. These client-side timeouts can help prevent your application from waiting indefinitely for a response from AWS services.
Example: Set the timeout for an API call using Boto3 in Python:
import boto3
# Create a custom session with a specified timeout
session = boto3.Session()
# Create a service client with the custom session and set the read timeout to 5 seconds
s3_client = session.client('s3', config=boto3.session.Config(read_timeout=5))
# Use the S3 client to make an API call
response = s3_client.list_buckets()
By setting timeouts in various AWS services, you can improve the antifragility of your application and ensure it can handle failures and slowdowns more gracefully.
Redundancy¶
By implementing redundancy, you can help ensure your systems are more resilient to failure and recover quickly from disruptions.
- Use multiple servers to host your applications. If one server fails, your applications will still be available on the other servers.
- Use multiple different regions to host your servers,
- Use a load balancer to distribute traffic across multiple servers. This will help to prevent any one server from becoming overloaded.
- Use DNS to balance the traffic across regions.*
- Use a content delivery network (CDN). to cache static content, such as images and JavaScript files. This will help to improve the performance of your website or application by reducing the amount of traffic that needs to be sent to your servers.
- Use a backup service to store copies of your data. If your primary data store fails, you will still have a copy of your data to restore.
- Use a disaster recovery plan to document how you will recover from a disaster. This plan should include steps for restoring your data, your applications, and your infrastructure.
Redundancy refers to the process of introducing multiple instances of the same component or process so that others can take over if one fails. This approach helps to ensure the availability and resilience of a system. In the context of British English, the fundamental principles and examples remain the same.
Key principles of redundancy¶
Failover Implementing failover mechanisms enables a system to continue functioning even if a component fails. This can be achieved by automatically switching to a backup or standby component when the primary component is unavailable.
Load balancing Distributing incoming network traffic or workload across multiple instances of a service can help maintain system performance, prevent overloading, and improve fault tolerance.
Replication Ensuring that multiple copies of critical data and services are available can help maintain system integrity and facilitate recovery in the event of a failure.
Examples of redundancy¶
Load balancers in front of web servers¶
Multiple web server instances might run behind a load balancer in a web application. The load balancer distributes incoming traffic evenly across the available instances, ensuring no single instance becomes a bottleneck. If one of the instances fails, the load balancer automatically redirects traffic to the remaining instances, maintaining the availability of the web application.
apiVersion: v1
kind: Service
metadata:
name: my-web-app
spec:
selector:
app: web-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer
Database replication¶
Databases are critical components of many systems. You can improve the resilience of a database by using replication techniques, such as master-slave or multi-master replication. In master-slave replication, the master database server replicates its data to one or more slave servers. If the master server fails, a slave server can be promoted to master, ensuring the continued availability of the database.
# PostgreSQL Master-Slave replication using the Crunchy PostgreSQL Operator
apiVersion: crunchydata.com/v1
kind: Pgcluster
metadata:
name: my-database
spec:
Name: my-database
Replicas: 2
PrimaryStorage:
name: my-database-storage
storageClassName: standard
accessMode: ReadWriteOnce
size: 1G
ReplicaStorage:
name: my-database-replica-storage
storageClassName: standard
accessMode: ReadWriteOnce
size: 1G
Distributed file systems¶
Using distributed file systems, such as Hadoop Distributed FileSystem (HDFS) or GlusterFS, can provide redundancy for your data storage. These systems automatically replicate data across multiple nodes in the cluster, ensuring that the data remains available even if one or more nodes fail.
# GlusterFS volume definition for data redundancy
apiVersion: "v1"
kind: "ConfigMap"
metadata:
name: "my-glusterfs"
data:
vol01: |
volume vol01
type replicate
replica 3
subvolumes subvol01 subvol02 subvol03
end-volume
Redundancy in DNS resolution¶
Domain Name System (DNS) resolution is critical to internet infrastructure. To ensure the availability and resilience of your system, it's important to have redundant DNS providers or multiple DNS servers. If one DNS provider experiences downtime, your application can still resolve domain names through the other DNS provider(s).
You can configure multiple DNS providers using DNS delegation, which enables you to distribute DNS resolution across multiple providers:
; Example DNS zone file with redundant DNS providers
example.com. IN SOA ns1.primary-dns-provider.com. hostmaster.example.com. (
2022010101 ; Serial
10800 ; Refresh
3600 ; Retry
604800 ; Expire
86400 ); Minimum
example.com. IN NS ns1.primary-dns-provider.com.
example.com. IN NS ns2.primary-dns-provider.com.
example.com. IN NS ns1.secondary-dns-provider.com.
example.com. IN NS ns2.secondary-dns-provider.com.
Multiple availability zones and regions¶
When deploying applications on cloud platforms like AWS, Azure, or Google Cloud, you can leverage multiple availability zones (AZs) and regions to ensure redundancy. You can protect your system from failures in a single zone or region by distributing your application instances, databases, and other services across different AZs and regions. This approach also provides better fault tolerance and lowers the risk of downtime due to power outages, network issues, or natural disasters.
For example, when deploying an application on AWS using Amazon EC2 instances, you can create an Auto Scaling group to distribute instances across multiple AZs:
{
"AutoScalingGroupName": "my-app",
"LaunchConfigurationName": "my-launch-config",
"MinSize": 2,
"MaxSize": 10,
"DesiredCapacity": 4,
"AvailabilityZones": ["us-west-2a", "us-west-2b", "us-west-2c"],
"HealthCheckType": "EC2",
"HealthCheckGracePeriod": 300
}
Message queues and event-driven architecture¶
Implementing an event-driven architecture using message queues can help add redundancy to your system by decoupling components and enabling asynchronous communication. Services can publish events to a message queue; other services can consume these events when ready. If a service fails or becomes temporarily unavailable, the message queue can store the events until the service recovers, ensuring no data is lost.
For example, you can use Apache Kafka or Amazon SQS to implement a message queue for your system:
# Example of an Amazon SQS queue
Resources:
MyQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: my-event-queue
RedrivePolicy:
deadLetterTargetArn: !GetAtt MyDeadLetterQueue.Arn
maxReceiveCount: 5
MyDeadLetterQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: my-event-queue-dead-letter
Replicated caching systems¶
Caching is an essential technique to improve the performance of your applications. When using caching systems, it's essential to ensure redundancy to prevent a single point of failure. You can achieve this using replicated or distributed caches like Redis, Memcached, or Hazelcast.
For instance, you can set up a Redis cluster to ensure redundancy and high availability:
# Example of a Redis cluster deployment on Kubernetes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
spec:
serviceName: redis-cluster
replicas: 6
selector:
matchLabels:
app: redis-cluster
template:
metadata:
labels:
app: redis-cluster
spec:
containers:
- name: redis
image: redis:5.0.5-alpine
command: ["/conf/update-node.sh"]
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- containerPort: 6379
name: client
- containerPort: 16379
name: gossip
volumeMounts:
- name: conf
mountPath: /conf
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
- metadata:
name: conf
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
Backup and disaster recovery¶
A robust backup and disaster recovery strategy is essential to ensure redundancy for your data and services. Regularly backing up your data and being able to restore it quickly in case of a disaster can help your system recover from failures and minimise downtime.
For example, you can use Amazon RDS to create automated backups of your databases and restore them in case of a failure:
# Example of an Amazon RDS instance with automated backups
Resources:
MyDBInstance:
Type: AWS::RDS::DBInstance
Properties:
AllocatedStorage: 20
BackupRetentionPeriod: 7
DBInstanceClass: db.t2.micro
Engine: MySQL
MasterUsername: myusername
MasterUserPassword: mypassword
MultiAZ: true
Health checks and monitoring¶
Implementing health checks and monitoring for your services is crucial for maintaining redundancy. By continuously monitoring the health of your components, you can quickly detect failures and take corrective actions.
For instance, you can use Kubernetes readiness and liveness probes to ensure that your services are running correctly:
apiVersion: v1
kind: Pod
metadata:
name: my-service
spec:
containers:
- name: my-service-container
image: my-service-image
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 15
periodSeconds: 20
By following these best practices and incorporating redundancy into your system, you can By incorporating redundancy into various aspects of your system, you can build more resilient, antifragile systems that can withstand failures and continue to operate efficiently in challenging conditions. By implementing redundancy in your systems, you can enhance their antifragility and ensure they remain available and performant even in the face of failures.
Diversity¶
Diversity in antifragile systems refers to using different technologies, methods, or designs to perform the same function. This approach increases the chances that at least one component will succeed in the face of failure.
For example, when designing a highly available storage system, you might use different types of storage media like SSDs, HDDs, and tapes for different levels of data access. This approach ensures that if one type of storage media fails, the other types can continue functioning, providing redundancy and resilience.
Another example is implementing load balancing with multiple algorithms, such as Round Robin, Least Connections, and Random. Using a mix of these algorithms, you can distribute load across your services more efficiently and ensure better overall performance even if one of the algorithms underperforms or fails.
AWS¶
In AWS, you can use different types of databases to store your application data, such as Amazon RDS for relational databases, Amazon DynamoDB for NoSQL databases, and Amazon S3 for object storage. This diverse set of storage options allows you to choose the most appropriate storage solution for different parts of your application, increasing the chances that at least one component will succeed in the face of failure.
Kubernetes¶
For Kubernetes, you can use different ingress controllers to manage traffic routing, such as NGINX, HAProxy, and Traefik. Each ingress controller has its unique features and benefits, and using a mix of these controllers can help you distribute the load more efficiently and ensure better overall performance even if one of the controllers underperforms or fails.
Modularity¶
Modularity in antifragile systems involves breaking a system into smaller, more manageable components that can be developed, tested and maintained independently. This design principle makes it easier to isolate faults, reduces the impact of failures, and allows for more accessible updates and maintenance.
For example, a microservices architecture is a modular approach to building applications. Each microservice is a small, independent component responsible for a specific function or set of functions. This design allows developers to work on individual microservices without impacting the entire system. It also enables more efficient scaling and fault tolerance.
In a practical example, consider an e-commerce application with separate microservices for user management, product catalogue, shopping cart, and payment processing. Each of these components can be developed, tested, and deployed independently, reducing the complexity of the overall system and improving its resilience.
AWS¶
In AWS, you can use AWS Lambda to build a serverless application with modular components. Each Lambda function represents an independent module responsible for a specific function or set of functions. These functions can be developed, tested, and deployed independently, reducing the complexity of the overall system and improving its resilience.
Kubernetes¶
In Kubernetes, you can deploy your application as a set of independent microservices, each running in its container. Using Kubernetes deployments, services, and pods, you can create a modular architecture where each component is responsible for a specific function and can be developed, tested, and deployed independently.
Loose coupling¶
Loose coupling in antifragile systems minimises dependencies between components, making it easier to swap them out, upgrade, or repair without impacting the overall system. This design principle improves flexibility and resilience, allowing systems to quickly adapt and recover from failures.
For example, when building a web application, you can use a RESTful API to facilitate communication between the front-end and back-end components. This approach ensures that the front end can interact with the back end using a standard protocol, even if the underlying implementation of the back end changes. This loose coupling allows you to update or replace components independently without affecting the overall system.
Another example is using message queues, such as Apache Kafka or RabbitMQ, to decouple components in a distributed system. By using message queues, components can communicate asynchronously, reducing the dependency on direct communication between components. If a component fails, the message queue can store messages until the component recovers, ensuring that no data is lost.
In summary, by implementing diversity, modularity, and loose coupling in your systems, you can create antifragile systems that are more resilient and adaptable in the face of failures. By combining different technologies and approaches, you can reduce the risk of cascading failures and create a system that is easier to maintain, scale, and evolve.
AWS¶
In AWS, you can use Amazon SNS and Amazon SQS to create a publish-subscribe architecture that decouples your application components. By using SNS for publishing messages and SQS for subscribing to these messages, you can ensure that your components communicate asynchronously and are not directly dependent on each other. This loose coupling allows for better fault tolerance and easier component updates.
Kubernetes¶
In Kubernetes, you can use Custom Resource Definitions (CRDs) and operators to extend the Kubernetes API and manage custom resources. Using CRDs, you can create a loosely coupled architecture where your custom resources can interact with built-in Kubernetes and other custom resources using standard Kubernetes APIs. This approach allows you to update or replace components independently without affecting the overall system.
For example, you can deploy a Kafka operator and create custom Kafka resources to manage your Kafka clusters. The Kafka operator will watch for changes to the Kafka custom resources and update the underlying Kafka clusters accordingly, ensuring that your Kafka components are decoupled from the rest of your Kubernetes resources.
Self-healing¶
Self-healing systems are designed to detect and recover from failures without human intervention automatically. They often incorporate monitoring, automated failover, and redundancy to ensure the system can continue operating even when components fail.
AWS¶
AWS Auto Scaling Groups can be used to create self-healing infrastructure. When instances within the group fail health checks, AWS terminates unhealthy instances and launches new ones automatically, maintaining the desired number of instances. Elastic Load Balancing (ELB) can also distribute incoming traffic across healthy instances, ensuring that requests are not sent to failed instances.
AWS Auto Scaling Group (using CloudFormation template):
Resources:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AvailabilityZones:
- us-west-2a
- us-west-2b
LaunchConfigurationName: !Ref LaunchConfiguration
MinSize: 2
MaxSize: 4
DesiredCapacity: 2
HealthCheckType: EC2
HealthCheckGracePeriod: 300
LaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
ImageId: ami-0c94855ba95b798c7 # Amazon Linux 2 AMI
InstanceType: t2.micro
SecurityGroups:
- !Ref InstanceSecurityGroup
InstanceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable SSH access and HTTP traffic
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: '22'
ToPort: '22'
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: '80'
ToPort: '80'
CidrIp: 0.0.0.0/0
Kubernetes with Istio¶
In Kubernetes, Deployments and ReplicaSets can be used to create self-healing applications. When a pod fails, Kubernetes creates a new pod to replace it, maintaining the desired number of replicas. Istio enhances this by providing traffic management and failure handling features, such as retries and timeouts, which can help automatically recover from failures and maintain application availability.
Kubernetes Deployment with self-healing:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: my-app-image:latest
ports:
- containerPort: 80
Graceful degradation¶
Graceful degradation is a design principle that allows systems to function at a reduced capacity when one or more components fail. The goal is to provide a satisfactory user experience despite the failure, even if some functionality is temporarily unavailable.
For example, in a distributed database system, you can use read replicas to handle read requests in case the primary database fails. While write operations may not be available during the failure, read operations can still be served, providing users with a degraded but functional experience.
In a microservices architecture, you can use the Circuit Breaker pattern to gracefully degrade the functionality of your system when a dependent service is unavailable. Instead of continuously attempting to call the failing service, the Circuit Breaker can return a cached or default response, allowing the system to continue operating with reduced functionality.
AWS¶
In AWS, you can implement graceful degradation by using Amazon RDS read replicas. When the primary database fails, read replicas can handle read requests, ensuring that read operations continue functioning. Although write operations may be temporarily unavailable, the system can still provide users with a degraded but functional experience.
Kubernetes with Istio¶
In a microservices architecture on Kubernetes, you can use Istio to implement the Circuit Breaker pattern for graceful degradation. Istio allows you to define DestinationRules with outlier detection and traffic policies to automatically detect failing services and prevent continuous attempts to call them. When a dependent service is unavailable, the circuit breaker can return a cached or default response, allowing the system to continue operating with reduced functionality.
Istio DestinationRule with outlier detection:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: my-service
spec:
host: my-service.default.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutiveErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
Istio VirtualService with retries and timeouts¶
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service.default.svc.cluster.local
http:
- route:
- destination:
host: my-service.default.svc.cluster.local
retries:
attempts: 3
perTryTimeout: 2s
retryOn: gateway-error,connect-failure,refused-stream
timeout: 10s
Load shedding¶
Load shedding involves dropping or throttling low-priority tasks during periods of high demand to maintain the performance and stability of the system. This technique can help prevent cascading failures by ensuring that critical tasks continue to be processed, even when the system is under heavy load.
For example, you can implement load-shedding in a web application using rate-limiting middleware, such as the Token Bucket or Leaky Bucket algorithm. This middleware can limit the number of incoming requests from individual users or IP addresses, preventing the system from becoming overwhelmed during periods of high demand.
In AWS, you can use Amazon API Gateway to implement load-shedding by setting up usage plans and throttling limits for your APIs. This will allow you to control the number of requests that can be made to your API per second or day, ensuring that your backend services are not overwhelmed by a sudden surge in traffic.
In Kubernetes, you can use the Horizontal Pod Autoscaler (HPA) to automatically scale the number of pods based on the current load. The HPA can monitor the CPU or memory usage of your pods and adjust the number of replicas accordingly, helping to maintain the performance and stability of your application during periods of high demand.
AWS¶
In AWS, you can use Amazon API Gateway to implement load shedding by setting up usage plans and throttling limits for your APIs. By controlling the number of requests that can be made to your API per second or day, you ensure that your backend services are not overwhelmed by a sudden surge in traffic. Additionally, you can use AWS WAF (Web Application Firewall) to create rate-based rules that limit requests from specific IP addresses, helping to prevent DDoS attacks.
Kubernetes with Istio¶
In Kubernetes with Istio, you can use rate limiting to implement load-shedding. Istio integrates with rate-limiting services like Envoy, allowing you to configure rate limits for incoming requests to your services. Limiting the rate of requests to your services can prevent them from becoming overwhelmed during periods of high demand.
Furthermore, you can use Kubernetes Horizontal Pod Autoscaler (HPA) to scale your application based on the current load. HPA can monitor the CPU or memory usage of your pods and adjust the number of replicas accordingly, helping to maintain the performance and stability of your application during periods of high demand.
Istio Rate Limiting configuration¶
First, apply the EnvoyFilter to enable the rate-limiting service:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: filter-ratelimit
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: HTTP_FILTER
match:
context: GATEWAY
listener:
filterChain:
filter:
name: "envoy.filters.network.http_connection_manager"
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: my_ratelimiter
timeout: 1s
failure_mode_deny: true
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_cluster
timeout: 1s
- applyTo: CLUSTER
match:
cluster:
service: ratelimit.default.svc.cluster.local
patch:
operation: ADD
value:
name: rate_limit_cluster
type: STRICT_DNS
connect_timeout: 1s
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: rate_limit_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: ratelimit.default.svc.cluster.local
port_value: 8080
Then, apply the QuotaSpec
and QuotaSpecBinding
to configure the rate-limiting rules:
apiVersion: config.istio.io/v1alpha2
kind: QuotaSpec
metadata:
name: request-count
spec:
rules:
- quotas:
- charge: 1
max_amount: 10
valid_duration: 1s
dimensions:
source: source.labels["app"] | "unknown"
destination: destination.labels["app"] | "unknown"
---
apiVersion: config.istio.io/v1alpha2
kind: QuotaSpecBinding
metadata:
name: request-count
spec:
quota_specs:
- name: request-count
namespace: default
services:
- service: '*'
These configuration files will enable rate-limiting for incoming requests in your Istio-enabled Kubernetes cluster. The rate limiting rules are set to allow ten requests per second, with the source and destination application labels used as dimensions for the rate limit.
Adaptive Capacity¶
In AWS, you can use Auto Scaling Groups to adjust the number of instances in response to changing conditions. AWS Elastic Load Balancing distributes incoming traffic across multiple instances, improving the system's performance and resilience.
In Kubernetes, you can use the Horizontal Pod Autoscaler (HPA) to adjust the number of replicas based on CPU utilisation or custom metrics, such as the number of requests per second.
AWS Auto Scaling (using CloudFormation template):
Resources:
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AvailabilityZones:
- us-west-2a
- us-west-2b
LaunchConfigurationName: !Ref LaunchConfiguration
MinSize: 2
MaxSize: 4
DesiredCapacity: 2
HealthCheckType: EC2
HealthCheckGracePeriod: 300
LaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
ImageId: ami-0c94855ba95c798c7 # Amazon Linux 2 AMI
InstanceType: t2.micro
SecurityGroups:
- !Ref InstanceSecurityGroup
InstanceSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Enable SSH access and HTTP traffic
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: '22'
ToPort: '22'
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: '80'
ToPort: '80'
CidrIp: 0.0.0.0/0
Kubernetes Horizontal Pod Autoscaler:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: my-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 4
targetCPUUtilizationPercentage: 50
Rate Limiting¶
As shown in the previous answer, you can use Istio to implement rate limiting in your Kubernetes cluster. Rate limiting can also be achieved in AWS by using AWS API Gateway, which allows you to define quotas and throttling limits for APIs.
For rate-limiting examples using Istio, please refer to the previous answer.
AWS API Gateway (using AWS Management Console):
- Navigate to the AWS API Gateway service in the AWS Management Console.
- Create or select an existing API.
- In the "Actions" dropdown, click "Create Usage Plan."
- Set the desired "Throttling" and "Quota" values, and click "Next."
- Associate the usage plan with an API stage, and click "Done."
Monitoring and Observability¶
In AWS, you can use Amazon CloudWatch to collect and analyse logs, metrics, and events from your infrastructure and applications. You can set up alarms to notify you when specific thresholds are breached, enabling you to take appropriate action.
In Kubernetes, you can use Prometheus to collect and store metrics from your applications and Grafana for visualisation. Istio provides built-in observability features like distributed tracing (with Jaeger or Zipkin), monitoring (with Prometheus), and logging (with Elasticsearch, Fluentd, and Kibana).
Prometheus and Grafana setup in Kubernetes:
- Install Prometheus and Grafana using Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/prometheus
helm install grafana prometheus-community/grafana
- Access Grafana dashboard:
Now you can access Grafana at http://localhost:3000
.
Feedback Loops¶
Feedback loops can be implemented using a combination of monitoring, logging, and alerting tools. For example, you can use AWS CloudWatch or Kubernetes monitoring tools (Prometheus and Grafana) to collect performance data and set up alerts to notify you when specific thresholds are reached.
You can then use this information to adjust your system's configuration, such as updating the desired number of instances in an AWS Auto Scaling Group or modifying the desired number of replicas in a Kubernetes Deployment using the Horizontal Pod Autoscaler.
In addition, you can use tools like AWS X-Ray or Istio's distributed tracing capabilities to analyse request paths and identify bottlenecks or errors in your system. This information can be used to improve your application's performance and resilience.
Using AWS CloudWatch Alarms:
- Navigate to the CloudWatch service in the AWS Management Console.
- Click "Alarms" in the left-hand menu, and click "Create alarm."
- Select the desired metric (e.g., CPUUtilization) and set the alarm threshold.
- Configure actions to be taken when the alarm is triggered, such as sending a notification or adjusting an Auto Scaling Group.
Using Prometheus Alertmanager in Kubernetes:
- Install Alertmanager using Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install alertmanager prometheus-community/alertmanager
By continuously monitoring your system and using the data collected to make adjustments, you can create a feedback loop that helps your system learn from and adapt to changing conditions, improving its overall performance and resilience.
Here are some examples to help you implement adaptive capacity, rate limiting, monitoring and observability, and feedback loops in AWS and Kubernetes.