Introduction
Circuit breaker is an electrical safety device used to safeguard an electrical circuit from overcurrent damage. Most modern houses and buildings have circuit breakers that act as resettable fuses that automatically cut off power when it has reached above the safe levels, this is to protect an electrical circuit from damage or risk of fire caused by overload or short circuit.
The Circuit Breaker in software design pattern is popularised by Michael Nygard in his book, Release It! It falls under the sustainable design pattern category and this design pattern can be used in microservices, serverless and back-end systems.
In serverless computing architecture, a circuit breaker design pattern can be used to stop cascading failures and enhance fault tolerance. It keeps track of a service’s health and immediately cuts off traffic to that service if it starts to malfunction or act strangely. Once the service has been restored, the circuit breaker switches traffic back to the service.
This approach can stop failures from spreading throughout the system and is especially helpful in serverless systems where functions are called in response to events.
This article will demonstrate the basics of circuit breaker design pattern, how you can leverage its functionality to have a resilient and reliable application and lastly, some examples of business related use-cases.
How does the data flow in the Circuit Breaker design pattern?
In the diagram below, you can see a Service Caller which can be anything from an API consumer, web/app user or another back-end application. The service caller sends its request for authentication. The request will be passed to an Amazon API Gateway, then call the Circuit Breaker AWS Lambda to check whether the Authentication Service is available. Once it confirms that the Authentication Service Lambda is up and running, it will continue the request flow and wait for the response of the Authentication Service.
In a typical request and response scenario where there’s no issue with any of the services, the Service Caller sends its request to the application and the Authentication Service Lambda will return a successful response (Figure A)
In an unexpected breakdown or failure of the Authentication Service Lambda, it will reply with a failure message to the Service Caller and it will also save the state of the Authentication Service to an AWS DynamoDB table (Figure B). DynamoDB is used to store each service state. For simplicity, any service that’s in DynamoDB is considered a down or has an issue. The record in DynamoDB can have a defined expiration time and it will be deleted automatically using DynamoDB Time To Live (TTL).
For all succeeding requests, the system will check the DynamoDB first to get the Authentication Service status, if the service is recorded in DynamoDB then there’s no need to call the actual Authentication Service Lambda, rather, it will immediately return an error response to the Service Caller (Figure C).
After a certain period of time, the service failure recorded in DynamoDB is gone and any requests from the Service Caller will proceed to the Authentication Service Lambda (Figure D). If the Authentication Service is resolved, then it will send a successful response to the Service Caller.
What are the Circuit breaker design pattern states?
A circuit breaker acts as a proxy between the service caller and the service. It monitors the health of the service and if it detects a failure or slowness in response, it can “trip” the circuit, and prevent any further request sent to the service.
The circuit breaker design pattern is implemented using a state machine, which can be in one of the three states: OPEN, CLOSED or HALF-OPEN.
- A CLOSED state means that everything is in normal working operation. This state is also shown in Fig. A – Circuit Breaker on a Happy Path. In a more advanced setup, it can also monitor the error rate or if the response time exceeds a certain threshold.
- In an OPEN state, the Circuit breaker will refrain from calling the actual service and it will immediately return an error response, this is a damage control approach to prevent any further failure or overloading to the system. In some cases, instead of sending an error response to the service caller, the request will be delegated to another similar service to process the request. This is exactly what’s happening during an AWS ECS deployment. Amazon ECS Deployment implements Circuit Breaker Design Pattern, where during the deployment, the old containers won’t be replaced until the new containers are healthy and working, otherwise it will fail the deployment and retain the old working containers. This state is also described in Fig. B – Circuit Breaker on a sudden service failure and Fig. C – Circuit Breaker on a service failure.
- After a certain period of time, the Circuit Breaker should allow some of the requests to the Service, this is called HALF-OPEN state. This state can also be found in Fig. D – Circuit breaker retrying the service after a timeout period.
These small numbers of requests will hit the service and an assessment will happen. If the request is successful and it gives a valid response then the state transitions to a CLOSED state. It will transition to an OPEN state if the request fails to give a valid response. In some instances, there’s a separate health checker that runs in the background.
Circuit breaker in action
Imagine a new start-up company providing an online service where people can generate dog and cat art posters using Artificial Intelligence or AI. They are currently in beta phase and have permitted a limited number of users to try out their program. The company is trying out the Circuit Breaker design pattern in their serverless architecture to prove two hypotheses:
- Can they improve the front-end user experience using this design pattern?
- Can they improve the performance and back-end infrastructure using this design pattern?
The process of art generation is shown in Fig. F – Circuit Breaker State Machine for AI Art Generator. The service callers are the users who send the requests to the back-end. There’s a back-end service named ArtGenerator which processes the art generation using AI. The circuit breaker is the first Choice State which gets service state from a FailedService table in DynamoDB.
In an OPEN state, the service name ArtGenerator will be queried from the DynamoDB FailedService table. If it returns nothing then that means the ArtGenerator Lambda function is invoked. A successful art generation will email the art to the user.
In a CLOSED state, the art generation will fail and it will save the service name ArtGenerator in the FailedService DynamoDB table. It will also email the user that the art was not generated successfully and it should wait for 10 minutes to try again.
After 10 minutes of waiting, the user tries to generate an art again. This request falls under the HALF-OPEN state. The ArtGenerator Lambda will be invoked again and will transition to either CLOSED state (successful art generation) or CLOSED state (failure in art generation).
Using this design pattern the start-up can
- Improve the user experience by immediately sending the state of the ArtGenerator service to the front-end, if there are any issues, it will prevent any requests to the ArtGenerator service until it’s resolved.
- View the State Machine failed processes and the failure frequency from the logs and DynamoDB table. They can also check CloudWatch logs for any code errors so they can understand why the ArtGenerator Lambda is failing. They might learn that their code can be optimised further to process the job quickly, they can put a higher error threshold, a longer timeout period, or instead of using a Lambda Function, they need a GPU optimised EC2 instance to process the images more efficiently. There’s a lot of assumptions that can be made and these are just some of them which can improve the performance and their infrastructure.
Additional enhancement using Exponential backoff
In AWS Step Function, a Lambda function has a configurable error handling mechanism. It can set the maximum number of retries, have intervals between retries and an adjustable backoff rate.
Systems that depend on network communication can become more resilient by using a technique called exponential backoff. After each unsuccessful attempt, the time between retries is increased exponentially until a maximum delay or maximum attempt is reached.
In AWS Step Function, exponential backoff can be applied to decide how long a circuit breaker should stay in an “open” state after tripping because of increased error rate or another problem.
For example, the circuit breaker might set with an interval of 1 seconds between retries. If the first retry fails, the delay is multiplied with the backoff rate which is 2, so now the delay is increased to 2 seconds, then 4 seconds, then 8 seconds, and so on, until a maximum attempt of 5 is reached.
Practical application of Circuit Breaker design pattern
The example above is just one of the problems that can be solved by the Circuit Breaker design pattern. This design pattern can also be used in E-commerce websites during a sale period where it is expected to have a high volume of traffic. Another case is when a company has a fleet of Internet of Things or IoT devices, it can detect and isolate the devices that are malfunctioning or causing issues. In Healthcare, when a certain data source begins to have issues, the circuit breaker can be tripped to stop requests from being made until the source is functioning normally once again. This protects against any data loss or data corruption. Financial institutions, such as online banking, maintain a database with a list of suspicious information and risky activities. By comparing the activity of the bad actors with that of the database, the circuit breaker can be used to identify them; a match will immediately stop further processing of their requests. Legitimate clients won’t be impacted, and it secures the system. It secures the system and it won’t affect legitimate customers.
Things to consider when using Circuit Breaker design pattern
- Take steps to calibrate the error threshold at the correct value, as it can cause the circuit breaker to OPEN unnecessarily if it is too low, or a significant possibility of cascade failure if it is too high.
- This design pattern introduces an additional layer. If not used correctly, this adds complexity and overhead to the system, penalising system performance.
- Check that system monitoring is properly configured, as a lack of or insufficient monitoring makes discovering and diagnosing faults difficult.
Conclusion
In summary, the circuit breaker design pattern in serverless architecture is a powerful design principle that addresses the problem of cascading failures, resource overloading and latency issues. It offers a number of benefits, such as greater availability, decreased downtime, and improved user experience. It helps with real-world issues like those in e-commerce websites, financial applications, Internet of Things apps, and healthcare programs. Any distributed system that seeks to ensure high availability and prevent system failure must consider using the circuit breaker paradigm.