May 18, 2015 / by Indu Alagarsamy / In Design Patterns /
Don't get zapped! Protect your software
This article was originally published on the Particular.net Blog.
Life as a software developer is definitely lived in the fast lane. After weeks and hours of cranking out the code to get the features developed, and after the builds and tests are green and QA stamps its seal of approval, the code is deployed to Production. And then the most dreaded thing happens, the deployed software fails in production in a bad sort of way. In a Murphy’s law sort of way. As the saying goes, “Anything that can go wrong, will go wrong”. But, just what if the code we write took this sort of thing into consideration?
So, how do we take a bad thing and turn it around into something good?
Electronics to the rescue
I still remember the day when my brother and I had to change the fuse in our house due to a surge, granted I didn’t know the gravity of the situation at the time, he’s the one with the electrical savvy. The fuse was completely burnt. But it saved our TV. In electrical engineering, fuses and circuit breakers were invented for exactly this sort of thing. An overload of power can cause serious damage, from ruining electrical equipment to even setting the house on fire! A fuse contains a small filament of wire that will melt during an electrical overload, similar to a light bulb burning out, stopping the dangerous flow of current and keeping other electrical equipment and the house safe.
Fuses evolved into circuit breakers, which commonly use an electromagnet to break the circuit instead of burning it up, allowing a circuit breaker to be reset and used over and over. However, the basic premise is the same. It detects the over usage, then fails fast, without destroying everything else.
Thinking back about it, that’s a pretty powerful concept. By actually killing the component (the fuse was literally dead), you can save serious damage. Our TV was alive thanks to the dead fuse after that surge episode. And rightfully, why can’t the same be done in software? Something bad happens and you have this component in your software that will help you fail fast. Mimic that real life behavior and we have the Circuit Breaker design pattern.
In dealing with distributed scenarios, some failures are transient, where quick successive retries will fix the problem. But there may be some scenarios where connectivity to a critical dependency is lost and may not be restored for a while. For example, an application may lose its connection to a persistent store hosted on the cloud. In these scenarios, by shutting down your service, you can prevent further damage to the rest of your system by avoiding bad processing of data, or even worse, data loss or cascading failures.
By failing fast, it also becomes easier for Ops resources to monitor and respond. As far as they are concerned, services attempting in vain to re-establish connectivity may still appear healthy, hence not triggering alarms when in fact it should’ve. Now if you cause the service to fail completely when appropriate, the warning lights go off and Ops is now aware of the problem and can respond right away.
The Circuit Breaker Design Pattern
It’s easy to create reusable infrastructure to enable the circuit breaker design pattern within your own systems. This is how it works:
-
Define a reusable
CircuitBreaker
class withTrip
andReset
methods, and provide it an action to call when the circuit breaker is tripped. -
Use the
CircuitBreaker
to monitor the dependency upon which your system depends. For every single failure, trip the circuit breaker, which sets it in an armed state. This is like the beginning of an electrical surge. -
If a subsequent attempt succeeds within a specified time window (the current has subsided) then reset the breaker, and all is well.
-
If the circuit breaker is not reset within the specified time, and exceptions continue to occur (the current continues to rise to unsafe levels) then the breaker will invoke the action you’ve provided. You can choose to fail fast (terminate the process) when the circuit breaker is tripped, or whatever other action you choose.
Sample Usage
In this example, ExternalServiceAdapter
is a class that helps connects to some external dependency. There could be a web program that makes requests executing the DoStuff
operation constantly. While executing, if the GetConnection
method fails, it will trip the circuit breaker when an exception occurs. It resets the circuit breaker when the connection is re-established. But if the connection exceptions continue to occur, the circuit breaker will be tripped and the specified Trip action will be executed, which in this case will fail fast.
public class ExternalServiceAdapter
{
private CircuitBreaker circuitBreaker;
public ExternalServiceAdapter()
{
circuitBreaker = new CircuitBreaker("CheckConnection", /*name of your circuit breaker */
exception => /* The action to take when the circuit breaker is tripped */
{
Console.WriteLine("Circuit breaker tripped! Fail fast!");
// Terminate the process, skipping any pending try/finally blocks or finalizers
Environment.FailFast(exception.Message);
},
3, /* Max threshold before tripping the circuit breaker */
TimeSpan.FromSeconds(2)); /* Time to wait between each try before attempting to trip the circuit breaker */
}
public void DoStuff()
{
var externalService = GetConnection();
externalService.DoStuff();
}
ConnectionDependency GetConnection()
{
try
{
var newConnection = new ConnectionDependency();
circuitBreaker.Reset();
return newConnection;
}
catch (Exception exception)
{
circuitBreaker.Trip(exception);
throw;
}
}
}
Simple Implementation of the Circuit Breaker pattern.
using System;
using System.Threading;
public class CircuitBreaker
{
public CircuitBreaker(string name, /*name of the operation */
Action<Exception> tripAction, /* action to invoke when the circuit breaker is tripped */
int maxTimesToRetry, /* number of times to retry before tripping the circuit breaker */
TimeSpan delayBetweenRetries /* time to wait between each retry*/)
{
this.name = name;
this.tripAction = tripAction;
this.maxTimesToRetry = maxTimesToRetry;
this.delayBetweenRetries = delayBetweenRetries;
// Don't start the timer, just as yet. Start it when the user trips the circuit breaker.
timer = new Timer(CircuitBreakerTripped, null, Timeout.Infinite, (int)delayBetweenRetries.TotalMilliseconds);
}
public void Reset()
{
var oldValue = Interlocked.Exchange(ref failureCount, 0);
timer.Change(Timeout.Infinite, Timeout.Infinite);
Console.WriteLine("The circuit breaker for {0} is now disarmed", name);
}
public void Trip(Exception ex)
{
lastException = ex;
var newValue = Interlocked.Increment(ref failureCount);
if (newValue == 1)
{
// Start the retry timer.
timer.Change(delayBetweenRetries, TimeSpan.FromMilliseconds(-1));
// Log that the circuit breaker is triggered.
Console.WriteLine("The circuit breaker for {0} is now in the armed state", name);
}
}
void CircuitBreakerTripped(object state)
{
Console.WriteLine("Check to see if we need to trip the circuit breaker. Retry:{0}", failureCount);
if (Interlocked.Increment(ref failureCount) > maxTimesToRetry)
{
Console.WriteLine("The circuit breaker for {0} is now tripped. Calling specified action", name);
tripAction(lastException);
return;
}
timer.Change(delayBetweenRetries, TimeSpan.FromMilliseconds(-1));
}
readonly string name;
readonly int maxTimesToRetry;
long failureCount;
readonly Action<Exception> tripAction;
Exception lastException;
readonly TimeSpan delayBetweenRetries;
readonly Timer timer;
}
Unit Tests for the CircuitBreaker
[TestFixture]
public class CircuitBreakerTests
{
[Test]
public void When_the_circuit_breaker_is_tripped_the_trip_action_is_called_after_reaching_max_threshold()
{
bool circuitBreakerTripActionCalled = false;
var connectionException = new Exception("Something bad happened.");
var circuitBreaker = new CircuitBreaker("CheckServiceConnection", exception =>
{
Console.WriteLine("Circuit breaker tripped - fail fast");
circuitBreakerTripActionCalled = true;
// You would normally fail fast here in the action to faciliate the process shutdown by calling:
// Environment.FailFast(connectionException.Message);
}, 3, TimeSpan.FromSeconds(1));
circuitBreaker.Trip(connectionException);
System.Threading.Thread.Sleep(5000);
Assert.IsTrue(circuitBreakerTripActionCalled);
}
[Test]
public void When_the_circuit_breaker_is_reset_the_trip_action_is_not_called()
{
bool circuitBreakerTripActionCalled = false;
var connectionException = new Exception("Something bad happened.");
var circuitBreaker = new CircuitBreaker("CheckServiceConnection", exception =>
{
Console.WriteLine("Circuit breaker tripped - fail fast");
circuitBreakerTripActionCalled = true;
// You would normally fail fast here in the action to faciliate the process shutdown by calling:
// Environment.FailFast(connectionException.Message);
}, 3, TimeSpan.FromSeconds(2));
circuitBreaker.Trip(connectionException);
System.Threading.Thread.Sleep(1000);
circuitBreaker.Reset();
Assert.False(circuitBreakerTripActionCalled);
}
}
The code example above uses Console.WriteLine. Replace it with your favorite logger.
Closing Thoughts
Circuit breakers are an essential part of the modern world, and arguably one of the most important safety devices ever invented. There’s always a good reason behind a blown fuse or a tripped circuit breaker.
Monitor your critical resources, fail fast when they don’t respond. Facilitate your Ops team to take corrective actions.
If you’re further interested in these types of patterns, Michael.T.Nygard’s Release It is a fantastic read.
Be safe. Don’t get zapped.