High Availability vs Fault Tolerance vs Disaster Recovery




These are some of the key pillars of AWS Well-Architected Framework  I think the Terms, are slightly misunderstood when designing an architecture. And can easily be confused with one another.

Let us start with High Availability. Whenever we hear the term High Availability we tend to think, an architecture that is highly available, scalable and cannot fail. Well, the first two points might be right, but last part "Fail" is where we need pay bit more attention. Lets see the difference.

A High Availability: Architecture is actually prone to failures. What the term "High Availability" implies is an Architecture that intends to provide a maximum uptime. Meaning, in cases of failure, the ability of the system to recover as quickly as possible. 

You could hear terms like 99.9% fail proof, 99.999% (five 9's) fail proof etc. 

But that still means, for a 99.9% there is still a protentional of 8.77 hours per year of downtime possible  and for 99.999% there is still a potential of 5.26 minutes of downtime.

Its gets very tricky, to bring back these HA systems, online especially lets say your SLA indicates 99.999%. Which means, within 5.26 minutes, then you need to identify the issue, and deploy the solutions so things are up and running again within that time limit. 

Lets say, you have 2 servers running on site, one as a backup and another one as a main server through which the users are logged into. This is potentially a high available system. But if the main server goes down, then it still takes time to have the backup server up and running. 

Even further, lets say we have designed this to be an automated process where all the users are linked to both servers at both times, and the backup server comes online as soon as the main one goes out. You have to remember, it Still takes time, maybe a couple of minutes for the backup server to boot and start functioning.

But, the key point is, there is still some downtime. Meaning, a Highly available system is not prone to failures. It is more like a spare tire on your 4x4, if you carry it, you can replace the flat and not get stuck in the middle of nowhere forever. However you will still be stuck for a while you're replacing the tire but you will be up and running in a short period.

That is what a highly available architecture tends to achieve. If it ever comes to drowning, it makes sure that you are on the surface as soon as possible, otherwise called as "Maximum Uptime" main goal is to minimize user disruption as much as possible.

A Highly Available network or an infrastructure can be redundant in nature, like multiple train tracks ready to be switched when the main track has an issue. A backup generator is a good example. When the lights go off, it takes a few seconds or minutes depending on the generator to kick in. But you still have that small amount of time where you are in a power outage. 

Now lets move on to what a Fault Tolerance Architecture aims to achieve?


A Fault Tolerance System: This is where a lot of times, we get confused and mix up Fault Tolerance with High Availability. A Fault Tolerance system is and can be very similar to a Highly Available system but it has a Main Key differentiator.

Fault Tolerant systems are designed to continue to function without any disruption whatsoever even if one or more of its components fail. Which means, This type of architecture cannot have any downtime. It needs to continue to function even with the failed components.

So as a solution architects, we have to be extremely careful in designing the infrastructure for the customer and should know exactly what is needed  between a Highly Available system or a Fault Tolerant System based on their requirements. A Fault Tolerant Architecture can get fairly expensive because of the complexity involved. And this Additional labor and cost should be justified.

Lets take an example of an Air Traffic Control Tower. Where the servers manage the incoming radar data and provide it to all the screens in the in the control tower for the Air Traffic Controllers to access.

An ATC tower uses multiple radars such as Air Surveillance Radar (ASR), or a Precision Approach Radar (PAR). If these log the data points to a server, which then translates and produces information to all the ATC's in real time, a High Availability Approach would not work here because there is no room for any downtime.

Imagine the situation with an HA architecture in this case. If there are backup servers, and in an event of a failure of a main server, the time it would take to switch to the backup servers from the main one would mean life an death situation for the pilots in the air. Because during this switch over period the systems are offline for a few seconds or minutes.

Now, Lets say both these radars are connected to two or more servers simultaneously and things are continuously functioning in tandem, parallel, all the time. Then, this could be a good example of a Fault Tolerant system. In an event, if one Server goes down, then the other two are already functioning. Because it is highly unlikely for all three servers to go down at the same time especially if you have architected the setup in such a way that they are not on the same premises, located in different geographical locations, and even maybe on different Networks. But yet, they function in parallel and a failure of one has no disruption in service at all.

A very good analogy of Fault Tolerance is, two or more engines on an aircraft. If one engine goes down, the other engines are still more than capable of landing the passengers safely on the ground.

Now lets tackle, the last point.

Disaster Recovery: Simply put, a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. 

Ok, that is a textbook definition. Think of disaster recovery of a last card in the deck of cards. If Highly Available systems or Fault Tolerant systems fail, then DR is a checklist and set of policies to activate the backup systems.

Think of it as one of those "Break Glass to push Alarm" buttons in the buildings. Or an Unfortunate "Mayday" Call from a pilot and a Mayday checklist. These systems, are designed Only to be used in extreme emergencies.

DR, policies are preplanned and standard procedures put in place, in advance in case of a disaster. Without a proper set of instructions in place, there would be panic, confusion and the recovery time will be much higher than it needs to be. 

A good DR infrastructure, is highly backed up and should be spread out geographically. And just like any other drill, a DR simulation should be part of the policy which should occur on frequent intervals to ensure the onsite engineers are always up to date with the recovery steps when the disaster actually takes place.

So lets sum this up:

A High Availability System is:
  • Designed for Maximum Uptime
  • Which means, it does have a room for failure.
  • Is designed to recover as soon as Possible. 
  • These are cost optimized infrastructures due to its comparatively less redundant nature.

A Fault Tolerant System is:
  • Designed to maintain performance even if one or more components of the infrastructure fails.
  • Are Highly redundant in nature
  • Can be a fairly complex task to Architect
  • Can be very expensive to implement.

A Disaster Recovery System is:
  • A backup system that is maintained at all times, to be activated in an event of Natural Disasters, or human-induced disasters.
  • Are extremely well planned and documented.
  • Are simulated periodically to ensure all the systems are functioning as they should and the engineers always have the knowledge and ready if the event occurs.
  • Only, to be activated in emergency situations.

WRAP UP:

I know, this is mostly theory and not much practical information here and might get a bit boring, but if you read this far you already have a much better understanding of this topic compared to most of the architects in the field. 

I do intend to create some infrastructure designs outlining all three designs sometime in the future. Just remember always, if  HA is a spare tire on a 4x4, FT is a redundant engine and hydraulic systems on an aircraft, then DR is the Pilot ejection seat. 

Due to AWS and its automated features thankfully 99% of the times, the remedies are automated without any human intervention. But its still good to know all the points listed above. 

Also, all these 3 systems are highly customer "needs" oriented, and it is our job as a solutions architect to design the most optimized solution according to their business needs.

Until Next one..


 

Comments

Post a Comment

Popular posts from this blog

From Zero To Hero!

AWS Cloudwatch