Any good system that is targeting the public or the enterprise these days must be built to expect the unexpected. No system is perfect and at some point, something will happen that will render a system inoperative – a fire, a hurricane, an earthquake, human error – the list goes on. Because there are so many different possible ways that systems can fail, systems need to be designed with the expectation that failure will occur.
There are two related, but often confused topics that play into system architecture that mitigate against failure: high availability (HA) and disaster recovery (DR). High availability, simply put, is eliminating single points of failure and disaster recovery is the process of getting a system back to an operational state when a system is rendered inoperative. In essence, disaster recovery picks up when high availability fails, so HA first.
As mentioned, High Availability is about eliminating single points of failure, so it implies redundancy. There are basically 3 kinds of redundancy that are implemented in most systems: hardware, software, and environmental.
Hardware redundancy was one of the first ways that high availability was introduced into computing. Before most apps were connected to the internet, they served enterprises on a LAN. These servers didn’t need the scale that modern applications do where there may be thousands of simultaneous connections with 24/7 demand. These applications did, however, supplied business critical data, so they needed hardware that was fault tolerant. Single points of failure where eliminated by manufacturers building servers that had:
Software redundancy soon followed suit. Application designers worked to ensure that applications themselves could tolerate failures in a system, be it hardware failure, configuration errors, or any number of other reasons that could take down a part of the software. A few ways this has been accomplished includes:
With the rise of cloud computing, cloud providers have taken high availability to a whole new level to include large scale, environmental redundancy with:
All these domains (hardware, software, and environmental) seek to solve the same basic problem by making efforts to eliminate single points of failure. The results now supply high service level agreements (SLA’s) that measure unplanned downtime to less than 10 seconds for a given 24-hour period.
Disaster recovery picks up where high availability fails. Disaster recovery can be as simple as restoring from a backup, but it can also be very complex too depending on two factors: the Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
A Recovery Time Objective is the maximum amount of time that a system can be down before it is recovered to an operational state. For some systems, this recovery time objective can be measured in hours or even days, but for more mission-critical systems the recovery time objectives are typically measured in seconds.
A Recovery Point Objective is the amount of data loss, measured in time, that is tolerable in a disaster. For some systems, losing a day’s worth of data might be acceptable while for other systems this might be mere minutes or seconds. The length of RTO’s and RPO’s have profound implications on how disaster recovery plans are implemented.
Short RTO’s and RPO’s require that a system implements active data replication between primary and recovery systems (such as database log shipping) and maintaining failover systems in a ready (expressed as “hot-hot”) or near ready (“hot-warm”) state to take over in the event of a disaster. Likewise, the trigger for a disaster recovery failover is automated.
For longer RTO’s and RPO’s, restoring systems from daily backups might be enough to meet the RTO’s and RPO’s. These backups might be backups of application servers, databases, or both. The process for restoring these may be manual, automated, or both. Whenever backups are used to restore systems to an operational state though, this is typically referred to as a “hot-cold” configuration. In any case, the process of recovering a hot-cold configuration is significantly longer than hot-warm or hot-hot.
One of the biggest factors that prevents organizations from implementing high availability and short RTO’s and RPO’s is cost. Where HA is concerned, more redundancy requires more resources which translates into higher costs. Similarly, short RTO’s and RPO’s require that capacity be available to handle a failover, which also translates into higher costs. There is always a balancing act between costs and system downtime, and sometime the costs of HA, short RTO’s, and short RPO’s is not worth it for some apps, while for others it is necessary no matter what the costs may be.
Fundamentally, High Availability and Disaster Recovery are aimed at the same problem: keeping systems up and running in an operational state, with the main difference being that HA is intended to handle problems while a system is running while DR is intended to handle problems after a system fails. Regardless though of how highly available a system is, any production application no matter how trivial minimally needs to have some sort of disaster recovery plan in place.
Microsoft Azure and Amazon Web Services (AWS) are two of the most popular cloud platforms.…
Cloud management is difficult to do manually, especially if you work with multiple cloud…
Azure’s scalable infrastructure is often cited as one of the primary reasons why it's the…
https://www.youtube.com/watch?v=wDzCN0d8SeA Watch our "Unlocking the Power of AI in your Software Development Life Cycle (SDLC)"…
FinOps is a strategic approach to managing cloud costs. It combines financial management best practices…
Using Kubernetes with Azure combines the power of Kubernetes container orchestration and the cloud capabilities…