System Resiliency Planning : Part 2 — Options to Support High Availability

Peter Aten
4 min readJan 21, 2021

--

High Availability (HA) is a desirable but sometimes poorly understood concept in system resiliency design. This post explores options to increase HA for your system, particularly via cloud computing infrastructure.

Other posts in this series:
Part 1 — Setting Objectives
Part 3 — Data Replication
Part 4 — Example Scenarios and Summary

High Availability (HA) vs. Disaster Recovery (DR)

Are these the same thing? If I have one, do I need the other? What separates them?

HA is a widely-used, non-specific term to describe systems and/or system components that mitigate one or more failure modes which would otherwise cause a DR situation, generally by adding redundancy to eliminate single points-of-failure.

Incorporating HA architectural techniques into your system design where cost-effective is a great idea. These techniques help avoid disasters versus reacting to them.

Understand, though, that every system with HA design/components can still experience a DR scenario. Never assume that a system billed as “High Availability” mitigates all your failure modes or that it eliminates the need for a DR plan. Understand your remaining failure modes and your recovery plan.

Cloud Computing — Availability Zone vs. Region vs. Multi-Region

Hosting applications in the cloud has many benefits, and options to improve system availability are prominent among them, including automatic upgrades and security patches (potentially without scheduled downtime). A few key concepts related to cloud resiliency and avoiding a single point of failure are availability zones (AZ) and regions. Configuration options typically include single AZ, multi-AZ (i.e., region), and multi-region.

Availability Zone

Think of an availability zone (AZ) as a data center, a physical location. For many services in the cloud the default option is to implement it in a single AZ. An AZ provides many benefits related to resiliency, including redundant power and network. Additionally, many cloud products offer some redundancy within an AZ, so any application taking advantage of rudimentary cloud design concepts will realize increased, cost-effective resiliency and scalability. AZ outages are more common than regional outages, and most have to do with network outages or errors in management of the AZ by the vendor.

Region

A cloud region is a collection of geographically proximate AZs, connected by a proprietary, low-latency network. Think of a region as a collection of physical data centers that can communicate very quickly with each other, but with a fence around the region which helps to prevent a pandemic between regions. This enables replication of physical assets and data across multiple AZs to avoid single-points-of failure.

Regions are physically isolated from each other in terms of the network in order to reduce the odds of two regions suffering an outage due to a common cause. Updates to cloud vendor products are typically rolled out region-by-region for the same reason.

This extra redundancy results in greater resiliency than a single AZ. Region-wide failures can (and do) occur due to defects in regional product updates, as well as regional network failures. These outages are relatively uncommon but can still be quite impactful, as seen by the 17.5-hour AWS Kinesis outage in their US-East-1 region on November 25, 2020. (Interestingly, the overwhelming majority of significant AWS service interruptions have occurred in their US-East-1 region. If that is your default choice for AWS regions, it may be worth reconsidering.)

Multi-Region

Regions are designed to leverage isolation for the specific purpose of resiliency. However, since region-wide failures do occur, this level of resiliency is insufficient for some system requirements. A multi-region design adds yet another layer of redundancy. Keeping in mind the metaphor of a region as a collection of data centers inside a fence, in order to communicate between regions, you need to jump that fence. This takes time and effort. In system design terms, this means complexity and latency, and those have real or potential costs in user experience, cloud provider expense, and development team time that I would not underestimate. It’s a legitimate question whether the cost of this added complexity and latency outweighs the benefits of improved resiliency, dependent on the needs of your product.

Some cloud vendors support a limited selection of seamless multi-region solutions that resolve the implementation complexity but still incur the latency. It’s also possible to implement a solution manually where the development team incurs both costs. In each case, the intentional boundary created around a region (low latency within, isolation without) creates challenges with state management and cache synchronization, in addition to latency concerns. Multi-region outages can still occur based on the nature of the issue, such as Microsoft Azure’s 3-hour global outage on September 28, 2020 or Google’s 45-minute global outage on December 14, 2020, both due to issues with their authentication services.

Summary

This post has explored high level concepts related to keeping your application running. In the next post we’ll explore options for protecting your data.

Next post: Part 3 — Data Replication

--

--

Peter Aten

Interested in making great software, and particularly in how to make teams more effective