System Resiliency Planning : Part 1 — Setting Objectives
Imagine a system that you’re responsible for having an outage. That’s bad enough, but now imagine that when your system comes back online, a bunch of data is missing, lost forever. Feeling nauseous?
Systems failures are sudden, infrequent, and inevitable, and without some advance preparation they can be extremely painful. Some forethought can go a long way to avoiding, or recovering gracefully from, a potential systems disaster.
It’s common for product owners and development teams to focus first on delivering functional value, and last (if ever, in many cases) on resiliency and disaster recovery planning. Those priorities are understandable, but since business stakeholders generally have an unspoken requirement that the system remains running and the data remains safe, most projects would benefit by investing some time in system resiliency planning from the start.
The purpose of this series of articles is to share some key terms and concepts related to system resiliency planning, so that business stakeholders and technical teams can share a common language to facilitate discussion around this essential but often overlooked topic.
Failure Modes
Systems are comprised of hardware and software that can fail for a myriad of reasons. Being familiar with failure modes can help in understanding what scenarios are addressed by a specific mitigation step. Examples include:
- Hardware failure, like when a hard drive or server fails
- A network failure, which prevents components of the system from communicating with each other and/or users
- A power outage in the data center
- Software defects introduced into the application or to the operating systems of any of the hardware supporting it
- Defective queries, which may accidentally corrupt or delete data
- A security breach that deletes, corrupts, or makes data inaccessible
High-level system resiliency planning includes the following:
- Prevention — Can the system design help mitigate the risk of various failure modes?
- Detection — Is the system properly instrumented to identify failure modes when they occur, allowing for faster remediation?
- Correction — What is the plan to recover system health when a failure mode occurs?
Recovery Time Objective and Recovery Point Objective
The best place to start a system resiliency planning discussion is by framing it in terms that are understood by all stakeholders. Let’s first establish the guardrails for an acceptable solution before we explore technical solutions and their associated trade-offs.
RTO stands for Recovery Time Objective, which is another way of saying, “how long can we afford for the system to be down?” The reflex answer to this may be “never,” but we have to accept that no downtime ever is an unrealistic objective. In reality, depending on the nature of the system and the business costs of system unavailability the answer can range from seconds to days. When establishing an RTO, it’s important to consider factors like:
- Customer impact — What impact will there be to customer value if we are unable to deliver? For example, does an outage present a health and safety risk? How many users are dependent on the system and will suddenly be idle if it goes down? Does this impact the ability to manufacture goods, process payroll, etc.?
- Revenue impact — Will sales be lost if the system is down?
- Reputational impact
- What other systems depend on this system? Our RTO needs to align with theirs.
- Are workarounds possible while the system is down?
- What is the worst-case scenario in terms of timing? Month end? The busiest hour of the week?
RTO includes the time it takes to notify/alert, respond, implement, and complete the recovery plan.
RPO
RPO stands for Recovery Point Objective, which is another way of saying, “how much data are we willing to lose?” The technical definition of RPO relates to how old your data backup can be. For example, if you backup your data once a day, then your RPO is effectively 24 hours, since that’s the maximum amount of data you could lose.
I think this concept is a little trickier than that because it’s hard to imagine scenarios where it’s perfectly acceptable to permanently lose order/sales/purchase or other data as if it never happened. Your Internal Audit department might also have trouble with that concept when you’re operating any SOX-compliant system. For the purposes of establishing a guardrail, the answer may well be zero or near-zero data loss. But as we’ll discuss next, data recovery may consist of both data restoration from backup as well as other, perhaps manual measures.
A more realistic way to define RPO may be, “how current do I need the backup of my system data to be, so that I can bring my system back online within the RTO?” The biggest consideration here is whether and how to recover the data processed by your system since the last backup.
- How will we determine what data is missing?
- If the source of the data was a customer order form, can those be re-entered into the system?
- If the source of the data was an upstream system, can that system replay the data to ours?
- If you have access to system logs, can you ingest and process those to rebuild the lost data?
RTO and RPO are interrelated
How long will it take to restore data? This may be the factor that most influences the ability to meet the RTO.
- How long will it take to restore a backup of the data?
- How long will it take to then recover any missing data since the last backup?
- Is the recovery process constrained only by system resources, or is it a largely manual process that is constrained by human capacity?
Alternatively, can you bring the system back online before all data is restored?
- Is there any risk to processing data out of order, if recovered data is added after the system begins accepting new transactions?
- Is there any benefit to having a read-only copy of incomplete data available to consumers while the process of recovering lost data continues?
Summary
RTO and RPO are great tools to facilitate a constructive conversation about how resilient a particular system needs to be. To better understand options for achieving that resiliency, check out the other posts in this series:
Part 2 — Options to Support High Availability
Part 3 — Data Replication
Part 4 — Example Scenarios and Summary