System Resiliency Planning : Part 4 — Example Scenarios and Summary

Peter Aten
4 min readJan 21, 2021

--

Building on prior posts in this series, let’s explore how to apply what we’ve discussed using some hypothetical examples.

Part 1 — Setting Objectives
Part 2 — Options to Support High Availability
Part 3 — Data Replication

Resiliency/DR Scenarios

1) Product Catalog System

This system has the following characteristics:

  • Slow rate of data change — there are few users and data is updated ad hoc.
  • Data is useful for a long time — these products have a long lifespan.
  • Product updates are usually made via spreadsheet upload, so there is a record of changes that can be reprocessed.
  • Downstream systems cache the product catalog in an intentional effort to decouple systems.

What resiliency/DR strategy might this suggest?

  • Since data updates are relatively infrequent and cached by downstream systems, a short RTO is not essential.
  • Since data is slow changing, and “data loss” since last backup can be easily restored by uploading one or more spreadsheets, a short RPO is not essential.
  • Here, an RTO/RPO of 24 hours or even more may be sufficient. There is probably no need to wake people up in the middle of the night when the system fails.

2) Reporting System

This system has the following characteristics:

  • The effective system-of-record for enterprise customer reporting, where the report is the product the customer is paying for.
  • Rapid data change with inputs from multiple transactional systems.
  • Data is useful for a long time — analytical reporting as well as customer reporting will leverage it for years.
  • This system provides data to multiple customer-reporting channels and thus is integral to the customer value chain, supporting revenue of hundreds of millions of dollars.

What resiliency/DR strategy might this suggest?

  • The data is important enough to the customer value chain that a short RTO is likely appropriate, and thus restoring from a data backup is probably not feasible.
  • The data is important enough to the customer value chain that a lengthy regional outage is unacceptable, thus a multi-regional solution with data replication is required.
  • Given how the data is used, it’s possible that a two-stage RTO is appropriate here: recover with a read-only copy of the data in a short amount of time, while committing to a failover with full functionality but potential data loss/recovery after a longer period.

3) Transactional System

This system has the following characteristics:

  • Supports diagnostic laboratory testing. End-to-end transactions on a work item take one hour to three weeks. 90% of transactions take less than 4 hours.
  • All order data is sourced from paper forms or electronically. Even electronic orders include a paper form that can be used to support re-entering missing data.
  • All result data is sourced from physical laboratory instruments.
  • All data is broadcast to downstream systems as it occurs.
  • Transaction processing is largely dependent on lab personnel who work in shifts, as well as instrument capacity to process samples.
  • Lab staff have very limited capacity to reprocess orders. Since labs are staffed to handle a normal volume of orders, an extreme DR scenario has the potential to result in a backlog of orders, which could take days or even weeks to resolve.
  • Patient samples degrade with time, so processing can’t be delayed for days. Samples are also depleted after they are processed, so re-processing samples due to lost data may not be feasible.
  • The system is integral to the customer value chain, supporting revenue of hundreds of millions of dollars.

What resiliency/DR strategy might this suggest?

  • The business value of the system plus the external and internal challenges of recovering from a DR scenario are such that the business determines that the RTO is < 1 hour and RPO is as close to zero as possible.
  • Data backups don’t support the RTO or RPO, so data replication is required. Additionally, backups have limited value given the short useful lifespan of transactional data.
  • Supporting the RTO requirement suggests a multi-region design.
  • The combined RTO/RPO requirement suggests using synchronous, multi-region data replication, which also supports immediate failover and minimal disruption to lab staff and customers.

Recovery

Really understanding your DR playbook will require executing it in a practice situation to confirm that your RTO and RPO can realistically be met. Here again cloud computing really helps by making it easier to provision an appropriate environment for that practice.

  • Use a non-production environment that is identical to production. Leveraging infrastructure as code is key to easily standing up an identical environment.
  • Put a worst-case load on the non-production test environment.
  • Execute a worst-case failure mode, like the entire regional network failing. Consider executing other failure modes as well.
  • Understand what it takes to fail over with your data intact.

Summary

Lighthouses don’t prevent storms, but they are tools implemented in advance to help us better navigate the storms when they inevitably occur. Likewise, systems failures are going to happen, and we can’t understand how quickly we can recover from them unless we invest in creating and practicing a plan ahead of time.

That plan needs to dig deeper than just “are we doing backups?” As I hope I’ve demonstrated, one area of significant complexity is dealing with potential data loss.

Hopefully, the information above will provide some useful context and general information that will help project teams begin to engage in disaster avoidance and recovery planning. The earlier in the project you can begin planning for system resiliency, the better. It should not be a struggle to allocate time to this once the team recognizes that system resiliency is indeed a business priority.

--

--

Peter Aten

Interested in making great software, and particularly in how to make teams more effective