Celso Marciano
Building VTS
Published in
6 min readMar 27, 2024

--

The lifeblood of any modern business is its data and applications. Ensuring their availability, integrity, and performance is paramount. A vital aspect of this is a disaster recovery (DR) strategy that provides a safety net for unforeseen circumstances.

At VTS, we host our commercial real estate platform on AWS, and the infrastructure is architected with high availability and resiliency in mind. VTS Lease and VTS Activate products are considered “P0” (pronounced P-zero) services.

The term “P0” is often used in the context of service-level objectives (SLOs) or incident management to designate the highest level of urgency or importance. In other words, a P0 service is considered mission-critical to the business’s operation in a system or application with multiple components. When a P0 service experiences an outage or severe issue typically triggers immediate attention, including rapid escalation procedures and swift mobilization of on-call engineers to resolve the problem.

This blog post is the first in a series in which we will review basic disaster recovery concepts and outline disaster recovery strategies in a multi-service landscape.

Architecture Overview

Before diving into the specifics of our disaster recovery strategy, it’s essential to understand some fundamentals of AWS infrastructure that play a crucial role in our approach.

Note: The terminology below is specific to AWS. However, it should apply to any cloud.

AWS Regions and Availability Zones

AWS organizes its global infrastructure into separate geographic areas called regions. Within these regions, AWS operates isolated locations known as availability zones or AZs.

  • Regions: These are separate geographic areas designed to be completely isolated from each other, reducing the likelihood of simultaneous failure. Our primary runtime Region is the US East, and the standby Region is the US West.
  • Availability Zones: Each region has multiple AZs, essentially isolated data centers within the region. These zones are connected through low-latency links and offer the ability to run a highly available application that is more resilient to failures in a single location.

Understanding regions and AZs is critical for disaster recovery because they enable us to architect our services for high availability and failover capabilities. For example, we distribute our EKS clusters and RDS instances across multiple AZs in the primary region and have a pilot light setup in a separate standby region.

Disaster Recovery Strategies Overview on AWS

The AWS whitepaper Disaster Recovery Options in the Cloud details disaster recovery strategies and their approximate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

When selecting a strategy, RTO, RPO, and cost and technical constraints should be considered. Most importantly, the strategy needs to align with the Service Level Agreement or SLA.

For instance, if a payment service provider is a mission-critical component, we would like to fail over to another site as quickly as possible. In this scenario, achieving minimal RTO at the expense of additional software architecture complexity and running redundant infrastructure in multiple data centers might be worth it on an active-active topology.

Here are some of the most common strategies:

Backup and Restore

What it is: The simplest form of DR involves taking backups of data and restoring them when needed.

Best for: Smaller workloads and non-critical systems where some downtime is acceptable.

Pilot Light

What it is: A minimal version of the environment is always running on standby, which is our chosen DR strategy.

Best for: Applications that require quick recovery with minimized data loss.

Warm Standby

What it is: A scaled-down version of a fully functional environment is always running.

Best for: Systems that need faster failover than pilot light but at a higher cost.

Multi-Site (Active-Active)

What it is: Running multiple instances of an application in different locations (regions/AZs) simultaneously.

Best for: Mission-critical applications where high availability and resilience are paramount.

Understanding the nuances of these strategies can help businesses choose the DR architecture best suited to their needs. Each comes with its cost, complexity, and recovery speed trade-offs.

VTS Lease Disaster Recovery Strategy

VTS Lease is a Software as a Service offering for Commercial Real Estate deals, leases, and tenant relationship management. It is a cloud-native service hosted on AWS, with customers ranging from small brokerage firms to large asset management companies.

For VTS Lease, our failover strategy always runs a version of the environment in the standby region. Our standby region involves a minimal set of the application runtime and data stores. The setup falls between the pilot light and warm standby strategies.

We achieve infrastructure parity between primary and secondary regions through Infrastructure as Code using Terraform. A bespoke infra repository allows us to replicate and enforce infrastructure parity in different regions through Terraform modules, workspaces, and configuration files.

Annual failover drills ensure the engineering team can rely on an up-to-date procedure. It also allows us to improve the process by replacing manual steps with automation whenever possible.

Advantages

Rapid Activation: Since critical systems and configurations are already in place, failing over becomes quicker than backup and restore.

Reduced Costs: Compared to warmer strategies, such as active-active, only essential services run in standby mode, so cost is lower.

Drawbacks

Longer Recovery: Although the failover procedure strikes a good balance between backup and restore and active-active topology, it requires a few manual steps and human input.

Ongoing maintenance: having redundant infrastructure results in additional observability and operational burden. The infrastructure in the standby region needs to be monitored appropriately so it is ready when disaster strikes.

AWS Service-specific DR Strategies

Our production infrastructure is always deployed in multiple availability zones for high availability.

AWS EKS (Elastic Kubernetes Service)

  • Immutable infrastructure: Infrastructure as Code (IAC) uses Terraform to mirror the cluster configuration between the primary and standby regions.

AWS RDS for Postgres (Relational Database Service)

  • Read Replicas: Deployed to the standby region for quick promotion during disaster scenarios.

AWS Elasticache for Redis

  • Snapshots: Periodic backups are taken from Redis and stored in Amazon S3.
  • Cache Warming: Pre-warming the cache from snapshots in case of a failover. The backups in Amazon S3 are automatically replicated and restored periodically on a standby instance on the standby region.

AWS Opensearch

  • Snapshots: Regular backups to Amazon S3. The backups in Amazon S3 are automatically replicated and refreshed periodically on a standby instance.

AWS Backups

  • Backup plans: We use the capabilities of AWS Backups with appropriate retention for different scenarios.

DR Failover Procedure

The following procedure is a high-level list of steps taken during our failover drills. These are tailored to each application’s needs.

  • Initial Assessment: Confirm the type and scale of the disaster.
    - Identify the incident: Determine the nature of the incident (security, technical, natural disaster, etc.)
    - Assess Impact: Evaluate the extent of the impact on services, data, and infrastructure.
    - Engagement: Kick off the incident response and engage the teams required for service recovery.
  • Notification: Alert all stakeholders and initiate the failover procedure.
    - Stakeholder Communication: inform all relevant stakeholders, including management, IT, and external partners.
    - Documentation: record actions taken for post-recovery review.
  • Resource Scaling: Scale the standby version of the environment in the standby region into a full-scale production environment.
    - Verification: Ensure the standby environment is operational.
    - Adjust Capacity: increase the standby region's compute, storage, and networking capacity to handle the production load.
    - Configuration: Apply necessary configuration changes to support full-scale operations.
  • Data Syncing: Promote RDS read replicas and restore Elasticache and Opensearch snapshots.
    - Database promotion: Promote the read-replicas on the standby region to primary status, ensuring the database service can handle production workloads with the correct configuration.
    - Ephemeral storage: restore caches and auxiliary databases or choose other means to hydrate data.
  • DNS Switch: Update DNS records to point to the standby region.
    - DNS update: modify DNS records to reroute traffic to the standby region.
    - TTL settings: To minimize the time needed for changes to propagate, consider reducing the Time to Live (TTL) settings for DNS records in advance.
    - Health checks: Implement health checks to ensure services are fully operational before rerouting traffic.

Conclusion

Developing and maintaining a robust disaster recovery strategy is not a one-time effort but an ongoing process. Using AWS services like EKS, RDS, Elasticache for Redis, and Opensearch enables us to leverage high-availability features and simplifies the complexity involved in DR planning. By investing in a Disaster Recovery setup and focusing on service-specific strategies, we have built a resilient system poised to handle unexpected disasters with minimal downtime and data loss.

Do you have any strategies to share or questions about our setup? Please feel free to comment below.

References

  1. Introduction — Disaster Recovery of Workloads on AWS
  2. Herrington, M. (2017). There are multiple approaches to backup and disaster recovery planning. The Enterprise, 46(38), 8.

--

--