Skip to main content

RPO and RTO

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively. In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.

Recovery Point Objective and Recovery Time Objective for Temporal Cloud can be considered within these scenarios:

  1. High Availability features enabled (same-region, multi-region, or multi-cloud replication): Sub-1-minute RPO and 20 minutes or less RTO
  2. Default (non-HA) namespace, regional failure: 8-hour RPO and RTO
  3. Default (non-HA) namespace, availability zone failure: 0 RPO and RTO

Which objective is relevant to your organization is driven by whether you map data center loss to a regional loss or a zonal loss. Temporal Cloud delivers different RPO/RTOs based on these scenarios because of the way our platform performs writes to our data provider.

High Availability, Regional Failure

Temporal Cloud offers High Availability. High availability ensures that a system remains operational with minimal downtime.

As Workflows progress in the active region, history events are asynchronously replicated to the standby region. In case of an incident or outage in the active region, Temporal Cloud will fail over to the standby region so that existing Workflow Executions will continue to run and new Executions can be started.

Recovery Point Objective (RPO) - sub-1-minute

Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved.

Temporal Cloud strives to maintain a P95 replication lag of less than 1 minute. In this context, P95 means 95% of updates are processed faster than this limit.

The recovery point objective RPO is near-zero. There may be a short period of time—the replication lag—during the incident when some data may be unavailable

Recovery Time Objective (RTO) - 20 minutes

Recovery time objective (RTO) for Temporal Cloud is 20 minutes or less per incident.

Default (non-HA) Namespace, Regional Failure

Temporal Cloud Namespace data is backed up by our data provider. For a single region Namespace, data must be restored in order to recover in the event of regional failure (i.e., logical corruption).

Temporal Cloud is beholden to our data provider backup constraints, so in this scenario it leads to the following objectives for regional failure:

Recovery Point Objective (RPO) - 8 hours

  • Our data provider “snapshot” duration which is 4 hours
  • The time window of 4 hours allocated to detection of corruption point before we mitigate.

Recovery Time Objective (RTO) - 8 hours

  • The time window of 4 hours allocated to detection of corruption point.
  • Our data provider restore time can be up to 4 hours

Default (non-HA) Namespace, Availability Zone Failure

Temporal Cells are deployed in three Availability Zones (AZs) in the same region. Our data provider is deployed with the same topology in three AZs in the same region.

All writes to storage are synchronously replicated across AZs, including our writes to ElasticSearch. ElasticSearch is eventually consistent, but this does not impact our RPO as there is no data loss.

This means there is no logical corruption and restoration is done from a live replicated instance. This applies for both single region Namespaces and multi region Namespaces.

This leads to the following objectives for availability zone failure:

Recovery Point Objective (RPO) - 0

Anything that gets committed into the zone is protected by replication in another AZ.

Recovery Time Objective (RTO) - 0

Temporal is active-active across AZs. The RTO is stated to be zero, meaning there should be no downtime in such scenarios.