← Patterns / SP-034

Cyber Resilience

Cyber Resilience is the capacity of an organisation to anticipate, withstand, recover from, and adapt to adverse cyber events. It goes beyond traditional security (preventing attacks) and traditional disaster recovery (restoring from backups) to address the reality that sophisticated adversaries will eventually breach preventive controls. The question is not whether you will be compromised, but whether your architecture allows you to continue operating and recover quickly when you are. The pattern is built on four phases. Anticipation identifies critical business services, maps their dependencies, sets impact tolerances, and designs architectures that can absorb disruption. Withstanding maintains critical operations under attack through graceful degradation, isolation, and failover -- the system bends but does not break. Recovery restores full operations from a known-good state through immutable backups, isolated recovery environments, and tested restoration procedures. Adaptation feeds lessons from each incident back into the architecture, continuously improving resilience posture. This is fundamentally an architecture problem, not a technology problem. A resilient system is designed from the ground up with failure modes in mind: critical services have no single points of failure, data stores have immutable backups that attackers cannot reach, identity systems can be rebuilt independently, and recovery procedures are tested regularly -- not just documented. The most common failure mode in real cyber incidents is not the initial compromise but the discovery during recovery that backups are corrupted, that no one knows the restoration sequence, or that the recovery environment depends on the same compromised infrastructure. The pattern aligns with regulatory frameworks that increasingly mandate operational resilience: the EU Digital Operational Resilience Act (DORA), the UK PRA/FCA operational resilience framework, and the US Federal Financial Institutions Examination Council (FFIEC) business continuity guidance. These regulations recognise that financial stability depends not on preventing every attack but on ensuring critical services recover within defined impact tolerances.

Release: 26.02 Authors: Aurelius, Vitruvius Updated: 2026-02-07

Assess

ATT&CK This pattern addresses 425 techniques across 13 tactics View on ATT&CK Matrix →

Click any control badge to view its details. Download SVG

Key Control Areas

Critical Service Identification and Impact Tolerance

PM-08 PM-11 PL-02 RA-03 RA-09

Resilience starts with knowing what matters. PM-08 establishes a critical infrastructure plan that identifies the systems and services essential to the organisation's mission. PM-11 defines mission and business processes with their supporting technology dependencies, creating the map from business outcomes to infrastructure components. PL-02 documents the security architecture including trust boundaries, data flows, and single points of failure. RA-03 assesses risk to critical services: what threats are most likely, what is the impact of disruption, and where are the architectural weaknesses? RA-09 (criticality analysis) determines which components, if compromised, would cause the most damage -- these are the crown jewels that receive the highest resilience investment. The output is a tiered service catalogue with impact tolerances: Tier 1 services (e.g., payment processing) must recover within 2 hours, Tier 2 (e.g., customer portal) within 24 hours, Tier 3 (e.g., internal reporting) within 72 hours. Impact tolerances drive every subsequent architectural decision.

Resilient Architecture and Graceful Degradation

SA-08 SC-36 SC-07 SC-32 SI-13 SI-17

Architecture must be designed to absorb disruption. SA-08 applies resilience engineering principles: no single points of failure for critical services, blast radius containment through isolation, fail-safe defaults that maintain safety when components fail. SC-36 provides distributed processing and storage: critical services run across multiple availability zones or regions, with data replicated to ensure availability if one location is lost. SC-07 enforces boundary protection between resilience zones: if one zone is compromised, network segmentation prevents lateral movement to recovery infrastructure. SC-32 creates system partitions that isolate critical components: the backup infrastructure is in a separate trust domain from production, with separate credentials, separate network paths, and separate administrative access. SI-13 addresses predictable failure prevention: monitoring for early warning signs (disk degradation, certificate expiry, capacity limits) that could compound during an incident. SI-17 provides fail-safe procedures when automated systems fail: manual fallback processes for critical business functions, paper-based procedures for essential operations, and clear escalation paths when technology is unavailable. Graceful degradation means designing systems with progressive feature reduction: when under attack, non-essential features are shed to preserve core functionality.

Immutable Backup and Data Protection

CP-09 CP-06 SC-28 SI-07 SC-13

Backups are the last line of defence and must be treated as critical infrastructure. CP-09 mandates system backup with specific attention to integrity: backups must be immutable (write-once-read-many or append-only storage), stored in a separate trust domain with separate credentials, and regularly tested through actual restoration. CP-06 provides alternate storage sites: backup data must be geographically separated from production and accessible through independent network paths that do not traverse potentially compromised infrastructure. SC-28 protects data at rest: backup encryption with keys managed independently of the production key management system -- if the attacker compromises production encryption keys, backup data must remain protected. SI-07 verifies backup integrity: cryptographic hashes computed at backup time and verified before restoration, detecting tampering or corruption that would make restoration fail. SC-13 mandates approved cryptographic mechanisms for backup encryption. The 3-2-1-1-0 rule provides a practical target: three copies of data, on two different media types, one offsite, one immutable, zero errors in recovery testing. Ransomware specifically targets backup systems: the architecture must assume that backup credentials, backup servers, and backup network paths will be attacked.

Isolated Recovery Environment

CP-07 CP-10 CP-02 SC-07 AC-02

Recovery must happen from a clean, trusted environment. CP-07 provides an alternate processing site: an isolated recovery environment (sometimes called a clean room or cyber vault) that can reconstitute critical services independently of the production environment. CP-10 governs system recovery and reconstitution: the process of rebuilding systems from immutable images, verified backups, and trusted software repositories. CP-02 defines the contingency plan: step-by-step recovery procedures that have been tested and can be executed under the stress of an active incident. SC-07 enforces strict boundary protection around the recovery environment: air-gapped or network-isolated, accessible only through tightly controlled jump boxes, with no trust relationship to the production domain. AC-02 manages separate recovery accounts: recovery environment credentials are stored offline (printed, in a safe), not in the production directory service that may be compromised. The recovery environment must include: independent DNS, independent identity provider, independent certificate authority, network-isolated backup access, and pre-staged recovery tooling. Testing must validate that the recovery environment can actually reconstitute services without any dependency on the production environment.

Identity and Access Resilience

IA-02 IA-05 AC-02 CP-02 SC-12

Identity infrastructure is often the first and hardest thing to recover. IA-02 in the resilience context means maintaining the ability to authenticate users even when the primary identity provider is compromised. IA-05 addresses authenticator management during crisis: break-glass accounts with offline credentials, emergency access procedures that bypass normal authentication while maintaining auditability. AC-02 manages emergency account provisioning: predefined emergency accounts sealed in secure storage, used only during recovery, with full audit trails. CP-02 includes identity recovery in the contingency plan: how to rebuild Active Directory from backup, how to re-establish certificate trust, how to provision temporary access during rebuilding. SC-12 ensures cryptographic key recovery: backup of critical encryption keys (data encryption, TLS certificates, code signing) in offline key escrow, enabling service restoration without regenerating all cryptographic material. The most common recovery failure is circular dependency: you need Active Directory to access the backup system, but Active Directory is the thing you need to restore. Break this cycle by designing the recovery environment with independent identity.

Testing, Exercising, and Continuous Improvement

CP-04 IR-03 PM-14 CA-02 CA-07

Resilience that is not tested is theoretical. CP-04 mandates contingency plan testing: at minimum annual tabletop exercises, quarterly technical recovery tests for Tier 1 services, and periodic full-scale simulation exercises. IR-03 provides incident response testing through red team exercises, purple team collaboration, and scenario-based simulations that test detection, response, and recovery holistically. PM-14 ensures continuous testing and monitoring of resilience capabilities. CA-02 conducts control assessments of resilience architecture: verifying that backup immutability is actually enforced, that recovery environments are truly isolated, that recovery time objectives can actually be met. CA-07 provides continuous monitoring of resilience indicators: backup completion rates, backup age, recovery environment readiness, certificate expiry timelines, and capacity headroom. Every test should produce findings, and findings should drive architectural improvements. Organisations that treat resilience testing as a compliance checkbox rather than a genuine learning exercise discover their gaps during real incidents rather than before them.

Regulatory Alignment and Governance

PM-08 PM-09 PM-30 PL-02 SA-08

Cyber resilience is increasingly regulated. PM-08 establishes critical infrastructure plans aligned with regulatory requirements: DORA mandates ICT risk management frameworks with specific resilience requirements for financial entities. PM-09 develops the risk management strategy that balances resilience investment against organisational risk appetite. PM-30 provides supply chain risk management: understanding and managing resilience dependencies on third-party providers (cloud platforms, SaaS applications, managed services) whose failure could disrupt critical services. PL-02 documents resilience architecture in system security plans that demonstrate regulatory compliance. SA-08 applies security and privacy engineering principles to resilience design: defence in depth (multiple independent resilience mechanisms), least privilege (recovery access restricted to minimum necessary), and fail-safe defaults (systems fail to a safe state rather than an insecure one). Board-level reporting on resilience posture should include: recovery time objective vs actual recovery time in tests, backup integrity verification results, and resilience gap remediation progress.

When to Use

This pattern is essential for any organisation where a prolonged technology outage would cause significant financial, reputational, or safety harm. It is particularly critical for: financial services organisations subject to DORA or operational resilience regulation, healthcare organisations where system availability affects patient safety, critical national infrastructure operators, organisations that have experienced (or closely observed peers experiencing) destructive cyber attacks, any organisation with high-value data that would be targeted by ransomware, and organisations with complex technology estates where recovery sequence and dependencies are not well understood.

When NOT to Use

Very small organisations with simple technology environments and high tolerance for downtime may find the full resilience architecture disproportionate, though basic backup and recovery capability is always appropriate. Organisations in early startup phase where the entire technology estate can be rebuilt from code repositories in hours may prioritise development speed over resilience architecture. Environments that are entirely ephemeral and stateless (pure functions with no persistent data) have different resilience characteristics and may not need traditional backup and recovery.

Typical Challenges

Cost is the primary barrier: resilient architecture requires investment in redundancy, backup infrastructure, recovery environments, and testing programmes that deliver no visible benefit until an incident occurs. Recovery testing disrupts production operations and requires dedicated time from teams already under pressure. Organisations discover circular dependencies during recovery exercises: the backup system depends on DNS which depends on Active Directory which is the thing being recovered. Cloud provider resilience is often assumed but not verified: multi-region deployments may share control plane dependencies that create correlated failures. Immutable backup storage increases cost and complexity compared to traditional backup approaches. The human factor is critical and often underestimated: recovery under the stress of an active incident with exhausted teams and executive pressure is fundamentally different from a planned test. Shadow IT and undocumented systems create recovery gaps: you cannot recover what you do not know exists. Third-party dependencies (SaaS providers, cloud platforms) may have their own resilience limitations that constrain your recovery timeline. Keeping recovery environments current and tested requires ongoing operational investment that competes with feature development.

Threat Resistance

Cyber Resilience addresses the threats that preventive controls alone cannot fully mitigate. Ransomware that encrypts production data is neutralised by immutable backups in isolated storage that the attacker cannot reach or corrupt (CP-09, CP-06, SC-28). Destructive wiper malware that destroys systems is countered by the ability to reconstitute from immutable images in an isolated recovery environment (CP-07, CP-10). Supply chain attacks that compromise trusted software are contained by verified software repositories and integrity checking during recovery (SI-07, SR-10). Advanced persistent threats that compromise the domain controller are addressed by independent identity recovery capability and break-glass access (IA-02, AC-02, CP-02). Insider threats that sabotage backup infrastructure are mitigated by separation of duties, immutable storage, and multi-party authorisation for backup administration (AC-05, CP-09, AC-02). Cloud provider outages are addressed by multi-region architecture and provider-independent recovery capability (SC-36, CP-07). Coordinated attacks targeting both production and backup simultaneously are countered by air-gapped recovery environments with independent credentials (SC-07, AC-02). The fundamental architectural principle is that recovery infrastructure must exist in a separate trust domain that does not share credentials, network paths, or administrative access with the production environment.

Assumptions

The organisation has identified its critical business services and their technology dependencies. Executive sponsorship exists for resilience investment (resilience costs money upfront and pays off during incidents). IT operations teams have the skills to implement and maintain backup and recovery infrastructure. Network architecture supports the creation of isolated recovery environments. The organisation can tolerate the operational overhead of maintaining and testing recovery capabilities. Regulatory requirements for operational resilience are understood and prioritised.

Developing Areas

Immutable backup testing automation is improving but most organisations still rely on manual quarterly restoration exercises. The gap between backup completion (automated, measured) and backup recoverability (rarely tested, poorly measured) means that many organisations discover corruption or incompatibility only during real incidents. Emerging solutions like automated daily restore-and-validate pipelines are available from major vendors but adoption remains below 20% even in financial services.
Cross-border disaster recovery coordination is increasingly complex as data residency requirements multiply. GDPR, DORA, and equivalent regulations create conflicting requirements where backup data must be geographically distributed for resilience but confined to specific jurisdictions for compliance. Organisations with operations spanning EU, UK, US, and APAC face recovery architectures constrained by legal geography rather than optimal network topology.
DORA compliance measurement lacks standardised metrics. The regulation mandates operational resilience testing but does not define pass/fail criteria for recovery time objectives, backup integrity verification, or third-party resilience assessment. Regulatory supervisors are developing expectations through supervisory dialogue rather than published benchmarks, creating uncertainty for organisations attempting to demonstrate compliance.
Chaos engineering applied to security resilience -- intentionally injecting security failures to test detection and recovery -- is gaining traction but remains controversial. Netflix-style failure injection for availability is well-understood, but deliberately simulating credential compromise, backup corruption, or identity infrastructure failure in production carries risks that most organisations are not willing to accept without mature safety mechanisms.
Ransomware recovery time objectives are being tested against increasingly sophisticated adversary tactics. Modern ransomware groups target backup infrastructure specifically, with average dwell times of 10-14 days before encryption allows thorough reconnaissance of recovery capabilities. The emerging response -- isolated recovery environments with independent identity and network infrastructure -- is architecturally sound but operationally expensive and rarely tested at the frequency needed to maintain confidence.

Related Patterns

Patterns that operate within or alongside this one. Click any to view.

SP-011 Cloud Computing SP-013 Data Security SP-025 Advanced Monitoring and Detection SP-028 Secure DevOps Pipeline SP-029 Zero Trust Architecture SP-031 Security Monitoring and Response

AC: 2AU: 1CA: 2CP: 6IA: 2IR: 3PL: 1PM: 5RA: 2SA: 2SC: 6SI: 3SR: 1

AC-02 Account Management

AC-05 Separation of Duties

AU-02 Event Logging

CA-02 Control Assessments

CA-07 Continuous Monitoring

CP-02 Contingency Plan

CP-04 Contingency Plan Testing

CP-06 Alternate Storage Site

CP-07 Alternate Processing Site

CP-08 Telecommunications Services

CP-09 System Backup

CP-10 System Recovery and Reconstitution

IA-02 Identification and Authentication (Organizational Users)

IA-05 Authenticator Management

IR-03 Incident Response Testing

IR-04 Incident Handling

IR-08 Incident Response Plan

PL-02 System Security and Privacy Plans

PM-08 Critical Infrastructure Plan

PM-09 Risk Management Strategy

PM-11 Mission and Business Process Definition

PM-14 Testing, Training, and Monitoring

PM-30 Supply Chain Risk Management Strategy

RA-03 Risk Assessment

RA-09 Criticality Analysis

SA-08 Security and Privacy Engineering Principles

SA-09 External System Services

SC-07 Boundary Protection

SC-12 Cryptographic Key Establishment and Management

SC-13 Cryptographic Protection

SC-28 Protection of Information at Rest

SC-32 System Partitioning

SC-36 Distributed Processing and Storage

SI-07 Software, Firmware, and Information Integrity

SI-13 Predictable Failure Prevention

SI-17 Fail-safe Procedures

SR-10 Inspection of Systems or Components