Cyber Resilience
Click any control badge to view its details. Download SVG
Key Control Areas
- Critical Service Identification and Impact Tolerance (PM-08, PM-11, PL-02, RA-03, RA-09): Resilience starts with knowing what matters. PM-08 establishes a critical infrastructure plan that identifies the systems and services essential to the organisation's mission. PM-11 defines mission and business processes with their supporting technology dependencies, creating the map from business outcomes to infrastructure components. PL-02 documents the security architecture including trust boundaries, data flows, and single points of failure. RA-03 assesses risk to critical services: what threats are most likely, what is the impact of disruption, and where are the architectural weaknesses? RA-09 (criticality analysis) determines which components, if compromised, would cause the most damage -- these are the crown jewels that receive the highest resilience investment. The output is a tiered service catalogue with impact tolerances: Tier 1 services (e.g., payment processing) must recover within 2 hours, Tier 2 (e.g., customer portal) within 24 hours, Tier 3 (e.g., internal reporting) within 72 hours. Impact tolerances drive every subsequent architectural decision.
- Resilient Architecture and Graceful Degradation (SA-08, SC-36, SC-07, SC-32, SI-13, SI-17): Architecture must be designed to absorb disruption. SA-08 applies resilience engineering principles: no single points of failure for critical services, blast radius containment through isolation, fail-safe defaults that maintain safety when components fail. SC-36 provides distributed processing and storage: critical services run across multiple availability zones or regions, with data replicated to ensure availability if one location is lost. SC-07 enforces boundary protection between resilience zones: if one zone is compromised, network segmentation prevents lateral movement to recovery infrastructure. SC-32 creates system partitions that isolate critical components: the backup infrastructure is in a separate trust domain from production, with separate credentials, separate network paths, and separate administrative access. SI-13 addresses predictable failure prevention: monitoring for early warning signs (disk degradation, certificate expiry, capacity limits) that could compound during an incident. SI-17 provides fail-safe procedures when automated systems fail: manual fallback processes for critical business functions, paper-based procedures for essential operations, and clear escalation paths when technology is unavailable. Graceful degradation means designing systems with progressive feature reduction: when under attack, non-essential features are shed to preserve core functionality.
- Immutable Backup and Data Protection (CP-09, CP-06, SC-28, SI-07, SC-13): Backups are the last line of defence and must be treated as critical infrastructure. CP-09 mandates system backup with specific attention to integrity: backups must be immutable (write-once-read-many or append-only storage), stored in a separate trust domain with separate credentials, and regularly tested through actual restoration. CP-06 provides alternate storage sites: backup data must be geographically separated from production and accessible through independent network paths that do not traverse potentially compromised infrastructure. SC-28 protects data at rest: backup encryption with keys managed independently of the production key management system -- if the attacker compromises production encryption keys, backup data must remain protected. SI-07 verifies backup integrity: cryptographic hashes computed at backup time and verified before restoration, detecting tampering or corruption that would make restoration fail. SC-13 mandates approved cryptographic mechanisms for backup encryption. The 3-2-1-1-0 rule provides a practical target: three copies of data, on two different media types, one offsite, one immutable, zero errors in recovery testing. Ransomware specifically targets backup systems: the architecture must assume that backup credentials, backup servers, and backup network paths will be attacked.
- Isolated Recovery Environment (CP-07, CP-10, CP-02, SC-07, AC-02): Recovery must happen from a clean, trusted environment. CP-07 provides an alternate processing site: an isolated recovery environment (sometimes called a clean room or cyber vault) that can reconstitute critical services independently of the production environment. CP-10 governs system recovery and reconstitution: the process of rebuilding systems from immutable images, verified backups, and trusted software repositories. CP-02 defines the contingency plan: step-by-step recovery procedures that have been tested and can be executed under the stress of an active incident. SC-07 enforces strict boundary protection around the recovery environment: air-gapped or network-isolated, accessible only through tightly controlled jump boxes, with no trust relationship to the production domain. AC-02 manages separate recovery accounts: recovery environment credentials are stored offline (printed, in a safe), not in the production directory service that may be compromised. The recovery environment must include: independent DNS, independent identity provider, independent certificate authority, network-isolated backup access, and pre-staged recovery tooling. Testing must validate that the recovery environment can actually reconstitute services without any dependency on the production environment.
- Identity and Access Resilience (IA-02, IA-05, AC-02, CP-02, SC-12): Identity infrastructure is often the first and hardest thing to recover. IA-02 in the resilience context means maintaining the ability to authenticate users even when the primary identity provider is compromised. IA-05 addresses authenticator management during crisis: break-glass accounts with offline credentials, emergency access procedures that bypass normal authentication while maintaining auditability. AC-02 manages emergency account provisioning: predefined emergency accounts sealed in secure storage, used only during recovery, with full audit trails. CP-02 includes identity recovery in the contingency plan: how to rebuild Active Directory from backup, how to re-establish certificate trust, how to provision temporary access during rebuilding. SC-12 ensures cryptographic key recovery: backup of critical encryption keys (data encryption, TLS certificates, code signing) in offline key escrow, enabling service restoration without regenerating all cryptographic material. The most common recovery failure is circular dependency: you need Active Directory to access the backup system, but Active Directory is the thing you need to restore. Break this cycle by designing the recovery environment with independent identity.
- Testing, Exercising, and Continuous Improvement (CP-04, IR-03, PM-14, CA-02, CA-07): Resilience that is not tested is theoretical. CP-04 mandates contingency plan testing: at minimum annual tabletop exercises, quarterly technical recovery tests for Tier 1 services, and periodic full-scale simulation exercises. IR-03 provides incident response testing through red team exercises, purple team collaboration, and scenario-based simulations that test detection, response, and recovery holistically. PM-14 ensures continuous testing and monitoring of resilience capabilities. CA-02 conducts control assessments of resilience architecture: verifying that backup immutability is actually enforced, that recovery environments are truly isolated, that recovery time objectives can actually be met. CA-07 provides continuous monitoring of resilience indicators: backup completion rates, backup age, recovery environment readiness, certificate expiry timelines, and capacity headroom. Every test should produce findings, and findings should drive architectural improvements. Organisations that treat resilience testing as a compliance checkbox rather than a genuine learning exercise discover their gaps during real incidents rather than before them.
- Regulatory Alignment and Governance (PM-08, PM-09, PM-30, PL-02, SA-08): Cyber resilience is increasingly regulated. PM-08 establishes critical infrastructure plans aligned with regulatory requirements: DORA mandates ICT risk management frameworks with specific resilience requirements for financial entities. PM-09 develops the risk management strategy that balances resilience investment against organisational risk appetite. PM-30 provides supply chain risk management: understanding and managing resilience dependencies on third-party providers (cloud platforms, SaaS applications, managed services) whose failure could disrupt critical services. PL-02 documents resilience architecture in system security plans that demonstrate regulatory compliance. SA-08 applies security and privacy engineering principles to resilience design: defence in depth (multiple independent resilience mechanisms), least privilege (recovery access restricted to minimum necessary), and fail-safe defaults (systems fail to a safe state rather than an insecure one). Board-level reporting on resilience posture should include: recovery time objective vs actual recovery time in tests, backup integrity verification results, and resilience gap remediation progress.
When to Use
This pattern is essential for any organisation where a prolonged technology outage would cause significant financial, reputational, or safety harm. It is particularly critical for: financial services organisations subject to DORA or operational resilience regulation, healthcare organisations where system availability affects patient safety, critical national infrastructure operators, organisations that have experienced (or closely observed peers experiencing) destructive cyber attacks, any organisation with high-value data that would be targeted by ransomware, and organisations with complex technology estates where recovery sequence and dependencies are not well understood.
When NOT to Use
Very small organisations with simple technology environments and high tolerance for downtime may find the full resilience architecture disproportionate, though basic backup and recovery capability is always appropriate. Organisations in early startup phase where the entire technology estate can be rebuilt from code repositories in hours may prioritise development speed over resilience architecture. Environments that are entirely ephemeral and stateless (pure functions with no persistent data) have different resilience characteristics and may not need traditional backup and recovery.
Typical Challenges
Cost is the primary barrier: resilient architecture requires investment in redundancy, backup infrastructure, recovery environments, and testing programmes that deliver no visible benefit until an incident occurs. Recovery testing disrupts production operations and requires dedicated time from teams already under pressure. Organisations discover circular dependencies during recovery exercises: the backup system depends on DNS which depends on Active Directory which is the thing being recovered. Cloud provider resilience is often assumed but not verified: multi-region deployments may share control plane dependencies that create correlated failures. Immutable backup storage increases cost and complexity compared to traditional backup approaches. The human factor is critical and often underestimated: recovery under the stress of an active incident with exhausted teams and executive pressure is fundamentally different from a planned test. Shadow IT and undocumented systems create recovery gaps: you cannot recover what you do not know exists. Third-party dependencies (SaaS providers, cloud platforms) may have their own resilience limitations that constrain your recovery timeline. Keeping recovery environments current and tested requires ongoing operational investment that competes with feature development.
Threat Resistance
Cyber Resilience addresses the threats that preventive controls alone cannot fully mitigate. Ransomware that encrypts production data is neutralised by immutable backups in isolated storage that the attacker cannot reach or corrupt (CP-09, CP-06, SC-28). Destructive wiper malware that destroys systems is countered by the ability to reconstitute from immutable images in an isolated recovery environment (CP-07, CP-10). Supply chain attacks that compromise trusted software are contained by verified software repositories and integrity checking during recovery (SI-07, SR-10). Advanced persistent threats that compromise the domain controller are addressed by independent identity recovery capability and break-glass access (IA-02, AC-02, CP-02). Insider threats that sabotage backup infrastructure are mitigated by separation of duties, immutable storage, and multi-party authorisation for backup administration (AC-05, CP-09, AC-02). Cloud provider outages are addressed by multi-region architecture and provider-independent recovery capability (SC-36, CP-07). Coordinated attacks targeting both production and backup simultaneously are countered by air-gapped recovery environments with independent credentials (SC-07, AC-02). The fundamental architectural principle is that recovery infrastructure must exist in a separate trust domain that does not share credentials, network paths, or administrative access with the production environment.
Assumptions
The organisation has identified its critical business services and their technology dependencies. Executive sponsorship exists for resilience investment (resilience costs money upfront and pays off during incidents). IT operations teams have the skills to implement and maintain backup and recovery infrastructure. Network architecture supports the creation of isolated recovery environments. The organisation can tolerate the operational overhead of maintaining and testing recovery capabilities. Regulatory requirements for operational resilience are understood and prioritised.
Developing Areas
- Immutable backup testing automation is improving but most organisations still rely on manual quarterly restoration exercises. The gap between backup completion (automated, measured) and backup recoverability (rarely tested, poorly measured) means that many organisations discover corruption or incompatibility only during real incidents. Emerging solutions like automated daily restore-and-validate pipelines are available from major vendors but adoption remains below 20% even in financial services.
- Cross-border disaster recovery coordination is increasingly complex as data residency requirements multiply. GDPR, DORA, and equivalent regulations create conflicting requirements where backup data must be geographically distributed for resilience but confined to specific jurisdictions for compliance. Organisations with operations spanning EU, UK, US, and APAC face recovery architectures constrained by legal geography rather than optimal network topology.
- DORA compliance measurement lacks standardised metrics. The regulation mandates operational resilience testing but does not define pass/fail criteria for recovery time objectives, backup integrity verification, or third-party resilience assessment. Regulatory supervisors are developing expectations through supervisory dialogue rather than published benchmarks, creating uncertainty for organisations attempting to demonstrate compliance.
- Chaos engineering applied to security resilience -- intentionally injecting security failures to test detection and recovery -- is gaining traction but remains controversial. Netflix-style failure injection for availability is well-understood, but deliberately simulating credential compromise, backup corruption, or identity infrastructure failure in production carries risks that most organisations are not willing to accept without mature safety mechanisms.
- Ransomware recovery time objectives are being tested against increasingly sophisticated adversary tactics. Modern ransomware groups target backup infrastructure specifically, with average dwell times of 10-14 days before encryption allows thorough reconnaissance of recovery capabilities. The emerging response -- isolated recovery environments with independent identity and network infrastructure -- is architecturally sound but operationally expensive and rarely tested at the frequency needed to maintain confidence.