It goes beyond traditional business continuity planning. Where a BCP asks "do we have a plan?", operational resilience modeling asks "does the plan actually work — and can we prove it?" The distinction matters because regulators, boards, and customers are no longer satisfied with documented intentions. They want evidence.
The shift is already happening. The EU's Digital Operational Resilience Act (DORA) took effect in January 2025. The UK's FCA and PRA required firms to have mapped their Important Business Services and set impact tolerances by March 31, 2025. Canada's OSFI Guideline E-21 mandates full adherence by September 2026. Switzerland's FINMA issued formal operational resilience guidance in November 2025. And in October 2024, the Bank of England ran SIMEX 24 — a sector-wide simulation that tested whether the entire UK financial system could survive a total infrastructure shutdown and restart.
The message from regulators is consistent and clear: resilience must be modeled, tested, and continuously demonstrated — not just described in a document.
Why Static Plans Are No Longer Enough
For most of the last two decades, operational resilience meant having a business continuity plan. A document — usually written in Word, stored in SharePoint, reviewed annually — that described what the organization would do if a disruption occurred. Most of these plans were compliance artifacts, not operational tools.
Three forces have made this approach untenable.
Infrastructure complexity has outpaced documentation. The average enterprise now runs hundreds of interconnected services across multiple cloud providers, third-party APIs, and internal systems. The web of dependencies between these components is too complex, too dynamic, and too distributed for any static document to accurately represent. By the time a Word-based BCP is reviewed and approved, the infrastructure it describes has already changed.
The cost of getting it wrong has escalated. A 2025 study by New Relic found that the median annual cost of high-impact IT outages has reached $76 million per organization. Cockroach Labs' State of Resilience 2025 report found that 100% of surveyed executives reported outage-related revenue losses in the prior year, with organizations averaging 86 outages annually. PagerDuty's research showed customer-impacting incidents increased 43% year-over-year, with each incident costing approximately $800,000. At these stakes, untested plans are unacceptable.
Regulators now demand evidence, not assertions. The regulatory shift is unmistakable. DORA doesn't just require financial institutions to have resilience plans — it mandates threat-led penetration testing (TLPT) on live production systems every three years for significant entities, and annual penetration testing for all in-scope firms. The UK's PRA expects firms to test "progressively more severe but plausible scenarios" proportionate to their size and complexity. OSFI's E-21 expects scenario testing to mature over time, moving from paper-based assessments to full simulations.
The era of "we have a plan" is over. The era of "we can prove our plan works" has begun.
The Core Concepts of Operational Resilience Modeling
Operational resilience modeling rests on three foundational concepts that work together. Understanding each one — and how they connect — is essential for building a model that's useful rather than decorative.
Important Business Services (IBS)
An Important Business Service is an external-facing service delivered to customers, clients, or market participants that, if disrupted, could cause harm to consumers, threaten the firm's safety and soundness, or risk financial stability. The UK's PRA and FCA formalized this concept in their 2021 operational resilience framework, and it has since been adopted — under various names — by regulators globally.
Identifying your Important Business Services is the starting point for operational resilience modeling. This is closely related to the Minimum Viable Company (MVC) concept popularized by PwC — the smallest set of services, systems, and processes that must remain operational for the organization to survive a major disruption. Your IBS list defines the boundary of what matters most.
For a bank, IBS might include payment processing, account access, and lending decisions. For a SaaS company, it's the core application, authentication, and billing. For a healthcare provider, patient-facing systems, scheduling, and records access.
Impact Tolerances
An impact tolerance is the maximum level of disruption that an Important Business Service can sustain before the harm becomes intolerable — to customers, to the firm, or to the broader market. It is not the same as a recovery time objective (RTO). RTOs describe how fast you aim to recover. Impact tolerances describe how much disruption is acceptable before recovery, measured in time, transaction volume, data loss, or customer impact.
Setting impact tolerances requires both quantitative data and qualitative judgment. PwC recommends a five-step approach: identify your services, establish baseline metrics, develop a rubric for assessing harm, set the tolerance levels, and then validate them against real data. The key question at each step is: at what point does disruption to this service cause unacceptable harm?
Impact tolerances acknowledge a reality that traditional business continuity planning often ignores: disruptions will happen. The question isn't whether you can prevent all failures, but whether you can keep the damage within survivable bounds.
Scenario Testing
Scenario testing is the mechanism that connects IBS identification and impact tolerances to real-world validation. It answers the question: given a specific disruption, do we stay within our impact tolerances — or do we breach them?
The PRA's supervisory statement (SS1/21) is explicit about what this means in practice. Firms must test against "severe but plausible" scenarios, informed by previous incidents within their organization, across their industry, and in other sectors. Testing should cover the availability and integrity of resources — people, technology, facilities, and third-party providers — and should combine both announced and unannounced exercises.
The Bank of England demonstrated what sector-wide scenario testing looks like with SIMEX 24 in October 2024. The exercise simulated a major infrastructure failure that required a complete sector shutdown and restart of the UK financial system. Over 40 financial institutions and market infrastructure providers participated, coordinated by the Cross Market Operational Resilience Group (CMORG).
How Operational Resilience Modeling Works in Practice
The modeling process follows a clear sequence. Each phase builds on the one before it, and skipping a phase undermines everything that follows.
Phase 1: Map Your Dependencies
Every operational resilience model starts with dependency mapping — identifying every service, system, third-party provider, and infrastructure component that supports your Important Business Services, and documenting how they connect. This isn't an architecture diagram. It's a comprehensive graph of every direct, transitive, and shared dependency that your critical services rely on.
Effective dependency mapping combines multiple data sources: infrastructure-as-code configurations (Terraform, CloudFormation, Kubernetes manifests), network traffic analysis, distributed tracing data, and cloud provider metadata. The goal is to capture actual dependencies, not just documented ones — because production behavior frequently diverges from documentation.
This phase also identifies single points of failure: components where a single failure would cascade across multiple services simultaneously. These are the highest-risk nodes in your dependency graph and the ones most likely to cause you to breach your impact tolerances.
Phase 2: Define Impact Tolerances
With your services and dependencies mapped, set impact tolerances for each Important Business Service. This requires cross-functional input — engineering leaders understand the technical constraints, business leaders understand the financial and reputational impact, and compliance teams understand the regulatory thresholds.
Impact tolerances typically include:
- Maximum tolerable downtime — How long can this service be unavailable before harm becomes unacceptable?
- Maximum tolerable degradation — Can the service operate in a degraded state? For how long?
- Transaction or data impact — How many transactions can be lost or delayed? What volume of data loss is acceptable?
- Customer impact threshold — How many customers can be affected before the impact is intolerable?
Phase 3: Simulate Failure Scenarios
This is where modeling becomes operational. Take a specific failure scenario — a cloud provider outage, a database failure, an authentication provider going down, a DDoS attack — and propagate it through your dependency graph. Track which services fail, in what order, how quickly the failure cascades, and whether the resulting disruption stays within your defined impact tolerances.
Effective scenario simulation isn't theoretical. It should produce concrete, quantifiable outputs:
- Number of services affected (directly and transitively)
- Estimated time to breach impact tolerances
- Revenue impact based on affected service criticality
- Whether backup services and manual workarounds hold
- Maximum recovery time across the affected dependency chain
The PRA specifically expects firms to test scenarios where they anticipate exceeding their impact tolerances. The point isn't to confirm that easy scenarios are survivable — it's to identify which realistic scenarios would break you, so you can invest in preventing those specific failures.
Phase 4: Quantify and Report
The outputs of scenario simulation feed directly into reporting — both for internal governance and regulatory compliance. Boards need to understand the organization's resilience posture in terms they can act on: which services are at highest risk, where the single points of failure are, what the estimated financial impact of key failure scenarios looks like, and whether investments in redundancy are reducing risk.
Regulators need evidence that testing has occurred, that impact tolerances are being monitored, and that the organization has remediation plans for identified gaps. Static Word documents don't satisfy these requirements. Continuously updated, simulation-backed resilience reports do.
The Regulatory Landscape: Who Requires What
Operational resilience modeling isn't a nice-to-have for organizations in regulated industries. It's either explicitly required or strongly implied by every major financial regulator in the world. Here's where things stand.
European Union: DORA
The Digital Operational Resilience Act took effect January 17, 2025, applying to all financial entities in the EU — banks, insurers, investment firms, payment institutions, and their critical ICT third-party providers. DORA requires comprehensive ICT risk management frameworks, mandatory incident reporting, regular resilience testing, third-party risk management, and information sharing on ICT threats. For approximately 200–250 "significant" entities, DORA mandates threat-led penetration testing (TLPT) on live production systems at least every three years, following the TIBER-EU methodology.
United Kingdom: FCA/PRA
The UK's operational resilience framework, enforced jointly by the FCA, PRA, and Bank of England, required firms to have identified their Important Business Services and set impact tolerances by March 31, 2025. But that deadline was a beginning, not an endpoint. By 2026, the regulatory focus has shifted from "Have you identified your services?" to "Can you prove you are staying within your tolerances today?" Supervisory reviews now expect empirical, scenario-based testing — not judgment-based assessments.
Canada: OSFI E-21
The Office of the Superintendent of Financial Institutions published Guideline E-21 in August 2024, establishing expectations for operational resilience, business continuity risk management, and crisis management for all federally regulated financial institutions. Full adherence is expected by September 2026, with scenario testing for all critical operations completed by September 2027.
Switzerland: FINMA
FINMA issued formal operational resilience guidance in November 2025, defining operational resilience as an institution's ability to identify threats, protect against them, respond to disruptions, and restore critical functions within set tolerance levels — emphasizing that disruptions in interconnected financial systems can have systemic effects.
The regulatory direction is identical across jurisdictions: identify your critical services, set measurable impact tolerances, test against realistic scenarios, and maintain continuous evidence that your resilience posture is current. Operational resilience modeling is the mechanism that makes all of this possible.
What Modeling Reveals That Static Planning Doesn't
The difference between a documented business continuity plan and an operational resilience model is the difference between a hypothesis and an experiment. Plans describe what should happen. Models test what actually happens.
Hidden cascade paths. Static plans typically map direct dependencies — "our app depends on this database." Modeling reveals transitive dependencies that span three, four, or five links in the chain. When AWS us-east-1 went down in December 2021, organizations discovered that services they thought were independent shared control planes, DNS resolvers, and certificate authorities deep in the dependency chain. A model would have revealed those hidden paths before the outage exposed them.
Tolerance breaches before they happen. Without simulation, the only way to know whether a failure scenario breaches your impact tolerances is to experience it. Modeling lets you test hundreds of scenarios — including combinations of failures that haven't happened yet — and identify exactly which ones exceed your thresholds. This transforms resilience from reactive to predictive.
Quantified risk for business decisions. "Our infrastructure is resilient" is not actionable information for a board. "A four-hour cloud outage would cascade to 14 services, impact $2.3M in revenue, and breach our impact tolerance for payment processing within 45 minutes" is. Modeling translates technical risk into financial and operational terms that drive investment decisions.
Backup and workaround validation. Plans often assume that backup services and manual workarounds will function as designed. Modeling tests those assumptions. Does the read replica actually handle production load? Does the manual invoicing process scale beyond 50 orders? Does the backup authentication provider integrate with all downstream services? Simulation answers these questions before a crisis forces you to find out.
From Annual Exercise to Continuous Capability
The most significant shift in operational resilience thinking is the move from periodic to continuous. Annual tabletop exercises and quarterly plan reviews are giving way to always-on resilience monitoring.
This shift is driven by a simple reality: infrastructure changes constantly. New services are deployed weekly. Third-party providers update their SLAs. Traffic patterns shift. Teams restructure. A resilience model from three months ago may not reflect today's dependency graph — and a model that doesn't reflect today's reality provides false confidence.
Continuous operational resilience modeling means:
- Live dependency maps that update as infrastructure changes, rather than static diagrams that decay
- Ongoing SPOF detection that flags new single points of failure as they emerge, rather than discovering them during annual reviews
- Resilience scoring that tracks your posture over time, showing whether investments in redundancy are actually reducing risk
- On-demand simulation that lets teams test "what if" scenarios before deploying changes, not just during scheduled exercises
The UK's FCA has been explicit about this direction. Their 2026 supervisory focus is on continuous monitoring — "Can you prove you are staying within your tolerances today?" — not periodic attestation. Organizations that treat operational resilience as an annual project will find themselves perpetually behind both regulatory expectations and actual risk.
Getting Started
Operational resilience modeling doesn't require a multi-year program or a seven-figure consulting engagement. The foundational steps can be completed in days, not months.
Four steps to your first resilience model
- Identify your Important Business Services. Start with the services your customers interact with directly. For each one, ask: if this service were unavailable for 24 hours, what would the impact be on customers, revenue, and regulatory standing? Services where the answer is "severe" belong on your IBS list.
- Map the dependencies underneath. For each Important Business Service, trace the dependency chain backward: what databases, APIs, infrastructure components, and third-party services does it rely on? Automated dependency mapping accelerates this process significantly — import your infrastructure-as-code files or use guided setup to build the graph in minutes rather than weeks.
- Set impact tolerances and simulate. For each IBS, define how much disruption is tolerable. Then simulate failure scenarios against your dependency map: cloud provider outage, database failure, authentication provider down. Track whether each scenario stays within your tolerances or breaches them.
- Generate evidence and iterate. Document your model, your tolerances, your simulation results, and your remediation plans. This evidence satisfies regulatory requirements and gives your board a clear picture of organizational resilience. Then update continuously — resilience is a capability, not a deliverable.
The Bottom Line
Operational resilience modeling is the bridge between planning and proof. It takes the concepts that regulators require — Important Business Services, impact tolerances, scenario testing — and makes them testable, quantifiable, and continuous.
The organizations that model their resilience don't just pass audits faster. They prevent cascading failures by identifying hidden dependencies before outages expose them. They deploy changes with confidence because they can simulate the blast radius before pushing to production. And they communicate infrastructure risk in terms that engineering leaders, boards, and regulators all understand.
Cockroach Labs found that only 20% of executives feel their organizations are fully prepared to prevent or respond to outages, and 79% admit inadequate readiness for new operational resilience regulations like DORA and NIS2. The gap between regulatory expectation and organizational capability is wide — but it's closeable, and it starts with building a model you can actually test.
The fault lines in your infrastructure are already there. Modeling is how you find them before your next disruption does.
Sources & Related Reading
- Cockroach Labs — The State of Resilience 2025
- New Relic — Businesses Face $76 Million Median Annual Cost from High-Impact Outages
- PagerDuty — The Cost of Downtime: Customer-Impacting Incidents Up 43%
- Bank of England — SIMEX 24: Testing UK Financial Sector Resilience
- PRA SS1/21 — Impact Tolerances for Important Business Services
- Enactia — FCA & PRA Operational Resilience 2026: From Compliance to Continuous Resilience
- What Is Infrastructure Dependency Mapping? A Complete Guide
- What Is a Minimum Viable Company — And Why Every CTO Needs to Define Theirs
- Business Continuity Reports Are Mandatory. Why Are You Still Writing Them in Word?
- What Happens When AWS Goes Down? Every Major Outage Since 2017
- How Companies Actually Maintain Business Continuity Plans (And Why Most Don't)