The Email That Couldn't Be Read

Here's a scenario that happens more often than any BCM professional wants to admit. A team sits down to define their recovery time objectives. Microsoft 365 — email, calendars, Teams — gets a one-hour RTO. Reasonable. Email is critical. One hour feels aggressive but achievable.

Then the outage happens.

The email platform itself recovers in 45 minutes. Under target. Someone should be celebrating. But nobody can access it. Why? Because authentication has a four-hour RTO. The VPN that remote workers need to reach the authentication server has a six-hour RTO. The DNS resolution that the VPN depends on is handled by a third-party provider with no defined RTO at all.

The email platform is technically "recovered." It's sitting there, running, ready to serve messages. But the chain of dependencies standing between the user and their inbox is still broken. Eleven hours later, people finally start reading email again.

The documented RTO said one hour. Reality said eleven. The email team did nothing wrong. Their system performed perfectly. The RTO was still fiction.

RTOs Are a Chain, Not a Number

The fundamental problem is how organizations set RTOs. Each team defines recovery objectives for their own systems in isolation. The email team sets theirs. The identity and access management team sets theirs. The network team sets theirs. Nobody checks whether the math works end to end.

Think of it like a relay race. It doesn't matter how fast your anchor leg runner is if the second runner hasn't passed the baton yet. The race time isn't determined by your fastest runner — it's determined by the sum of all the handoffs.

But dependencies aren't even linear. They're a graph. Email depends on authentication. Authentication depends on Active Directory. Active Directory depends on DNS. DNS depends on network infrastructure. Network infrastructure depends on firewall policies and certificate authorities. Each layer has its own RTO, its own team, its own recovery procedures — and none of them were designed with the full chain in mind.

Your real RTO — the one your customers and executives will experience — is the longest sequential path through that dependency graph. Not the number written in your BCP document. The number in the BCP document is a fiction that survives because nobody tests the chain.

Why This Keeps Happening

Dependencies live in people's heads

Ask the email administrator what Microsoft 365 depends on and they'll say "Azure AD and the internet." Ask the network engineer and they'll add DNS, VPN concentrators, firewall rules, and certificate stores. Ask the security team and they'll add the MFA provider, conditional access policies, and the SIEM that gates security alerts during recovery.

Nobody has the complete picture. The dependency knowledge is scattered across teams, and each team only sees their own slice. In February 2026, Microsoft's own MFA infrastructure failed with 504 Gateway Timeout errors. Organizations using Conditional Access policies — exactly the configuration Microsoft recommends — found themselves completely locked out of email, Teams, SharePoint, and VPN. The security control designed to protect them became the single point of failure that prevented recovery.

RTOs are set in isolation

Each team sets their recovery objective based on the criticality of their own service. The email team asks "how long can the business survive without email?" and arrives at one hour. The identity team asks "how long can the business survive without authentication?" and arrives at four hours — because not everything needs authentication immediately, and their recovery procedure is complex.

Both teams are right about their own service. But nobody checks the dependency math. If email needs authentication and authentication takes four hours, then the real email RTO is four hours plus whatever time email itself needs. The one-hour email RTO was set in a vacuum.

Nobody wants to own the gap

Here's the political hot potato: who pays to upgrade the authentication infrastructure so the email team can meet their one-hour RTO? The email team doesn't own authentication. The authentication team has no budget pressure to speed up their recovery — their own four-hour RTO satisfies their stakeholders. Meanwhile, HR says they can tolerate a day of email downtime but the trading desk says they need it within an hour. Both share the same authentication infrastructure. Whose RTO wins?

This is why the gap persists. Fixing it requires cross-team coordination, budget allocation, and organizational alignment that most companies don't have. So the fiction stays in the BCP and everyone hopes the outage never tests the math.

Five Seconds, Eight Systems Deep

The email example is easy to grasp because everyone uses email. But consider what happens in those five seconds when you tap a credit card at a payment terminal.

The terminal reads the card via NFC, encrypts the data, and sends it to a payment gateway. The gateway routes the request to the acquiring bank's processor. The processor forwards it to the card network — Visa, Mastercard — where the transaction is detokenized, run through fraud detection algorithms, and forwarded to the issuing bank. The issuing bank checks available credit, runs its own fraud and velocity checks, and sends an authorization response back through the entire chain.

That's seven to eight distinct systems, each owned by a different organization, each with its own infrastructure, its own failure modes, and its own RTOs. A failure anywhere in that chain doesn't degrade payment processing — it stops it entirely.

On July 8, 2022, Canadians found out exactly how fragile that chain was. Rogers Communications suffered a nationwide network outage, and because Interac — Canada's debit payment network — had a single-provider dependency on Rogers with no redundancy, debit transactions and ATM access stopped working across the entire country. For over fourteen hours, people couldn't buy groceries, couldn't fill their cars with gas, couldn't withdraw cash. Telecom consultant Mark Goldberg called it "really, really poor network design" — a single dependency, several layers deep, that nobody had stress-tested.

The Rogers network itself wasn't even a "critical" system in most organizations' BCPs. But it was the dependency underneath the dependency underneath the payment system that millions of people relied on every day.

The Email Dependency Deep-Dive

Let's go back to email and actually map what the chain looks like, because most people dramatically underestimate it.

At first glance, email looks simple: user opens Outlook, Outlook connects to Exchange Online, messages flow. Three components. One-hour RTO. Easy.

Now add reality. The user needs to authenticate. That's Azure AD (now Entra ID). Authentication requires MFA — that's a third-party provider or Microsoft Authenticator. If the user is remote, they need VPN. The VPN needs DNS resolution to find the authentication endpoint. DNS needs the network infrastructure to be operational. The network needs firewall rules to allow the traffic. The firewall needs valid certificates to inspect TLS traffic. Certificates need a certificate authority. Active Directory replication needs to be current for group policies to resolve correctly.

Three components just became fifteen-plus systems. And each of those systems has its own recovery team, its own procedures, and its own timeline.

Now add the political layer. HR says they can tolerate email being down for a full business day. The trading desk says one hour maximum. Legal needs email for regulatory compliance and can't specify a tolerance because "it depends." All three departments share the same authentication infrastructure, the same VPN, the same DNS. If the authentication team sets a four-hour RTO because HR can tolerate a day, the trading desk's one-hour requirement is already impossible. But nobody in the room during the annual BCP review connects these dots because each team presents their recovery objectives in isolation.

What Honest RTOs Look Like

Setting RTOs that reflect reality requires three things that most organizations skip.

Map the full dependency chain at the service level. Not at the infrastructure component level — individual S3 buckets or EC2 instances don't fail independently in ways that matter for RTO planning. Map at the service level: "email depends on authentication, which depends on VPN, which depends on DNS." This gives you the chain. The chain gives you the real RTO. Without the chain, you're setting numbers in a vacuum.

Sequence the recovery path and add up the timeline. Once you have the dependency chain, walk the recovery in order. You can't recover email until authentication is up. You can't recover authentication until Active Directory is replicating. You can't replicate Active Directory until DNS is resolving. Add up the recovery times along the longest sequential path. That sum is your actual RTO — and it will almost certainly be longer than whatever number is in the current plan.

Stress-test with simulation scenarios. Don't just calculate the chain on paper — run the scenario. What if authentication recovers in two hours instead of four because you got lucky? What if it takes six because the backup also failed? What if DNS recovers first but the firewall rules weren't restored and traffic can't flow? Simulation exposes the failure modes that paper analysis misses. It's the difference between knowing the relay race time in theory and actually running the race.

The Maintenance Problem

Even if you do all three of those things perfectly today, the map is wrong by next month. Engineering teams deploy new services. Dependencies shift. A team migrates authentication from on-premises AD to a cloud identity provider and the entire dependency chain changes. Someone adds a new third-party integration and doesn't tell the BCM coordinator because the feedback loop between deployments and continuity documentation doesn't exist.

Creating the dependency map isn't the hard part. Keeping it current as infrastructure changes — that's where it falls apart. Static diagrams decay. Manual inventories go stale. And the RTO numbers that were honest when you calculated them become fiction again within weeks.

The organizations that solve this don't solve it with better discipline. They solve it by making the dependency map a living artifact that updates when infrastructure changes, not when someone remembers to update a spreadsheet.

The Bottom Line

Your RTO is not a number. It's a chain. And unless you've mapped every link in that chain, sequenced the recovery path, and stress-tested it against realistic failure scenarios, the number in your BCP is a guess wearing a suit.

The email team's one-hour RTO was technically correct for their system. But the business didn't experience a one-hour outage. It experienced an eleven-hour outage. The difference between those two numbers is the dependency chain that nobody mapped, nobody sequenced, and nobody tested.

Every RTO in your plan right now has a dependency chain behind it. The question is whether you know what's in that chain — or whether you'll find out during the next outage.