On October 20 2025, the cloud infrastructure that powers a huge swathe of the internet hiccupped in a big way. The outage at Amazon Web Services (AWS) offers a stark reminder of what happens when a single provider faces a fault and it also provides a timely jolt to ask: is there a more robust alternative?
What happened?
The timeline & cause
- Around 00:11 PDT (≈ 07:11 UTC) on October 20, AWS began experiencing increased error rates and latency across multiple services in its US-East-1 region (Northern Virginia).
- The root cause: a DNS-resolution issue affecting the API endpoints for AWS’s DynamoDB (its scalable database service) within that region. In effect, services that rely on DynamoDB could not reliably resolve the service endpoints.
- Because the issue hit the “phone directory” of the web (DNS) and a key internal component (DynamoDB) in a central region, the effects cascaded far beyond just one service. Many sites/applications which “only” used AWS infrastructure in that region were impacted.
- AWS worked multiple parallel mitigation paths and by around 02:24 AM PDT (≈ 09:24 UTC) the underlying issue was “managed”.
- Nevertheless, services still experienced backlogs, delayed recovery, and increased latency for hours afterward.
The scope & impact
- The outage affected thousands of apps and websites. Tracking via outage-monitoring services reported millions of user incident reports globally.
- Some of the impacted platforms included major names such as Snapchat, Fortnite, Signal, Duolingo, many bank apps in the UK, and even government services.
- The key takeaway from experts: this wasn’t a cyber-attack, but rather a systemic failure rooted in internal infrastructure - human/operational error or configuration fault.
- The incident reignited debate around the fragility of internet “single points of failure” - namely over-concentration of many services in one region or one provider.
Why this matters to you & your business
When your business runs workloads on a cloud provider like AWS, an incident in a major region (especially one many services default to) can lead to extended downtime, degraded performance or service loss.Even if your service is architectural OK, dependencies (databases, load-balancers, DNS, support APIs) can introduce vulnerability. This incident shows: your cloud provider’s internal services matter.
The cost of downtime is not only direct lost revenue, it also gives your business a reputational hit, service degradation, user churn, regulatory risk and these get magnified when your infrastructure is “in one basket”. If you have services or users that must be globally available (UK, EU, APAC), then resilience across regions/providers is a mission-critical design factor.
Why Microsoft Azure stands out as a strong alternative
While no cloud provider is immune to failure, there are features and architectural approaches that make Azure compelling, especially if you want to build greater robustness into your cloud strategy.
- Global footprint & regional redundancy: Azure advertises one of the largest global footprints among the major clouds: many data-centre regions, enabling you to place workloads closer to users and distribute risk. So if one region suffers issues, you potentially have more (or different) regions to fall back on.
- Hybrid & multi-cloud flexibility: Azure emphasises hybrid cloud (mixing on-premises + cloud) and multi-cloud interoperability. That means you can design architectures that aren’t locked into one provider, one region, one failure domain. In other words: if you’re concerned about “all eggs in one basket”, Azure gives more options to spread them.
- Robust security & compliance: Azure provides built-in security features across compute, networking, identity (via Azure AD) and is certified across many global compliance regimes. Given that outages often result in compromised service availability, having mature security and governance is part of being resilient.
- Disaster recovery & fail-over capability: Azure provides services and tools explicitly targeting disaster recovery scenarios, automatic fail-overs, and geo-replication. For example, Azure Cosmos DB’s architecture supports geo-failover at the partition level. Designing with such tools means your business can recover quicker when a region or endpoint fails.
- Enterprise readiness & integration: For businesses already using Microsoft stack (Windows, SQL Server, Office 365 etc.), Azure integrates more seamlessly. This lowers friction in migrating, deploying resilient architectures and utilising cross-platform investments.
Important caveats - no provider is perfect
Azure, like any cloud platform, isn’t completely immune to outages - no provider is. The key difference lies in how each platform is architected and how effectively it mitigates and recovers from disruption.
- Complexity: Using multiple regions or providers introduces more moving parts - governance, latency, integration - but Azure’s tools and design flexibility make it easier to manage this complexity effectively. With the right architecture in place, you gain far greater control, scalability, and resilience.
- Value for money: Building in redundancy, multi-region capability, and fail-over adds value. With Azure, you’re investing in stronger continuity, better security, and superior recovery options that protect your business and your reputation when things go wrong.
- Vendor lock-in: While Azure’s hybrid and multi-cloud capabilities reduce risk, it’s always smart to design with interoperability in mind. This ensures your systems remain portable and your business stays in control of its data and infrastructure choices.
Call to Action: What should you do now?
Review your cloud footprint:
- Where are your workloads hosted (which region/provider)?
- What dependencies do you have (databases, DNS, APIs, support paths)?
- What is your uptime/RTO (Recovery Time Objective) requirement?
Design for resilience
- Consider multi-region deployment (for example deploying to more than one region within Azure or mixing providers).
- Implement fail-over and redundancy for critical services (databases, APIs, load-balancers).
- Use provider-agnostic architecture patterns where feasible (containers, Kubernetes, IaC) so you can shift if needed.
Evaluate Azure as part of your strategy
- If you’re on AWS, now is a good time to check: could Azure (or a multi-cloud mix) provide better risk-mitigation for you?
- For green-field deployments, consider starting with Azure (or a hybrid design) to build in stronger resilience from day one.
Establish governance & cost controls
- Ensure you monitor and manage cost, because resilience often adds overhead.
- Define clear policies for fail-over, region-switching, and incident recovery.
- Run periodic tests of your architecture to ensure recovery processes work.
Conclusion
The AWS outage on 20 October 2025 serves as a reminder: even the biggest cloud providers can have failure events and when they do, the ripple effects are massive. While AWS remains a strong platform, for businesses that demand high-availability, global reach, and robust disaster-resilience, adopting or incorporating Azure into your cloud strategy is a wise move.
By rethinking your architecture today, choosing multi-region, multi-cloud or hybrid setups, you can better protect your business against the next such outage.
Discover more about our Azure offerings and gtt in touch today if you'd like to review the options for your business.