Building for the Inevitable Next Cloud Outage – Part 1

by | Jul 10, 2023

Daniel Bartholomew

Dan's innovative mindset and expertise in mission critical application delivery has led the development of the CloudFlow Supercloud Platform, revolutionizing the way businesses approach global application delivery and container orchestration.

The following is based on a talk by Pavel Nikolov of Section (acquired by Webscale in 2023) at the KubeCon+CloudNativeCon Europe 2022 event. This first post will discuss the challenges in building for the next cloud outage. Part Two will demonstrate how to deploy a Kubernetes application across clusters in multiple clouds and regions with built-in failover to automatically adapt to cloud outages. You can also read Pavel’s column on this topic in TechBeacon.

Every few months we read about the widespread impact of a major cloud outage. These events are unpredictable and inevitable, and, quite frankly, keep site reliability engineering (SRE) teams up at night. No matter your type of business, it is prohibitively expensive to deploy your applications everywhere around the world at the same time while still ensuring high availability.

Public cloud remains the most popular data center approach among the cloud native community, with multi-cloud growing in adoption. However, adopting a multi-cloud strategy isn’t as simple as hitting the “go” button. What’s more, despite best efforts at building out redundancy, the cloud providers cannot guarantee 100% uptime. As such, it’s not a question of if your servers or services will go down but rather when. And it will probably happen when you are either not prepared or least expect it (hello middle of the night support calls).

This is true for a number of reasons. For one, there are external factors, such as your Domain Name System (DNS) going down or upstream internet provider connectivity issues, that are outside the control of the public clouds. Then, too, there are the human factors involved, like when we make mistakes in code deployment that can be difficult to roll back. Of course, there are also natural disasters that can take down entire regions or cause significant headaches for services around the globe.

As a result, organizations spend a significant amount of time and money prepping disaster recovery plans while preparing for that next inevitable cloud outage.

Disaster Recovery to the Rescue (maybe)

The vast majority of organizations fall into one of four disaster recovery categories when it comes to responding to an outage:

  1. Active / active deployment strategy: If your primary server goes down, you flip the switch on your DNS and your request goes to a second active server. While this is the fastest and least-disruptive disaster recovery, you’re among the lucky few if your IT budget supports this option!
  2. Active / passive deployment strategy: This is very similar to active / active but it’s cheaper because you’re not paying for the hosting of the passive instance or cluster when you’re not using it. However, you have to spin up the passive instance and flip the switch on your DNS before service is restored, delaying the return to service.
  3. Periodic backup of your databases: In this instance, when your service goes down you must first spin up your code, restore the backups, and then continue serving as normal. While viable, this should not be considered a rapid response and can potentially extend service outages over more than 24 hours. The only thing worse is…
  4. No disaster recovery strategy: Truth be told, far too many organizations fall into this category. It’s understandable; you’re busy building features and don’t have time to think about disaster recovery. When something happens, you’ll figure it out!

The challenge with any of these disaster recovery strategies (except for the fourth one, of course) is that they require a high level of discipline. Your entire team needs to understand what will happen and know what they must do when an outage occurs, and even the best laid plans will likely require some level of human intervention to restore service. In addition, as you add new features or components to your system, you’ll need to test your disaster recovery plan to account for changes that have occurred. Ideally, this should happen at least every quarter – preferably every month – and it’s easy to get caught up in our day-to-day delivery deadlines, putting off review of the disaster recovery plan until it’s too late.

Multi-Cluster Disaster Recovery

Since you’re reading this blog, let’s assume you’re running a modern Kubernetes containerized application. Let’s further assume that your application is running on multiple distributed clusters to maximize availability and performance. How does that impact disaster recovery?

Just because you have multiple clusters does not mean automatic failover during an outage. The culprit is often DNS. First off, DNS servers can (and often do) become unavailable. But even if the servers themselves don’t go down, DNS configuration can cause problems during outages. DNS uses TTL (time to live) settings to handle routing, and the problem is that there is no guarantee that, worldwide, all providers will honor your TTL. This can effectively mean that distributed clusters are available but effectively invisible during an outage.

But what if there was another approach to disaster recovery? In our next post we’ll discuss a strategy using BGP + Anycast to significantly improve availability and recovery. If you’re eager to jump ahead, feel free to watch Pavel’s KubeCon talk.

Webscale CloudFlow’s Cloud-Native Hosting Solution Addresses Reliability (and much more)

On the other hand, if you need a solution today, why not turn to Webscale CloudFlow? As we know all too well, outages will happen eventually. It can be prohibitively expensive and labor intensive to maintain disaster recovery strategies for your organization. Fortunately, Webscale CloudFlow offers a wide range of Cloud-Native Hosting solutions that address the complexity of building and operating distributed networks. The complexities of routing across multi-layer edge-cloud topologies are perhaps the most daunting when it comes to building distributed systems. This is why organizations are increasingly turning to solutions like Webscale CloudFlow that take care of this for you.

In particular, Webscale CloudFlow’s Kubernetes Edge Interface (KEI), Adaptive Edge Engine (AEE) and Composable Edge Cloud (CEC) work together to improve application availability. With KEI you can set policy-based controls using simple commands in tools like kubectl that control, among other things, cluster reliability and availability. AEE uses advanced artificial intelligence to interpret those commands and automatically handle configuration and routing in the background. Finally, Webscale CloudFlow’s Composable Edge Cloud features a heterogeneous mix of different cloud providers worldwide, ensuring application availability even when a provider network goes down.

To learn more, get in touch and we’ll show you how the Webscale CloudFlow platform can help you achieve the reliability, scalability, speed, security or other custom edge compute functionality that your applications demand.

Read Part 2 »

Recent Posts

Headless Commerce Drives Edge Computing Adoption

Headless Commerce Drives Edge Computing Adoption

In e-commerce today, the challenge to meet and exceed customer expectations is driving innovation. The demand for frictionless shopping, 24/7 availability, superior product and impeccable service quality is ever increasing, putting pressure on...

read more