Building for the Inevitable Next Cloud Outage – Part 2

by | Jul 17, 2023

Daniel Bartholomew

Dan's innovative mindset and expertise in mission critical application delivery has led the development of the CloudFlow Supercloud Platform, revolutionizing the way businesses approach global application delivery and container orchestration.

This is the second post in a two part series based on a talk by Pavel Nikolov of Section (acquired by Webscale in 2023) at the recent Europe 2022 KubeCon+CloudNativeCon event. In the first post we discussed the challenges in building for the next cloud outage. This second installment demonstrates how to deploy a Kubernetes application across clusters in multiple clouds and regions with built-in failover to automatically adapt to cloud outages. You can also read Pavel’s column on this topic in TechBeacon.

Previously we discussed how the Domain Name System (DNS) can become a single point of failure for your system, even in a multi-cluster environment, either because DNS servers go down or due to issues with worldwide TTL (time to live) settings. Obviously neither of these situations are ideal in regard to an outage.

BGP + Anycast = A match made in heaven

But what if there was another approach to disaster recovery? Imagine, if you will, an approach that is self-healing, does not require human intervention, does not involve any single point of failure, and anticipates that anything could go down at any time, including your DNS servers. In fact, Border Gateway Protocol (BGP) and Anycast Internet Protocol (IP) addresses can be used together to provide a viable alternative for disaster recovery.

For starters, you’ll need to purchase your own IP address range and your cloud provider must allow you to bring your own IP range, which most public clouds support. And there is, of course, a learning curve that comes with implementing BGP for your organization. As we know, the internet is essentially a network of networks, with the larger networks referred to as autonomous systems. BGP ensures that autonomous systems communicate with each other in the most efficient way possible. If, for example, you have one server that needs to reach another server, BGP ensures that the Transmission Control Protocol (TCP) packets from server A finds the most efficient route to its destination on the internet.

This happens by way of BGP “speakers” that announce the range of IP addresses within their autonomous system to all other autonomous systems. Within a matter of seconds, the entire internet knows where each specific IP range resides. So, when you have a packet that needs to reach a specific IP address, every system in the world knows where to send it based on the IP range it falls within. When the packet reaches the autonomous system with the correct IP range, internal routing finds the exact server with the exact IP address and sends your packet through to its destination.

As a failover response, BGP offers a significant time saving benefit compared with DNS. It’s not uncommon for the DNS server to take five minutes or more to recover from a disaster. With BGP, however, convergence takes just seconds. When the BGP speaker announces an IP address, the whole world knows about it. Similarly, when it stops announcing an IP address, the whole world also knows about it.

When it comes to sending packets across networks, there are several different addressing methods: Unicast (one server that sends one TCP packet over the network to exactly one destination server), Multicast (one server sends a packet that reaches many different destinations) and Anycast (there are many servers around the world with the exact same public IP address and your packet is guaranteed to find the nearest one). As you might imagine, Anycast and BGP enable a world of possibilities, offering built-in failover to automatically adapt to cloud outages.

BGP + Anycast in action

To better understand the benefits, let’s look at a very simple scenario where you create a small Kubernetes test application that requests a response every second. In this example, we’ll deploy this cluster in three different clouds and regions – one in New York City, another in Amsterdam, and a third in Sydney. When deployed in a healthy state, the clusters will all have the same Anycast public IP address. If you are located in Zurich, for example, you will receive a response from Amsterdam since it is the closest location. If you are instead located in Cairo, you will also get a response from Amsterdam since it’s still the closest.

Now, if you repeat this scenario to send a request to each of the clusters every second and stop announcing the IP range for Amsterdam (to simulate one of your regions going down), your app will start getting a response from New York – the next closest location – in less than a second. Repeat the same process and take down the New York cluster and you’ll instantly start receiving a response from Sydney. You have now automatically rescheduled workloads and rerouted traffic to healthy clusters in real-time without having to touch a single thing – no disaster recovery strategy required! While you can’t guarantee 100% uptime, combining BGP with Anycast will bring you close to the holy grail with minimal effort. In the case of our example, the system recovered on its own within milliseconds.

Webscale CloudFlow’s Cloud-Native Hosting Solution addresses reliability (and much more)

As we know all too well, outages will happen eventually. It can be prohibitively expensive and labor intensive to maintain disaster recovery strategies for your organization. Fortunately, Section offers a wide range of Cloud-Native Hosting solutions that address the complexity of building and operating distributed networks. The complexities of routing across multi-layer edge-cloud topologies are perhaps the most daunting when it comes to building distributed systems. This is why organizations are increasingly turning to solutions like Section that take care of this for you.

In particular, Webscale CloudFlow’s Kubernetes Edge Interface (KEI), Adaptive Edge Engine (AEE) and Composable Edge Cloud (CEC) work together to improve application availability. With KEI you can set policy-based controls using simple commands in tools like kubectl that control, among other things, cluster reliability and availability. AEE uses advanced artificial intelligence to interpret those commands and automatically handle configuration and routing in the background. Finally, Webscale CloudFlow’s Composable Edge Cloud features a heterogeneous mix of different cloud providers worldwide, ensuring application availability even when a provider network goes down.

To learn more, get in touch and we’ll show you how the Webscale CloudFlow platform can help you achieve the reliability, scalability, speed, security or other custom edge compute functionality that your applications demand.

Recent Posts

Headless Commerce Drives Edge Computing Adoption

Headless Commerce Drives Edge Computing Adoption

In e-commerce today, the challenge to meet and exceed customer expectations is driving innovation. The demand for frictionless shopping, 24/7 availability, superior product and impeccable service quality is ever increasing, putting pressure on...

read more