Americas

  • United States

Asia

Oceania

ryan_francis
Contributor

Disaster recovery: How is your business set up to survive an outage?

News
Mar 13, 201714 mins
Backup and RecoveryBusiness ContinuityCloud Security

Can your business get by with an asynchronous backup or must that offsite server be updated by the second to keep the business up and running at all times.

disaster recovery button
Credit: Thinkstock

Asynchronous vs synchronous. Dark disaster recovery vs. active architecture. Active/active vs. active/passive. No setup is objectively better or worse than another. The best one for you primarily depends on your level of tolerance for what happens when the server goes down.

Security experts say how individual companies choose to save their data in anticipation of an outage depends on how long they can survive before the “lights” are turned back on. What level of availability does your company need? Is the face of your company an ecommerce site where even a few minutes offline can cost an astronomical sum? Will the cost of an active-active system outweigh the potential loss of business from an outage?

“It isn’t about one being more efficient than the other. More to the point of what needs are you trying to solve for. For example, buying a Ferrari to get groceries will get the job done, but is it really fit for purpose?” says Don Foster, senior director of solutions marketing and technical alliances at Commvault.

In an active/active architecture, typically a cluster of offsite servers are synchronized with the onsite server. This allows for there to be no downtime in the event of a disaster where one server is knocked offline. It can be configured to automatically failover. In this setup, less hardware is needed because all the systems across both sites are being used vs. only half the hardware in a dark disaster-recovery scenario. If you had 48 cores of dark disaster recovery, you’d have 96 total cores and use only 48. In active/active mode, you scale back to 32 x 2, for 64 cores, and all 64 are active.

In a dark disaster recovery scenario, capacity is an entirely redundant system – all the hardware and software ready to go – but sitting completely idle. That capacity is not used at all until the first site fails, but it is replicated to at certain periods.

Erin Swike, senior cloud solution architect at Bluelock, explains that “active/active disaster recovery is the unicorn of the DR world. The idea of being able to sleep at night knowing that, should your production site fail, your DR site will automatically start serving up applications to users without a single packet lost or moment of downtown, is the nirvana of any CIO or system engineer.

“For most, it remains the thing of fairytales and legends. Forget about the obvious factor of data center proximity and network latency; one of the most important factors is whether your applications are written to support this type of scenario. Unless an application was written with this in mind from the beginning, odds are that it can’t support it,” she said.

The software costs are higher in active/active mode because any system that’s running in active mode must have licensed software. When the system is in dark disaster-recovery mode, the second system does not require paid licenses for database cores, for example, because only one set is live at a time. The fact that the two systems are staying in synch does not affect costs at all.

In synchronous replication, there needs to be reliable network connections between the two servers. There is also extra labor involved in having to manage another site constantly.

Don Foster, Commvault

The negative with asynchronous replication involves losing some data between the downtime and when the server was last updated. It can be set up, though, for automatic failover.

Anand Hariharan, vice president of product at Webscale Networks, says this is essentially the concept of cold/warm/hot backup servers. The pros and cons fall into two groups: service-level agreement and cost. Recovery point objective (RPO) and recovery time objective (RTO) define the SLA a vendor will provide to inform the user of the acceptable length of time data could be lost in the event of an outage and how fast they will restore services.

“Naturally, with a hot backup, or an active/active architecture, there is zero downtime and a perfect replica of data, so from an SLA perspective, this is a very favorable path to take as it ensures that critical data isn’t lost and critical applications continue to function without interruption,” Hariharan says. “The downside here is of course cost. Maintaining two systems that are always running is essentially twice the cost, whether these costs be related to running replica architectures in a private data center, paying a managed hosting provider to perform the same task in an offsite location, or the cost of running double the instances in the cloud. In some of these scenarios, and depending on the size of the deployment, there is likely also a headcount consideration, where the additional technical staff required to managed twice the systems will also cause a steep increase in costs.”

Costs

Given an average (and increasing) rate of $7,900 per minute (Ponemon Institute), downtime creates a potentially huge cost for enterprises, both in immediate business and long-term reputation.

Other costs include servers at a collocation site. They have the superficial attraction of saving money by distributing infrastructure costs over many users, but a closer look reveals that those savings aren’t actualized, according to a ScaleArc White Paper. The collocation company still charges for any unused resources, including dark ones that might someday be activated into full use. Yet enterprises can’t reduce the amount of resources dedicated to the secondary site because all information from the primary server must be backed up to the co-lo secondary.

The ScaleArc report also notes that like collocation, public cloud solutions seem attractive owing to their assumed economies of scale. Nevertheless, organizations with security concerns (banks and government agencies, for example) still shy away from the cloud because of privacy concerns. Also, cloud systems can introduce latency that impacts application performance beyond acceptable levels. And again, cloud economics aren’t always what they seem. Under full operation, cloud expenses typically run higher than when businesses own and run their own infrastructure.

ScaleArc believes that maintenance costs for an active/active architecture are lower because the tasks can be done during work hours rather than requiring a crew in the middle of the night. They also require fewer staff members because organizations can keep the application running during maintenance, so developers and other application specialists don’t need to be involved.

“For only a 20 percent increase in costs, organizations will enjoy 33 percent more system capacity, along with the additional economic benefits of reduced downtime, lower operational costs, better asset utilization, and likely higher total revenue,” ScaleArc writes.

Customers may not understand computing architectures, but they do want their apps and data to be available, all the time. Any vendor that fails to provide 100 percent uptime risks losing customers and revenue. 

Al Sargent, senior director at OneLogin, said from a financial perspective, topline revenue dwarfs what companies spend on IT budgets. One study shows that companies spend between 3 and 7 percent of revenue on IT. “Shifting to active/active architectures might increase an IT budget a fraction of a percent, but will prevent outages that could erode revenues by many percentage points,” he said.

Some of these cost downsides are lessened with a cloud-based SaaS solution, where a common management environment can be automatically maintained across both sites. The cloud enables fast scale out times, so you can deploy a reduced (smaller footprint) failover infrastructure that can restore applications almost instantly during a disaster incident, enabling better SLA, Hariharan said.

Foster said both scenarios are valid to an enterprise disaster-recovery strategy. Many applications and even infrastructure (storage arrays in the enterprise space have created active/active grids through single namespaces that can cross data centers as well) have developed this technology to make it easier for companies to provide business continuity plans and recovery in the case of an infrastructure outage. 

“The problem is the cost of maintaining and running these infrastructures. If an application or service has requirements to truly be a ‘dial tone-like’ system (always on – never without) then a business will spend the dollars required to ensure the five nines of availability and then some,” he said. 

Most critical applications with these needs have these types of failover mechanisms built in so that secondary or tertiary systems can resume if one has a failure, Foster added. Clustering has also been around for a long time for servers and as that technology has moved down the stack into the infrastructure services, the ease at which availability can be provided is greatly improved – just at a cost. 

Although he said cost is not the only down side. “Active-active recovery solutions do not account for user error. They are garbage in garbage out, and in the event of this type of an outage, you need to have something that is tracking point in time consistency of the data to recover back to. The GitLab outage from a few weeks ago is a great example of this,” Foster said.

“There could be any number of mission-critical applications worth the protection of active/active redundancy, the trick is determining those that merit the expense,” said Steven Hill, senior storage analyst with 451 Research. “It’s important to remember that a good DR/BC plan calls for a broad assessment of a company’s key business priorities; the personnel, data and applications necessary to support them; and the cost of alternative options available to replace them — all weighed in a cost/benefit analysis against the risk of loss and likelihood of a critical business interruption.”

Dark disaster recovery is more cost effective, is typically data outage focused, and can be very complementary to the built-in active recovery services, Foster noted. The infrastructure would be highly available with data copies tracked with real-time and versioned point-in-time references to solve any outage issue that may arise.

ScaleArc’s CEO Justin Barney believes an assessment of the costs for an active-active architecture must take into account the potential losses of downtime. “Active/active operations do cost a bit of a premium – about 20 percent in hardware and software costs. But those additional costs don’t include offsets from sources such as revenue losses averted because of avoided downtime. Overall, the perspective that active/active operations are warranted only for organizations that can’t afford downtime is true,” he said.

Barney said with demand for continuous availability dominating nearly every industry, active/active operations clearly provide the best mix of advantages.

There’s new data showing that the backup systems and processes enterprises have relied on the most to ensure business continuity/disaster recovery might actually be hurting not helping when it comes to preventing major outages, according to Barney. “This is important now because these disaster recovery systems are no longer meeting the needs of organizations that must achieve ‘continuous availability.’” 

“Today’s enterprises don’t have the luxury of failing and then recovering from that failure when going offline isn’t an option – and so the ‘dark DR’ model fails them,” he adds.

Foster disagrees with that statement. “If you are still operating backup and recovery and DR like it is 2005, then yes that statement may be correct, but the reality is that customers are modernizing how they execute on DR and backup as their infrastructures and architectures have matured and changed. When they don’t do this, outages can occur due to the no integrate fashion in which protection and DR decisions are made.”

In addition, the primary server’s normal workflow must be redirected to the secondary server, which becomes, at least temporarily, the new primary server. This redirection can require significant amounts of manual configuration, with two IT teams (one at each location) working overtime to enable and troubleshoot the switch. Similar reconfiguration applies to DNS, networking, replication topology, and other infrastructure elements. Testing requirements are massive, and additional IT staff must step into place at the secondary facility while the original IT team remains pinned down trying to get the primary facility back online.

“Of course, as we’re watching the big trends around ‘software is eating the world’ and ‘every company is becoming a software company,’ there are fewer and fewer organizations for whom downtime is acceptable. DR often means at least several minutes, if not more, of downtime, and of course, because you’re bringing an idle system online all of a sudden, it may not start operations smoothly. But yes – active/active architectures are best suited for organizations that cannot tolerate downtime,” Barney said.

Joseph George, vice president of product management at Sungard AS, said he wouldn’t frame the debate between the two architectures purely in terms of efficiency, because often the biggest deciding factor for what resiliency tier a business selects is based on what companies can afford. “Clearly, if cost was not a factor, every business would have [high-availability] systems. But they typically can only afford (and need) that level for the most mission critical systems and applications,” he said.

“It is important for enterprises to ‘tier’ their applications to help manage the economic balance between risk and the investment to mitigate. Tiering applications, as well as mapping their interdependencies, enables optimal recovery order sequencing and allows for the most cost effective availability program for the level of application downtime and data loss the organization can afford based on business impact,” he added.

Warm DR is fine

Swike said the majority of enterprises don’t really need active/active DR. Warm DR meets their needs. With appropriate bandwidth between sites, an RPO of seconds and technical RTO of minutes-to-hours is very achievable. “The technology is only part of the story though: there has to be discipline and time given to the process of DR. Having servers replicated is a great step, but if you don’t test it regularly how would you know it’s even going to work?”

For many, DR is number 11 on their top 10 list of priorities, she said. “That in no way means that they don’t care about DR. It’s just that day-to-day issues and production projects always tend to be at the top of the list.”

Mike Weber, vice president of Coalfire’s labs division said, fundamentally the key to a solid backup strategy is dependent on the business needs and mission criticality of the system. There are many tiered models that speak to critical data with a very short RTO measured in minutes that require streaming backups and/or replication to a redundant (but not high availability) system, through the non-critical data that can absorb the impact of recovery measured in days.

“Each of these, and various levels in between, requires different strategies to meet both business continuity and disaster recovery objectives. There are dozens of ways to proverbially skin that cat,” Weber said.

He said many times Coalfire finds that backup or disaster recovery sites do not have the same security protections and controls that production sites do. Penetration tests have found that when there are systems that are used in various backup or redundancy capacities, budget constraints often result in a lack of the same network security controls that protect the production environment.