ATMCOMIO

Backup and Recovery in the Cloud: Simplification is Actually Really Hard

Still doing all of your backup and recovery locally?  “To the cloud” will be the rallying cry that you hear from, well, pretty much everyone these days.  After all, it’s super easy to retarget your backups at the cloud and drink all of that sweet, sweet unlimited storage capacity and never have to worry about running out again.  Of course, you’ll pay for that, but it’s expected, so people are willing to do it.
Increasingly, we’re seeing companies with the capability to allow you to recover your stuff to their cloud environments.  That way, if an errant meteor decides that your data center looks like a good place to sleep, you can push a button and spin up your workloads in someone else’s data center.
Easy as pie!
Or, is it?
On paper, saying that you can just push a button and restore your workloads to the cloud seems simple, perhaps deceptively so.  In reality, it’s actually a whole lot harder to do this and then, you have to think about how you bring these workloads back after you’ve rebuilt your data center in a meteor-free zone.

Planning for Protection

First, no matter how many times you’re told that you can just “push a button” and something will happen, it’s rarely that easy.  And, in the rare event that it is that easy, it’s only because some really smart people have spent an inordinate amount of time doing things behind the scenes so that you can hit the easy button and make things happen. In most cases, there is no shortcut.  Someone somewhere needs to do the heavy lifting.
It turns out that simplicity is really difficult.
In the world of cloud-based workload recovery, there are a lot of things to consider:

  • Recovery Point Objective (RPO). How much data loss are you willing to withstand in the event of a failure?  Once you’ve identified this figure, that will drive all of your decisions around how often you synchronize data with your cloud-based backup repository.  The smaller your RPO, the more often you will need to synchronize and, at a certain point, you may even need to consider synchronous replication.
  • Bandwidth. The more syncs and the more data you need to sync, the more bandwidth you need to your backup provider’s cloud.  That impacts cost.
  • Networking. Let’s say you lose your data center and everything is redirected to the cloud.  You need to be able to quickly and easily redirect traffic to make this happen.  This kind of redirection itself may not be that difficult, but ensuring that all of the workloads that are spun up in the cloud can still talk with one another and with any necessary clients.  This means that, on an ongoing basis, admins need to be careful about hard coding IP addresses in services and the like, particularly if you can’t just carry over your local addressing scheme to the cloud.

I’m not going to go too deep here, since I suspect you get the point.  You have to plan ahead for simple to work and then you have to continue to plan accordingly during routine operations.
But, wait… there’s more.

Recovery

At some point, you need to bring those workloads back.  This will necessitate recovering your work from the backup provider’s cloud and then repointing networks at the original location.  There are lots of ways to achieve this, but I’m going to focus on just one here. This was a part of a discussion during Storage Field Day 13 in Denver during which we chatted with StorageCraft.  In these scenarios StorageCraft helps customers revert by shipping them a set of disks (sourced from the customer workloads running in their environment) overnight and the client then restores from those.
It’s here where you may be wondering what happens to all of the data that changed from the time that set of disks was created to when they were reloaded locally.  At a certain point in the process, the StorageCraft process connects to the still-running cloud-based service and synchronizes the local copy to match what is current in the cloud.
The cloud service is then disabled and the customer repoints clients to the now-local service.  Of course, there will probably be a short period of disruption while the network connections are reconfigured, but the end result will be a restored local service that is current with what was in the cloud.

Summary

This is probably not particularly mind-blowing information to many, but the more often I see things like, “just push a button”, I think it’s important for people to understand that “easy” can actually be really hard.  The best solutions out there mask their inherent complexity in a wrapper of simplicity, but, remember, it’s a lot of work and takes a whole lot of enginrring and brain power.