Many cloud initiatives are sold and implemented on the basis of the improved redundancy they offer. Impressive infrastructure architectures like "geo-redundant data centres" become available and affordable to all sizes of company thanks to the huge economies of scale that cloud providers can leverage.
However, this redundancy is usually offered only at the infrastructure level. It doesn’t do much to help with system and application level redundancy. This remains the responsibility of the system owner and not the cloud infrastructure provider.
If you haven’t considered system and application level redundancy in addition to inherent platform and network redundancy then you may have some gaps in your disaster recovery solution.
System and configuration issues will instantly be replicated between your geo-redundant data centres making failover useless for disasters above the server infrastructure level
By ensuring that you have the appropriate system level recovery assets in place you can reduce the risk of downtime and data loss should something untoward happen with your servers' system configuration or your applications.
What is a system level recovery asset?
We define a recovery asset as a safeguard that you have built into your architecture to assist with the recovery from, and root cause analysis of a system or application level failure.
Examples would be, system logs, file system backups, replicated or mirrored databases, database logs, virtual machine snapshots and dormant servers.
If you make such recovery assets available, and keep them current, in the event of a system level failure you have lots of options in how to recover your system.
Including recovery assets in your architecture will allow you to recover within a data centre rather than failover to a geo-redundant site
Maintaining recovery assets does not need to cost a fortune to implement.
You can take advantage of the fact that most cloud service providers do not charge for servers in a stopped state, so having replica servers (with pre-defined IP addresses, network configuration and correct software installed) will speed up recovery time if one of your live servers has issues.
Cloud storage is cheap, particularly if you housekeep it aggressively, so copying out your log files, periodic backups and snapshots, and syncing transactional data to storage can be a cost effective way to provide options to restore your applications to a known point in time.
Third party solutions can also provide enhanced recoverability.
As an example Papertrail will aggregate system logs from cloud servers so these can be viewed if remote access to the actual servers is not possible.
If you are currently relying on your cloud service providers high service levels, or you have built your disaster recovery plans around failover to a geo-redundant site alone, you should act to implement further recovery assets so that you don’t find your only option is to restore from a backup and lose data, or worse find that you have no recovery options at all.
We would be interested to hear what low cost recovery assets others have implemented in their cloud architectures to maximise system recoverability in the event of system or application level failures. Please let us know using the comments for this post.