Our data protection team discuss Points of Recovery for your disaster recovery plan...

Introduction

Lately we have been working with many clients on data recovery strategies and with the number of options available thought this whitepaper might be helpful.  The large number of options to solve data recovery is both good news and bad news.  The good news is that there are plenty of options to choose from the bad news is (you know the rest).  In this article, we walk through some of the options we find most pervasive in our technical engagements, include definitions and provide technical lessons learned that we hope you find helpful.

The Process

The first place to start is to understand the value of your data.  All data is not created equal and understanding what data has the most value and what data has the least value is a logical starting point.  A method used to assign value to data and to systems is a Business Impact Analysis (BIA), which helps to classify data value using a most to least scale. The output of this analysis is often the starting point of a comprehensive process similar to the one outlined below.

The first step would be to collect information on the current environment, understand the future goals, consider the BIA and then analyze the gap.  Once there is concurrence on the findings then you would go through the plan and design phase where the architecture is developed, refined, developed, refined again (rinse and repeat) until you have a hardened architecture that achieves the desired results.  Next would be to implement the solution, validate the implementation, test the implementation and work on the production cutover plan.  Once in production you ensure solution viability through regression testing, failover testing and periodic reviews to ensure the solution is keeping pace with ever changing Service Levels.

Recovery Options

As you go through the process outlined above, you will come across many different recovery solutions and approaches.  We are focusing more on recovery than high availability in this piece so you will not see some of the traditional high availability approaches that we have implemented.  We will cover high availability in a future blog post.

Transactional

Often overlooked, as a recovery method is transactional recovery; because many application and systems administrators perform transaction recovery without ever alerting the data protection team about a data loss or error.  Transactional recovery utilizes a software program’s inherent “recovery” mechanisms to bring data back.  Database logging is a very common form of transactional recovery with Database Administrators or software support resources typically performing the transactional recovery.

When it comes to application or data corruption due to code bugs, this is the most common, and a very effective way to provide very close-to-zero RPOs.  The biggest advantage of this type of recovery is the fact that many applications offer this functionality as an integrated part of its transaction processing.  There really is very little that can provide a better level of recovery, since this technology is the most “application aware”, allowing individual transactions to be rewound, to correct an error.

The chief issue with this toolset is the limited nature of recovery.  Any loss of any one component (host/storage/connectivity) and this recovery method cannot be activated.  Once the component loss is corrected, however, transactional logging can validate data and bring the application online quickly.  Often, administrators are unaware that the application performs this recovery type at all.  Most power outages that cause hosts to suddenly power off will need transactional recovery to get everything working again.

Snapshots

Nearly all storage vendors provide snapshot technologies in their storage subsystems. Snapshots are point-in-time backups of individual or groups of LUNs. Snapshots can be very space efficient, as they typically only contain the data that has changed since the last snapshot.

Clients often use snapshot technology for the backup and recovery of data at the LUN level. Snapshot creation can either be scheduled or on an ad hoc basis.  Recovering data from snapshots allows for a point-in-time recovery in the event there is data corruption or loss. Snapshots and Replication used together can help to create on-site and/or off-site copies that are no longer dependent upon the source subsystem.

Clients also use snapshots to create both read-only and read/write copies of data sets for presentation to other hosts. It is very common for clients to use snapshots to create point-in-time database copies for use in development, testing, and reporting.

When snapshots are used as a backup or duplication method for databases, multiple LUNs will need to be copied at the same point-in-time. Storage vendors use consistency groups to snapshot multiple LUNs synchronously. It is important to maintain the consistency groups as the storage landscape changes. If the database is expanded with additional LUNs, the new LUNs need to be added to the consistency group, or the recovery of the database will fail. As with any method used for data backup,  snapshots should be tested for recoverability regularly to validate and to create procedures for the recovery strategy.

Local Backup

Local backup is most likely the first thing thought of when discussing recovery options.  Local backup involves copying data from production server hosts to a secondary storage system.  Often, that system will be optimized for large data throughput, and/or several layers of data reduction technology (compression/deduplication).  Due to the point-in-time, air-gapped nature of backups, as well as the containerizing of the data, there is often a higher level of protection from data loss due to corruption, infection via virus, or accidental deletion.  Local backup is a very resilient and tested platform for many data loss risks, the solutions available today are versatile, flexible, and robust.  They will often have a rich user-base that has already solved data protection challenges someone may be facing just today.

It is common for local backup to suffer from one of two architectural weaknesses:  Copying data takes time, and requires lots of space to house the copies.  Even with copy optimization and data de-duplication tools, the costs of backup can be expensive.  Additionally, the restoration process takes as much time (or longer) as the backup, and RTO is frequently longest when using backups as the mode of recovery.

However, the number one reason backup is still the most common technology used for recovery is cost.  That is, cost compared to reduced risk.  With complete backup copies, risk of permanent data loss is reduced the most per dollar with a backup solution.  Since a business’s data is often the most important thing they have, a complete secondary copy of it, stored separately, labeled, and indexed, can provide a mechanism to rebuild from the ashes of any disaster.

Remote Backup

Local backup, as discussed above, can be modified to provide additional robustness by making it a remote backup.

Geographic separation of the two copies of data (production data and backup data) lowers risk from a variety of potential sources.  Traditionally, this was done by physically transferring backup media (usually tapes) to a safe offsite storage facility.

Often, in an effort to improve performance or reduce costs, backup data will be housed in the same facility, or maybe even on the same devices as the production data.  Local disk snapshotting is a perfect example of this.  It is commonly mistaken as a form of backup, which it is not.  Additionally, the backup data may never leave a specific data center, or only perhaps the data will be limited to one campus.  Site protection can be simply achieved by using remote backup services from a geographically separate location, sending backup data from one facility to another via direct transfer, or perhaps replication.

Remote backups, while very similar in architecture to local backups, are often avoided due to the high cost of connectivity between the two locations.  Data reduction technologies can help with this high expense, but they do not solve it completely.  If high-speed links are setup for data protection, often the bandwidth is reserved for availability tools instead of backup tools (e.g. synchronous disk replication or VM replication) due to improved RTO/RPO.  Optimally, both should be leveraged however.

Replication

Storage vendors provide replication options to protect against single storage subsystem failure.  Replication is the copy of data at the LUN level from one storage subsystem to another. This copy can be local (in the same data center), or remote (in another data center). Consistency groups are used in replication in the same way they are used in snapshots. Multiple replicated LUNs are grouped together to guarantee point-in-time consistency.

Synchronous replication provides a mirrored copy in both the source and target subsystems. Write transactions are sent to both subsystems and the write is marked complete at both before the write is acknowledged to the host. Synchronous replication requires very low latency between the source and target subsystem, as host performance can be severely impacted as latency increases. If latency is an issue, asynchronous replication is another replication method.

Asynchronous replication provides a delayed copy on the target subsystem. write transactions are sent  to both subsystems, and the write is acknowledged to the host when only the source subsystem write completes, regardless of the status of the write at the target subsystem. This method allows for some level of replication on higher latency links, but the delay of the second write, does not guarantee a mirror copy of data. In the event a client needs to recover from an asynchronously replicated copy, they must rely on the ability of the application/database to roll forward or backward transactions to a consistent state.

Many clients combine snapshots, synchronous replication, and asynchronous replication to achieve their recovery goals. It is very common for customers to implement snapshots to provide quick recovery and flexibility for individual databases, local synchronous replication to provide a mirrored copy of critical data, and remote replication to provide an offsite copy of critical data. It is important to note, in most cases neither snapshot nor replication replaces traditional backup. Replication needs to be tested regularly to validate the recoverability of data.

Conclusion

Well that simplified things!  It didn’t but hopefully highlighted some options to consider understanding that this is not an exhaustive list but more of a subset of the more popular methods deployed today.  The reality is that practical implementations are more than one of these and sometimes all of these options to cover your data recovery requirements.  Before spending time, money and effort on data protection first understand the value of the data to the business.  This will help frame out the types of solutions to consider so that you right size your solution and maximize your data protection spend.


Let us know if you have any questions….. – Contact Us Today

Image Credit – Cisco