Disaster Recovery: A Survivor's Guide

Disaster Recovery: A Survivor's Guide

Natural disasters, terrorist incidents, hacker attacks and other threats can make a mainframe system vulnerable to losses and interruptions. System downtime can have a negative impact on your business, affecting your ability to serve customers and manage your operations. 

We all hope that disaster won't strike, but if it does...

Think Globally.

The main aim of a disaster recovery solution is that processing can resume at a different site, and on different hardware. Disaster recovery is just one component of an overall Business Continuity Plan, which hinges on your ability to recover data. This should include:

·       Data recovery

·       Data backup or replication

·       Business continuity planning

·       Ongoing plan audits

·       Off-site data protection

Consider Site Recovery.

In the late 20th century, if an event took a system down, a team of programmers would be dispatched to a backup site, where they'd spend several hours spinning tapes, fighting JCL errors and replying "cancel" to device allocation error messages.

These days, DASD manufacturers have added capabilities for replicating data across sites. Mainframe hardware and software have changed to support redundancy and failover.

In a basic site recovery, two data centers are placed some distance apart on the same campus. The separation should be great enough to prevent simultaneous failure, but close enough for synchronous hardware communication. Processors, coupling facilities (CF's) and DASD boxes are all connected over redundant links. Synchronous data replication ensures that a write to a primary disk will be replicated to an alternate volume in another box.

An administrator has to decide which volumes need to be replicated, and how to organize the recovery groups. Automation also requires a human touch; someone must decide how much intervention is required, which workloads recover where, and what really constitutes a site failure.

Building and testing the procedures ahead of time will save a lot of headaches.

Consider People, Too.

Business processes are important – but so too are the people who help keep them running. So you have to consider how they will get to where they need to be, in case of an emergency.

One solution is to have people work from home. But this assumes that telecommunications infrastructure will still be functional, and that there'll be enough bandwidth. And there are other factors beyond your local control, like the power grid, which may also have been affected.

Your planning for recovery has to take such elements into account, together with the changing nature of the environment and the disasters which can affect it.

Some Survival Tips:

1. Backup, and Audit. Audit your internal backup plans, and implement procedures where none exist. Ensure that your plan covers all aspects of your environment which are critical to a disaster response.

2. Decide on Scope. Determine whether you need wide-area disaster recovery: the ability to recover data and resume operations at locations on another coast, or even another continent. With careful planning, you can maintain application and data availability during fires, floods, and power outages without breaking your budget.

3. Enlist Senior Management Support. Ensure that disaster recovery is a priority throughout the enterprise.

4. Match Technology to Scale. If you do need wide-area recovery, make sure the system combines mirroring with IP-based networking that connects two storage-area networks.

5. Build for Redundancy. For the server and storage platforms, consider redundant hot-swap components, like disk drives, fans and power supplies. External storage arrays should connect to multiple host bus adapters, and be configured with redundant management modules where applicable. You should also consider Error Checking and Correcting memory, uninterruptible power supplies, and hot-swap devices to facilitate repairs while keeping the system online.

6. Staff Locally, but Manage Globally. Ensure that your recovery system supports remote administration. This enables troubleshooting and diagnosing problems from off-site locations, in order to bring a server back online quickly.

7. Have a Backup Plan. Determine if the recovery system includes support for mobile, desktop and portable computers. Then ensure regular backups, on each.

8. Partner Up. Form partnerships with a hardware/services provider that can ensure full service support before, during, and after a disaster. Ideally, you should be able to turn over the IT disaster recovery functions to that partner, so you can focus on your employees, customers and core business operations. Make sure this partner can deploy, track and deliver the mission-critical aspects needed to keep your business running.

9. Give Clear Mandates. Make sure your employees know exactly what's expected of them. Your IT staff must know what to focus on, during and after an event. Make sure everyone has appropriate contact information, and access to status reports on the disaster or crisis.

10. Allow for Flexibility. Ensure that your recovery system can scale to your future needs, and that it doesn't lock you into a particular technology or vendor.