Why Disaster Recovery Requires a Plan
Guest post from Casper Manes on behalf of IT Channel Insight
Whether you are a commercial pilot, an astronaut, a submarine weapons officer, or a Cylon, you know the importance of having a plan. There are certain tasks that, no matter how repetitious they may seem, are so important to get right the first time, and every time, that they have been boiled down to a checklist which any reasonably skilled and trained individual can walk through, step by step, in order, to accomplish the task. They are designed to be easy to follow, to spell out exactly what needs to be done, and the order in which it must be done, to get things going, and to require a minimum of creative thinking. Tasks are performed by rote, and verified each step of the way. That’s the perfect way to approach disaster recovery, and in this article we’ll discuss why you need a disaster recovery plan that is a little more detailed than “don’t panic!”
What is a disaster?
Let’s consider what, in business terms, can constitute a disaster. Sure, things like hurricanes and blizzards come to mind, perhaps even fires in the datacenter, but a disaster is more than just a weather phenomenon or catastrophic loss; it’s anything that significantly disrupts the normal operations of your business. If we limit ourselves to an IT perspective, that can include prolonged Internet outages, a severe flu epidemic that takes out half the staff, a virus that shuts down key servers, or a SAN failure. It can also include HVAC failures, power outages, or hardware failures on critical, but not redundant, systems. Anything that causes a significant and protracted impact to normal operations may be enough to declare a disaster situation, and require that you implement your recovery plan.
Disaster declared, now what?
In the best case disaster, you have experienced a hardware failure that will eventually be corrected by the vendor. But while systems are down, your phone is ringing off the hook, you’re getting pinged on email and IM, and someone is probably sticking their head in your cube every 30 seconds asking if it’ fixed yet. In the worse type of disasters, you and your colleagues are probably more worried about your family and your own property more so than the company’s, and that’s assuming all your team even made it into the office. Hurricanes, blizzards, and other region impacting events can leave you with only a skeleton crew, and most of them are going to be worried about more than just how to get the website back online and email working. That’s why you want to work the plan.
By the numbers
Think back to how this article opened. When failure is not an option and there are countless distractions going on, you want people to have something to anchor themselves with, and to keep the need for creative thinking to a minimum. You also need to make sure that things are done in a certain order, and that nothing is missed, because most things have dependencies. A plan is the guide that your team will use to enable them to focus on specific and discrete tasks, without having to make it up as they go along. Make use of checklist; I mean actual paper documents on clipboards with check marks that each step is complete, so that;
a) If something distracts you, it is easy to pick up where you left off without missing anything,
b) You can hand off to someone else and they know exactly where to start
c) Someone can audit that each step was done.
Paper checklists also have the distinct advantage of not relying on technology. I once saw an organization who kept all their DR procedures online; which looked great until they couldn’t get to them while the datacenter was down!
It’s a journey, not a destination
Disaster recovery planning is an ongoing process. Plans must be tested and revised as the company grows, new systems are brought into the environment, and old systems are deprecated. Real disasters don’t happen on schedule, so training must be thorough and testing must be performed to ensure that whoever is on the clock can handle the early steps of the process until more people can get online. Staffing changes will mean that this must happen frequently, and repeatedly. It’s just a part of the overall process, so accept it. And make sure that at least two people know how to perform any part of the disaster recovery plan since you have no way to know in advance whether everyone will be able to make it into the office when a disaster strikes. Redundancy of equipment is no more important that redundancy of skillsets, and a single point of failure could be the one guy who can’t get into the office because the roads are closed.
This article was written by Casper Manes on behalf of IT Channel Insight, a site for MSPs and Channel partners where you can find other related articles on how to setup a disaster recovery plan.