Tag Archives: DRP

Why Disaster Recovery Requires a Plan

Why Disaster Recovery Requires a Plan

Guest post from Casper Manes on behalf of IT Channel Insight

Whether you are a commercial pilot, an astronaut, a submarine weapons officer, or a Cylon, you know the importance of having a plan. There are certain tasks that, no matter how repetitious they may seem, are so important to get right the first time, and every time, that they have been boiled down to a checklist which any reasonably skilled and trained individual can walk through, step by step, in order, to accomplish the task. They are designed to be easy to follow, to spell out exactly what needs to be done, and the order in which it must be done, to get things going, and to require a minimum of creative thinking. Tasks are performed by rote, and verified each step of the way. That’s the perfect way to approach disaster recovery, and in this article we’ll discuss why you need a disaster recovery plan that is a little more detailed than “don’t panic!”

What is a disaster?

Let’s consider what, in business terms, can constitute a disaster. Sure, things like hurricanes and blizzards come to mind, perhaps even fires in the datacenter, but a disaster is more than just a weather phenomenon or catastrophic loss; it’s anything that significantly disrupts the normal operations of your business. If we limit ourselves to an IT perspective, that can include prolonged Internet outages, a severe flu epidemic that takes out half the staff, a virus that shuts down key servers, or a SAN failure. It can also include HVAC failures, power outages, or hardware failures on critical, but not redundant, systems. Anything that causes a significant and protracted impact to normal operations may be enough to declare a disaster situation, and require that you implement your recovery plan.

Disaster declared, now what?

In the best case disaster, you have experienced a hardware failure that will eventually be corrected by the vendor. But while systems are down, your phone is ringing off the hook, you’re getting pinged on email and IM, and someone is probably sticking their head in your cube every 30 seconds asking if it’ fixed yet. In the worse type of disasters, you and your colleagues are probably more worried about your family and your own property more so than the company’s, and that’s assuming all your team even made it into the office. Hurricanes, blizzards, and other region impacting events can leave you with only a skeleton crew, and most of them are going to be worried about more than just how to get the website back online and email working. That’s why you want to work the plan.

By the numbers

Think back to how this article opened. When failure is not an option and there are countless distractions going on, you want people to have something to anchor themselves with, and to keep the need for creative thinking to a minimum. You also need to make sure that things are done in a certain order, and that nothing is missed, because most things have dependencies. A plan is the guide that your team will use to enable them to focus on specific and discrete tasks, without having to make it up as they go along. Make use of checklist; I mean actual paper documents on clipboards with check marks that each step is complete, so that;

a)     If something distracts you, it is easy to pick up where you left off without missing anything,

b)     You can hand off to someone else and they know exactly where to start

c)     Someone can audit that each step was done.

Paper checklists also have the distinct advantage of not relying on technology. I once saw an organization who kept all their DR procedures online; which looked great until they couldn’t get to them while the datacenter was down!

It’s a journey, not a destination

Disaster recovery planning is an ongoing process. Plans must be tested and revised as the company grows, new systems are brought into the environment, and old systems are deprecated. Real disasters don’t happen on schedule, so training must be thorough and testing must be performed to ensure that whoever is on the clock can handle the early steps of the process until more people can get online. Staffing changes will mean that this must happen frequently, and repeatedly. It’s just a part of the overall process, so accept it. And make sure that at least two people know how to perform any part of the disaster recovery plan since you have no way to know in advance whether everyone will be able to make it into the office when a disaster strikes. Redundancy of equipment is no more important that redundancy of skillsets, and a single point of failure could be the one guy who can’t get into the office because the roads are closed.

This article was written by Casper Manes on behalf of IT Channel Insight, a site for MSPs and Channel partners where you can find other related articles on how to setup a disaster recovery plan.

Ike: this is no time to think about disaster planning

Bookmark This (opens in new window)

Hurricane Ike

Hurricane Ike

Thousands of businesses in Texas from Freeport to Houston are wondering, “How are we going to survive Hurricane Ike and continue business operations afterwards?”

If this is the first time this has crossed your mind, there’s precious little you can do now but kiss your systems and hope that they are still running when you see them again.  The storm surge is supposed to exceed 20 feet, which will prove disastrous to many businesses.

But when you get back to the workplace and things are back to normal (which I hope is not too long), start thinking seriously about disaster recovery planning.  A DR project does not have to be expensive or take a lot of resources, and it’s not just for large businesses.  Organizations of every size need a DR plan: the plan may be large and complex in big organizations, but it will be small and manageable and not be expensive to develop.

Hurricane Ike's Path

Hurricane Ike's Path

Where do you begin?  At the beginning, of course, by identifying your most critical business processes, and all of the resources that those processes depend on.  Then you begin to figure out how you will continue those processes if one or more of those critical resources are not available.  The approach is systematic and simple, and repetitive: you go step by step through each process, identifying critical dependencies, figuring out how to mitigate those dependencies if they go “offline” at a critical time.

IT Disaster Recovery Planning for DummiesOrder yourself a great book that will get you started.  As one reviewer said, “It would be tempting to make all sorts of snide comments about a Dummies book that wants to take a serious look at disaster recovery of your IT area. But this is a Dummies title that you’ll actually go back to a number of times if you’re responsible for making sure your organization survives a disaster… IT Disaster Recovery Planning for Dummies by Peter Gregory. It’s actually the first book on the subject that I found interesting *and* readable to an average computer professional….” read the rest of this review here and here.

Don’t put this off – but strike while the iron is hot and get a copy of this now.  Don’t wait for the next hurricane to catch you off-guard.

I don’t want to see any business unprepared and fail as a result of a natural disaster.  If it were up to me, disaster preparedness would be required by law, but instead it’s a free choice for most business owners.  I just wish that more would choose the path of preparation and survival, but unfortunately many do not.  I wrote IT Disaster Recovery Planning For Dummies to help more people understand the importance of advance disaster recovery planning and how easy the planning process can be.

Does your organization need a disaster recovery plan?

Bookmark This (opens in new window)

DisasterMany businesses, particular those that have less than one thousand employees, think that disaster recovery planning is something that is too difficult or too expensive to undertake. Another response is that of the avoider: it won’t happen to me. These assumptions have been perpetuated to the detriment of many businesses that unnecessarily failed.

Disasters come in many forms. Most people think of massive earthquakes and hurricanes. However, there are hundreds of disasters that occur on a regular basis, but they’re too localized and small to make the news. And not all disasters are ‘acts of nature’: there are many man-caused disasters that occur on a regular basis that cripple businesses just like acts of nature do.

Disaster Recovery Planning need not be expensive, and most businesses can (and should!) get started right away with even a small amount of planning that could prove highly valuable, in case the unexpected occurs.

Get the book, build the plan!

On interim DR planning

Bookmark This (opens in new window)

Most organizations will immediately recognize the risks associated with the absence of a disaster recovery plan. Knowing that having a full DR plan in place and tested may be more than a year in the future, many organizations will have a strong desire to have something in place while waiting for the full DR plan to be completed.

Often the something that is needed is an interim DR plan. This is a plan that can be created quickly and with minimal effort. It will not, of course, be as comprehensive as a full DR plan. It is rather like tossing a tow rope in the back of a car, knowing that major engine work is needed.

Confidence in a DR plan

Submit: Add to your del.icio.us Digg This Slashdot GotNews StumbledUpon Reddit

Disaster recovery plans aren’t much good if they don’t work. And if they don’t work, then the time devoted to their development has been pretty much a waste of time.

Decision makers in businesses, especially the executives, like certainty. They want to have confidence that things will go as planned. And while no one plans a disaster, they want to know that the recovery effort after a disaster will work.

The survival of the business may depend on it.

You can take your DR plan to a fortune teller, but I wouldn’t put much stock in that. Why not just try it?

– from IT Disaster Recovery Planning for Dummies

Skype restored, but executives still in hiding

Bookmark This (opens in new window)

Disclaimer: I’m a big fan of Skype – it’s my IM client of choice.

The Skype PR disaster continues unabated. There is no end in sight.

The Skype network service has been restored. Here is a short explanation of their problem:

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.

Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days.

This and other postings were made by someone named Villu Arak, about which practically nothing can be found.

Skype had one of the most significant outages of any online service, and none of its executives have so much as said “Hello”.

Skype’s network disaster has been solved, but Skype’s PR disaster continues. Have they ever heard the terms “goodwill”, “media relations”, or “customer service”?

Earlier story.

Skype: not one disaster, but two

Bookmark This (opens in new window)

Disclaimer: I’m a big fan of Skype – it’s my IM client of choice.

Update to the story.

Aug. 19, 2007 – Skype suffered a colossal outage last week, and the network is just now coming back. They have promised to tell everyone on August 20, 2007, what happened in the previous week.

Two disasters hit Skype last week:

  1. The network outage, whatever it was about.
  2. The complete absence of Skype and eBay executives throughout the crisis.

While the first disaster might not have been preventable, the second disaster was a direct result of decisions made by Skype and eBay executives who apparently chose to hide. They appear to have left the Skype technical recovery team out to flap in the wind, alone, with no visible public support. This led to rumors and speculation about what really happened, and whether Skype and eBay executives care about the community of users. Those executives committed the cardinal sin of disaster management: communicating from a high level about what’s happened and what is being done about it. Instead of being told, we were left to wonder if they even noticed. Maybe they were too busy and could not be bothered about a world-wide outage. Their silence is deafening – we hear it loud and clear.

In a different part of the world, the Utah mine accident is playing out. Those executives got it right: they are right there, working and concerned, and are making frequent statements to the press. They are telling everyone what they know, what they don’t know, and what they’re doing.

Murray Energy Executives get two thumbs up for being there. Skype and eBay get two thumbs down for being silent and absent.

While Skype users may breathe a collective sigh of relief that the network is running again, I wonder if Skype and eBay have even noticed that the second disaster has taken place and has yet to be addressed.

Server consolidation and disaster recovery planning

Bookmark This (opens in new window)

Server consolidation has been the talk of IT departments for several years, and represents a still popular cost cutting move. The concept is simple: rather than dedicate applications to individual servers, which can result in underutilized servers, install multiple applications onto servers in order to more efficiently utilize server hardware, thereby reducing costs.

I’m all for saving money, electricity, natural resources, and so on, and consolidating servers is a smart move to undertake, as long as you abide by this principle:

Server consolidation is something to undertake during peacetime, not solely for recovery purposes.

Let me expand on this. Consider an environment that is made up of dozens of underutilized servers dedicated to applications. The DR planning team wants to consider a DR strategy that consolidates these applications onto fewer servers as a way of providing a lower-cost recovery capability.

Well, it might work, but I’d want to test it very thoroughly and carefully. Combining applications that are used to having servers all to themselves may lead to unexpected interactions that could be difficult to troubleshoot and untangle.

If you want to undertake server consolidation, do it first in your production environment, and then take that consolidated architecture and apply it to a DR architecture.

– from IT Disaster Recovery Planning for Dummies

Aligning DR planning to the org chart in large organizations

Perhaps different segments of a large organization may push forward on DR planning at different rates. One’s lack of progress should not impede another. Instead, you might think of this as a DR plan for each cog in the organizational wheel. If this is how things get done in your organization, then perhaps the DR plan gets built in pieces, asynchronously. Progress has many faces.

– from IT Disaster Recovery Planning for Dummies

DR team selection

You can’t hand pick your recovery team members. The disaster will select them for you. It is for this reason that recovery procedures must be specific enough so that anyone with the basic relevant skills can carry them out confidently and correctly.

– from IT Disaster Recovery Planning for Dummies

DRP: the job is not done until the paperwork is done

Paperwork

Bookmark This (opens in new window)

The job is not done until the paperwork is done.

Nowhere is this pithy saying more true than in disaster recovery planning. Why? Because the paperwork in DRP is about how to jump-start the business when “the big one” hits. Depending upon where your business is located, the “big one” may be an earthquake, tornado, hurricane, flood, or a swarm of locusts.

The paperwork in DRP is simply this: the procedures and other documents that business personnel must refer to in order to get things going again after a disaster. The DRP procedures are especially important because they might be read and followed by persons who are not the foremost experts with the systems that support critical business processes. Still, those people are expected to rebuild critical systems in a short period of time in order to support critical process that are probably going to be performed by people who likewise are not subject matter experts at the business process level.

And the business’s survival depends on the paperwork being right. There are no second chances.

You just love documentation, right? Thought so.

– from IT Disaster Recovery Planning for Dummies

Ninety percent of good disaster recovery planning is knowing what makes your environment run today.

– from IT Disaster Recovery Planning for Dummies

Bookmark This (opens in new window)

Building replacement workstations in a disaster

Submit: Add to your del.icio.us Digg This Slashdot GotNews StumbledUpon Reddit

The need to build workstations on unfamiliar hardware platforms requires some out-of-the-box thinking for those who are required to build replacement workstations in a disaster. Straightaway, I recommend that workstation images be very well documented, so that they can be built from the ground up on new hardware platforms.

– from IT Disaster Recovery Planning for Dummies

Gap in PC Procedure Causes Corporate Crisis

Bookmark This (opens in new window)

Some years back, a colleague in another organization came to me for help. In this international organization and U.S. public company, the finance department was unable to close its quarterly financial books in time to meet a S.E.C. filing deadline.

It had missed the deadline for several days, and the matter had reached the CEO and the boardroom as an uproar.

The cause: an overseas subsidiary was unable to close its books. The reason: one of the steps to the overseas subsidiary’s completing its month and quarter-end financials was a procedure wherein a financial report was downloaded to a PC’s spreadsheet program, where a spreadsheet macro would perform some calculations that would be used in the subsidiary’s financial results.

This time, there was a problem: the macro had become corrupted and would not run.

There were no backups. A contractor had created the macro and was nowhere to be found. No one in the finance department knew what the macro did or how it worked. It was an undocumented step in this critical business process; the original software was gone, and none of this was documented.

Be certain to avoid having this kind of a scenario occurring in your organization.

– from IT Disaster Recovery Planning for Dummies

Storing production data on end user workstations?

Submit: Add to your del.icio.us Digg This Slashdot GotNews StumbledUpon Reddit

As I encounter cases where an employee’s workstation is, in fact, on the critical path for a critical business process, the first question I usually ask is:

Why?

Warnings go off in my head when I hear about an employee’s workstation in any process’s critical path.

– from IT Disaster Recovery Planning for Dummies

Now let me tell you why I think it’s a bad idea to store production data on end user workstations:

  • Workstation hard drives are not protected from failure by any RAID or mirroring technology. When the hard drive fails, the data is gone. IT servers often have RAID or mirroring, which protects the integrity and availability of the data.
  • Most users don’t back up their workstation hard drives. When the data is gone, it’s gone. IT servers are usually backed up regularly.
  • Most workstations have little or no power protection (plug strips hardly count). When sags, spikes, or brownouts occur, the workstation will take the brunt of this, possibly resulting in a crash or hardware failure. Sure, it’s unlikely, but it DOES happen. IT servers are usually protected by UPS and, sometimes, generators.
  • Users often tinker with workstations, which sometimes results in a disabled state and/or a reboot. This happens a lot less in most IT servers.
  • User workstations, particularly if they are laptops, are stolen far more frequently than IT servers, especially when they are locked up in server rooms.

Documenting PC tasks just as important as other platforms

Submit: Add to your del.icio.us Digg This Slashdot GotNews StumbledUpon Reddit

When business processes include tasks that are carried out on workstations, then written procedures for those tasks must exist alongside procedures for other steps in the process that take place on other platforms. All tasks in a business process must have equal formality, regardless of whether they take place on a formal IT server platform or on a user’s workstations.

– from IT Disaster Recovery Planning for Dummies