This thread was initiated by a comment from @BTPost in another thread. I was recently promoted from Assistant VP to VP at my company and the running joke is that I now have a key to the Executive Bathroom. In reality, the responsibilities have changed/increased, with a commensurate increase in authority...and stress. As such, one of the things I am now more involved in is our BCP (Business Continuity Plan/Program) and DR (Disaster Recovery) plan. As the initial reply was done from my phone and done quickly, I figured I would start a new thread on what BCP/DR is, how they are implemented (and have been historically) and how they are different. First, some definitions (no, not Webster, real life). EDP or Electronic Data Processing This is a VERY old-school term for "computers" and the jobs they perform but is the term still used by regulatory agencies and BCP/DR plans. BCP or Business Continuity Plan (or Program) A BCP is focused on continued key business operations without having to rebuild anything from scratch. They typically focus on key personnel and critical business systems. The point behind BCP is to ensure that those business systems and process, and the people required to run them, stay available in the event of some catastrophe. In general, I will be focused on "non-manufacturing" types of BCP as that is what I am most familiar with (I'm in IT...go figure). Examples of steps taken as part of BCP include: Personnel to manage processes, systems and applications, separated by large enough distances that they are unlikely to be impacted by the same event at the same time. Many DR and BCP plans have begun referring to this as the Katrina Effect based on the fact that in some cases businesses had both of their primary locations within the same region that was affected by Katrina and its aftermath. Now, the minimum distance considered safe by the industry is 150-300 miles between locations. More on that later. Critical business systems, including/especially EDP, are housed in a facility that can survive either the worst possible "realistic" natural disaster or one step below. For example: If you are in an area that is likely to have hurricanes and Cat 3 is the most they have ever recorded, your datacenter/facility should be designed to withstand a Cat 4 (for giggles) or Cat 3. If you are in an area that is prone to power outages...MOVE! Just kidding, the facility should have internal capacity to generate power for no less than the average outage or 72 hours (whichever is greater). Meaning generator or battery power for that long. If you are in an area that is prone to power outages or where outages can be long-term (after a snow/ice storm, hurricane, tornado, tsunami, earthquake or volcanic eruption...yes, that's on the list), plans for fuel delivery to maintain power "indefinitely". "Authority" continuity is planned for in the short and long term: Key personnel may be moved temporarily for the duration of the event Plans are in place to transition authority/responsibility to others in the company should key individuals be inaccessible or unable to perform their duties for an extended period of time (up to and including loss of life) Contingency plans exist for remote work/work from home for both short and long term outages or disasters. Disaster Recovery Plan A DR Plan is focused on resuming service as quickly as possible after a disaster has occurred and services have been interrupted. A DR plan assumes that business continuity has been interrupted and must be re-established. Examples of steps covered in a DR Plan include: Having a secondary data center, outside of the Katrina Effect zone, with either redundant systems or the ability to immediately scale to redundant systems. Having the ability to transfer workloads from the primary (offline) data center to the backup/DR data center with minimal loss of data (transactions) The ability for customers and employees to use the DR site. For a data center, that would include re-routing traffic from the internet. For a labor intensive trade, it would include getting people on-site to do the work. For a call-center, it would mean getting the calls re-routed and up coming into the secondary site. Agreements with partners, vendors and providers to make all of the above happen, as well as get the process started to repair the failed site or stand up a new one. Two is one, one is none. If you are running out of your DR site because your primary site is a smoldering crater (or swimming pool)...you're actually running out of your BRAND NEW PRIMARY SITE...with no DR site in existence. I have worked for 3 large companies that had both BCP and DR plans, some better than others, but all covered the same bases and none of them were perfect. The first company (let's call them "Spectrometer"...although they had a very different name when I worked there) initially had what I call a "Hope and Pray Plan", as in "I hope and pray nothing goes wrong". Long story short, work was being done on the UPS room when someone tripped and literally hit the big red button. The data center technically ran on UPS all the time to condition the power, but the big red button cut off outside power and was designed to put us on generator. In this case the generator started as expected and was up and providing power in less than 2 minutes (3 minutes to spare on the UPS). Unfortunately, ALL 3 pieces of transfer switch gear failed...ALL 3. The batteries ran out and the entire data center just...turned off. 1500 servers, RACKS of storage, all the network gear, phones, everything. Walking in there was like walking into a scene from Doom II. Almost pitch black with long shadows being cast from the single flickering florescent bulb in the back corner...and a beeeeep...beeeeep...beeeeep coming from the alarm panel. We were 50% to our DR plan. Disaster...check. Plan...not so much check. We built it on the fly and it was a "bring the site back up" plan. If it had been a smoking crater we'd have all been out of a job and recovery would have taken longer than the market would have accepted. We had backups, but they were all on tape and they hadn't made a pickup yet that day so we'd be a day old on transactions. It would have taken months to get all the hardware we needed and another month to restore all the data. We would have been done, period. But we got damned lucky. This was on a Friday afternoon. We were, however, able to get everything back online once power was restored, in about 13 hours with zero loss of data due to transaction logs on the databases and caching on the storage arrays. We lost probably 100 hard drives over then next 72 hours as bearing got VERY cold (AC came back up 6 hours before we powered on the first storage/servers...bad call that one), but with the redundancy built into the systems we lost no data. Eventually we acquired another company with their own Tier I data center (highest possible rating) out in the middle of nowhere that had 8 days of reserve power on-site with two different rail spurs going to the data center and the ability to draw diesel directly from a train tanker car if necessary while refilling the underground tanks. The remote site also had sufficient spur track to hold 3 tanker cars on each spur for a total of something like 1000 years of diesel (no, not really, but WAY longer than necessary). The previous company I worked for (we'll call them "If I try to get cute with their name and someone finds out who they are they will probably sue me so I'm not even going to play that game"..."A-Holes" for short) had a BCP plan that had people in over 100 countries and every continent except for Antarctica, with redundant pairs of data centers in the US and EU (Switzerland in fact). Great setup, right? Yeah, except that both US data centers were about 100 miles apart...in FLORIDA! And along comes hurricane season. I still wake up in a cold sweat many nights from September through late October! BCP was great, it really was, the company was a freakin' hydra. But DR was NOT cool and if we had to run out of Switzerland, most of the company everywhere else would have suffered. Now to the current company. We are what I would call a really big "little" company. We grew very fast and are frankly scary big for what we do. But we're playing catch up in a lot of areas. Data centers on the East Coast (yes, I'm watching the storm and in contact with the DC every couple of hours), and in Texas (north, not prone to hurricanes in the gulf). We can switch operations in about 4 hours cleanly or about 2 hours uncleanly (with something like 3 weeks of cleanup afterwards). We have folks on both coasts who can do most things. Our CIO is in San Diego for a week (once a month) and is staying there for the duration. We have VPN into both data centers with sufficient capacity to support a couple thousand simultaneous users in each site. We have folks set up with virtual desktops to minimize the need for VPN. We can run move the call center with the flip of a switch to 3 other offices. We have temporary office space "ready to occupy" in Atlanta and it's already warm. And we still had 3 days of meeting every couple of hours to play the "what if" game. What if the primary data center loses power? Well, they have 72 hours worth of reserve for the generators and service planned for trucks starting at 6 hours or as soon as the roads are safe. What if the data center floods (it's like 15 feet above the highest water mark recorded)? We will get notice several hours before that happens and will initiate a DR event and cleanly move to the DR site. If we don't get notice, we execute DR into the DR site. What if there is another event at the DR site (and yes, this got asked)? *hangs head* Then I submit my resignation and I bring you my laptop and badge once the roads clear. Seriously...if that happens, it happens and we just suck it up and wait for one or the other to come back online. How many simultaneous acts of God do you think we can prepare for? What if everyone on the east coast loses power? Then we use cell phones to let the west coast folks know it's all on them...then we drive somewhere with power and charge our phones so we can walk them through anything they don't know (which is minimal). We also have an off-shore contingent that keeps the wheels on the bus overnight that can step in and work during our "day" if we have to. And on and on. We got REALLY into the weeds and edge cases by the time we all just called it quits and left last night. For us, BCP and DR really are tied at the hip and DR is a component of BCP. For us: We have a remote data center where data replicates with an Recovery Point Objective of 5 minutes or less (data in the DR site is never more than 5 minutes "old"). We would lose no more than 5 minutes of data/transactions. We are almost entirely virtualized and our DR is almost entirely automated. I can literally hit a red button in a computer console (from either side) and bring up the entire virtualized infrastructure in about 40 minutes. Validation is what takes the longest. Have employees on both coasts and off-shore to cover all of the critical tasks and positions. We sent people home to prepare at home or leave and work "very" remotely on Wednesday (we knew on Wed that schools were closed on Thursday and Friday). Key employees have laptops (battery power is key here) and work provided mobile phones with hot-spot enabled for remote work without local internet. I personally have told my people that family comes first and to let me know if they are going to be out of the loop. Take care of yourself, we have a lot of employees to fill in the gap. We have backup power at the site that was tested last weekend for this event. It powers the on-site server room with 3 days worth of diesel and plans for more after 24 hours. The local building was built a fair bit above the flood level. We kept our CIO out of "harms way" for the duration. We have an off-site location for office space (100 seats) that is already activated and "warm" Our partners are ready to switch to the DR site links if necessary We have the ability for all affected folks to work remote via VPN or VDI out of either data center. Our call center can switch to one of 3 locations within a few minutes. All our hardware vendors are aware of the situation and while we haven't staged hardware, we could get up and running in a third site within 3 weeks and dump to the cloud via our virtualization vendor in less than 2 days (it would just be CRAY CRAY expensive!) In this case, even though we're not mature, we're actually better off than the last two places I've been, and we're getting better. Hopefully this has shed a little light on BPC and DR and the anecdotes made reading it a little less painful.