How does a Business or Company prepare for a disaster?

Discussion in 'General Survival and Preparedness' started by DarkLight, Sep 15, 2018.

  1. DarkLight

    DarkLight Live Long and Prosper - On Hiatus

    This thread was initiated by a comment from @BTPost in another thread.

    I was recently promoted from Assistant VP to VP at my company and the running joke is that I now have a key to the Executive Bathroom. :) In reality, the responsibilities have changed/increased, with a commensurate increase in authority...and stress. As such, one of the things I am now more involved in is our BCP (Business Continuity Plan/Program) and DR (Disaster Recovery) plan.

    As the initial reply was done from my phone and done quickly, I figured I would start a new thread on what BCP/DR is, how they are implemented (and have been historically) and how they are different.

    First, some definitions (no, not Webster, real life).

    EDP or Electronic Data Processing
    This is a VERY old-school term for "computers" and the jobs they perform but is the term still used by regulatory agencies and BCP/DR plans.

    BCP or Business Continuity Plan (or Program)
    A BCP is focused on continued key business operations without having to rebuild anything from scratch. They typically focus on key personnel and critical business systems. The point behind BCP is to ensure that those business systems and process, and the people required to run them, stay available in the event of some catastrophe. In general, I will be focused on "non-manufacturing" types of BCP as that is what I am most familiar with (I'm in IT...go figure).
    Examples of steps taken as part of BCP include:
    • Personnel to manage processes, systems and applications, separated by large enough distances that they are unlikely to be impacted by the same event at the same time. Many DR and BCP plans have begun referring to this as the Katrina Effect based on the fact that in some cases businesses had both of their primary locations within the same region that was affected by Katrina and its aftermath. Now, the minimum distance considered safe by the industry is 150-300 miles between locations. More on that later.
    • Critical business systems, including/especially EDP, are housed in a facility that can survive either the worst possible "realistic" natural disaster or one step below. For example:
      • If you are in an area that is likely to have hurricanes and Cat 3 is the most they have ever recorded, your datacenter/facility should be designed to withstand a Cat 4 (for giggles) or Cat 3.
      • If you are in an area that is prone to power outages...MOVE! Just kidding, the facility should have internal capacity to generate power for no less than the average outage or 72 hours (whichever is greater). Meaning generator or battery power for that long.
      • If you are in an area that is prone to power outages or where outages can be long-term (after a snow/ice storm, hurricane, tornado, tsunami, earthquake or volcanic eruption...yes, that's on the list), plans for fuel delivery to maintain power "indefinitely".
    • "Authority" continuity is planned for in the short and long term:
      • Key personnel may be moved temporarily for the duration of the event
      • Plans are in place to transition authority/responsibility to others in the company should key individuals be inaccessible or unable to perform their duties for an extended period of time (up to and including loss of life)
    • Contingency plans exist for remote work/work from home for both short and long term outages or disasters.
    Disaster Recovery Plan
    A DR Plan is focused on resuming service as quickly as possible after a disaster has occurred and services have been interrupted. A DR plan assumes that business continuity has been interrupted and must be re-established.
    Examples of steps covered in a DR Plan include:
    • Having a secondary data center, outside of the Katrina Effect zone, with either redundant systems or the ability to immediately scale to redundant systems.
    • Having the ability to transfer workloads from the primary (offline) data center to the backup/DR data center with minimal loss of data (transactions)
    • The ability for customers and employees to use the DR site. For a data center, that would include re-routing traffic from the internet. For a labor intensive trade, it would include getting people on-site to do the work. For a call-center, it would mean getting the calls re-routed and up coming into the secondary site.
    • Agreements with partners, vendors and providers to make all of the above happen, as well as get the process started to repair the failed site or stand up a new one. Two is one, one is none. If you are running out of your DR site because your primary site is a smoldering crater (or swimming pool)'re actually running out of your BRAND NEW PRIMARY SITE...with no DR site in existence.
    I have worked for 3 large companies that had both BCP and DR plans, some better than others, but all covered the same bases and none of them were perfect.

    The first company (let's call them "Spectrometer"...although they had a very different name when I worked there) initially had what I call a "Hope and Pray Plan", as in "I hope and pray nothing goes wrong". Long story short, work was being done on the UPS room when someone tripped and literally hit the big red button. The data center technically ran on UPS all the time to condition the power, but the big red button cut off outside power and was designed to put us on generator. In this case the generator started as expected and was up and providing power in less than 2 minutes (3 minutes to spare on the UPS). Unfortunately, ALL 3 pieces of transfer switch gear failed...ALL 3. The batteries ran out and the entire data center just...turned off. 1500 servers, RACKS of storage, all the network gear, phones, everything. Walking in there was like walking into a scene from Doom II. Almost pitch black with long shadows being cast from the single flickering florescent bulb in the back corner...and a beeeeep...beeeeep...beeeeep coming from the alarm panel.

    We were 50% to our DR plan. Disaster...check. Plan...not so much check. We built it on the fly and it was a "bring the site back up" plan. If it had been a smoking crater we'd have all been out of a job and recovery would have taken longer than the market would have accepted. We had backups, but they were all on tape and they hadn't made a pickup yet that day so we'd be a day old on transactions. It would have taken months to get all the hardware we needed and another month to restore all the data. We would have been done, period. But we got damned lucky. This was on a Friday afternoon.

    We were, however, able to get everything back online once power was restored, in about 13 hours with zero loss of data due to transaction logs on the databases and caching on the storage arrays. We lost probably 100 hard drives over then next 72 hours as bearing got VERY cold (AC came back up 6 hours before we powered on the first storage/servers...bad call that one), but with the redundancy built into the systems we lost no data.

    Eventually we acquired another company with their own Tier I data center (highest possible rating) out in the middle of nowhere that had 8 days of reserve power on-site with two different rail spurs going to the data center and the ability to draw diesel directly from a train tanker car if necessary while refilling the underground tanks. The remote site also had sufficient spur track to hold 3 tanker cars on each spur for a total of something like 1000 years of diesel (no, not really, but WAY longer than necessary).

    The previous company I worked for (we'll call them "If I try to get cute with their name and someone finds out who they are they will probably sue me so I'm not even going to play that game"..."A-Holes" for short) had a BCP plan that had people in over 100 countries and every continent except for Antarctica, with redundant pairs of data centers in the US and EU (Switzerland in fact). Great setup, right? Yeah, except that both US data centers were about 100 miles FLORIDA! And along comes hurricane season. I still wake up in a cold sweat many nights from September through late October! BCP was great, it really was, the company was a freakin' hydra. But DR was NOT cool and if we had to run out of Switzerland, most of the company everywhere else would have suffered.

    Now to the current company. We are what I would call a really big "little" company. We grew very fast and are frankly scary big for what we do. But we're playing catch up in a lot of areas. Data centers on the East Coast (yes, I'm watching the storm and in contact with the DC every couple of hours), and in Texas (north, not prone to hurricanes in the gulf). We can switch operations in about 4 hours cleanly or about 2 hours uncleanly (with something like 3 weeks of cleanup afterwards). We have folks on both coasts who can do most things. Our CIO is in San Diego for a week (once a month) and is staying there for the duration. We have VPN into both data centers with sufficient capacity to support a couple thousand simultaneous users in each site. We have folks set up with virtual desktops to minimize the need for VPN. We can run move the call center with the flip of a switch to 3 other offices. We have temporary office space "ready to occupy" in Atlanta and it's already warm.

    And we still had 3 days of meeting every couple of hours to play the "what if" game.

    What if the primary data center loses power? Well, they have 72 hours worth of reserve for the generators and service planned for trucks starting at 6 hours or as soon as the roads are safe.

    What if the data center floods (it's like 15 feet above the highest water mark recorded)? We will get notice several hours before that happens and will initiate a DR event and cleanly move to the DR site. If we don't get notice, we execute DR into the DR site.

    What if there is another event at the DR site (and yes, this got asked)? *hangs head* Then I submit my resignation and I bring you my laptop and badge once the roads clear. Seriously...if that happens, it happens and we just suck it up and wait for one or the other to come back online. How many simultaneous acts of God do you think we can prepare for?

    What if everyone on the east coast loses power? Then we use cell phones to let the west coast folks know it's all on them...then we drive somewhere with power and charge our phones so we can walk them through anything they don't know (which is minimal). We also have an off-shore contingent that keeps the wheels on the bus overnight that can step in and work during our "day" if we have to.

    And on and on. We got REALLY into the weeds and edge cases by the time we all just called it quits and left last night.

    For us, BCP and DR really are tied at the hip and DR is a component of BCP. For us:
    • We have a remote data center where data replicates with an Recovery Point Objective of 5 minutes or less (data in the DR site is never more than 5 minutes "old"). We would lose no more than 5 minutes of data/transactions.
    • We are almost entirely virtualized and our DR is almost entirely automated. I can literally hit a red button in a computer console (from either side) and bring up the entire virtualized infrastructure in about 40 minutes. Validation is what takes the longest.
    • Have employees on both coasts and off-shore to cover all of the critical tasks and positions.
    • We sent people home to prepare at home or leave and work "very" remotely on Wednesday (we knew on Wed that schools were closed on Thursday and Friday).
    • Key employees have laptops (battery power is key here) and work provided mobile phones with hot-spot enabled for remote work without local internet.
    • I personally have told my people that family comes first and to let me know if they are going to be out of the loop. Take care of yourself, we have a lot of employees to fill in the gap.
    • We have backup power at the site that was tested last weekend for this event. It powers the on-site server room with 3 days worth of diesel and plans for more after 24 hours.
    • The local building was built a fair bit above the flood level.
    • We kept our CIO out of "harms way" for the duration.
    • We have an off-site location for office space (100 seats) that is already activated and "warm"
    • Our partners are ready to switch to the DR site links if necessary
    • We have the ability for all affected folks to work remote via VPN or VDI out of either data center.
    • Our call center can switch to one of 3 locations within a few minutes.
    • All our hardware vendors are aware of the situation and while we haven't staged hardware, we could get up and running in a third site within 3 weeks and dump to the cloud via our virtualization vendor in less than 2 days (it would just be CRAY CRAY expensive!)
    In this case, even though we're not mature, we're actually better off than the last two places I've been, and we're getting better. Hopefully this has shed a little light on BPC and DR and the anecdotes made reading it a little less painful.
  2. Meat

    Meat Monkey+++

    That’s a big read that unfortunately I don’t understand mostly. We have a running joke in case of disaster as well. Tie up the two supervisors and lock them in a room. The alternative for me would be walk off and quit. Easy as pie for a Lineman. Maybe nah for others. :D
  3. ghrit

    ghrit Bad company Administrator Founding Member

    :lol: Slightly (well, a LOT off topic) but the running joke when I was in the field was that if something went off the rails, I'd fire the crew and quit.
    Back to the topic. I've always questioned the idea of central controls of any operation that could be disrupted by momma nature or more malevolent organizations. Complete, separate, ops centers with 100% connectivity is that thought process (two is one, one is none thinking.) The downside of that is duplicate staffing and overly redundant talents.
    Gator 45/70, mysterymet and Meat like this.
  4. DarkLight

    DarkLight Live Long and Prosper - On Hiatus

    From a talent standpoint, you can usually double people up in a pinch and just have people do something different/more important for the short term. I agree on the "all eggs in one basket" approach. The difficulty most companies that are reliant on IT run into is the speed of light. Unless you have systems (computers and applications) that can run in an active/active way (Amazon, Netflix, Google) where you can tolerate "lag" in data synchronization, some distances are just too far to be useful.

    Data can be replicated in a synchronous or asynchronous way. Synchronous just means it's close enough that the systems involved can keep the data completely synchronized with no visible impact to the carbon based life form using the data. Asynchronous just means that there will always be a short time where the data is out of sync.

    Go more than about 60 miles and Newton (the bastard!) and his frigging laws kick in and you just can't keep things in perfect sync without spending a TON of money and using very expensive compression. 100 miles or so and all bets are off, the speed of light prevents synchronous replication. So you spend enough money to have a full duplicate in another location, and you still have risk of data loss.

    And for me it's not a joke. If both DCs go dark, I'll tender my resignation by email and you'll get a package with my laptop and badge sent book rate.
    Last edited: Sep 15, 2018
    Gator 45/70, Tempstar and sec_monkey like this.
  5. SB21

    SB21 Monkey+++

    Not to be off subject here,,,but how are you faring out there ? I do know I wouldn't want to be doing electrical work while dripping wet ,,,,what county are you working in ?
    Gator 45/70, sec_monkey and DarkLight like this.
  6. Meat

    Meat Monkey+++

    I’m in a retirement type of position now. Storm work is a thing of the past mostly. As far as rain I prefer it over heat. It’s not even close. :D Oops. I derailed this thread slightly. My apologies.
    Yard Dart, SB21 and DarkLight like this.
  7. DKR

    DKR Raconteur of the first stripe

    When I was worked as the Business Continuity Manager at a major utility, the hardest thing I had to accomplish was convincing the senor management that in a real-deal, wide area disaster, the majority of the employees would simply stay home and try to protect their families , home and material possessions.

    The next battle was how to provide 'basic services' for the employees that did show up - like food, water and sanitation.

    It was a no-win situation. If an event happened - like a wide area power outage, and the company had minimal impact - we were "lucky" - lucky we had working gensets, fuel for the genstes, trained operators for the gensets, power connections pre-wired at each fallibility, a standing contract from more than one vendor to deliver fuel on site daily...all luck.
    The mitigation planing and background work all but invisible to senior managers who seems only capable of bitching about 'cost' for stuff that didn't add to their quarterly revenue/profit - and their personal bonuses.

    Of course, if there was a hiccup, you can guess who was / would be the scapegoat.....

    I work as a technical writer now - far less stress.and the pay is better. Best part, almost no meetings to try and stay awake through!
  8. Zimmy

    Zimmy Wait, I'm not ready!

    When I was in the power industry, our local power plant had 17 employees. We kept food, water, and cots for 60 people for 3 months.

    The worst case assumed survivable was a pandemic. The plan was to gather all the families into the facility and lock it down.

    Now I work at a major hospital, I'll roll with it until it becomes untenable. The official plan is everyone in facilities comes to work and can't leave without termination. That's not realistic. [finger]
    oldawg, Gator 45/70, Brokor and 2 others like this.
  9. DarkLight

    DarkLight Live Long and Prosper - On Hiatus

    If I owned a business that employed people, I would endeavor to have a large enough facility that I could house all of the employees and their families for some fixed amount of time. Being in IT, I could probably do it too since IT can be done almost anywhere. Keep the number of employees to a sustainable amount and have the ability to turn the facility into a compound for a couple of weeks.

    Where I work now? Never happen. We have roughly 2000 employees on site in two 6-story buildings full of cubes. It would be reminiscent of the SuperDome after Katrina.
  10. DKR

    DKR Raconteur of the first stripe

    Clearly the senior considers the employee base to be 'fungible'. I'd keep a bicycle at work.....
    Gator 45/70 and sec_monkey like this.
  11. Tempstar

    Tempstar Monkey+++

    Did that. They told me to stay during hurricane Matthew. I told them to sue me or fire me or get off my ass about it. I went home and called their bluff and they didn't do anything. I've since happily moved on....
    Gator 45/70 likes this.
  12. duane

    duane Monkey+++

    Business and government always assume that if they stock up for employees for x days, in my last case 120 days, food, fuel, toilet paper, etc, we would just hunker down and do our jobs. No provisions for family or others and we were supposed to be happy with it. I don't know if I would have stayed or went to work if things were truly SHTF, but it was one of the reasons for a career change and a family survival plan. Except for atomic war or EMP, I figure it will be a frog in a pot situation, economic collapse, political instability, disease, etc, and we will be well into it before you realize there is a problem. Thus long ago decided to live in such a way that if 6 months into the situation I realized what was going on, I would still be OK. The real war is on data at present, all my financial records,Medicare medical records, social security records, bank records, tax records, etc, are on computers and either not backed up by paper, or if, it would be impossible to recreate the electronic data in a short time. Real danger in my mind is not the loss of data, but the changing of data, identity theft is just the tip of the iceberg. If data is randomly changed, money transfered from one account to another, tax information credited to wrong account, or credit card charges randomly changed, etc, none of the information can be trusted and the system does not work. All the backups will not handle corrupted data and I lost a computer and a weeks data to a ransomware attack, my first backup was encrypted as well as my computer and they wanted to sell me the key.
    Gator 45/70 likes this.
  13. Asia-Off-Grid

    Asia-Off-Grid RIP 11-8-2018

    I bet you have been busier than a prostitute on two-for-one night, if on the east coast, haven't ya?
    Gator 45/70 likes this.
  14. Meat

    Meat Monkey+++

    I’m in the PNW. :D
    Gator 45/70 likes this.
  15. Asia-Off-Grid

    Asia-Off-Grid RIP 11-8-2018

    You're just sayin' that because you don't wanna be a prostitute. Fine. Be that way.

    Congratulations, I think? (I can only imagine the amount of stress involved in such positions.)
    Last edited: Sep 16, 2018
    Gator 45/70 and Brokor like this.
  16. Meat

    Meat Monkey+++

    Well it’s a family forum and it’s also the truth. :D
  17. Brokor

    Brokor Live Free or Cry Moderator Site Supporter+++ Founding Member

    @Meat So, you're not from Georgia then? :D
  18. Meat

    Meat Monkey+++

    Nah. My kind don’t like the heat or humidity. :D
  19. fedorthedog

    fedorthedog Monkey+++

    My experience is inthe public saftey area not data but one very overlooked item I found in my department was a lack of any type of support for the employees in the affected site. IE we had no way to feed or water our people. in the end I put together a 5 day plan for feeding a group of city employees to allow people to be able to function and respond.

    It seems you have a handle on the hardware and data end, but if you cant get the employees out you will need some plan for them.
    Gator 45/70, Zimmy and DarkLight like this.
  20. DarkLight

    DarkLight Live Long and Prosper - On Hiatus

    Agreed and I haven't got any insight into the company plans for this. Which is why all of my folks were sent home two days prior to anything scheduled. Doesn't help for an unscheduled event though.
    Gator 45/70 and Zimmy like this.
  1. MattU94
  2. Dunerunner
  3. Dunerunner
  4. Dunerunner
  5. Dunerunner
  6. Dunerunner
  7. Dunerunner
  8. Dunerunner
  9. Dunerunner
  10. Thunder5Ranch
  11. Coyote Ridge
  12. Yard Dart
  13. Yard Dart
  14. fl4848
  15. Motomom34
  16. duane
  17. DKR
  18. Motomom34
  19. Yard Dart
survivalmonkey SSL seal warrant canary