I’m in our daily standup and it’s turned into exchanging fucked up sysadmin redundancy tales.

One place I worked lost a machine room. They’d fired people so fast that nobody remembered where the boxes were any more.

I knew, but they didn’t ask me. Oh well!

The cycle of IT binge and purge is eternal. Post your tales here.

  • Jo Miran
    link
    fedilink
    arrow-up
    24
    ·
    edit-2
    3 months ago

    I have two three stories.

    Company X: Our testbed server room was supported by redundant rooftop AC units, many yards apart. During a storm, a lightning bolt forked (split) One tip.of the bolt hit AC unit one and the other hit AC unit two, killing both cooling units. To make things worse, the server manufacturer did not add a temperature safety shutdown to the units and instead configured them to fan faster the hotter they got. By the time I got there the cable management was warping and melting due to heat.

    Company Y: The main datacenter was on tower 2 and the backup datacenter was on tower 1. Most IT staff was present when the planes hit.

    EDIT:
    Company Z: I started work at a company where they gave me access to a “test” BigIP (unit 3) to use as my own little playground. Prior to my joining the company was run by devs doubling as IT. I deleted the old spaghetti code rules so that I could start from scratch. So, after verifying that no automation was running on my unit (unit 3), I deleted the old rules. Unfortunately the devs/admins forgot to disengage replication on “unit 2” when they gave me “unit 3”. So production “unit 2” deleted its rules and told production “unit 1” to do the same. Poof…production down and units offline. I had to drive four hours to the datacenter and code the entire BigIP from scratch and under duress. I quit that job months after starting. Some shops are run so poorly that they end up fostering a toxic environment.

    • db0@lemmy.dbzer0.com
      link
      fedilink
      arrow-up
      16
      ·
      3 months ago

      Company Y: The main datacenter was on tower 2 and the backup datacenter was on tower 1. Most IT staff was present when the planes hit.

      Well, that one dark “redundancy” story…

      • BearOfaTime@lemm.ee
        link
        fedilink
        arrow-up
        7
        ·
        3 months ago

        I don’t understand why they had redundancy so physically close.

        Whatever affects one has a high risk of affecting the other.

        Different regions is a thing for a reason.

        • db0@lemmy.dbzer0.com
          link
          fedilink
          arrow-up
          12
          ·
          edit-2
          3 months ago

          I think the OP was talking about the other “redundancy”, as in “your whole team has been made redundant”. In this context your story is very dark indeed :D

        • Christopher Wood@awful.systems
          link
          fedilink
          arrow-up
          5
          ·
          3 months ago

          It’s probably good to situate in time when thinking about these things. The twin towers were how a lot of companies became examples of what location redundancy really means. These days people are keeping that lesson well in mind, but back then, not so much.

        • bitofhope@awful.systems
          link
          fedilink
          English
          arrow-up
          2
          ·
          3 months ago

          There are tradeoffs to higher and higher grades of redundancy and the appropriate level depends on the situation. Across VMs you just need to know how to set up HA for the system. Across physical hosts requires procuring a second server and more precious Us on a rack. Across racks/aisles might sometimes require renting a whole second rack. Across fire door separated rooms requires a DC with such a feature. Across DCs might require more advanced networking, SDN fabrics, VPNs, BGP and the like. Across sites in different regions you might have latency issues, you might have to hire people in multiple locations or deal with multiple colo providers or ISPs, maybe even set up entire fiber lines. Across states or countries you might have to deal with regulatory compliance in multiple jurisdictions. Especially in 2001 none of this was as easy as selecting a different Availability Zone from a dropdown.

          Running a business always involves accepting some level of risk. It seem reasonable for some companies to decide that if someone does a 9/11 to them, they have bigger problems than IT redundancy.

    • froztbyte@awful.systems
      link
      fedilink
      English
      arrow-up
      7
      ·
      3 months ago

      company X sounds like the sort of bad shit I remember from a DC my side of the world, which was so frequently broken in various states that occasionally you couldn’t even touch the outside doorknobs (the heat would translate from the inside)

      Y: oof.

      Z: lol, wow. good on ya for leaving. no point sticking around in that kind of disasterzone. it ain’t ever gonna get fixed by your efforts.

    • YourNetworkIsHaunted@awful.systems
      link
      fedilink
      arrow-up
      2
      ·
      3 months ago

      Company Z:

      Oh God. I worked as an NSE with F5 for seven years and had the opportunity to join dozens of “oh God everything’s fucked” calls, and autosync was a factor in the majority of them. I’m not sure why you would want half-finished virtual servers pushed onto your active production devices, but damn if people weren’t doing that all the blasted time.