I am currently looking into High Availability for my work setup. I am having some problems understanding how to achive that. I have two servers, one running libvirt and a couple VM, the other one nothing much yet.

To achieve HA with keepalived, I would have to setup the exact same VMs under the second server, right? If that’s the case, how would I make sure that the “mirrors” stay equal, If for example the master goes down, the backup takes over, some changes are made in a DB and the master knows nothing about these changes.

Maybe I misunderstood keepalived so far, can somebody provide me with an example setup or hints on how to achieve what I want to do?

Kind Regards

g7s

  • kimli
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    It’s been a few years since I used keepalived so my knowledge might be outdated.

    You are correct that the VMs should be in different servers. To test around you can set up on the same, but this shouldn’t be done in production environments, if you lose the host, you lose the service.

    Keepalived will make sure your service is available in an IP. To say, you have two (it can be configured for more than two) servers with (A) 192.168.0.2 and (B) 192.168.0.3 which provide the service you want to provide. With Keepalived you’ll configure a common IP for both of them, let’s say 192.168.0.4

    While working, server A will be available at 192.168.0.2 and 192.168.0.4 while server B will be available at 192.168.0.3. If server A fails keepalived will “move” 192.168.0.4 to server B, so 192.168.0.2 will not be available and server B will be available at 192.168.0.3 and 192.168.0.4.

    No matter which server is up / primary, your service will always be available at 192.168.0.4

    For the mirroring part, you need to solve it in another step outside from keepalived. For example, MariaDB provides multimaster replication “out of the box” with galera (the recommendation is at least 3 nodes)

    For files, depending on your filesystem you should have to rsync, use some shared units, distribute filesystem (Ceph), …

    • g7sOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Thank you for the explanation. I might look into heartbeat, as suggested by @arbiter. I understand now, that keepalived is only working on an IP layer, and not helping me with mirroring my actual VM’s. For that I will look into other technologies.

  • arbiter
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    If you plan on using this in a production environment, I’d bring in a consultant.

    However, I’ve heard of people in the home-lab sphere use things like heartbeat and drdb. The more nodes the merrier as if you lose connection between the two you’ll have a bad time.

    • g7sOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I’m working for our department as the only IT-admin, everything runs fine and nightly downtimes for upgrades etc. are fine. However, I want to make it more available. Thanks for the suggestions, I will look into them :)

      • arbiter
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        1 year ago

        Other data replication technologies worth looking in to: GlusterFs, Ceph.

        Dependent on your db’s they should offer replication out of the box.

        You can also implement a load balancer, such as HAProxy or Nginx, to distribute incoming network traffic across multiple VMs

  • taladar@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    Whatever technology you end up using you should be aware that you will see an order of magnitude or two increase in complexity by running things in a HA way which is very likely to cause some additional downtime instead of reducing it for a while (and possibly even in the long-term).

    Network block devices on clusters like Ceph or distributed filesystems have many more failure modes in addition to the ones of the underlying storage hardware due to their distributed nature. Clustered services are similar. You might also see new performance bottlenecks emerge (e.g. your network might be significantly slower in both latency and throughput than modern local SSD or NVMe storage) and new temporarily unavailable services when the failover happens too often.

    My advice would be to start running something like that only on a dev/test system that sees some use for a few months at least to learn what to do when things go wrong before you even consider using them in production.

    • g7sOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Thank you for the insight. I will think about it more and set up a test lab. We have 2.5Gbit switches, so I hope the network won’t be a bottleneck