I have a 2 bay NAS, and I was planning on using 2x 18tb HDDs in raid 1. I was planning on purchasing 3 of these drives so when one fails I have the replacement. (I am aware that you should purchase at different times to reduce risk of them all failing at the same time)

Then I setup restic.

It makes backups so easy that I am wondering if I should even bother with raid.

Currently I have ~1TB of backups, and with restics snapshots, it won’t grow to be that big anyways.

Either way, I will be storing the backups in aws S3. So is it still worth it to use raid? (I also will be storing backups at my parents)

  • Atemu
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    1 month ago

    Note that you do not need any sort of redundancy to detect corruption.

    Redundancy only gains you the ability to have that corruption immediately and automatically repaired.

    While this sounds nice in theory, you have no use for such auto repair if you have backups handy because you can simply restore that data manually using your backups in the 2 times in your lifetime that such corruption actually occurs.
    (If you do not have backups handy, you should fix that before even thinking about RAID.)

    It’s incredibly costly to have such redundancy at a disk level and you’re almost always better off using those resources on more backups instead if data security is your primary concern.
    Downtime mitigation is another story but IMHO it’s hardly relevant for most home users.

    • beastlykings@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 month ago

      Can you explain this to me better?

      I need to work on my data storage solution, and I knew about bit rot but thought the only solution was something like a zfs pool.

      How do I go about manually detecting bit rot? Assuming I had perfect backups to replace the rotted files.

      Is a zfs pool really that inefficient space wise?

      • Atemu
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 month ago

        Sure :)

        I knew about bit rot but thought the only solution was something like a zfs pool.

        Right. There are other ways of doing this but a checksumming filesystem such as ZFS, btrfs (or bcachefs if you’re feeling adventurous) are the best way to do that generically and can also be used in combination with other methods.

        What you generally need in order to detect corruption on ab abstract level is some sort of “integrity record” which can determine whether some set of data is in an expected state or an unexpected state. The difficulty here is to keep that record up to date with the actually expected changes to the data.
        The filesystem sits at a very good place to implement this because it handles all such “expected changes” as executing those on behalf of the running processes is its purpose.

        Filesystems like ZFS and btrfs implement this integrity record in the form of hashes of smaller portions of each file’s data (“extents”). The hash for each extent is stored in the filesystem metadata. When any part of a file is read, the extents that make up that part of the file are each hashed and the results are compared with the hashes stored in the metadata. If the hash is the same, all is good and the read succeeds but if it doesn’t match, the read fails and the application reading that portion of the file gets an IO error that it needs to handle.

        Note how there was never any second disk involved in this. You can do all of this on a single disk.

        Now to your next question:

        How do I go about manually detecting bit rot?

        In order to detect whether any given file is corrupted, you simply read back that file’s content. If you get an error due to a hash mismatch, it’s bad, if you don’t, it’s good. It’s quite simple really.

        You can then simply expand that process to all the files in your filesystem to see whether any of them have gotten corrupted. You could do this manually by just reading every file in your filesystem once and reporting errors but those filesystems usually provide a ready-made tool for that with tighter integrations in the filesystem code. The conventional name for this process is to “scrub”.

        How do I go about manually detecting bit rot? Assuming I had perfect backups to replace the rotted files.

        You let the filesystem-specific scrub run and it will report every file that contains corrupted data.

        Now that you know which files are corrupted, you simply replace those files from your backup.

        Done; no more corrupted files.

        Is a zfs pool really that inefficient space wise?

        Not a ZFS pool per-se but redundant RAID in general. And by “incredibly costly” I mean costly for the purpose of immediately restoring data rather than doing it manually.

        There actually are use-cases for automatic immediate repair but, in a home lab setting, it’s usually totally acceptable for e.g. a service to be down for a few hours until you e.g. get back from work to restore some file from backup.

        It should also be noted that corruption is exceedingly rare. You will encounter it at some point which is why you should protect yourself against it but it’s not like this will happen every few months; this will happen closer to on the order of every few decades.

        To answer your original question directly: No, ZFS pools themselves are not inefficient as they can also be used on a single disk or in a non-redundant striping manner (similar to RAID0). They’re just the abstraction layer at which you have the choice of whether to make use of redundancy or not and it’s redundancy that can be wasteful depending on your purpose.

        • Andres Salomon@social.ridetrans.it
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          1 month ago

          @Atemu @beastlykings Every few decades seems optimistic. I have an archive of photos/videos from cameras and phones spanning from early 2000s to mid-2010s. There’s not a lot, maybe 6gb; a few thousand files. At some point around the end of that time period, I noticed corruption in some random photos.

          Likewise, I have a (3tb) flac archive, which is about 15-20 years old. Nightly ‘flac -t’ checks are done on 1/60th of the archive, essentially a scrub. Bitrot has struck a dozen times so far.

          • Atemu
            link
            fedilink
            English
            arrow-up
            1
            ·
            23 days ago

            Interesting. I suspect you must either have had really bad luck or be using faulty hardware.

            In my broad summarising estimate, I only accounted for relatively modern disks like something made in the past 5 years or so. Drives from the 2000s or early 2010s could be significantly worse and I wouldn’t be surprised. It sounds like to me your experience was with drives that are well over a decade old at this point.

            • Andres Salomon@social.ridetrans.it
              link
              fedilink
              arrow-up
              2
              ·
              23 days ago

              @Atemu Well yes, this is experience of self-hosting for close to 25 years, with a mix of drives over those years. I have noticed much better quality drives in the past decade (helium hdds running cooler/longer, nvram, etc) with declining failure rates and less corruption.

              But especially if you’re talking about longer time scales like that (“every few decades”), it’s difficult to account for technology changes.

              • Andres Salomon@social.ridetrans.it
                link
                fedilink
                arrow-up
                2
                ·
                23 days ago

                @Atemu Drives from the mid/late-2000s in particular were just poorly behaved for me. Recent drives (2014+) have been much better. Who knows how 2030s drives will behave? So I will continue scrubbing data as I swap out older drives for newer ones.

                • Atemu
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  23 days ago

                  Oh absolutely; I would never advocate against verifying your data’s integrity.

          • Andres Salomon@social.ridetrans.it
            link
            fedilink
            arrow-up
            1
            ·
            1 month ago

            @Atemu @beastlykings Only once or twice it’s been severe (where, say, tens or even hundreds of files mysteriously are corrupt). The vast majority of the time though, it’s limited to a single file. I’d say once/yr a single file is corrupt in that archive and I restore from backup.

        • beastlykings@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 month ago

          Thanks for the write-up!

          I see now I was conflating zfs with RAID in general. It makes sense that you could have the benefits of a checksumming filesystem without the need for RAID, by simply restoring from backups.

          This is a great start for me to finally get some local backups going.

    • Count042
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 month ago

      backups in the 2 times in your lifetime that such corruption actually occurs.

      What are you even talking about here? This line invalidates everything else you’ve said.

      • Atemu
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 month ago

        I was thinking whether I should elaborate on this when I wrote the previous reply.

        At the scale of most home users (~dozens of TiBs), corruption is actually quite unlikely to happen. It’ll happen maybe a handful of times in your lifetime if you’re unlucky.

        Disk failure is actually also not all that likely (maybe once every decade or so, maybe) but still quite a bit more likely than corruption.

        Just because it’s rare doesn’t mean it never happens or that you shouldn’t protect yourself against it though. You don’t want to be caught with your pants down when it does actually happen.

        My primary point is however that backups are sufficient to protect against this hazard and also protect you against quite a few other hazards. There are many other such hazards and a hard drive failing isn’t even the most likely among them (that’d be user error).
        If you care about data security first and foremost, you should therefore prioritise more backups over downtime mitigation technologies such as RAID.