I’m in the process of starting a proper backup solution however over the years I’ve had a few copy-paste home directory from different systems as a quick and dirty solution. Now I have to pay my technical debt and remove the duplicates. I’m looking for a deduplication tool.

  • accept a destination directory
  • source locations should be deleted after the operation
  • if files content is the same then delete the redundant copy
  • if files content is different, move and change the name to avoid name collision I tried doing it in nautilus but it does not look at the files content, only the file name. Eg if two photos have the same content but different name then it will also create a redundant copy.

Edit: Some comments suggested using btrfs’ feature duperemove. This will replace the same file content with points to the same location. This is not what I intend, I intend to remove the redundant files completely.

Edit 2: Another quite cool solution is to use hardlinks. It will replace all occurances of the same data with a hardlink. Then the redundant directories can be traversed and whatever is a link can be deleted. The remaining files will be unique. I’m not going for this myself as I don’t trust my self to write a bug free implementation.

  • lemmyvore@feddit.nl
    link
    fedilink
    English
    arrow-up
    13
    ·
    3 months ago

    Use Borg Backup. It has built-in deduplication — it works with chunks not files and will recognize identical chunks and avoid storing them multiple times. It will deduplicate your files and will find duplicated chunks even in files you didn’t know had duplicates. You can continue to keep your files duplicated or clean them out, it doesn’t matter, the borg backups will be optimized either way.

    • FryAndBender@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      2 months ago

      Here are the stats from a backup of 1 server with approx 600gig


                         Original size      Compressed size    Deduplicated size
      

      This archive: 592.44 GB 553.58 GB 13.79 MB All archives: 14.81 TB 13.94 TB 599.58 GB

                         Unique chunks         Total chunks
      

      Chunk index: 2760965 19590945

      13meg… nice

    • utopiah
      link
      fedilink
      arrow-up
      2
      ·
      3 months ago

      Neat ,wasn’t aware of it, thanks for sharing

  • chtk@feddit.nl
    link
    fedilink
    arrow-up
    12
    ·
    3 months ago

    jdupes is my go-to solution for file deduplication. It should be able to remove duplicate files. I don’t know how much control it gives you over which duplicate to remove though.

  • deadbeef79000@lemmy.nz
    link
    fedilink
    arrow-up
    3
    ·
    3 months ago

    I have exactly the same problem.

    I got as far as using fdupe to identify duplicates and delete the extras. It was slow.

    Thinking about some of the other comments… If you use a tool to create hardlinks first, then one could then traverse the entire tree and deleting a file if it has more than one hardlink. The two phases could be done piecemeal and are cancelable and restartable.

      • deadbeef79000@lemmy.nz
        link
        fedilink
        arrow-up
        1
        ·
        2 months ago

        Backup backup backup! If you have btrfs them just take a snapshot first: instantly.

        One could do a non-destructive rename first. E.g. prepend deleteme. to the file name, sanity check it, then ‘rollback’ by renaming back without the prefix or commit and delete anything with the prefix.

  • HumanPerson@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 months ago

    I believe zfs has deduplication built in if you want a separate backup partition. Not sure about its reliability though. Personally I just have a script that keeps a backup and an oldbackup, and they are both fairly small. I keep a file in my home dir called excluded for things like linux ISOs that don’t need backed up.

    • GenderNeutralBro@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 months ago

      BTRFS also supports deduplication, but not automatically. duperemove will do it and you can set it up on a cron task if you want.

  • JetpackJackson@feddit.de
    link
    fedilink
    arrow-up
    2
    ·
    3 months ago

    Instead of trying to parse the old stuff, could you just run something like borg and then delete the old copypaste backup? Or are there other files there that you need to go through? I ask because I went through a similar thing switching my backups from rsync to borg

    • Agility0971@lemmy.worldOP
      link
      fedilink
      arrow-up
      1
      ·
      3 months ago

      I had multiple systems which at some point were syncing with syncthing but over time I stopped using my desktop computer and syncthing service got unmaintained. I’ve had to remove the ssd of the old desktop so I yoinked the home directory and saved it into my laptop. As you can probably tell, a lot of stuff got duplicated and a lot of stuff got diverged over time. My idea is that I would merge everything into my laptops home directory, and rather then look at the diverged files manually as it would be less work. I don’t think doing a backup with all my redundant files will be a good idea as the initial backup will include other backups and a lot of duplicated files.

  • biribiri11
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    3 months ago

    As said previously, Borg is a full dedplicating incremental archiver complete with compression. You can use relative paths temporarily to build up your backups and a full backup history, then use something like pika to browse the archives to ensure a complete history.

      • biribiri11
        link
        fedilink
        arrow-up
        3
        ·
        edit-2
        3 months ago

        Tbf you did start your post with

        I’m in the process of starting a proper backup

        So you’re going to end up with at least a few people talking about how to onboard your existing backups into a proper backup solution (like borg). Your bullet points can certainly probably be organized into a shell script with sync, but why? A proper backup solution with a full backup history is going to be way more useful than dumping all your files into a directory and renaming in case something clobbers. I don’t see the point in doing anything other than tarring your old backups and using borg import-tar (docs). It feels like you’re trying to go from one half-baked, odd backup solution to another, instead of just going with a full, complete solution.

      • slavanap@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        3 months ago

        1 rsync allows to sync hardlinks correctly

        2 zfs has pretty fast (zfs set dedup=edonr,verify) block level duplication where block size is 1MB (zfs set blocksize=1M).

        3 in reality I tried to achieve proper data structure but it was way too time consuming so I couldn’t do any work other than that, thus I established zfs as a history backtrack where I can rollback to something very important what I accidentally can delete, thus using ZFS and all aforementioned its benefits