Following up from my previous post.

I used the API at https://archive.org/developers/changes.html to enumerate all the item names in the archive. Currently there are over 256 million item names. However I went through a sample of them and noted the following:

There are many, many items from the archive which have been removed. Much higher than I expected. If you have critical data, of course Internet Archive should never be your only backup.

I don’t know the distribution of metadata and .torrent file sizes since i have not tried downloading them yet. It looks like it would require a lot of storage if there are many files or the content is huge (if only 50% of the items remain and the average .torrent + metadata is 20KB it would be over 2.5 TB to store). But on the other hand, the archive has a lot of random one off uploads that are not very big, so some metadata is 800 bytes and the torrent 3KB in those cases (only 640 GB to store if combined is 5 KB).

  • BermudaHighball@lemmy.dbzer0.comOP
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    4 days ago

    Yes, exactly why I wanted to start this project. It’s nice to have the Internet Archive but we cannot trust that content won’t be taken down eventually. Even just storage costs might become an issue in the future for data that gets maybe 30 total views over many years. But it is nice to hear some of the data you were looking at is coming back.

    Long term, it would be nice for a community of users to create a decentralized index of Internet Archive metadata so it cannot get taken down and has the torrent files of the content so people can share it and participate in the seeding for the content they care about. The Internet Archive might cooperate to make it easier to do this, for example by using Bittorrent v2 which would help us detect file duplication and not have to use padding files since all files are aligned to pieces in v2.

    Currently there is little incentive for people to seed the Internet Archive content but no doubt it will become more important to do that in the future.