When thousands of pages started disappearing from the Centers for Disease Control and Prevention (CDC) website late last week, public health researchers quickly moved to archive deleted public health data.

Soon, researchers discovered that the Internet Archive (IA) offers one of the most effective ways to both preserve online data and track changes on government websites. For decades, IA crawlers have collected snapshots of the public Internet, making it easier to compare current versions of websites to historic versions. And IA also allows users to upload digital materials to further expand the web archive. Both aspects of the archive immediately proved useful to researchers assessing how much data the public risked losing during a rapid purge following a pair of President Trump’s executive orders.

Part of a small group of researchers who managed to download the entire CDC website within days, virologist Angela Rasmussen helped create a public resource that combines CDC website information with deleted CDC datasets. Those datasets, many of which were previously in the public domain for years, were uploaded to IA by an anonymous user, “SheWhoExists,” on January 31. Moving forward, Rasmussen told Ars that IA will likely remain a go-to tool for researchers attempting to closely monitor for any unexpected changes in access to public data.

Rasmussen told Ars that the deletion of CDC datasets is “extremely alarming” and “not normal.” While some deleted pages have since been restored in altered versions, removing gender ideology from CDC guidance could put Americans at heightened risk. That’s another emerging problem that IA’s snapshots could help researchers and health professionals resolve.

On Bluesky, Rasmussen led one of many charges to compile archived links and download CDC data so that researchers can reference every available government study when advancing public health knowledge.

“These data are public and they are ours,” Rasmussen posted. “Deletion disobedience is one way to fight back.”

To help researchers quickly access the missing data, anyone can help the IA seed the datasets, the Reddit user said in another post providing seeding and mirroring instructions. Currently dozens are seeding it for a couple hundred peers.

“Thank you to everyone who requested this important data, and particularly to those who have offered to mirror it,” the Reddit user wrote.

As Rasmussen works with her group to make their archive more user-friendly, her plan is to help as many researchers as possible fight back against data deletion by continuing to reference deleted data in their research. She suggested that effort—doing science that ignores Trump’s executive orders—is perhaps a more powerful way to resist and defend public health data than joining in loud protests, which many researchers based in the US (and perhaps relying on federal funding) may not be able to afford to do.

“Just by doing things and standing up for science with your actions, rather than your words, you can really make, I think, a big difference,” Rasmussen said.

  • Bogasse
    link
    fedilink
    English
    arrow-up
    16
    ·
    6 hours ago

    When the internet archive was attacked a few months ago we were like “who would be dumb and mean enough to do that?”. We have new suspects! 🎉

    • Arghblarg@lemmy.ca
      link
      fedilink
      English
      arrow-up
      34
      ·
      16 hours ago

      archive.is or their mirrors should also be used, as archive.org has proven vulnerable to takedown requests from corporations, wouldn’t surprise me if they could be coerced into removing their data by USA govt request as well.

    • cygnus@lemmy.ca
      link
      fedilink
      English
      arrow-up
      19
      arrow-down
      4
      ·
      17 hours ago

      It wouldn’t take much; they had multiple breaches and other problems last fall, seemingly due to very avoidable reasons.

      • GreenKnight23@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        7 hours ago

        although could have been avoidable, it begs the question who was behind the attacks.

        I think we can safely say it was Peelon Shmusk, the worlds worst spy!

      • dan@upvote.au
        link
        fedilink
        English
        arrow-up
        21
        ·
        edit-2
        15 hours ago

        very avoidable reasons.

        They’re understaffed for the amount of work they do, and their staff are probably even more busy fighting lawsuits at the moment. Things are going to slip through the cracks, unfortunately.

  • BassTurd@lemmy.world
    link
    fedilink
    English
    arrow-up
    19
    ·
    15 hours ago

    Any idea the size of IA? Could it be packaged in some torrents and distributed to the masses for decentralized archiving? I’m guessing it’s way more than I could store.

        • 9point6@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          ·
          edit-2
          4 hours ago

          The problem is you’d need to split it down to an amount that people would be happy hosting and then host it multiple times in case any node goes offline.

          Another comment in the thread says it’s likely over 100PB today (100,000 terabytes). I’d say 4 copies (spread over different time zones) is a relatively minimal level of redundancy (people may host on machines that aren’t powered all the time), and I reckon you’d get a network with the most participants, whilst still getting enough storage, at around the 150gb per node mark.

          That comes to nearly 3 million participants needed just to cover today’s archive, new people will obviously need to join every day. Also given I imagine it would need to be open to all, the redundancy level could do with increasing to avoid malicious actors with a lot of resources taking on a lot of the network and forcing it all offline at once in an effort to cause data loss

          Nothing here is insurmountable, but also not remotely easy

      • BassTurd@lemmy.world
        link
        fedilink
        English
        arrow-up
        7
        ·
        15 hours ago

        That’s a bit more than my home server can handle. I could maybe take some CDC data, but definitely not the full shebang. It would be neat if someone could segment the data so we could save some more critical things.

        • xektop@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          7 hours ago

          A couple years ago I read that Filecoin has teamed up with the internet archive to synchronize the data on the Blockchain. I’m not sure how far they are yet, but it’s something that could work if it doesn’t turn out to be just crypto hype in the end.