(attempt to cross-post from /c/programming )

Idea: Scrape all the posts from a subreddit as they’re being made, and “archive” them on a lemmy instance, making it very clear it’s being rehosted, and linking back to the original. It would probably have to be a “closed” lemmy instance specifically for this purpose. The tool would run for multiple subreddits, allowing Lemmy users to still be updated about and discuss any potential content that gets left behind.

Thoughts? It’s probably iffy copyright-wise, but I think I can square my conscience with it.


Update: as per the feedback, I have acquired a separate instance, and started coding. Just tonight I managed to clone some posts from a subreddit to it. - I’m intentionally being vague because I will probably wipe and reset the communities on there a couple of times, and that messes up federation.

The goal is to have all the communities be read only for non-mods (and the only mod will be the admin and bots), but to also have a separate request community where anyone can request subreddits to be cloned. I’ll keep updating this post here - still figuring all of this out as I go along :)

  • Barbarian@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    16
    ·
    edit-2
    1 year ago

    Lemmy is based on a pull model, so if nobody on a different instance subscribes then it doesn’t show up on anybody else’s feeds. If an admin doesn’t want that in their “All” feeds, they can block the instance.

    Just make sure it’s on its own instance with nothing else, something like that is bound to be EXTREMELY noisy, and not all admins are gonna be happy about it. I assume that’s what you meant by closed?

    • usernotfoundOP
      link
      fedilink
      English
      arrow-up
      9
      ·
      edit-2
      1 year ago

      Yeah, exactly.

      Also to reduce the chances of it colliding with an existing community. It would be an entire Lemmy instance dedicated to reddit mirroring, Lemmit ;)

      But to be fair, I wasn’t particularly looking forward to hosting and maintaining my own instance, but coding the tool part should be easy.

      • Barbarian@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        7
        ·
        1 year ago

        Just be aware that it might not work. Reddit implemented rate limits on page loads to combat the inevitable web scraping as they turn off the API. Test out how fast you can pull pages before putting in any real coding time.

        • borari@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          5
          ·
          1 year ago

          Reddit implemented rate limits on page loads to combat the inevitable web scraping

          This whole time I was wondering how the API changes made any sense when anyone disgruntled about it could just turn to scraping, putting drastically more load on Reddit’s infrastructure. It makes me feel a bit better that they aren’t that clueless.

  • piezoelectron@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    edit-2
    1 year ago

    Hey I LOVE this idea! I had it myself but I can’t code for nuts, so glad to see someone else trying it out.

    Question: how can we follow your progress? Are you thinking of creating a dedicated community/website to share updates? If one already exists then do let me know, I’d love to stay connected.

    EDIT: As for copyright/concerns…if the goal is to preserve information, then maybe you have some way to pseudonymise usernames as part of the script. Or even remove usernames completely, as we’re focusing on the comments.

    I prefer pseudonymising, as you can replace real usernames with fake ones, so that it’s still possibly to follow who’s replying to whom within the context of a comment thread.

    • usernotfoundOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Still in testing mode, but I’ll keep putting on this page. Once it’s ready to go live, I’ll be sure to create a new post.

    • usernotfoundOP
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      1 year ago

      Ooh, that is a very good point! I actually started something using BeautifulSoup in python, but that would save some hassle.

  • KNova@links.dartboard.social
    link
    fedilink
    English
    arrow-up
    3
    ·
    1 year ago

    Pretty sure this exists already. I’m not in a place where I can search and pull it up but it’s linked in the Lemmy repo. Might just need some tweaking for easier deployment for non technical users.

  • kilgore@feddit.de
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    While these efforts to move Reddit content to Lemmy are great, wouldn’t it make more sense to focus on creating more content here instead? So many people seem to want to “leave Reddit” but somehow not leave at the same time. I’ll miss my niche communities but I hope they show up here with time.

    • usernotfoundOP
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      I get that, and the whole point of a tool like this is to make itself redundant.

      I hope it will help people make the journey over to Lemmy, knowing they don’t need leave anything behind. Once here, they can start contributing here.

  • usernotfoundOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Update: as per the feedback, I have acquired a separate instance, and started coding. Just tonight I managed to clone some posts from a subreddit to it. - I’m intentionally being vague because I will probably wipe and reset the communities on there a couple of times, and that messes up federation.

    The goal is to have all the communities be read only for non-mods (and the only mod will be the admin and bots), but to also have a separate request community where anyone can request subreddits to be cloned. I’ll keep updating this post here - still figuring all of this out as I go along :)