• DongFangHong@lemmygrad.mlOPM
      link
      fedilink
      arrow-up
      7
      ·
      edit-2
      3 years ago

      Yeah I wrote up a quick Python script to call the Pushshift API, which gets a list of post IDs from the subreddit. Then for each post, you can the Reddit json API to get a json with all of the information in the submission. Then, I inserted the json into a database. Here’s my code if you’re interested

      import datetime
      import html
      import pymongo
      import requests
      import time
      
      subreddit = 'GenZedong'
      sort = 'asc'
      sort_type = 'created_utc'
      size = 100
      
      client = pymongo.MongoClient('mongodb://localhost:27017/')
      db = client.subredditArchiveDB
      collection = db[subreddit]
      
      def main():
          query_params = {
              'subreddit': subreddit,
              'sort': sort,
              'sort_type': sort_type,
              'size': size,
              'after': 1646299223, # use this to start the search after a specific timestamp
          }
      
          while True:
              r = requests.get('https://api.pushshift.io/reddit/search/submission/', params=query_params)
              r.raise_for_status()
      
              j = r.json()
              for post in j['data']:
                  id = post['id']
                  timestamp = datetime.datetime.utcfromtimestamp(post['created_utc'])
                  timestamp_str = timestamp.strftime("%Y-%m-%d %H:%M")
      
                  reddit_r = requests.get(f'https://www.reddit.com/comments/{id}/.json', headers={'User-Agent': 'Subreddit archiver', 'Cookie': 'Paste your Reddit browser cookie here (needed to access quarantined subreddit)' })
                  reddit_r.raise_for_status()
                  reddit_json = reddit_r.json()
              
                  post_archive = {
                      'id': id,
                      'timestamp': timestamp,
                      'json': reddit_json
                  }
      
                  collection.insert_one(post_archive)
                  print(f'Added {id} from {timestamp_str} to the collection')
      
              query_params['after'] = timestamp
      
      
      if __name__ == '__main__':
          main()
      
        • DongFangHong@lemmygrad.mlOPM
          link
          fedilink
          arrow-up
          3
          ·
          3 years ago

          You can use the cookie that Reddit stores on your browser. An easy way to do this is to open up the browser dev tools console to the network tab, load Reddit, and then click on the request that was made to reddit.com in your console. You should be able to find a list of headers, one of which being Cookie. Copy that and paste it in the code.

          • red_red_revolution@lemmygrad.ml
            link
            fedilink
            arrow-up
            2
            ·
            edit-2
            3 years ago

            I know this sounds dumb but I have no idea what you’re talking about. What tools or programs do I need to open this and explore the subreddit? Do I need to download anything? Or know code?

            • DongFangHong@lemmygrad.mlOPM
              link
              fedilink
              arrow-up
              1
              ·
              3 years ago

              No it’s not dumb. If your goal is just to be able to explore the content on /r/GenZhou, that would be pretty difficult to do. I don’t know if you’ve taken a look at the archive file but it’s essentially just a bunch of Javascript code that stores the data. It’s pretty much impossible to read easily as-is, even for a programmer. What the next step is going to be is formatting the data so that it becomes human-readable. Some folks are already starting to work doing that. Hopefully eventually we can view everything that was in GenZhou, but on a Lemmy site.