• DongFangHong@lemmygrad.mlOPM
    link
    fedilink
    arrow-up
    7
    ·
    edit-2
    3 years ago

    Yeah I wrote up a quick Python script to call the Pushshift API, which gets a list of post IDs from the subreddit. Then for each post, you can the Reddit json API to get a json with all of the information in the submission. Then, I inserted the json into a database. Here’s my code if you’re interested

    import datetime
    import html
    import pymongo
    import requests
    import time
    
    subreddit = 'GenZedong'
    sort = 'asc'
    sort_type = 'created_utc'
    size = 100
    
    client = pymongo.MongoClient('mongodb://localhost:27017/')
    db = client.subredditArchiveDB
    collection = db[subreddit]
    
    def main():
        query_params = {
            'subreddit': subreddit,
            'sort': sort,
            'sort_type': sort_type,
            'size': size,
            'after': 1646299223, # use this to start the search after a specific timestamp
        }
    
        while True:
            r = requests.get('https://api.pushshift.io/reddit/search/submission/', params=query_params)
            r.raise_for_status()
    
            j = r.json()
            for post in j['data']:
                id = post['id']
                timestamp = datetime.datetime.utcfromtimestamp(post['created_utc'])
                timestamp_str = timestamp.strftime("%Y-%m-%d %H:%M")
    
                reddit_r = requests.get(f'https://www.reddit.com/comments/{id}/.json', headers={'User-Agent': 'Subreddit archiver', 'Cookie': 'Paste your Reddit browser cookie here (needed to access quarantined subreddit)' })
                reddit_r.raise_for_status()
                reddit_json = reddit_r.json()
            
                post_archive = {
                    'id': id,
                    'timestamp': timestamp,
                    'json': reddit_json
                }
    
                collection.insert_one(post_archive)
                print(f'Added {id} from {timestamp_str} to the collection')
    
            query_params['after'] = timestamp
    
    
    if __name__ == '__main__':
        main()
    
      • DongFangHong@lemmygrad.mlOPM
        link
        fedilink
        arrow-up
        3
        ·
        3 years ago

        You can use the cookie that Reddit stores on your browser. An easy way to do this is to open up the browser dev tools console to the network tab, load Reddit, and then click on the request that was made to reddit.com in your console. You should be able to find a list of headers, one of which being Cookie. Copy that and paste it in the code.

        • red_red_revolution@lemmygrad.ml
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          3 years ago

          I know this sounds dumb but I have no idea what you’re talking about. What tools or programs do I need to open this and explore the subreddit? Do I need to download anything? Or know code?

          • DongFangHong@lemmygrad.mlOPM
            link
            fedilink
            arrow-up
            1
            ·
            3 years ago

            No it’s not dumb. If your goal is just to be able to explore the content on /r/GenZhou, that would be pretty difficult to do. I don’t know if you’ve taken a look at the archive file but it’s essentially just a bunch of Javascript code that stores the data. It’s pretty much impossible to read easily as-is, even for a programmer. What the next step is going to be is formatting the data so that it becomes human-readable. Some folks are already starting to work doing that. Hopefully eventually we can view everything that was in GenZhou, but on a Lemmy site.