Raw master archive of /r/GenZhou

DongFangHong@lemmygrad.ml · 3 years ago

Raw master archive of /r/GenZhou

DongFangHong@lemmygrad.ml · edit-2 3 years ago

Yeah I wrote up a quick Python script to call the Pushshift API, which gets a list of post IDs from the subreddit. Then for each post, you can the Reddit json API to get a json with all of the information in the submission. Then, I inserted the json into a database. Here’s my code if you’re interested

import datetime
import html
import pymongo
import requests
import time

subreddit = 'GenZedong'
sort = 'asc'
sort_type = 'created_utc'
size = 100

client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client.subredditArchiveDB
collection = db[subreddit]

def main():
    query_params = {
        'subreddit': subreddit,
        'sort': sort,
        'sort_type': sort_type,
        'size': size,
        'after': 1646299223, # use this to start the search after a specific timestamp
    }

    while True:
        r = requests.get('https://api.pushshift.io/reddit/search/submission/', params=query_params)
        r.raise_for_status()

        j = r.json()
        for post in j['data']:
            id = post['id']
            timestamp = datetime.datetime.utcfromtimestamp(post['created_utc'])
            timestamp_str = timestamp.strftime("%Y-%m-%d %H:%M")

            reddit_r = requests.get(f'https://www.reddit.com/comments/{id}/.json', headers={'User-Agent': 'Subreddit archiver', 'Cookie': 'Paste your Reddit browser cookie here (needed to access quarantined subreddit)' })
            reddit_r.raise_for_status()
            reddit_json = reddit_r.json()
        
            post_archive = {
                'id': id,
                'timestamp': timestamp,
                'json': reddit_json
            }

            collection.insert_one(post_archive)
            print(f'Added {id} from {timestamp_str} to the collection')

        query_params['after'] = timestamp


if __name__ == '__main__':
    main()

TheBlurstOfGuys@lemmygrad.ml · 3 years ago

That’s just incredible man. Thanks so much!

DongFangHong@lemmygrad.ml · 3 years ago

No problem!

holdengreen@lemmygrad.ml · 3 years ago

I was using another script but this looks better.

holdengreen@lemmygrad.ml · 3 years ago

what do you feed as ‘Cookie’?

DongFangHong@lemmygrad.ml · 3 years ago

You can use the cookie that Reddit stores on your browser. An easy way to do this is to open up the browser dev tools console to the network tab, load Reddit, and then click on the request that was made to reddit.com in your console. You should be able to find a list of headers, one of which being Cookie. Copy that and paste it in the code.

red_red_revolution@lemmygrad.ml · edit-2 3 years ago

I know this sounds dumb but I have no idea what you’re talking about. What tools or programs do I need to open this and explore the subreddit? Do I need to download anything? Or know code?

DongFangHong@lemmygrad.ml · 3 years ago

No it’s not dumb. If your goal is just to be able to explore the content on /r/GenZhou, that would be pretty difficult to do. I don’t know if you’ve taken a look at the archive file but it’s essentially just a bunch of Javascript code that stores the data. It’s pretty much impossible to read easily as-is, even for a programmer. What the next step is going to be is formatting the data so that it becomes human-readable. Some folks are already starting to work doing that. Hopefully eventually we can view everything that was in GenZhou, but on a Lemmy site.