I will work on making this more accessible in the future but for now, here is a link to download all /r/GenZhou submissions up to 3/29/2022 in json format.
I will work on making this more accessible in the future but for now, here is a link to download all /r/GenZhou submissions up to 3/29/2022 in json format.
Yeah I wrote up a quick Python script to call the Pushshift API, which gets a list of post IDs from the subreddit. Then for each post, you can the Reddit json API to get a json with all of the information in the submission. Then, I inserted the json into a database. Here’s my code if you’re interested
import datetime import html import pymongo import requests import time subreddit = 'GenZedong' sort = 'asc' sort_type = 'created_utc' size = 100 client = pymongo.MongoClient('mongodb://localhost:27017/') db = client.subredditArchiveDB collection = db[subreddit] def main(): query_params = { 'subreddit': subreddit, 'sort': sort, 'sort_type': sort_type, 'size': size, 'after': 1646299223, # use this to start the search after a specific timestamp } while True: r = requests.get('https://api.pushshift.io/reddit/search/submission/', params=query_params) r.raise_for_status() j = r.json() for post in j['data']: id = post['id'] timestamp = datetime.datetime.utcfromtimestamp(post['created_utc']) timestamp_str = timestamp.strftime("%Y-%m-%d %H:%M") reddit_r = requests.get(f'https://www.reddit.com/comments/{id}/.json', headers={'User-Agent': 'Subreddit archiver', 'Cookie': 'Paste your Reddit browser cookie here (needed to access quarantined subreddit)' }) reddit_r.raise_for_status() reddit_json = reddit_r.json() post_archive = { 'id': id, 'timestamp': timestamp, 'json': reddit_json } collection.insert_one(post_archive) print(f'Added {id} from {timestamp_str} to the collection') query_params['after'] = timestamp if __name__ == '__main__': main()
That’s just incredible man. Thanks so much!
No problem!
I was using another script but this looks better.
what do you feed as ‘Cookie’?
You can use the cookie that Reddit stores on your browser. An easy way to do this is to open up the browser dev tools console to the network tab, load Reddit, and then click on the request that was made to reddit.com in your console. You should be able to find a list of headers, one of which being Cookie. Copy that and paste it in the code.
I know this sounds dumb but I have no idea what you’re talking about. What tools or programs do I need to open this and explore the subreddit? Do I need to download anything? Or know code?
No it’s not dumb. If your goal is just to be able to explore the content on /r/GenZhou, that would be pretty difficult to do. I don’t know if you’ve taken a look at the archive file but it’s essentially just a bunch of Javascript code that stores the data. It’s pretty much impossible to read easily as-is, even for a programmer. What the next step is going to be is formatting the data so that it becomes human-readable. Some folks are already starting to work doing that. Hopefully eventually we can view everything that was in GenZhou, but on a Lemmy site.