- Created github repository
- Wrote project plan
In this first leg of the process, I focused primarily on data collection. I wanted a big list of subreddits, but don't use reddit frequently enough to know more than a few off the top of my head, so instead I decided to scrape reddit's list of popular subreddits. This has a few problems, such as bizarrely not including popular subreddits such as r/AmItheAsshole, but I couldn't find any other extensive lists like this outside of reddit, and it's good enough for my purposes. To get the list from this, I scraped the first 10 pages with Beautiful Soup gathering a total list of 2500 subreddits. This is probably way more than I actually need, but I wanted to get a big list at the start in case I decide to scale up my project later. For the code I used to do this, see scripts/scrape_top_subreddits.py. Next, I used this list and PRAW to scrape the comments from the top 10 posts of the past year for each subreddit. It took some trial and error to figure out what I was doing, and it was extremely slow so I had to leave my computer on overnight twice (scraping the whole list took around 9 hours and I forgot to save the ids the first time and had to do it twice...), but I did eventually manage to get the data I needed. for the code I used to do this, see scripts/scrape_comments.py. I also split the data up into several versions of increasing size (10, 100, 1000, and 2500 subreddits), so that I can experiment with smaller datasets and scale up if I want. Finally, I did some basic exploration of the dataset in notebooks/data_exploration.ipynb.
For sharing my data, I want to do something similar to the GUM corpus, i'll distribute the data with the text
column blanked out, and provide a script to re-fetch the data from reddit using the comment_id
column, allowing anyone to reconstruct my exact dataset. I also provided a very small sample of the data in data_samples/comment_data_sample.json to give an idea of the structure of the data.
This second leg was slightly less productive than the previous one, but I have finished my data cleaning, sharing plan, and am ready to begin my analysis. To start, I finished cleaning up my data in my data exploration notebook and saved it out for my next steps. I've also refined the linguistic question I'm after. I want to group the subreddits by topic using document clustering, and then do the same sort of analysis we did in homework too to see if the topic of a subreddit has any significant correlation with those linguistic factors (comment length, syntactic complexity, etc.). Finally, I've implemented the data sharing plan I described in the last progress report.
My data sharing plan can be seen in the data/ directory. These files are the same files i'm using in my data analysis but with the text field removed using scripts/redact_data.py. The data (barring any comments that have since been deleted or removed) can be retrieved using scripts/unredact_data.py, the script is quite slow and takes a few hours to finish running due to rate limits, but fetches all of the same comments I have in my copy of the data.
For my license, I picked GPLv3. As far as I'm aware, no copyrighted data is directly shared in my project, so this license is fine for this purpose. Beyond that, I picked this license because I like the way it ensures that any potential further work using this code stays open source.
During this third leg, I ended up running out of time and didn't manage to do everything I wanted to. The first thing I did was figure out how to get the rest of my data on to github using git LFS. I also removed the empty text field since it just increases file size without actually adding any information. Next, I added more processing to my data exploration notebook, and made it save out processed pickles for both of the larger data sets. Finally, I created a new clustering notebook. In it's current state it's pretty incomplete, but I have further plans for it. Treating the comments as individual texts has turned out to not be a great approach, so next I want to try combining all of the comments from each subreddit into one text for each subreddit and clustering that, which given the larger text size should hopefully give more interesting results. After that, I want to continue with my plan to compare the different clusters with measures like comment length. Given the nature of reddit, some topics attract longer comments than others, for example, an art subreddit would likely attract shorter responses than something like askreddit where the comments are the main attraction, and I want to know if this can be identified from the vague topic of the subreddit.