reddit is a popular news and social networking website that bills itself as the “front page of the Internet.” They reported 174 million unique visitors last month and 3 million logged in users. However, reddit is not just one site. It is made up of many different sub-communities called subreddits. To date, there are more than 500,000 subreddits.
To start, I used the Python Reddit API Wrapper to scrape data of more than 17,000 users and 32,000 subreddits. This is limited to public user data like posts and submissions for logged in users and does not contain people who just view or “lurkers” who, unfortunately, make up a significant part of the reddit population.
I stored all of the data in a Neo4J graph database. In graph databases, the data points have relationships to one another and are good for modeling things like social networks. This was my first time working with a graph database (I’m much more familiar with structured SQL) and it was an interesting experience. The relationship in the database look something like this:
In this picture, the blue nodes are users. That’s me, bkey23 in the middle and reddit co-founder, Alexis Ohanian aka kn0thing near the bottom. The green nodes represent different subreddits, like r/nyc and r/askreddit. Users are linked to subreddits if they have public activity, again comments for posts, in that subreddit. Subreddits can also be connected to one another, if the description of that page in the sidebar mentions another subreddit.
To get recommendations for subreddits, I use user-based collaborative filtering. With collaborative filtering, the idea is that we can generate recommendations through users who have displayed similar tastes in subreddits. In the picture example with me and kn0tthing, we share interest in subreddits such as NYC and technology but differ on many such as bitcoin. I employ the k-nearest neighbors (k-NN) algorithm using a simple Jaccard similarity to calculate distance between users. The Jaccard Similarity is the size of the subreddits shared by two users over the total count of their subreddits. In the example, kn0thing and I share two subreddits while having been active in 15 total combined subreddits. So our Jaccard Similarity is 2/15 or 0.133, not particularly similar.
Then, in order to create recommendations, I look for subreddits to which the nearest neighbors are subscribed but the user is not, giving bonus points for those that link or are linked to on the sidebar to those subreddits (e.g. are relevant) a user already likes.
Code related to this project can be found on github.
Improve recommendations, test new models using SVD, item based CF and/or topic modeling
Improve web interface and speed up processing