I’m now one of the co-organizers of Data For Democracy in NYC. Checkout our meetup group
Tag Archives: Data
reddit is a popular news and social networking website that bills itself as the “front page of the Internet.” They reported 174 million unique visitors last month and 3 million logged in users. However, reddit is not just one site. It is made up of many different sub-communities called subreddits. To date, there are more than 500,000 subreddits.
To start, I used the Python Reddit API Wrapper to scrape data of more than 17,000 users and 32,000 subreddits. This is limited to public user data like posts and submissions for logged in users and does not contain people who just view or “lurkers” who, unfortunately, make up a significant part of the reddit population.
I stored all of the data in a Neo4J graph database. In graph databases, the data points have relationships to one another and are good for modeling things like social networks. This was my first time working with a graph database (I’m much more familiar with structured SQL) and it was an interesting experience. The relationship in the database look something like this:
In this picture, the blue nodes are users. That’s me, bkey23 in the middle and reddit co-founder, Alexis Ohanian aka kn0thing near the bottom. The green nodes represent different subreddits, like r/nyc and r/askreddit. Users are linked to subreddits if they have public activity, again comments for posts, in that subreddit. Subreddits can also be connected to one another, if the description of that page in the sidebar mentions another subreddit.
To get recommendations for subreddits, I use user-based collaborative filtering. With collaborative filtering, the idea is that we can generate recommendations through users who have displayed similar tastes in subreddits. In the picture example with me and kn0tthing, we share interest in subreddits such as NYC and technology but differ on many such as bitcoin. I employ the k-nearest neighbors (k-NN) algorithm using a simple Jaccard similarity to calculate distance between users. The Jaccard Similarity is the size of the subreddits shared by two users over the total count of their subreddits. In the example, kn0thing and I share two subreddits while having been active in 15 total combined subreddits. So our Jaccard Similarity is 2/15 or 0.133, not particularly similar.
Then, in order to create recommendations, I look for subreddits to which the nearest neighbors are subscribed but the user is not, giving bonus points for those that link or are linked to on the sidebar to those subreddits (e.g. are relevant) a user already likes.
Code related to this project can be found on github.
Improve recommendations, test new models using SVD, item based CF and/or topic modeling
Improve web interface and speed up processing
For this project, I am trying to decide which actor/actress to cast in a hypothetical film that will most help that film to make money. Specifically, is there a better way to choose actors than based solely on their previous box office totals? In previous studies, it has been found that an actor or actress’ star power is a predicative factor in a film’s box office. But can we quantify “star power” in a different way than past box office performance?
In the age of the Internet, we can track users’ interest in a given topic in more ways than before. Previous studies have tried to predict box office success using twitter or web search. For celebrities, these methods could help us identify “rising stars” much more quickly or more accurately gauge which stars in a given movie are driving that film’s performance. By previous box office, Samuel L. Jackson is the most bankable star. However, on Wikipedia, I found that Jennifer Lawerence reigns.
I felt Wikipedia was a good choice for this kind of study because it is the largest reference site on the internet and the seventh overall in Alexa rankings. It is the go-to site for information. Also, they make their page view information freely available.
I took all films from 2012-2013 except animated films and documentaries taken from the list of US films on boxofficemojo.com. I then took the list of actors in that film (again, according to boxofficemojo) and cross referenced them with Wikipedia. Additionally, I have taken page view statistics of every actor in one of the said movies (as listed on boxofficemojo) for the years 2013 and 2013 taken from http://stats.grok.se/.
I tried several predictive models using scikit-learn, the one that worked best was a linear regression using average page views for the 30 days leading up to a film’s release, budget and rating. The prediction using the Wikipedia page views performed slightly worse than the previous box office totals with r-squareds 0.614 and 0.64 respectively. However, the page views are significant and future work could be done to improve the model.
My code for this project can be found here
What do Politicians Talk About on Social Media? An Analysis of All 100 U.S. Senators Before the Midterm Elections
As of January 2014, all 100 United States senators Twitter accounts. And, it seems to me, all but one senator, Idaho’s Jim Risch, have a Facebook page. I found the account names and/or numbers mainly The Sunlight Foundation and then added a handful that I discovered to be missing or wrong.
Next, I took the text of those posts, discarding any images or videos, and used Latent Dirichlet allocation to classify the topics found in posts. After tying a few different options, I settled on twelve topic groups. Of course, there are still some posts that don’t fit easily into any category but these seemed reasonable. The names of the groups I created myself.
You can view my viz here. It’s made in D3 and shows the data for the period of July 1, 2014 to October 31, 2014. Hover over a senator’s bar to see their details or hover over the category names to see prominent words in the cluster and example tweets.
One of the first things that jumps out is that New Jersey Senator Cory Booker tweets/facebooks (is facebooks a verb?) more than twice the next highest senator. He is also by far the most likely to reply to another user on Twitter, replying 1784 times. This is a big reason such a high percentage of his tweets fall into the “Thank yous and Personal Appeals” category. He had the third most Facebook replies however, behind Mark Begich and Mitch McConnell.
Facebook replies turned out to be the most predictive factor in being able to tell if a person was currently running for election. My analysis didn’t look at who the users a senator was responding to were but it seems to make sense that Facebook replies would be more likely to be constituents whereas Twitter users could either regular citizens or other politicians. For example, I noticed some replying to @BarackObama.
Among those Senators running for re-election, there is an average 26% increase in the “Press and News Release” category compared to four months prior, the clearest topical indicator that he or she is actively campaigning. There is also a average 30% decrease in the “Violence and Women” category among those running in the midterm elections compared to a 19% increase among those Senators not running which seems a bit counter-intuitive given the attention female voters have received running up to the midterm elections.
Want to see more? Check out the source code on github.