I’m now one of the co-organizers of Data For Democracy in NYC. Checkout our meetup group
My DataKind Team Featured on their Blog
Strengthening Communities One YMCA at a Time
Very excited to be working with this team! Exciting things to come 🙂
A Graph Model for a Subreddit Recommender
reddit is a popular news and social networking website that bills itself as the “front page of the Internet.” They reported 174 million unique visitors last month and 3 million logged in users. However, reddit is not just one site. It is made up of many different sub-communities called subreddits. To date, there are more than 500,000 subreddits.
To start, I used the Python Reddit API Wrapper to scrape data of more than 17,000 users and 32,000 subreddits. This is limited to public user data like posts and submissions for logged in users and does not contain people who just view or “lurkers” who, unfortunately, make up a significant part of the reddit population.
I stored all of the data in a Neo4J graph database. In graph databases, the data points have relationships to one another and are good for modeling things like social networks. This was my first time working with a graph database (I’m much more familiar with structured SQL) and it was an interesting experience. The relationship in the database look something like this:
In this picture, the blue nodes are users. That’s me, bkey23 in the middle and reddit co-founder, Alexis Ohanian aka kn0thing near the bottom. The green nodes represent different subreddits, like r/nyc and r/askreddit. Users are linked to subreddits if they have public activity, again comments for posts, in that subreddit. Subreddits can also be connected to one another, if the description of that page in the sidebar mentions another subreddit.
Here’s a zoomed out view of a larger subset of the graph as shown in Gephi. In this image, the colors don’t reflect users or subreddits but results of Gephi’s clustering alogrithm.
To get recommendations for subreddits, I use user-based collaborative filtering. With collaborative filtering, the idea is that we can generate recommendations through users who have displayed similar tastes in subreddits. In the picture example with me and kn0tthing, we share interest in subreddits such as NYC and technology but differ on many such as bitcoin. I employ the k-nearest neighbors (k-NN) algorithm using a simple Jaccard similarity to calculate distance between users. The Jaccard Similarity is the size of the subreddits shared by two users over the total count of their subreddits. In the example, kn0thing and I share two subreddits while having been active in 15 total combined subreddits. So our Jaccard Similarity is 2/15 or 0.133, not particularly similar.
Then, in order to create recommendations, I look for subreddits to which the nearest neighbors are subscribed but the user is not, giving bonus points for those that link or are linked to on the sidebar to those subreddits (e.g. are relevant) a user already likes.
Code related to this project can be found on github.
Future Improvements:
Improve recommendations, test new models using SVD, item based CF and/or topic modeling
Improve web interface and speed up processing
Working with US Census Data & D3
This week, I spent some time trying to learn D3, a really great JavaScript library for visualization.
For data, I worked with US Census data taken from the website IPUMS. The initial goal was to recreate the very popular Adults Dataset but updated for the year 2014. IPUMS is a terrific resource for data and very easy to use. They email you your requested data in ASCII files and then parsed into a SQL database.
Using a variety of classification techniques, we try to identify survey respondents who have a Bachelor’s Degree or higher. Since we are dealing with college age people or older, we removed from our data set all people under the age of 21. That left us with about 93000 respondents. I tried Logistic Regressions, K Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines. Then I did feature selection to identify the best features. The best mode, using Logistic Regression, takes CPS data and is able to identity college graduates with 81% accuracy.
Here is an interactive tool in D3 which shows what features are important to this model. The size of the bubbles represents the number of respondents who answered that way and the color represents the correlation with a positive result.
I also made a map of the percentage of Census respondents are college graduates by states in D3 because I really like maps.
Another StarDate Article
As an update to my previous post on the AMNH hackathon, our project was featured in both the Atlantic and Vox.com which is super cool. I’m famous now!
I definitely learned a lesson about how non-technical people respond to tech projects. Our thing was emphatically NOT the coolest project at the hackathon. And, of course, it didn’t win anything. But it’s pretty easy to understand, mostly finished and easily available online. That makes it an attractive project to “normals.”
Predicting Domestic Box Office Success with Wikipedia Page Views
For this project, I am trying to decide which actor/actress to cast in a hypothetical film that will most help that film to make money. Specifically, is there a better way to choose actors than based solely on their previous box office totals? In previous studies, it has been found that an actor or actress’ star power is a predicative factor in a film’s box office. But can we quantify “star power” in a different way than past box office performance?
In the age of the Internet, we can track users’ interest in a given topic in more ways than before. Previous studies have tried to predict box office success using twitter or web search. For celebrities, these methods could help us identify “rising stars” much more quickly or more accurately gauge which stars in a given movie are driving that film’s performance. By previous box office, Samuel L. Jackson is the most bankable star. However, on Wikipedia, I found that Jennifer Lawerence reigns.
I felt Wikipedia was a good choice for this kind of study because it is the largest reference site on the internet and the seventh overall in Alexa rankings. It is the go-to site for information. Also, they make their page view information freely available.
I took all films from 2012-2013 except animated films and documentaries taken from the list of US films on boxofficemojo.com. I then took the list of actors in that film (again, according to boxofficemojo) and cross referenced them with Wikipedia. Additionally, I have taken page view statistics of every actor in one of the said movies (as listed on boxofficemojo) for the years 2013 and 2013 taken from http://stats.grok.se/.
I tried several predictive models using scikit-learn, the one that worked best was a linear regression using average page views for the 30 days leading up to a film’s release, budget and rating. The prediction using the Wikipedia page views performed slightly worse than the previous box office totals with r-squareds 0.614 and 0.64 respectively. However, the page views are significant and future work could be done to improve the model.
My code for this project can be found here
Night At the Museum
This weekend, I had the pleasure in taking part in the first ever Hack the Universe hackathon at the American Museum of Natural History, basically the coolest location for a hackathon EVAR. The hackathon kicks off the museum’s new BridgeUp STEM initiative. The whole thing lasted exactly 24 hours, starting at 6pm on Friday night and ending with presentations of all the work at 6pm on Saturday. That’s right, we spent the night in the Hall of the Universe. I had an amazing time, met some great people, and learned a ton about stars. My one complaint was how cold it was at night. Note to future participants, the museum is basically not heated at night; it feels like being outside so DRESS WARMLY. Big thanks to the organizers and everyone involved especially my awesome team, Will, Charlye and Jen!
You can check out our hack, alternately called Histarry or StarDate depending on who you ask, here. It attempts to put the distance of the closest 100 stars to Earth in a context humans can understand. The distance of each star from the sun is measured in light years which we obtained from AMNH’s Digital Universe Software. By calculating the distance in time, we have correlated events in history with the 100 closest stars to the sun using the New York Times Article API. It’s like being in outerspace and receiving the latest edition of the New York Times and looking at history in the present. The visualization was done in D3. (Did I mention I’m trying to learn D3?)
EDIT: There’s an article about our project in the Atlantic!
What do Politicians Talk About on Social Media? An Analysis of All 100 U.S. Senators Before the Midterm Elections
As of January 2014, all 100 United States senators Twitter accounts. And, it seems to me, all but one senator, Idaho’s Jim Risch, have a Facebook page. I found the account names and/or numbers mainly The Sunlight Foundation and then added a handful that I discovered to be missing or wrong.
Once I had the account info, I used the Twitter and Facebook APIs to pull all posts for the last few months. That worked out to be about 138000 tweets and 51000 Facebook posts.
Next, I took the text of those posts, discarding any images or videos, and used Latent Dirichlet allocation to classify the topics found in posts. After tying a few different options, I settled on twelve topic groups. Of course, there are still some posts that don’t fit easily into any category but these seemed reasonable. The names of the groups I created myself.
You can view my viz here. It’s made in D3 and shows the data for the period of July 1, 2014 to October 31, 2014. Hover over a senator’s bar to see their details or hover over the category names to see prominent words in the cluster and example tweets.
One of the first things that jumps out is that New Jersey Senator Cory Booker tweets/facebooks (is facebooks a verb?) more than twice the next highest senator. He is also by far the most likely to reply to another user on Twitter, replying 1784 times. This is a big reason such a high percentage of his tweets fall into the “Thank yous and Personal Appeals” category. He had the third most Facebook replies however, behind Mark Begich and Mitch McConnell.
Facebook replies turned out to be the most predictive factor in being able to tell if a person was currently running for election. My analysis didn’t look at who the users a senator was responding to were but it seems to make sense that Facebook replies would be more likely to be constituents whereas Twitter users could either regular citizens or other politicians. For example, I noticed some replying to @BarackObama.
Among those Senators running for re-election, there is an average 26% increase in the “Press and News Release” category compared to four months prior, the clearest topical indicator that he or she is actively campaigning. There is also a average 30% decrease in the “Violence and Women” category among those running in the midterm elections compared to a 19% increase among those Senators not running which seems a bit counter-intuitive given the attention female voters have received running up to the midterm elections.
Want to see more? Check out the source code on github.
I’ve updated my CitiBike heatmap to use Leaflet plugin for WebGL heatmap library. The old version used heatmap.js but they ceased to get along. It’s still viewable here minus the background map. I wish Leaflet would add their own heatmap instead of making us deal with third parties.
A Day in the Life of CitiBike part 2
I made a heatmap of CitiBike activity for yesterday, June 15th using heatmap.js and Leaflet. I used the same criteria for “activity” as in the previous post. This time, the most active stations were Broadway & W 57 St and Central Park S & 6 Ave hence the red dot near Central Park.
Edit: Here’s another for Friday June 22nd. You can see there is a lot more activity on the weekdays especially around the transit hubs like Union Square, Penn Station and Grand Central Terminal.
Edit 2: I map an interactive map where you can view the current status. Click on the layer control button on the far right to view either the available bikes, available docks or the total docks.
The code is on github.