For data, I worked with US Census data taken from the website IPUMS. The initial goal was to recreate the very popular Adults Dataset but updated for the year 2014. IPUMS is a terrific resource for data and very easy to use. They email you your requested data in ASCII files and then parsed into a SQL database.
Using a variety of classification techniques, we try to identify survey respondents who have a Bachelor’s Degree or higher. Since we are dealing with college age people or older, we removed from our data set all people under the age of 21. That left us with about 93000 respondents. I tried Logistic Regressions, K Nearest Neighbors, Decision Trees, Random Forests, and Support Vector Machines. Then I did feature selection to identify the best features. The best mode, using Logistic Regression, takes CPS data and is able to identity college graduates with 81% accuracy.
Here is an interactive tool in D3 which shows what features are important to this model. The size of the bubbles represents the number of respondents who answered that way and the color represents the correlation with a positive result.
I also made a map of the percentage of Census respondents are college graduates by states in D3 because I really like maps.
Working with US Census Data & D3