“People Who Like This Also Like … “
A while ago a friend of mine asked me how I would go about building a ‘People Who Like This Also Like …’ feature for a music startup he was working at. For each band or musician, he wanted to display a list of other artists that people might also be interested in. At the time, I think I arrogantly responded with something like ‘Thats easy! I can think of like a dozen ways of calculating this!’. Which of course was profoundly unhelpful and probably slightly infuriating. Once he calmed down, I sketched out how I would calculate the distance between any two artists – and use that distance as a ranking function to build this feature. Since then he’s been encouraging me to write a blog post about this, and after a totally unreasonable delay I finally got around to finishing it up. So here is my step by step guide for the non data scientist, using Python with Pandas and SciPy to compute the distances, and D3.js for building gratuitously interactive visualizations.
Statistics: P values are just the tip of the iceberg
There is no statistic more maligned than the P value. Hundreds of papers and blogposts have been written about what some statisticians deride as ‘null hypothesis significance testing’. NHST deems whether the results of a data analysis are important on the basis of whether a summary statistic (such as a P value) has crossed a threshold. Given the discourse, it is no surprise that some hailed as a victory the banning of NHST methods (and all of statistical inference) in the journal Basic and Applied Social Psychology in February.
L1 vs. L2 Loss function
Least absolute deviations (L1) and Least square errors (L2) are the two standard loss functions, that decides what function should be minimized while learning from a dataset.
Document Classification with scikit-learn
Document classification is a fundamental machine learning task. It is used for all kinds of applications, like filtering spam, routing support request to the right support rep, language detection, genre classification, sentiment analysis, and many more. To demonstrate text classification with scikit-learn, we’re going to build a simple spam filter. While the filters in production for services like Gmail are vastly more sophisticated, the model we’ll have by the end of this tutorial is effective, and surprisingly accurate.