Learning from the experience of others with mixed effects models
At Stitch Fix we have many problems that boil down to finding the best match between items in two sets. Our recommendation algorithms match inventory to clients with the help of the expert human judgment of our stylists. We also match these stylists to clients. This blog post is about the remarkably useful application of some classical statistical models to these and similar problems that feature repeated measurements.
1st Place Solution for Search Results Relevance Competition on Kaggle
The best single model we have obtained during the competition was an XGBoost model with linear booster of Public LB score 0.69322 and Private LB score 0.70768. Our final winning submission was a median ensemble of 35 best Public LB submissions. This submission scored 0.70807 on Public LB and 0.72189 on Private LB.
A Neural Network in 11 lines of Python
I learn best with toy code that I can play with. This tutorial teaches backpropagation via a very simple toy example, a short python implementation.
Tracking down the Villains: Outlier Detection at Netflix
The Netflix service currently runs on tens of thousands of servers; typically less than one percent of those become unhealthy. For example, a server’s network performance might degrade and cause elevated request processing latency. The unhealthy server will respond to health checks and show normal system-level metrics but still be operating in a suboptimal state.
Python: Social network analysis with NetworkX
This post describes how to use the Python library NetworkX, to deal with network data and solve interesting problems in network analysis. You can find a nice IPython Notebook with all the examples below, on Domino
50+ Data Science and Machine Learning Cheat Sheets
Gear up to speed and have Data Science & Data Mining concepts and commands handy with these cheatsheets covering R, Python, Django, MySQL, SQL, Hadoop, Apache Spark and Machine learning algorithms.
Spark 1.4 for Rstudio
This document contains a tutorial on how to provision a spark cluster with Rstudio. You will need a machine that can run bash scripts and a functioning account on AWS. Note that this tutorial is meant for Spark 1.4.0. Future versions will most likely be provisioned in another way but this should be good enough to help you get started. At the end of this tutorial you will have a fully provisioned spark cluster that allows you to handle simple dataframe operations on gigabytes of data within RStudio.
Visualizing West Nile Virus
The West Nile Virus competition gave participants weather, location, spraying, and mosquito testing data from the City of Chicago and asked them to predict when and where the virus would appear. This dataset was perfect for visual storytelling and Kagglers did not disappoint. They never do! Below are five of our favorite visualizations shared in the competition’s scripts repository. Stay tuned for a second post later this week with top benchmark code and tutorials from the competition featuring Keras, XGBoost, and Lasagne.
Easy Bayesian Bootstrap in R
A while back I wrote about how the classical non-parametric bootstrap can be seen as a special case of the Bayesian bootstrap. Well, one difference between the two methods is that, while it is straightforward to roll a classical bootstrap in R, there is no easy way to do a Bayesian bootstrap. This post, in an attempt to change that, introduces a bayes_boot function that should make it pretty easy to do the Bayesian bootstrap for any statistic in R. If you just want a function you can copy-n-paste into R go to The bayes_boot function below. Otherwise here is a quick example of how to use the function, followed by some details on the implementation.
A Simple Intro to Bayesian Change Point Analysis
The purpose of this post is to demonstrate change point analysis by stepping through an example of the technique in R presented in Rizzo’s excellent, comprehensive, and very mathy book, Statistical Computing with R, and then showing alternative ways to process this data using the changepoint and bcp packages. Much of the commentary is simplified, and that’s on purpose: I want to make this introduction accessible if you’re just learning the method. (Most of the code is straight from Rizzo who provides a much more in-depth treatment of the technique. I’ve added comments in the code to make it easier for me to follow, and that’s about it.)
R 101 – Aggregate By Quarter
We were asked a question on how to (in R) aggregate quarterly data from what I believe was a daily time series. This is a pretty common task and there are many ways to do this in R, but we’ll focus on one method using the zoo and dplyr packages. Let’t get those imports out of the way:…
Understanding Data Visualisations
This resource aims to help people make sense of data visualisations. It’s for the general public – people who are interested in visualisations, but are not experts in this subject. Each section tells you something different, and it attempts to build your confidence and skills in making sense of data visualisations. You can work through the sections in any order you like. Why do we need to understand data visualisations? There is more and more data around us, and data are increasingly used in decision-making, journalism, and to make sense of the world. One of the main ways that people get access to data is through visualisations, but lots of people feel like they don’t have the skills and knowledge to make sense of visualisations. This can mean that some people feel left out of conversations about data. This resource aims to overcome that problem, by helping people to develop their ability to understand – and enjoy! – data visualisations.