k-Fold Cross Validation made simple
Does your high performing model degrade perform poorly on an out of time sample? Has your Kaggle Private score come down from your Public score significantly? Not sure, if your current model is an overfit or the right fit? If your answer to any of the above three questions is ‘yes’, you have come to the right place. Recently, I participated in one of the Kaggle competition called TFI where I experimented a lot of things to get a sense of when do I overfit. I will illustrate my finding in this article.
SparkR preview in Rstudio
Apache Spark is the hip new technology on the block. It allows you to write scripts in a functional style and the technology behind it will allow you to run iterative tasks very quickly on a cluster of machines. It’s benchmarked to be quicker than hadoop for most machine learning use cases (by a factor between 10-100) and soon Spark will also have active support for the R language. As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4) due early summer 2015. In the meanwhile, you can use this tutorial to go ahead and get started with the Spark API in R.
R Recipe: Aligning Axes in ggplot2
Faceted plots in ggplot2 are phenomenal. They give you a simple way to break down an array of plots according to the values of one or more categorical variables. But what if you want to stack plots of different variables? Not quite so simple. But certainly possible. I gathered together this solution from a variety of sources on stackoverflow, notably this one and this other one. A similar issue for vertical alignment is addressed here.