Basics of Ensemble Learning Explained in Simple English

Ensemble modeling is a powerful way to improve performance of your model. It usually pays off to apply ensemble learning over and above various models you might be building. Time and again, people have used ensemble models in competitions like Kaggle and benefited from it. Ensemble learning is a broad topic and is only confined by your own imagination. For the purpose of this article, I will cover the basic concepts and ideas of ensemble modeling. This should be enough for you to start building ensembles at your own end. As usual, we have tried to keep things as simple as possible.

Pandashells: Bringing the python data stack to the shell prompt

For decades, system administrators, dev-ops engineers and data analysts have been piping textual data between unix tools such as grep, awk, sed, etc. Chaining these tools together provides an extremely powerful workflow. The more recent emergence of the ‘data-scientist’ has resulted in the increasing popularity of tools like R, Pandas, IPython, etc. These tools have amazing power for transforming, analyzing and visualizing data-sets in ways that grep, awk, sed, and even the dreaded perl-one-liner could never accomplish. Pandashells is an attempt to marry the expressive, concise workflow of the shell pipeline with the statistical and visualization tools of the python data-stack.

New Standard Methodology for Analytical Models

Traditional methods for the analytical modelling like CRISP-DM have several shortcomings. Here we describe these friction points in CRISP-DM and introduce a new approach of Standard Methodology for Analytics Models which overcomes them.

Ternary Interpolation / Smoothing

For a long time, people have been sending me requests for a suitable smoothing / contouring / interpolation geometry be made available via ggtern, over and above the Kernel Density function. I am very pleased to say, that the recent version 1.0.6 has this feature added. Let me demonstrate how it works.

Survival Analysis – 1

I recently was looking for methods to apply to time-to-event data and started exploring Survival Analysis Models. In this post, I’m exploring basic KM estimator which is a nonparametric estimator of the survival function using a real dataset (on time to death for 80 males who were diagnosed with different types of tongue cancer, from package KMsurv) and a simulated dataset (using package survsim). In addition I am using OIsurv, dplyr, ggplot2 and broom for this analysis. Following is a basic rmarkdown document illustrating the analysis.