Cheatsheet: Scikit-Learn & Caret Package for Python & R respectively

For any Python or R practitioner, this article will prove to be a boon. We provide you cheatsheets for the most widely used machine library in Python & R each. Read on to know what’s in store for you.


Naive Bayes Classification explained with Python code

Machine Learning is a vast area of Computer Science that is concerned with designing algorithms which form good models of the world around us (the data coming from the world around us). Within Machine Learning many tasks are – or can be reformulated as – classification tasks. In classification tasks we are trying to produce a model which can give the correlation between the input data $X$ and the class $C$ each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class. We need some amount of training data to train the Classifier, i.e. form a correct model of the data. We can then use the trained Classifier to classify new data. If the training dataset chosen correctly, the Classifier should predict the class probabilities of the new data with a similar accuracy (as it does for the training examples).


Machine Learning Classifier Comparison


Stacked Generative Adversarial Networks

We propose a novel generative model named Stacked Generative Adversarial Networks (SGAN). Our model consists of a top-down stack of GANs, each trained to generate plausible lower-level representations, conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, providing intermediate supervision.


?Bosch Production Line Performance Competition Winners’ Interview: 3rd Place, Team Data Property Avengers | Darragh, Marios, Mathias, & Stanislav

The Bosch Production Line Performance competition ran on Kaggle from August to November 2016. Well over one thousand teams with 1602 players competed to reduce manufacturing failures using intricate data collected at every step along Bosch’s assembly lines. Team Data Property Avengers, made up of Kaggle heavyweights Darragh Hanley (Darragh), Marios Michailidis (KazAnova), Mathias Müller (Faron), and Stanislav Semenov, came in third place by relying on their experience working with grouped time-series data in previous competitions plus a whole lot of feature engineering.


Cost Weighted Logistic Loss

The problem of weighting the type 1,2 errors on binary classification came up in a forum I visit.


Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.


Basic Tree 2 Exercises

This is a continuation of the exercise Basic Tree 1


Dynamically generated Shiny UI

It is not uncommon that the user interface of a Shiny application needs to be generated dynamically, based on data or program state. One typical use case that we encounter frequently is when the UI lets the user edit a variable number of records from a database. Imagine that you have an employee database, where each employee can be assigned multiple roles. Each role also has additional data, for example, the proportion of work time the employee is expected to perform that role or a comment field. In a relational database, you would store this information in a roles table, where each row corresponds to one of the role assignments of an employee. When writing a Shiny app to edit the database, it makes sense to edit all roles of an employee on the same page: add or delete roles, or modify existing ones. This requires generating the user interface (UI) of the app dynamically, based on the database.


Simultaneous intervals for smooths revisited

Eighteen months ago I wrote a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalised spline. It was a nice post that attracted some interest. It was also wrong. I have no idea what I was thinking when I thought the intervals described in that post were simultaneous. Here I hope to rectify that past mistake. I’ll tackle the issue of simultaneous intervals for the derivatives of penalised spline in a follow-up post. Here, I demonstrate one way to compute a simultaneous interval for a penalised spline in a fitted GAM. As example data, I’ll use the strontium isotope data set included in the SemiPar package, and which is extensively analyzed in the monograph Semiparametric Regression (Ruppert, Wand, and Carroll 2003). First, load the packages we’ll need as well as the data, which is data set fossil. If you don’t have SemiPar installed, install it using install.packages(‘SemiPar’) before proceeding