OneR – Establishing a New Baseline for Machine Learning Classification Models

The following story is one of the most often told in the Data Science community: some time ago the military built a system which aim it was to distinguish military vehicles from civilian ones. They chose a neural network approach and trained the system with pictures of tanks, humvees and missile launchers on the one hand and normal cars, pickups and trucks on the other. After having reached a satisfactory accuracy they brought the system into the field (quite literally). It failed completely, performing no better than a coin toss. What had happened? No one knew, so they re-engineered the black box (no small feat in itself) and found that most of the military pics where taken at dusk or dawn and most civilian pics under brighter weather conditions. The neural net had learned the difference between light and dark!
(see also http://…fascinating-insights-through-simple-rules )

How to work with strings in base R – An overview of 20+ methods for daily use.

In this post in the R:case4base series we will look at string manipulation with base R, and provide an overview of a wide range of functions for our string working needs.

Top R libraries for Data Science

Here, let me tell you something about some awesome libraries that R has. I consider these libraries to be the top libraries for Data Science. These libraries have wide range of functions and is quite useful for Data Science operations. I’ve used them and still use them for most of my day to day Data Science operations. Without wasting any further time, let me get you started with awesome R stuff.
1. Dplyr
2. Ggplot2
3. Esquisse
4. BioConductor
5. Shiny
6. Lubridate
7. Knitr
8. Mlr
9. Quanteda.dictionaries
10. DT
11. RCrawler
12. Caret
13. RMarkdown
14. Leaflet
15. Janitor
16. Other worth mentioning R libraries :
16.1. Ggvis
16.2. Plotly
16.3. Rcharts
16.4. Rbokeh
16.5. Broom
16.6. StringR
16.7. Magrittr
16.8. Slidify
16.9. Rvest
16.10. Future
16.11. RMySQL
16.12. RSQLite
16.13. Prophet
16.14. Glmnet
16.15. Text2Vec
16.16. SnowballC
16.17. Quantmod
16.18. Rstan
16.19. Swirl
16.20. DataScienceR

Estimating Distributions: Nonparametric

This article is part of my series that delve into how to optimize portfolio allocations. This series explains the mathematics and design principles behind my open source portfolio optimization library OptimalPortfolio. The first article here dealt with finding market invariants. Having done so, we must now estimate the distribution of the market invariants in order to extract useful information from them.

Introduction to Power Analysis in Python

Nowadays, many companies?-?Netflix, Amazon, Uber, but also smaller?-?constantly run experiments (A/B testing) in order to test new features and implement those, which the users find best and which, in the end, lead to revenue growth. Data scientists’ role is to help in evaluating these experiments?-?in other words?-?verify if the results from these tests are reliable and can/should be used in the decision-making process. In this article I provide an introduction to power analysis. Shortly speaking, power is used to report confidence in the conclusions drawn from the results of an experiment. It can also be used for estimating the sample size required for the experiment, i.e., a sample size in which?-?with a given level of confidence?-?we should be able to detect an effect. By effect one can understand many things, for instance, more frequent conversion within a group, but also higher average spend of customers going through a certain signup flow in an online shop etc.

Planning by Dynamic Programming: Reinforcement Learning

In this blog post, I will be explaining how to evaluate and find optimal policies using Dynamic Programming. This series of blog posts contain a summary of concepts explained in Introduction to Reinforcement Learning by David Silver.

Natural Language Processing(NLP) for Machine Learning

In this article well be learning about Natural Language Processing(NLP) which can help computers analyze text easily i.e detect spam emails, autocorrect. We’ll see how NLP tasks are carried out for understanding human language.

Correcting text input by machine translation and classification

Recently, I am working on a Optical Character Recognition (OCR) related project. On of the challenge is that pre-trained OCR model output incorrect text. Besides the performance of OCR model, image quality and layout alignment are other major source to text error.

Understanding Support Vector Machine: Part 1: Lagrange Multipliers

In this post, I will give an introduction of Support Vector Machine classifier. This post will be a part of the series in which I will explain Support Vector Machine (SVM) including all the necessary minute details and mathematics behind it. It will be easy, believe me! Without any delay let’s begin – Suppose we’re given these two samples of blue stars and purple hearts (just for schematic representation and no real data are used here), and our job is to find out a line that separates them best. What do we mean by best here ?

How to keep your research projects organized, part 1: folder structure

All researchers are familiar with the importance of delivering a paper that is written in a clean and organized way. However, the same thing can often not be said about the way that we organize and maintain the code and data used in the backend (i.e. code and data layer) of a research project. This part of a project is usually not visible and good intentions to keep it organized tend to be one of the first things to fly out the window when a deadline is approaching. While understandable, I consider this to be an area with a lot of room for improvement. We spend the large majority of our time interacting with this part of the project and we can save ourselves a lot of time and frustration if we keep it clean and organized!


This is my renewed self-torturing attempt at learning machine learning. A couple of months ago, I earned a Machine Learning Engineer Nanodegree from the online nanodegree factory udacity. Nano probably denotes that it is infinitesimally small and insignificant.

The Hundred-Page Machine Learning Book

This is the supporting wiki for the upcoming book The Hundred-Page Machine Learning Book by Andriy Burkov. The wiki contains pages that extend some book chapters with additional information: Q&A, code snippets, further reading, tools, and other relevant resources. This book is distributed on the ‘read first, buy later’ principle. I strongly believe that paying for the content before consuming it is buying a pig in a poke. You can see and try a car in a dealership before you buy it. You can try on a shirt or a dress in a department store. You have to be able to read a book before paying for it. The read first, buy later principle implies that you can freely download the book, read it and share it with your friends and colleagues. If you liked the book, only then you have to buy it.

A tutorial on tidy cross-validation with R

This blog posts will use several packages from the {tidymodels} collection of packages, namely {recipes}, {rsample} and {parsnip} to train a random forest the tidy way. I will also use {mlrMBO} to tune the hyper-parameters of the random forest.

Self Learning AI-Agents IV: Stochastic Policy Gradients

Control Artificial Intelligence in continues Action Spaces: From self-driving Cars to Robots.

Reinforcement Learning with Python

Reinforcement is a class of machine learning where an agent learns how to behave in the environment by performing actions and thereby drawing intuitions and seeing the results. In this article, you’ll learn to understand and design a reinforcement learning problem and solve in Python.

There are two very different ways to deploy ML models, here’s both

If an ML model makes a prediction in Jupyter, is anyone around to hear it? Probably not. Deploying models is the key to making them useful. It’s not only if you’re building product, in which case deployment is a necessity?-?it also applies if you’re generating reports for management. Ten years ago it was unthinkable that execs wouldn’t question assumptions and plug their own numbers into an Excel sheet to see what changed. Today, a PDF of impenetrable matplotlib figures might impress junior VPs, but could well fuel ML skepticism in the eyes of experienced C-suite execs.

How I do data science

For the past 10 years, I’ve been working with businesses of all shapes and sizes. I’ve worked on problems that ran the gamut from simple to incredibly complex. All that time, I’ve been trying to extract a framework for achieving results in analytics. Of course, machine learning is an amorphous beast – it’s always growing and changing. And though my work has been resilient to systematisation, there is a methodology that I’ve adopted with recent clients that has proven itself useful, and I’m going to present it here. Before we get to that, though, I just want to highlight some guiding principles of my work which say a little more about why I do what I do, and what the usual settings are.