5 Questions which can teach you Multiple Regression (with R and Python)

A journey of thousand miles begin with a single step. In a similar way, the journey of mastering machine learning algorithms begins ideally with Regression. It is simple to understand, and gets you started with predictive modeling quickly. While this ease is good for a beginner, I always advice them to also understand the working of regression before they start using it.

Linear Mixed-effect Model Workflow

Linear Mixed effect Models are becoming a common statistical tool for analyzing data with a multilevel structure. I will start by introducing the concept of multilevel modeling where we will see that such models are a compromise between two extreme: complete pooling and no pooling. Then I will present a typical workflow for the analysis of multilevel data using the package lme4 in R.

Deep learning resources

This page summarises the deep learning resources I’ve consulted in my album cover classification project.

Understanding Feature Space in Machine Learning – Data Science Pop-up Seattle

Machine learning derives mathematical models from raw data. In the model building process, raw data is first processed into ‘features,’ then the features are given to algorithms to train a model. The process of turning raw data into features is sometimes called feature engineering, and it is a crucial step in model building. Good features lead to successful models with a lot of predictive power; bad features lead to a lot of headache and nowhere. This talk aims to help the audience understand what is a feature space and why it is so important. We will go through some common feature space representations of English text and discuss what tasks they are suited for and why. Expect lots of pictures, whiteboard drawings and handwaving. We will exercise our power of imagination to visualize high dimensional feature spaces in our mind’s eye. Presented by Alice Zheng Director of Data Science at Dato.

A New Method for Statistical Disclosure Limitation, I

The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study. In this posting, I’ll briefly explain what SDL is, and then describe a new method that Pat Tendick and I are proposing. Our paper is available as arxiv:1510.04406 and R code to implement the method is available on GitHub. See the paper for details.

The 5th Tribe, Support Vector Machines and caret

In his new book, The Master Algorithm, Pedro Domingos takes on the heroic task of explaining machine learning to a wide audience and classifies machine learning practitioners into 5 tribes*, each with its own fundamental approach to learning problems. To the 5th tribe, the analogizers, Pedro ascribes the Support Vector Machine (SVM) as it’s master algorithm. Although the SVM has been a competitive and popular algorithm since its discovery in the 1990’s this might be the breakout moment for SVMs into pop culture. (What algorithm has a cooler name?) More people than every will want to give them a try and face the challenge of tuning them. Fortunately, there is plenty of help out there and some good tools for the task.

Making Youden Plots in R

The data for a Youden plot is generated by providing a number of laboratories aliquots from two separate unknown samples, which we will call A and B. Every lab analyzes both samples and a scatter plot of the A and B results are generated-the A results on the x -axis and the B results on the y -axis. Once this is completed, limits of acceptability are plotted and outliers can be identified.

Applied Machine Learning for the IoT – Data Science Pop-up Seattle

The Internet of Things is about data, not things. Some forecasts that by 2018 the number of connect things will exceed the combined number of personal computers, smartphones, and tablets. Each ’thing’ can produce a tremendous stream of data from sensors and other sources. This presentation will discuss progress, examples, challenges, and opportunities with machine learning for the IoT. A short presentation will be done on some recent applications of ML (using H2O) to the domains of machine prognostics / health management (PHM) and agriculture. Presented by Hank Roark, Data Scientist / Hacker at H2O.ai.