The Machine Learning Reproducibility Crisis

I was recently chatting to a friend whose startup’s machine learning models were so disorganized it was causing serious problems as his team tried to build on each other’s work and share it with clients. Even the original author sometimes couldn’t train the same model and get similar results! He was hoping that I had a solution I could recommend, but I had to admit that I struggle with the same problems in my own work. It’s hard to explain to people who haven’t worked with machine learning, but we’re still back in the dark ages when it comes to tracking changes and rebuilding models from scratch. It’s so bad it sometimes feels like stepping back in time to when we coded without source control.

Getting Value from Machine Learning Isn’t About Fancier Algorithms – It’s About Making It Easier to Use

Machine learning can drive tangible business value for a wide range of industries — but only if it is actually put to use. Despite the many machine learning discoveries being made by academics, new research papers showing what is possible, and an increasing amount of data available, companies are struggling to deploy machine learning to solve real business problems. In short, the gap for most companies isn’t that machine learning doesn’t work, but that they struggle to actually use it. How can companies close this execution gap? In a recent project we illustrated the principles of how to do it. We used machine learning to augment the power of seasoned professionals — in this case, project managers — by allowing them to make data-driven business decisions well in advance. And in doing so, we demonstrated that getting value from machine learning is less about cutting-edge models, and more about making deployment easier.

Introduction to k-Nearest-Neighbors

The k-Nearest-Neighbors (kNN) method of classification is one of the simplest methods in machine learning, and is a great way to introduce yourself to machine learning and classification in general. At its most basic level, it is essentially classification by finding the most similar data points in the training data, and making an educated guess based on their classifications. Although very simple to understand and implement, this method has seen wide application in many domains, such as in recommendation systems, semantic searching, and anomaly detection.

Default Priors for the Intercept Parameter in Logistic Regressions

In logistic regression, separation refers to the situation in which a linear combination of predictors perfectly discriminates the binary outcome. Because finite-valued maximum likelihood parameter estimates do not exist under separation, Bayesian regressions with informative shrinkage of the regression coefficients offer a suitable alternative. Little focus has been given on whether and how to shrink the intercept parameter. Based upon classical studies of separation, we argue that efficiency in estimating regression coefficients may vary with the intercept prior. We adapt alternative prior distributions for the intercept that downweight implausibly extreme regions of the parameter space rendering less sensitivity to separation. Through simulation and the analysis of exemplar datasets, we quantify differences across priors stratified by established statistics measuring the degree of separation. Relative to diffuse priors, our recommendations generally result in more efficient estimation of the regression coefficients themselves when the data are nearly separated. They are equally efficient in non-separated datasets, making them suitable for default use. Modest differences were observed with respect to out-of-sample discrimination. Our work also highlights the interplay between priors for the intercept and the regression coefficients: numerical results are more sensitive to the choice of intercept prior when using a weakly informative prior on the regression coefficients than an informative shrinkage prior.

R and Docker

If you regularly have to deal with specific versions of R, or different package combinations, or getting R set up to work with other databases or applications then, well, it can be a pain. You could dedicate a special machine for each configuration you need, I guess, but that’s expensive and impractical. You could set up virtual machines in the cloud which works well for one-off situations, but gets tedious having to re-configure a new VM each time. Or, you could use Docker containers, which were expressly designed to make it quick easy to configure and launch an independent and secure collection of software and services. If you’re new to the concept of Docker containers, here’s a docker tutorial for data scientists. But the concepts are pretty simple. At Docker hub, you can search ‘images’ – basically, bundles of software with pre-configured settings – contributed by the community and by vendors. (You’ll be referring to the images by name, for example: rocker/r-base.) You can then create a ‘container’ (a running instance of that image) on your machine with the docker application, or in the cloud using the tools offered by your provider of choice.

Regression Analysis Essentials For Machine Learning

Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x). Briefly, the goal of regression model is to build a mathematical equation that defines y as a function of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new values of the predictor variables (x).

What Comes After Deep Learning

We’re stuck. Or at least we’re plateaued. Can anyone remember the last time a year went by without a major notable advance in algorithms, chips, or data handling? It was so unusual to go to the Strata San Jose conference a few weeks ago and see no new eye catching developments. As I reported earlier, it seems we’ve hit maturity and now our major efforts are aimed at either making sure all our powerful new techniques work well together (converged platforms) or making a buck from those massive VC investments in same. I’m not the only one who noticed. Several attendees and exhibitors said very similar things to me. And just the other day I had a note from a team of well-regarded researchers who had been evaluating the relative merits of different advanced analytic platforms, and concluding there weren’t any differences worth reporting.

Automated front-end development using deep learning

SketchCode: Go from idea to HTML in 5 seconds

Engineering Data Science at Automattic

Most data scientists have to write code to analyze data or build products. While coding, data scientists act as software engineers. Adopting best practices from software engineering is key to ensuring the correctness, reproducibility, and maintainability of data science projects. This post describes some of our efforts in the area.

Multi-Class Text Classification with Scikit-Learn

There are lots of applications of text classification in the commercial world. For example, news stories are typically organized by topics; content or products are often tagged by categories; users can be classified into cohorts based on how they talk about a product or brand online.

Introducing udpipe for easy Natural Language Processing in R

Natural Language Processing (NLP) has been seen as one of the blackboxes of Data Analytics. The aim of this post is to introduce this simple-to-use but effective R package udpipe for NLP and Text Analytics. UDPipe?—?R package provides language-agnostic tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in natural language processing.

Learning Distributed Word Representations with Neural Network: an implementation from scratch in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by Prof. Geoffrey Hinton from the University of Toronto in 2012.

Blockchain Potential to Transform Artificial Intelligence

The research on improving Artificial Intelligence (A.I.) has been ongoing for decades. However, it wasn’t until recently that developers were finally able to create smart systems that closely resemble the A.I. capabilities of humans. The main reason for this breakthrough in technology is advancements in Big Data. Recent developments in Big Data have allowed us the capability to organize a very large amount of information into structured components that can be very quickly processed by computers. Another technology that has the potential for rapidly advancing and transforming Artificial Intelligence is the Blockchain. While some of the applications that have been developed on Blockchain are nothing more than ledger records of transactions, others are so incredibly smart that they almost appear like AI. Here, we will look more closely at the opportunities for A.I. advancement through the Blockchain protocol.