Scaling Machine Learning from 0 to millions of users, part 1

I suppose most Machine Learning (ML) models are conceived on a whiteboard or a napkin, and born on a laptop. As the fledgling creatures start babbling their first predictions, we’re filled with pride and high hopes for their future abilities. Alas, we know deep down in our heart that that not all of them will be successful, far from it. A small number fail us quickly as we build them. Others look promising, and demonstrate some level of predictive power. We are then faced with the grim challenge of deploying them in a production environment, where they’ll either prove their legendary valour or die an inglorious death.

Hard Problems in Data Science: Causality, Sequential Learning and Complex Dynamic Theories

In the second of four informal discussion sessions, Professor Maurits Kaptein from Tilburg University discussed the methodological challenges of data science. ”Before discussing the biggest challenges we face in Data Science, we need to first have some common understanding about what data science actually is.” Maurits starts his talk by attempting to define data science, and the current interest in the topic, by highlighting that over the past decades we have become better and better at learning mappings (or functions) from inputs to outputs: due to tremendous advances in AI and Machine Learning we can now learn very flexible mappings. Examples include models that take as input an image and output which person is in the image. Another example would be to take as input all the characteristics of a consumer and all the products in stock, and to output the expected monetary value of a customer when pitching a specific product. Our ability to learn very flexible input-output mappings and use these mappings in practical applications is at the core of what drives the recent interest in data science.

Ensembles of Convolutional Networks

Ensembles of models refers to the practice of combining predictions from multiple statistical models to form one final prediction. This opens up opportunity for diversity in the representational capacity of the model. This concept is analogous to anecdotes such as seeking the opinion of multiple doctors. Ensembles of models are a very common enhancement to traditional machine learning models, such as the upgrade of decision trees to random forests. In contrast to machine learning models, deep models take a long time to train and therefore it is not practical to form ensembles of deep models each trained from scratch. This article covers the novelties of a CNN ensemble design from Minetto et al. named ‘Hydra’. This design saves computation time by using a shared feature extractor network for models in the ensemble.

Manifold: A Model-Agnostic Visual Debugging Tool for Machine Learning at Uber

Machine learning (ML) is widely used across the Uber platform to support intelligent decision making and forecasting for features such as ETA prediction and fraud detection. For optimal results, we invest a lot of resources in developing accurate predictive ML models. In fact, it’s typical for practitioners to devote 20 percent of their effort into building initial working models, and 80 percent of their effort improving model performance in what is known as the 20/80 split rule of ML model development. Traditionally, when data scientists develop models, they evaluate each model candidate using summary scores like log loss, area under curve (AUC), and mean absolute error (MAE). Although these metrics offer insights into how a model is performing, they do not convey much information regarding why a model is not performing well, and from there, how to improve its performance. As such, model builders tend to rely on trial and error when determining how to improve their models. To make the model iteration process more informed and actionable, we developed Manifold, Uber’s in-house model-agnostic visualization tool for ML performance diagnosis and model debugging. Taking advantage of visual analytics techniques, Manifold allows ML practitioners to look beyond overall summary metrics to detect which subset of data a model is inaccurately predicting. Manifold also explains the potential cause of poor model performance by surfacing the feature distribution difference between better and worse-performing subsets of data. Moreover, it can display how several candidate models have different prediction accuracies for each subset of data, providing justification for advanced treatments such as model ensembling. In this article, we discuss Manifold’s algorithm and visualization design, as well as explain how Uber leverages the tool to gain insights into its models and improve model performance.

How Facebook Scales Machine Learning

Recently a consulting firm reached out to me for advice about building and scaling Artificial Intelligence (AI) and Machine Learning (ML) platforms for their customers. I have some experience in this space at the infrastructure level working with Intel and NVIDIA, and at the software and services level working with Amazon, IBM and a few others so I decided to help them out. This post covers some key focus areas when it comes to scaling AI/ML platforms for their customers. I figure it is best to approach this topic through household names with products and services that most understand and use on a daily basis so I chose Facebook and Uber. For Facebook, I’ll focus on the software and hardware considerations they made to successfully scale AI/ML infrastructure per an excellent talk given by Yangqing Jia, Facebook’s Director of AI Infrastructure, at the Scaled Machine Learning Conference. If you are interested in learning about how Uber successfully built their AI/ML platform, Michelangelo, and organizational changes they’ve made to ensure successfully AI/ML adoption then click here, to continue with Facebook’s AI/ML efforts keep reading. One last note, this discussion focuses on Facebook’s approach so it is given from a non-cloud perspective which changes some of the narrative as public clouds alter your approach, eliminating some options and making other choices significantly easier.

Fuzzy String Matching in Python

Have you ever wanted to compare strings that were referring to the same thing, but they were written slightly different, had typos or were misspelled? This is a problem that you can encounter in a variety of situations ranging from spelling check software to mapping databases that lack a common key (think of trying to join two tables by company name, and these appear differently in both tables). For this tutorial, I will focus on a case study in which the database problem mentioned above was addressed. However, before we start, it would be beneficial to show how we can fuzzy match strings.

Is Google Cloud Platform Ready to Run Your Data Analytics Pipeline?

Is Google Cloud Platform Ready to Run Your Data Analytics Pipeline? This is the focus of my latest research which published in Jan 2019. So, why did I decide to write on this topic? I am glad you asked. My journey in helping our customers with their technical queries started when I joined Gartner in late 2016. At that time, the burning topic was choosing the right database for burgeoning use cases. I spent the majority of my time helping clients decide which was the right Hadoop platform and which NoSQL / nonrelational data store to pick for specific use cases.

Statistical Thinking for Industrial Problem Solving – a free online course

This online course is available – for free – to anyone interested in building practical skills in using data to solve problems better.

A Comprehensive Survey on Graph Neural Networks

Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into different categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this fast-growing field.

Expressive motions recognition and analysis with learning and statistical methods

This paper proposes to recognize and analyze expressive gestures using a descriptive motion language, the Laban Movement Analysis (LMA) method. We extract body features based on LMA factors which describe both quantitative and qualitative aspects of human movement. In the direction of our study, a dataset of 5 gestures performed with 4 emotions is created using the motion capture Xsens. We used two different approaches for emotions analysis and recognition. The first one is based on a machine learning method, the Random Decision Forest. The second approach is based on the human’s perception. We derive the most important features for each expressed emotion using the same methods, the RDF and the human’s ratings. We compared the results obtained from the automatic learning method against human perception in the discussion section.

Perfume Recommendations using Natural Language Processing

Natural Language Processing(NLP) has many intriguing applications to Recommender Systems and Information Retrieval. As a perfume lover and a Data Scientist, the unusual and highly descriptive language used in the niche perfume community inspired me to use NLP to create a model to help me discover perfumes I might want to purchase. ‘Niche’ perfumes are rare perfumes created by small, boutique perfume houses. Similar to wine, there is a whole subculture surrounding niche perfumes that comes with its own poetic vocabulary perfect for NLP!

Using Mode definitions (and dbt) to bootstrap your data model

At Landed, we track a variety of customer journeys through a variety of lifecycle stages across a variety of products. To explore all this variety, our teams rely on a collection of analytics dashboards to help them anticipate the needs of our customers and partners. The questions our teams ask change over time, which requires a flexible and responsive practice for building and updating data models. We recently went through a process of upgrading our data infrastructure, and asked ourselves ‘how might we define an agile sprint that will take us from raw data to a BI Platform MVP in a couple of weeks?’

Introducing DoWhy

The human mind has a remarkable ability to associate causes with a specific event. From the outcome of an election to an object dropping on the floor, we are constantly associating chains of events that cause a specific effect. Neuropsychology refers to this cognitive ability as causal reasoning. Computer science and economics study a specific form of causal reasoning known as causal inference which focuses on exploring relationships between two observed variables. Over the years, machine learning has produced many methods for causal inference but they remain mostly difficult to use in mainstream applications. Recently, Microsoft Research open sourced DoWhy, a framework for causal thinking and analysis.

Error Analysis in Neural Networks

Error analysis is the analysis of error. He he! You don’t have to tell me that. In fact, the whole of error analysis is just as intuitive. But, people tend to miss some points in real projects. We can treat this as a kind of refresher that one can check out?-?when frustration makes us forget the basics. With rich libraries like Pytorch and Tensorflow, most of the machine learning algorithms are now available out of the box?-?just instantiate an object and train it with the data you have. You are ready to go!

Implementing multi-class text classification with Doc2Vec

In this post, you will learn how to classify text documents into different categories while using Doc2Vec to represent the documents. We will learn this with an easy to understand example of classifying the movie plots by genre using Doc2vec for feature representation and using logistic regression as a classification algorithm. The movie dataset contains short movie plot descriptions and the labels for them represents the genre.

International Program in Survey and Data Science (IPSDS)

We are pleased to announce the launch of the International Program in Survey and Data Science (IPSDS). Fundamental changes in the nature of data, their availability, the way in which they are collected, integrated, and disseminated are a big challenge for all those working with designed data from surveys as well as organic data. IPSDS was developed in response to the increasing demand from researchers and practitioners for the appropriate methods and right tools to face these changes. We offer a multidisciplinary curriculum, world-class faculty, and a web-based learning environment that allows you to take courses from anywhere in the world.


Explore your data with SQL. Easily create charts and dashboards, and share them with your team.


ClusterFuzz is a scalable fuzzing infrastructure which finds security and stability issues in software. It is used by Google for fuzzing the Chrome Browser, and serves as the fuzzing backend for OSS-Fuzz.