Activation Regularization for Reducing Generalization Error in Deep Learning Neural Networks

Deep learning models are capable of automatically learning a rich internal representation from raw input data. This is called feature or representation learning. Better learned representations, in turn, can lead to better insights into the domain, e.g. via visualization of learned features, and to better predictive models that make use of the learned features. A problem with learned features is that they can be too specialized to the training data, or overfit, and not generalize well to new examples. Large values in the learned representation can be a sign of the representation being overfit. Activity or representation regularization provides a technique to encourage the learned representations, the output or activation of the hidden layer or layers of the network, to stay small and sparse. In this post, you will discover activation regularization as a technique to improve the generalization of learned features in neural networks.

Free ebook: An Introduction to Machine Learning Interpretability

Understanding and trusting models and their results is a hallmark of good science. This free ebook, courtesy of Dataiku, will show you how to apply machine learning interpretability for fairness, accountability, and transparency.

How to manage missing values in the longitudinal data with tidyverse

Imputation is a complex process that requires a good knowledge of your data. For example, it is crucial to know whether the missing is at random or not before you impute the data. I have read a nice tutorial which visualize the missing data and help to understand the type of missing, and another post showing how to impute the data with MICE package.

Advent of Data Science

Data Science related Advent Calendar

Overview of different types of data scientists

The dataset is the 2018 Kaggle data science survey. In this survey, Kagglers are asked various questions about what they do, their backgrounds, and opinions on various data science topics. There were about 23,000 respondents. The survey data is interesting to mine, since it gives perspective on what people interested in data science are working on, and what are current trends in data science. In this analysis, I would like to group Kagglers into categories based on what they do at work, then find out more about each category’s backgrounds. I am interested in discovering how people got to where they are and what makes each category unique based on what they do at work.

Introducing dag-factory

Apache Airflow is ‘a platform to programmatically author, schedule, and monitor workflows.’ And it is currently having its moment. At DataEngConf NYC 2018, it seemed like every other talk was either about or mentioned Airflow. There have also been countless blog posts about how different companies are using the tool and it even has a podcast! A major use case for Airflow seems to be ETL or ELT or ETTL or whatever acronym we are using today for moving data in batches from production systems to data warehouses. This is a pattern that will typically be repeated in multiple pipelines. I recently gave a talk on how we are using Airflow at my day job and use YAML configs to make these repeated pipelines easy to write, update, and extend. This approach is inspired by Chris Riccomini’s seminal post on Airflow at WePay. Someone in the audience then asked ‘how are you going from YAML to Airflow DAGs?’ My response was that we had a Python file that parsed the configs and generated the DAGs. This answer didn’t seem to satisfy the audience member or myself. This pattern is very common in talks or posts about Airflow, but it seems like we are all writing the same logic independently. Thus the idea for dag-factory was born.

No time to read AI research? We summarized top 2018 papers for you

Trying to keep up with AI research papers can feel like an exercise in futility given how quickly the industry moves. If you’re buried in papers to read that you haven’t quite gotten around to, you’re in luck. To help you catch up, we’ve summarized 10 important AI research papers from 2018 to give you a broad overview of machine learning advancements this year. There are many more breakthrough papers worth reading as well, but we think this is a good list for you to start with. We’ve done our best to summarize these papers correctly, but if we’ve made any mistakes, please contact us to request a fix.

Simple Encrypted Arithmetic Library (SEAL)

Microsoft Simple Encrypted Arithmetic Library (Microsoft SEAL) is an easy-to-use homomorphic encryption library developed by researchers in the Cryptography Research group at Microsoft Research. SEAL is written in modern standard C++ and has no external dependencies, making it easy to compile and run in many different environments. For more information about the Microsoft SEAL project, see

Very Non-Standard Calling in R

Our group has done a lot of work with non-standard calling conventions in R. Our tools work hard to eliminate non-standard calling (as is the purpose of wrapr::let()), or at least make it cleaner and more controllable (as is done in the wrapr dot pipe). And even so, we still get surprised by some of the side-effects and mal-consequences of the over-use of non-standard calling conventions in R. Please read on for a recent example.

rTensor: An R Package for Multidimensional Array (Tensor) Unfolding, Multiplication, and Decomposition

rTensor is an R package designed to provide a common set of operations and decompositions for multidimensional arrays (tensors). We provide an S4 class that wraps around the base ‘array’ class and overloads familiar operations to users of ‘array’, and we provide additional functionality for tensor operations that are becoming more relevant in recent literature. We also provide a general unfolding operation, for which the k-mode unfolding and the matrix vectorization are special cases of. Finally, package rTensor implements common tensor decompositions such as canonical polyadic decomposition, Tucker decomposition, multilinear principal component analysis, t-singular value decomposition, as well as related matrix-based algorithms such as generalized low rank approximation of matrices and popular value decomposition.

GARCH and a rudimentary application to Vol Trading

This post will review Kris Boudt’s datacamp course, along with introducing some concepts from it, discuss GARCH, present an application of it to volatility trading strategies, and a somewhat more general review of datacamp. So, recently, Kris Boudt, one of the highest-ranking individuals pn the open-source R/Finance totem pole (contrary to popular belief, I am not the be-all end-all of coding R in finance…probably just one of the more visible individuals due to not needing to run a trading desk), taught a course on Datacamp on GARCH models.

AzureVM: managing virtual machines in Azure

This is the next article in my series on AzureR, a family of packages for working with Azure in R. I’ll give a short introduction on how to use AzureVM to manage Azure virtual machines, and in particular Data Science Virtual Machines (DSVMs).

Handling Imbalanced Datasets in Deep Learning

Not all data is perfect. In fact, you’ll be extremely lucky if you ever get a perfectly balanced real-world dataset. Most of the time, your data will have some level of class imbalance, which is when each of your classes have a different number of examples.

One Recipe Step to Rule Them All

In this post I will demonstrate, how my new R package customsteps can be used to create recipe steps, that apply custom transformations to a data set. Note, you should already be fairly familiar with the recipes package before you continue reading this post or give customsteps a spin!

Where do Mayors Come From?: Querying Wikidata with Python and SPARQL

Querying Wikidata with Python and SPARQL: Imagine you wanted to have a list of all sons of painters that were also painters or you wanted to know which mayor was born the furthest away. What would have otherwise taken a lot of tedious research is now possible within a simple query on Wikidata with SPARQL. In this article, you will see how to build queries for Wikidata with Python and SPARQL by taking a look where mayors in Europe are born.

Sidewalk Labs

Sidewalk Labs is designing a district in Toronto’s Eastern Waterfront to tackle the challenges of urban growth, working in partnership with the tri-government agency Waterfront Toronto and the local community. This joint effort, called Sidewalk Toronto, aims to make Toronto the global hub for urban innovation.

Simple Deployment of Web App + ML Model + APIs – Tutorial

Learn how to deploy a web app served by a Logistic Regression Model. We will create an app to predict the probability of a Data Scientist earn more than USD 100k per year. What is the point of creating a ML model if you don’t put it into production? Calculating model’s metrics and tuning it to better performance is NOT the end of a data scientist job. You can do much more than that. In fact, deploying a scalable model is much easier than it looks like. And knowing how to do it, certainly will add great value to your resume, because most data scientists don’t go that far. You can think of this tutorial as the first step into becoming a data science unicorn.

How to deploy Jupyter notebooks as components of a Kubeflow ML pipeline (Part 2)

In Part 1, I showed you how to create and deploy a Kubeflow ML pipeline using Docker components. In Part 2, I will show you how to make a Jupyter notebook a component of a Kubeflow ML pipeline. Where the Docker components are for the folks operationalizing machine learning models, being able to run a Jupyter notebook on arbitrary hardware is more suitable for data scientists.

Cloud Deep Learning VM Image

Provision a VM quickly and effortlessly, with everything you need to get your deep learning project started on Google Cloud. Cloud Deep Learning VM Image Beta makes it easy and fast to instantiate a VM image containing the most popular deep learning and machine learning frameworks on a Google Compute Engine instance. You can launch Compute Engine instances pre-installed with popular ML frameworks like TensorFlow, PyTorch, or scikit-learn. You can also add Cloud TPU and GPU support with a single click. You can either instantiate the image using the Google Cloud Platform (GCP) Cloud Marketplace UI or through Cloud SDK from the command line.


The Machine Learning Toolkit for Kubernetes. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

HMTL – Multi-task Learning for solving NLP Tasks

The field of Natural Language Processing includes dozens of tasks, among them machine translation, named-entity recognition, and entity detection. While the different NLP tasks are often trained and evaluated separately, there exists a potential advantage in combining them into one model, i.e., learning one task might be helpful in learning another task and improve its results. Hierarchical Multi-Task Learning model (HMTL) provides an approach to learn different NLP tasks by training on the ‘simple’ tasks first, and using the knowledge to train on more complicated tasks. The model presents state-of-the-art performance in several tasks and an in-depth analysis of the importance of each part of the model, from different aspects of the word embeddings to the order of the tasks.

Animating regression models in R using broom and ggplot2

My first article on Towards Data Science was the result of a little exercise I set myself to keep the little grey cells ticking over. This is something of a similar exercise, albeit a bit more relevant to a project I’ve been working on. As I spend my time working in a marketing department, I have to get used to wearing [at least] two hats. Often, these hats are mutually exclusive, and sometimes they disagree with each other. In this case, the disagreement is in the form of another piece of animated data visualisation. As with the animated Scottish rugby champions graph, this example doesn’t really benefit from adding the animation as another dimension to the plot.

Training Convolutional Neural Networks to Categorize Clothing with PyTorch

Images are stored in computers as matrices of numbers where each number represents the colour of a pixel. For black and white images, there will only be one number per pixel to represent its darkness, while for coloured images, there will be 3 numbers for the RGB channels (one for red, one for blue, one for green). Each number represents the intensity of that colour and can range between 0 and 255. So, how can we teach a computer to tell us if an image is a shoe or a t-shirt if all it can ‘see’ are numbers? The process of computers analyzing and understanding images is called computer vision, a subset of machine learning, and convolutional neural networks are the best algorithms for the task.

Decision Tree Classification of Diabetes among the Pima Indian Community in R using mlr

In my last post I conducted EDA on the Pima Indians dataset to get it ready for a suite of Machine Learning techniques. My second post will explore just that. Taking one step at a time, my incoming posts will include one Machine Learning technique showing an in-depth (as in-depth as I can get) look at how to conduct each technique. Lets start with Decision Trees!

Machine Learning’s Security Layer, an Overview

This is a shallow overview of the security of machine learning systems. Within a few scrolls we’ll go through:
• Adversarial Example
• Model Theft
• Dataset Poisoning
• Dataset Protection

Deep Transfer Learning for Natural Language Processing?-?Text Classification with Universal Embeddings

Transfer learning is an exciting concept where we try to leverage prior knowledge from one domain and task into a different domain and task. The inspiration comes from us humans itself, where in, we have an inherent ability to not learn everything from scratch. We transfer and leverage our knowledge from what we have learnt in the past for tackling a wide variety of tasks. With computer vision, we have excellent big datasets available to us, like Imagenet, on which we get a suite of world-class, state-of-the-art pre-trained models to leverage transfer learning. But what about Natural Language Processing? There-in lies an inherent challenge considering text data is so diverse, noisy and unstructured. We’ve had some recent successes with word embeddings including methods like Word2Vec, GloVe and FastText, all of which I have covered in my article around ‘Feature Engineering for Text Data’.