Data Science Beyond the Hype – What kind of skills you really need?

Data science is reaching the end of its hype cycle and the skills for becoming a data scientist are changing. Besides of fiddling about the best performant machine learning model, it is more important than ever to make data science with business impact. In an Analytics & Data Science Meetup by DataIku in Berlin (Germany) I discussed the current skill set that help define and build up a data science workflow delivering real value. The slides of my talk you can find here.

Exploring Neural Networking with Kotlin Metaprogramming: a Cheaper Alternative for Deep Learning?

A case study improving upon the traditional machine learning model using abstract compositional contracts.

Transfer Learning for Time Series Prediction

LSTM Recurrent Neural Networks turn out to be a good choice for time series prediction task, however the algorithm relies on the assumption that we have sufficient training and testing data coming from the same distribution. The challenge is that time-series data usually exhibit time-varying characteristic, which may lead to a wide variability between old and new data. In this blog we want to test to which extent transfer learning and general domain training of models help tackle the above-mentioned problem.

Preventing Discriminatory Outcomes in Credit Models

Machine learning is being deployed to do large-scale decision making, which can strongly impact the life of individuals. By not considering and analysing such scenarios, we may end up building models that fail to treat societies equally and even infringe anti-discrimination laws. There are several algorithmic interventions to identify unfair treatment based on what is considered to be fair. In this article, we will visit these and explain their benefits and limitations with a case study.

Visualising stocks correlations with Networkx

In a recent class of Network Analytics, we were asked to visualise correlations between stocks. As an investor, you’re interested in diversifying risk by selecting different types of them. You might therefore want visualise which stocks behave similarly (positive correlations) or very differently (negative correlations). Using a dataset with the prices of selected stocks over time, we’ll create a correlation matrix that we’ll visualise with Networkx.

Optimising your R code – a guided example

Optimising your R code is not always the priority. But when you run out of memory, or it just takes too long, you start to wonder if there are better ways to do things! In this blog post, I will show you my way of optimising my R code and the process behind it. As an example, I will use the code from my last blog post about gas prices in Germany – if you haven’t read it yet, click here! There are different steps to optimise your code. From debugging and profiling over benchmarking to rethinking the whole method. You have to repeat these steps until you are satisfied with the result.

An Overview of Python’s Datatable package

Python library for efficient multi-threaded data processing, with the support for out-of-memory datasets. If you are an R user, chances are that you have already been using the data.table package. Data.table is an extension of the data.frame package in R. It’s also the go-to package for R users when it comes to the fast aggregation of large data (including 1

The Data Science Methodology

The relevance of a Doc-string to a function is to guide future users (including future-self), by specifying the right parameters and use cases of the function… So is The Data Science Methodology to data scientists. The Data Science Methodology is an iterative system of methods that guides data scientists on the ideal approach to solving problems with data science, through a prescribed sequence of steps.

How to Get Deterministic word2vec/doc2vec/paragraph Vectors

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the code implements it from papers in the following posts. I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials 🙂 ). I want you to understand word vectors from the coding level together with me so that we can know how to design and implement our word embedding and language model.

Simplifying Google AI’s Best Paper at ICML 2019 on Unsupervised Learning

There are only a handful of machine learning conferences in the world that attract the top brains in this field. One such conference, which I am an avid follower of, is the International Conference on Machine Learning (ICML). Folks from top machine learning research companies, like Google AI, Facebook, Uber, etc. come together and present their latest research. It’s a conference any data scientist would not want to miss.

Off-Policy Classification – A New Reinforcement Learning Model Selection Method

Off-Policy Classification – A New Reinforcement Learning Model Selection Method Wednesday, June 19, 2019 Posted by Alex Irpan, Software Engineer, Robotics at Google Reinforcement learning (RL) is a framework that lets agents learn decision making from experience. One of the many variants of RL is off-policy RL, where an agent is trained using a combination of data collected by other agents (off-policy data) and data it collects itself to learn generalizable skills like robotic walking and grasping. In contrast, fully off-policy RL is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot. With fully off-policy RL, one can train several models on the same fixed dataset collected by previous agents, then select the best one. However, fully off-policy RL comes with a catch: while training can occur without a real robot, evaluation of the models cannot. Furthermore, ground-truth evaluation with a physical robot is too inefficient to test promising approaches that require evaluating a large number of models, such as automated architecture search with AutoML. This challenge motivates off-policy evaluation (OPE), techniques for studying the quality of new agents using data from other agents. With rankings from OPE, we can selectively test only the most promising models on real-world robots, significantly scaling experimentation with the same fixed real robot budget.

What are model governance and model operations?

Our surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration – no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts.

A Gentle Introduction to tidymodels

Recently, I had the opportunity to showcase tidymodels in workshops and talks. Because of my vantage point as a user, I figured it would be valuable to share what I have learned so far. Let’s begin by framing where tidymodels fits in our analysis projects. The diagram above is based on the R for Data Science book, by Wickham and Grolemund. The version in this article illustrates what step each package covers. Even though it is a single step, developing models can benefit from having a tidyverse-friendly interface. That is where tidymodels comes in. It is important to clarify that the group of packages that make up tidymodels do not implement statistical models themselves. Instead, they focus on making all the tasks around fitting the model much easier. Those tasks are data pre-processing and results validation.

An Intuitive Explanation to Dropout

Training neural networks is tricky. One should be careful that his model is good enough to learn from existing data, and good enough to generalize to unseen data. The lack to generalize a model is mainly because of a problem called overfitting. In simple words, overfitting means that the model achieves very high accuracy on the initial training data and very low accuracy on newly unseen data. It is like when a teacher always gives the same questions in their exams. Their students would easily get very high grades, because they simply memorize the answers. So, the high grades are not a good metric here. A dangerous consequence is that the students will not even bother to learn the subject. There are many techniques to target overfitting, and dropout is one of them. In this article, we will discover what is the intuition behind dropout, how it is used in neural networks, and finally how to implement it in Keras.

In this article I want to explore with you how you can create a containerized data science environment or whatever other systems you might want, that you can quickly deploy to any machine running Docker, may it be your laptop or a cloud computer. The tool I want to demonstrate to you for that purpose is Docker-Compose, an addition to Docker to build and run applications made from multiple containers. The example system I want to build with you in this article will be comprised of three components: a Jupyter Notebook server to conduct experiments in, an MLflow Tracking Server to record and organize experiment parameters and metrics, and a Postgres Database as the backend for the MLflow server and as a handy datastore for your structured datasets. I mostly aim to give you an idea of Docker-Compose and how to use it and will assume that you have at least a basic understanding of Docker or maybe a first idea what it is used for and how it works. If not, let’s take a quick look at why you should bother with yet another technology.

Visualizing Intermediate Activations of a CNN trained on the MNIST Dataset

In this article, we’re going to not only train a Convolutional Neural Network using Keras with Python to classify handwritten digits, but also visualize the intermediate activations of the Convolutional Neural Network, in order to gain insights about what features of the image does each layer learn. We will be using the MNIST dataset which can be found here on Kaggle. This dataset contains 42000 rows in the training set and 24000 rows in the testing set. Each row contains 784 pixel values signifying a 28 x 28 image containing handwritten singular digits from 0-9. Let’s dive into the code.