The Simple Yet Practical Data Cleaning Codes

In one of my articles – My First Data Scientist Internship, I talked about how crucial data cleaning (data preprocessing, data munging…Whatever it is) is and how it could easily occupy 40%-70% of the whole data science workflow. The world is imperfect, so is data.

4 Reasons Why Your Machine Learning Code is Probably Bad

Your current workflow probably chains several functions together like in the example below. While quick, it likely has many problems:
• it doesn’t scale well as you add complexity
• you have to manually keep track of which functions were run with which parameter
• you have to manually keep track of where data is saved
• it’s difficult for others to read

Reflections on the State of AI: 2018

Every day, multiple news items discussing various things ‘related to AI’ pile up in our mailboxes, and there are dozens, if not hundreds, articles, opinions and overviews being published every week that have the term ‘AI’ in their titles. However, not everything claimed as related to artificial intelligence is actually meaningful or relevant, and instead can often have nothing to do with the field altogether. Aiming to democratize the knowledge of machine learning, neural networks and other segments of AI, we’ve decided to launch our efforts by creating a set of focused articles that would cover the latest advancements in the field, taking a look at the key players, and providing some insights into the most promising technology, as well as both the opportunities and dilemmas the industry is facing today. In this first article, we provide a concise overview of the key developments we saw in 2018, segmented by key contributors, applications and challenges.

Applications of Graph Neural Networks

Graphs and their study have received a lot of attention since ages due to their ability of representing the real world in a fashion that can be analysed objectively. Indeed, graphs can be used to represent a lot of useful, real world datasets such as social networks, web link data, molecular structures, geographical maps, etc. Apart from these cases which have a natural structure to them, non-structured data such as images and text can also be modelled in the form of graphs in order to perform graph analysis on them. Due to the expressiveness of graphs and a tremendous increase in the available computational power in recent times, a good amount of attention has been directed towards the machine learning way of analysing graphs, i.e. Graph Neural Networks.

Features that Maximizes Mutual Information, How do they Look Like?

Thanks to recent developments, such as MINE and DeepInfoMax, we are not only able to estimate mutual information of two random variables that are high dimensional, but also able to create latent features that maximize mutual information with the input data. Personally, I think the idea of creating meaningful latent variables somewhat relates to the talk Dr. Bengio gave. Simply put, better representation of data (a.k.a. abstract representation) is beneficial for machine learning models. This also relates to the idea in which, the objective function that we want to minimize/maximize is not in pixel space, rather in information space. For this post, I am going to focus on visualizing the latent variable which was created by maximizing mutual information. Rather than choosing the best ‘best’ representation that could be used for any downstream task.

Simple Introduction to Convolutional Neural Networks

In this article, I will explain the concept of convolution neural networks (CNN’s) using many swan pictures and will make the case of using CNN’s over regular multilayer perceptron neural networks for processing images.

Robust Regressions: Dealing with Outliers in R

It is often the case that a dataset contains significant outliers – or observations that are significantly out of range from the majority of other observations in our dataset. Let us see how we can use robust regressions to deal with this issue. I described in another tutorial how we can run a linear regression in R. However, this does not account for the outliers in our data. So, how can we solve this?

You Need to Start Branding Your Graphs. Here’s How, with ggplot!

In today’s post I want to help you incorporate your company’s branding into your ggplot graphs. Why should you care about this? I’m glad you asked!

Financial Machine Learning Part 0: Bars

Recently, I got my copy of Advances in Financial Machine Learning by Marcos Lopez de Prado. Lopez de Prado is a renowned quant researcher who has managed billions throughout his career. The book is an amazing resource to anyone interested in data science and finance, and it offers valuable insights into how advanced predictive techniques are applied to financial problems. This post is the first of a series dedicated to applying the approaches introduced by Lopez de Prado to real (and occasionally, synthetic) datasets. My hope is that by writing these posts I can solidify my understanding of the material and share some lessons learned along the way. Without further ado, let’s proceed to the main subject of this post: bars.

Automating Scientific Data Analysis Part 2

The first step in automating scientific data analysis is ensuring that your programs are able to understand the data contained in each file. There are three main challenges associated with this step:
• First, the data from the test must be structured in a way that the program can identify the important portion of the data. Many tests will have warm-up periods, conditioning periods,and/or transient periods before getting to the part of the test with usable data. To analyze that data set there must be a consistent signal the program can use to identify the correct period to analyze. This article will present several potential solutions you can use to overcome this challenge.
• Second, the program must have a way of identifying the conditions of each test. The conditions are vital for both performing calculations to make sense of the data, and storing the results in an appropriate location. This article will present a few different ways you can instruct your program to obtain the conditions of each test.
• Third, the data for each test must have its own unique file for the previous suggestions to work. Sometimes data files from an entire project come in a single file, making it much harder to identify the conditions of each test. This article will provide two solutions to help you split these files apart, creating a separate data file for each test.

Human-Like Machine Hearing With AI (3/3)

This is the last part of my article series on ‘Human-Like’ Machine Hearing: Modeling parts of human hearing to do audio signal processing with AI.
This last part of the series will provide:
• A concluding summary of the key ideas.
• Results from empirical testing.
• Related work and future perspectives.

Preserving Memory in Stationary Time Series

Many predictive models require a certain consistency of time series called stationarity. The usual transformation, namely integer order differencing (in Finance e.g. modelling returns instead of absolute prices), eliminates memory in the data and hence affects the predictive power of the modelling. This article outlines how fractional calculus allows to retain more information and to better balance stationarity and meaningful memory. In general, we understand a given time series as a sample generated by a stochastic process whose distribution and statistics we try to infer for a predictive model. Building predictive models of stochastic processes is about finding a balance between specificity and generalisability of the samples: the model interprets a given series against the backdrop of general patterns. More specific than a general predictive regression, a time series comes with inherent ordering due to its temporal structure?-?any given instance reflects a history of values traversed in the past, the specific memory of its past track record.

Learning Explanatory Rules from Noisy Data

Arti cial Neural Networks are powerful function approximators capable of modelling solutions to a wide variety of problems, both supervised and unsupervised. As their size and expressivity increases, so too does the variance of the model, yielding a nearly ubiquitous over tting problem. Although mitigated by a variety of model regularisation methods, the common cure is to seek large amounts of training data|which is not necessarily easily obtained|that su ciently approximates the data distribution of the domain we wish to test on. In contrast, logic programming methods such as Inductive Logic Programming o er an extremely data-e cient process by which models can be trained to reason on symbolic domains. However, these methods are unable to deal with the variety of domains neural networks can be applied to: they are not robust to noise in or mislabelling of inputs, and perhaps more importantly, cannot be applied to non-symbolic domains where the data is ambiguous, such as operating on raw pixels. In this paper, we propose a Di erentiable Inductive Logic framework, which can not only solve tasks which traditional ILP systems are suited for, but shows a robustness to noise and error in the training data which ILP cannot cope with. Furthermore, as it is trained by backpropagation against a likelihood objective, it can be hybridised by connecting it with neural networks over ambiguous data in order to be applied to domains which ILP cannot address, while providing data e ciency and generalisation beyond what neural networks on their own can achieve.

DeepMind Combines Logic and Neural Networks to Extract Rules from Noisy Data

In his book ‘The Master Algorithm’, artificial intelligence researcher Pedro Domingos explores the idea of a single algorithm that can combine the major schools of machine learning. The idea is, without a doubt, extremely ambitious but we are already seeing some iterations of it. Last year, Google published a research paper under the catchy title of ‘One Model to Learn Them All’ that combines heterogeneous learning techniques under a single machine learning model. Last years, Alphabet’s subsidiary DeepMind took another step towards multi-model algorithms by introducing a new technique called Differentiable Inductive Logic Programming(?ILP) that combines logic and neural networks into a single model to extract rules from noisy data.

Data Structure Evaluation to Choose the Optimal Machine Learning Method

There is no single ML method. To choose the one that suits your purposes, you as a developer need to understand the nature of the data that will be used in the project. In this post, I’ll share my experience with machine learning system development, describing the steps of choosing the optimum prediction model.

How to use Google Speech to Text API to transcribe long audio files?

Speech recognition is a fun task. A lot of API resources are available in market today which makes it easier for user to opt for one or another. However, when it comes to audio files especially call center data, the task becomes little challenging. Let’s make an assumption that a call center conversation takes roughly 10 minutes. For this scenario, only a few API resources available in market can handle this type of data (Google, Amazon, IBM, Microsoft, Nuance,, Open source Wavenet, Open source CMU Sphinx). In this article, we will talk about Google speech to text API in detail.