The Power of A/B Testing

A/B testing involves multiple concepts which can be difficult to wrestle with, especially during high-pressure environment such as job interview. So I decided to write this post to clear things up. This post begins with a short technical review of power, two example problems, and extensive visualization that illustrates its interaction with other variables. This is not an introductory article on statistical testing, to which I have very little to contribute. But if you’re an analyst or data scientist rusty on basic concepts, or product managers who want to acquire some technical savvy on the subject, this is a good review (part of the reason I’m writing the post is to make a visual cheatsheet that I can quickly reference to later on).


Quality Control with Machine Learning

Quality Control is an important step of every production system. Lot of business investements aim to reinforce this process in order to grant higher level perfomance products. In last years Machine Learning solutions play a key role in this program of investemts for their ability to easy adapt in every contest and for the great results achived. In this article I present an AI solution for Quality Control in a standard production unit, in form of a classification problem. Following a very interesting approach, I try to achive the best possible performance, giving a visual explanation of the results and taking into account the usefull human insights. I want to underline this latest topic because human insights are often underestimated in Machine Learning! It’s not a surprise that they permit to achive best performance and to adopt smartest solutions.


Feature Engineering in SQL and Python: A Hybrid Approach

I knew SQL long before learning about Pandas, and I was intrigued by the way Pandas faithfully emulates SQL. Stereotypically, SQL is for analysts, who crunch data into informative reports, whereas Python is for data scientists, who use data to build (and overfit) models. Although they are almost functionally equivalent, I’d argue both tools are essential for a data scientist to work efficiently.


Time Series Feature Extraction for industrial big data (IIoT) applications

Feature extraction remains one of the most preliminary steps in machine learning algorithms to identify strong and weak relevant attributes. While many feature extraction algorithms are used during Feature Engineering for standard classification and regression problems, the problem turns increasing difficult for time series classification and regression problems where each label or regression target is associated with several time series and meta-information simultaneously. Trust me such scenarios are quite common with huge datasets obtained from industrial heavy manufacturing equipments, machinery, IoT which often go under maintenance or exibit production line optimization demonstrating different success and failure metrics in different time series.


Build Python for Data Science in Just a Few Clicks

There is only one Python distro that lets you add new versions of packages, remove unused packages, and rebuild in minutes. Yes, for free. Download ActiveState Python 3.6 build now.


Scalable Log Analytics with Apache Spark – A Comprehensive Case-Study

One of the most popular and effective enterprise case-studies which leverage analytics today is log analytics. Almost every small and big organization today have multiple systems and infrastructure running day in and day out. To effectively keep their business running, organizations need to know if their infrastructure is performing to its maximum potential. This involves analyzing system and application logs and maybe even apply predictive analytics on log data. The amount of log data is typically massive, depending on the type of organizational infrastructure and applications running on it. Gone are the days when we were limited by just trying to analyze a sample of data on a single machine due to compute constraints.


Review: Residual Attention Network – Attention-Aware Features (Image Classification)

In this story, Residual Attention Network, by SenseTime, Tsinghua University, Chinese University of Hong Kong (CUHK), and Beijing University of Posts and Telecommunications, is reviewed. Multiple attention module is stacked to generate attention-aware features. Attention residual learning is used for very deep network. Finally, This is a 2017 CVPR paper with over 200 citations. (Sik-Ho Tsang @ Medium)


How to Automate Tasks on GitHub With Machine Learning for Fun and Profit

A tutorial on how to build a GitHub App that predicts and applies issue labels using Tensorflow and public datasets.


Predict Bitcoin prices by using Signature time series modelling

First, I would like to give a short introduction to the Signature method. According to Wikipedia, a rough path is a generalization of the notion of smooth path allowing to construct a robust solution theory for controlled differential equations driven by classically irregular signals, for example, a Wiener process. The theory was developed in the 1990s by Terry Lyons. The aim of the mathematics is to describe a smooth but potentially highly oscillatory and multidimensional path X effectively. The Signature is a homomorphism from the monoid of paths (under concatenation) into the group-like elements of the free tensor algebra. It provides a graduated summary of path X. Here is a formal maths definition of the Signature transformation from A Primer on the Signature Method in Machine Learning.


Meme Text Generation with a Deep Convolutional Network in Keras & Tensorflow

The goal of this post is to describe end-to-end how to build a deep conv net for text generation, but in greater depth than some of the existing articles I’ve read. This will be a practical guide and while I suggest many best practices, I am not an expert in deep learning theory nor have I read every single relevant research paper. I’ll cover takeaways about data cleaning, training, model design, and prediction algorithms.