You Don’t Know SVD (Singular Value Decomposition)

For it’s disappointing that almost every tutorial of SVD makes it more complicated than necessary, while the core idea is most simple. Since mathematics is just the art of assigning different names to the same concept, SVD is nothing more than decomposing vectors onto orthogonal axes?-?we just decided it may need a more deluxe name. Let’s see how this is the case.

Learning from Simulated Data

One of the biggest problems plaguing Deep Neural Networks is that training them requires massive labeled datasets. Some researchers are trying to overcome this limitation with research into areas such as Meta-Learning and few-shot learning. However, for the time being, massive datasets are needed for success in tasks such as object detection.

How To Harness The Power Of Speech With Artificial Intelligence

Have you ever heard of the Turing test? Turing test is a test designed by Alan Turing in 1950 to see whether a machine can exhibit conversational behaviours that are indistinguishable from that of a human. When Alan Turing first developed the Turing Test, he strictly subjected the experiment to be a text-only conversation rather than a full speech conversation. Why? Well, he didn’t want the crappy robotic voice of the 1950s to give away the clue behind ‘who’s the robot?’ test. But the text-only rule stayed there for many years and decades to come since the test was first conceived. Many computer Scientists believed that we were actually closer to building a fully conversational agent that acts just like humans than building a realistic text-to-speech system.

The Anatomy of K-means

A complete guide to K-means clustering algorithm. Let’s say you want to classify hundreds (or thousands) of documents based on their content and topics, or you wish to group together different images for some reason. Or what’s even more, let’s think you have that same data already classified but you want to challenge that labeling. You want to know if that data categorization makes sense or not, or can be improved. Well, my advice is that you cluster your data. Information is often darkened by noise and redundancy, and grouping data into clusters (clustering) with similar features is an efficient way to bring some light on.

Implementing QANet (Question Answering Network) with CNNs and self attentions

In this post, we will tackle one of the most challenging yet interesting problems in Natural Language Processing, aka Question Answering. We will implement Google’s QANet in Tensorflow. Just like its machine translation counterpart Transformer network, QANet doesn’t use RNNs at all which makes it faster to train / test. I’m assuming that you already have some knowledge of python and Tensorflow.

Running your R script in Docker

Since its release in 2014, Docker has become an essential tool for deploying applications. At STATWORX, R is part of our daily toolset. Clearly, many of us were thrilled to learn about RStudio’s Rocker Project, which makes containerizing R code easier than ever. Containerization is useful in a lot of different situations. To me, it is very helpful when I’m deploying R code in a cloud computing environment, where the coded workflow needs to be run on a regular schedule. Docker is a perfect fit for this task for two reasons: On the one hand, you can simply schedule a container to be started at your desired interval. On the other hand, you always know what behavior and what output to expect, because of the static nature of containers. So if you’re tasked with deploying a machine-learning model that should regularly make predictions, consider doing so with the help of Docker. This blog entry will guide you through the entire process of getting your R script to run in a Docker container one step at a time. For the sake of simplicity, we’ll be working with a local dataset.

A Minimalist Guide to Flume

Apache Flume is used to collect, aggregate and distribute large amounts of log data. It can operate in a distributed manor and has various fail-over and recovery mechanisms. I’ve found it most useful for collecting log lines from Kafka topics and grouping them together into files on HDFS. The project started in 2011 with some of the earliest commits coming from Jonathan Hsieh, Hari Shreedharan and Mike Percy, all of whom either currently, or at one point, worked for Cloudera. As of this writing the code base is made up of 95K lines of Java. The building blocks of any Flume agent’s configuration is one or more sources of data, one or more channels to transmit that data and one or more sinks to send the data to. Flume is event-driven, it’s not something you’d trigger on a scheduled basis. It runs continuously and reacts to new data being presented to it. This contrasts tools like Airflow which run scheduled batch operations. In this post I’ll walk through feeding Nginx web traffic logs into Kafka, enriching them using Python and feeding Flume those enriched records for storage on HDFS.

Democratisation of Usable Machine Learning in Computer Vision

Proposed benchmark criteria to form the basis of literacy in usable ML. We present a working set of benchmark criteria that could serve as data science literacy or certification for novice users of usable ML tools with the intention of mitigating irresponsible deployment of ML algorithms in computer vision.
1. Supervised ML
2. Accountability
3. Transparency
4. Data provenance
5. Algorithmic bias
6. Measurement bias
7. Accuracy vs. fairness
8. Automation bias
9. Class imbalance and prevalence
10. Overfitting
11. Concept drift
12. Data leakage
13. A confounder
14. ML performance metrics
15. Type 1 and type 2 errors

Data science is different now

What do you think of when you read the phrase ‘data science’? It’s probably some combination of keywords like statistics, machine learning, deep learning, and ‘sexiest job of the 21st century’. Or maybe it’s an image of a data scientist, sitting at her computer, putting together stunning visuals from well-run A/B tests. Either way, it’s glamorous, smart, and sophisticated. This is the narrative that data science has been selling since I entered the field almost ten years ago.


This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks:
• Prepare Data: Preparing and loading data for each recommender algorithm
• Model: Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (ALS) or eXtreme Deep Factorization Machines (xDeepFM).
• Evaluate: Evaluating algorithms with offline metrics
• Model Select and Optimize: Tuning and optimizing hyperparameters for recommender models
• Operationalize: Operationalizing models in a production environment on Azure
Several utilities are provided in reco_utils to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.

Computing Extremely Accurate Quantiles Using t-Digests

We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1-q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy. An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available.

The GraphTech Ecosystem 2019 – Part 1: Graph Databases

This post is part of a series of 3 articles about the GraphTech ecosystem. This article is the first part and covers the graph database landscape. The second part is about graph analytics frameworks and the third part will list the existing graph visualization tools. Back in 2014, we wrote about the graph ecosystem, which we then described as ’emerging’. In 5 years, we’ve seen the trend become stronger, the stakeholders multiply and the ecosystem take a more assertive form. Now is a good time to take look at the state of the graph ecosystem in 2019, starting with graph databases.

The GraphTech Ecosystem 2019 – Part 2: Graph Analytics

Our second layer is another back-end layer: graph analytics frameworks. They consist of a set of tools and methods developed to extract knowledge from data modeled as a graph. They are crucial for many applications because processing large datasets of complex connected data is computationally challenging.

Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters

The year 2018 marked a turning point for the field of Natural Language Processing, with a series of deep-learning models achieving state-of-the-art results on NLP tasks ranging from question answering to sentiment classification. Most recently, Google’s BERT algorithm has emerged as a sort of ‘one model to rule them all,’ based on its superior performance over a wide variety of tasks. BERT builds on two key ideas that have been responsible for many of the recent advances in NLP: (1) the transformer architecture and (2) unsupervised pre-training. The transformer is a sequence model that forgoes the sequential structure of RNN’s for a fully attention-based approach, as described in the classic Attention Is All You Need. BERT is also pre-trained; its weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). Thus BERT doesn’t need to be trained from scratch for each new task; rather, its weights are fine-tuned. For more details about BERT, check out the The Illustrated Bert.

Acoustic Noise Cancellation by Machine Learning

In this post I describe how I built an active noise cancellation system by means of neural networks on my own. I’ve just got my first results which I am sharing, but the system looks like a ravel of scripts, binaries, wires, soundcard, microphone and headphones, so I am not going to publish any sources yet. May be later.