MachinaNova: A Personal News Recommendation Engine

A few months ago, I purchased a subscription to a tool called Paper.li an application I could add to my website. Its promise was to ‘collect relevant content and deliver it where ever you want’ – which is exactly what I need during my work morning. Let me explain… I love my morning routine. It consists of grabbing breakfast, coffee and sitting down to make my way through a jungle of digital news and content. My goal: emerge on the other side of the news jungle with a tidbit of information I didn’t already have; and hopefully some of the coffee and bagel I plan to eat lands in my mouth. To be real, I click literally dozens of bookmarks, scan headlines and eventually give up surrendering to clicking fatigue. Every morning, I end up on the tried and true Harvard Business Review… because they’re just great every time. If Paper.li does what they say they can do, it will cut time off of my routine, and I’ll finally be able to enjoy my bagel instead of shoving it in my face. I NEEDED THIS SOLUTION!


Using Docker and Kubernetes to Host Machine Learning Models

Docker is a great tool for deploying ML models in the cloud. If you want to set up a production-grade deployment in the cloud, there’s a number of options across AWS and GCP. In chapter 4 of my book in-progress, I focus on ECS for serving models, but also explore Google Kubernetes Engine towards the end. This post omits the AWS section on containers for deployment, but includes the remaining sections covered in the chapter.


Making Text Ascent

I’ve often found myself reading an article, say on data science, and wondering, where can I read something simpler on this topic? I realized I wasn’t the only one when a friend posted a similar question on LinkedIn. She asked how to find articles in a specific range between most simple and most complex. I realized we don’t have an easy system for that type of search besides manually reading for a good fit.


Kalman Filter(1) – The Basics

I was trying to learn Kalman Filter, a way to combine your guesses and some uncertain measurements and make a better estimation, and found there is no such easy-to-understand topics out there. But later on, I came across this course, which introduces the idea from the very fundamental. So in this post, I will follow the structure from the course and give a brief introduction of the basics of self-driving car localisation, which is also the starting point of Kalman Filter.


Intuition: Exploration vs Exploitation

The intuition behind the trade-off and common solutions. The exploration-exploitation trade-off is a well-known problem that occurs in scenarios where a learning system has to repeatedly make a choice with uncertain pay-offs. In essence, the dilemma for a decision-making system that only has incomplete knowledge of the world is whether to repeat decisions that have worked well so far (exploit) or to make novel decisions, hoping to gain even greater rewards (explore). This is highly relevant in reinforcement learning, but also for many other applications, such as recommendation systems and online advertising. In this article, I give an overview of three simple and proven strategies to tackle the exploration-exploitation trade-off for multi-armed bandits.


Demystifying Generative Models by Generating Passwords – Part 2

Understand the differences between Naive Bayes model and Variational Autoencoders (VAE) in generative tasks. Hello, once again this is the second part of the ‘Demystifying Generative Models’ posts so if you haven’t read Part 1 yet, I really urge you to do so here. In the previous post, we discussed the differences between discriminative and generative models, took a peek to the fascinating world of probabilities and used that knowledge to develop a working Naive Bayes that generates passwords for us. Now, we will change our methodologies a little bit and explore how Deep Learning can help us when probabilities fail.


Prediction and Inference with Boosted Trees and Shapley Values

As a data science aficionado, perhaps you’ve come across the old adage: ‘models with high accuracy are not interpretable, and models with high interpretability cannot attain high accuracy’. While that may be true in general, there are certain cases and certain clever tricks one can pull out of their data science magic hat to obtain both reasonably high accuracy and interpretability of the input factors on the output result. Specifically we are going to discuss the case of binary classification (though the principles can apply to 1-D regression as well) using an ensemble of GBDT models and the Shapley values associated with each model in the ensemble to collect population statistics for both the prediction and the factor interpretations, that we can aggregate and apply statistical techniques such as averaging and confidence intervals to. This method takes inspiration from the umpteenth Kaggle competitions where the top ranked models are pretty much always some ensemble of boosted decision tree models, and the need to infer the effect of input factors on the output probability, because your boss asked you to. But even if your boss didn’t ask you to, inference/interpretation of input factors gives the data scientist powerful tools to analyze possibly non-linear and non-obvious trends and patterns in the dataset, as they relate to some outcome.


Gesture Recognition Toolkit (GRT)

The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.


Decisions

A deep dive into different frameworks and considerations when making decisions.


tidync: scientific array data from NetCDF in R

In May 2019 version 0.2.0 of tidync was approved by rOpenSci and accepted to CRAN. Here we provide a quick overview of the typical workflow with some pseudo-code for the main functions in tidync. This overview is enough to read if you just want to try out the package on your own data. The tidync package is focussed on efficient data extraction for developing your own software, and this somewhat long post takes the time to explain the concepts in detail. There is a section about the NetCDF data model itself. Then there is a detailed illustration of a raster data set in R including some of the challenges faced by R users. This is followed by sections on how tidync sees metadata and coordinates in NetCDF, how we can slice a dataset and control the format of the output. We then discuss some limitations and future work , and then (most importantly) reflect on the rOpenSci process of package review.