Exploring Recommendation Systems

While we commonly associate recommendation systems with e-commerce, their application extends to any decision-making problem which requires pairing two types of things together. To understand why recommenders don’t always work as well as we’d like them to, we set out to build some basic recommendation systems using publicly available data.


Demystifying Information Security Using Data Science

When you search for security data science on the internet, it’s difficult to find resources with crisp and clear information about the use cases, methods and limitations in Information Security (hereby referred to as InfoSec). There’s usually always some marketing material attached to it. So, I thought of summarising my knowledge and InfoSec experience in this article.


Using Data? Master the Science in Data Science

1.Model Thinking – Understand the role and meaning of models
2.The Hypothesis – Deploy the power of hypothesis-led learning
3.The Data Generating Process – Know what it is that you seek to model
4.Searching for the Mechanism – The how and why of a model’s performance
5.Replicability, Reproducibility, Generalizability – Push for enduring impact


Finding Similar Names with Matrix Factorization

Applying matrix factorization on user clicks on hundreds of names on the recommender system NamesILike.com reveal an unseen structure in our first names.


Getting Started With MapD: Docker Install and Loading Data

It’s been nearly five years since I wrote about Getting Started with Hadoop for big data. In those years, there have been incremental improvements in columnar file formats and dramatic computation speed improvements with Apache Spark, but I still wouldn’t call the Hadoop ecosystem convenient for actual data analysis. During this same time period, thanks to NVIDIA and their CUDA library for general-purpose calculations on GPUs, graphics cards went from enabling visuals on a computer to enabling massively-parallel calculations as well. Building upon CUDA is MapD, an analytics platform that allows for super-fast SQL queries and interactive visualizations. In this blog post, I’ll show how to use Docker to install MapD Community Edition and load hourly electricity demand data to analyze.


Easily Converting Strings to Times and Dates in R with flipTime

Date conversion in R can be a real pain. However, it is a very important initial step when you first get your data into R to ensure that it has the correct type (e.g. Date). This gives you the the correct functionality for working with data of that type. R provides a number of handy features for working with date-time data. However, the sheer number of options/packages available can make things seem overwhelming at first. There are more than 10 packages providing support for working with date-time data in R. In this post, I will provide an introduction to the functionality R offers for converting strings to dates. In doing so, I discuss common pitfalls and give helpful tips to make working with dates in R less painful. Finally, I introduce some code that my colleagues and I wrote to make things a bit easier (with the flipTime package).


Avoid Overfitting with Regularization

Have you ever created a machine learning model that is perfect for the training samples but gives very bad predictions with unseen samples! Did you ever think why this happens? This article explains overfitting which is one of the reasons for poor predictions for unseen samples. Also, regularization technique based on regression is presented by simple steps to make it clear how to avoid overfitting. The focus of machine learning (ML) is to train an algorithm with training data in order create a model that is able to make the correct predictions for unseen data (test data). To create a classifier, for example, a human expert will start by collecting the data required to train the ML algorithm. The human is responsible for finding the best types of features to represent each class which is capable of discriminating between the different classes. Such features will be used to train the ML algorithm. Suppose we are to build a ML model that classifies images as containing cats or not using the following training data.


Google and Uber’s Best Practices for Deep Learning

There is more to building a sustainable Deep Learning solution than what is provided by Deep Learning frameworks like TensorFlow and PyTorch. These frameworks are good enough for research, but they don’t take into account the problems that crop up with production deployment. I’ve written previously about technical debt and the need from more adaptive biological like architectures. To support a viable business using Deep Learning, you absolutely need an architecture that supports sustainable improvement in the presence of frequent and unexpected changes in the environment. Current Deep Learning framework only provide a single part of a complete solution. Fortunately, Google and Uber have provided a glimpse of their internal architectures. The architectures of these two giants can be two excellent base-camps if you need to build your own production ready Deep Learning solution. The primary motivations of Uber’s system named Michelangelo was that “there were no systems in place to build reliable, uniform, and reproducible pipelines for creating and managing training and prediction data at scale.” In their paper, they describe the limitations of existing frameworks with the issues of deployment and managing technical debt. The paper has enough arguments that should convince any skeptic that existing frameworks are insufficient for the production.


Generalists Dominate Data Science

Analytics products and systems are best built by small teams of generalists. Large teams of specialists become dominated by communication overhead, and the effect of “Chinese whispers” distorts the flow of tasks and stagnates creativity. Data scientists should develop generalist skills to become more efficient members of a data science team.