Similarity and Distance Metrics for Data Science and Machine Learning

In a previous article introducing Recommendation Systems, we mentioned several times the concept of ‘similarity measures’. Why? Because in Recommendation Systems, both Content-Based filtering and Collaborative filtering algorithms, use some specific similarity measure to find how equal two vectors of users or items are in between them. So in the end, a similarity measure is not more than the distance between vectors.


Data Fallacies to Avoid

• Cherry Picking
• Data Dredging
• Survivorship Bias
• Cobra Effect
• False Causality
• Gerrymanderin
• Sampling Bias
• Gambler’s Fallacy
• Hawthorne Effect
• Regression Towards the Mean
• Simpson’s Paradox
• Mcnamara Fallacy
• Overfitting
• Publication Bias
• Danger of Summary Metrics


Observations on Observability

Over the past few years, observability has become a prominent topic in distributed computing. Observability means different things to different people and the use of the term is still evolving, but, essentially, observability refers to the practice of using instrumentation to understand software systems in order to derive superior operational outcomes, like a superior customer experience, reduced operational costs, greater reliability, or product improvement. I do not like the term observability as it is currently used in the software industry. This article is my attempt to reflect on the emerging observability practices and explain why I do not like the term observability being applied to them. I was somewhat hesitant to publish this essay, as I think some people will find it pedantic – like I am splitting hairs. If that is the case, feel free to stop reading and these words can remain my personal indulgence.


Human-Centered Software Agents: Lessons from Clumsy Automation

The Cognitive Systems Engineering Laboratory (CSEL) has been studying the actual impact of capable autonomous machine agents on human performance in a variety of domains. The data shows that ‘strong, silent, difficult to direct automation is not a team player’ (Woods, 1996). The results of such studies have led to an understanding of the importance of human-centered technology development and to principles for making intelligent and automated agents team players (Billings, 1996). These results have been obtained in the crucible of complex settings such as aircraft cockpits, space mission control centers, and operating rooms. These results can be used to help developers of human-centered software agents for digital information worlds avoid common pitfalls and classic design errors.


Turn Python Scripts into Beautiful ML Tools

Introducing Streamlit, an app framework built for ML engineers. In my experience, every nontrivial machine learning project is eventually stitched together with bug-ridden and unmaintainable internal tools. These tools – often a patchwork of Jupyter Notebooks and Flask apps – are difficult to deploy, require reasoning about client-server architecture, and don’t integrate well with machine learning constructs like Tensorflow GPU sessions. I saw this first at Carnegie Mellon, then at Berkeley, Google X, and finally while building autonomous robots at Zoox. These tools were often born as little Jupyter notebooks: the sensor calibration tool, the simulation comparison app, the LIDAR alignment app, the scenario replay tool, and so on. As a tool grew in importance, project managers stepped in. Processes sprouted. Requirements flowered. These solo projects gestated into scripts, and matured into gangly maintenance nightmares.


Development practices that data scientists should use NOW.

I’ve worked with seasoned developers, expert data scientists, newcomers, non-programmers and all of these people had something in common: they *had* to produce code. Normally this post should be very general and most programmers work this way but due to the inherently experimental nature of some data science or data engineering practices, I’ve found that, sometimes, some of these advices are, err… overlooked. So here’s the absolute minimum every single data {scientist|engineer} should know about coding practice.


Towards Secure AI/Machine Learning

Enterprises are migrating AI/ML workload to the cloud but are running into major security issues. Here are the issues and how to address them. Today, AI/ML’s need for advanced computing and data storage has driven enterprises to migrate AI/ML workload out of the comfortable confines of their data centers and into the cloud. But recent headlines of high-profile data breaches have shown that there are valid, real, and serious cybersecurity concerns that need to be addressed. This is a particularly poignant issue for AI/ML. After all, AI/ML practitioners rely on huge volumes of data, much of it sensitive and/or proprietary, to train their models. To make matters even more complex, the data used by data scientists must largely remain un-obfuscated (or, ‘clear text’) which could magnify both the opportunity for a data breach, and perhaps also its impact. Hence, it is probably not surprising that enterprises are looking at AI/ML’s migration to the cloud with a degree of trepidation and caution. In fact, a recent survey by Deloitte, an international consulting firm, has shown that senior technology and security executives and professionals across all industries believe ‘cybersecurity vulnerabilities can slow or even stop AI initiatives.’ Today, it appears that addre


Searching Algorithms for Artificial Intelligence

An overview of the blind, informed, and optimal searching algorithms for artificial intelligence. Searching algorithms are defined as a way to identify and find a specific value within a given data structure. Not only do they allow you to find what value you are looking for, but searching algorithms are also a key element of artificial intelligence; they teach computers to ‘act rationally’ by achieving a certain goal with a certain input value. Essentially, artificial intelligence can find solutions to given problems through use of searching algorithms. We will be discussing three different kinds of searching algorithms: blind, informed, and optimal. Whichever algorithm you choose to utilize depends entirely on how you want to traverse through your data structure, as well as for what you want to optimize (time versus speed, for example).


How to find Feature importances for BlackBox Models?

I grapple through with many algorithms on a day to day basis, so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series. How many times it has happened when you create a lot of features and then you need to come up with ways to reduce the number of features? Last time I wrote a post titled ‘The 5 Feature Selection Algorithms every Data Scientist should know’ in which I talked about using correlation or tree-based methods and adding some structure to the process of feature selection. Recently I got introduced to another novel way of feature selection called Permutation Importance and really liked it. So, this post is explaining how permutation importance works and how we can code it using ELI5.


Machine Learning in the Browser: Train and Serve a Mobilenet Model for Custom Image Classification

There are several ways of fine-tuning a deep learning model but doing this on the web browser with WebGL acceleration is something that we experienced not such a long time ago, with the introduction of Tensorflow.js. I will use Tensorflow.js together with Angular to build a Web App that trains a convolutional neural network to detect malaria-infected cells with the help of Mobilenet and Kaggle dataset containing 27.558 infected and uninfected cell images.


Negative Binomial Regression: A Step by Step Guide

In this article, we’ll cover the following topics:
• We’ll get introduced to the Negative Binomial (NB) regression model. An NB model can be incredibly useful for predicting count based data.
• We’ll go through a step-by-step tutorial on how to create, train and test a Negative Binomial regression model in Python using the GLM class of statsmodels.


Deploy a Keras Model for Text Classification using TensorFlow Serving (Part 1 of 2)

Almost two years ago, I used the Keras library to build a solution for Kaggle’s Toxic Comment Classification Challenge. The solution ensembled several deep learning classifiers to achieve 98.6% mean ROC. Like most of my Kaggle submissions, this one was a jumble of code wrapped in a Jupyter notebook that served little purpose other than producing a very arbitrary csv file. In an effort to make my submission more useful, I selected one of the models used in the ensemble solution, cleaned it up and used TensorFlow serving to expose an HTTP endpoint for model inference.


Virtual Machines for TensorFlow 2.0

Since TensorFlow 2.0 is now available, and a period of hectic business travel appears to have passed, I got to thinking about getting back to machine learning, software engineering, and doing what I would do if I wasn’t travelling. You can read the announcement about TensorFlow 2.0 from the TensorFlow Medium publication. Now just because TensorFlow 2.0 Beta has emerged does not mean you should just jump straight in. You need to test the changes, review your code base and assess any likely changes that might become necessary in due course. The question becomes how to do such an evaluation without breaking your current set-up. The answer is to use a Virtual Machine (VM) and you could either do a VM on a powerful local host, with RedHat (RHEL7/8) or just head over to the Cloud. In this article we head over to the AWS cloud and spin up a VM for a road test.


Decomposing Signal Using Empirical Mode Decomposition – Algorithm Explanation for Dummy

What kind of ‘beast’ is Empirical Mode Decomposition (EMD) is? It’s an algorithm to decompose signals. And when I say signal, what I mean is a time-series data. We inputting a signal to the EMD and we will get some decomposed signal a.k.a ‘basic ingredient’ of our signal input. It’s similar to the Fast Fourier Transform (FFT). FFT assumes our signal is periodic and it’s ‘basic ingredient’ is various simple sine waves. In FFT, our signal is changed from the time spectrum to the frequency spectrum.


DeepMind is Using This Old Technique to Evaluate Fairness in Machine Learning Models

One of the arguments that is regularly used in favor of machine learning systems is the fact that they can arrive to decisions without being vulnerable to human subjectivity. However, that argument is only partially true. While machine learning systems don’t make decisions based on feelings or emotions, they do inherit a lot of human biases via the training datasets. Bias is relevant because it leads to unfairness. In the last few years, there has been a lot of progress developing techniques that can mitigate the impact of bias and improve the fairness of machine learning systems. Recently, DeepMind published a research paper that proposes using an old statistical technique known as Causal Bayesian Networks(CBN) to build more fairer machine learning systems.


Demystifying Object Detection and Instance Segmentation for Data Scientists

Distilling the history of Object detection and Instance segmentation into an easy to digest explanation – RCNN, Fast RCNN, Faster RCNN, Mask RCNN.


When Topic Modeling is Part of the Text Pre-processing

A few months ago, we built a content based recommender system using a relative clean text data set. Because I collected the hotel descriptions my self, I made sure that the descriptions were useful for the goals we were going to accomplish. However, the real-world text data is never clean and there are different pre-processing ways and steps for different goals. Topic modeling in NLP is rarely my final goal in an analysis, I use it often to either explore data or as a tool to make my final model more accurate. Let me show you what I meant.


Tf.estimator, a Tensorflow High-level API

Now Tensorflow 2.0 has been officially released and it’s having two high-level deep learning APIs. The first one is tf.keras and another one is tf.estimator. You can see the list of TensorFlow’s Python API in the picture above. Some of you are familiar with building an ML model using Keras. But we’re not so familiar with tf.estimator (Assuming we refer to a beginner in ML). So let us understand tf.estimator.


Feature Scaling with Python’s scikit-learn

One of the primary objectives of normalization is to bring the data close to zero. That makes the optimization problem more ‘numerically stable’. Now, the scaling using mean and standard deviation assumes that the data are normally distributed, that is, most of the data are sufficiently close to the mean. So shifting the mean to zero ensures that most components of most data points are close to 0.
Advertisements