Proper Balancing for Cross Validation

Who hasn’t come across the need of applying the cross validation technique, while the dataset in hand is imbalanced, in regards to the number of instances per target class value. The question here is, do we apply it properly? The purpose of this article is to show a way to use the balancing methods in cross-validation, without forcing balancing on the CV test folds; thus to get more realistic CV evaluation results.

Line Detection: Make an Autonomous Car see Road Lines

Fully self-driving passenger cars are not ‘just around the corner’. Elon Musk claims that Teslas will have a ‘full self-driving’ capability by the end of 2020. Especially, he says that Tesla’s hardware is already ready for Autonomous drive, and what is left is just an update on their current software, which many brilliant scientists are working on it. The first instinct to us, humans, as drivers, is probably to look in front of us and decide where should the car move; at which direction, between which lines, etc. As every Autonomous Vehicle comes with a camera in a front, one very important task its to decide the border in which the car should move in between. For humans, we draw lines on the roads. Now we will teach an Autonomous Vehicle to see these lines. I promise it will be fun 🙂

How to Estimate ROI and Costs for Machine Learning and Data Science Projects

Before implementing any tech-related initiative, specialists must answer many whys and hows: What might be the impact of this solution? How do we know which tech stack is optimal for solving this problem? Can we afford this experiment? What is the predicted payback period? Answers to such questions help companies decide whether building a certain solution is worth the effort. But that’s not always the case. Netflix spent $1 million on recommendation engine improvements it never used.

Creating a Search Engine for Finding Code

Github and Microsoft have released a dataset of search queries for code and annotated results to advance the development of search engines that can locate specific code. The dataset includes 99 search queries and ten likely results per query, which experts annotated for their relevance to the query for a total of 4,000 annotations. The search results contain a total of six million functions from open-source code across six programming languages, including Python and Java.

Explaining Predictions: Boosted Trees Post-hoc Analysis (Xgboost)

This is the last post of this series looking at explaining model predictions at a global level. We first started this series explaining predictions using white box models such as logistic regression and decision tree. Next, we did model specific post hoc evaluation on black box models. Specifically, for random forest and Xgboost.

Developing a prescriptive recommender system through Matrix Factorization

Matrix Factorization is a very powerful algorithm with many valuable use cases in multiple industries. Some of the well-known application of the Matrix Factorization is – Netflix 1-million-dollar prize for movie recommendation and Amazon online application (much-refined version) for recommending books to various readers. Although it is a very popular method, it has some limitations. Our focus in the current paper is to understand latent factors and how to make it more prescriptive. The algorithm works well when we have products in the set already rated by a few users OR every user has rated few products. However, if we are including new products in the set which have not been rated by any user or new users in the set whose preferences for the products are not known – it will be difficult to create a recommendation for unrated products and the new users. There are challenges in co-relating the underlying features of the products and users’ preferences based on key features. The paper below addresses the challenge of interpreting latent factors related to the product and the user features. Once we have clarity on the products and the user features – we can move towards the prescriptive journey. It will help in designing future products with more popular features OR finding a market for a particular product.

A Guide to Integrating Text Analytics into Tableau

Data is often dirty and messy. Sometimes, it doesn’t even come in the right form for quick analysis and visualization. While Tableau (and Prep) had several tools to deal with numeric, categorical, and even spatial data, one consistent missing piece was handling unstructured text data. Not anymore. In the latest edition of Tableau Prep (2019.3), released just a few weeks ago, Prep now natively supports custom R or Python scripts. By using TabPy, one can use the entire suite of R and Python libraries on any datasets without having to leave the Tableau universe. From running machine learning models to calling geospatial APIs, the possibilities are pretty much endless. In this article, we take a crack at something new and exciting: applying natural language processing techniques to unstructured text data using Tableau Prep and Python. Before we dive into that, we start with an in-depth guide on how to set everything up.

R for Industrial Engineers — Quality Control Charts

As an industrial engineer, it might sound strange using programming languages and software to perform tasks and conduct analyses. ‘Leave that to computer science engineers and data analysts’, you might be thinking. However, recent trends in the industry are demanding the workforce to be able to manage, transform, understand and interpret data for better decision making and for obtaining business insights. R, an open-source and free software environment, represents an amazing tool that can be used by industrial engineers for multiple purposes. With this article, I would like to encourage industrial engineers to start learning and using programming languages and software and to break the paradigm that they are only meant to be used by computer science engineers and data analysts. You will see that it is not difficult at all and will experience the advantages and benefits they can bring to your professional career. Let’s take a look!

Introduction to Matrix Profiles

In time series analysis, one is typically interested in two things; anomalies and trends. For example, a physician examines an EKG (electrocardiogram – heartbeat reading) for anomalous events that indicate at-risk patients. An individual working within retail needs to understand what items sell and when they sell (seasonality) to increase profits. One method to find anomalies and trends within a time series is to perform a similarity join. Essentially, you compare snippets of the time series against itself by computing the distance between each pair of snippets. While it takes minimal effort to implement a naive algorithm using nested loops, it may take months or years to receive an answer for a moderately sized time series using this approach. Taking advantage of the Matrix Profile algorithms drastically reduces the computation time.

Active Learning: getting the most out of limited data

Last week I saw an excellent talk by a researcher at Fast Forward Labs that caused me to think about the instance of machine learning called ‘active learning’. Active learning refers to a number of strategies for dealing with incompletely labeled data, particularly identifying which points to manually label. Most of the use cases people think about when they hear the term ‘machine learning’ involve so-called ‘supervised learning’, meaning that they require data that has a labelled target variable to train on. If a bank wants to build a model that will predict if a particular transaction is fraudulent based on certain characteristics, it needs to have training data that it can show the computer containing known cases of fraudulent and non-fraudulent transactions. If a machine-vision engineer wants to teach a car’s onboard computer to recognize stop signs, it needs to present the computer with some clearly labeled examples of images with and without stop signs. There are unsupervised techniques which don’t reference a specifically labeled target variable – algorithms that try to group points together or that search for anomalies for instance – but predictive models almost always require reference to a target variable and therefore require that your dataset already have that variable accounted for. After all, how else would you validate the model?

From scikit-learn to Spark ML

Taking a machine learning project from Python to Scala. In a previous post, I showed how to take a raw dataset of home sales and apply feature engineering techniques in Python with pandas. This allowed us to produce and improve predictions on home sale prices using scikit-learn machine learning models. But what happens when you want to take this sort of project to production, and instead of 10,000 data points perhaps there are tens or hundreds of gigabytes of data to train on? In this context, it is worth moving away from Python and scikit-learn toward a framework that can handle Big Data.

Understanding BERT: Is it a Game Changer in NLP?

One of the most path-breaking developments in the field of NLP was marked by the release (considered to be the ImageNet moment for NLP) of BERT – a revolutionary NLP model that is superlative when compared with traditional NLP models. It has also inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, ERNIE2.0, XLNet, RoBERTa, etc. Let’s deep dive into understanding BERT and it’s potential to transform NLP.

6 Deep Learning models – When should you use them?

Supervised Models
• Classic Neural Networks (Multilayer Perceptrons)
• Convolutional Neural Networks (CNNs)
• Recurrent Neural Networks (RNNs)
Unsupervised Models
• Self-Organizing Maps (SOMs)
• Boltzmann Machines
• AutoEncoders

Detecting SET cards using transfer learning

During a holiday in beautiful France, me and my family played a lot of SET, a simple and elegant card game. The goal is to find specific combinations of cards before others find them. While playing the game we often stared at the cards wondering if there’s another SET that we just don’t see. This started a fun personal side project where I apply machine learning to find SET combinations.

Understanding Optimizers

In deep learning we have the concept of loss, which tells us how poorly the model is performing at that current instant. Now we need to use this loss to train our network such that it performs better. Essentially what we need to do is to take the loss and try to minimize it, because a lower loss means our model is going to perform better. The process of minimizing (or maximizing) any mathematical expression is called optimization and we need now see how we can use these optimization methods for neural networks. In a neural network, we have many weights in between each layer. We have to understand that each and every weight in the network will affect the output of the network in some way, because they are all directly or indirectly connected to the output.

Modeling Bank’s Churn Rate with AdaNet: A Scalable, Flexible Auto-Ensemble Learning Framework

AdaNet provides a framework that could automatically produce a high-quality model given an arbitrary set of features and a model search space. In addition, it builds ensembles from productionized TensorFlow models to reduce churn, reuse domain knowledge, and conform with business and explainability requirements. The framework is capable of handling datasets containing thousands to billions of examples, in a distributed environment.

A theoretical survey on Mahalanobis-Taguchi system

The Mahalanobis-Taguchi System (MTS) is a diagnosis and forecasting method employing Mahalanobis Distance (MD) and Taguchi’s Robust Engineering in a multidimensional system. In MTS, MD is used to construct a continuous measurement scale to discriminate observations and measure the level of abnormality of abnormal observations which compared to a group of normal observations. Therefore, MTS can handle the class imbalance problem. In addition, MTS is unique in its robustness to assess variability among all the levels of observations (noise) and evaluate significant and insignificant features which contributed to the multidimensional system by means of simplistic yet robust technique via Orthogonal Arrays (OA) and Signal-to-Noise Ratios (SNR). However, compared with the classic multivariate methods, MTS has a weaker theoretical basis. In order to promote the development and improvement of MTS theory, this paper reviews the literature related to developing and improving MTS theory. The survey presents and analyzes the research results in terms of MD, SNR, Mahalanobis Space (MS), feature selection, threshold, multi-class MTS, and comparison with other methods. Finally, a detailed analysis of the future possible research directions will be proposed to develop and improve MTS.