How to handle Imbalanced Classification Problems in machine learning?

If you have spent some time in machine learning and data science, you would have definitely come across imbalanced class distribution. This is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. This problem is predominant in scenarios where anomaly detection is crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, etc. In this situation, the predictive model developed using conventional machine learning algorithms could be biased and inaccurate. This happens because Machine Learning Algorithms are usually designed to improve accuracy by reducing the error. Thus, they do not take into account the class distribution / proportion or balance of classes. This guide describes various approaches for solving such class imbalance problems using various sampling techniques. We also weigh each technique for its pros and cons. Finally, I reveal an approach using which you can create a balanced class distribution and apply ensemble learning technique designed especially for this purpose.

So you built a Machine Learning model?

You’ve been working on a Machine Learning task. You collected data from various sources, built your model and got some results. You notice you’re getting about 80% accuracy on your test set which is less than what you desire. Now what? How do you decide what will improve the model? Should you get more data? Build a more complex model? Increase or decrease regularization? Add or remove features? Run more iterations of gradient descent? May be try all of them? Recently I got this question from a friend, who said it seemed improving models is just hit and trial. This prompted me to write this post on how to make an informed decision about what should you work on first ?

Self-Organising Maps: An Introduction

When you learn about machine learning techniques, you usually get a selection of the usual suspects. Something like: Support Vector Machines, decision trees/random forests, and logistic regression for classification, linear regression for regression, k-means for clustering and perhaps PCA for dimensionality reduction. In fact, KDNuggets has a good post about the 10 machine learning algorithms you should know. If you want to learn about machine learning techniques, you should start there. The point is, on the subject of these algorithms the internet has you covered. In this post I want to talk about a less prevalent algorithm, but one that I like and that can be useful for different purposes. It’s called a Self-Organising Map (SOM).

What does more efficient Monte Carlo mean?

In a simple question on X validated a few days ago [about simulating from x²f(x)] popped up the remark that the person asking the question wanted a direct simulation method for higher efficiency. Compared with an accept-reject solution. Which shows a misunderstanding of what “efficiency” means on Monte Carlo situations. If it means anything, I would think it is reflected in the average time taken to return one simulation and possibly in the worst case. But there is no reason to call an inverse cdf method more efficient than an accept reject or a transform approach since it all depends on the time it takes to make the inversion compared with the other solutions… Since inverting the closed-form cdf in this example is much more expensive than generating a Gamma(½,½), and taking plus or minus its root, this is certainly the case here. Maybe a ziggurat method could be devised, especially since x²f(x)<f(x) when |x|=1, but I am not sure it is worth the effort!

Plotting trees from Random Forest models with ggraph

Today, I want to show how I use Thomas Lin Pederson’s awesome ggraph package to plot decision trees from Random Forest models. I am very much a visual person, so I try to plot as much of my results as possible because it helps me get a better feel for what is going on with my data. A nice aspect of using tree-based machine learning, like Random Forest models, is that that they are more easily interpreted than e.g. neural networks as they are based on decision trees. So, when I am using such models, I like to plot final decision trees (if they aren’t too large) to get a sense of which decisions are underlying my predictions. There are a few very convient ways to plot the outcome if you are using the randomForest package but I like to have as much control as possible about the layout, colors, labels, etc. And because I didn’t find a solution I liked for caret models, I developed the following little function (below you may find information about how I built the model): As input, it takes part of the output from model_rf <- caret::train(… ‘rf’ …), that gives the trees of the final model: model_rf$finalModel$forest. From these trees, you can specify which one to plot by index.

Learning to Remember Rare Events

Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events. We present a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the module is fully differentiable and trained end-to-end with no extra supervision. It operates in a life-long manner, i.e., without the need to reset it during training. Our memory module can be easily added to any part of a supervised neural network. To show its versatility we add it to a number of networks, from simple convolutional ones tested on image classification to deep sequence-to-sequence and recurrent-convolutional models. In all cases, the enhanced network gains the ability to remember and do life-long one-shot learning. Our module remembers training examples shown many thousands of steps in the past and it can successfully generalize from them. We set new state-of-the-art for one-shot learning on the Omniglot dataset and demonstrate, for the first time, life-long one-shot learning in recurrent neural networks on a large-scale machine translation task.

The HPE Elastic Platform for Big Data Analytics

This is the fourth entry in an insideBIGDATA series that explores the intelligent use of big data on an industrial scale. This series, compiled in a complete Guide, also covers the exponential growth of data and the changing data landscape, as well realizing a scalable data lake. The fourth entry in the series is focused on offerings from HPE for big data analytics.

R + D3, A powerful and beautiful combo

For a while now, I have been looking for ways to incorporate pretty D3 graphics into R markdown reports or shiny apps. Why? I like R for the ability to be able to do data handling, stats, ML all very easily with minimal code. But when you want to present something to clients or the public, there is no competition for the front end web stuff e.g. d3.js. The answer: htmlwidgets.

Breaking Data Science Open

Deliver Collaboration, Self-Service and Production Deployment with Open Data Science Data science has burst into public attention over the past few years as perhaps the hottest and most lucrative technology field. No longer just a buzzword for advanced analytics, Christine Doig is a senior data scientist at Continuum Analytics, where she’s worked on several projects, including MEMEX, a DARPA-funded open data science project to help stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries. Christine Doig @ch_doig data science is poised to change everything about an organization: its potential customers, expansion plans, engineering and manufacturing process, how it chooses and interacts with suppliers and more. The leading edge of this tsunami is a combination of innovative business and technology trends that promise a more intelligent future based on Open Data Science. Open Data Science is a movement that makes the open source tools of data science—data, analytics and computation—work together as a connected ecosystem.

Email Spam Filtering : A python implementation with scikit-learn

Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models. Spam filtering is a beginner’s example of document classification task which involves classifying an email as spam or non-spam (a.k.a. ham) mail. Spam box in your Gmail account is the best example of this. So lets get started in building a spam filter on a publicly available mail corpus.