Boosting with AdaBoost and Gradient Boosting

Have you ever been or seen a Kaggle competition? Most of the prize winners do it by using boosting algorithms. Why is AdaBoost, GBM, and XGBoost the go-to algorithm of champions?

Ensemble Learning: When everybody takes a guess…I guess!

Remember a few months ago when everyone was taking wild guesses at the color of the dress (blue or gold) or the tennis shoe (pink or grey)? For me, Ensemble Learning looks a bit like that. A group of weak learners comes together to form a strong learner, thus increasing the accuracy of any Machine Learning model.

An Introduction to Random Forest using the fastai Library (Machine Learning for Programmers – Part 1)

Programming is a crucial prerequisite for anyone wanting to learn machine learning. Sure quite a few autoML tools are out there, but most are still at a very nascent stage and well beyond an individual’s budget. The sweet spot for a data scientist lies in combining programming with machine learning algorithms. is led by the amazing partnership of Jeremy Howard and Rachel Thomas. So when they recently released their machine learning course, I couldn’t wait to get started. What I personally liked about this course is the top-down approach to teaching. You first learn how to code an algorithm in Python, and then move to the theory aspect. While not a unique approach, it certainly has it’s advantages.

Simplifying Data Preparation and Machine Learning Tasks using RapidMiner

It’s a well-known fact that we spend too much time on data preparation and not as much time as we want on building cool machine learning models. In fact, a Harvard Business Review publication confirmed what we always knew: analytics teams spend 80% of their time preparing data. And they are typically slowed down by clunky data preparation tools coupled with a scarcity of data science experts.

Decision Tree Classifier implementation in R

The decision tree classifier is a supervised learning algorithm which can use for both the classification and regression tasks. As we have explained the building blocks of decision tree algorithm in our earlier articles. Now we are going to implement Decision Tree classifier in R using the R machine learning caret package. To get more out of this article, it is recommended to learn about the decision tree algorithm. If you don’t have the basic understanding on Decision Tree classifier, it’s good to spend some time on understanding how the decision tree algorithm works.

Working with panel data in R: Fixed vs. Random Effects (plm)

Panel data, along with cross-sectional and time series data, are the main data types that we encounter when working with regression analysis.

Semantic Interoperability: Are you training your AI by mixing data sources that look the same but aren’t?

Semantic interoperability is a challenge in AI systems, especially since data has become increasingly more complex. The other issue is that semantic interoperability may be compromised when people use the same system differently.

R and Python: How to Integrate the Best of Both into Your Data Science Workflow

From Executive Business Leadership to Data Scientists, we all agree on one thing: A data-driven transformation is happening. Artificial Intelligence (AI) and more specifically, Data Science, are redefining how organizations extract insights from their core business(es). We’re experiencing a fundamental shift in organizations in which ‘approximately 90% of large global organizations with have a Chief Data Officer by 2019’. Why? Because, when the ingredients of a ‘high performance data science team’ are present (refer to this Case Study), organizations are able to generate massive return on investment (ROI). However, data science teams tend to get hung up on a ‘battle’ waged between the two leading programming languages for data science: R versus Python.

Partially additive (generalized) linear model trees

The PALM tree algorithm for partially additive (generalized) linear model trees is introduced along with the R package palmtree. One potential application is modeling of treatment-subgroup interactions while adjusting for global additive effects.

Mining Sent Email for Self-Knowledge

How can we use data analytics to increase our self-knowledge? Along with biofeedback from digital devices like FitBit, less structured sources such as sent emails can provide insights. E.g. here it seems my communication took a sudden more positive turn in 2013. Let’s see what else shakes out of my sent email corpus.

RStudio 1.2 Preview: Reticulated Python

One of the primary focuses of RStudio v1.2 is improved support for other languages frequently used with R. Last week on the blog we talked about new features for working with SQL and D3. Today we’re taking a look at enhancements we’ve made around the reticulate package (an R interface to Python).

Duplicate question detection using Word2Vec, XGBoost and Autoencoders

In this post, I tackle the problem of classifying questions pairs based on whether they are duplicate or not duplicate. This is important for companies like Quora, or Stack Overflow where multiple questions posted are duplicates of questions already answered. If a duplicate question is spotted by an algorithm, the user can be directed to it and reach the answer faster.

Towards AI Transparency: Four Pillars Required to Build Trust in Artificial Intelligence Systems

Trust is a foundational building block of human socio-economic dynamics. In software development, during the last few decades, we steadily built mechanisms for asserting trust on specific applications. When we get on planes that fly on auto-pilot or cars completely driven by robots we are intrinsically expressing trust on the creators of a specific software application. In software, trust mechanisms are fundamentally based on the deterministic nature of most software applications in which their behavior is uniquely determine by the code workflow which makes it intrinsically predictable. The non-deterministic nature of artificial intelligence(AI) systems breaks the pattern of traditional software applications and introduces new dimensions to enable trust in AI agents. Recently, researchers from IBM proposed a new methodology for establishing trust in AI systems.

Analyzing time series data in Pandas

In my previous tutorials, we have considered data preparation and visualization tools such as Numpy, Pandas, Matplotlib and Seaborn. In this tutorial, we are going to learn about Time Series, why it’s important, situations we will need to apply Time Series, and more specifically, we will learn how to analyze Time Series data using Pandas.

Deep Learning: Which Loss and Activation Functions should I use?

The purpose of this post is to provide guidance on which combination of final-layer activation function and loss function should be used in a neural network depending on the business goal.

Deep Learning: Overview of Neurons and Activation Functions

This post is designed to be an overview on concepts and terminology used in deep learning. It’s goal is to provide an introduction on neural networks, before describing some of the mathematics behind neurons and activation functions.