AVBytes: AI & ML Developments this week – IBM’s Library 46 Times Faster than TensorFlow, Baidu’s Massive Self-Driving Dataset, the Technology behind AWS SageMaker, etc.

In recent times, one of the more popular themes in the machine learning world has been regarding computational power. As the amount of data collected continues to rise unabated, organizations are lagging behind as their hardware is not up to scratch. But the big tech giants like IBM, Google and Amazon are working on products that can handle gigantic data using smaller computation power. We, at Analytics Vidhya, are covering these developments, along with all other major stories in the ML world on AVBytes! We provide links to official research papers so you can deep dive into the theory behind the technology. We also provide links to the source code on GitHub so you can replicate it (and even improve it!) on your own machine. In the past week, we saw the big names grabbing the headlines – Amazon unveiled the technology behind it’s AWS SageMaker, IBM developed a library that ran the same model on the same data 46 times faster than TensorFlow, Baidu open sourced it’s massive self-driving dataset, SAS developed a ML model to rank the best places to live, etc.

Weighted Linear Regression in R

If you are like me, back in engineering school you learned linear regression as a way to “fit a line to data” and probably called in “least squares”. You probably extended it to multiple variables affecting a single dependent variable. In a statistics class you had to calculate a bunch of stuff and estimate confidence intervals for those lines. And that was probably about it for a long time, unless you were focusing on math or statistics. You may have picked up along the way that there are assumptions inside of the decision to use “ordinary least squares”.

Machine Learning for Diabetes with Python

About one in seven U.S. adults has diabetes now, according to the Centers for Disease Control and Prevention. But by 2050, that rate could skyrocket to as many as one in three. With this in mind, this is what we are going to do today: Learning how to use Machine Learning to help us predict Diabetes. Let’s get started!

Getting Value from Machine Learning Isn’t About Fancier Algorithms — It’s About Making It Easier to Use

Machine learning can drive tangible business value for a wide range of industries — but only if it is actually put to use. Despite the many machine learning discoveries being made by academics, new research papers showing what is possible, and an increasing amount of data available, companies are struggling to deploy machine learning to solve real business problems. In short, the gap for most companies isn’t that machine learning doesn’t work, but that they struggle to actually use it. How can companies close this execution gap? In a recent project we illustrated the principles of how to do it. We used machine learning to augment the power of seasoned professionals — in this case, project managers — by allowing them to make data-driven business decisions well in advance. And in doing so, we demonstrated that getting value from machine learning is less about cutting-edge models, and more about making deployment easier.

Moving Towards Managing AI Products

A successful product has consistent behavior, meets or exceeds user-expectations, and significantly contributes to the top-line growth for the business. It is vital for a Product Manager to set and manage the expectations of users, gather quantifiable feedback regularly, communicate it rigorously to engineers, and make sure the product pragmatically evolves with the business and market transitions. AI products, however, can differ significantly from traditional products. For example, in my prior experience as a Product Manager, success was measured through delivery of a ‘deterministic’ product that always delighted customers?—?a hardware product has the same behavior under the standard conditions, the same user actions in a software product results in the same expected response. An AI-driven product, however, may not always have a deterministic behavior and may in fact produce counter-intuitive results?—?a personalized recommender system may produce different results to a user action after learning additional preferences.

Text Data Preprocessing: A Walkthrough in Python

This post will serve as a practical walkthrough of a text data preprocessing task using some common Python tools.

5 Things You Need to Know about Sentiment Analysis and Classification

We take a look at the important things you need to know about sentiment analysis, including social media, classification, evaluation metrics and how to visualise the results.

15 Types of Regression you should know

• Linear Regression
• Polynomial Regression
• Logistic Regression
• Quantile Regression
• Ridge Regression
• Lasso Regression
• ElasticNet Regression
• Principal Component Regression
• Partial Least Square Regression
• Support Vector Regression
• Ordinal Regression
• Poisson Regression
• Negative Binomial Regression
• Quasi-Poisson Regression
• Cox Regression

Exploring the underlying theory of the chi-square test through simulation – part 2

In the last post, I tried to provide a little insight into the chi-square test. In particular, I used simulation to demonstrate the relationship between the Poisson distribution of counts and the chi-squared distribution. The key point in that post was the role conditioning plays in that relationship by reducing variance. To motivate some of the key issues, I talked a bit about recycling. I asked you to imagine a set of bins placed in different locations to collect glass bottles. I will stick with this scenario, but instead of just glass bottle bins, we now also have cardboard, plastic, and metal bins at each location. In this expanded scenario, we are interested in understanding the relationship between location and material. A key question that we might ask: is the distribution of materials the same across the sites? (Assume we are still just counting items and not considering volume or weight.)

Deep Learning from first principles in Python, R and Octave – Part 5

In this 5th part on Deep Learning from first Principles in Python, R and Octave, I solve the MNIST data set of handwritten digits (shown below), from the basics. To do this, I construct a L-Layer, vectorized Deep Learning implementation in Python, R and Octave from scratch and classify the MNIST data set. The MNIST training data set contains 60000 handwritten digits from 0-9, and a test set of 10000 digits. MNIST, is a popular dataset for running Deep Learning tests, and has been rightfully termed as the ‘drosophila’ of Deep Learning, by none other than the venerable Prof Geoffrey Hinton.

1.1 Billion Taxi Rides: EC2 versus EMR

In this blog post I wanted to take two ways of running Hadoop jobs that cost less than $3.00 / hour on Amazon Web Services (AWS) and see how well they compare in terms of performance. The $3.00 price point was driven by the first method: running a single-node Hadoop installation. I wanted to make sure the dataset used in this benchmark could easily fit into memory. The price set the limit for the second method: AWS EMR. This is Amazon’s Hadoop Platform offering. It has a huge feature set but the key one is that it lets you setup Hadoop clusters with very little instruction. The $3.00 price limit includes the service fee for EMR.