What is TensorFrames? TensorFlow + Apache Spark

TensorFrames is an open source created by Apache Spark contributors. Its functions and parameters are named the same as in the TensorFlow framework. Under the hood, it is an Apache Spark DSL (domain-specific language) wrapper for Apache Spark DataFrames. It allows us to manipulate the DataFrames with TensorFlow functionality. And no, it is not pandas DataFrame, it is based on Apache Spark DataFrame.


Measuring Performance: AUPRC

The area under the precision-recall curve (AUPRC) is another performance metric that you can use to evaluate a classification model. If your model achieves a perfect AUPRC, it means your model can find all of the positive samples (perfect recall) without accidentally marking any negative samples as positive (perfect precision.) It’s important to consider both recall and precision together, because you could achieve perfect recall (but bad precision) using a naive classifier that marked everything positive, and you could achieve perfect precision (but bad recall) using a naive classifier that marked everything negative.


Introduction to U-Net and Res-Net for Image Segmentation

Human beings are able to see the images by capturing the reflected light rays which is a very complex task. So how can machines be programmed to perform a similar task? Computer sees the images as matrices which need to be processed to get a meaning out of it. Image segmentation is the method to partition the image into various segments with each segment having a different entity. Convolutional Neural Networks are successful for simpler images but haven’t given good results for complex images. This is where other algorithms like U-Net and Res-Net come into play.


Top 10 Neural Network Architectures

10 Neural Network Architectures
• Perceptrons
• Convolutional Neural Networks
• Recurrent Neural Networks
• Long / Short Term Memory
• Gated Recurrent Unit
• Hopfield Network
• Boltzmann Machine
• Deep Belief Networks
• Autoencoders
• Generative Adversarial Network


AI adoption is being fueled by an improved tool ecosystem

In this post, I share slides and notes from a keynote that Roger Chen and I gave at the 2019 Artificial Intelligence conference in New York City. In this short summary, I highlight results from a – survey (AI Adoption in the Enterprise) and describe recent trends in AI. Over the past decade, AI and machine learning (ML) have become extremely active research areas: the web site arxiv.org had an average daily upload of around 100 machine learning papers in 2018. With all the research that has been conducted over the past few years, it’s fair to say that we now have entered the implementation phase for many AI technologies. Companies are beginning to translate research results and developments into products and services.


The Kalman Filter and (Maximum) Likelihood

I first realized the power of the Kalman Filter during Kaggle’s Web Traffic Time Series Forecasting competition, a contest requiring prediction of future web traffic volumes for thousands of Wikipedia pages. In this contest, simple heuristics like ‘median of medians’ were hard to beat and my own modeling choices were mostly ineffective. Of course I blamed my tools, and wondered if anything in the traditional statistical toolbox was up for this task. Then I read a post by a user known only as ‘os,’ 8th place with Kalman filters.


Machine Learning Python Keras Dropout Layer Explained

Machine learning is ultimately used to predict outcomes given a set of features. Therefore, anything we can do to generalize the performance of our model is seen as a net gain. Dropout is a technique used to prevent a model from overfitting. Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase. If you take a look at the Keras documentation for the dropout layer, you’ll see a link to a white paper written by Geoffrey Hinton and friends, which goes into the theory behind dropout.


Robotic Process Automation

A recent finding from KPMG’s Global Sourcing Advisory Pulse Survey ‘Robotic Revolution’, suggests that technology experts believe ‘The opportunities [from RPA] are many – so are the adoption challenges… For most organisations, taking advantage of higher-end RPA opportunities will be easier said than done.’


Anomaly detection in time series with Prophet library

First of all, let’s define what is an anomaly in time series. Anomaly detection problem for time series can be formulated as finding outlier data points relative to some standard or usual signal. While there are plenty of anomaly types, we’ll focus only on the most important ones from a business perspective, such as unexpected spikes, drops, trend changes and level shifts. You can solve this problem in two way: supervised and unsupervised. While the first approach needs some labeled data, second does not, you need the just raw data. On that article, we will focus on the second approach.


What you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem

As we have done before (see 2017 data science ecosystem, 2018 data science ecosystem), we examine which tools were part of the same answer – the skillset of the user. We note that this does not necessarily mean that all tools were used together on each project, but having knowledge and skills to used both tools X and Y makes it more likely that both X and Y were used together on some projects. The results we see are consistent with this assumption. The top tools show surprising stability – we see essentially the same pattern as last year.


Why correlation might tell us nothing about outliers

We often hear claims à la ‘there is a high correlation between x and y .’ This is especially true with alleged findings about human or social behaviour in psychology, the social sciences or economics. A reported Pearson correlation coefficient of 0.8 indeed seems high in many cases and escapes our critical evaluation of its real meaning. So let’s see what correlation actually means and if it really conveys the information we often believe it does. Inspired by the funny spurious correlation project as well as Nassim Taleb’s medium post and Twitter rants in which he laments psychologists’ (and not only) total ignorance and misuse of probability and statistics, I decided to reproduce his note on how much information the correlation coefficient conveys under the Gaussian distribution.


Monte Carlo Simulation and Statistical Probability Distributions in Python

One method that is very useful for data scientist/data analysts in order to validate methods or data is Monte Carlo simulation. In this article, you learn how to do a Monte Carlo simulation in Python. Furthermore, you learn how to make different Statistical probability distributions in Python.


Recommendation Systems using UV-Decomposition

There are countless examples of recommender systems in use across nearly every industry in existence today. Most people understand, at a high level, what recommender systems attempt to achieve. However, not many people understand how they work at a deeper level. This is what Leskovec, Rajaraman, and Ullman dive into in Chapter 9 of their book Mining of Massive Datasets. A great example of a recommender system in use today is Netflix. Whenever you log into Netflix there are various sections such as ‘Trending Now’ or ‘New Releases’, but there is also a section titled ‘Top Picks for You’. This section uses a complex formula that tries to estimate which movies you would enjoy the most. The formula takes into account previous movies you have enjoyed as well as other movies people like you have also enjoyed.


Graph Algorithms (Part 2)

Graphs are becoming central to machine learning these days, whether you’d like to understand the structure of a social network by predicting potential connections, detecting fraud, understand customer’s behavior of a car rental service or making real-time recommendations for example. In this article, we’ll cover :
• The main graph algorithms
• Illustrations and use-cases
• Examples in Python


Kernel Secrets in Machine Learning Pt. 2

In the first article on the topic, Kernel Secrets in Machine Learning, I explained kernels in the most basic way possible. Before reading further I would advise you to quickly go through the article to get a feel of what a kernel really is if you have not already. Hopefully, you are going to come to this conclusion: A kernel is a similarity measure between two vectors in a mapped space. Good. Now we can check out and discuss some well-known kernels and also how do we combine kernels to produce other kernels. Keep in mind, for the examples that we are going to use, the x’s are one-dimensional vectors for plotting purposes and we fix the value of x’ to 2. Without further ado, let’s start kerneling.


Kernel Secrets in Machine Learning Pt. 1

This post is not about deep learning. But it could be might as well. This is the power of kernels. They are universally applicable in any machine learning algorithm. Why you might ask? I am going to try to answer this question in this article.


How to learn the maths of Data Science using your high school maths knowledge

This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning.


What is Quantum Computing and How is it Useful for Artificial Intelligence?

After decades of a heavy slog with no promise of success, quantum computing is suddenly buzzing! Nearly two years ago, IBM made a quantum computer available to the world. The 5-quantum-bit (qubit) resource they now call the IBM Q experience. It was more like a toy for researchers than a way of getting any serious number crunching done. But 70,000 users worldwide have registered for it, and the qubit count in this resource has now quadrupled. With so many promises by quantum computing and data science being at the helm currently, are there any offerings by quantum computing for the AI? Let us explore that in this blog!


Machine Learning : Unsupervised – Hierarchical Clustering and Bootstrapping

This article is based on Unsupervised Learning algorithm: Hierarchical Clustering. This is the brief illustration with a practical working example of forming unsupervised hierarchical clusters and testing them to assure that you have formed the right clusters. This is a real-life data world example which can be studied and evaluated as data is provided for personal use and practice. There are variations to each topic in data science but there is a brief basic pattern that can be followed to build models. ‘The Datum’ empowers you to have access to these basic patterns for your lifetime and building upon them as you progress. Consider ‘The Datum’ blogs as your cookbook hand out which will help you learn, refer, and contribute the relevant topics. All listings and models are implemented in R Language using R Studio, and image instances of my work are embedded in this article for your reference.