RStudio 1.2 Preview: The Little Things

Today, we’re continuing our blog series on new features in RStudio 1.2. If you’d like to try these features out for yourself, you can download a preview release of RStudio 1.2. In this blog series thus far, we’ve focused on the biggest new features in RStudio 1.2. Today, we’ll take a look at some of the smaller ones.

Overview of Atom IDE

In this tutorial, you’ll learn the importance of IDEs, how to set-up Atom, and download packages. Getting started with data science? Can’t decide what language is best for you: R, Python or both? Well, the best solution is to jump right into it and get your hands dirty with code. But wait, code where?

AdaBoost Classifier in Python

Understand the ensemble approach, working of the AdaBoost algorithm and learn AdaBoost model building in Python.

9 obscure Python libraries for data science

Go beyond pandas, scikit-learn, and matplotlib and learn some new tricks for doing data science in Python.
• Wget
• Pendulum
• Imbalanced-learn
• FlashText
• FuzzyWuzzy
• PyFlux
• Ipyvolume
• Dash
• Gym

An Introduction to AI

This article is an introduction to AI key terminologies and methodologies on behalf of myself and DLS (

Microsoft Move To Blockchain Linking its Major Products to the Tech

The blockchain is actually a digital record in which the transactions made in cryptocurrencies like Bitcoin or Ethereum are entered publically and chronologically. And Microsoft was the first company to take blockchain to the cloud with its Microsoft Azure product about three years ago. The tech monster, Microsoft, has unobtrusively been establishing links between its blockchain administrations and other broadly utilized platforms, like Office Outlook, Office 365, SharePoint Online, Salesforce, SAP, Twitter, and Dynamics 365 Implementation Online as mentioned by Matt Kerner, the Microsoft Azure’ general manager. The idea of doing this was to enable Microsoft clients to port their information from all these platforms into the cloud, and from that point onto a blockchain.

Starting to develop in PySpark with Jupyter installed in a Big Data Cluster

Is not a secret that Data Science tools like Jupyter, Apache Zeppelin or the more recently launched Cloud Data Lab and Jupyter Lab are a must be known for the day by day work so How could be combined the power of easily developing models and the capacity of computation of a Big Data Cluster? Well in this article I will share very simple step to start using Jupyter notebooks for PySpark in a Data Proc Cluster in GCP.

Using a Keras Long Short-Term Memory (LSTM) Model to Predict Stock Prices

In this tutorial, we’ll build a Python deep learning model that will predict the future behavior of stock prices. We assume that the reader is familiar with the concepts of deep learning in Python, especially Long Short-Term Memory. While predicting the actual price of a stock is an uphill climb, we can build a model that will predict whether the price will go up or down. The data and notebook used for this tutorial can be found here. It’s important to note that there are always other factors that affect the prices of stocks, such as the political atmosphere and the market. However, we won’t focus on those factors for this tutorial.

Autonomy – Do we have the choice?

Why it is hard to take some decisions for humans? Whenever we have to take a complex decisions we have to deal with rationality, emotions and our beliefs. It’s a cognitive load to take decisions sometimes on certain issues. We find the situation complex, baffling. Sometimes we don’t even take difficult decisions and leave the situations as it is for years.

Word Morphing

How to employ word2vec’s embeddings and A* search algorithm to morph between words.

How Important is that Machine Learning Model be Understandable? We analyze poll results

About 85% of respondents said it was always or frequently important that Machine Learning model be understandable. This was is especially important for academic researchers, and surprisingly more in US/Canada than in Europe or Asia.

R > Python: a Concrete Example

I like both Python and R, and teach them both, but for data science R is the clear choice. When asked why, I always note (a) written by statisticians for statisticians, (b) built-in matrix type and matrix manipulations, (c) great graphics, both base and CRAN, (d) excellent parallelization facilities, etc. I also like to say that R is ‘more CS-ish than Python,’ just to provoke my fellow computer scientists. ?? But one aspect that I think is huge but probably gets lost when I cite it is R’s row/column/element-name feature. I’ll give an example here.

Cognitive Services in Containers

I’ve posted several examples here of using Azure Cognitive Services for data science applications. You can upload an an image or video to the service and extract information about faces and emotions, generate a caption describing a scene from a provided photo, or speak written text in a natural voice. (If you haven’t tried the Cognitive Services tools yet, you can try them out using the instructions in this notebook using only a browser.) But what if you can’t upload an image or text to the cloud? Sending data outside your network might be subject to regulatory or privacy policies. And if you could analyze the images or text locally, your application could benefit from reduced latency and bandwidth.

Many Factor Models

Today, we will return to the Fama French (FF) model of asset returns and use it as a proxy for fitting and evaluating multiple linear models. In a previous post, we reviewed how to run the FF three-factor model on the returns of a portfolio. That is, we ran one model on one set of returns. Today, we will run multiple models on multiple streams of returns, which will allow us to compare those models and hopefully build a code scaffolding that can be used when we wish to explore other factor models. Let’s get to it!

Evaluation Metrics for Recommender Systems

Recommender systems are growing progressively more popular in online retail because of their ability to offer personalized experiences to unique users. Mean Average Precision at K (MAP@K) is typically the metric of choice for evaluating the performance of a recommender systems. However, the use of additional diagnostic metrics and visualizations can offer deeper and sometimes surprising insights into a model’s performance. This article explores Mean Average Recall at K (MAR@K), Coverage, Personalization, and Intra-list Similarity, and uses these metrics to compare three simple recommender systems.

Building a book Recommendation System using Keras

An recommendation system seeks to predict the rating or preference a user would give to an item given his old item ratings or preferences. Recommendation systems are used by pretty much every major company in order to enhance the quality of there services. In this article, we will take a look at how to use embeddings to create a book recommendation system.

Gaussian Mixture Model clusterization: how to select the number of components (clusters)

If you landed on this post, you probably already know what a Gaussian Mixture Model is, so I will avoid the general description of the this technique. But if you are not aware of the details, you can just see the GMM as a k-means which is able to form stretched clusters, like the ones you can see in Figure 2. All the code used for this post is in this notebook. In the same repository you can find the data to fully replicate the results you see plotted.

How to create and deploy a Kubeflow Machine Learning Pipeline (Part 1)

Google Cloud recently announced an open-source project to simplify the operationalization of machine learning pipelines. In this article, I will walk you through the process of taking an existing real-world TensorFlow model and operationalizing the training, evaluation, deployment, and retraining of that model using Kubeflow Pipelines (KFP in this article).

The conceptual arithmetics of concepts

One of my favorite books I read recently is ‘Surfaces and essences: analogy as the fuel and fire of thinking’ by Douglas Hofstadter. In this book, the author’s central thesis is that categorization is central to thinking and analogy-making is the core of cognition. Hofstadter’s main thesis is that concepts are not rigid rather than fluid and blurry and can’t be strictly hierarchical. He argues that cognition takes place thanks to a constant flow of categorizations, in contrast to classification (which aims to put all things into fixed and rigid mental boxes).

Announcement: TensorFlow 2.0 is coming!

The eagerly-awaited update for the popular machine learning framework TensorFlow was announced earlier in August by Martin Wicke from Google AI. The exciting news was announced on his Google Group and it already caused a buzz around the next major version of the framework?-?TensorFlow 2.0. If you’re excited like me and eager to stay up to date with the details of 2.0 development, I strongly encourage you to subscribe to the Google Group! What makes this more appealing is that you can be a part of the coming public design reviews and even contribute to the features of TensorFlow 2.0 by voicing your concerns and proposing changes! This is exactly why I’m in love with open source development as the community works together and supports one another for the common goals.

How Amazon Alexa works? Your guide to Natural Language Processing (AI)

We can talk to almost all of the smart devices now, but how does it work? When you ask ‘What song is this?’, what technologies are being used?

Five Not Well-Known Machine Learning Architectures that will Help You Move from Pilot to Production

Despite the hype surrounding machine learning and artificial intelligence(AI) most efforts in the enterprise remain in a pilot stage. Part of the reason for this phenomenon is the natural experimentation associated with machine learning projects but also there is a significant component related to the lack of maturity of machine learning architectures. This problem is particularly visible in enterprise environments in which the new application lifecycle management practices of modern machine learning solutions conflicts with corporate practices and regulatory requirements. What are the key architecture building blocks that organizations should put in place when adopting machine learning solutions? The answer is not very trivial but recently we have seen some efforts from research labs and AI data science that are starting to lay down the path of what can become reference architectures for large scale machine learning solutions.

Beyond Word Embeddings Part 4 – Introducing Semantic Structure to Neural NLP

Since the advent of word2vec, neural word embeddings have become a go to method for encapsulating distributional semantics in NLP applications. This series will review the strengths and weaknesses of using pre-trained word embeddings and demonstrate how to incorporate more complex semantic representation schemes such as Semantic Role Labeling, Abstract Meaning Representation and Semantic Dependency Parsing into your applications.

Deep Learning for the Masses (… and The Semantic Layer)

Deep learning is everywhere right now, in your watch, in your televisor, your phone, and in someway the platform you are using to read this article. Here I’ll talk about how can you start changing your business using Deep Learning in a very simple way. But first, you need to know about the Semantic Layer.

Recurrent Neural Networks for Language Understanding

Recurrent Neural Networks (RNNs) have been credited with achieving state of the art performance in machine translation, sentiment analysis, speech recognition and many other machine learning tasks. Their strengths lie in their ability to process sequential data and outputs and inputs of various lengths. These miraculous networks aren’t anything new. In fact, they were developed in the 1980s?-?but being more computationally costly than non-recurrent neural networks, it wasn’t until (relatively) recent improvements in computational resources that RNNs took off in popularity. This article will introduce Recurrent Neural Networks in the context of NLP.

Reinforcement Learning (Q-Learning) with Decision Trees

Reinforcement learning (RL) is a paradigm in machine learning where a computer learns to perform tasks such as driving a vehicle, playing atari games, and beating humans in the game of Go, with little to no supervision from human experts. Several RL algorithms have been named most interesting breakthroughs in 2017. Everyone was excited with the new possibilities. I was excited.

Utilizing Artificial Intelligence Cloud Solutions for Business

Delving into artificial intelligence solutions to streamline business processes can seem like a daunting task. Fortunately, there is a growing number of resources and cloud based services that can be leveraged to either outsource or facilitate AI endeavors for business.

Faster R-CNN (object detection) implemented by Keras for custom data from Google’s Open Images Dataset V4

After exploring CNN for a while, I decided to try another crucial area in Computer Vision, object detection. There are several methods popular in this area, including Faster R-CNN, RetinaNet, YOLOv3, SSD and etc. I tried Faster R-CNN in this article. Here, I want to summarise what I have learned and maybe give you a little inspiration if you are interested in this topic.

Use causal graphs!

This is the second post of a series about causality in data science. You can check the first one: ‘Why do we need causality in data science?’. As we said, there are currently two principal frameworks for working with causality: potential outcomes and with graphs. Here we will continue explaining why is causal inference necessary and how graphs help with it.

Why do we need causality in data science?

This is a series of posts explaining why do we need causal inference in data science and machine learning. Causal inference brings a new fresh set of tools and perspectives that let us deal with old problems.

Interpretable Neural Networks

Interpreting black box models is a significant challenge in machine learning, and can significantly reduce barriers to adoption of the technology. In a previous post, I discussed interpreting complex machine learning models using shap values. To summarize, for a particular feature, the prediction of a model for a specific data point is compared when it can see the feature, and when it can’t?-?the magnitude of this difference tells us how important that feature is to the model’s prediction.

Building The AI Stack

As the use of machine learning?-?and specifically compute-intensive deep learning technology?-?is booming across research and the industry, the market for building the machine learning stack has exploded. This is perfectly illustrated by technology giants such as Google, Amazon and Microsoft releasing cloud products targeted at making ML (Machine Learning) technology easier to develop, and eventually driving cloud sales. This trend has also been highlighted by a large number of startups building infrastructure and tools, from data preparation (image labelling), to training (model optimization), to deployment.

Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning – Paper Summary

In December 2017, Uber AI Labs released five papers, related to the topic of neuroevolution, a practice where deep neural networks are optimised by evolutionary algorithms. This post is a summary of one those papers called ‘Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning’. It is intended for those with some basic familiarity in topics related to machine learning. Concepts such as ‘genetic algorithms’ and ‘gradient descent’ are prerequisite knowledge. Much of the research in the five Uber AI Labs papers, including the paper I am summarising, builds on research by OpenAI presented in the paper ‘Evolution Strategies as a Scalable Alternative to Reinforcement Learning’. OpenAI have written a blog post summarising their paper. A shorter summary (written by me) can be found here. The code used for the experiments in this paper can be found here. In April 2018, the code was optimised to run on a single personal computer. The work to achieve this is described in an Uber AI labs blog post, and the specific code can be found here.

Understanding the scaling of L² regularization in the context of neural networks

Did you ever look at the L² regularization term of a neural network’s cost function and wondered why it is scaled by both 2 and m?

The Math of Data Science – 3

I am halfway through my journey of being enough Mathematically literate to understand and work comfortably with Data Science books, posts, articles and journals. This is 3rd article in series, here are part-1 and part-2.

The Math of Data Science – 2

Earlier, in part 1 of this series, I wrote about how I started learning Mathematics when I begin my Data Science journey. Originally I wanted to do Data Science but I was not sure what Data Science is and how it is different from Data Analysis and Machine Learning (I started back in April or May), it was all confusion with different titles of Data Analyst, Machine Learning Engineer, Business Analyst, Data Scientist and confusion arose primarily from the fact that ‘titles looked so different but work given to the employees who have these titles did not have much of boundaries’. Data Scientists were being hired as Data Analysts and Machine Learning Engineers were working like Data Scientists. Only very recently this confusion is being cleared up as industry is giving meaning to the titles with well defined boundaries. So, in all that confusion, Probability, Statistics and Linear Algebra still were the 3 primary subjects for Data Science from Mathematics. Anyone interested in one of those titles could not have gone far without knowing these 3. You can’t avoid these. So I started reading Statistics and Probability, and like I mentioned earlier, whatever book I picked up, it had ? and ? in it. It became quite frustrating reading books because I could not comprehend integration and differentiation. So, I had to look at integration and derivatives and for that I had to learn Algebra, Geometry and Trigonometry.

The Math of Data Science

Around April this year, I decided to leave Software Development behind and begin my career in Data Science. I think to be happy, one has to find a balance between his interests and his profession. So, when everyone is moving from creating the software to using the software (MIT chose Python over Scheme, John Hopcraft started a book on Data Science and Software industry had its lessons and everything is being either automated and/or onto the cloud, that reminds me of Bhagvad Gita which tells that the only thing permanent in life is change. So, my decision came in May and I started using R but then I noticed requirements of industry and how industry itself was changing trend and hence by July I shifted to Python. Then the story of Math started.

Probability Part 1: Probability for Everyone

Inspired by a course which I am taking in probability theory, this blogpost is an attempt to explain the fundamentals of the mathematical theory of probability at an intuitive level. As the title suggests, this post is pretty much intended for everyone, regardless of mathematical level or ability. There will be some mathematics, but feel free to skip through these sections. The important stuff comes in between the mathematics.

Different Approches for an AI-Based Facial Recognition using Deep Learning

In This Article, I Want to Take the Example of Facial Recognition Applied in a Retail Store. Indeed, This Technology can be Used in Many Ways:
• Improve in-Store Personalization
• Provide one-to-one Personalized Shopping Experience
• Understand Visitors Buying Patterns
Back in the Days, the Moment you Walk Into Your Local Store, the Business Owner Would Recognize You, Greet you and Perhaps Offer you Something. Today, Through Marketing, the Goal for Retailers is to Recreate Such Shopping Experiences With Facial Recognition.

Detecting Data Leakage before it’s too late

After reading Susan Li’s Expedia Case study article, I wanted to see if I could reproduce the results using AuDaS, Mind Foundry’s automated Machine Learning platform. The data is available on Kaggle and contains customer web analytics information for hotel bookings (true and false). The goal of this competition is to predict whether a customer will make a reservation or not. However, after cleaning the data and building my model after 3 minutes of training I had reached a classification accuracy of 100% which immediately triggered an alarm. I was a victim of data leakage.