Interpreting Machine Leaning Models: A Myth or Reality?

Despite the predictive capabilities of supervised machine learning, can we trust the machines? As much as we want the models to be good, we also want them to be interpretable. Yet, the task of interpretation often remains vague. Despite the proliferation of machine learning into our daily lives ranging from finance to justice, a majority of the users find their models difficult to understand. This lack of a commonly agreed upon definition or the ill-definition of the interpretability means that rather than being a monolithic concept, interpretability embeds various related concepts.

OECD Principles on AI

The OECD Principles on Artificial Intelligence promote artificial intelligence (AI) that is innovative and trustworthy and that respects human rights and democratic values. They were adopted on 22 May 2019 by OECD member countries when they approved the OECD Council Recommendation on Artificial Intelligence. The OECD AI Principles are the first such principles signed up to by governments. Beyond OECD members, other countries including Argentina, Brazil, Colombia, Costa Rica, Peru and Romania have already adhered to the AI Principles, with further adherents welcomed. The OECD AI Principles set standards for AI that are practical and flexible enough to stand the test of time in a rapidly evolving field. They complement existing OECD standards in areas such as privacy, digital security risk management and responsible business conduct.

Distributed Artificial Intelligence

A primer on Multi-Agent Systems, Agent-Based Modeling, and Swarm Intelligence. Almost two years ago, I paused thinking about the future of AI and drew down some ‘predictions’ about where I thought the field was going. One of those forecasts concerned reaching a general intelligence in several years, not through a super powerful 100-layers deep learning algorithm, but rather through something called collective intelligence. However, except for very obvious applications (e.g., drones), I have not read or seen any big development in the field and I thus thought to dig a bit into that to check what is currently going on. As part of the AI Knowledge Map then, I will have a look here not only at Swarm Intelligence (SI) but more generally at Distributed AI, which also includes Agent-Based Modeling (ABM) and Multi-Agent Systems (MAS).

Automating and Accelerating Hyperparameter Tuning for Deep Learning

Deep learning can be tedious work. Taking Long Short-Term Memory (LSTM) as an example, we have lots of hyperparameters, (learning rate, number of hidden units, batch size, and so on) waiting for us to find the best combination. Considering the size of a deep learning model, hyperparameter tuning usually takes long time. One traditional way to automate the tuning process is grid search, trying every possible group of hyperparameters in the search space and remembering the best one, but this method is typically too computationally intensive. To overcome this issue, IBM developed a black-box optimization library called RbfOpt which is embedded in Experiment Builder.

A Deep Dive Into Imbalanced Data: Over-Sampling

When implementing classification algorithms, the structure of your data is of great significance. Specifically, the balance between the number of observations for each potential output heavily influences your prediction’s performance (I intentionally avoided using the word ‘accuracy’ for reasons I will later elaborate on in greater detail).

Deep Learning’s Uncertainty Principle

DeepMind has a new paper where researchers have uncovered two ‘surprising findings’. The paper is described in ‘Understanding Deep Learning through Neuron Deletion’. In networks that generalize well, (1) all neurons are important and (2) are more robust to damage. Deep Learning network has behavior that reminds us of holograms. These results are further confirmation of my conjecture that Deep Learning systems are like holographic memories.

GANs Demystified – What the hell do they learn?

This article is a summary of the research paper ‘GAN Dissection: Visualising And Understanding Generative Adversarial Networks’. The paper provides an excellent insight into the internal representation of GANs and gives a close answer to the following question. What the hell do GANs learn? I’m specifically referring to the generator. We’ve all seen stunning results produced by GANs, almost indistinguishable from human work in some cases. But how they represent learned knowledge is still a mystery. Do they simply learn pixel patterns and composite what they see? Or do they actually capture complex relationships from training data?

A Beginner’s Guide to Hierarchical Clustering and how to Perform it in Python

It is crucial to understand customer behavior in any industry. I realized this last year when my chief marketing officer asked me – ‘Can you tell me which existing customers should we target for our new product?’ That was quite a learning curve for me. I quickly realized as a data scientist how important it is to segment customers so my organization can tailor and build targeted strategies. This is where the concept of clustering came in ever so handy! Problems like segmenting customers are often deceptively tricky because we are not working with any target variable in mind. We are officially in the land of unsupervised learning where we need to figure out patterns and structures without a set outcome in mind. It’s both challenging and thrilling as a data scientist.

Cross Validation in One Picture

Cross Validation explained in one simple picture. The method shown here is k-fold cross validation, where data is split into k folds (in this example, 5 folds). Blue balls represent training data; 1/k (i.e. 1/5) balls are held back for model testing. Monte Carlo cross validation works the same way, except that the balls would be chosen with replacement. In other words, it would be possible for a ball to appear in more than one sample.

Xaas Business Model: Economics Meets Analytics

Digital capabilities leverage customer, product and operational insights to digitally transform business models. And nowhere is this more evident than the rush by industrial companies to digitally transform consumption models by transitioning from selling products to selling [capabilities]-as-a-service (thusly, Xaas). For example:
• The key issue for the airlines is to maximize their core revenue generating mechanisms:flight scheduling and the hours that the airplane is actually flying. So instead of looking at the features of the jet engine, GE turned their attention to helping airlines more effectively generating more revenue; GE they moved from selling engines to offering Thrust (engines)-as-a-service[1].
• Kaeser Kompressoren, who manufacturers large air compressors, leverages sensors on its equipment to capture product usage, performance and condition data off of the machines.Kaeser leveraged the product and operational insights gained from these data sources to start selling air by the cubic meter through compressors that it owns and maintains …compressed Air-as-a-service[2].
But let’s be honest, anyone can create an Xaas business model. The key is not creating an Xaas business model; the key is creating a profitable Xaas business model. That means that organizations moving to an Xaas business model must master operational excellence (remote monitoring, sensors, predictive maintenance, first time fix, inventory optimization, technician scheduling, asset utilization), pricing perfection and meeting agreed-upon customer Service Level Agreement (SLA) requirements to ensure Xaas business model success.

Frame a problem as a machine learning problem or otherwise

This is not very simple to choose a machine learning method and letting it go wild on the data. Particularly, understanding the core business problem and objective of the outcome and frame accordingly is one of the vital factors in machine learning. A general approach is difficult to recommend without intimate knowledge of the data. However, it sounds like we need to formalize the aspects of your model. Following questions may help to decide on machine learning problem or otherwise:
1. What am I trying to predict? What are my outcomes?
2. What data can I use to train my model, and what are my inputs? What market factors can I train my model with to predict the outcomes?

Multiple Linear Regression in Python

One of the most in-demand machine learning skill is linear regression. In this article, you learn how to conduct a multiple linear regression in Python.

Different Ways to Manage Apache Spark Applications on Amazon EMR

Technology has seen advancing rapidly in the last few years and so is the amount the data that is getting generated. There are a plethora of sources which generates unstructured data that carries a huge amount of information if mined correctly. These varieties of voluminous data are known as the Big Data which traditional computers or storage systems are incapable to handle. To mine big data, the concept of parallel computing or clusters came into place popularly known as the Hadoop. Hadoop has several components which not only stores the data in the form of clusters but processes them in parallel as well. The HDFS or the Hadoop storage file system stores the big data while using the Map Reduce technique the data is processed.

Two New Ways to Make DNS over HTTPS Queries in R

A fair bit of time ago the {gdns} package made its way to CRAN to give R users the ability to use Google’s (at that time) nascent support for DNS over HTTPS (DoH). A bit later on Cloudflare also provided a global DoH endpoint and that begat the (not-on-CRAN) {dnsflare} package. There are actually two ways to make these DoH queries: one via an HTTPS GET REST API and the other via HTTPS POST queries that use DNS wireformat queries and replies. While the POST side of DoH is pretty standardized/uniform the GET/REST API side is kind of the Wild West. I wanted a way to have support for both wireformat and REST idioms but also not have to write a gazillion packages to support the eventual plethora of diverse DoH GET/REST API services. I ‘solved’ this by first augmenting my (not-on-CRAN) {clandnstine} package to support the POST wireformat DoH queries (since the underlying {getdns} library supports decoding wireformat responses]) and creating a very small {playdoh} package which provided generic support for (hopefully) any DoH GET/REST endpoint.

RStudio in Docker – now share your R code effortlessly!

If you are a full time data science practitioner and have passed through the stages of starting out with the Titanic dataset and working through the various exercises in Kaggle , you would know by now that we wish real world data problems are that simple, but they are not! This post is about just one of the many challenges one could face, which is sharing your R code to someone who does not use R and who doesn’t have time to install every single dependency your code needs to run in their system. Simple solution – Docker!

[R]eady for Production: a Joint Event with RStudio and EODA

We’re excited to team up with EODA, an RStudio Full Service Certified Partner, to host a free data science in production event in Frankfurt, Germany, on June 13. This one-day event will be geared for data science and IT teams that want to learn how to integrate their analysis solutions with the optimal IT infrastructure. This is a great chance to work in smaller groups with experts from EODA and RStudio on best-practice approaches to productive data-science architectures, and to see real-world solutions to deployment problems. With sessions in English and German, the conference will start with a high-level overview of the right interaction between data science and IT, and then focus on more hands-on solutions to engineering problems such as building APIs with Plumber and deploying to RStudio Connect, using Python and SQL in the RStudio IDE, and Shiny load testing.

Gale–Shapley algorithm simply explained

From this article, you will learn about stable pairing or stable marriage problem. You will learn how to solve that problem using Game Theory and the Gale-Shapley algorithm in particular. We will use Python to create our own solution using theorem from the original paper from 1962.