All About Using Jupyter Notebooks and Google Colab

Interactive notebooks are experiencing a rise in popularity. How do we know? They’re replacing PowerPoint in presentations, shared around organizations, and they’re even taking workload away from BI suites. Today there are many notebooks to choose from Jupyter, R Markdown, Apache Zeppelin, Spark Notebook and more. There are kernels/backends to multiple languages, such as Python, Julia, Scala, SQL, and others. Notebooks are typically used by data scientists for quick exploration tasks. In this blog, we are going to learn about Jupyter notebooks and Google colab. We will learn about writing code in the notebooks and will focus on the basic features of notebooks. Before diving directly into writing code, let us familiarise ourselves with writing the code notebook style!

spaCy Cheat Sheet: Advanced NLP in Python

heck out the first official spaCy cheat sheet! A handy two-page reference to the most important concepts and features.

R Packages: A Beginner’s Guide

An introduction to R packages based on 11 of the most frequently asked user questions.

DataOps: The New DevOps of Analytics

According to Gartner’s report, Innovation Insight for DataOps, 27 December 2018, ‘DataOps is a collaborative data management practice focused on improving the communication, integration, and automation of data flows across an organization.’ A relatively new approach, DataOps represents a change in culture that focuses on improving collaboration and accelerating service delivery by adopting lean or iterative practices. Unlike its close cousin DevOps, which focuses on operations and development teams, DataOps is geared towards the data developers, data analysts or data scientists. It also focuses on data operations that are streaming data pipelines down to data consumers such as intelligent systems, advanced analytic models or people. While the promise of DataOps seems strong, it’s important to understand how the two concepts are the same and how they are different. For example, DataOps isn’t just DevOps applied to data analytics. While the two methodologies have a common theme of establishing new, streamlined collaboration, DevOps responds to organizational challenges in developing and continuously deploying applications. DataOps, on the other hand, responds to similar challenges but around the collaborative development of data flows, and the continuous use of data across the organization.

Accelerate Performance for Production AI

Learn about the HPC storage requirements to accelerate performance for production AI scenarios with distributed AI servers. This paper shows the testing results from a variety of benchmarks from 1 to 32 GPUs up to 4 server nodes using flash-based WekaIO storage. See how GPU performance compares within a single server versus a clustered configuration with the same amount of GPUs, as well as how GPU performance scales from 1 to 32 GPUs. Discover the storage bandwidth and throughput requirements for common benchmarks, such as Resnet50, VGG16, and Inceptionv4. The information in this paper can help you plan and optimize your AI resources for production AI.

Explaining Random Forest (with Python Implementation)

We provide an in-depth introduction to Random Forest, with an explanation to how it works, its advantages and disadvantages, important hyperparameters and a full example Python implementation.

Common statistical tests are linear models (or: how to teach stats)

This document is summarised in the table below. It shows the linear models underlying common parametric and non-parametric tests. Formulating all the tests in the same language highlights the many similarities between them.

The Deep Learning Toolset – An Overview

Every problem worth solving needs great tools for support. Deep learning is no exception. If anything, it is a realm in which good tooling will become ever more important over the coming years. We are still in the relatively early days of the deep learning supernova, with many deep learning engineers and enthusiasts hacking their own way into efficient processes. However, we are also observing an increasing number of great tools that help facilitate the intricate process that is deep learning, making it both more accessible and more efficient. As deep learning is steadily spreading from the work of researchers and pundits into a broader field of both DL enthusiasts looking to move into the field (accessibility), and growing engineering teams that are looking to streamline their processes and reduce complexity (efficiency), we have put together an overview of the best DL tools.

Chatting with machines: Strange things 60 billion bot logs say about human nature

Lauren Kunze discusses lessons learned from an analysis of interactions between humans and chatbots.

Using RStudio and LaTeX

This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX.

Introduction to BigQuery ML

BigQuery ML enables users to create and execute machine learning models in BigQuery using standard SQL queries. BigQuery ML democratizes machine learning by enabling SQL practitioners to build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.
BigQuery ML currently supports the following types of models:
• Linear regression – These models can be used for predicting a numerical value.
• Binary logistic regression – These models can be used for predicting one of two classes (such as identifying whether an email is spam).
• Multiclass logistic regression for classification – These models can be used to predict more than two classes such as whether an input is ‘low-value’, ‘medium-value’, or ‘high-value’.

Predictive Maintenance: detect Faults from Sensors with CNN

In Machine Learning the topic of Predictive Maintenance is becoming more popular with the passage of time. The challanges are not easy and very heterogenous: it’s usefull to have a good knowledge of the domain or to be in touch with people which know how the underlying system works. For these reasons when a data scientist engages himself in this new field of battle has to follow a linear and rational approach, keeping in mind that the easiest solutions are always the better ones. In this article, we will take a look at a classification problem. We will apply a simple but very power model made with CNN in Keras and we will try to give a visual explanation of our results.

An Executive’s Guide to Implementing AI and Machine Learning

As a Chief Analytics Officer, I’ve had to bridge the gap between business needs and data scientists. How that gap is bridged is, in my experience, the difference between how well the value and promise of artificial intelligence (AI) and machine learning is realized. Here are a few things I’ve learned.

Fraud detection with cost-sensitive machine learning

In traditional two-class classification problems we aim to minimize misclassifications and measure the model performance with metrics like Accuracy, F-score or the AUC-ROC Curve. In certain problems, however, it is for the best to allow more misclassifications at the benefit of lower total costs. If costs associated with misclassifications vary among samples, we should apply an example-dependent cost-sensitive learning approach. But let’s start from the beginning…

Problems before tools: An alternative perspective to the full stack data science generalist

I recently read an article from Eric Colson, Chief Algorithm Officer at Stitch Fix, where he talked about how we should avoid building data science teams like a manufacturing plant, comprising of highly specialized individuals operating different parts of a manufacturing process. Instead data science teams should be built with a full stack approach where data scientists are considered to be generalists. Generalists refers to the capability of performing diverse functions from conception to modelling to implementation to measurement. I won’t go into a detail summary of the article here but you should read Eric’s article before continuing on. The purpose of this article is to provide a complementary view into Eric’s philosophy. In his article, he took a very top down approach to describing why a data science team should be built with generalists. I believe the same conclusion can be drawn through the lens of a bottoms up approach, from a perspective of a practitioner of data science and what it really means to be in data science.

Deploying Your Data Science Projects in JavaScript

For my latest project, I decided to use React for most of my exploratory data analysis (EDA) and needed a drop-dead simple JSON API for serving up the necessary data to avoid loading a +70 MB page. In this tutorial, I’ll walk you through the process of deploying a sample app with create-react-app, fastapi, and Nginx on DigitalOcean. You can explore the real app as it currently stands at The production deploy process for this is a bit manual, but it certainly could be automated if needed.

Connecting the Dots: Using AI & Knowledge Graphs to Identify Investment Opportunities

Knowledge graphs have recently been announced to be on the rise by Gartner’s 2018 Hype Cycle for Artificial Intelligence and Emerging Technologies. They are just alongside 4D Printing and Blockchain for Data Security early in the Hype Cycle, part of the Innovation Trigger phase and only likely to achieve a plateau in five to ten years as of August 2018.

Context Theory I: Introduction

Conversational AI has come a long way since the early days of state machines and intent classification combinations. End-to-end training strategies and reinforcement learning research accelerated to replace unscalable predefined intents and hard-coded states. However, things are far from perfect and complete to chatbot end-users. In this aspect, Dialogue Management components show great potential for improvement, and the key for conversing like a real human-being lies in keeping up with the conversation. Almost everyone can classify some predefined intents. The bigger question is: can you classify context-dependent intents? What do you do to short answers, for instance, when so many of them exist in almost all spoken languages, carry not much information, and whose meaning is generally heavily context-dependent, but can still dramatically change the way dialogue goes and drastically contribute to the whole context? While generating an answer, one maybe can catch grammatical cohesion and lexical cohesion… but are grammatical and lexical choices the only things that make a text coherent?