Building a Backend System for Artificial Intelligence

Let’s explore the challenges involved in building a backend system to store and retrieve high-dimensional data vectors, typical to modern systems that use ‘artificial intelligence’ – image recognition, text comprehension, document search, music recommendations, …

Automatic GPUs

A reproducible R / Python approach to getting up and running quickly on GCloud with GPUs in Tensorflow.

Classification algorithm for non-time series data

One of the critical problems of ‘identification’, be it NLP – speech/text or solving an image puzzle from pieces like a jigsaw, is to understand the words, or pieces of data and the context. The words or pieces individually don’t give any meaning and tying them together gives an idea about the context. Now the data itself has some patterns, which is broadly classified as sequential or time-series data and non-time series data, which is largely non-sequential or arbitrary. Sentiment analysis of text reports, documents and journals, novels & classics follow time series pattern, in the sense, the words itself follow a precedence as governed by the grammar and the language dictionary. So are the stock-price prediction problems which has a precedent of the previous time period predictions and socio-economic conditions.

Calendar Heatmaps in ggplot

Calendar heatmaps are a neglected, but valuable, way of representing time series data. Their chief advantage is in allowing the viewer to visually process trends in categorical or continuous data over a period of time, while relating these values to their month, week, and weekday context – something that simple line plots do not efficiently allow for. If you are displaying data on staffing levels, stock returns (as we will do here), on-time performance for transit systems, or any other one dimensional data, a calendar heatmap can do wonders for helping your stakeholders note patterns in the interaction between those variables and their calendar context. In this post, I will use stock data in the form of daily closing prices for the SPY – SPDR S&P 500 ETF, the most popular exchange traded fund in the world. ETF’s are growing in popularity, so much so that there’s even a podcast devoted entirely to them. For the purposes of this blog post, it’s not necessary to have any familiarity with ETF’s or stocks in general. Some knowledge of tidyverse packages and basic R will be helpful, though.

Getting Machine Learning Models Ready For Production

As a Scientist, it’s incredibly satisfying to be given the freedom to experiment by applying new research and rapidly prototyping. This satisfaction can be sustained quite well in a lab environment but can diminish quickly in a corporate environment. This is because of the underlying commercial value motive which science is driven by in a business setting – if it doesn’t add business value to employees or customers, there’s no place for it! Business value, however, goes beyond just being a nifty experiment which shows potential value to employees or customers. In the context of Machine Learning models, the only [business] valuable models, are models in Production! In this blog post, I will take you through the journey which my team and I went through in taking Machine Learning models to Production and some important lessons learnt along the way.

Adversarial Examples – Rethinking the Definition

Adversarial examples are a large obstacle for a variety of machine learning systems to overcome. Their existence shows the tendency of models to rely on unreliable features to maximize performance, which if perturbed, can cause misclassifications with potentially catastrophic consequences. The informal definition of an adversarial example is an input that has been modified in a way that is imperceptible to humans, but is misclassified by a machine learning system whereas the original input was correctly classified.

Data Science is Boring (Part 1)

My boring days of deploying Machine Learning and how I cope.

Parsing Text for Emotion Terms: Analysis & Visualization Using R: Updated Analysis

The motivation for an updated analysis: The first publication of Parsing text for emotion terms: analysis & visualization Using R published in May 2017 used the function get_sentiments(‘nrc’) that was made available in the tidytext package. Very recently, the nrc lexicon was dropped from the tidytext package and hence the R codes in the original publication failed to run. The NRC emotion terms are also available in the lexicon package.

R Neural Network

In the previous four posts I have used multiple linear regression, decision trees, random forest, gradient boosting, and support vector machine to predict MPG for 2019 vehicles. It was determined that svm produced the best model. In this post I am going to use the neuralnet package to fit a neural network to the cars_19 dataset.

Kubernetes: A simple overview

This overview covers the basics of Kubernetes: what it is and what you need to keep in mind before applying it within your organization. The information in this piece is curated from material available on the O’Reilly online learning platform and from interviews with Kubernetes experts.

Quickly understanding process mining by analyzing event logs with Celonis Snap

Data is the new oil.’, ‘Our company needs to become more efficient.’, ‘Can we optimize this process?’, ‘Our processes are too complicated.’ – sentences you have heard very often and maybe cannot hear anymore. It is understandable but there are some actual real world benefits that stem from the technologies and discussions behind the super trend of (Big) Data. One of the emerging technologies in this field is in more ways than one directly linked to the sentences above. It is process mining. Maybe you have heard of it. Maybe you have not. Harvard Business Review thinks ‘[…] you should be exploring process mining’.

Fine-grained Sentiment Analysis (Part 3): Fine-tuning Transformers

Hands-on transfer learning using a pretrained transformer in PyTorch. This is Part 3 of a series on fine-grained sentiment analysis in Python. Parts 1 and 2 covered the analysis and explanation of six different classification methods on the Stanford Sentiment Treebank fine-grained (SST-5) dataset. In this post, we’ll look at how to improve on past results by building a transformer-based model and applying transfer learning, a powerful method that has been dominating NLP task leaderboards lately.

Industrializing AI & Machine Learning Applications with Kubeflow

Enable data scientists to make scaling and production-ready ML products.

Building and Labeling Image Datasets for Data Science Projects

Using standardized datasets is great for benchmarking new models/pipelines or for competitions. But for me at least a lot of fun of data science comes when you get to apply things to a project of your own choosing. One of the key parts of this process is building a dataset. So there are a lot of ways to build image datasets. For certain things I have legitimately just taken screenshots like when I was sick and built a facial recognition dataset using season 4 of the Flash and annotated it with labelimg. Another route I have taken is downloading a bunch of images by hand and just display images and label them in an excel spreadsheet… For certain projects you might just have to take a bunch of pictures with your phone as was the case when I made my dice counter. These days I have figured out a few more tricks which make this processes a bit easier and am working on improving things along the way.

Introducing IceCAPS: Microsoft’s Framework for Advanced Conversation Modeling

The new open source framework that brings multi-task learning to conversational agents. Neural conversation systems and disciplines such as natural language processing(NLP) have seen significant advancements over the last few years. However, most of the current NLP stacks are designed for simple dialogs based on one or two sentences. Structuring more sophisticated conversations that factor in aspects such as personalities or context remains an open challenge. Recently, Microsoft Research unveiled IceCAPS, an open source framework for advanced conversation modeling.