Google Knows What You Are Saying With Only 80 MBs

With only 80 MBs, Google has brought AI powered speech recognition offline. It is described as an end-to-end, all neural, on device, speech recognizer. This allows a user to dictate notes, emails, text messages and voice searches faster and more reliably. The new recognizer works at the character level. As you dictate, the speech recognizer outputs words in real time, character by character, similar to someone typing out what you are saying.

PU Learning – Positive/unknown class machine learning approaches

A challenge that keeps presenting itself at work is one of not having a labelled negative class in the context of needing to train a binary classifier. Typically, the issue is paired with horribly imbalanced data sets and pressed for time, I have often taken the simplistic route of sub-sampling the unknown set and treating them as unknowns. Obviously this isn’t ideal as the unknown set is contaminated and as a result the classifiers dont train that well. Nevertheless, out in the wild, with real-life deadlines, the approach was time efficient, and the results were often surprisingly useful. Recently, I was lucky to have a few days to read around the topic a little. I found some interesting approaches and thought it would be worth taking a few notes, and they turned into this post.

A bit more understanding of Cronbach’s alpha

Cronbach’s alpha reliability coefficient is one of the most widely used indicators of the scale reliability. It is used often without concern for the data (this will be a different text) because it is simple to calculate and it requires only one implementation of a single scale. The aim of this article is to provide some more insight into the functioning of this reliability coefficient without going into heavy mathematics.

Quickly create Codeplans of your (labelled) Data

The view_df() function from the sjPlot-package creates nice ‘codeplans’ from your data sets, and also supports labelled data and tagged NA-values. This gives you a comprehensive, yet clear overview of your data set. To demonstrate this function, we use a (labelled) data set from the European Social Survey. view_df() produces a HTML-file, that is – when you use RStudio – displayed in the viewer pane, or it can be opened in your webbrowser.

Unleash the potential of Recommender Systems

Recommender systems are one of the most popular algorithms in data science today. They possess immense capability in various sectors ranging from entertainment to e-commerce. Recommender Systems have proven to be instrumental in pushing up company revenues and customer satisfaction with their implementation. Therefore, it is essential for machine learning enthusiasts to get a grasp on it and get familiar with related concepts. As the amount of available information increases, new problems arise as people are finding it hard to select the items they actually want to see or use. This is where the recommender system comes in. They help us make decisions by learning our preferences or by learning the preferences of similar users. They are used by almost every major company in some form or the other. Netflix uses it to suggest movies to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow.

Multi-Task Learning in Language Model for Text Classification

Howard and Ruder propose a new method to enable robust transfer learning for any NLP task by using pre-training embedding, LM fine-tuning and classification fine-tuning. The sample 3-layer of LSTM architecture with same hyperparameters except different dropout demonstrate a outperform and robust model for 6 downstream NLPS tasks. They named it as Universal Language Model Fine-tuning (ULMFiT). This story will discuss about Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018) and the following are will be covered: Architecture, Experiment, Implementation.

Imbalanced Class Sizes and Classification Models: A Cautionary Tale Part 2

Recently, I wrote this post about imbalanced class sizes in classification models might lead to overestimation of a classification model’s performance. The post discussed a classification project I was developing using Airbnb first user booking data from Kaggle. The objective of the project was to predict whether a first-time Airbnb user would likely book a vacation home in the U.S.A./Canada or somewhere internationally. However, bookings within the U.S.A./Canada accounted for around 75 percent of the data, making it difficult to accurately estimate international bookings. To account for the fact that nearly 100 percent of observations were predicted to be in the dominant class (in my case, travelers to the U.S.A./Canada), I used the Adaptive Synthetic (ADASYN) to oversample the minority class in my test set. Not satisfied with the results of the out-of-box logistic regression with default parameters, I decided to do some model selection and brute-force parameter tuning using GridSearchCV in scikit-learn. My features were engineered, my classes were balanced, and my shiny new AWS instance was compute-optimized. What could go wrong?

Word Vectors and Lexical Semantics (Part 1)

The following are my personal notes based on the Deep NLP course by Oxford University held in 2017. The material is available at Deep Natural Language Processing Course (University of Oxford 2017).

Deep Natural Language Processing Course (University of Oxford 2017)

This is an advanced course on natural language processing. Automatically processing natural language inputs and producing language outputs is a key component of Artificial General Intelligence. The ambiguities and noise inherent in human communication render traditional symbolic AI techniques ineffective for representing and analysing language data. Recently statistical techniques based on neural networks have achieved a number of remarkable successes in natural language processing leading to a great deal of commercial and academic interest in the field This is an applied course focussing on recent advances in analysing and generating speech and text using recurrent neural networks. We introduce the mathematical definitions of the relevant machine learning models and derive their associated optimisation algorithms. The course covers a range of applications of neural networks in NLP including analysing latent dimensions in text, transcribing speech to text, translating between languages, and answering questions. These topics are organised into three high level themes forming a progression from understanding the use of neural networks for sequential language modelling, to understanding their use as conditional language models for transduction tasks, and finally to approaches employing these techniques in combination with other mechanisms for advanced applications. Throughout the course the practical implementation of such models on CPU and GPU hardware is also discussed.

Predicting “Bikeability” in U.S. Cities

According to the United Nations, more than half of the world’s population now live in cities. As the worldwide poverty rate continues to fall and people get richer, the number of private vehicles on the roads has surged. These dual phenomena mean higher traffic congestion, which, in turn, exacerbates climate-change-causing greenhouse gas emissions. Alternative ‘green’ modes of transit, namely, walking, scootering, and biking, can improve urban mobility and help cities to meet emissions-reduction targets, but not all cities are equally conducive to these modes. It is in this context that I developed a data science project to predict the ‘bikeability’ of cities across the United States and to explore which urban features determine ‘bikeability’.

Climbing the ladder of causality

Wearing shorts in November will not induce summer. If this seems obvious to you, congratulations! The neural network within your head is able to do causal reasoning, something most artificial neural networks are not capable of. A new book by Judea Pearl, ‘The Book of Why: The New Science of Cause and Effect’, has recently sparked an increased interest for causality in the machine learning community. And for good reasons! As a computer scientist, Pearl presents a useful framework for answering questions or ‘queries’ (‘Will this drug have an effect?’, ‘What would have happened if I had…?’). However, as a philosopher he argues that being capable of pondering about hypothetical situations makes us human. In this post I will attempt to organize the main ideas of The Book of Why in my own research sphere.

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion based on mutual information. Because of the difficulty in directly implementing the maximal dependency condition, we first derive an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection. Then, we present a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers). This allows us to select a compact set of superior features at very low cost. We perform extensive experimental comparison of our algorithm and other methods using three different classifiers (naive Bayes, support vector machine, and linear discriminate analysis) and four different data sets (handwritten digits, arrhythmia, NCI cancer cell lines, and lymphoma tissues). The results confirm that mRMR leads to promising improvement on feature selection and classification accuracy.

Nuts and Bolts of Reinforcement Learning: Introduction to Temporal Difference (TD) Learning

Q-learning became a household name in data science when DeepMind came up with an algorithm that reached superhuman levels on ATARI games. It’s one of the core components of reinforcement learning (RL). I regularly come across Q-learning whenever I’m reading up about RL. But what does Q-learning have to do with our topic of temporal difference learning? Let me take an example to give an intuition of what temporal difference learning is about. Rajesh is planning to travel to Jaipur from Delhi in his car. A quick check on Google Maps shows him an estimated 5 hour journey. Unfortunately, there is an unexpected delay due to a roadblock (anyone who has taken a long journey can relate to this!). The estimated arrival time for Rajesh now jumps up to 5 hours 30 minutes.

Ensemble Methods in One Picture

Ensemble methods take several machine learning techniques and combine them into one predictive model. It is a two step process:
1. Generate the Base Learners: Choose any combination of base learners, based on accuracy and diversity. Each base learner can produce more than one predictive model, if you change variables such as case weights, guidance parameters, or input space partitions.
2. Combine Estimates from the Base Learners. The result is a computational ‘average’ of sorts (which is much more complex than the regular arithmetic average).

Approximating RStudio Server Pro (And Shiny Server Pro, and JupyterHub) for Free

Since authentication, scaling, and serving application content are all problems that can be solved with open source software, why pay $10k/year for Rstudio Server Pro or Shiny Server Pro when you can do it for free? Well, for starters, RStudio Server Pro is seamless and really easy to use. A free and open source alternative built on top of Shinyproxy, Docker, Nginx, has a few more moving pieces and is much less elegant. I’ll detail a way to get a multi user RStudio Server system complete with logins and mounted home directories, multiple R versions, etc. This system also can serve and scale shiny (and Dash) apps like ShinyServer Pro. And ideally Jupyter Lab and Jupyter Notebooks, but that piece isn’t working yet. What’s also not working is HTTPS. The whole thing is great for local dev but not ready for prime time yet, so if you know a thing or two about Nginx, Docker, and encrypting web traffic, I would LOVE a pull request.

50 Years of Test (Un)fairness: Lessons for Machine Learning

Quantitative definitions of what is unfair and what is fair have been introduced in multiple disciplines for well over 50 years, including in education, hiring, and machine learning. We trace how the notion of fairness has been defined within the testing communities of education and hiring over the past half century, exploring the cultural and social context in which different fairness definitions have emerged. In some cases, earlier definitions of fairness are similar or identical to definitions of fairness in current machine learning research, and foreshadow current formal work. In other cases, insights into what fairness means and how to measure it have largely gone overlooked. We compare past and current notions of fairness along several dimensions, including the fairness criteria, the focus of the criteria (e.g., a test, a model, or its use), the relationship of fairness to individuals, groups, and subgroups, and the mathematical method for measuring fairness (e.g., classification, regression). This work points the way towards future research and measurement of (un)fairness that builds from our modern understanding of fairness while incorporating insights from the past.


HashiCorp Nomad is a single binary that schedules applications and services onLinux, Windows, and Mac. It is an open source scheduler that uses adeclarative job file for scheduling virtualized, containerized, andstandalone applications.
Nomad is a flexible container orchestration tool that enables an organization to easily deploy and manage any containerized or legacy application using a single, unified workflow. Nomad can run a diverse workload of Docker, non-containerized, microservice, and batch applications, and generally offers the following benefits to developers and operators:
• API-driven Automation: Workload placement, scaling, and upgrades can be automated, simplifying operations and eliminating the need for homegrown tooling.
• Self-service Deployments: Developers are empowered to service application lifecycles directly, allowing operators to focus on higher value tasks.
• Workload Reliability: Application, node, and driver failures are handled automatically, reducing the need for manual operator intervention
• Increased Efficiency and Reduced Cost: Higher application densities allow operators to reduce fleet sizes and save money.
Nomad is trusted by enterprises from a range of sectors including financial, retail, software, and others to run production workloads at scale across private infrastructure and the public cloud.

Advanced cross-validation tips for time series

n a previous post, we explained the concept of cross-validation for time series, aka backtesting, and why proper backtests matter for time series modeling. The goal here is to dig deeper and discuss a few coding tips that will help you cross-validate your predictive models correctly.

Choosing from Popular Python Web Frameworks

Python is one of the most popular and versatile programming languages. There are thousands of Python packages, and these allow you to extend Python capabilities to any domain,be it web development, Internet of Things (IoT), artificial intelligence, or machine learning, and scientific computing. We can work with many different web frameworks and packages to easily build simple and complex RESTful Web APIs with Python, and we can combine these frameworks with other Python packages.
Full-Stack Framework
1. Django
2. Pyramid
1. Flask
2. Bottle
Asynchronous Framework
1. Tornado
2. Growler