A review of BERT based models

Attention – the simple idea of focussing on salient parts of input by taking a weighted average of them, has proven to be the key factor in a wide class of neural net models. Multihead attention in particular has proven to be reason for the success of state-of-art natural language processing models such as BERT and Transformer based machine translation models. Multihead attention – which is essentially ‘multiple attention heads attending to different sections of input in parallel’, enables expression of sophisticated functions beyond the weighted average of just one attention head. Not since word2vec has a NLP model spawned so many offshoot models – either with its name embedded in them as prefix/suffix (doc2vec,node2vec,lda2vec etc.) or expanding on its core idea (skip thought vectors, fasttext, Adagram etc.)


Distilling BERT – How to achieve BERT performance using Logistic Regression

BERT is awesome, and it’s everywhere. It looks like any NLP task can benefit from utilizing BERT. The authors showed that this is indeed the case, and from my experience, it works like magic. It’s easy to use, works on a small amount of data and supports many different languages. It seems like there’s no single reason not to use it everywhere. But actually, there is. Unfortunately, in practice, it is not so trivial. BERT is a huge model, more than 100 million parameters. Not only we need a GPU to fine tune it, but also in inference time, a CPU (or even many of them) is not enough. It means that if we really want to use BERT everywhere, we need to install a GPU everywhere. This is impractical in most cases. In 2015, this paper (by Hinton et al.,) introduced a way to distill the knowledge of a very big neural network into a much smaller one, like teacher and student. The method is very simple. We use the big neural network predictions to train the small one. The main idea is to use raw predictions, i.e, predictions before the final activation function (usually softmax or sigmoid). The assumption is that by using raw values, the model is able to learn inner representations better than by using ‘hard’ predictions. Sotmax normalizes the values to 1 while keeping the maximum value high and decreases other values to something very close to zero. There’s little information in zeros, so by using raw predictions, we also learn from the not-predicted classes. The authors show good results in several tasks including MNIST and speech recognition.


Introduction to Neural Networks

This article is the first in a series of articles aimed at demystifying the theory behind neural networks and how to design and implement them. The article was designed to be a detailed and comprehensive introduction to neural networks that is accessible to a wide range of individuals: people who have little to no understanding of how a neural network works as well as those who are relatively well-versed in their uses, but perhaps not experts. In this article, I will cover the motivation and basics of neural networks. Future articles will go into more detailed topics about the design and optimization of neural networks and deep learning. These tutorials are largely based on the notes and examples from multiple classes taught at Harvard and Stanford in the computer science and data science departments.


You Need to Know Edge Relaxation for Shortest Paths Problem

The purpose of the shortest paths problem is to find the shortest path from the starting vertex to the goal vertex. We widely use the algorithms to solve the shortest paths problem from competitive programming to Google Maps directions search. By understanding the key notion, ‘edge relaxation’, it is really easier to understand the concrete algorithms, say Dijsktra’s algorithm or Bellman-Ford algorithm. In other words, it might be difficult to make these algorithms your own without understanding edge relaxation. In this post, I focus on edge relaxation and explain the general structure to solve the shortest paths problem. Also, we’ll go through the easy algorithm and its implementation for better understanding. I use Python for the implementation. This post is structured as follows:
• What is the shortest paths problem?
• What is edge relaxation?
• The order of the relaxation
• The shortest path on DAG and its implementation


Introduction to Actor Critic in Reinforcement Learning

Before delving into the details of the actor critic, let’s remind ourselves of the Policy Gradient . What does it mean to have a policy based reinforcement learning? To put it simply imagine that a robot find itself in some situation, but it appears that this situation is similar to something it had experienced before. So the policy based method is says: since I have taken action (a) in this particular situation in the past, let’s try the same action this time too. PS. Don’t mix similar situation with same states, in similar situation the robot or agent is in some new states that resemble what it has experienced before, and not necessarily the exact same state in which it was in.


Flask and Heroku for online Machine Learning deployment

Thanks to libraries such as Pandas, scikit-learn and Matplotlib it is relatively easy to start exploring datasets and making some first predictions using simple Machine Learning (ML) algorithms in Python. Although, to make these trained models useful in the real world it is necessary to share them and make them accessible on the Web ready to make predictions. Only in this way Machine Learning can be used to provide benefit to society. I have recently been working with the Blood Transfusion Service Center dataset by Prof. I-Cheng Yeh and I achieved a prediction accuracy of 91.1% using a Random Forest Classifier. I, therefore, decided to make this model available online on my personal website. To do so, I made use of Heroku and the Python library Flask. As a further development, I also decided to create an Android App for my personal website using Android Studio making so my ML model also ready to run on Android devices.


Instance segmentation using Mask R-CNN

This article discusses the instance segmentation technique, Mask R-CNN, with a brief introduction to object detection techniques such as R-CNN, Fast R-CNN, Faster R-CNN.


A fastai/Pytorch implementation of MixMatch

In this post, I will be discussing and implementing ‘MixMatch: A Holistic Approach to Semi-Supervised Learning;’ by Berthelot, Carlini, Goodfellow, Oliver, Papernot and Raffel. Released in May, 2019, MixMatch is a semi-supervised learning algorithm which has significantly outperformed previous approaches. This blog comes from discussions held within Dr. Ehsan Kamalinejad’s machine learning research group at Cal State East Bay. How much of an improvement is MixMatch? When trained on CIFAR10 with 250 labeled images, MixMatch outperforms the next best technique (Virtual Adversarial Training) by almost 25% on the error rate (11.08% vs 36.03%; for comparison the fully supervised case on all 50k images has an error rate of 4.13%). These are far from incremental results, and the technique shows the potential to dramatically improve the state of semi-supervised learning.


Data Scientist is not the only Data Science role!

Concise information about the major data science roles.


The magazine that writes itself

We asked an AI to write about the future, here’s what it has to say and the steps we took to build it.


3 Quick Tips for Training ML Models

For every triumphant result in machine learning, there are hundreds of buggy models that just won’t converge. Skyrocketing loss functions. Unusual validation loss. Suspiciously low training loss. All of these indicate that something is wrong. There are some beautiful references that can help thoroughly diagnose the issue, but here are the three candidates that immediately come to mind and constantly pop up. Note: This especially applies to image-oriented tasks but still has great value for general ML tasks.


Code: the new data to look at

What do Facebook, a car and a Boeing have in common? They all run on source code with over 20M lines of code (L.O.C). Six years ago, Marc Andreessen said that ‘software is eating the world’ and looking at David McCandless visualization we can understand how much it’s true. We are getting overwhelmed by source code and their increasing complexity and thus challenges: shadow IT, lack of documentation, language, and framework heterogeneity, lack of visibility in code history, no easy way to maintain the code…


Generating meaningful Phrases from unstructured news data

Phrase is a multi word expression or word n-grams which expresses valid meaning and very useful in indexing document, summarisation, recommendation etc. In this article i will explain a very high precision parallel phrasing module i came up with from only 10000 unstructured news articles with more than 95% precision. Will cover how to generate valid bi-grams and tri-grams phrases. You might be wondering there are Phrase module like gensim etc which i have tried and failed to make it work. The phrases generated by gensum are not upto the mark and may require huge corpus to generate phrases based on collocation. (I guess around 1 million news article)


Relative vs Absolute: Understanding Compositional Data with Simulations

One of the things I am constantly asked to do at my work is to compare NGS data across samples. These could be replicates or could be coming from completely different environments or processing steps. We all intuitively understood that these data are always constrained by the sequencing depth and therefore the conclusions need to be drawn carefully. I lacked the vocabulary or the techniques to analyze this data until I saw the pre-print of this paper by @WarrenMcG. That started a few days of digging a little deeper into compositional data, and I generated some synthetic data to understand the nuances a little better. In this post, I try to lay out what I’ve learned and hopefully help spread the word about the dangers of treating compositional data as regular un-constrained data.


How To Become a Data Engineer

A list of useful resources to learn Data Engineering from scratch.


Incorrectly stated machine learning problem

When you are solving the wrong problem.


How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models

I love being a data scientist working in Natural Language Processing (NLP) right now. The breakthroughs and developments are occurring at an unprecedented pace. From the super-efficient ULMFiT framework to Google’s BERT, NLP is truly in the midst of a golden era. And at the heart of this revolution is the concept of the Transformer. This has transformed the way we data scientists work with text data – and you’ll soon see how in this article.
Advertisements