XLNet, Ernie 2.0, and Roberta: What you Need to Know About new 2019 Transformer Models

Large pretrained language models are definitely the main trend of the latest research advances in natural language processing (NLP). While lots of AI experts agree with Anna Rogers’s statement that getting state-of-the-art results with just more data and computing power is not research news, other NLP opinion leaders also see some positive moments in the current trend. For example, Sebastian Ruder, a research scientist at DeepMind, points out that these big language frameworks help us see the fundamental limitations of the current paradigm. With transformers occupying the NLP leaderboards, it’s often hard to follow what the amendments are that enabled a new big language model to set another state-of-the-art result. To help you stay up to date with the latest NLP breakthroughs, we’ve summarized research papers featuring the current leaders of the GLUE benchmark: XLNet from Carnegie Mellon University, ERNIE 2.0 from Baidu, and RoBERTa from Facebook AI.

MLIR: accelerating AI with open-source infrastructure

Machine learning now runs on everything from cloud infrastructure containing GPUs and TPUs, to mobile phones, to even the smallest hardware like microcontrollers that power smart devices. The combination of advancements in hardware and open-source software frameworks like TensorFlow is making all of the incredible AI applications we’re seeing today possible–whether it’s predicting extreme weather, helping people with speech impairments communicate better, or assisting farmers to detect plant diseases. But with all this progress happening so quickly, the industry is struggling to keep up with making different machine learning software frameworks work with a diverse and growing set of hardware. The machine learning ecosystem is dependent on many different technologies with varying levels of complexity that often don’t work well together. The burden of managing this complexity falls on researchers, enterprises and developers. By slowing the pace at which new machine learning-driven products can go from research to reality, this complexity ultimately affects our ability to solve challenging, real-world problems. Earlier this year we announced MLIR, open source machine learning compiler infrastructure that addresses the complexity caused by growing software and hardware fragmentation and makes it easier to build AI applications. It offers new infrastructure and a design philosophy that enables machine learning models to be consistently represented and executed on any type of hardware. And today we’re announcing that we’re contributing MLIR to the nonprofit LLVM Foundation. This will enable even faster adoption of MLIR by the industry as a whole.


A simple interface to extract texts from (almost) any url.

D3 Deconstructor

The D3 Deconstructor is a Google Chrome extension for extracting data from D3.js visualizations. D3 binds data to DOM elements when building a visualization. Our D3 Deconstructor extracts this data and the visual mark attributes (such as position, width, height, and color) for each element in a D3 visualization. In the example below, we apply the D3 Deconstructor on the visualization (left) by right clicking on it and selecting the extension from the context menu. The D3 Deconstructor then extracts the data table (right).

Create Chatbot using Rasa Part-1

Rasa is an open source machine learning framework for building AI assistants and chatbots. Mostly you don’t need any programming language experience to work in Rasa. Although there is something called ‘Rasa Action Server’ where you need to write code in Python, that mainly used to trigger External actions like Calling Google API or REST API etc.

Dynamic UI Elements in Shiny

At STATWORX, we regularly deploy our project results with the help of Shiny. It’s not only an easy way of letting potential users interact with your R-code, but it’s also fun to design a good-looking app. One of Shiny’s biggest strengths is its inherent reactivity after all being reactive to user input is a web-applications prime purpose. Unfortunately, many apps seem to only make use of Shiny’s responsiveness on the server side while keeping the UI completely static. This doesn’t have to be necessarily bad. Some apps wouldn’t profit from having dynamic UI elements. Adding them regardless could result in the app feeling gimmicky. But in many cases adding reactivity to the UI can not only result in less clutter on the screen but also cleaner code. And we all like that, don’t we?

Monte Carlo Learning

In this article I will cover Monte Carlo Method of reinforcement learning. I have briefly covered Dynamic programming (Value Iteration and Policy Iteration) method in earlier article. In Dynamic programming we need a model(agent knows the MDP transition and rewards) and agent does planning (once model is available agent need to plan its action in each state). There is no real learning by the agent in Dynamic programming method.
Monte Carlo method on the other hand is a very simple concept where agent learn about the states and reward when it interacts with the environment. In this method agent generate experienced samples and then based on average return, value is calculated for a state or state-action. Below are key characteristics of Monte Carlo (MC) method:
• There is no model (agent does not know state MDP transitions)
• agent learns from sampled experience
• learn state value vp(s) under policy p by experiencing average return from all sampled episodes (value = average return)
• only after a complete episode, values are updated (because of this algorithm convergence is slow and update happens after a episode is Complete)
• There is no bootstrapping
• Only can be used in episodic problems

Temporal-Difference Learning

In this article I will cover Temporal-Difference Learning methods. Temporal-Difference(TD) method is a blend of Monte Carlo (MC) method and Dynamic Programming (DP) method. Below are key characteristics of Monte Carlo (MC) method:
• There is no model (agent does not know state MDP transitions)
• agent learn from sampled experience (Similar to MC)
• Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap like DP).
• It can learn from incomplete episode thus this method can be used in continuous problems as well
• TD updates a guess towards a guess and revise the guess based on real experience
To understand this better, consider a real life analogy; if Monte Carlo learning is like annual examination where student completes its episode at the end of the year. Similarly we have TD learning , can be thought like a weekly or monthly examination (student can adjust their performance based on this score (reward received) after every small interval and final score is accumulation of the all weekly tests (total rewards)).

What to expect from a causal inference business project: an executive’s guide III

This is the third part of the post ‘What to expect from a causal inference business project: an executive’s guide’. You will find the second one here. Most of these words have fuzzy meaning, at least at a popular level. Let me define first what will they mean some of them in this post.

What to expect from a causal inference business project: an executive’s guide II

Casual inference models how variables affect each other. Based on this information, uses some calculation tools to answer questions like what would have happened if instead of doing this I had done that? can I have an estimate of the effect of a variable to another? Causal inference provides a broad-brush approach to get preliminary estimates of causal effects. If you want more definitive conclusions, you should go, whenever is possible, for more precise and clear measurements with A/B tests. These do not suffer from confounding and you don’t need any modeling, beyond statistical calculations.

What to expect from a causal inference business project: an executive’s guide I

This is the fifth post on a series about causal inference and data science. The previous one was ‘Solving Simpson’s Paradox’. You will find the second part of this post here. Causal inference is a new language to model causality to help understand better causes and impacts so that we can make better decisions. Here we will explain how it can help a company or organization to gain insights from their data. This post is written for those in a data-driven company, not necessarily technical staff, who want to understand which are the key points in a causal inference project.

A DevOps Process for Deploying R to Production

I’ve been at the EARL Conference in London this week, and as always it’s been inspiring to see so many examples of R being used in production at companies like Sainsbury’s, BMW, Austria Post, PartnerRe, Royal Free Hospital, the BBC, the Financial Times, and many others. My own talk, A DevOps Process for Deploying R to Production, presented one process for automating the process of building and deploying R-based applications using Azure Pipelines and Azure Machine Learning Service.

Survival analysis with strata, clusters, frailties and competing risks in in Finalfit

In healthcare, we deal with a lot of binary outcomes. Death yes/no, disease recurrence yes/no, for instance. These outcomes are often easily analysed using binary logistic regression via finalfit(). When the time taken for the outcome to occur is important, we need a different approach. For instance, in patients with cancer, the time taken until recurrence of the cancer is often just as important as the fact it has recurred. Finalfit wraps a number of functions to make these analyses easy to perform and output into PDFs and Word documents.

Hierarchical Neural Architecture Search

Many researchers and developers are interested in what Neural Architecture Search can offer their Deep Learning models, but are deterred by monstrous computational costs. Many techniques have been developed to promote more efficient search, notably Differentiable Architecture Search, parameter sharing, predictive termination, and hierarchical representations of architectures. This article will explain the idea of hierarchical representations because it is by far the easiest way to achieve the desired balance of efficiency and a sufficiently expressive search space. This representation of neural networks is so powerful that you can achieve competitive results with random search, eliminating the need to implement bayesian, evolutionary, reinforcement learning, or differentiable search algorithms.

Automate Data Cleaning with Unsupervised Learning

In this post, I propose my solution to improve the quality of textual data at my disposal. I develop a workflow which aims to clean data AUTOMATICALLY and in an UNSUPERVISED way. I say ‘automatically’ because it is useless to follow an unsupervised approach if we have to check manually all the time the data to understand what the model outputs. We need certainties and don’t want to lose our time.

Importance of Loss Function in Machine Learning

Assume you are given a task to fill a bag with 10 Kg of sand. You fill it up till the measuring machine gives you a perfect reading of 10 Kg or you take out the sand if the reading exceeds 10kg. Just like that weighing machine, if your predictions are off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. As you experiment with your algorithm to try and improve your model, your loss function will tell you if you’re getting(or reaching) anywhere. ‘The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function’ – Source At its core, a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value). We convert the learning problem into an optimization problem, define a loss function and then optimize the algorithm to minimize the loss function.