BERT, RoBERTa, DistilBERT, XLNet – which one to use?

Google’s BERT and recent transformer-based methods have taken the NLP landscape by a storm, outperforming the state-of-the-art on several tasks. Lately, varying improvements over BERT have been shown – and here I will contrast the main similarities and differences so you can choose which one to use in your research or application.


Bayesian Linear Mixed Models: Random Intercepts, Slopes, and Missing Data

This past summer, I watched a brilliant lecture series by Richard McElreath on Bayesian statistics. It honestly changed my whole outlook on statistics, so I couldn’t recommend it more (plus, McElreath is an engaging instructor). One of the most compelling cases for using Bayesian statistics is with a collection of statistical tools called linear mixed models or multilevel/hierarchical models. It’s common that data are grouped or clustered in some way. Often in psychology we have repeated observations nested within participants, so we know that data coming from the same participant will share some variance. Linear mixed models are powerful tools for dealing with multilevel data, usually in the form of modeling random intercepts and random slopes. In this tutorial I assume familiarity with linear regression and some background knowledge in Bayesian inference, such that you should have some familiarity with priors and posterior distributions (if not, go here) or watch McElreath’s videos.


Building a data pipeline from scratch on AWS

When starting to dive into the data world you will see that there are a lot of approaches you can go for and a lot of tools you can use. It may make you feel a little overwhelmed at first. On this post, I will try to help you to understand how to pick the appropriate tools and how to build a fully working data pipeline on the cloud using the AWS stack based on a pipeline I recently built. The pipeline discussed here will provide support for all data stages, from the data collection to the data analysis. The intention here is to provide you enough information, by going through the whole process I passed through in order to build my first data pipeline, so that on the end of this post you will be able to build your own architecture and to discuss your choices.


Drawing Architecture: Building Deep Convolutional GAN’s In Pytorch

Feynman did not create GAN’s, unsupervised learning, or adversarial training, but with this quote he did demonstrate that intelligence and the ability to understand something is not merely a supervised, discriminative task. In order to understand something you must do more than give it a label based on something similar that you have seen a million times – to understand what you are looking at you must be able to recreate it. The ability to create is what sets general adversarial networks apart from their predecessors in deep learning. GAN’s are generative models that generate output; this is a departure from discriminative models that label input. This makes them a powerful paradigm shifting force in deep learning and artificial intelligence that is worthy of the hype that Yan Lecun and the other fathers of deep learning have given it. The potential for GAN’s surpasses that of discriminative networks because GAN’s use deep learning to synthesize information and to create something novel from it. As Feynman said, this is the most impactful form of understanding that there is.


3 Ways to Manage Human Bias in the Analytics Process

Managing human bias is an important part of the analytics process. Learn about three areas to watch out for to ensure your models are as unbiased as possible.
• Evaluate the ask: What is the business decision-maker looking for?
• Carefully select & evaluate information that feeds the model
• Objectively choose the best analytics method


How to Create an Interactive Geographic Map Using Python and Bokeh

If you are looking for a powerful way to visualize geographic data then you should learn to use interactive Choropleth maps. A Choropleth map represents statistical data through various shading patterns or symbols on predetermined geographic areas such as countries, states or counties. Static Choropleth maps are useful for showing one view of data, but an interactive Choropleth map is much more powerful and allows the user to select the data they prefer to view.


Bootstrapping for Inferential Statistics

Data Scientist’s Toolkit – bootstrapping, sampling, confidence intervals, hypothesis testing.
Bootstrap is a powerful, computer-based method for statistical inference without relying on too many assumption. It’s just magical to form a sampling distribution just from only one sample data. No formula needed for my statistical inference. Not only that, in fact, it is widely applied in other statistical inference such as confidence interval, regression model, even the field of machine learning. In this article we will primarily talk about two things
• Building Confidence Intervals
• Hypothesis Testing


Build your first Voice Assistant

A step by step tutorial on building a voice-based assistant using python. Who doesn’t want to have the luxury to own an assistant who always listens for your call, anticipates your every need, and takes action when necessary? That luxury is now available thanks to artificial intelligence-based voice assistants. Voice assistants come in somewhat small packages and can perform a variety of actions after hearing your command. They can turn on lights, answer questions, play music, place online orders and do all kinds of AI-based stuff. Voice assistants are not to be confused with virtual assistants, which are people who work remotely and can, therefore, handle all kinds of tasks. Rather, voice assistants are technology based. As voice assistants become more robust, their utility in both the personal and business realms will grow as well.


Differential Privacy

This project contains a C++ library of e-differentially private algorithms, which can be used to produce aggregate statistics over numeric data sets containing private or sensitive information. In addition, we provide a stochastic tester to check the correctness of the algorithms. Currently, we provide algorithms to compute the following:
• Count
• Sum
• Mean
• Variance
• Standard deviation
• Order statistics (including min, max, and median)
We also provide an implementation of the laplace mechanism that can be used to perform computations that aren’t covered by our pre-built algorithms. All of these algorithms are suitable for research, experimental or production use cases. This project also contains a stochastic tester, used to help catch regressions that could make the differential privacy property no longer hold.


Face recognition and OCR processing of 300 million records from US yearbooks

Using AI and computer vision in genealogy research. A yearbook is a type of a book published annually to record, highlight, and commemorate the past year of a school. Our team at MyHeritage took on a complex project: extracting individual pictures, names, and ages from hundreds of thousands of yearbooks, structuring the data, and creating a searchable index that covers the majority of US schools between the years 1890-1979 – more than 290 million individuals. In this article I’ll describe what problems we encountered during this project and how we solved them.


How I Understood: What features to consider while training audio files?

As for any machine learning experiment, it’s first required to collect data. Then the next main task would be to transform the data to features, which can then be fed into an algorithm. This post is aimed at briefing through some of the most important features that may be needed to build a model for an audio classification task. Extraction of some of the features using Python has also been put up below.


Demystifying hypothesis testing with simple Python examples

Hypothesis testing is the bread and butter of inferential statistics and a critical skill in the repertoire of a data scientist. We demonstrate the concept with very simple Python scripts.


Marketing Analytics: Customer Engagement, Random Forest Style

We’ll go over on how to build a random forest predictive model on customer marketing engagement. With better predictions on how customers will engage to certain marketing campaigns, a marketer can tailor strategies for different audiences [1]. The official marketing term we are looking for here is the ‘likelihood of engagement.’ One concrete example of this is isolating what type of customers will respond to which type of ads (e.g. females ages 20-39 responding to Facebook Ads v. Google Ads – totally made that up).
Advertisements