Tutorial on Text Classification (NLP) using ULMFiT and fastai Library in Python

Natural Language Processing (NLP) needs no introduction in today’s world. It’s one of the most important fields of study and research, and has seen a phenomenal rise in interest in the last decade. The basics of NLP are widely known and easy to grasp. But things start to get tricky when the text data becomes huge and unstructured. That’s where deep learning becomes so pivotal. Yes, I’m talking about deep learning for NLP tasks – a still relatively less trodden path. DL has proven its usefulness in computer vision tasks like image detection, classification and segmentation, but NLP applications like text generation and classification have long been considered fit for traditional ML techniques.

Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

The k-nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data – likely to contain noise and imperfections – are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data – which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k-nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbor algorithm to obtain Smart/Quality Data for a high-quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analyzed.

A survey on online kernel selection for online kernel learning

Online kernel selection is fundamental to online kernel learning. In contrast to offline kernel selection, online kernel selection intermixes kernel selection and training at each round of online kernel learning, and requires a sublinear regret bound and low computational complexity. In this paper, we first compare the difference between offline kernel selection and online kernel selection, then survey existing online kernel selection approaches from the perspectives of formulation, algorithm, candidate kernels, computational complexities and regret guarantees, and finally point out some future research directions in online kernel selection.

Regularization: Ridge, Lasso and Elastic Net

In this tutorial, you will get acquainted with the bias-variance trade-off problem in linear regression and how it can be solved with regularization.

GLMs: link vs. distribution

Usually, when I give a course on GLMs, I try to insist on the fact that the link function is probably more important than the distribution. In order to illustrate, consider the following dataset, with 5 observations

Amazon Textract

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. Many companies today extract data from documents and forms through manual data entry that’s slow and expensive or through simple optical character recognition (OCR) software that is difficult to customize. Rules and workflows for each document and form often need to be hard-coded and updated with each change to the form or when dealing with multiple forms. If the form deviates from the rules, the output is often scrambled and unusable. Amazon Textract overcomes these challenges by using machine learning to instantly ‘read’ virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. With Textract you can quickly automate document workflows, enabling you to process millions of document pages in hours. Once the information is captured, you can take action on it within your business applications to initiate next steps for a loan application or medical claims processing. Additionally, you can create smart search indexes, build automated approval workflows, and better maintain compliance with document archival rules by flagging data that may require redaction.

R now supported in Azure SQL Database

Azure SQL Database, the database-as-a-service based on Microsoft SQL Server, now offers R integration. (The service is currently in preview; details on how to sign up for the preview are provided in that link.) While you’ve been able to run R in SQL Server in the cloud since the release of SQL Server 2016 by running a virtual machine, Azure SQL Database is a fully-managed instance that doesn’t require you to set up and maintain the underlying infrastructure. You just choose the size and scale of the database you want to manage, and then connect to it like any other SQL Server instance. (If you want to learn how to set up an Azure SQL database, this Microsoft Learn module is a good place to start.)

Linking Data Science Activities to Business Initiatives Using the Hypothesis Development Canvas

The Hypothesis Development Canvas is an effective and concise tool that integrates the different elements of the ‘Thinking Like A Data Scientist’ process into a single document.

Scikit-Multiflow: A Multi-output Streaming Framework

scikit-multiflow is a framework for learning from data streams and multi-output learning in Python. Conceived to serve as a platform to encourage the democratization of stream learning research, it provides multiple state-of-the-art learning methods, data generators and evaluators for different stream learning problems, including single-output, multi-output and multi-label. scikit-multiflow builds upon popular open source frameworks including scikit-learn, MOA and MEKA. Development follows the FOSS principles. Quality is enforced by complying with PEP8 guidelines, using continuous integration and functional testing. The source code is available at https://…/scikit-multiflow.

Multivariate Bayesian Structural Time Series Model

This paper deals with inference and prediction for multiple correlated time series, where one also has the choice of using a candidate pool of contemporaneous predictors for each target series. Starting with a structural model for time series, we use Bayesian tools for model fitting, prediction and feature selection, thus extending some recent works along these lines for the univariate case. The Bayesian paradigm in this multivariate setting helps the model avoid overfitting, as well as captures correlations among multiple target time series with various state components. The model provides needed flexibility in selecting a different set of components and available predictors for each target series. The cyclical component in the model can handle large variations in the short term, which may be caused by external shocks. Extensive simulations were run to investigate properties such as estimation accuracy and performance in forecasting. This was followed by an empirical study with one-step-ahead prediction on the max log return of a portfolio of stocks that involve four leading financial institutions. Both the simulation studies and the extensive empirical study confirm that this multivariate model outperforms three other benchmark models, viz. a model that treats each target series as independent, the autoregressive integrated moving average model with regression (ARIMAX), and the multivariate ARIMAX (MARIMAX) model.

Choosing the right Hyperparameters for a simple LSTM using Keras

Building Machine Learning models has never been easier and many articles out there give a great high-level overview on what Data Science is and the amazing things it can do, or go into depth about a really smaller implementation detail. This leaves aspiring Data Scientists, like me a while ago, often looking at Notebooks out there, thinking ‘It looks great and works, but why did the author choose this type of architecture/number of neurons or this activation function instead of another? In this article, I want to give some intuition on how to make some of the decisions like finding the right parameters while building a model, demonstrated on a very basic LSTM to predict the gender from a given first name. Since there are many great courses on the math and general concepts behind Recurring Neural Networks (RNN), e.g. Andrew Ng’s deep learning specialization or here on Medium, I will not dig deeper into them and perceive this knowledge as given. Instead, we will only focus on the high-level implementation using Keras. The goal is to get a more practical understanding of decisions one has to make building a neural network like this, especially on how to chose some of the hyperparameters.

Outlier-Aware Clustering: Beyond K-Means

What we propose is a combinational approach of dimensionality reduction for optimized clustering. Yes, I’m sure some of you will still stick to the ol’ PCA/K-Means after reading this article, but I hope you’ll get a new tool in your toolbox that’s just as quick. The approach uses two approaches that are quickly gaining in popularity: UMAP for dimensionality reduction, and HDBSCAN for clustering. We’ve had a lot of success with this combination across multiple projects in human resources for behavioral archetype definitions, and in recommendation systems for customer segmentation.

Light on Math Machine Learning: Intuitive Guide to Understanding Decision Trees

This article aims at introducing decision trees; a popular building block of highly praised models such as xgboost. A decision tree is simply a set of cascading questions. When you get a data point (i.e. set of features and values), you use each attribute (i.e. a value of a given feature of the data point) to answer a question. The answer to each question decides the next question. At the end of this sequence of questions, you will end up with a probability of the data point belonging to each class.

The Startling Power of Synthetic Data

In this article, we are going to talk about this little-known subset of anonymized data that is assisting Machine Learning in Data Scarce Environments.

An Overview of ResNet and its Variants

After the celebrated victory of AlexNet [1] at the LSVRC2012 classification contest, deep Residual Network [2] was arguably the most groundbreaking work in the computer vision/deep learning community in the last few years. ResNet makes it possible to train up to hundreds or even thousands of layers and still achieves compelling performance. Taking advantage of its powerful representational ability, the performance of many computer vision applications other than image classification have been boosted, such as object detection and face recognition. Since ResNet blew people’s mind in 2015, many in the research community have dived into the secrets of its success, many refinements have been made in the architecture. This article is divided into two parts, in the first part I am going to give a little bit of background knowledge for those who are unfamiliar with ResNet, in the second I will review some of the papers I read recently regarding different variants and interpretations of the ResNet architecture.

Residual blocks – Building blocks of ResNet

Understanding a residual block is quite easy. In traditional neural networks, each layer feeds into the next layer. In a network with residual blocks, each layer feeds into the next layer and directly into the layers about 2-3 hops away. That’s it. But understanding the intuition behind why it was required in the first place, why it is so important and how similar it looks to some other state of the art architectures is where we are going to focus on. There are more than one interpretations of why residual blocks are awesome and how & why they are one of the key ideas that can make a neural network show state of the art performances on wide range of tasks. Before diving into the details, here is a picture of how a residual block actually looks like.

Separable convolutions – trading little accuracy for huge computational gains

Typically in convolutions, we use a 2D or a 3D kernel filter where we hope that each filter extracts some kind of a feature by convoluting in all the 2 or 3 dimensions, respectively. Specifically in 2D case, we try to extract simple features in initial layer and more complex features in the later layers. However if we want, we can factorize a 2D kernel into two 1D kernels as shown. Now, we can take these two 1D kernels and apply them one by one (in subsequent layers) on an image instead of applying the original 2D kernel. By doing so, we have actually reduced the number of parameters that we use for convolution and now have lesser parameters to train on. Also, the order in which we use these separable kernel filters does not really matter in general. To put things into perspective, a 5×5 kernel filter has 25 parameters whereas two kernels, a 1×5 kernel and a 5×1 kernel has only 10 parameters.

Deciding optimal filter size for CNNs

In image processing, a kernel, convolution matrix, or mask is a small matrix. It is used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between a kernel and an image.

Netflix – The journey towards a self-service data platform

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data techs (e.g. Spark, Flink, Presto, and Druid), enabling services (e.g. federated metadata management and data event triggering), and machine learning support (e.g. end-to-end ML workflow infrastructure with notebook integration). But with power comes complexity. I´ll talk through how we are investing towards an easier, ‘self-service’ data platform without sacrificing our enabling capabilities.
In this talk, we will dive into the philosophy, tactics, and technologies behind this transition. E.g.
• How we are leveraging GraphQL
• Innovations and plans in the Jupyter / nteract notebook space
• Our philosophy and tech for machine learning infrastructure
• Our approach to and focus on education and user understanding
• Developing a comprehensible data catalog
• Creating ‘virtual teams’ with our technical partners
… and a whole lot more. Join us for lessons learned and key principles to weave into your respective data worlds.

5 Essential Neural Network Algorithms

1. The feedforward algorithm…
2. A common activation algorithm: Sigmoid…
3. The cost function…
4. The back propagation…
5. Applying the learning rate/weight updating…

Field Notes: Building Data Dictionaries

The scariest ghost stories I know take place when the history of data – how it’s collected, how it’s used, and what it’s meant to represent – becomes an oral history, passed down as campfire stories from one generation of analysts to another like a spooky game of telephone. These stories include eerie phrases like ‘I’m not sure where that comes from’, ‘I think that broke a few years ago and I’m not sure if it was fixed’, and the ever-ominous ‘the guy who did that left’. When hearing these stories, one can imagine that a written history of the data has never existed – or if it has, it’s overgrown with ivy and tech-debt in an isolated statuary, never to be used again.

Amazon SageMaker Ground Truth – Build Highly Accurate Datasets and Reduce Labeling Costs by up to 70%

In 1959, Arthur Samuel defined machine learning as a ‘field of study that gives computers the ability to learn without being explicitly programmed’. However, there is no deus ex machina: the learning process requires an algorithm (‘how to learn’) and a training dataset (‘what to learn from’). Today, most machine learning tasks use a technique called supervised learning: an algorithm learns patterns or behaviours from a labeled dataset. A labeled dataset containing data samples as well as the correct answer for each one of them, aka ‘ground truth’. Depending on the problem at hand, one could use labeled images (‘this is a dog’, ‘this is a cat’), labeled text (‘this is spam’, ‘this isn’t’), etc. Fortunately, developers and data scientists can now rely on a vast collection of off-the-shelf algorithms (as illustrated by the built-in algorithms in Amazon SageMaker) and of reference datasets. Deep learning has popularized image datasets such as MNIST, CIFAR-10 or ImageNet, and more are also available for tasks like machine translation or text classification. These reference datasets are extremely useful for beginners and experienced practitioners alike, but a lot of companies and organizations still need to train machine learning models on their own dataset: think about medical imaging, autonomous driving, etc.