GitHub Repo Raider and the Automation of Machine Learning

Since X never, ever marks the spot, this article raids the GitHub repos in search of quality automated machine learning resources. Read on for projects and papers to help understand and implement AutoML.

Complete Data Science Project Template with Mlflow for Non-Dummies.

Best practices for everyone working either locally or in the cloud, from start-up ninja to big enterprise teams.

Why data scientists still can’t code.

Common knowledge would have you think a data scientist spends the majority of their time modelling and evaluating those models. This is a falsehood. For many data scientists, the majority of their time is spent developing data pipelines which act as a requisite precursor to machine learning. Such pipelines do not come out of thin air, and failing the use of some third party plug-and-play, drag-and-drop suite, they’re coded. As such, it’s necessary that the code of these pipelines be up to scratch.

Step by step guide to creating a Text-To-Voice-To-Text email assistant

In this article, I will describe how to create your personal email secretary. This email secretary is an email application that reads your email using Gmail API, reads out your email using Google Text-to-Speech API and ‘playsound’ API, it hears your response using ‘pyaudio’ API converting the audio response into text using Google Speech-To-Text API and finally sends out the response again using Gmail API.

The Power of Ensembles in Deep Learning

Introducing DeepStack, a python package for building Deep Learning Ensembles. Ensemble Building is the leading winning strategy for machine learning competitions and often the technique used for solving real-world problems. What often happens is that while solving a problem or participating in a competition you end up with several trained models, each one with some differences to another – and you end up picking up your best model based on your best evaluation score. The truth is that your best model ‘knows less’ about the data, than all others ‘weak models’ combined. Combining several base models together to create a more powerful ensemble model then rises as a natural by-product of this workflow.

GraphQL, Grafana and Dash

If you’re a person who is interested in Data Science, Data Manipulation or Data Visualization this article is the right one for you. I’m sure you have heard of the names that I have used for my topic above. In this article I would be first going through each of these in some detail and later there will be the comparison.

Informational vs. Behavioral: Two Types of Harms in Machine Learning

The types of security and privacy harm enabled by ML fall into roughly two categories: informational and behavioral. Informational harms relate to the unintended or unanticipated leakage of information. Behavioral harms, on the other hand, relate to manipulating the behavior of the model itself, impacting the predictions or outcomes of the model. We describe the specific ‘attacks’ that constitute these types of harms below, viewing each such attack as a warning sign of future, more widely known and exploited vulnerabilities associated with ML.

Using Gradient Boosting for Time Series prediction tasks

Time series prediction problems are pretty frequent in the retail domain. Companies like Walmart and Target need to keep track of how much product should be shipped from Distribution Centres to stores. Even a small improvement in such a demand forecasting system can help save a lot of dollars in term of workforce management, inventory cost and out of stock loss. While there are many techniques to solve this particular problem like ARIMA, Prophet, and LSTMs, we can also treat such a problem as a regression problem too and use trees to solve it. In this post, we will try to solve the time series problem using XGBoost. The main things I am going to focus on are the sort of features such a setup takes and how to create such features.

Illustrated: Self-Attention

Step-by-step guide to self-attention with illustrations and code. What do BERT, RoBERTa, ALBERT, SpanBERT, DistilBERT, SesameBERT, SemBERT, MobileBERT, TinyBERT and CamemBERT all have in common? And I’m not looking for the answer ‘BERT’ ??. Answer: self-attention ??. We are not only talking about architectures bearing the name ‘BERT’, but more correctly Transformer-based architectures. Transformer-based architectures, which are primarily used in modelling language understanding tasks, eschew the use of recurrence in neural network and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. But what’s the math behind this?

Knowledge Distillation – A technique developed for compacting and accelerating Neural Nets

The recent growth has witnessed a marked up growth in the deep learning industry. With the breakthrough in ImageNet competition in 2012 by AlexNet, more advanced deep neural networks(ResNet-50, VGGNet, and many others)have been developed to keep rewriting the records. However, these networks require heavy computation to generate good results. For-eg AlexNet has 62.3 million parameters and it takes about 6 days to train over 90 epochs using two Nvidia Geforce GTX 580 GPUs. Hence, the potential of these networks can only be utilized completely with heavy computation. In contrast, the majority of the mobile devices used in our daily life usually have rigorous constraints on the storage and computational resources, which prevents them from fully taking advantage of deep neural networks. As a result, to maintain similar accuracy while keeping the constraints in mind, a new technique was developed for compacting and accelerating neural networks, called knowledge distillation. In this post, we will be focussing on knowledge distillation and how it can be used for solving this problem.

Entropy – The Pillar of both Thermodynamics and Information Theory

Entropy is a vague yet powerful term that forms that backbone of many key ideas in Thermodynamics and Information Theory. It was first identified by physical scientists in the 19th century and acted as a guiding principle for many of the Industrial Revolution’s revolutionary technologies. However, the term also helped spark the Information Age when it appeared in mathematician Claude Shannon’s groundbreaking work A Mathematical Theory of Communication. So how can one term be responsible for two breakthroughs, about a century apart, in related yet disparate fields?

Why AutoML is An Essential New Tool For Data Scientists

Machine learning (ML) is the current paradigm for modeling statistical phenomena by harnessing algorithms that exploit computer intelligence. It is common place to build ML models that predict housing prices, aggregate users by their potential marketing interests, and use image recognition techniques to identify brain tumors. However, up until now these models have required scrupulous trial and error in order to optimize model performance on unseen data. The advent of automated machine learning (AutoML) aims to curb the resources required (time and expertise) by offering well-designed pipelines that handle data preprocessing, feature selection, and model creation and evaluation. While AutoML may initially only appeal to enterprises that want to harness the power of ML without consuming precious budgets and hiring skilled data practitioners, it also contains very strong promise to become an invaluable tool for the experienced data scientist.