NYC buses: C5.0 classification with R; more than 20 minute delay?

We are continuing on with our NYC bus breakdown problem. When we left off, we had constructed a rule-based Cubist regression model with our expanded pool of predictors; but we were still only managing to explain 37% of the data’s variance with our model. Given how ‘dirty’ the target variable ‘time_delayed’ is (because it is human reported and of dubious precision), we decided that perhaps we should rephrase the question in order to get a more sensible answer. When a bus breakdown is called into operations, perhaps the question to ask is: ‘Will this delay exceed twenty minutes?’

Interpretability is crucial for trusting AI and machine learning

As machine learning takes its place in many recent advances in science and technology, the interpretability of machine learning models grows in importance.

What hyper-parameters are, and what to do with them; an illustration with ridge regression

This blog post is an excerpt of my ebook Modern R with the tidyverse that you can read for free here. This is taken from Chapter 7, which deals with statistical models. In the text below, I explain what hyper-parameters are, and as an example I run a ridge regression using the {glmnet} package. The book is still being written, so comments are more than welcome!

Linear, Quadratic, and Regularized Discriminant Analysis

Discriminant analysis encompasses methods that can be used for both classification and dimensionality reduction. Linear discriminant analysis (LDA) is particularly popular because it is both a classifier and a dimensionality reduction technique. Quadratic discriminant analysis (QDA) is a variant of LDA that allows for non-linear separation of data. Finally, regularized discriminant analysis (RDA) is a compromise between LDA and QDA. This post focuses mostly on LDA and explores its use as a classification and visualization technique, both in theory and in practice. Since QDA and RDA are related techniques, I shortly describe their main properties and how they can be used in R.

Automatic (slides) for the people

Creating and saving multiple plots to Powerpoint

Syncing your Jupyter Notebook Charts to Microsoft Word Reports

You’re doing a big data analysis in your Jupyter Notebook. You’ve got tons of charts and you want to report on them. Ideally, you’d create your final report in the Jupyter notebook itself, with all its fancy markdown features and the ability to keep your code and reporting all in the same place. But here’s the rub: most people still want Word document reports, and don’t care about your code, reproducibility, etc. When reporting it’s important to give people the information in a format most useful to them.

Python and Slack: A Natural Match

How to send Slack messages, post plots, and monitor machine learning models programmatically with Python

Explained: GPipe – Training Giant Neural Nets using Pipeline Parallelism

In recent years the size of machine learning datasets and models has been constantly increasing, allowing for improved results on a wide range of tasks. At the same time hardware acceleration (GPUs, TPUs) has also been improving but at a significantly slower pace. The gap between model growth and hardware improvement has increased the importance of parallelism, i.e. training a single machine learning model on multiple hardware devices. Some ML architectures, especially small models, are conducive to parallelism and can be divided quite easily between hardware devices, but in large models synchronization costs lead to degraded performance, preventing them from being used.

Clean your data with unsupervised machine learning

Cleaning data does not have to be painful! This post is a quick example of how to use unsupervised machine learning to clean through a mountain of messy text data, using real-life data.

How To Structure Machine Learning Projects

This article is not to show you what machine learning algorithms to learn and explain the nitty-gritty of the models to you.

Advances in few-shot learning: a guided tour

Few-shot learning is an exciting field of machine learning right now. The ability of deep neural networks to extract complex statistics and learn high level features from vast datasets is proven. Yet current deep learning approaches suffer from poor sample efficiency in stark contrast to human perception?-?even a child could recognise a giraffe after seeing a single picture. Fine-tuning a pre-trained model is a popular strategy to achieve high sample efficiency but it is a post-hoc hack. Can machine learning do better?

Advances in few-shot learning: reproducing results in PyTorch

Few-shot learning is an exciting field of machine learning which aims to close the gap between machine and human in the challenging task of learning from few examples. In my previous post I provided a high level summary of three cutting edge papers in few-shot learning?-?I assume you’ve either read that, are already familiar with these papers or are in the process of reproducing them yourself. In this post I will guide you through my experience in reproducing the results of these papers on the Omniglot and miniImageNet datasets, including some of the pitfalls and stumbling blocks on the way. Each paper has its own section in which I provide a Github gist with PyTorch code to perform a single parameter update on the model described by the paper. To train the model just have to put that function inside a loop over the training data. Less interesting details such as dataset handling are omitted for brevity.

Hypothesis Testing: how to determine significance

Hypothesis testing can be confusing. The language used to express results is cumbersome, and correctly interpreting the p-value may be tricky. Where do you set the threshold to determine whether your results are significant, and are they even practical? In this article, I will show you how to perform a complete statistical hypothesis test using data from the Northwind database. The

Effective Data Science Presentations

If you’re new to the field of Data Science, I wanted to offer some tips on how to transition from presentations you gave in academia to creating effective presentations for industry. Unfortunately, if your background is of the math, stats, or computer science variety, no one probably prepared you for creating an awesome presentation in industry. And the truth is, it takes practice. In academia, we share tables of t-stats and p-values and talk heavily about mathematical formulas. That is basically the opposite of what you’d want to do when presenting to a non-technical audience. If your audience is full of a bunch of STEM PhD’s then have at it, but in many instances we need to adjust the way we think about presenting our technical material.

Graph Theory – On To Network Theory

Network theory is the application of graph-theoretic principles to the study of complex, dynamic interacting systems. It provides techniques for further analyzing the structure of interacting agents when additional, relevant information is provided. The applications of network theory, as stated in the articles leading up to this piece (3), are far-reaching & industry-agnotisc. From computer science, to electrical engineering, to game-theory & social media analysis, the fundamentals of network theory offer a powerful mental model to increase our understanding of modern systems.

Graph Theory – Basic Properties

Let’s take a step back in order to take a few more forward in our walk through the basics of graph theory. The previous article in this series mainly revolved around explaining & notating something labeled a simple graph. We’ll now circle back to highlight the properties of a simple graph in order to provide a familiar jump-off point for the rest of this article.

Graph Theory – Set & Matrix Notation

In this article, in contrast to the opening piece of this series, we’ll work though graph examples. The first example graph we’ll review contains specific properties that classify it as a simple graph. Simple graphs are graphs whose vertices are unweighted, undirected, & exclusive of multiple edges & self-directed loops. It’s okay to not understand the previous terms as we’ll cover graph properties extensively in the following article.

Data Pre Processing Techniques You Should Know

As you know data can be very intimidating for a data scientist. If you have a dataset in your hand, and if you are a data scientist on top of that, then you kind of start thinking of varies stuff you can do to the raw dataset you have in your hand. Well, that’s the nature of a data scientist. So I am still in the learning process of becoming a data scientist. I am trying to fill up my mind with varies data preprocessing techniques because these techniques are very essential to know if you want to play with data.

Multitask learning: teach your AI more to make it better

Hi everyone! Today I want to tell you about the topic in machine learning that is, on one hand, very research oriented and supposed to bring machine learning algorithms to more human-like reasoning and, on the other hand, is very well known to us from basics of machine learning, but very rarely interpreted as I want to show it today. It’s called multitask learning and it has (almost) nothing to do with multitasking as on the picture above. In this post, I will show what is multitask learning for humans and algorithms, how researchers today already apply this concept, how you can use it for any problem of yours to increase your models’ performance. And, the most important, I’ll provide you with source code and explanations of 4 use cases (emotion recognition from the photos, movement identification with accelerometer data, siamese network boosting and solving Numerai challenge: all using multitask learning!), that you can use as a template (and, I hope, inspiration) for your own projects!

An Overview of Multi-Task Learning in Deep Neural Networks

This post gives a general overview of the current state of multi-task learning.

Standard Performance Evaluation Corporation

The Standard Performance Evaluation Corporation (SPEC) is anon-profit corporation formed to establish, maintain and endorsestandardized benchmarks and tools to evaluate performance and energyefficiency for the newest generation of computing systems. SPEC develops benchmarksuites and also reviews and publishes submitted results from our memberorganizations and other benchmark licensees.

Performance Web Sites

1. Performance Measurement Organizations:
2. Benchmark Databases:
3. Usenet
4. Web Directories
5. SPEC Mirror Sites