How to Perform Ordinal Logistic Regression in R

In this article, we discuss the basics of ordinal logistic regression and its implementation in R. Ordinal logistic regression is a widely used classification method, with applications in variety of domains. This method is the go-to tool when there is a natural ordering in the dependent variable. For example, dependent variable with levels low, medium, high is a perfect context for application of logistic ordinal regression.

radian: a modern console for R

Whenever I’m developing R code or writing data wrangling or analysis scripts for research projects that I work on I use Emacs and its add-on package Emacs Speaks Statistics (ESS). I’ve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldn’t call myself an Emacs expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my go-to methods for configuring Emacs+ESS were failing; indentation was all over the shop, the smart _ stopped working or didn’t work as it had for over a decade, syntax highlighting of R-related files, like .Rmd was hit and miss, and polymode was just a mystery to me. Configuring Emacs+ESS was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasn’t fun any more. So, in a fit of pique following one to many reconfiguration sessions of Emacs+ESS, I went in search of some greener grass. During that search I came across radian, a neat, attractive, simple console for working with R.

Parametric survival modeling

Survival analysis is used to analyze the time until the occurrence of an event (or multiple events). Cox models – which are often referred to as semiparametric because they do not assume any particular baseline survival distribution – are perhaps the most widely used technique; however, Cox models are not without limitations and parametric approaches can be advantageous in many contexts. For instance, parametric survival models are essential for extrapolating survival outcomes beyond the available follow-up data. For this reason they are nearly always used in health-economic evaluations where it is necessary to consider the lifetime health effects (and costs) of medical interventions. R contains a large number of packages related to biostatistics and its support for parametric survival modeling is no different. Below we will examine a range of parametric survival distributions, their specifications in R, and the hazard shapes they support. We will then show how the flexsurv package can make parametric regression modeling of survival data straightforward.

Some Things I Wish I Had Known Before Scaling Machine Learning Solutions: Part II

This is the second part of an article highlighting some key best practices in the implementation of machine learning solutions. The first part highlighted some of the key differences between the lifecycle of machine learning solutions and traditional software application which are conducive to very unique challenges. Today, I would like to include a few more relevant challenges and solutions that are constantly faced by data science teams when delivering machine learning solutions at a decent scale. To pick up where we left off, let’s continue with the challenges related to data experimentation.

Some Things I Wish I Had Known Before Scaling Machine Learning Solutions: Part I

Recently, I’ve been touring different conferences presenting a talk about best practices for implementing large scale machine learning solutions. The idea is to present a series of non-obvious ideas that result incredibly practical in the implementation of machine intelligence applications in the real world. All the lessons have been based on our experiences at Invector Labs working with large organizations and ambitious startups in the implementation of machine learning capabilities. During those exercises, we quickly realized that many of our assumptions of machine learning apps were really flawed and that there was a huge gap between the advancements in AI research and the practical viability of those ideas. In this two-part article, I would like summarize some of those ideas that hopefully will result valuable to machine learning practitioners and aspirational data scientists. There are many challenges that surface in the implementation of real world machine learning solutions. Most of them are related to the mismatch between the lifecycle of machine learning programs and traditional software applications. With some exceptions, traditional software applications follow a relatively sequential model from design to production. Machine learning models, on the other hand, follow a circular lifecycle that include aspects such as regularization or optimization that have no equivalent in the current toolset of traditional software applications.

Which statistical test should you use?

A statistical test is a tool to make a quantitative decision about sampled data. Although these tests are strongly related to statistical inference, one should keep in his mind that these tests are designed neither to prove nor to falsify the hypothesis. This is the reason there is no such an expression /conclusion in statistics as:
‘There is sufficient evidence at the alpha level of significance to reject the claim given in the alternative hypothesis, so the alternative hypothesis is accepted. ‘
Instead, there are two possible conclusions after employing statistical tests as:
• There is sufficient evidence at the alpha level of significance to reject the claim given in the null hypothesis, so the null hypothesis can be rejected.
• There is no sufficient evidence at the alpha level of significance to reject the claim given in the null hypothesis, so the null hypothesis cannot be rejected.

Quick Start to Gaussian Process Regression

A quick guide to understanding Gaussian process regression (GPR) and using scikit-learn’s GPR package.

How to bridge Machine Learning and Software Engineering

There is a growing frustration within the data science/machine learning community to see yet another PoC (Proof of Concept) create promising results but never turning into something impactful. The main reason: a large gap between development and deployment. There is an increasing number of tools and frameworks – open source as well as commercial – that promise to bridge this gap. However, introducing them into an organization is a time-consuming and expensive endeavor. Fighting for such solutions is honorable and necessary. Still, there is something else we can do to narrow the gap in the meantime: aligning our workflow with software development principles. Number one on things to learn: unit tests. Unfortunately, most introductions and documentations deal with this topic in an abstract way. In contrast, this blog walks you through a practical workflow with a specific example. Let’s get started!

Data Visualization With MatPlotLib Using Python

People are consuming and generating huge volumes of data knowingly and unknowingly on a daily basis. It is this bombardment of digital information is what current businesses are trying to tap and harness to sell and engage their customers more. All types of Industries are bringing a personal touch into their services and offerings to give awesome user experience to their customers. All these have become possible due to powerful Data science enabled AI/ML techniques which are empowering our machines, allowing them to take analytical decisions based on a sea of data accessible to them.

Difference between Local Response Normalization and Batch Normalization

A short tutorial on different normalization techniques used in Deep Neural Networks.

Intelligent, realtime and scalable video processing in Azure

In this tutorial, an end to end project is created in order to do intelligent, realtime and scalable video processing in Azure. In this, a capability is created that can detect graffiti and identify wagon numbers using videos of trains. Properties of the project are as follows:
• Intelligent algorithms to detect graffiti and identify wagon numbers
• Realtime and reliable way of processing videos from edge to cloud
• Scalable for exponential growth of number of videos
• Functional project that can be optimized to any video processing capability

Behind The Models: Dirichlet – How Does It Add To 1?

Building Blocks For Non-Parametric Bayesian Models.

Diving into DALI: How to Use NVIDIA’s GPU-Optimized Image Augmentation Library

Deep learning image augmentation pipelines typically offer speed or flexibility, but never both at the same time. Computationally efficient, production-ready computer vision pipelines tend to be written in C++ and require developers to specify all the nuts and bolts of image transform algorithms to such an extent that these pipelines end up not terribly amenable to further on-the-fly tweaking. On the other end of the spectrum, popular Python libraries like Pillow offer high-level APIs that let practitioners choose from seemingly unlimited combinations of tweaks that can be applied to a vast repository of image transform algorithms. Unfortunately, this freedom carries with it the cost of a steep drop-off in performance. The DALI Library attempts to give practitioners the best of both worlds. Its image transform algorithms are themselves written in C++ code that squeezes every last drop of performance out of NVIDIA GPU chips, making it possible to perform image transforms in parallel on a per-batch basis, across however many GPUs a user has access to. The C++ source code is wired up to a user-friendly Python API, through which practitioners can define image transform pipelines that play nice with both the PyTorch and TensorFlow frameworks. In an attempt to ascertain whether DALI indeed delivers both the speed and flexibility it advertises, I spent the better part of one week running a series of my own experiments with the library. Spoiler alert: while DALI absolutely brings the speed, flexibility is still somewhat lacking.

How to initialize a Neural Network

Training a neural net is far from being a straightforward task, as the slightest mistake leads to non-optimal results without any warning. Training depends on many factors and parameters and thus require a thoughtful approach. It is known that the beginning of training (i.e., the first few iterations) is very important. When done improperly, you get bad results – sometimes, the network won’t even learn anything at all! For this reason, the way you initialize the weights of the neural network is one of the key factors to good training. The goal of this article is to explain why initialization is impacting and present a different number of ways to implement it efficiently. We will test our approaches against practical examples. The code uses the fastai library (based on pytorch) and lessons from the last fastai MOOC (which, by the way, is really great!). All experiment notebooks are available in this github repository.

How to set up hypotheses

Some readers asked for an easy fact-based (no uncertainty or probability) example to accompany my article ‘Never start with a hypothesis’ to show you how setting up the decision context works. Your wish is my command! Let’s play through two scenarios with two different default actions to see how it works. Imagine I’ve just gotten a call from my friend: ‘Shall we go out tonight?’