What moved the median?

This post is about how I went about with (some very able colleagues and director) in making a system that tried to automate (with fairly high consistency) the reason for the undesirable shift in median page onload time of a web page ( http://…/navigation-timing-2) for the worse. The project met with some great reviews and good success.

How NASA experiments with knowledge discovery

NASA is using big data to make complex knowledge more readily available. Learn how graph visualization can help turn large corpus of documents into concrete insights.

Calculus on Computational Graphs: Backpropagation

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it just goes by different names. In fact, the algorithm has been reinvented at least dozens of times in different fields (see Griewank (2010)). The general, application independent, name is “reverse-mode differentiation.” Fundamentally, it’s a technique for calculating derivatives quickly. And it’s an essential trick to have in your bag, not only in deep learning, but in a wide variety of numerical computing situations.

Bayesian Correlation with PyMC

In this notebook, I show how to determine a correlation coefficient within the Bayesian framework both in a simply and a robust way. The correlation can be seen as a direct alternative to the traditional Pearson correlation coefficient.

Setting up a quota watcher agent in Python

This is a documentation for QuotaWatcher utility, a small cron job developed to monitor disk usage on our servers. In this post I am going to explain how this agent works, what are the steps we need to build it and how it can be improved. Please feel free to comment and add your input

Sentiment Analysis With Word2vec

This tutorial covers sentiment analysis with Word2vec and logistic regression. It is written for programmers, but assumes knowledge of only basic mathematical concepts. Its purpose is to demonstrate how Word2vec can be used for opinion mining on text in the wild. Sentiment has obvious applications for market research, business intelligence, product development, reputation management, political campaigns and sociological studies. Broadly speaking, sentiment analysis has three stages: tokenization, feature extraction and classification. The first divides a document up into words, the second creates representations of words or documents, the third learns to bucket them by type. For this tutorial, we are interested in feature extraction and classification on the document level.

Evaluating Logistic Regression Models in R

This post provides an overview of performing diagnostic and performance evaluation on logistic regression models in R. After training a statistical model, it’s important to understand how well that model did in regards to it’s accuracy and predictive power. The following content will provide the background and theory to ensure that the right technique are being utilized for evaluating logistic regression models in R.

likelihood-free inference in high-dimensional models

The recently arXived paper “Likelihood-free inference in high-dimensional models“, by Kousathanas et al. (July 2015), proposes an ABC resolution of the dimensionality curse [when the dimension of the parameter and of the corresponding summary statistics] by turning Gibbs-like and by using a component-by-component ABC-MCMC update that allows for low dimensional statistics. In the (rare) event there exists a conditional sufficient statistic for each component of the parameter vector, the approach is just as justified as when using a generic ABC-Gibbs method based on the whole data. Otherwise, that is, when using a non-sufficient estimator of the corresponding component (as, e.g., in a generalised [not general!] linear model), the approach is less coherent as there is no joint target associated with the Gibbs moves. One may therefore wonder at the convergence properties of the resulting algorithm. The only safe case [in dimension 2] is when one of the restricted conditionals does not depend on the other parameter. Note also that each Gibbs step a priori requires the simulation of a new pseudo-dataset, which may be a major imposition on computing time. And that setting the tolerance for each parameter is a delicate calibration issue because in principle the tolerance should depend on the other component values.

Revolution R Enterprise Now Available in the Cloud on Azure Marketplace

Revolution is excited to announce the availability of its latest release of Revolution R Enterprise 7.4.1 (RRE) as a technical preview on Microsoft Azure via Windows- and Linux-based virtual machines in the Azure Marketplace. Through Azure’s world-wide cloud infrastructure customers now have on-demand access to high-performance predictive analytics to accelerate growth, optimize operations, and expedite data insight and discovery from any place and at any time. Availability in Azure Marketplace is the first step in Microsoft’s plan to integrate Revolution’s products with the Azure and, in the bigger picture, Cortana Analytics.

GEOSTAT 2015: a write-up

The week before last I attended the GEOSTAT summer school in Lancaster. GEOSTAT is an annual week-long meeting devoted to ‘geostatistics’ (or ‘spatial statistics’ – we’ll come on to the difference subsequently).

API for Prediction and Machine Learning: poll results and analysis

APIs are set procedures which provide easy to use, automated, robust solution to the recurring programming challenges. Here, we analyzed major players in the big data domain are providing machine learning APIs.

Bayesian regression models using Stan in R

It seems the summer is coming to end in London, so I shall take a final look at my ice cream data that I have been playing around with to predict sales statistics based on temperature for the last couple of weeks. Here I will use the new brms (GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.