Comparing Waic (or loo, or any other predictive error measure)

And all this is reminding me that we’d like to add an Anova-like feature for comparing multiple models; in that paper we present methods of computing Waic or loo for one model, or comparing two models, so we should really also present the general comparison of multiple model fits.

Elections, visual

On October 18, 2015 Swiss voters will elect a new Parliament for the next four years. There are some very useful and also beautiful visual tools that help voters to get informed about developments in the political landscape and about candidates.

Illustrating Spurious Regressions

Let’s begin by reviewing what is usually meant when we talk about a ‘spurious regression’.

Linear Algebra Formulas for Econometrics

The purpose of this post is to outline the linear algebra of some popular regression strategies. It is essentially an extremely short summary of parts of Jeffrey Wooldridge’s authoritative econometrics textbook, the text that I use most often.

All you need to know about Hadoop

Named after a kid’s toy elephant and initially recognized as a technical problem, today it drives a market that’s expected to be worth $50 billion by 2020. It is the most talked about technology since its inception as it allows some of the world’s largest companies to store and process data sets on clusters of commodity hardware.

R, Open Source and GSoC (Google Summer of Code)

There’s a plethora of generic content regarding GSoC preparation already available on the internet.* This blog post will concentrate on specific tips which should help anyone looking to get involved with The R-Project, or any R-centric Open-Source project, begin their journey.

The world beyond batch: Streaming 101

A high-level tour of modern data-processing concepts.

Introduction to Statistics using Python

Inspired by Allen Downey’s books Think Stats and Think Bayes, this is an attempt to learn Statistics using an application-centric programming approach.

Mistaking a Data Library for a Data Lake: 7 best practices for developing your Hadoop data strategy

1. Have a librarian
2. Build a data catalog
3. Develop protocols for subscription to content
4. Promote sharing and reuse of content
5. Establish lineage of your digital assets
6. Develop a process to procure your data assets
7. Monitor the quality of your assets

Probabilistic bug hunting

Have you ever run into a bug that, no matter how careful you are trying to reproduce it, it only happens sometimes? And then, you think you’ve got it, and finally solved it – and tested a couple of times without any manifestation. How do you know that you have tested enough? Are you sure you were not ‘lucky’ in your tests? In this article we will see how to answer those questions and the math behind it without going into too much detail. This is a pragmatic guide.

RuleFit: When disassembled trees meet Lasso

The RuleFit algorithm from Friedman and Propescu is an interesting regression and classification approach that uses decision rules in a linear model. RuleFit is not a completely new idea, but it combines a bunch of algorithms in a clever way. RuleFit consists of two components: The first component produces ‘rules’ and the second component fits a linear model with these rules as input (hence the name ‘RuleFit’). The cool thing about the algorithm is that the produced model is highly interpretable, because the decision rules have an easy understandable format, but you still have a flexible enough approach to capture complex interactions and get a good fit.

Delta Method Confidence Bands for Gaussian Density

During one of our Department’s weekly biostatistics ‘clinics’, a visitor was interested in creating confidence bands for a Gaussian density estimate (or a Gaussian mixture density estimate). The mean, variance, and two ‘nuisance’ parameters, were simultaneously estimated using least-squares. Thus, the approximate sampling variance-covariance matrix (4×4) was readily available. The two nuisance parameters do not directly affect the Gaussian density, but the client was concerned that their correlation with the mean and variance estimates would affect the variance of the density estimate. Of course, this might be the case in general, and a nonparametric bootstrap method might be used to account for this. Nevertheless, I proposed using the delta method, in which the variability of the nuisance parameter estimates do not affect that of the density estimate; a consequence of the normality assumption. This can be verified by fiddling with the parameters below.

A Simpler Explanation of Differential Privacy

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this article we’ll work through the definition of differential privacy and demonstrate how Dwork et.al.’s recent results can be used to improve the model fitting process.