Visualizing MLS Player Salaries

Recently, I came across this great visualization of MLS Player salaries. I tried to do something similar with ggplot2, and while I was unable to replicate the interactivity or the tree-map nature of the graph, the graph still looks pretty cool.

The Connected Scatterplot for Presenting Paired Time Series

The connected scatterplot isn’t really known in visualization, but has gotten some interest in journalism. There are a number of recent examples, like How the U.S. and OPEC Drive Oil Prices, The Death Spiral Of M. Night Shyamalan’s Career, National Indebtedness, and a number others (there’s a list in the paper). My favorites include one of my all-time favorite news graphics, Driving Safety, in Fits and Starts, as well as Helium Supply, and The Rise of Long-Term Joblessness (aka The Scorpion Chart).

Getting started with Machine Learning in MS Excel using XLMiner

Machine Learning is nothing but building a ‘machine’ which ‘learns’ from its experience. And, becomes better with experience – just like humans. We also learn from our experiences. Right ? Companies like Google, Facebook, Microsoft are using machine learning techniques at a larger scale. However, one common mis-conception people have is that they need to learn coding to start machine learning. While coding becomes necessary for any one who is doing machine learning seriosuly, but not to start it. You can look at GUI driven tool like Weka or even Excel to start with Machine Learning. Here, I’ll introduce you to a simpler way to get started with Machine Learning.

Simpler R coding with pipes > the present and future of the magrittr package

I was first introduced to the %>% (a.k.a: pipe) operator in R, thanks to Hadley Wickham’s (fascinating) dplyr tutorial (link to the workshop’s material) at useR!2014. After several discussions during the conference (including one very influential conversation with Rstudio’s Joe Cheng), I got convinced that the pipe operator is one (if not THE) most important innovation introduced, this year, to the R ecosystem.

The hardest parts of data science

Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed.

Gradient Boosting

Youtube Video

Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT

In this post we will show how to import data from other sources into the R workspace.

Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

Overview on Multivariate Distributions

In June 2016, with Olivier L’Haridon, we will organize a (small) conference, in Rennes, on risk models in a multi-attribute framework. In order to fully enjoy the workshop (more to come on the blog), we will organize every month an internal workshop on that topic. We will start tomorrow afternoon, 13:00-14:30, and I will give a brief talk on multivariate distributions, with an emphasis on spherical / elliptical distributions, distributions on the simplex, and copulas.

Setting up an AWS instance for R, RStudio, OpenCPU, or Shiny Server

While most web-developers have worked with Amazon AWS, Microsoft Azure, or similar platforms before, this is still not the case for many R number crunchers. Especially researchers at academic institutions have less exposure to these commercial offerings. Time to change that! In this post, we explain how to set up an Ubuntu server instance on AWS, and how to install R on it. In later posts, we will explain how to add RStudio server and Shiny Server Open Source. Our main goal is to get you started quickly, so we ignore many of the useful options offered by AWS, such as e.g. security aspects.

statistically significant trends with multiple years of complex survey data

The purpose of this analysis is to make statements such as, “there was a significant linear decrease in the prevalence of high school aged americans who have ever smoked a cigarette across the period 1999-2011” with complex sample survey data. This step-by-step walkthrough exactly reproduces the statistics presented in the Center for Disease Control & Prevention’s (CDC) linear trend analysis, using free and open source methods rather than proprietary or restricted software. The example below displays only linearized designs (created with the svydesign function). For more detail about how to reproduce this analysis with a replicate-weighted design (created with the svrepdesign function), see note below section #4.

How to setup a data science environment in minutes using Docker and Jupyter

Configuring a data science environment can be a pain. Dealing with inconsistent package versions, having to dive through obscure error messages, and having to wait hours for packages to compile can be frustrating. This makes it hard to get started with data science in the first place, and is a completely arbitrary barrier to entry.

The relation between p-values and the probability H0 is true is not weak enough to ban p-values

The journal of Basic and Applied Social Pychology banned the p-value in 2015, after Trafimow (2014) had explained in an editorial a year earlier that inferential statistics were no longer required. In the 2014 editorial, Trafimow notes how: “The null hypothesis significance procedure has been shown to be logically invalid and to provide little information about the actual likelihood of either the null or experimental hypothesis (see Trafimow, 2003; Trafimow & Rice, 2009)”. The goal of this blog post is to explain why the arguments put forward in Trafimow & Rice (2009) are incorrect. Their simulations illustrate how meaningless questions provide meaningless answers, but they do not reveal a problem with p-values. Editors can do with their journal as they like – even ban p-values. But if the simulations upon which such a ban is based are meaningless, the ban itself becomes meaningless.