R 3.2.3 is released (with improvements for Windows users, and general bug fixes)

R 3.2.3 (codename “Wooden Christmas Tree”) was released several days ago. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.


Bringing the powers of SQL into R

One of the big flaw of R is that data loaded into it are stored in the memory (on the RAM) and not on the disk. As you are working in an analysis with large (big) data the processing time of simple and more complex functions can become very long or even crash your computer. SQL enters here, it is a powerful language designed to work with (large) database and to perform simple operation (like subsetting, sorting …) on them. It is particularly useful to explore very large dataset and format the data for further analysis. There are many programs for doing database management using SQL. I decided to start looking at MySQL since it has an R package and is rather easy to set up (one could also use PostgreSQL …). In this post I will show you step by step how to create a database in MySQL, to upload data from R into it, then to do some queries to look at the power of SQL. Before I start note that the data.table package was developed to perform fast operation on big data (have a look here).


7 Important Ways to Summarise Data in R

People remain confused when it comes to summarizing data real quick in R. There are various options. But, which one is the best ? I’ve answered this question below. You must choose one at first. And, become expert at it. That’s how you should move to the next. People who transition from SAS or SQL are used to writing simple queries on these languages to summarize data sets. For such audience, the biggest concern is to how do we do the same thing on R. In this article I will cover primary ways to summarize data sets. Hopefully this will make your journey much easier than it looks like.


FeatureFu: Building Featureful Machine Learning Models

LinkedIn’s FeatureFu project is a new open source toolkit designed to enable creative and agile feature engineering for most machine learning tasks such as statistical modeling (classification, clustering, and regression) and rule-based decision engines. In this blog post, we will detail the design and implementation of Expr in FeatureFu, provide examples of how feature engineering is becoming more powerful with this open source toolkit, and demonstrate how this technique nicely blurs the boundaries between modeling and feature engineering. Here, we share our practices and encourage you to do the same by sharing your valuable experience with FeatureFu.


Stephen Senn: The pathetic P-value (Guest Post)

I want to make it clear, that I am not suggesting that P-values alone are a good way to summarise results, nor am I suggesting that Bayesian analysis is necessarily bad. I am suggesting, however, that Bayes is hard and pointing the finger at P-values ducks the issue. Bayesians (quite rightly so according to the theory) have every right to disagree with each other.


Lessons from Bayesian disease diagnosis: Don’t over-interpret the Bayes factor

A primary example of Bayes’ rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here’s one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes’ rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103-105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.


Walmart Kaggle: Trip Type Classification

Walmart uses trip type classification to segment its shoppers and their store visits to better improve the shopping experience. Walmart’s trip types are created from a combination of existing customer insights and purchase history data. The purpose of the Kaggle competition is to use only the purchase data provided to derive Walmart’s classification labels. The goal for Walmart is to refine their trip type classification process.


Lucky Numbers Part 2: Machine Learning for Understanding Lottery Players’ Preferences

In a quick unscientific poll at a recent NYC Data Science Academy meetup, most people indicated that they have played the lottery at one time or another, but of those who have played, only a few indicated that they choose their own numbers. But was my audience at the meetup a representative sample of lottery players? Probably not, given the quantitative skills one would assume for people who choose to spend an evening listening to data science presentations. The goal of this project is to understand the selection behavior of lottery players as a whole. In particular I want to answer the following questions:
• Are there certain number combinations that are selected unusually often by lottery players?
• In games where winners share a fixed pool of prize money (i.e. parimutuel games), are the expected prize amounts appreciably lower for players who choose popular combinations? In other words, are customers who loyally play “their” numbers getting smaller payouts than occasional players who play random numbers selected by the lottery terminal?


Lucky Numbers Part 1: Web Scraping and Preliminary Analysis

In a lottery game, the numbers that the lottery selects are random, but the numbers that players choose to play are not. To the best of my knowledge, data on player selections are not publicly available. However, lotteries do publish data on the numbers they draw and the amounts of the prizes they award. In games where prizes are parimutuel, that is when a certain percentage of sales is divided equally among the winners, one can infer the popularity of the numbers drawn from the prize amounts: popular numbers result in smaller prizes because there are more winners splitting the prize money. The primary component of this project is scraping a variety of lottery websites using a variety of techniques in order to gather data for an analysis that relates prizes amounts to the numbers drawn. Ultimately, I would like to build machine learning models that predict prize amounts as a function of the numbers drawn. However, here I simply present some visualizations and do some hypothesis tests to investigate whether there is a relationship between prize amounts and the sum of the numbers drawn.


Gradient Boosters and the RossMann (Project)

As part of a Kaggle competition, we were challenged by Rossmann, the second largest chain of German drug stores, to predict the daily sales for 6 weeks into the future for more than 1,000 stores. Exploratory data analysis revealed several novel features, including spikes in sales prior to, and preceding store refurbishment. We also engineered several novel features by the inclusion of external data including Google Trends, macroeconomic data, as well as weather data. We then used H20, a fast, scalable parallel-processing engine for machine learning, to build predictive models utilizing random forests, gradient boosting machines, as well as deep learning. Lastly, we combined these models using different ensemble methods to obtain better predictive performance. Training data was provided for 1,115 Rossmann stores from January 1st 2013 through July 31st 2015 .The task was to forecast 6 weeks (August 1st 2015 through September 17th 2015) of sales for 856 of the Rossmann stores identified within the testing data.


Implementing a CNN for Text Classification in TensorFlow

In this post we will implement a model similar to Kim Yoon’s Convolutional Neural Networks for Sentence Classification. The model presented in the paper achieves good classification performance across a range of text classification tasks (like Sentiment Analysis) and has since become a standard baseline for new text classification architectures. I’m assuming that you are already familiar with the basics of Convolutional Neural Networks applied to NLP. If not, I recommend to first read over Understanding Convolutional Neural Networks for NLP to get the necessary background.


Data Science for Losers, Part 6 – Azure ML

In this article we’ll explore Microsoft’s Azure Machine Learning environment and how to combine Cloud technologies with Python and Jupyter. As you may know I’ve been extensively using them throughout this article series so I have a strong opinion on how a Data Science-friendly environment should look like. Of course, there’s nothing against other coding environments or languages, for example R, so your opinion may greatly differ from mine and this is fine. Also AzureML offers a very good R-support! So, feel free to adapt everything from this article to your needs. And before we begin, a few words about how I came to the idea to write about Azure and Data Science.


The Quartz guide to bad data

An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them. As a reporter your world is full of data. And those data are full of problems. This guide presents thorough descriptions and possible solutions to many of the kinds of problems that you will encounter when working with data. Most of these problems can be solved. Some of them can’t be solved and that means you should not use the data. Others can’t be solved, but with precautions you can continue using the data. In order to allow for these ambiguities, this guide is organized by who is best equipped to solve the problem: you, your source, an expert, etc. In the description of each problem you may also find suggestions for what to do if that person can’t help you. You can not possibly review every dataset you encounter with for all of these problems. If you try to do that you will never get anything published. However, by familiarizing yourself with the kinds of issues you are likely to encounter you will have a better chance of identifying an issue before it causes you to make a mistake.


Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:
1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations
2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.
The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.


Win-Vector news

In blogging we have found people really respond positively to articles in series. Along those lines we have been writing more and organizing more into series. Some recent examples include: