Goals for the New R Consortium

The creation of the R Consortium offers an intriguing opportunity to expand the use of R around the world. I’ve suggested several potential goals for the Consortium, including ways to help people choose packages, increase reliability testing, rating package support levels, increasing visibility of key generic functions, adding support for Word, and making R more accessible through stronger GUI support. What else should the R Consortium consider? Let’s hear your ideas in the comments section below.

Beginners Guide To Learn Dimension Reduction Techniques

In this article, we will look at various methods to identify the significant variables using the most common dimension reduction techniques and methods.

Cross Validation done wrong

Cross validation is an essential tool in statistical learning 1 to estimate the accuracy of your algorithm. Despite its great power it also exposes some fundamental risk when done wrong which may terribly bias your accuracy estimate. In this blog post I’ll demonstrate – using the Python scikit-learn2 framework – how to avoid the biggest and most common pitfall of cross validation in your experiments.

Notes on the Dirichlet Distribution and Dirichlet Process

• The symmetric Dirichlet distribution (DD) can be considered a distribution of distributions.
• The Dirichlet Process can be considered a way to generalize the Dirichlet distribution.

Computing AIC on a Validation Sample

This afternoon, we’ve seen in the training on data science that it was possible to use AIC criteria for model selection. …

“Improving Segmentation” (using Lorenz curves, or sort of)

This afternoon, André did send me an interesting graph about the use of Lorenz curve in the context of insurance pricing (and modeling) …

TheWalnut.io: An Easy Way to Create Algorithm Visualizations

Google’s DeepDream project has gone viral which allows to visualize the deep learning neural networks. It highlights a need for a generalized algorithm visualization tool, in this post we introduce to you one such effort.

R tutorial on the Apply family of functions

In the present post we show the use of apply, its variants, and a few of its relatives, applied to different data structures. We will not exhaust all the variants (googling might be of help here) but when possible, we will illustrate the use of these functions in cooperation via a couple of slightly more beefy examples. Hope you will enjoy the read!

The complete catalog of argument variations of select() in dplyr

When I read the dplyr vignette, I found a convenient way to select sequential columns such as select(data, year:day). Because I had inputted only column names to select() function, I was deeply affected by the convenient way. On closer inspection, I found that the select() function accepts many types of input. Here, I will enumerate the variety of acceptable inputs for select() function. By the way, these column selection methods also can use in the summarise_each(), mutate_each() and some functions in tidyr package(e.g. gather()).

Efficient accumulation in R

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, ScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.