An Update on Boosting with Splines
In my previous post, An Attempt to Understand Boosting Algorithm(s), I was puzzled by the boosting convergence when I was using some spline functions (more specifically linear by parts and continuous regression functions).

Variable Selection using Cross-Validation (and Other Techniques)
A natural technique to select variables in the context of generalized linear models is to use a stepŵise procedure. It is natural, but contreversial, as discussed by Frank Harrell in a great post, clearly worth reading. Frank mentioned about 10 points against a stepwise procedure.

How Much Did It Rain? Winner’s Interview: 1st place, Devin Anzelmo
An early insight into the importance of splitting the data on the number of radar scans in each row helped Devin Anzelmo take first place in the How Much Did It Rain? competition. In this blog, he gets into details on his approach and shares key visualizations (with code!) from his analysis.

The R Consortium: what does it mean for me?
Earlier this week a press release from the Linux Foundation formally unveiled The R Consortium: ‘a group of businesses organized under an open source governance and foundation model to provide support to the R community, the R Foundation and groups and individuals, using, maintaining and distributing R software’. Mango Solutions were announced as founding silver members alongside the R Foundation; Microsoft and RStudio (Platinum); TIBCO Software Inc. (Gold); and Alteryx, Google, HP, Ketchum Trading and Oracle (Silver). Clearly we think The R Consortium is idea. But what does it mean for ordinary R users?

Persistent data storage in Shiny apps
Shiny apps often need to save data, either to load it back into a different session or to simply log some information. However, common methods of storing data from R may not work well with Shiny. Functions like write.csv() and saveRDS() save data locally, but consider how works. is a popular server for hosting Shiny apps. It is designed to distribute your Shiny app across different servers, which means that if a file is saved during one session on some server, then loading the app again later will probably direct you to a different server where the previously saved file doesn’t exist. On other occasions, you may use data that is too big to store locally with R in an efficient manner. This guide will explain seven methods for storing persistent data remotely with a Shiny app.

• Applied Spatial Data Analysis with R
• Bayesian Networks and Graphical Models with R
• Data manipulation with dplyr
• Efficient statistical consulting using R Workflow for data analysis projects
• Handling missing values with a special focus on the use of principal components methods
• RHadoop
• Rocker: Using R on Docker
• Statistical analysis of network data
• Analysis and Visualization of Large Complex Data with Tessera
• Applied Machine Learning and Efficient Model Selection with mlr
• Bioconductor for high-throughput sequence analysis
• Getting to Know Grid Graphics
• Introduction to Bayesian Data Analysis with R
• spatstat: An R package for analysing spatial point patterns
• Testing R Code
• Using Pandoc’s markdown with R

News from UseR!2015 – the RHadoop tutorial
The central example throughout is to use mapreduce() in the package rmr2 to summarize new York taxi journeys, grouped by hour of the day. This is a dataset of ~200GB of uncompressed csv files. You can read more about how this data came into the public domain at http://…/.

Useful tutorials
There are some tools that I use regularly, and I would like my research students and post​​docs to learn them too. Here are some great online tutorials that might help.
• gplot tutorial from Winston Chang
• Writing an R package from Karl Broman
• Rmarkdown from RStudio
• Shiny from RStudio
• git/​github guide from Karl Broman
• minimal make tutorial from Karl Broman