ggplot 2.0.0

I’m very pleased to announce the release of ggplot2 2.0.0. I know I promised that there wouldn’t be any more updates, but while working on the 2nd edition of the ggplot2 book, I just couldn’t stop myself from fixing some long standing problems.
On the scale of ggplot2 releases, this one is huge with over one hundred fixes and improvements. This might break some of your existing code (although I’ve tried to minimise breakage as much as possible), but I hope the new features make up for any short term hassle. This blog post documents the most important changes:
• ggplot2 now has an official extension mechanism.
• There are a handful of new geoms, and updates to existing geoms.
• The default appearance has been thoroughly tweaked so most plots should look better.
• Facets have a much richer set of labelling options.
• The documentation has been overhauled to be more helpful, and require less integration across multiple pages.
• A number of older and less used features have been deprecated.


The longer it has taken, the longer it will take

This is related to the Lindy effect. The longer a cultural artifact has been around, the longer it is expected to last into the future.


Basic Statistics

Basic statistics include the description of each variable numerically, by calculating mean and frequency. Also include an exploration of the relationships among different variables by calculating t-test, ANOVA and Chi-square.


Numerai – like Kaggle, but with a clean dataset, top ten in the money, and recurring payouts

Numerai is an attempt at a hedge fund crowd-sourcing stock market predictions. It presents a Kaggle-like competition, but with a few welcome twists. For one thing, the dataset is very clean and tidy. As we mentioned in the article on the Rossmann competition, most Kaggle offerings have their quirks. Often we were getting an impression that the organizers were making the competition unnecessarily convoluted – apparently against their own interests. It’s rather hard to find a contest where you could just apply whatever methods you fancy, without much data cleaning and feature engineering. In this tournament, you can do exactly that.


Rossmann Store Sales, Winner’s Interview: 1st place, Gert Jacobusse

Rossmann operates over 3,000 drug stores in 7 European countries. In their first Kaggle competition, Rossmann Store Sales, this drug store giant challenged Kagglers to forecast 6 weeks of daily sales for 1,115 stores located across Germany. The competition attracted 3,738 data scientists, making it our second most popular competition by participants ever. Gert Jacobusse, a professional sales forecast consultant, finished in first place using an ensemble of over 20 XGBoost models. Notably, most of the models individually achieve a very competitive (top 3 leaderboard) score. In this blog, Gert shares some of the tricks he’s learned for sales forecasting, as well as wisdom on the why and how of using hold out sets when competing.


What can data science do for me? You’ll know you’re ready when:

1. Your Question is Sharp
2. Your Data Measures What You Care About
3. Your Data is Accurate
4. Your Data is Connected
5. You Have a Lot of Data


Different NLP (Natural Language Processing) tasks incorporated into software programs today.

• Sentence segmentation, part-of-speech tagging, and parsing
• Deep analytics
• Machine translation
• Named entity extraction
• Co-reference resolution
• Automatic summarization


R is the fastest-growing language on StackOverflow

StackOverview is a popular Q&A site, and a go-to resource for developers of all languages to find answers to programming problems they may have: most of the time, the question has already been asked and answered, or you can always post a new question and wait for a reply. It’s an excellent resource for R users, featuring answers to nearly 100,000 R questions. In fact, R is the fastest-growing language on StackOverflow in terms of the number of questions asked:


Time series analysis with R: Testing stuff with NetAtmo data

I’ve got a NetAtmo weather station. One can download the measurements from its web interface as a CSV file. I wanted to give time series analysis with the extraction of seasonal components (‘decomposition’) a try, so I thought it would be a good opportunity to use the temperature measurements of my weather station.


Post mortem

This is again a guest post, mainly written by Roberto, which I only slightly edited (and if significantly so, I am making it clear by adding text in italics and in square brackets, like [this]). By the way, the pic on the left shows my favourite pathologist examining a post-mortem.


R and Python: Gradient Descent

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we’re going to illustrate here.