RStudio v0.99 Preview: Tools for Rcpp
Over the past several years the Rcpp package has become an indispensable tool for creating high-performance R code. Its power and ease of use have made C++ a natural second language for many R users. There are over 400 packages on CRAN and Bioconductor that depend on Rcpp and it is now the most downloaded R package. In RStudio v0.99 we have added extensive additional tools to make working with Rcpp more pleasant, productive, and robust, these include:
• Code completion
• Source diagnostics as you edit
• Code snippets
• Navigable list of compilation errors
• Code navigation (go to definition)
We think these features will go a long way to helping even more R users succeed with Rcpp. You can try the new features out now by downloading the RStudio Preview Release.
Analyzing Customer Churn – Restricted Mean Survival Time
As its name suggests, Restricted Mean Survival Time (RMST from here on out) is simply the average number of time periods a customer survives before churning… except that the highest values are ‘restricted’ to some maximum. So, we might take an average survival time in days for a group of customers, but we restrict the highest values to 365 before we take the average. That’s the 365-day RMST for that group. So what does it tell us, exactly? It tells us the average number of days of revenue we’ll get out of a group of customers during their first year. If the RMST comes out to, say, 335, we know that we’ll get 335 days (or 11 months) of revenue out of the average customer. If our monthly fee is $5 / month, that’s $55 of revenue per customer in their first year. Framed differently, we can say that churn is costing us $5 per customer out of a possible $60 in first-year revenue. Of course, you could do this with other maximum time periods… a month, two years. Whatever makes sense for your analysis. But do make sure it makes sense. Calculating a 10-year RMST when you only have 1 year of customer data would be fruitless.
Driving Behaviour as a Telematic Fingerprint
The objective of my final project at Metis from weeks 9 to 12, is to categorize drivers based on their behaviour on the roads – their driving style and the type of roads that they follow. The challenge associated with this objective is to identify uniquely a driver (and hence his proper ‘driving behaviour’) based on the GPS log of a mobile phonelocated inside the car. My idea to solve this issue is to experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation(LDA) and explain the observed trips by the unobserved behaviour of drivers.
All Aboard! The R Service Bus 6.2
In the absence of a protocol droid or Babel fish, we have the enterprise service bus. For the uninitiated, an enterprise service bus is a software architecture model designed to interface between various software applications. How does this relate to R? Let’s consider an abstract example. Imagine that you work with machines that generate data. One could establish a workflow that monitors a server folder for files, applies an R-written quality control algorithm, produces visualizations in ggplot2, then generates a Sweave document to be emailed or stored in a web application for future access.
Interactive time series with dygraphs
• Automatically plots xts time-series objects (or objects convertible to xts).
• Rich interactive features including zoom/pan and series/point highlighting.
• Highly configurable axis and series display (including optional 2nd Y-axis).
• Display upper/lower bars (e.g. prediction intervals) around series.
• Various graph overlays including shaded regions, event lines, and annotations.
• Use at the R console just like conventional R plots (via RStudio Viewer).
• Embeddable within R Markdown documents and Shiny web applications.
Visualisation with R and Google Maps
For those of you who are interested in using R alongside Google Maps by using the packages geonames (www.geonames.org), RgoogleMaps, ggmap, loa and plotKML. Enjoy the slides of our presentation on this topic during the last RBelgium meetup.
Plotting tables alsongside charts in R
Occasionally I’d like to plot a table alongside a chart in R, e.g. to present summary statistics of the graph itself. Thanks to the gridExtra package this is quite straightforward. The function tableGrob creates a table like plot of a data frame, while arrangeGrob allows me to arrange ggplot2, lattice and grid graphical objects (short ‘grobs’, such as tableGrob) on a page.
Small Cell Suppression – Problem Overview
The ‘cell suppression problem’ is an overarching term for situations where a researcher must hide certain values in tabular reports in order to protect sensitive personal (or otherwise protected) information. For instance, suppose Wayout County, Alaska has only one resident with a PhD – we’ll call her ‘Jane.’ Some economist comes in to do a study of the value of higher education in rural areas, and publishes a list of average salaries disaggregated by county and level of education. Whoops! The average salary for people with PhDs in Wayout County is just Jane’s salary. That researcher has just disclosed Jane’s personal information to the world, and anybody that happens to know her now knows how much money she makes. ‘Suppressing’ or hiding the value of that cell in the report table would have saved a lot of trouble!
Regression Coefficients & Units of Measurement
A linear regression equation is just that – an equation. This means that when any of the variables – dependent or explanatory – have units of measurement, we also have to keep track of the units of measurement for the estimated regression coefficients.
How Airbnb uses machine learning to detect host preferences
At Airbnb we seek to match people who are looking for accommodation – guests — with those looking to rent out their place – hosts. Guests reach out to hosts whose listings they wish to stay in, however a match succeeds only if the host also wants to accommodate the guest.
Introducing CURRENNT: The Munich Open-Source CUDA RecurREnt Neural Network Toolkit
In this article, we introduce CURRENNT, an open-source parallel implementation of deep recurrent neural networks (RNNs) supporting graphics processing units (GPUs) through NVIDIA’s Computed Unified Device Architecture (CUDA). CURRENNT supports uni- and bidirectional RNNs with Long Short-Term Memory (LSTM) memory cells which overcome the vanishing gradient problem. To our knowledge, CURRENNT is the first publicly available parallel implementation of deep LSTM-RNNs. Benchmarks are given on a noisy speech recognition task from the 2013 2nd CHiME Speech Separation and Recognition Challenge, where LSTM-RNNs have been shown to deliver best performance. In the result, double digit speedups in bidirectional LSTM training are achieved with respect to a reference single-threaded CPU implementation. CURRENNT is available under the GNU General Public License from http://sourceforge.net/p/currennt.