Probabilistic algorithms are all around us. Not only are they acceptable, but some programmers actually seek out chances to use them.
Data Manipulation is an inevitable phase of predictive modeling. A robust predictive model can’t be just be built using machine learning algorithms. But, with an approach to understand the business problem, the underlying data, performing required data manipulations and then extracting business insights. Among these several phases of model building, most of the time is usually spent in understanding underlying data and performing required manipulations. This would also be the focus of this article – packages to perform faster data manipulation in R.
It might happen that your dataset is not complete, and when information is not available we call it missing values. In R the missing values are coded by the symbol NA. To identify missings in your dataset the function is is.na().
Well if Data Science and Data Scientists can not decide on what data to choose to help them decide which language to use, here is an article to use BOTH.
In the code below I demonstrate how the function “clusterApply” from the package “snow” can be used as a replacement for the regular “apply” function. Note the cluster in clusterApply refers to the mulitcore clusters rather than the clusters in the data frame. My code sets up a simple regression problem, wherein the standard error of the the regressor is 0.4. To demonstrate the clustering phenomenon I duplicate the data frame of 10,000 observations 20 times. As a result of this the standard error falls to 0.09 based on the naive estimate of the variance-covariance matrix.
In case you missed it, Jonas Dalege and his colleagues at the PsychoSystems research group have recently published an article in Psychological Review detailing how attitudes can be represented as network graphs. It is all done using R and a dataset that can be downloaded by registering at the ANES data center. You will find the R code under Scripts and Code in a file called ANES 1984 Analyses. With very minor changes to the size of some labeling, I was able to reproduce the above undirected graph with two R packages: IsingFit and qgraph. As usual when downloading others’ files, most of the R code is data munging and deals with assigning labels and transforming ratings into dichotomies.
Our most recent article was a dynamic programming solution to the A/B test problem. Explicitly solving such dynamic programs is a long and tedious process, so you are well served by finding and introducing clever invariants to track (something better than just raw win-rates). This clever idea, called ‘sequential analysis’, was introduced by Abraham Wald (whom we have written about before). If you have ever heard of a test plan such as ‘first process to get more than 30 wins ahead of the other is the one we choose’ you have seen methods derived from Wald’s sequential analysis technique.
Via RBloggers, I spotted this post on Deploying Your Very Own Shiny Server. I’ve been toying with the idea of running some of my own Shiny apps, so that post provided a useful prompt, though way too involved for me;-) So here’s what seems to me to be an easier, rather more pointy-clicky, wiring stuff together way using Docker containers (though it might not seem that much easier to you the first time through!). The recipe includes: github, Dockerhub, Tutum and Digital Ocean.
One of the legendary events in the history of analytics was the original Netflix prize. The event led to a terrific example of the need to focus on not only theoretical results, but also pragmatically achievable results, when developing analytic processes. For those who aren’t familiar with the story, not quite 10 years ago, Netflix was having trouble achieving the desired improvement in its recommendation algorithms. There were smart people working on the problem, but progress had slowed as they used all the tricks and techniques that they knew. As a result, Netflix decided to do something that was, at the time, novel and unexpected.
At the end of each month I pull together a collection of links to some of the most relevant, interesting or thought-provoking web content I’ve come across during the previous month. Here’s the latest collection from October 2015.