Big Data is Really Dead
Big Data as a concept is characterized by 3Vs: Volume, Velocity, and Variety. Big Data implies a huge amount of data. Due to the sheer size, Big Data tends to be clumsy. The dominating implementation solution is Hadoop, which is batch based. Not just a handful of companies in the market merely collect lots of data with noise blindly, but they don’t know how to cleanse it, let alone how to transform, store and consume it effectively. They simply set up a HDFS cluster to dump the data gathered and then label it as their ‘Big Data’ solution. Unfortunately, the consequence of what they did actually marks the death of Big Data.

Understanding R6: new OO system in R
Last month, I got to know about the most recent version of object oriented system in R (which shows the limitations of the existing OO systems in R), R6 version 2.0.1, which is developed by Winston Chang from RStudio. For past 5 days, I went through the documentation and the source code to understand the system. What I discovered, was the amazingly complicated network of environment system which characterize the class system. In fact, understanding this package will make any novice (like me) understand the concept of environment and non standard evaluation in R. Below, I will describe whatever I could understand of this amazing package.

Three Things About Data Science You Won’t Find In the Books
1. Evaluation Is Key
2. It’s All In The Feature Extraction
3. Model Selection Burns Most Cycles, Not Data Set Sizes
In summary, knowing how to evaluate properly can help a lot to reduce the risk that the method won’t perform on future data. Getting the feature extraction right is maybe the most effective lever to pull to get good results, and finally, it doesn’t always to have Big Data, although distributed computation can help to bring down training times.

Running PageRank Hadoop job on AWS Elastic MapReduce
In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. In that post I took an existing Hadoop job in Java and modified it somewhat (added unit tests and made file paths set by a parameter). This post shows how to use this job on a real-life Hadoop cluster. The cluster is a AWS EMR cluster of 1 Master Node and 5 Core Nodes, each being backed by a m3.xlarge instance.

Hierarchical Two Compartimental PK Model
In this post I am running the Theoph dataset from MEMSS (Data sets and sample analyses from Pinheiro and Bates, ‘Mixed-effects Models in S and S-PLUS’ (Springer, 2000)) in a JAGS model. To quote the MEMSS manual: ‘Boeckmann, Sheiner and Beal (1994) report data from a study by Dr. Robert Upton of the kinetics of the anti-asthmatic drug theophylline. Twelve subjects were given oral doses of theophylline then serum concentrations were measured at 11 time points over the next 25 hours. These data are analyzed in Davidian and Giltinan (1995) and Pinheiro and Bates (2000) using a two-compartment open pharmacokinetic model, for which a self-starting model function, SSfol, is available.’