drat Tutorial: Publishing a package
The drat package was released earlier this month, and described in a first blog post. I received some helpful feedback about what works and what doesn’t. For example, Jenny Bryan pointed out that I was not making a clear enough distinction between the role of using drat to publish code, and using drat to receive/install code. Very fair point, and somewhat tricky as R aims to blur the line between being a user and developer of statistical analyses, and hence packages. Many of us are both. Both the main point is well taken, and this note aims to clarify this issue a little by focusing on the former.

Market Research and Big Data: A Difficult Relationship
My advice to the market research world is to stop conceptualizing so much when it comes to Big Data and Data Science and simply apply the new techniques there were appropriate.

k-means clustering and Voronoi sets
In the context of k-means, we want to partition the space of our observations into k classes. Each observation belongs to the cluster with the nearest mean. Here “nearest” is in the sense of some norm, usually the l2 (Euclidean) norm….

Redirecting All Kinds of stdout in Python
A common task in Python (especially while testing or debugging) is to redirect sys.stdout to a stream or a file while executing some piece of code. However, simply “redirecting stdout” is sometimes not as easy as one would expect; hence the slightly strange title of this post. In particular, things become interesting when you want C code running within your Python process (including, but not limited to, Python modules implemented as C extensions) to also have its stdout redirected according to your wish. This turns out to be tricky and leads us into the interesting world of file descriptors, buffers and system calls.

Probability calibration
When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not not support probability prediction. The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.

Probabilistic PCA for Forecasting
… use a flavor of Probabilistic PCA that is robust to missing data (see Ilin and Raiko 2010). Rather than the vanilla one-shot SVD, PPCA uses an iterative EM procedure/fixed point algorithm. From an initial guess, it’ll alternatively interpolate missing data and update the components until convergence.

Big Data: 14 Requirements for Real-Time Analytics
1. Adopt In-Memory Databases
2. Embrace DRAM Memory
3. Leverage Flash
4. Go Parallel
5. Think Bigger
6. Think Ahead
7. That Makes Sense
8. Know Your Options
9. A Closer Look: Aerospike
10. A Closer Look: IBM BLU
11. A Closer Look: Microsoft
12. A Closer Look: Oracle
13. A Closer Look: SAP HANA
14. Bonus Content

Export R output to a file
Sometimes it is useful to export the output of a long-running R command. For example, you might want to run a time consuming regression just before leaving work on Friday night, but would like to get the output saved inside your Dropbox folder to take a look at the results before going back to work on Monday.