Notes on Multivariate Gaussian Quadrature (with R Code)

Statisticians often need to integrate some function with respect to the multivariate normal (Gaussian) distribution, for example, to compute the standard error of a statistic, or the likelihood function in of a mixed effects model. In many (most?) useful cases, these integrals are intractable, and must be approximated using computational methods. Monte-Carlo integration is one such method; a stochastic method, but its computation can be prohibitively expensive, especially when the integral is computed many times.

Running scalable Data Science on Cloud with R & Python

The complexity in data science is increasing by the day. This complexity is driven by three fundamental factors:
1. Increased Data Generation
2. Low cost of data storage
3. Cheap computational power
So, in summary, we are generating far more data (and you are becoming a data point as you read this article!), we can store it at a low cost and can run computations and simulations on this data at a low cost!

Heuristics on bias and variance for kernel density estimators

Consider the simple case of a moving histogram (which is a very simple kernel). …

Google voice search: faster and more accurate

Back in 2012, we announced that Google voice search had taken a new turn by adopting Deep Neural Networks (DNNs) as the core technology used to model the sounds of a language. These replaced the 30-year old standard in the industry: the Gaussian Mixture Model (GMM). DNNs were better able to assess which sound a user is producing at every instant in time, and with this they delivered greatly increased speech recognition accuracy. Today, we’re happy to announce we built even better neural network acoustic models using Connectionist Temporal Classification (CTC) and sequence discriminative training techniques. These models are a special extension of recurrent neural networks (RNNs) that are more accurate, especially in noisy environments, and they are blazingly fast!

Hadoop Maturity Survey: The Tipping Point

AtScale first global Hadoop maturity survey finds Hadoop value greatly increases with nodes deployed; its use for ETL is frequently a transition stage to higher-value Data Science applications.

Making it easy to use RHadoop on HDInsight Hadoop clusters

The RHadoop packages make it easy to connect R to Hadoop data (rhdfs), and write map-reduce operations in the R language (rmr2) to process that data using the power of the nodes in a Hadoop cluster. But getting the Hadoop cluster configured, with R and all the necessary packages installed on each node, hasn’t always been so easy.

R – My journey so far

The first time I heard about R, was about 4 years ago…a couple of week after I joined SAP. At that time I read in one of our internal documents that SAP HANA was going to be able to interact with the R programming Language. At first, I was totally clueless about R…I had never heard from it before…so I of course start looking for some more information, download R and RStudio and start learning how to use it… After some time…I posted my first blog talking about R…that was on November 28, 2011…the blog was Dealing with R and HANA… After that I kept learning and using it whenever it was suitable…and I end up writing my most successful blog on the SAP Community Network…that was on May 21, 2012.

Hypothesis-Driven Development Part V: Stop-Loss, Deflating Sharpes, and Out-of-Sample

This post will demonstrate a stop-loss rule inspired by Andrew Lo’s paper “when do stop-loss rules stop losses”? Furthermore, it will demonstrate how to deflate a Sharpe ratio to account for the total number of trials conducted, which is presented in a paper written by David H. Bailey and Marcos Lopez De Prado. Lastly, the strategy will be tested on the out-of-sample ETFs, rather than the mutual funds that have been used up until now (which actually cannot be traded more than once every two months, but have been used simply for the purpose of demonstration).

Writing “Python Machine Learning”

It’s been about time. I am happy to announce that ‘Python Machine Learning’ was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing ‘Python Machine Learning’ really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.

Best of the Visualisation Web… July 2015

At the end of each month I pull together a collection of links to some of the most relevant, interesting or thought-provoking web content I’ve come across during the previous month. Here’s the latest collection from June 2015.