What’s the difference between Causality and Correlation?
Causation and Correlation are loosely used words in analytics. People tend to use these words interchangeably without knowing the fundamental logic behind them. Apparently, people get trapped in the phonetics of these words and end up using them at incorrect places. But, let me warn you that apart from the similar sounding names, there isn’t a lot common in the two phenomena. Their fundamental implications are very different.#

Deep Learning with h2o.ai
This post provides a brief history lesson and overview of deep learning, coupled with a quick ‘how to’ guide for dipping your toes into the water with H2O.ai. Then I describe how Domino lets us easily run H2O on scalable hardware and track the results of our deep learning experiments, to take analyses to the next level. The code and sample experiments I describe are available on Domino.

Computing Shortest Distances Incrementally with Spark
Shortest distances and paths have many uses in real world graph applications. For instance, in a social network like the Facebook graph, shortest distance can serve as a measure of relevance. When you search for another user on Facebook, users with smaller shortest distances are more relevant than users farther away. The shortest distance algorithm is also used as a subroutine for many other graph problems. For computing on large scale graphs, this problem can easily be cast into an iterative map-reduce job. Indeed, Spark’s graph library, GraphX, ships with such an implementation.

Shiny 0.12: Interactive Plots with ggplot2
Shiny 0.12 has been released to CRAN. Compared to version 0.11.1, the major changes are:
• Interactive plots with base graphics and ggplot2
• Switch from RJSONIO to jsonlite

In machine learning, is more data always better than better algorithms?
Probably one of the most famous quotes defending the power of data is that of Google’s Research Director Peter Norvig claiming that ‘We don’t have better algorithms. We just have more data.’. This quote is usually linked to the article on ‘The Unreasonable Effectiveness of Data’, co-authored by Norvig himself (you should probably be able to find the pdf on the web although the original is behind the IEEE paywall). The last nail on the coffin of better models is when Norvig is misquoted as saying that ‘All models are wrong, and you don’t need them anyway’.

Pairwise-complete correlation considered dangerous
This note warns about potentially misleading results when using the use=pairwise.complete.obs and related options in R’s cor and cov functions. Pitfalls are illustrated using a very simple pathological example followed by a brief list of alternative ways to deal with missing data and some references about them.

Don’t Miss These Scripts: Otto Group Product Classification
Data scientists with very different backgrounds and varying levels of machine learning experience posted code in Otto’s scripts repository. We’ve selected a handful of scripts that we believe highlight important machine learning techniques, interesting packages, new approaches, and the creativity Kagglers are known for in the data science community.

Python: Improve LSTM text generation example
LSTM Text Generation with Keras (Theano wrapper)