Aggregate – A Powerful Tool for Data Frame in R

This post gives a short review of the aggregate function as used for data.frames and presents some interesting uses: from the trivial but handy to the most complicated problems I have solved with aggregate.

10 more lessons learned from building Machine Learning systems


Step-by-Step with a Terabyte of Reddit Data

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how install Jupyter on your Spark cluster and use PySpark for some ad hoc analysis of reddit comment data on Amazon S3. This following tutorial installs Jupyter on your Spark cluster in standalone mode on top of Hadoop and also walks through some transformations and queries on the reddit comment data on Amazon S3. We assume you already have an AWS EC2 cluster up with Spark 1.4.1 and Hadoop 2.7 installed. If not, you can go to our previous post on how to quickly deploy your own Spark cluster.

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks

We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree-LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification

TensorFlow Disappoints – Google Deep Learning falls shallow

Google recently open-sourced its TensorFlow machine learning library, which aims to bring large-scale, distributed machine learning and deep learning to everyone. But does it deliver?

Why Submitting Ideas To The R Consortium Is A Good Idea

Announced in June 2015, the R Consortium aims to support the R ecosystem and community and help grow the adoption of R. To deliver on these aims, the R Consortium has an Infrastructure Steering Committee (ISC). The ISC are responsible for directing technical focus and overseeing projects to deliver improvements. You’ll note in all this, that there’s no mention of a central vision for R beyond supporting it. That’s because the R Consortium is looking to support the projects that the community thinks will help it grow and thrive. Every six months, the R Consortium will grant awards to proposals for projects that they feel best help the community. The ISC are responsible for receiving, evaluating, and selecting projects.

Interactive Data Science with R in Apache Zeppelin Notebook

The objective of this blog post is to help you get started with Apache Zeppelin notebook for your R data science requirements. Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown, Shell and more.

Partools, Recommender Systems and More

Recently I attended a talk by Stanford’s Art Owen, presenting work done with his student, Katelyn Gao. This talk touched on a number of my interests, both mathematical and computational. What particularly struck me was that Art and Katelyn are applying a very old — many would say very boring — method to a very modern, trendy application: recommender systems. (See also a recent paper by P.O. Perry.)

Wind in Netherlands

In climate change discussions, everybody talks about temperature. But weather is much more than that. There is at least rain and wind as directly experienced quality, and air pressure as measurable quantity. In the Netherlands, some observation stations have more than a century of daily data on these things. The data may be broken in the sense that equipment and location can have changed. To quote: ‘These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt link or the Central Netherlands Temperature link.’ Since I am not looking at temperature but wind, I will keep to this station’s data.

What do auto-correlated residuals do to your linear model?

For training purposes I wanted to illustrate the dangers of ignoring time series characteristics of the random part of a classical linear regression, and I came up with this animation to do it:

James Bond movies

I’m a big James Bond fan, so naturally I went to watch the new Bond movie Spectre which – spoiler alert! – is pretty bad. It also got me to reminice about the good Bond films of the past. My personal candidate for worst Bond film is Die Another Day, but what does the “objective” opinion say on this hotly debated topic? Does my taste conform to the Internet’s taste?

Using htmlwidgets with knitr and Jekyll

A few weeks ago I gave a talk at BARUG (and wrote a post) about blogging with the excellent knitr-jekyll repo. Yihui’s system is fantastic, but it does have one drawback: None of those fancy new htmlwidgets packages seem to work… A few people have run into this. I recently figured out how to fix it for this blog (which required a bit of time reading through the rmarkdown source), so I thought I’d write it up in case it helps anyone else, or my future-self.