I like playing around with data from Eurostat. At this time the tools to do so are just so easy. There are tools to pull the data directly from the data base in R (eurostat package). Process it a bit using dplyr and before you know it, ggplot makes a plot.
A data set is said to be large when it exceeds 20% of the available RAM for a single machine. Which for your standard MacBook Pro with 8Gb of RAM, corresponds to a meager 2Gb dataset. A size that is becoming more and more frequent these days. Of course before you actually run out of memory, your machine will slow down to a crawl and your frustration will increase in inverse proportions. To deal with large data, first aid kit strategies consists in sampling the data to only consider a subset of the whole data or reaching out for more RAM by going to the cloud. Amazon offers boxes with plenty of RAM for pennies/hour. Other options are to use libraries such as Apache Spark’s MLlib, or platforms such as H2O or Dato’s GraphLab Create. R also has a streaming package. However if scikit-learn is your weapon of choice for machine learning, you should stick with it and make the best of its out-of-core processing capabilities.
Variational Recurrent Neural Network
In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder.
I’ve seen R users swooning over the magrittr package for a while now, but I couldn’t make heads or tails of all these scary %>% symbols. Finally I had time for a closer look, and it seems potentially handy indeed. Here’s the idea and a simple toy example.