Tensor Methods in Machine Learning

Tensors are high dimensional generalizations of matrices. In recent years tensor decompositions were used to design learning algorithms for estimating parameters of latent variable models like Hidden Markov Model, Mixture of Gaussians and Latent Dirichlet Allocation (many of these works were considered as examples of “spectral learning”, read on to find out why). In this post I will briefly describe why tensors are useful in these settings.

How Much Did It Rain? II: 2nd place, Luis Andre Dutra e Silva

How Much Did It Rain? II was the second competition (of the same name) that challenged Kagglers to predict hourly rainfall measurements. Luis Andre Dutra e Silva finished in second place, and in doing so, became a Kaggle Master (congrats!). In this blog, Luis shares his approach, and why using an LSTM model ‘is like reconstructing a melody with some missed notes.’

Tutorial for Developing Interactive R Problem Sets with RTutor

RTutor is an R package that allows to develop interactive R exercises. Problem sets can be solved off-line or can be hosted in the web with shinyapps.io. Problem sets can be designed as a Markdown .rmd file (to be solved directly in RStudio) or use a browser-based interface powered by RStudio’s Shiny. While the web interface looks nicer, I personally use problem sets in the Markdown format when teaching advanced economic classes.

Anomaly Detection in R

Inspired by this Netflix post, I decided to write a post based on this topic using R. There are several nice packages to achieve this goal, the one we´re going to review is AnomalyDetection.

Tutorial: Data Science with SQL Server R Services

You may have heard that R and the big-data RevoScaleR package have been integrated with with SQL Server 2016 as SQL Server R Services. If you’ve been wanting to try out R with SQL Server but haven’t been sure where to start, a new MSDN tutorial will take you through all the steps of creating a predictive model: from obtaining data for analysis, to building a statistical model, to creating a stored prodedure to make predictions from the model. To work through the tutorial, you’ll need a suitable Windows server on which to install the SQL Server 2016 Community Technology Preview, and make sure you have SQL Server R Services installed. You’ll also need a separate Windows machine (say a desktop or laptop) where you’ll install Revolution R Open and Revolution R Enterprise. Most of the computations will be happening in SQL Server, though, so this ‘data science client machine’ doesn’t need to be as powerful. The tutorial is made up of five lessons, which together should take you about 90 minutes to run though. If you run into problems, each lesson includes troubleshooting tips at the end.

RStudio Clone for Python – Rodeo

So have you been looking for something like RStudio, but for Python? It’s been out for some time, but a recently updated release of Rodeo gives an increasingly workable RStudio-like environment for Python users. The layout resembles the RStudio layout – file editor top left, interactive console bottom left, variable inspector and history top right, charts, directory view and plugins bottom right. (For plugins, read: packages).

Workflows in Python: Getting data ready to build models

A couple of weeks ago, I had the opportunity to host a workshop at the Open Data Science Conference in San Francisco. During the workshop, I shared the process of rapid prototyping followed by iterating on the model I’ve built. When I’m building a machine learning model in scikit-learn, I usually don’t know exactly what my final model will look like at the outset. Instead, I’ve developed a workflow that focuses on getting a quick-and-dirty model up and running as quickly as possible, and then going back to iterate on the weak points until the model seems to be converging on an answer. This process has three phases, which I’ll highlight in an example I created to predict failures of wells in Africa. In this blog post, I’ll show how I got the raw data machine-learning ready and build a few quick models. In subsequent posts, I’ll revisit some of the choices made in the first model, effectively cleaning up some messes that I made in the interest of moving quickly. Lastly, I’ll introduce scikit-learn Pipelines and GridSearchCV, a pair of tools for quickly attaching pieces of data science machinery and comprehensively searching for the best model.

A use of gsub, reshape2 and sqldf with healthcare data

Building off other industry-specific posts, I want to use healthcare data to demonstrate the use of R packages. The data can be downloaded here. To read the .CSV file in R you might read the post how to import data in R. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.

Buzzfeed uses R for Data Journalism

Buzzfeed isn’t just listicles and cat videos these days. Science journalist Peter Aldhous recently joined Buzzfeed’s editorial team, after stints at Nature, Science and New Scientist magazines. He brings with him his data journalism expertise and R programming skills to tell compelling stories with data on the site. His stories, like this one on the rates of terrorism incidents in the USA, often include animated maps or interactive charts created with R.