Google’s Great Gains in the Grand Game of Go

The game of Go has long stumped AI researchers, and, as such, solving it was thought to be years off. That is, until Google solved it earlier this week. Or did it?


6 Differences Between Pandas And Spark DataFrames

A post describing the key differences between Pandas and Spark’s DataFrame format, including specifics on important regular processing features, with code samples.


Is Deep Learning Overhyped?

With all of the success that deep learning is experiencing, the detractors and cheerleaders can be seen coming out of the woodwork. What is the real validity of deep learning, and is it simply hype?


Explore Data Science

Explore Data Science online training introduces common data science theory and techniques to help programmers, mathematicians, and other technical professionals expand their data science expertise. Participants go at their own pace during an overall 40-hour exploration of frequently used data science techniques. Explore Data Science is in the form of an interactive game, with participants advancing from one planet to another as they’re actively engaged with real datasets and interactive tasks. They earn points, awards, and badges for completing missions along the way. Python is used to complete the interactive challenges. There’s no need to fuss with environment and package setup. Explore Data Science has code execution right in the browser, meaning that users get real results, in real time.


Useful Data Science: Feature Hashing

Feature engineering plays major role while solving the data science problems. Here, we will learn Feature Hashing, or the hashing trick which is a method for turning arbitrary features into a sparse binary vector.


Implementing your own k-nearest neighbour algorithm using Python

In machine learning, you may often wish to build predictors that allows to classify things into categories based on some set of associated values. For example, it is possible to provide a diagnosis to a patient based on data from previous patients. Classification can involve constructing highly non-linear boundaries between classes, as in the case of the red, green and blue classes below:


Google Launches Deep Learning with TensorFlow MOOC

Google and Udacity have partnered for a new self-paced course on deep learning and TensorFlow, starting immediately.


The correlation between original and replication effect sizes might be spurious

In the reproducibility project, original effect sizes correlated r=0.51 with the effect sizes of replications. Some researchers find this hopeful.


Get the fully qualified domain name for your machine

This is just a quick post, to mention how you can get your computer name with the domain it is registered in i.e. Â the fully qualified domain name (FQDN) by using R.


Better prediction intervals for time series forecasts

Hybrid forecasts – averages of single-model forecasts – are commonly used to produce point estimates that are better than any of the contributing forecast models. I show how prediction intervals can be constructed for a hybrid forecast that have more accurate coverage than most commonly used prediction intervals (ie 80% of actual observations do indeed turn out to be within the 80% confidence interval), tested on the 3,003 M3 forecasting competition datasets.


A Million Text Files And A Single Laptop

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices). The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill. Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.


love-hate Metropolis algorithm

Hyungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” [although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the same random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an anti-Langevin proposal, relying on the gradient to proceed further down…


In-depth analysis of Twitter activity and sentiment, with R

Astronomer and budding data scientist Julia Silge has been using R for less than a year, but based on the posts using R on her blog has already become very proficient at using R to analyze some interesting data sets. She has posted detailed analyses of water consumption data and health care indicators from the Utah Open Data Catalog, religious affiliation data from the Association of Statisticians of American Religious Bodies, and demographic data from the American Community Survey (that’s the same dataset we mentioned on Monday).


Intro to Sound Analysis with R

Some of my articles cover getting started with a particular software, and some cover tips and tricks for seasoned users. This article, however, is different. It does demonstrate the usage of an R package, but the main purpose is for fun. In an article in Time, Matt Peckham described how French researchers were able to use four microphones and a single snap to model a complex room to within 1mm accuracy (Peckham). I decided that I wanted to attempt this (on a smaller scale) with one microphone and an R package. I was amazed at the results. Since the purpose of this article is not to teach anyone to write code to work with sound clips, rather than work with the code line by line, I will give a general overview, and I will present the code in full at the end for anyone that would like to recreate it for themselves.


Pipelining R and Python in Notebooks

As a Data Scientist, I refuse to choose between R and Python, the top contenders currently fighting for the title of top Data Science programming language. I am not going to argue about which is better or pit Python and R against each other. Rather, I’m simply going to suggest to play to the strengths of each language and consider using them together in the same pipeline if you don’t want to give up advantages of one over the other. This is not a novel concept. Both languages have packages/modules which allow for the other language to be used within it (rpy2 in Python and rPython in R). Even in Jupyter notebooks, using the python kernel, one can use ‘R magics’ to execute native R code (which actually relies on rpy2).


Linear regression with random error giving EXACT predefined parameter estimates

When simulating linear models based on some defined slope/intercept and added gaussian noise, the parameter estimates vary after least-squares fitting. Here is some code I developed that does a double transform of these models as to obtain a fitted model with EXACT defined parameter estimates a (intercept) and b (slope).