Microsoft R Open 3.2.3 now available

Microsoft R Open 3.2.3, the performance-enhanced distribution of R 3.2.3, is now available for download from

Fuzzy Matching Algorithms To Help Data Scientists Match Similar Data

A common scenario for data scientists is the marketing, operations or business groups give you two sets of similar data with different variables & asks the analytics team to normalize both data sets to have a common record for modelling.

Real Time Predictive Models – Are They Possible?

At least one instance of Real Time Predictive Model development in a streaming data problem has been shown to be more accurate than its batch counterpart. Whether this can be generalized is still an open question. It does challenge the assumption that Time-to-Insight can never be real time.

Introduction to Outlier Detection Methods

One of the challenges in data analysis in general and predictive modeling in particular is dealing with outliers. There are many modeling techniques which are resistant to outliers or reduce the impact of them, but still detecting outliers and understanding them can lead to interesting findings. We generally define outliers as samples that are exceptionally far from the mainstream of data.There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

Learn R From Scratch – Part 3

In the previous tutorial, of the “Learn R From Scratch” series, we learn very important concepts such as lists, dataframes and how to import and export data from R. This time, we will discuss more practical aspects such as exploring built-in datasets, handling dates, writing functions and debugging. In this series, I have also laid out 4 practice exercises, so you can to apply what you learnt thus far. So, lets get to work and crack this piece.

Learn R From Scratch – Part 2

This is a continuation from the Part 1 of “Learn R From Scratch” series. In the previous post, the videos covered the very basics for R from scratch. We first installed R, got familiar with the environment, worked some basic math, different types of variables, got introduced to vectors and learnt how to access and manipulate it. Now, its time to step up the game and dive deeper into the essentials.

Learn R From Scratch – Part 1

R is an open source programming language with a lot of facilities for problem solving through statistical computing. At the time of writing this, there are more than 6K packages available in CRAN repository.

Taking Keras to the Zoo

If you follow any of the popular blogs like Google’s research, FastML, Smola’s Adventures in Data Land, or one of the indie-pop ones like Edwin Chen’s blog, you’ve probably also used ModelZoo. Actually, if you’re like our boss, you affectionately call it ‘The Zoo’. (Actually x 2, if you have interesting blogs that you read, feel free to let us know!)

Even Further Beyond one-Hot: Feature Hashing

Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector. It can be extremely efficient by having a standalone hash function that requires no pre-built dictionary of possible categories to function. A simple implementation that allows the user to pick the desired output dimensionality is to simply hash the input value into a number, then divide it by the desired output dimensionality and take the remainder, R. With that, you can encode the feature as a vector of zeros with a one in index R.

How we fought bad ads in 2015

When ads are good, they connect you to products or services you’re interested in and make it easier to get stuff you want. They also keep a lot of what you love about the web—like news sites or mobile apps—free. But some ads are just plain bad—like ads that carry malware, cover up content you’re trying to see, or promote fake goods. Bad ads can ruin your entire online experience, a problem we take very seriously. That’s why we have a strict set of policies for the kinds of ads businesses can run with Google—and why we’ve invested in sophisticated technology and a global team of 1,000+ people dedicated to fighting bad ads. Last year alone we disabled more than 780 million ads for violating our policies—a number that’s increased over the years thanks to new protections we’ve put in place. If you spent one second looking at each of these ads, it’d take you nearly 25 years to see them all! Here are some of the top areas we focused on in our fight against bad ads in 2015:

Introducing Kaggle Datasets

At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Shiny 0.13.0

Shiny 0.13.0 is now available on CRAN! This release has some of the most exciting features we’ve shipped since the first version of Shiny.

Big data in ranching and animal husbandry

Another big part of the food supply comes from ranches and farms that raise and slaughter various livestock. While ranching is sometimes bundled with agriculture, I discussed farming in Big Data in Agriculture, so we’ll focus on ranching this time around. Somewhat surprising is that big data usage in ranching appears more limited than in farming. That said, there are a number of novel uses of technology and data in animal husbandry.

100 “must read” R-bloggers’ posts for 2015

The site is now 6 years young. It strives to be an (unofficial) online news and tutorials website for the R community, written by over 600 bloggers who agreed to contribute their R articles to the website. In 2015, the site served almost 17.7 million pageviews to readers worldwide.