Machine Learning is all about Common Sense.

Machines are invented to solve problems and make life easier. It is widely accepted that common sense is a sense which is not so common 🙂 So, don’t you think this problem should also be addressed? Right. if machines are supposed to solve our problems then is there any machine that can solve our this problem? Or someone would say that can a machine ever solve this problem? I’ll say ! machine learning is all about to solve this issue. What is a common sense? According to Wikipedia ‘Common sense is a basic ability to perceive, understand, and judge things, which is shared by (‘common to’) nearly all people and can reasonably be expected of nearly all people without any need for debate’


10 Great Healthcare Data Sets

Healthcare will be one of the biggest beneficiaries of big data & analytics. Here are 10 great data sets to start playing around with & improve your healthcare data analytics chops.


Hierarchical Clustering in R

Hello everyone! In this post, I will show you how to do hierarchical clustering in R. We will use the iris dataset again, like we did for K means clustering.


Deep Learning Glossary

Deep Learning terminology can be quite overwhelming to newcomers. This glossary tries to define commonly used terms and link to original references and additional resources to help readers dive deeper into a specific topic. The boundary between what is Deep Learning vs. “general” Machine Learning terminology is quite fuzzy. I am trying to keep the glossary specific to Deep Learning, but these decisions are somewhat arbitrary. For example, I am not including “cross-validation” here because it’s a generic technique uses all across Machine Learning. However, I’ve decided to include terms such as softmax or word2vec because they are often associated with Deep Learning even though they are not Deep Learning techniques.


Rossmann Store Sales, Winner’s Interview: 3rd place, Neokami Inc.

Rossmann operates over 3,000 drug stores in 7 European countries. In their first Kaggle competition, Rossmann Store Sales, this drug store giant challenged Kagglers to forecast 6 weeks of daily sales for 1,115 stores located across Germany. The competition attracted 3,738 data scientists, making it our second most popular competition by participants ever. Cheng Guo competed as team Neokami Inc. and took third place using a method, ‘entity embedding’, that he developed during the course of the competition. In this blog, he shares more about entity embedding, why he chose to use neural networks (instead of the popular xgboost), and how a simplified version of his model still manages to perform quite well.


The ‘rsvg’ Package: High Quality Image Rendering in R

The advantage of storing your plots in svg format is they can be rendered later into an arbitrary resolution and format without loss of quality! Each rendering fucntion takes a width and height parameter. When neither width or height is set bitmap resolution matches that of the input svg. When either width or height is specified, the image is scaled proportionally. When both width and height are specified, the image is stretched into the requested size.


Using webp in R: A New Format for Lossless and Lossy Image Compression

A while ago I blogged about brotli, a new general purpose compression algorithm promoted by Google as an alternative to gzip. The same company also happens to be working on a new format for images called webp, which is actually a derivative of the VP8 video format. Google claims webp provides superior compression for both lossless (png) and lossy (jpeg) bitmaps, and even though the format is currently only supported in Google Chrome, it seems indeed promising. The webp R package allows for reading/writing webp bitmap arrays so that we can convert between other bitmap formats. For example, let’s take this photo of a delicious and nutritious feelgoodbyfood spelt-pancake with coconut sprinkles and homemade espresso (see here for 7 other healthy winter breakfasts!)


Intro to Text Analysis with R

One of the most powerful aspects of using R is that you can download free packages for so many tools and types of analysis. Text analysis is still somewhat in its infancy, but is very promising. It is estimated that as much as 80% of the world’s data is unstructured, while most types of analysis only work with structured data. In this paper, we will explore the potential of R packages to analyze unstructured text. R provides two packages for working with unstructured text – TM and Sentiment. TM can be installed in the usual way. Unfortunately, Sentiment has been archived in 2012, and is therefore more difficult to install. However, it can still be installed using the following method, according to Frank Wang (Wang).


Some Comments on Donoho’s “50 Years of Data Science”

An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donoho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donoho is not only a grandmaster theoretician, but also a statistical philosopher. The paper should be required reading in all Stat and CS Departments. But as a CS person with deep roots in statistics, I believe there are a few points Donaho should have developed more, which I will discuss here, as well as other points on which his essay really shines.


So you want to be a Data Science superstar

Big house? Five cars? There’s no one universal way to do it, but get a coffee and read on through this bumper post to find your own way with the advice of real experts. Last summer, Mrs G and I were in that ridiculously long line for the cablecar in San Francisco, like predictable British tourists, and got talking to the guys next to us. One of them, Jason Jackson, was just about to start studies in business including a good dose of quantitative research and data analysis. So, we’ve stayed in touch on Twitter. Recently, he asked me what the single best resource is for getting started in data science, and I found this a surprisingly tough question.


Getting Started with Markov Chains: Part 2

In a previous post, I showed some elementary properties of discrete time Markov Chains could be calculated, mostly with functions from the markovchain package. In this post, I would like to show a little bit more of the functionality available in that package by fitting a Markov Chain to some data. In this first block of code, I load the gold data set from the forecast package which contains daily morning gold prices in US dollars from January 1, 1985 through March 31, 1989. Next, since there are few missing values in the sequence, I impute them with a simple ‘ad hoc’ process by substituting the previous day’s price for one that is missing. There are two statements in the loop because there are a number of instances where there are two missing values in a row. Note that some kind of imputation is necessary because I will want to compute the autocorrelation of the series, and like many R functions acf() does not like NAs. (it doesn’t make sense to compute with NAs.)


Microsoft releases CNTK, its open source deep learning toolkit, on GitHub

Microsoft is making the tools that its own researchers use to speed up advances in artificial intelligence available to a broader group of developers by releasing its Computational Network Toolkit on GitHub. The researchers developed the open-source toolkit, dubbed CNTK, out of necessity. Xuedong Huang, Microsoft’s chief speech scientist, said he and his team were anxious to make faster improvements to how well computers can understand speech, and the tools they had to work with were slowing them down.


Running R jobs quickly on many machines

As we demonstrated in “A gentle introduction to parallel computing in R” one of the great things about R is how easy it is to take advantage of parallel processing capabilities to speed up calculation. In this note we will show how to move from running jobs multiple CPUs/cores to running jobs multiple machines (for even larger scaling and greater speedup). Using the technique on Amazon EC2 even turns your credit card into a supercomputer.