K Means Clustering in R

Hello everyone, hope you had a wonderful Christmas! In this post I will show you how to do k means clustering in R. We will use the iris dataset from the datasets library.

Bounds for the Pearson Correlation Coefficient

The correlation measure that students typically first encounter is actually Pearson’s product-moment correlation coefficient. This coefficient is simply a standardized version of the covariance between two random variables (say, X and Y): …

Year in Review: Best of Analytics Vidhya from 2015

People say that 90% of startups fail by the time they reach their year 2! I would like to thank you all that we have not only made it to the remaining 10% of the startups, but have come out with flying colors! I still remember last day at my job and my friends at work were curious “How big a market could data scientists and business analysts be?” As a first time entrepreneur with a 6m old daughter, I felt scared that I did not know the answer. I was leaving my cushy job to try out something no one had tried before. All I knew was the glaring knowledge gap I wanted to address and how passionately I felt about it. Thankfully, it all worked out. Today, we are one of the the largest and fastest growing data science community in the world. Our traffic has become 5x of what is was when 2015 started and it is still growing at a healthy pace. We started the year with launch of our discussion portals, added different forms of content like infographics / cheat sheets, salary test and resource finder. In second half of the year, we also launched our hackathon platform. Through 2015, our community got bigger and bigger. We felt a huge shift on work load. But, this was pleasing. We wrote our heart out to provide the best possible knowledge in the subject matter. And we hope you enjoyed it. Here are some of the best snippets of content created by our community in 2015. Read them, give the knowledge a test and stay warm as the year comes to an end.

When Regularization Fails

Another Holiday Blog

A Visualization of World Cuisines

In a previous post, we had ‘mapped’ the culinary diversity in India through a visualization of food consumption patterns. Since then, one of the topics in my to-do list was a visualization of world cuisines. The primary question was similar to that asked of the Indian cuisine: Are cuisines of geographically and culturally closer regions also similar? I recently came across an article on the analysis of recipe ingredients that distinguish the cuisines of the world. The analysis was conducted on a publicly available dataset consisting of ingredients for more than 13,000 recipes from the recipe website Epicurious. Each recipe was also tagged with the cuisine it belonged to, and there were a total of 26 different cuisines. This dataset was initially reported in an analysis of flavor network and principles of food pairing.

R-Markdown and Knitr Tutorial (part 2)

In our last post we quickly went over how to create a new R-Markdown document and embed a Plotly graph in it. In this post we’ll get into more details around how to control code output using chunk options. As shown in our previous post, for the embedded R code to be evaluated, it’ll need to be written inside a code-chunk as shown below.

R-Markdown and Knitr Tutorial (part 1)

R-Markdown is a great way to create dynamic documents with embedded chunks of R code. The document is self contained and fully reproducible which makes it very easy to share. This post will be the first in a multi part series on how to embed Plotly graphs in R-Markdown documents as well as presentations. R-Markdown is a flavor of markdown which allows R-users to embed R code into a markdown document. The document is then ‘knit’ using knitr to create a HTML file.


A simple set of wrappers around gitpython for creating pandas dataframes out of git data. The project is centered around two primary objects:
A Repository object contains a single git repo, and is used to interact with it. A ProjectDirectory references a directory in your filesystem which may have in it multiple git repositories. The subdirectories are all walked to find any child repos, and any analysis is aggregated up from all of those into a single output (pandas dataframe).

Weekly Roundup – Dec. 25 2015

This year’s last weekly roundup! Check all the additions from PocketCluster Index ! Apache foundation recently announces Reef as an incubating project. Reef aims to dramatically simplify big data system’s resource management.

Getting Started With Apache Zeppelin and Airbnb Visuals

I’ve been playing around with Apache Zeppelin for a few months now (not so much playing as just frustration initially to get it working). After consistently using it a bit, I find it incredibly useful for data visualization and business intelligence purposes. Apache Zeppelin is self described as “a web-based notebook that enables interactive data analytics”. Imagine it as an IPython notebook for interactive visualizations but supporting more languages than just Python to munge your data for visualization. Ultimately after getting Pyspark working on it, I find it incredibly useful for displaying business data and analytics. Right now it only has a couple graph options which include bar graphs, line graphs, pie charts, and scatter plots. Currently it’s also in incubation mode at Apache and open-sourced!