Which Big Data, Data Mining, and Data Science Tools go together?
We analyze the associations between the top Big Data, Data Mining, and Data Science tools based on the results of 2015 KDnuggets Software Poll. Download anonymized data and analyze it yourself.

Why Data Lakes Require Semantics
Data lakes are relatively new and built largely to help address the TCO of data warehouses and the onslaught of big data. Unlike data warehouses, data lakes use the concept of ‘pre-processing as little of the data as possible beforehand’ to literally toss all the data into the data lake in its native form and fish out what is needed later. Essentially wait to the last possible moment to Extract, Transform, Load (ETL) and integrate the data – so called late binding.

Wanted: A Perfect Scatterplot (with Marginals)
The graph was produced in Python, using the seaborn package. Seaborn calls it a ‘jointplot;’ it’s called a ‘scatterhist’ in Matlab, apparently. The seaborn version also shows the strength of the linear relationship between the x and y variables. Nice.

Machine Learning basics for a newbie
There has been a renewed interest in machine learning in last few years. This revival seems to be driven by strong fundamentals – loads of data being emitted by sensors across the globe, with cheap storage and lowest ever computational costs! However, not every one around understands what machine learning is.

Intraday time series analysis of the #rstats hashtag on Twitter
Twitter is renowned for spawning vibrant communities and discussion of current events. Many services exist to track hashtags for popularity, but less is known about the statistical characteristics of the timelines associated with hashtags. Time series analysis sets the stage for understanding the properties of hashtags as a discrete phenomenon. However, no two hashtags are the same and this warrants different approaches for different hashtags. In this post, we’ll look at tweets related to R and data science using the query string ‘#rstats,#datascience,#bigdata,#machinelearning,#dataviz,#ml’. In Twitter’s API, this amounts to a disjunction of six search terms, so that a result is returned if any of the terms appear in a tweet. We’ll first collect tweet data for each time line and transform the JSON-like tree structure into a more analysis friendly data.frame. Then we’ll use some basic forecasting techniques to predict future activity and their accuracy. At the end I’ll pose some questions related to the assumptions made in this analysis and how sound this approach is.

15 Easy Solutions To Your Data Frame Problems In R
R’s data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with the data frame in R and it doesn’t always seem to be straightforward. But does it really need to be so?