Why We need SQL like Query Language for Realtime Streaming Analytics?
I was at O’reilly Strata in last week and certainly interest for realtime analytics was at it’s top.
Realtime analytics, or what people call Realtime Analytics, has two favours.
1. Realtime Streaming Analytics ( static queries given once that do not change, they process data as they come in without storing. CEP, Apache Strom, Apache Samza etc., are examples of this.
2. Realtime Interactive/Ad-hoc Analytics (user issue ad-hoc dynamic queries and system responds). Druid, SAP Hana, VolotDB, MemSQL, Apache Drill are examples of this.
In this post, I am focusing on Realtime Streaming Analytics. (Ad-hoc analytics uses a SQL like query language anyway.)

Deep Learning, The Curse of Dimensionality, and Autoencoders
Autoencoders are an extremely exciting new approach to unsupervised learning and for many machine learning tasks they have already surpassed the decades of progress made by researchers handpicking features.

Turning Ph.D.s into industrial data scientists and data engineers
I recently sat down with Angie Ma, co-founder and president of ASI, a London startup that runs a carefully structured “finishing school” for science and engineering doctorates. We talked about how Angie and her co-founders (all ex-physicists) arrived at the concept of the ASI, the structure of their training programs, and the data and startup scene in the UK.

Deep Learning, NLP, and Representations
In the last few years, deep neural networks have dominated pattern recognition. They blew the previous state of the art out of the water for many computer vision tasks. Voice recognition is also moving that way. But despite the results, we have to wonder… why do they work so well? This post reviews some extremely remarkable results in applying deep neural networks to natural language processing (NLP). In doing so, I hope to make accessible one promising answer as to why deep neural networks work. I think it’s a very elegant perspective.

Simple template for scientific manuscripts in R markdown
The good reasons to write scientific reports and manuscripts in LaTeX or Markdown are: improved document integrity (always), simplicity (not always) and reproducibility (always). I prefer the lightweight Markdown over rich but more complex LaTeX — I think that lightweight is good for reproducibility. I am also in love with knitr. Hence, I’ve made a really simple template for the classical manuscript format for R markdown and knitr.

23 Great Schools with Master’s Programs in Data Science
Looking to freshen your résumé and improve your earning potential? You are in exactly the right place at exactly the right time. A 2011 McKinsey report estimates there will be 140,000 to 190,000 unfilled positions of U.S. data analytics experts by 2018. In response, universities are scrambling to improve their existing degree programs and create entirely new offerings. We’ve listed 23 of these programs in alphabetical order. Many of these choices can also be found in the helpful list compiled by Information Week and Data Informed’s Map of University Programs.

What is the SAP Automated Predictive Library (APL) for SAP HANA?
Predictive Analytics 2.0 has so many new things in it that my original article (Introducing SAP Predictive Analytics 2.0!) was not able to go through details for the SAP Automated Predictive Library (APL) – a major milestone in our efforts to integrate and embed our advanced analytics services everywhere and into everything. The SAP APL is a native C++ implementation of the automated predictive capabilities of SAP InfiniteInsight (KXEN) running directly in SAP HANA. Now, for the first time, you can run our patented automated predictive algorithms on your data stored in SAP HANA without first requiring an expensive and time consuming data extraction process. This also opens up an entirely new area of use cases – such as on-the-fly, in-database scoring for predictions, classifications, and clustering scenarios.

How to Make a Histogram with ggplot2
In our previous post you learned how to make histograms with the hist() function. You can also make a histogram with ggplot2, “a plotting system for R, based on the grammar of graphics”. This post will focus on making a Histogram With ggplot2.

A Compendium of Clean Graphs in R
Every data analyst knows that a good graph is worth a thousand words, and perhaps a hundred tables. But how should one create a good, clean graph? In R, this task is anything but easy. Many users find it almost impossible to resist the siren song of adding grid lines, including grey backgrounds, using elaborate color schemes, and applying default font sizes that makes the text much too small in relation to the graphical elements. As a result, many R graphs are an aesthetic disaster; they are difficult to parse and unfit for publication.

Big Data Processing in Spark
Overall speaking, Apache Spark provides a powerful framework for big data processing. By the caching mechanism that holds previous computation result in memory, Spark out-performs Hadoop significantly because it doesn’t need to persist all the data into disk for each round of parallel processing. Although it is still very new, I think Spark will take off as the main stream approach to process big data.