A Very Short History Of Data Science

The story of how data scientists became sexy is mostly the story of the coupling of the mature discipline of statistics with a very young one-computer science. The term “Data Science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists and others for years. The following timeline traces the evolution of the term “Data Science” and its use, attempts to define it, and related terms.


SQL commands for Commonly Used Excel Operations

I’ve spent more than a decade working on Excel. Yet, there is so much to learn. If you dislike coding, excel could be your rescue into data science world (to some extent). Once you understand Excel operations, learning SQL is very easy.


Pitfalls for ML on Recurring Data

Many data science resources exist for the analysis of static datasets, but often in the real world, solutions and systems need to work in an automated fashion on recurring data. This presents a unique challenge, because when the system is being designed and built, you know that you don’t have all of the data. You will, by design, be getting new data every day, week or however often. It turns out, this new and unseen data is often pretty messed up. In this document, I aim to outline some of the common challenges associated with building recurring data/ML systems, both in terms of the data itself, and how it is sent/received.


Visualizing Data

Visualizing the data is a key function of R programming. Visualization include creating a histogram, density plots, box plots, pie chart, heatmap, and wordcloud. We will show how to use ggplot2 package in most of tutorials.


Evaluation of time series forecasting using Spark windowing

Evaluation metrics play a critical role in machine learning ecosystem. Especially for machine learning products, evaluation metrics are like the heart beats. They show how healthy the model is and how good it is performing in real life and they are the only numbers that the decision makers care about. Definition and implementation of evaluation metrics highly depend on the application and it changes from one data product to another one. In this post, I aim to introduce Mean Directional Accuracy (MDA) and how we can calculate it in Spark.


International Institute of Analytics (IIA) 5 Predictions and 5 Priorities for 2016

5 Predictions
1. Cognitive technology becomes the follow-on to automated analytics
2. Analytical Microservices facilitate embedded analytics
3. Data Science and predictive/prescriptive analytics become one and the same
4. The analytics talent crunch eases as many university crunch programs come online
5. Analytics are focused on data curation and management

5 Analytics Priorities
Priority 1: Align analytics and Business strategies, and use analytics in strategy development.
Priority 2: Leverage existing analytics strength broadly across the enterprise
Priority 3: Improve discipline in analytics project intake, definition, and prioritization
Priority 4: Retain talent with career and leadership development programs specific to Data Scientists.
Priority 5: Measure the comprehensive value of analytics to establish undeniable relevancy


Analyzing “Twitter faces” in R with Microsoft Project Oxford

In this blog post I will briefly describe some of the Project Oxford API’s of Microsoft.


The Generalized Method of Moments and the gmm Package

An almost-as-famous alternative to the famous Maximum Likelihood Estimation is the Method of Moments. MM has always been a favorite of mine because it often requires fewer distributional assumptions than MLE, and also because MM is much easier to explain than MLE to students and consulting clients. CRAN has a package gmm that does MM, actually the Generalized Method of Moments, and in this post I’ll explain how to use it (on the elementary level, at least).


Recording and Replaying the Graphics Engine Display List

In the development version of R (to become R 3.3.0), it is now possible (again) to record the graphics engine display list in one R session and replay it in another R session. The current situation (in R 3.2.2) is demonstrated below. If we record a plot with recordPlot(), save it to disk with saveRDS(), then quit R …


Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.


Has there been a ‘pause’ in global warming?

As I discussed in my previous post, records of global temperatures over the last few decades figure prominently in the debate over the climate effects of CO2 emitted by burning fossil fuels. I am interested in what this data says about which of the reasonable positions in this debate is more likely to be true — the `warmer’ position, that CO2 from burning of fossil fuels results in a global increase in temperatures large enough to have quite substantial (though not absolutely catastrophic) harmful effects on humans and the environment, or the `lukewarmer’ position, that CO2 has some warming effect, but this effect is not large enough to be a major cause for worry, and does not warrant imposition of costly policies aimed at reducing fossil fuel consumption.


Integrating Python and R Part III: An Extended Example

In this post I will be sharing a longer example using these approaches in analysis we carried out at Mango as a proof of concept to cluster news articles. The pipeline involved the use of both R and Python at different stages, with a Python script being called from R to fetch the data, and the exploratory analysis piece being conducted in R.


IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.


R is Not So Hard! A Tutorial, Part 20: Useful Commands for Exploring Data

Sometimes when you’re learning a new stat software package, the most frustrating part is not knowing how to do very basic things. This is especially frustrating if you already know how to do them in some other software. Let’s look at some basic but very useful commands that are available in R.
Advertisements