Data Science of IoT: Sensor fusion and Kalman filters, Part 2

The second part of this tutorial examines use of Kalman filters to determine context for IoT systems, which helps to combine uncertain measurements in a multi-sensor system to accurately and dynamically understand the physical world.


Recidivism and single-case probabilities

I am collaborating with a criminologist who studies recidivism. In the context of crime statistics, a recidivist is a convicted criminal who commits a new crime post conviction. Statistical studies of recidivism are used in parole hearings to assess the risks of releasing a prisoner. This application of statistics raises questions that go to the foundations of probability theory.


10 Must Watch Movies on Data Science and Machine Learning

Some members of our team (including me) live by just 2 passions in life – Data Science & Movies! For us, slicing and dicing movies over Monday morning coffee is part of warming up ritual. So, we decided to do a poll among ourselves on the best movies related to data science and machine learning. We also thought that we would release the outcome of the results in form of an inforgraphic.


7 Free Machine Learning Courses


Customer One View In The Times Of Big Data

There is no doubt that when an enterprise’s customer base and product portfolio grows, it becomes even more complicated to manage multiple relationships with the customer across products, channels and geographies. Most organizations have a simplistic approach to manage this challenge: Customer Single View. Customer single view, as a concept, has been around for many years now. However, it has been more of a static endeavor with updates made once a fortnight or at best once a week. Adding new channels or products is a project in itself that leaves the users frustrated and discontent. The lack of real-time updates, and multiple views across different channels add to the chaos. Moreover, the customer single view may or may not be driven by any KPIs.


600 websites about R

Anyone interested in categorizing them? It could be an interesting data science project, scraping these websites, extracting keywords, and categorizing them with a simple indexation or tagging algorithm. For instance, some of these blogs cater about stats, or Bayesian stats, or R libraries, or R training, or visualization, or anything else.


MCMC sampling for dummies

When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treating it as a black box essentially. The beauty of probabilistic programming is that you actually don’t have to understand how the inference works in order to build models, but it certainly helps.


Time Series Analysis: Building a Model on Non-stationary Time Series

In this post I will give a brief introduction to time series analysis and its applications. We will be using the R package astsa which was developed by professor David Stoffer at the University of Pittsburgh. The textbook it accompanies, which is a good read for anyone interested in the topic, can be found in a free eBook format here: Time Series Analysis and Its Applications


Profiling Top Kagglers: Gilberto Titericz, New #1 in the World

Kaggle has a new #1 data scientist! Gilberto Titericz usurped Owen Zhang to take the title of #1 Kaggler after his team finished 2nd in the Springleaf Marketing Response competition. As part of our series Profiling Top Kagglers, we interviewed Gilberto to learn more about his background and how he made his way to the top of the Kaggle community.


Introducing Distributed Data-structures in R

Due to R’s popularity as a data mining tool, many Big Data systems expose an R based interface to users. However, these interfaces are custom, non-standard, and difficult to learn. Earlier in the year, we hosted a workshop on distributed computing in R. You can read about the event here. A brief summary of the workshop is: well-known R contributors from industry, academia, and R-core members discussed whether we can standardize the interface for distributed computing. It should encourage people to write portable distributed applications in R.


A Filter Selection Method Inspired From Statistics

This post will demonstrate a method to create an ensemble filter based on a trade-off between smoothness and responsiveness, two properties looked for in a filter. An ideal filter would both be responsive to price action so as to not hold incorrect positions, while also be smooth, so as to not incur false signals and unnecessary transaction costs.


Interesting Datasets I Recently Learned About

Last week I spoke about my census-mapping work at two great venues: the EARL Conference in Boston and the NY Open Statistical Programming Meetup in New York. After my talks many people shared their own experiences mapping public datasets. I thought that I would pass along that information in case anyone is interested in exploring these datasets.