What’s wrong with my time series? Model validation without a hold-out set

Time series modeling sits at the core of critical business operations such as supply and demand forecasting and quick-response algorithms like fraud and anomaly detection. Small errors can be costly, so it’s important to know what to expect of different error sources. The trouble is that the usual approach of cross-validation doesn’t work for time series models. The reason is simple: time series data are autocorrelated so it’s not fair to treat all data points as independent and randomly select subsets for training and testing. In this post I’ll go through alternative strategies for understanding the sources and magnitude of error in time series.


Open Source Toolkits for Speech Recognition

As members of the deep learning R&D team at SVDS, we are interested in comparing Recurrent Neural Network (RNN) and other approaches to speech recognition. Until a few years ago, the state-of-the-art for speech recognition was a phonetic-based approach including separate components for pronunciation, acoustic, and language models. Typically, this consists of n-gram language models combined with Hidden Markov models (HMM). We wanted to start with this as a baseline model, and then explore ways to combine it with newer approaches such as Baidu’s Deep Speech. While summaries exist explaining these baseline phonetic models, there do not appear to be any easily-digestible blog posts or papers that compare the tradeoffs of the different freely available tools.


How neural networks work

Far from being incomprehensible, the principles behind neural networks are surprisingly simple. Here’s a gentle walk through how to use deep learning to categorize images from a very simple camera. You have responded with overwhelmingly positive comments to my two previous videos on convolutional neural networks and deep learning. You have also made two requests: find a better example and explain backpropagation. This video does both.


Introduction to Data Visualization – Theory, R & ggplot2

The topic of data visualization is very popular in the data science community. The market size for visualization products is valued at $4 Billion and is projected to reach $7 Billion by the end of 2022 according to Mordor Intelligence. While we have seen amazing advances in the technology to display information, the understanding of how, why, and when to use visualization techniques has not kept up. Unfortunately, people are often taught how to make a chart before even thinking about whether or not it’s appropriate. In short, are you adding value to your work or are you simply adding this to make it seem less boring? Let’s take a look at some examples before going through the Stoltzmaniac Data Visualization Philosophy.


Unit Testing in R

Software testing describes several means to investigate program code regarding its quality. The underlying approaches provides means to handle errors once they occur. Furthermore, software testing also show techniques to reduce the probability of that. R is becoming a increasingly promiment programming language. This not only includes pure statistical settings but also machine learning, dashboards via Shiny and beyond. This development is simulateneously fueled by the business schools teaching R to their students. While software testing is usually covered from a theoretical viewpoint, our slides teach the basics on software testing in an easy-to-understand fashion with the help of R. Our slide deck aims at bridging R programming and software testing. The slides outline the need for software testing and describe general approaches, such as the V model. In addition, we present the build-in features for error handling in R and also show how to do unit testing with the help of the “testthat” package. We hope that the slide deck supports practitioners to unleash the power of unit testing in R. Moreover, it should equip scholars in business schools with knowledge on software testing.


How Reproducible Data Analysis Scripts Can Help You Route Around Data Sharing Blockers

For aaaagggggggeeeeeeesssssss now, I’ve been wittering on about how just publishing “open data” is okay insofar as it goes, but it’s often not that helpful, or at least, not as useful as it could be. Yes, it’s a Good Thing when a dataset is published in support of a report; but have you ever tried reproducing the charts, tables, or summary figures mentioned in the report from the data supplied along with it? If a report is generated “from source” using something like Rmd (RMarkdown), which can blend text with analysis code and a means to import the data used in the analysis, as well as the automatically generated outputs, (such as charts, tables, or summary figures) obtained by executing the code over the loaded in data, third parties can see exactly how the data was turned into reported facts. And if you need to run the analysis again with a more recent dataset, you can do. (See here for an example.) But publishing details about how to do the lengthy first mile of any piece of data analysis – finding the data, loading it in, and then cleaning and shaping it enough so that you can actually start to use it – has additional benefits too.


4 tricks for working with R, Leaflet and Shiny

I recently worked on a dataviz project involving Shiny and the Leaflet library. In this post I give 4 handy tricks we used to improve the app: 1/ how to use leaflet native widgets 2/ how to trigger an action when user clicks on map 3/ how to add a research bar on your map 4/ how to propose a “geolocalize me” button. For each trick, a reproducible code snippet is provided, so you just have to copy and paste it to reproduce the image.


How to choose a project to practice data science

Here at Sharp Sight, I’ve derided the “jump in and build something” method of learning data science for quite some time. Learning data science by “jumping in” and starting a big project is highly inefficient.


Deep Learning and AI Success Stories

The insideBIGDATA Guide to Deep Learning & Artificial Intelligence is a useful new resource directed toward enterprise thought leaders who wish to gain strategic insights into this exciting area of technology. In this guide, we take a high-level view of AI and deep learning in terms of how it’s being used and what technological advances have made it possible. We also explain the difference between AI, machine learning and deep learning, and examine the intersection of AI and HPC. We present the results of a recent insideBIGDATA survey that reflects how well these new technologies are being received. Finally, we take a look at a number of high-profile use case examples showing the effective use of AI in a variety of problem domains. The complete insideBIGDATA Guide to Deep Learning & Artificial Intelligence is available for download from the insideBIGDATA White Paper Library.


Ideas on interpreting machine learning

You’ve probably heard by now that machine learning algorithms can use big data to predict whether a donor will give to a charity, whether an infant in a NICU will develop sepsis, whether a customer will respond to an ad, and on and on. Machine learning can even drive cars and predict elections. … Err, wait. Can it? I believe it can, but these recent high-profile hiccups should leave everyone who works with data (big or not) and machine learning algorithms asking themselves some very hard questions: do I understand my data? Do I understand the model and answers my machine learning algorithm is giving me? And do I trust these answers? Unfortunately, the complexity that bestows the extraordinary predictive abilities on machine learning algorithms also makes the answers the algorithms produce hard to understand, and maybe even hard to trust.


7 types of job profiles that makes you a Data Scientist

So yes, this post might somewhat look like a clickbait, but I promise you its not exactly that (Well somewhat). I recently got in question on Quora asking something on lines of What exact skills do companies look for when they are recruiting a Data Scientist ? and is there a definition of Data Scientist profile ? As is pretty obvious, there is no one profile, as every company is solving its own set of problems. But I tried to make a few generic job profiles that can somewhat fit JDs of different companies.


New screencast: using R and RStudio to install and experiment with Apache Spark

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark.