Perfect way to build a Predictive Model in less than 10 minutes

In the last few months, we have started conducting data science hackathons. These hackathons are contests with a well defined data problem, which has be be solved in short time frame. They typically last any where between 2 – 7 days. If month long competitions on Kaggle are like marathons, then these hackathons are shorter format of the game – 100 mts Sprint. They are high energy events where data scientists bring in lot of energy, the leaderboard changes almost every hour and speed to solve data science problem matters lot more than Kaggle competitions.

How to become a data scientist for free

Statistical analysis and data mining were the top skills that got people hired in 2014 based on LinkedIn analysis of 330 million LinkedIn member profiles. We live in an increasingly data driven world, and businesses are aggressively hiring experts in data storage, retrieval, and analysis. Across the globe, statistics and data analysis skills were highly valued. In the US, India, and France, those skills are in particularly high demand.

Deep Style: Inferring the Unknown to Predict the Future of Fashion

Here at Stitch Fix, we are always looking for new ways to improve our client experience. On the algorithms side that means helping our stylists to make better fixes through a robust recommendation system. With that in mind, one path to better recommendations involves creating an automated process to understand and quantify the style our inventory and clients at a fundamental level. Few would doubt that fashion is primarily a visual art form, so in order to achieve this goal we must first develop a way to interpret the style within images of clothing. In this post we’ll look specifically at how to build an automated process using photographs of clothing to quantify the style of some of items in our collection. Then we will use this style model to make new computer generated clothing like the image to the right.

Videos from Deep Learning Summer School, Montreal 2015

Deep neural networks that learn to represent data in multiple layers of increasing abstraction have dramatically improved the state-of-the-art for speech recognition, object recognition, object detection, predicting the activity of drug molecules, and many other tasks. Deep learning discovers intricate structure in large datasets by building distributed representations, either via supervised, unsupervised or reinforcement learning.

Statistics for Hackers

The field of statistics has a reputation for being difficult to crack: it revolves around a seemingly endless jargon of distributions, test statistics, confidence intervals, p-values, and more, with each concept subject to its own subtle assumptions. But it doesn’t have to be this way: today we have access to computers that Neyman and Pearson could only dream of, and many of the conceptual challenges in the field can be overcome through judicious use of these CPU cycles. In this talk I’ll discuss how you can use your coding skills to ‘hack statistics’ – to replace some of the theory and jargon with intuitive computational approaches such as sampling, shuffling, cross-validation, and Bayesian methods – and show that with a grasp of just a few fundamental concepts, if you can write a for-loop you can do statistical analysis.

Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs

Recurrent Neural Networks (RNNs) are popular models that have shown great promise in many NLP tasks. But despite their recent popularity I’ve only found a limited number of resources that throughly explain how RNNs work, and how to implement them. That’s what this tutorial is about. It’s a multi-part series in which I’m planning to cover the following:
1. Introduction to RNNs (this post)
2. Implementing a RNN using Python and Theano
3. Understanding the Backpropagation Through Time (BPTT) algorithm and the vanishing gradient problem
4. From RNNs to LSTM Networks

Data Science Data Logic

These days there is an incredible amount of attention for the algorithms side of data science. The term ‘deep learning’ is the absolute hype at the moment, with Random Forest and Gradient Boosting coming second and third as approaches that got many people fantastic scores on Kaggle. One thing, however, is almost never talked about: what decisions are made to come to a training, testing and validation logic? How is the target variable defined? How are the features created? Of course, you hear people speak about feature creation, but that is typically as recombination or transformation of the features that have initially been chosen to be part of the dataset. How do you get to that initial dataset? In this article, I’d like to discuss the many decisions that make up the logic to come to an analytic dataset and show common approaches.

From functional programming to MapReduce in R

The MapReduce paradigm has long been a staple of big data computational strategies. However, properly leveraging MapReduce can be a challenge, even for experienced R users. To get the most out of MapReduce, it is helpful to understand its relationship to functional programming. In this post I discuss how MapReduce relates to the underlying higher order functions map and reduce. By the end of the post, you will be able to translate R code into corresponding MapReduce jobs. You’ll also be able to understand how parallelism and apparent randomness fit into the paradigm so it is easier to reason about the result of a computation.

R Packages: How to auto-update a Package version number

A question came up which often comes up on our public and private training courses as we talk through the benefits of devtools and roxygen2. That is, “Is there any way I can automatically update the version number each time I build my package?” Currently neither devtools nor roxygen2 offer this functionality but in a few lines of R code you can write a script to do this for you.

Reading Financial Time Series Data with R

In a recent post focused on plotting time series with the new dygraphs package, I did not show how easy it is to read financial data into R. However, in a thoughtful comment to the post, Achim Zeileis pointed out a number of features built into the basic R time series packages that everyone ought to know. In this post, I will just elaborate a little on what Achim sketched out.