What NOT To Do When Data Are Missing

Suppose that we want to estimate a regression model by OLS. We have a full sample of size n for the regressors, but one of the values for our dependent variable, y, isn’t available. Rather than estimate the model using just the (n – 1) available data-points, you might think that it would be preferable to use all of the available data, and impute the missing value for y.

Beginner’s guide to Design of Experiments (with case study on banner advertisement)

In this article, I’ve elaborated the concept used behind Design of Experiments. By now, you would have got an intuition about the strategies that companies use to decide the best mode of advertisement for them. Earlier, companies use to face too much trouble in deriving positive returns on marketing budget, but this technique has not only saved million of hard cash, but has also provided a prudent method to reap benefits intelligently.

5 Easy questions on Ensemble Modeling everyone should know

In this article, we have looked at the 5 frequently asked questions on Ensemble models. While answering these questions, we have discussed about “Ensemble Models”, “Methods of Ensemble”, “Why should we ensemble diverse models?”, “Methods to identify optimal weight for ensemble” and finally “Benefits”. I would suggest you to look at the top 5 solutions of data science competitions and see their ensemble approaches to have better understanding and practice a lot. It will help you to understand what works or what doesn’t.

Predictive Analytics for Beginners – part 1

Data Analysis in R

I have written about R in the past, and it is one of the hottest tools for data analysis today. To further demonstrate the power of R, I found click-through rate data on Kaggle. The dataset is over 6 gigabytes and has over 12 million rows, but I limited the dataset to 2 million rows for the sake of performance in R.

The wonderful world of recommender systems

I recently gave a talk about recommender systems at the Data Science Sydney meetup (the slides are available here). This post roughly follows the outline of the talk, expanding on some of the key points in non-slide form (i.e., complete sentences and paragraphs!). The first few sections give a broad overview of the field and the common recommendation paradigms, while the final part is dedicated to debunking five common myths about recommender systems.

Recurrent Neural Networks Tutorial, Part 2 – Implementing a RNN with Python, Numpy and Theano

In this part we will implement a full Recurrent Neural Network from scratch using Python and optimize our implementation using Theano, a library to perform operations on a GPU. The full code is available on Github. I will skip over some boilerplate code that is not essential to understanding Recurrent Neural Networks, but all of that is also on Github.

Scipy Lecture Notes

Tutorials on the scientific Python ecosystem: a quick introduction to central tools and techniques. The different chapters each correspond to a 1 to 2 hours course with increasing level of expertise, from beginner to expert.

Grasp-and-Lift EEG Detection Winner’s Interview: 2nd place, daheimao

The Grasp-and-Lift EEG Detection competition asked participants to identify when a hand was grasping, lifting, and replacing an object using EEG data that was taken from healthy subjects as they performed these activities. The competition was sponsored by the WAY Consortium (Wearable interfaces for hAnd function recoverY) as part of their work towards developing better prosthetic devices for patients with amputation or neurological disabilities that have lost hand function.

Learning Game of Life with a Convolutional Neural Network

In this article I’d like to discuss a way of building this kind of jewel, as a convolutional neural network, which, after having seen a bunch of iterations of game of life, can learn its underlying behaviour.

What is TF-IDF? The 10 minute guide

I recently started reading up a bit on tf-idf, which stands for term frequency-inverse document frequency. Tf-idf is a simple, but surprisingly powerful technique which can be used to figure out what a document is ‘about’. It’s often used in the fields of information retrieval and text mining.

Most popular Data Science keywords

A Quick chart illustrating the top Data Science keywords

BigQuery Big Data Visualization With D3.js

How to handle large dataset with D3.js? It’s a frequently asked question. You can read several discussions on the topic here,here, and here. So far, the best solution is to process data to a smaller dataset. Then use D3.js to visualize.

Playing with Leaflet (and Radar locations)

Yesterday, my friend Fleur did show me some interesting features of the leaflet package, in R. …

Top 5 arXiv Deep Learning Papers, Explained

Top deep learning papers on arXiv are presented, summarized, and explained with the help of a leading researcher in the field.

30 Can’t miss Harvard Business Review articles on Data Science, Big Data and Analytics

1. Data Scientist: the sexiest job of the 21st century
2. The Sexiest Job of the 21st Century is Tedious, and that Needs to Change
3. What Every Manager Should Know About Machine Learning
4. Data Scientists Don’t Scale
5. Get the Right Data Scientists Asking the “Wrong” Questions
6. A Data Scientist’s Real Job: Storytelling
7. What Separates a Good Data Scientist from a Great One
8. Still the Sexiest Profession Alive
9. 10 Kinds of Stories to Tell with Data
10. How to Start Thinking Like a Data Scientist
11. Stop Searching for That Elusive Data Scientist
12. How to Explore Cause and Effect Like a Data Scientist
13. You May Not Need Big Data After All
14. Big Data Hype (and Reality)
15. With Big Data Comes Big Responsibility
16. Inventory Management in the Age of Big Data
17. Why Health Care May Finally Be Ready for Big Data
18. What the Companies Winning at Big Data Do Differently
19. Stop Worrying About Whether Machines Are “Intelligent”.
20. Are You Data Driven? Take a Hard Look in the Mirror.
21. Marketers Flunk the Big Data Test
23. Making Advanced Analytics Work for You
24. A Predictive Analytics Primer
25. The Persuasiveness of a Chart Depends on the Reader, Not Just the Chart
26. Analytics 3.0
27. What People Analytics Can’t Capture
28. Gamification Can Help People Actually Use Analytics Tools
29. What Popular Baby Names Teach Us About Data Analytics
30. A Better Way to Tackle All That Data

A Few Days of Python: Automating Tasks Involving Excel Files

There are plenty of instances where analysts are regularly forwarded xls spreadsheets and tasked with summarizing the data. In many cases, these scenarios can be automated through fairly simple Python scripts. In the following code, I take an Excel spreadsheet with two sheets, summarize each sheet using a pivot table, and add those results to sheets in a new spreadsheet.

Jug: Easily Create R APIs

Jug stands for Just Unified Galloping. Okay, okay, it’s just a play on words coming from a Flask (Python) background. Jug is my attempt to create a simple small web framework that allows you to turn your (existing) R functions into an API. Having the wonderful httpuv package at my disposal made this very easy for me.

Understanding empirical Bayes estimation (using baseball statistics)

This post isn’t really about baseball, I’m just using it as an illustrative example. (I actually know very little about sabermetrics. If you want a more technical version of this post, check out this great paper). This post is, rather, about a very useful statistical method for estimating a large number of proportions, called empirical Bayes estimation.

#MonthOfJulia Day 25: Interfacing with Other Languages

Julia has native support for calling C and FORTRAN functions. There are also add on packages which provide interfaces to C++, R and Python. We’ll have a brief look at the support for C and R here. Further details on these and the other supported languages can be found on github.

Combining Choropleth Maps and Reference Maps in R

Recent updates to my mapping packages now make it easy to combine choropleth maps and reference maps in R. All you have to do is pass the parameter reference_map = TRUE to the existing functions. This should “just work”, regardless of which region you zoom in on or what data you display.

purrr 0.1.0

Purrr is a new package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions purrr. Like many of my recent packages, it works with magrittr to allow you to express complex operations by combining simple pieces in a standard way.