Murphy diagrams in R
At the recent International Symposium on Forecasting, held in Riverside, California, Tillman Gneiting gave a great talk on ‘Evaluating forecasts: why proper scoring rules and consistent scoring functions matter’. It will be the subject of an IJF invited paper in due course. One of the things he talked about was the ‘Murphy diagram’ for comparing forecasts, as proposed in Ehm et al (2015). Here’s how it works for comparing mean forecasts.

Talk: How to Visualize Data
Last week, I gave one of the visualization primer talks at BioVis in Dublin. My goal was to show people some examples, but also criticize the rather poor visualization culture in bioinformatics and challenge people to do better. Here is a write-up of that talk.

Python: Intro to Linear Algebra for Data Scientists
It’s important to know what goes on inside a machine learning algorithm. But it’s hard. There is some pretty intense math happening, much of which is linear algebra. When I took Andrew Ng’s course on machine learning, I found the hardest part was the linear algebra. I’m writing this for myself as much as you. So here is a quick review, so next time you look under the hood of an algorithm, you’re more confident. You can view the iPython notebook (usually easier to code with) on my github.

Awesome Random Forest
A curated list of resources regarding tree-based methods and more, including but not limited to random forest, bagging and boosting.

Seven Python Tools All Data Scientists Should Know How to Use
1. IPython
2. GraphLab Create
3. Pandas
4. PuLP
5. Matplotlib
6. Scikit-Learn
7. Spark

A Step by Step Backpropagation Example
Backpropagation is a common method for training a neural network. There is no shortage of papers online that attempt to explain how backpropagation works, but few that include an example with actual numbers. This post is my attempt to explain how it works with a concrete example that folks can compare their own calculations to in order to ensure they understand backpropagation correctly.

Deep Learning Adversarial Examples – Clarifying Misconceptions
Google scientist clarifies misconceptions and myths around Deep Learning Adversarial Examples, including: they do not occur in practice, Deep Learning is more vulnerable to them, they can be easily solved, and human brains make similar mistakes.

6 reasons why I like KeystoneML
As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata + Hadoop World in New York, I sat down with my co-organizer Ben Recht for the the latest episode of the O’Reilly Data Show Podcast. Recht is a UC Berkeley faculty member and member of AMPLab, and his research spans many areas of interest to data scientists including optimization, compressed sensing, statistics, and machine learning.

Seeing Data as the Product of Underlying Structural Forms
Matrix factorization follows from the realization that nothing forces us to accept the data as given. We start with objects placed in rows and record observations on those objects arrayed along the top in columns. Neither the objects nor the measurements need to be preserved in their original form.

10 Pains Businesses Feel When Working With Data
The 10 most common data issues facing business and how to cure them.
1. Inability to compare data held in different locations and from different sources.
2. Non conforming data, e.g. invoice discrepancies etc.
3. Delayed access to data, reports out of date.
4. Limited or no metrics, no information on internal and external KPI’s.
5. Lack of understanding of customer, habits, preferences, satisfaction.
6. Extensive time and effort being spent on manual data entry, extraction and analysis.
7. Lack of ability to share insights and information in meaningful forms.
8. No factual consistency due to different versions and out of date information.
9. Insight lost in aggregated data and summary views.
10. Inability to re-examine source material from different and changing perspectives.

Working with Sessionized Data 2: Variable Selection
In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we sessionized the data by considering all possible aggregations (window widths) of the data as features. Such naive sessionization can quickly lead to very wide data sets, with potentially more features than you have datums (and collinear features, as well). In this post, we will use the same example, but try to select our features more intelligently.