Symmetry and Skewness

After taking your first introductory course in statistics you probably agreed wholeheartedly with the following statement: ‘A statistical distribution is symmetric if and only if it is not skewed.’ After all, isn’t that how we define ‘skewness’?

How to build a Market Basket Analysis Engine

A market basket analysis or recommendation engine is what is behind all these recommendations we get when we go shopping online or whenever we receive targeted advertising. The underlying engine collects information about people’s habits and knows that if people buy pasta and wine, they are usually also interested in pasta sauces. So, the next time you go to the supermarket and buy pasta and wine, be ready to get a recommendation for some pasta sauce!

Simple/limited/incomplete benchmark for scalability/speed and accuracy of machine learning libraries for classification

This project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data. If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is about 1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data type/size (the largest) stems from this author’s interest in credit card fraud detection at work. The algorithms studied are
• linear (logistic regression, linear SVM)
• random forest
• boosting
• deep neural network
in various commonly used open source implementations like
• R packages
• Python scikit-learn
• Vowpal Wabbit
• H2O
• xgboost
• Spark MLlib.

How a Kalman filter works, in pictures

I have to tell you about the Kalman filter, because what it does is pretty damn amazing. Surprisingly few software engineers and scientists seem to know about it, and that makes me sad because it is such a general and powerful tool for combining information in the presence of uncertainty. At times its ability to extract accurate information seems almost magical— and if it sounds like I’m talking this up too much, then take a look at this previously posted video where I demonstrate a Kalman filter figuring out the orientation of a free-floating body by looking at its velocity. Totally neat!

Named Entity Recognition: Examining the Stanford NER Tagger

Recently I landed a job at URX through a data science fellowship program for people with quantitative PhDs called Insight Data Science. As part of a new initiative within the program, I was offered the opportunity to work with URX on a unique data science challenge that held real business value. The goal was to develop an Named Entity Recognition (NER) classifier that could be compared favorably to one of the state-of-the-art (but commercially licensed) NER classifiers developed by the CoreNLP lab at Stanford University over a number of years.

July 2015: Scripts of the Week

July brought 3 new competitions, a few fun coding challenges, and the 2013 census dataset to Kaggle for you to explore on scripts. Kagglers took their scripting to the next level, walking other data scientists through their analysis with RMarkdown and using a blog style to effectively highlight the most interesting insights.

The reusable holdout: Preserving validity in adaptive data analysis

Machine learning and statistical analysis play an important role at the forefront of scientific and technological progress. But with all data analysis, there is a danger that findings observed in a particular sample do not generalize to the underlying population from which the data were drawn. A popular XKCD cartoon illustrates that if you test sufficiently many different colors of jelly beans for correlation with acne, you will eventually find one color that correlates with acne at a p-value below the infamous 0.05 significance level.

How Do Auction Values Differ by the Number of Teams in Your League? A Multilevel Model

Many sites publish fantasy football players’ average auction values (AAV) for use in auction drafts. But these auction values often assume you play in a 10-team league. If you have a different number of teams in your league, the auction values will be different. This post examines how auction values differ based on the number of teams in your league.

Correlation is not a measure of reproducibility

Biologists make wide use of correlation as a measure of reproducibility. Specifically, they quantify reproducibility with the correlation between measurements obtained from replicated experiments. For example, the ENCODE data standards document states

Differences in the network structure of CRAN and BioConductor

This week at JSM2015, the annual conference of the American Statistical Association, Joseph Rickert and I gave a presentation on the topic of ‘The network structure of CRAN and BioConductor’ ( http://…/AbstractDetails.cfm?abstractid=314733 ). Our work tested the hypothesis if one can detect statistical differences in the network graph formed by the dependencies between packages. In the dependency graph, each package is a vertex and each dependency is an edge connecting two vertices.