Large Scale Decision Forests: Lessons Learned

1. Handling sparse features
2. Dealing with missing features
3. Training optimizations
4. Evaluation optimizations
5. Performance optimizations
6. Visualization is key
7. No model is uniformly better

Why a Mathematician, Statistician, & Machine Learner Solve the Same Problem Differently

At a glance, machine learning and statistics seem to be very similar, but many people fail to stress the importance of the difference between these two disciplines. Machine learning and statistics share the same goals—they both focus on data modeling—but their methods are affected by their cultural differences. In order to empower collaboration and knowledge creation, it’s very important to understand the fundamental underlying differences that reflect in the cultural profile of these two disciplines. To gain a deeper understanding of these differences, we need to take a step back and look at their historical roots.

Avito Winner’s Interview: 1st place, Owen Zhang

It was no surprise to see Owen Zhang, currently ranked #1 on Kaggle, take first place in the Avito Context Ad Click competition. Owen used previous competition experience, domain knowledge, and a fondness for XGBoost to finish ahead of 455 other data scientists. The competition gave participants plenty of data to explore, with eight comprehensive relational tables on historical user browsing and search behavior, location, and more.

How Wearables, Analytics and the IoT Will Redefine the Enterprise of 2020

New technology like wearable computing, mobile apps, the Internet-of-Things (IoT) and data analytics are beginning to influence all aspects of our lives. As a consumer, it can feel like your applications are always a step ahead of you. Use navigation app Waze at a certain time of day, and it knows you are heading from the office to home, pre-populating the route.

Systematic Fraud Detection Through Automated Data Analytics in MATLAB

Systematic fraud detection presents several challenges. First, fraud detection methods require complex investigations that involve the processing of large amounts of heterogeneous data. The data is derived from multiple sources and crosses multiple knowledge domains, including finance, economics, business, and law. Gathering and processing this data manually is prohibitively time-consuming as well as error-prone. Second, fraud is ‘a needle in a haystack’ problem because only a very small fraction of the data is likely to be coming from a fraudulent case. The vast quantity of regular data—that is, data produced from nonfraudulent sources—tends to blend out the cases of fraud. Third, fraudsters are continually changing their methods, which means that detection strategies are frequently several steps behind.

Generalized Linear Models in R, Part 7: Checking for Overdispersion in Count Regression

In my last blog post we fitted a generalized linear model to count data using a Poisson error structure. We found, however, that there was over-dispersion in the data – the variance was larger than the mean in our dependent variable. Over-dispersion is a problem if the conditional variance (residual variance) is larger than the conditional mean. One way to check for and deal with over-dispersion is to run a quasi-poisson model, which fits an extra dispersion parameter to account for that extra variance. Now let’s fit a quasi-Poisson model to the same data.