Parametric Inference: Likelihood Ratio Test by Example
Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT)

Pachyderm. Let’s build a modern Hadoop
If you’ve been around the big data block, you’ve probably felt the pain of Hadoop, but we all still use it because we tell ourselves, ‘that’s just the way infrastructure software is.’ However, in the past decade, infrastructure tools ranging from NoSQL databases, to distributed deployment, to cloud computing have all advanced by orders of magnitude. Why have large-scale data analytics tools lagged behind? What makes projects like Redis, Docker and CoreOS feel modern and awesome while Hadoop feels ancient?

20 Great Ideas To Steal In 2015
1. ConocoPhillips Integrated Data Analysis Gets Results
2. FedEx Uses Texts To Max First-Try Deliveries
3. Allstate Gives Auto Insurance Telematics A Mobile Twist
4. Biogen Brings Augmented Reality To Posters
5. Toyota’s Sales Tools Lets Dealers ‘Go Places’
6. 20th Century Fox Film Builds A Digital Platform
7. Royal Caribbean Brings High-Speed Internet On Board
8. La Quinta Texts Guests: Your Room Is Ready
9. Capital One Chops Policies That Scare Off Tech Talent
10. Brooklyn Library Checks Out Data Visualization
11. A Faster, Easier Way To Lease Property
12. Pfizer Builds Digital Expertise In-House
13. Safeguard Mobilizes Inspection Effort
14. MetroHealth Digitizes Cuyahoga Inmate Healthcare System
15. Rensselaer Polytech Research Creates a ‘Smart Lake’
16. Comcast Meets Customer Demand For Self-Service
17. Do It Best Corp. Automates Distribution
18. Controlled Substance Ordering Goes Mobile
19. PayPal Brings Agility To Its Data Center
20. Black & Veatch Tests Drones For Inspections

Amazon Machine Learning: use cases and a real example in Python
After using AWS Machine Learning for a few hours I can definitely agree with this definition, although I still feel that too many developers have no idea what they could use machine learning for, as they lack the mathematical background to really grasp its concepts.

Improper applications of Principal Component Analysis on multimodal data
The odd thing here is that there are two obvious subpopulations of points, but within each they appear to have the same slope of PC1 vs PC2. This indicates that there is a linear dependence between PC1 and PC2, with some other factor explaining the difference between the clusters. In the case of the paper they only make this plot as a way to illustrate the relation between the samples. But if there where to be any additional analysis of the components in this PCA, this relation between PC1 and PC2 tells us something is wrong.

I Fought the (distribution) Law (and the Law did not win)
A few days ago, I was asked if we should spend a lot of time to choose the distribution we use, in GLMs, for (actuarial) ratemaking. On that topic, I usually claim that the family is not the most important parameter in the regression model.

Awesome-R: A curated list of the best add-ons for R
One of the great things about R is that there’s so much available to use with it: there are several interfaces to choose from, thousands of add-on packages to extend its capabilites, hundreds of books and on-line tutorials — an abundance of riches to improve your R experience. But with that abundance comes a problem: how to find the best add-ons to R. Qin Wenfeng has taken the trouble to curate the best add-ons to R in their list, awesome-R: A curated list of awesome R frameworks, packages and software. The list provides several (but not too many!) recommendations for R users in the areas of IDEs, data manipulation packages, database integration frameworks, machine learning suites, R-related websites, and much more. While your favourite add-on might not be listed in any given category (I would have added ‘checkpoint’ to Reproducible Research, for example), on the whole the items listed merit their inclusion. If you find yourself overwhelmed with choices for R, this is a good place to start. And if you can’t find what you’re looking for, there’s still the Task Views.

Comparing Tree-Based Classification Methods via the Kaggle Otto Competition
In this post, I’m going to be looking at the progressive performance of different tree-based classification methods in R, using the Kaggle Otto Group Product Classification Challenge as an example. This competition challenges participants to correctly classify products into 1 of 9 classes based on data in 93 features. I’ll start with basic decision trees and move into ensemble methods – bagging, random forests, boosting.