The Inspection Paradox is Everywhere

The inspection paradox is a common source of confusion, an occasional source of error, and an opportunity for clever experimental design. Most people are unaware of it, but like the cue marks that appear in movies to signal reel changes, once you notice it, you can’t stop seeing it. A common example is the apparent paradox of class sizes. Suppose you ask college students how big their classes are and average the responses. The result might be 56. But if you ask the school for the average class size, they might say 31. It sounds like someone is lying, but they could both be right.

R and Stan: introduction to Bayesian modeling

This series of the posts show how to install Stan on R, how to run it, and how to apply it to actual datasets. I hope you’ll find it to practice Bayesian modeling easier than ever.

Data scientist hack to find the right Meetup groups (using Python)

Data Scientists are a breed of lazy animals! We detest the practice of doing any repeatable work manually. We cringe at mere thought of doing tedious manual tasks and when we come across one, we try and automate it so that the world becomes a better place! We have been running a few meetups across India for last few months and wanted to see what some of best meetups across the globe are doing. For a normal human, this would mean surfing through pages of meetups and finding out this information manually.

A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy

This post introduces eigenvectors and their relationship to matrices in plain language and without a great deal of math. It builds on those ideas to explain covariance, principal component analysis, and information entropy. The eigen in eigenvector comes from German, and it means something like “very own.” For example, in German, “mein eigenes Auto” means “my very own car.” So eigen denotes a special relationship between two things. Something particular, characteristic and definitive. This car, or this vector, is mine and not someone else’s.

Random forest interpretation with scikit-learn

In one of my previous posts I discussed how random forests can be turned into a “white box”, such that each prediction is decomposed into a sum of contributions from each feature i.e. prediction=bias+feature 1 contribution+…+feature n contribution .

Producing Error Estimates with Residual Modeling

In a previous post, we discussed the need for understanding the confidence of our housing price predictions. One strategy for doing this is building a confidence model. In this post, we will describe one possible solution and the way we use it at Opendoor. Specifically, we will describe an error model g ^ that estimates the prediction errors from a housing valuation model f ^ that predicts housing prices.

CrowdFlower Competition Scripts: Approaching NLP

The CrowdFlower Search Results Relevance competition was a great opportunity for Kagglers to approach a tricky Natural Language Processing problem. With 1,326 teams, there was plenty of room for fierce competition and helpful collaboration. We pulled some of our favorite scripts that you’ll want to review before approaching your next NLP project or competition. Keep reading for more on:
•The instability of a quadratic weighted kappa metric
•How to use a stemmer and a lemmatizer
•Machine Learning Classification using Google Charts
•Set-based similarities (with a seaborn visualization)

Apache Drill Makes Big Data Analysis Easier for Everyone

The main benefit or advantage of Apache Drill is that it is going to significantly reduce the investment towards big data analysis. Now, enterprises do not have a good reason to invest in complex technology or skill sets always to access and analyse big data. With Apache Drill, big data analysis has become accessible to more people. It seems that Apache Drill marks the beginning of a trend when more tools and technologies are going to follow suit by making big data analysis much easier. That will indeed be a defining moment in the history of big data.

R – ArcGIS Community

The R – ArcGIS Community is a community driven collection of free, open source projects making it easier and faster for R users to work with ArcGIS data, and ArcGIS users to leverage the analysis capabilities of R.

Importing Data Into R – Part Two

In this follow-up tutorial of This R Data Import Tutorial Is Everything You Need-Part One, DataCamp continues with its comprehensive, yet easy tutorial to quickly import data into R, going from simple, flat text files to the more advanced SPSS and SAS files.

Using Azure as an R datasource, Part 4 – Pulling data from SQL Server to Linux

This post is the fourth in a series that covers pulling data from Microsoft SQL Server or MySQL/MariaDB on Azure to an R client on Windows or Linux. In the previous posts, we covered pulling data from SQL Server to Windows and from MySQL/MariaDB to both Windows and Linux. This time we’ll be pulling data from Microsoft SQL Server to an R client on Linux.

Visualising theoretical distributions of GLMs

I think Arthur’s 3D plots help to visualise what GLMs are conceptually about. They illustrate the theoretical distribution around the predictions. Let me say this again, the yellow areas in the charts above show not the 90% prediction interval, but the theoretical residual distribution if the models where true.

P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures

One of the most popular ways to correct for multiple testing is to estimate or control the false discovery rate. The false discovery rate attempts to quantify the fraction of made discoveries that are false.