What is a Logit Function and Why Use Logistic Regression?
One of the big assumptions of linear models is that the residuals are normally distributed. This doesn’t mean that Y, the response variable, has to also be normally distributed, but it does have to be continuous, unbounded and measured on an interval or ratio scale. Unfortunately, categorical response variables are none of these. No matter how many transformations you try, you’re just never going to get normal residuals from a model with a categorical response variable. There are a number of alternatives though, and one of the most popular is logistic regression. In many ways, logistic regression is very similar to linear regression. One big difference, though, is the logit link function.

The Data Scientist: Elusive or Illusive?
A good data scientist:
1. Can design a data investigation and manage towards defined objectives…and knows an: Experienced project manager
2. Is comfortable with the fact that most data is a complete mess…and knows a: DBA familiar with the data sources
3. Can navigate the politics of turning data into meaning…and knows a: Committed senior executive project sponsor
4. Recognizes the limitation of their business understanding…and knows a: Subject matter expert from the relevant line(s) of business
5. Can code and develop algorithms, but doesn’t need to be a top notch developer…and knows a: Developer with experience in relevant technology stacks
6. Understands the statistical implications of analysis, but doesn’t need to be a statistician…and knows a: Statistician comfortable with extremely messy inputs
7. Is able to communicate why results of the analysis matter…and knows a: A traditionally trained management/strategy consultant

The Problem with Data Science
Data Science is about learning from data, often using Machine Learning and statistics. To do so, we can build statistical models that provide answers to our questions or make predictions based on data we have collected. Ideally, we build the model that most accurately describes our data, makes the best predictions, and provides the answers of interest. Once we have our dream model we just have to figure out how to fit it to data (i.e. do inference).

Cohort Analysis That Helps You Look Ahead
For people who analyze customer behavior, the table above is a familiar one. This Mixpanel chart measures retention rates across different user cohorts. By moving down the table, you can see how retention is changing over time. This report, which shows how sticky your product is over time, has become one of the most important measures of health for many companies. It’s supported by a wide variety of tools, from out-of-the-box reporting tools like Google Analytics and KISSmetrics to data-collection services like Segment and Keen.io. Industry experts agree, this is the best tool for cohorting customers and measuring retention.

Microsoft Malware Winners’ Interview: 2nd place, Gert & Marios (aka KazAnova)
Marios & Gert, in a team of the same name, took 2nd place in the Microsoft Malware Classification Challenge. The two are regular teammates and previously won the Acquire Valued Shoppers Challenge together. This blog outlines their approach to the Malware competition and also gives us a closer look at how successful teams come together and collaborate on Kaggle.

Hello Stan!
In my previous post I discussed how Longley-Cook, an actuary at an insurance company in the 1950’s, used Bayesian reasoning to estimate the probability for a mid-air collision of two planes. Here I will use the same model to get started with Stan/RStan, a probabilistic programming language for Bayesian inference.

Centering and Standardizing: Don’t Confuse Your Rows with Your Columns
R uses the generic scale( ) function to center and standardize variables in the columns of data matrices. The argument center=TRUE subtracts the column mean from each score in that column, and the argument scale=TRUE divides by the column standard deviation (TRUE are the defaults for both arguments). For instance, weight and height come in different units that can be compared more easily when transformed into standardized deviations. Since such a linear transformation does not alter the correlations among the variables, it is often recommended so that the relative effects of variables measured on different scales can be evaluated. However, this is not the case with the rows.

Scaling R clusters? AWS Spot Pricing is your new best friend
Most of us recall the notion of elasticity from Economics 101. Markets are about supply and demand, and when there is an abundance of supply, prices usually go down. Elasticity is a measure of how responsive one economic variable is to another, and in an elastic market the response is proportionately greater than the change in input. It turns out that cloud pricing, on the margin at least, is pretty elastic. Like bananas in a supermarket, CPU cycles are a perishable commodity. If capacity sits idle and doesn’t get used, it goes away. Your cloud provider are no dummies, and much like store owners that mark down the price of ripe bananas before they spoil, cloud providers would rather sell capacity for pennies on the dollar rather than have it go to waste. What does this have to do with R you ask? As datasets become large, and R jobs become more compute intensive, multi-node parallelism is the way to go. Unless you have access to a facility with large-scale computing resources, you will probably want to deploy a cloud-based cluster if you are running analysis or simulations at any scale.