Amazon Machine Learning – Make Data-Driven Decisions at Scale
Today we are introducing Amazon Machine Learning. This new AWS service helps you to use all of that data you’ve been collecting to improve the quality of your decisions. You can build and fine-tune predictive models using large amounts of data, and then use Amazon Machine Learning to make predictions (in batch mode or in real-time) at scale. You can benefit from machine learning even if you don’t have an advanced degree in statistics or the desire to setup, run, and maintain your own processing and storage infrastructure.

All about that “bias, bias, bias” (it’s no trouble)
At some point, everyone who fiddles around with Bayes factors with point nulls notices something that, at first blush, seems strange: small effect sizes seem ‘biased’ toward the null hypothesis. In null hypothesis significance testing, power simply increases when you change the true effect size. With Bayes factors, there is a non-monotonicity where increasing the sample size will slightly increase the degree to which a small effect size favors the null, then the small effect size becomes evidence for the alternative. I recall puzzling with this with Jeff Rouder years ago when drafting our 2009 paper on Bayesian t tests.

What can be in an R data.frame column?
As an R programmer have you every wondered what can be in a data.frame column?

Five Characteristics of the Big Data Bang
1. Data Type
2. Data Detail
3. Data Periodicity and Timeliness
4. Topic / Sector Focus
5. Geographic/Linguistic Focus

Visualising a Classification in High Dimension, part 2
A few weeks ago, I published a post on Visualising a Classification in High Dimension, based on the use of a principal component analysis, to get a projection on the first two components. Following that post, I was wondering what could be done in the context of a classification on categorical covariates. A natural idea would be to consider a correspondance analysis, and to run a similar code.

Classification with Categorical Variables (the fuzzy side)
The Gaussian and the (log) Poisson regressions share a very interesting property …

Ultimate guide for Data Exploration in Python using NumPy, Matplotlib and Pandas
Exploring data sets and developing deep understanding about the data is one of the most important skill every data scientist should possess. People estimate that time spent on these activities can go as high as 80% of the project time in some cases. Python has been gaining a lot of ground as preferred tool for data scientists lately, and for the right reasons. Ease of learning, powerful libraries with integration of C/C++, production readiness and integration with web stack are some of the main reasons for this move lately. In this guide, I will use NumPy, Matplotlib, Seaborn and Pandas to perform data exploration. These are powerful libraries to perform data exploration in Python. The idea is to create a ready reference for some of the regular operations required frequently. I am using iPython Notebook to perform data exploration, and would recommend the same for its natural fit for exploratory analysis.

New video series: Intro to machine learning with scikit-learn
Have you tried out a few Kaggle competitions, but you aren’t quite sure what you’re supposed to be doing? Or perhaps you’ve heard all the talk in the Kaggle forums about Python’s scikit-learn library, but you haven’t figured out how to take advantage of this powerful tool for machine learning? If so, this post is for you!

A blessing of dimensionality often observed in high-dimensional data sets
Tidy data sets have one observation per row and one variable per column. Using this definition, big data sets can be either:
1. Wide – a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.
2. Tall – a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.

Machine Learning 201: Does Balancing Classes Improve Classifier Performance?
The author investigates if balancing classes improves performance for logistic regression, SVM, and Random Forests, and finds where it helps the performance and where it does not.

Inside Deep Learning: Computer Vision With Convolutional Neural Networks
Deep Learning-powered image recognition is now performing better than human vision on many tasks. We examine how human and computer vision extracts features from raw pixels, and explain how deep convolutional neural networks work so well.

Scale back or transform back multiple linear regression coefficients: Arbitrary case with ridge regression
Multiple regression is used by many practitioners. In this post we have shown how to scale continuous predictors and transform back the regression coefficients to original scale. Scaled coefficients would help us to better interpret the results. The question of when to standardize the data is a different issue.

Recreating the vaccination heatmaps in R
In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine. How hard would it be to recreate an R version?