Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R. We also came upon another cool approach, in the mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.

In research, especially in medical research, we describe characteristics of our study populations through Table 1. The Table 1 contain information about the mean for continue/scale variable, and proportion for categorical variable. For example: we say that the mean of systolic blood pressure in our study population is 145 mmHg, or 30% of participants are smokers. Since is called Table 1, means that is the first table in the manuscript.

An Introduction to Time Series with JSON Data

For this post, I wanted to take the data analysis process in a different direction. Normally, an R analysis starts with data from a comma-separated Excel file (.csv) or a tab-separated file (.txt). However, online data is often formatted in JSON, which stands for JavaScript Online Notation. JSON has different forms, but for this data, it consists of nested arrays in two main parts. One part is the meta-data header, and the other is the observations themselves.

Central Limit Theorem

The central limit theorem is a fundamental theorem of statistics. It prescribes that the sum of a sufficiently large number of independent and identically distributed random variables approximately follows a normal distribution.

Reverse Engineering with Correlated Features

In econometric modeling, I usually have a problem with correlated features. A few weeks ago, I was discussing feature selection when features are correlated. This week, I was wondering about reverse engineering when features might be correlated (not to say very correlated). The way I see reverse engineering is the following

Clusters of Texts

Another popular application of classification techniques is on texmining (see e.g. an old post on French president speaches). Consider the following example, inspired by Nobert Ryciak’s post, with 12 wikipedia pages, on various topics,

Clusters of (French) Regions

For the data scienec course of tomorrow, I just wanted to post some functions to illustrate cluster analysis. Consider the dataset of the French 2012 elections