One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample chapter) of our book Practical Data Science with R. We also came upon another cool approach, in the mixtools package for mixture model analysis. As with clustering, if you want to fit a mixture model (say, a mixture of gaussians) to your data, it helps to know how many components are in your mixture. The boot.comp function estimates the number of components (let’s call it k) by incrementally testing the hypothesis that there are k+1 components against the null hypothesis that there are k components, via parametric bootstrap.
In research, especially in medical research, we describe characteristics of our study populations through Table 1. The Table 1 contain information about the mean for continue/scale variable, and proportion for categorical variable. For example: we say that the mean of systolic blood pressure in our study population is 145 mmHg, or 30% of participants are smokers. Since is called Table 1, means that is the first table in the manuscript.
The central limit theorem is a fundamental theorem of statistics. It prescribes that the sum of a sufficiently large number of independent and identically distributed random variables approximately follows a normal distribution.
In econometric modeling, I usually have a problem with correlated features. A few weeks ago, I was discussing feature selection when features are correlated. This week, I was wondering about reverse engineering when features might be correlated (not to say very correlated). The way I see reverse engineering is the following
Another popular application of classification techniques is on texmining (see e.g. an old post on French president speaches). Consider the following example, inspired by Nobert Ryciak’s post, with 12 wikipedia pages, on various topics,
For the data scienec course of tomorrow, I just wanted to post some functions to illustrate cluster analysis. Consider the dataset of the French 2012 elections