Canopy Clustering Algorithm
The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000. It is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. The algorithm proceeds as follows, using two thresholds T_1 (the loose distance) and T_2 (the tight distance), where T_1 > T_2 .
1. Begin with the set of data points to be clustered.
2. Remove a point from the set, beginning a new ‘canopy’.
3. For each point left in the set, assign it to the new canopy if the distance less than the loose distance T_1.
4. If the distance of the point is additionally less than the tight distance T_2, remove it from the original set.
5. Repeat from step 2 until there are no more data points in the set to cluster.
6. These relatively cheaply clustered canopies can be sub-clustered using a more expensive but accurate algorithm.
An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.
Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the curse of dimensionality. Only when a cheap and approximative – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means. …
Grouped Merging Net (GM-Net)
Deep Convolutional Neural Networks (CNNs) are capable of learning unprecedentedly effective features from images. Some researchers have struggled to enhance the parameters’ efficiency using grouped convolution. However, the relation between the optimal number of convolutional groups and the recognition performance remains an open problem. In this paper, we propose a series of Basic Units (BUs) and a two-level merging strategy to construct deep CNNs, referred to as a joint Grouped Merging Net (GM-Net), which can produce joint grouped and reused deep features while maintaining the feature discriminability for classification tasks. Our GM-Net architectures with the proposed BU_A (dense connection) and BU_B (straight mapping) lead to significant reduction in the number of network parameters and obtain performance improvement in image classification tasks. Extensive experiments are conducted to validate the superior performance of the GM-Net than the state-of-the-arts on the benchmark datasets, e.g., MNIST, CIFAR-10, CIFAR-100 and SVHN. …
Convex Hierarchical Testing (CHT)
We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. …
Canopy Clustering Algorithm