Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step $k$ pivots are used to split the input into $k + 1$ segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun’s Java 7. In 2014 at ALENEX, Kushagra et al. introduced an even faster algorithm that uses three pivots. This paper studies what possible advantages multi-pivot quicksort might offer in general. The contributions are as follows: Natural comparison-optimal algorithms for multi-pivot quicksort are devised and analyzed. The analysis shows that the benefits of using multiple pivots with respect to the average comparison count are marginal and these strategies are inferior to simpler strategies such as the well known median-of-$k$ approach. A substantial part of the partitioning cost is caused by rearranging elements. A rigorous analysis of an algorithm for rearranging elements in the partitioning step is carried out, observing mainly how often array cells are accessed during partitioning. The algorithm behaves best if 3 or 5 pivots are used. Experiments show that this translates into good cache behavior and is closest to predicting observed running times of multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots from a sample affects sorting cost.
Functional principal component analysis is one the most commonly employed approaches in functional/longitudinal data analysis and we extend it to conduct $d$-dimensional functional/longitudinal data analysis. The computational issues emerging in the extension are fully addressed with our proposed solutions. The local linear smoothing technique is employed to perform estimation because of its capabilities of performing large-scale smoothing and of handling data with different sampling schemes (possibly on irregular domain) in addition to its nice theoretical properties. Besides taking the fast Fourier transform strategy in smoothing, the modern GPGPU (general-purpose computing on graphics processing units) architecture is applied to perform parallel computation to save computation time. To resolve the out-of-memory issue due to large-scale data, the random projection procedure is applied in the eigen-decomposition step. We show that the proposed estimators can achieve the classical nonparametric rates for longitudinal data and the optimal convergence rates for functional data if the number of observations per sample is of the order $(n/ \log n)^{d/4}$. Finally, the performance of our approach is demonstrated with simulation studies and the fine particulate matter (PM 2.5) data measured in Taiwan.
Automatic narration of events and entities is the need of the hour, especially when live reporting is critical and volume of information to be narrated is huge. This paper discusses the challenges in this context, along with the algorithms used to build such systems. From a systematic study, we can infer that most of the work done in this area is related to statistical data. It was also found that subjective evaluation or contribution of experts is also limited for narration context.
Algebraic Subspace Clustering (ASC) is a simple and elegant method based on polynomial fitting and differentiation for clustering noiseless data drawn from an arbitrary union of subspaces. In practice, however, ASC is limited to equi-dimensional subspaces because the estimation of the subspace dimension via algebraic methods is sensitive to noise. This paper proposes a new ASC algorithm that can handle noisy data drawn from subspaces of arbitrary dimensions. The key ideas are (1) to construct, at each point, a decreasing sequence of subspaces containing the subspace passing through that point; (2) to use the distances from any other point to each subspace in the sequence to construct a subspace clustering affinity, which is superior to alternative affinities both in theory and in practice. Experiments on the Hopkins 155 dataset demonstrate the superiority of the proposed method with respect to sparse and low rank subspace clustering methods.
We consider the problem of outlier rejection in single subspace learning. Classical approaches work directly with a low-dimensional representation of the subspace. Our approach works with a dual representation of the subspace and hence aims to find its orthogonal complement. We pose this problem as an $\ell_1$-minimization problem on the sphere and show that, under certain conditions on the distribution of the data, any global minimizer of this non-convex problem gives a vector orthogonal to the subspace. Moreover, we show that such a vector can still be found by relaxing the non-convex problem with a sequence of linear programs. Experiments on synthetic and real data show that the proposed approach, which we call Dual Principal Component Pursuit (DPCP), outperforms state-of-the art methods, especially in the case of high-dimensional subspaces.
Topic modeling is a very powerful technique in data analysis and data mining but it is generally slow. Many parallelization approaches have been proposed to speed up the learning process. However, they are usually not very efficient because of the many kinds of overhead, especially the load-balancing problem. We address this problem by proposing three partitioning algorithms, which either run more quickly or achieve better load balance than current partitioning algorithms. These algorithms can easily be extended to improve parallelization efficiency on other topic models similar to LDA, e.g., Bag of Timestamps, which is an extension of LDA with time information. We evaluate these algorithms on two popular datasets, NIPS and NYTimes. We also build a dataset containing over 1,000,000 scientific publications in the computer science domain from 1951 to 2010 to experiment with Bag of Timestamps parallelization, which we design to demonstrate the proposed algorithms’ extensibility. The results strongly confirm the advantages of these algorithms.