How Good is Multi-Pivot Quicksort?

Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step k pivots are used to split the input into k + 1 segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun’s Java 7. In 2014 at ALENEX, Kushagra et al. introduced an even faster algorithm that uses three pivots. This paper studies what possible advantages multi-pivot quicksort might offer in general. The contributions are as follows: Natural comparison-optimal algorithms for multi-pivot quicksort are devised and analyzed. The analysis shows that the benefits of using multiple pivots with respect to the average comparison count are marginal and these strategies are inferior to simpler strategies such as the well known median-of-k approach. A substantial part of the partitioning cost is caused by rearranging elements. A rigorous analysis of an algorithm for rearranging elements in the partitioning step is carried out, observing mainly how often array cells are accessed during partitioning. The algorithm behaves best if 3 or 5 pivots are used. Experiments show that this translates into good cache behavior and is closest to predicting observed running times of multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots from a sample affects sorting cost.

Multi-dimensional Functional Principal Component Analysis

Functional principal component analysis is one the most commonly employed approaches in functional/longitudinal data analysis and we extend it to conduct d-dimensional functional/longitudinal data analysis. The computational issues emerging in the extension are fully addressed with our proposed solutions. The local linear smoothing technique is employed to perform estimation because of its capabilities of performing large-scale smoothing and of handling data with different sampling schemes (possibly on irregular domain) in addition to its nice theoretical properties. Besides taking the fast Fourier transform strategy in smoothing, the modern GPGPU (general-purpose computing on graphics processing units) architecture is applied to perform parallel computation to save computation time. To resolve the out-of-memory issue due to large-scale data, the random projection procedure is applied in the eigen-decomposition step. We show that the proposed estimators can achieve the classical nonparametric rates for longitudinal data and the optimal convergence rates for functional data if the number of observations per sample is of the order (n/ \log n)^{d/4}. Finally, the performance of our approach is demonstrated with simulation studies and the fine particulate matter (PM 2.5) data measured in Taiwan.

Narrative Science Systems: A Review

Automatic narration of events and entities is the need of the hour, especially when live reporting is critical and volume of information to be narrated is huge. This paper discusses the challenges in this context, along with the algorithms used to build such systems. From a systematic study, we can infer that most of the work done in this area is related to statistical data. It was also found that subjective evaluation or contribution of experts is also limited for narration context.

Filtrated Spectral Algebraic Subspace Clustering

Algebraic Subspace Clustering (ASC) is a simple and elegant method based on polynomial fitting and differentiation for clustering noiseless data drawn from an arbitrary union of subspaces. In practice, however, ASC is limited to equi-dimensional subspaces because the estimation of the subspace dimension via algebraic methods is sensitive to noise. This paper proposes a new ASC algorithm that can handle noisy data drawn from subspaces of arbitrary dimensions. The key ideas are (1) to construct, at each point, a decreasing sequence of subspaces containing the subspace passing through that point; (2) to use the distances from any other point to each subspace in the sequence to construct a subspace clustering affinity, which is superior to alternative affinities both in theory and in practice. Experiments on the Hopkins 155 dataset demonstrate the superiority of the proposed method with respect to sparse and low rank subspace clustering methods.

Dual Principal Component Pursuit

We consider the problem of outlier rejection in single subspace learning. Classical approaches work directly with a low-dimensional representation of the subspace. Our approach works with a dual representation of the subspace and hence aims to find its orthogonal complement. We pose this problem as an \ell_1-minimization problem on the sphere and show that, under certain conditions on the distribution of the data, any global minimizer of this non-convex problem gives a vector orthogonal to the subspace. Moreover, we show that such a vector can still be found by relaxing the non-convex problem with a sequence of linear programs. Experiments on synthetic and real data show that the proposed approach, which we call Dual Principal Component Pursuit (DPCP), outperforms state-of-the art methods, especially in the case of high-dimensional subspaces.

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization

Topic modeling is a very powerful technique in data analysis and data mining but it is generally slow. Many parallelization approaches have been proposed to speed up the learning process. However, they are usually not very efficient because of the many kinds of overhead, especially the load-balancing problem. We address this problem by proposing three partitioning algorithms, which either run more quickly or achieve better load balance than current partitioning algorithms. These algorithms can easily be extended to improve parallelization efficiency on other topic models similar to LDA, e.g., Bag of Timestamps, which is an extension of LDA with time information. We evaluate these algorithms on two popular datasets, NIPS and NYTimes. We also build a dataset containing over 1,000,000 scientific publications in the computer science domain from 1951 to 2010 to experiment with Bag of Timestamps parallelization, which we design to demonstrate the proposed algorithms’ extensibility. The results strongly confirm the advantages of these algorithms.

Turán numbers of extensions

Brownian motion correlation in the peanosphere for $κ> 8$

NEMO5: Achieving High-end Internode Communication for Performance Projection Beyond Moore’s Law

A Boundedness Trichotomy for the Stochastic Heat Equation

Brownian Occupation Measures, Compactness and Large Deviations: Pair Interaction

Detecting Unspecified Structure in Low-Count Images

Spectral Partitioning with Blends of Eigenvectors

Moment Varieties of Gaussian Mixtures

Convergence Theorems for Generalized Functional Sequences of Discrete-Time Normal Martingales

Low-rank diffusion matrix estimation for high-dimensional time-changed Lévy processes

What survives of many-body localization in the presence of dissipation

Monte Carlo Dynamically Weighted Importance Sampling For Finite Element Model Updating

Subtree Isomorphism Revisited

Layer-Specific Adaptive Learning Rates for Deep Networks

On the Hamilton-Waterloo Problem with triangle factors and $C_{3x}$-factors

Telemedicine as a special case of Machine Translation

Predictive partitioning for efficient BFS traversal in social networks

New fast divide-and-conquer algorithms for the symmetric tridiagonal eigenvalue problem

An Improved Randomized Data Structure for Dynamic Graph Connectivity

Application of Stochastic Mesh Method to Efficient Approximation of CVA

A Grassmann algebra for matroids

Martin boundary of unbounded sets for purely discontinuous Feller processes

Accessibility, Martin boundary and minimal thinness for Feller processes in metric measure spaces

Scale invariant boundary Harnack principle at infinity for Feller processes

The Scaling Limit of Superreplication Prices with Small Transaction Costs in the Multivariate Case

The sign-sequence constant of the plane

Internally Perfect Matroids

On functional records and champions

Mixture Models: Building a Parameter Space

Towards Cleaning-up Open Data Portals: A Metadata Reconciliation Approach

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Enumeration of lozenge tilings of a hexagon with a maximal staircase and a unit triangle removed

Enumeration of lozenge tilings of halved hexagons with a boundary defect

A two-component normal mixture alternative to the Fay-Herriot model

From Microscopic Heterogeneity to Macroscopic Complexity in the Contrarian Voter Model

Rank one non-Hermitian perturbations of Hermitian $β$-ensembles

Online Markov decision processes with policy iteration

Defects, disorder and strong electron correlations in orbital degenerate, doped Mott insulators

Graph polynomials and link invariants as positive type functions on Thompson’s group F

Types of signature analysis in reliability based on Hilbert series

Correlation between local structure and dynamic heterogeneity in a metallic glass-forming liquid

A Method for Avoiding Data Disclosure While Automatically Preserving Multivariate Relations

Sketch-based Manga Retrieval using Manga109 Dataset

Total positivity for the Lagrangian Grassmannian

Robust Learning for Optimal Treatment Decision with NP-Dimensionality

Equitable Decompositions of Graphs

Extreme Value Laws for non stationary processes generated by sequential and random dynamical systems

Group-Invariant Subspace Clustering

Stopping time property of thresholds of Storey-type FDR procedures

Processing Regular Path Queries on Arbitrarily Distributed Data

Explicit solutions to a vector time series model and its induced model for business cycles

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

PRIMAL: PRofIt Maximization Avatar pLacement for Mobile Edge Computing

Priors for High-Dimensional Sparse Poisson Means

Generating functions for descents over permutations which avoid sets of consecutive patterns

A Fibonacci analogue of Stirling numbers

Asymptotic Lower Bounds for Optimal Tracking: a Linear Programming Approach

Particle-hole symmetry, many-body localization, and topological edge modes