A Sparse PCA Approach to Clustering

We discuss a clustering method for Gaussian mixture model based on the sparse principal component analysis (SPCA) method and compare it with the IF-PCA method. We also discuss the dependent case where the covariance matrix \Sigma is not necessarily diagonal.


Authorship Attribution Using a Neural Network Language Model

In practice, training language models for individual authors is often expensive because of limited data resources. In such cases, Neural Network Language Models (NNLMs), generally outperform the traditional non-parametric N-gram models. Here we investigate the performance of a feed-forward NNLM on an authorship attribution problem, with moderate author set size and relatively limited data. We also consider how the text topics impact performance. Compared with a well-constructed N-gram baseline method with Kneser-Ney smoothing, the proposed method achieves nearly 2:5% reduction in perplexity and increases author classification accuracy by 3:43% on average, given as few as 5 test sentences. The performance is very competitive with the state of the art in terms of accuracy and demand on test data. The source code, preprocessed datasets, a detailed description of the methodology and results are available at https://…/authorship-attribution.


Continuous Monitoring of A/B Tests without Pain: Optional Stopping in Bayesian Testing

A/B testing is one of the most successful applications of statistical theory in modern Internet age. One problem of Null Hypothesis Statistical Testing (NHST), the backbone of A/B testing methodology, is that experimenters are not allowed to continuously monitor the result and make decision in real time. Many people see this restriction as a setback against the trend in the technology toward real time data analytics. Recently, Bayesian Hypothesis Testing, which intuitively is more suitable for real time decision making, attracted growing interest as an alternative to NHST. While corrections of NHST for the continuous monitoring setting are well established in the existing literature and known in A/B testing community, the debate over the issue of whether continuous monitoring is a proper practice in Bayesian testing exists among both academic researchers and general practitioners. In this paper, we formally prove the validity of Bayesian testing with continuous monitoring when proper stopping rules are used, and illustrate the theoretical results with concrete simulation illustrations. We point out common bad practices where stopping rules are not proper and also compare our methodology to NHST corrections. General guidelines for researchers and practitioners are also provided.


Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data

Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as ‘Lexis’, that produces an optimized hierarchical representation of a given set of ‘target’ strings. The resulting hierarchy, ‘Lexis-DAG’, shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-Hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of Lexis-DAGs. We also consider the problem of identifying the set of intermediate nodes (substrings) that collectively form the ‘core’ of a Lexis-DAG, which is important in the analysis of Lexis-DAGs. We show that the Lexis framework can be applied in diverse applications such as optimized synthesis of DNA fragments in genomic libraries, hierarchical structure discovery in protein sequences, dictionary-based text compression, and feature extraction from a set of documents.


Topological classification of interacting 1D Floquet phases

Primal-Dual Rates and Certificates

The variation and Kantorovich distances between distributions of polynomials and a fractional analog of the Hardy–Landau–Littlewood inequality

Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters

Low rank tensor recovery via iterative hard thresholding

BioSpaun: A large-scale behaving brain model with complex neurons

Patterns of Scalable Bayesian Inference

Work-Efficient Parallel and Incremental Graph Connectivity

A Unified Monte-Carlo Jackknife for Small Area Estimation after Model Selection

A phase transition in excursions from infinity of the ‘fast’ fragmentation-coalescence process

Monte Carlo Markov Chains for sampling Strongly Rayleigh distributions and Determinantal Point Processes

The Computation of Key Properties of Markov Chains via Perturbations

Peak Criterion for Choosing Gaussian Kernel Bandwidth in Support Vector Data Description

Scheduling MapReduce Jobs under Multi-Round Precedences

Anomaly Detection in Clutter using Spectrally Enhanced Ladar

Archimedes’ quadrature of the parabola and minimal covers

Multifractality and Laplace spectrum of horizontal visibility graphs constructed from fractional Brownian motions

Choice by Elimination via Deep Neural Networks

Long Range Stress Correlations in the Inherent Structures of Liquids at Rest

Density and Glass Forming Ability in Amorphous Atomic Alloys: the Role of the Particle Softness

On critical points of random polynomials and spectrum of certain products of random matrices

Simulation Study of an Energy-Efficient Time Synchronization Scheme based on Source Clock Frequency Recovery in Asymmetric Wireless Sensor Networks

Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding

Large Scale Kernel Learning using Block Coordinate Descent

Modeling Dependence Dynamics of Air Pollution: Pollution Risk Simulation and Prediction of PM$_{2.5}$ Levels

Relative Error Embeddings for the Gaussian Kernel Distance

Recommendations as Treatments: Debiasing Learning and Evaluation

Ramsey numbers of uniform loose paths and cycles

Cross-Language Domain Adaptation for Classifying Crisis-Related Short Messages

Simple average-case lower bounds for approximate near-neighbor from isoperimetric inequalities

Online optimization and regret guarantees for non-additive long-term constraints

Modeling CD4+ T cells dynamics in HIV-infected patients receiving repeated cycles of exogenous Interleukin 7

On the lengths of zigzags in thin complexes

11 x 11 Domineering is Solved: The first player wins

Diffusion of innovation in large scale graphs

Ricci curvature bounds for weakly interacting Markov chains

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

Low-Rank Factorization of Determinantal Point Processes for Recommendation

Distributed Strong Diameter Network Decomposition

Cell segmentation with random ferns and graph-cuts

Classification of quasi-symmetric 2-(64,24,46) designs of Blokhuis-Haemers type

Inverse Reinforcement Learning in Swarm Systems

On Stochastic Comparisons for Load-Sharing Series and Parallel Systems

Some Reliability Properties of Transformed-Transformer Family of Distributions

Heterogeneity Adjustment with Applications to Graphical Model Inference

New perspective on sampling-based motion planning via random geometric graphs

Siladić’s theorem: weighted words, refinement and companion

Auxiliary Deep Generative Models

Special Operations Forces: A Global Immune System?

Is comonotonicity a good property for risk measures?

Convergence of Imprecise Continuous-Time Markov Chains

Asymptotic behavior of the Anderson polymer in a fractional Brownian environment

Cluster automorphisms and the marked exchange graphs of skew-symmetrizable cluster algebras

HeSP: a simulation framework for solving the task scheduling-partitioning problem on heterogeneous architectures

Endomorphisms of Cuboidal Hamming Graphs, Latin Hypercuboids of Class $r$, and Mixed MDS Codes

Central limit theorems for functionals of large dimensional sample covariance matrix and mean vector in matrix-variate skewed model

Eigen-Epistasis for detecting Gene-Gene interactions

Extreme robustness of scaling in sample space reducing processes explains Zipf’s law in diffusion on directed networks

Doppelgängers: Bijections of Plane Partitions

Alpha-CIR Model with Branching Processes in Sovereign Interest Rate Modelling

On metric graphs with prescribed gonality

Fault and Byzantine Tolerant Self-stabilizing Mobile Robots Gathering – Feasibility Study –

Differentiated latency in data center networks with erasure coded files through traffic engineering

Equiangular tight frames from hyperovals

Lower bounds for moments of global scores of pairwise Markov chains

Robust Kernel (Cross-) Covariance Operators in Reproducing Kernel Hilbert Space toward Kernel Methods

A multivariate CLT in Wasserstein distance with near optimal convergence rate

Multi-layer Representation Learning for Medical Concepts