Application of Quantum Annealing to Training of Deep Neural Networks

In Deep Learning, a well-known approach for training a Deep Neural Network starts by training a generative Deep Belief Network model, typically using Contrastive Divergence (CD), then fine-tuning the weights using backpropagation or other discriminative techniques. However, the generative training can be time-consuming due to the slow mixing of Gibbs sampling. We investigated an alternative approach that estimates model expectations of Restricted Boltzmann Machines using samples from a D-Wave quantum annealing machine. We tested this method on a coarse-grained version of the MNIST data set. In our tests we found that the quantum sampling-based training approach achieves comparable or better accuracy with significantly fewer iterations of generative training than conventional CD-based training. Further investigation is needed to determine whether similar improvements can be achieved for other data sets, and to what extent these improvements can be attributed to quantum effects.

Submodularity in Statistics: Comparing the Success of Model Selection Methods

We demonstrate the usefulness of submodularity in statistics. Greedy algorithms such as forward stepwise regression and the lasso perform well in situations that can be characterized by submodularity. In particular, submodularity of the coefficient of determination, R^2, provides a natural way to analyze the effects of collinearity on model selection. In model selection, we encounter the search problem of identifying a subset of k covariates with predictive loss close to that of the best model of k covariates. Submodularity arises naturally in this setting due to its deep roots within combinatorial optimization. It provides structural results for discrete convexity as well as guarantees for the success of greedy algorithms. In statistics, submodularity isolates cases in which collinearity makes the choice of model features difficult from those in which this task is routine. Submodularity of R^2 is closely related to other statistical assumptions used in variable screening and proving the performance of the Lasso. This has important implications for the use of model selection procedures: the situations in which forward stepwise and Lasso are successful are closely related.

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network

Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for tagging sequential data, e.g. speech utterances or handwritten documents. While word embedding has been demoed as a powerful representation for characterizing the statistical properties of natural language. In this study, we propose to use BLSTM-RNN with word embedding for part-of-speech (POS) tagging task. When tested on Penn Treebank WSJ test set, a state-of-the-art performance of 97.40 tagging accuracy is achieved. Without using morphological features, this approach can also achieve a good performance comparable with the Stanford POS tagger.

High Performance Latent Variable Models

Latent variable models have accumulated a considerable amount of interest from the industry and academia for their versatility in a wide range of applications. A large amount of effort has been made to develop systems that is able to extend the systems to a large scale, in the hope to make use of them on industry scale data. In this paper, we describe a system that operates at a scale orders of magnitude higher than previous works, and an order of magnitude faster than state-of-the-art system at the same scale, at the same time showing more robustness and more accurate results. Our system uses a number of advances in distributed inference: high performance in synchronization of sufficient statistics with relaxed consistency model; fast sampling, using the Metropolis-Hastings-Walker method to overcome dense generative models; statistical modeling, moving beyond Latent Dirichlet Allocation (LDA) to Pitman-Yor distributions (PDP) and Hierarchical Dirichlet Process (HDP) models; sophisticated parameter projection schemes, to resolve the conflicts within the constraint between parameters arising from the relaxed consistency model. This work significantly extends the domain of applicability of what is commonly known as the Parameter Server. We obtain results with up to hundreds billion oftokens, thousands of topics, and a vocabulary of a few million token-types, using up to 60,000 processor cores operating on a production cluster of a large Internet company. This demonstrates the feasibility to scale to problems orders of magnitude larger than any previously published work.

Dimensionality Reduction for Binary Data through the Projection of Natural Parameters

Principal component analysis (PCA) for binary data, known as logistic PCA, has become a popular alternative to dimensionality reduction of binary data. It is motivated as an extension of ordinary PCA by means of a matrix factorization, akin to the singular value decomposition, that maximizes the Bernoulli log-likelihood. We propose a new formulation of logistic PCA which extends Pearson’s formulation of a low dimensional data representation with minimum error to binary data. Our formulation does not require a matrix factorization, as previous methods do, but instead looks for projections of the natural parameters from the saturated model. Due to this difference, the number of parameters does not grow with the number of observations and the principal component scores on new data can be computed with simple matrix multiplication. We derive explicit solutions for data matrices of special structure and provide computationally efficient algorithms for solving for the principal component loadings. Through simulation experiments and an analysis of medical diagnoses data, we compare our formulation of logistic PCA to the previous formulation as well as ordinary PCA to demonstrate its benefits.

Spectral statistics of sparse Erdős-Rényi graph Laplacians

Noncrossing partitions, toggles, and homomesies

Localization of Quantum States and Landscape Functions

Scaling limits for the critical Fortuin-Kastelyn model on a random planar map III: finite volume case

Prevalence and recoverability of syntactic parameters in sparse distributed memories

Time-Sensitive Bayesian Information Aggregation for Crowdsourcing Systems

The permutation class Av(4213,2143)

Revisiting Alpha-Investing: Conditionally Valid Stepwise Regression

Efficient measurement of point-to-set correlations and overlap fluctuations in glass-forming liquids

A Risk Ratio Comparison of $l_0$ and $l_1$ Penalized Regression

Approximation of delay differential equations at the verge of instability by equations without delay

Bayesian Nonparametric Density Estimation under Length Bias

GLASSES: Relieving The Myopia Of Bayesian Optimisation

Universality in marginally relevant disordered systems

Monotone and additive Markov process duality

Propagation of chaos for the Vlasov-Poisson-Fokker-Planck system in 1D

Computing LZ77 in Run-Compressed Space

On the Erdős-Szekeres Conjecture

Likelihood Ratio Tests for a Dose-Response Effect using Multiple Nonlinear Regression Models

A generalization of Onsager’s reciprocity relations to gradient flows with nonlinear mobility

On the discontinuity of the specific heat of the Ising model on a scale-free network

On the number of latin hypercubes, pairs of orthogonal latin squares and MDS codes

A preparation theorem for the kashiwara $b(\infty)$ crystal

Functional delta-method for the bootstrap of quasi-Hadamard differentiable functionals

Quenched localisation in the Bouchaud trap model with slowly varying traps

Learning-based Compressive Subsampling

There is exactly one Z2Z4-cyclic 1-perfect code

An investigation of equilibration in small quantum systems

The Saturation Time of Graph Bootstrap Percolation

Creating Scalable and Interactive Web Applications Using High Performance Latent Variable Models

Homogenization of periodic diffusion with small jumps

Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions

Automated Synchronization of Driving Data Using Vibration and Steering Events

Improved Lower Bounds on the Classical Ramsey Numbers R(4,22) and R(4,25)

When Are Nonconvex Problems Not Scary?

The Effect of Data Swapping on Analyses of American Community Survey Data

Quantile Versions of the Lorenz Curve

The parabolic Taylor formula of the implied volatility

Regularization vs. Relaxation: A conic optimization perspective of statistical variable selection

Coagulation-fragmentation model for animal group-size statistics

Input Sparsity and Hardness for Robust Subspace Approximation

Staircase diagrams and enumeration of smooth Schubert varieties

Refined Turán numbers and Ramsey numbers for the loose 3-uniform path of length three

Scalable posterior approximations for large-scale Bayesian inverse problems via likelihood-informed parameter and state reduction

The Complexity of Pattern Matching for $321$-Avoiding and Skew-Merged Permutations

Nonlinear stochastic heat equation driven by spatially colored noise: moments and intermittency