Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

Robust Inference with Variational Bayes

In Bayesian analysis, the posterior follows from the data and a choice of a prior and a likelihood. One hopes that the posterior is robust to reasonable variation in the choice of prior and likelihood, since this choice is made by the modeler and is necessarily somewhat subjective. Despite the fundamental importance of the problem and a considerable body of literature, the tools of robust Bayes are not commonly used in practice. This is in large part due to the difficulty of calculating robustness measures from MCMC draws. Although methods for computing robustness measures from MCMC draws exist, they lack generality and often require additional coding or computation. In contrast to MCMC, variational Bayes (VB) techniques are readily amenable to robustness analysis. The derivative of a posterior expectation with respect to a prior or data perturbation is a measure of local robustness to the prior or likelihood. Because VB casts posterior inference as an optimization problem, its methodology is built on the ability to calculate derivatives of posterior quantities with respect to model parameters, even in very complex models. In the present work, we develop local prior robustness measures for mean-field variational Bayes(MFVB), a VB technique which imposes a particular factorization assumption on the variational posterior approximation. We start by outlining existing local prior measures of robustness. Next, we use these results to derive closed-form measures of the sensitivity of mean-field variational posterior approximation to prior specification. We demonstrate our method on a meta-analysis of randomized controlled interventions in access to microcredit in developing countries.

Selective Sequential Model Selection

Many model selection algorithms produce a path of fits specifying a sequence of increasingly complex models. Given such a sequence and the data used to produce them, we consider the problem of choosing the least complex model that is not falsified by the data. Extending the selected-model tests of Fithian et al. (2014), we construct p-values for each step in the path which account for the adaptive selection of the model path using the data. In the case of linear regression, we propose two specific tests, the max-t test for forward stepwise regression (generalizing a proposal of Buja and Brown (2014)), and the next-entry test for the lasso. These tests improve on the power of the saturated-model test of Tibshirani et al. (2014), sometimes dramatically. In addition, our framework extends beyond linear regression to a much more general class of parametric and nonparametric model selection problems. To select a model, we can feed our single-step p-values as inputs into sequential stopping rules such as those proposed by G’Sell et al. (2013) and Li and Barber (2015), achieving control of the familywise error rate or false discovery rate (FDR) as desired. The FDR-controlling rules require the null p-values to be independent of each other and of the non-null p-values, a condition not satisfied by the saturated-model p-values of Tibshirani et al. (2014). We derive intuitive and general sufficient conditions for independence, and show that our proposed constructions yield independent p-values.

Explaining NonLinear Classification Decisions with Deep Taylor Decomposition

Nonlinear methods such as Deep Neural Networks (DNNs) are the gold standard for various challenging machine learning problems, e.g., image classification, natural language processing or human action recognition. Although these methods perform impressively well, they have a significant disadvantage, the lack of transparency, limiting the interpretability of the solution and thus the scope of application in practice. Especially DNNs act as black boxes due to their multilayer nonlinear structure. In this paper we introduce a novel methodology for interpreting generic multilayer neural networks by decomposing the network classification decision into contributions of its input elements. Although our focus is on image classification, the method is applicable to a broad set of input data, learning tasks and network architectures. Our method is based on deep Taylor decomposition and efficiently utilizes the structure of the network by backpropagating the explanations from the output to the input layer. We evaluate the proposed method empirically on the MNIST and ILSVRC data sets.

Learning Discrete Bayesian Networks from Continuous Data

Real data often contains a mixture of discrete and continuous variables, but many Bayesian network structure learning and inference algorithms assume all random variables are discrete. Continuous variables are often discretized, but the choice of discretization policy has significant impact on the accuracy, speed, and interpretability of the resulting models. This paper introduces a principled Bayesian discretization method for continuous variables in Bayesian networks with quadratic complexity instead of the cubic complexity of other standard techniques. Empirical demonstrations show that the proposed method is superior to the state of the art. In addition, this paper shows how to incorporate existing methods into the structure learning process to discretize all continuous variables and simultaneously learn Bayesian network structures.

Online Gradient Descent in Function Space

In many problems in machine learning and operations research, we need to optimize a function whose input is a random variable or a probability density function, i.e. to solve optimization problems in an infinite dimensional space. On the other hand, online learning has the advantage of dealing with streaming examples, and better model a changing environment. In this paper, we extend the celebrated online gradient descent algorithm to Hilbert spaces (function spaces), and analyze the convergence guarantee of the algorithm. Finally, we demonstrate that our algorithms can be useful in several important problems.

Online Crowdsourcing

With the success of modern internet based platform, such as Amazon Mechanical Turk, it is now normal to collect a large number of hand labeled samples from non-experts. The Dawid-Skene algorithm, which is based on Expectation- Maximization update, has been widely used for inferring the true labels from noisy crowdsourced labels. However, Dawid-Skene scheme requires all the data to perform each EM iteration, and can be infeasible for streaming data or large scale data. In this paper, we provide an online version of Dawid- Skene algorithm that only requires one data frame for each iteration. Further, we prove that under mild conditions, the online Dawid-Skene scheme with projection converges to a stationary point of the marginal log-likelihood of the observed data. Our experiments demonstrate that the online Dawid- Skene scheme achieves state of the art performance comparing with other methods based on the Dawid- Skene scheme.

Optimal strategies for the control of autonomous vehicles in data assimilation

We propose a method to compute optimal control paths for autonomous vehicles deployed for the purpose of inferring a velocity field. In addition to being advected by the flow, the vehicles are able to effect a fixed relative speed with arbitrary control over direction. It is this direction that is used as the basis for the locally optimal control algorithm presented here, with objective formed from the variance trace of the expected posterior distribution. We present results for linear flows near hyperbolic fixed points.

Sensitivity analysis, multilinearity and beyond

Sensitivity methods for the analysis of the outputs of discrete Bayesian networks have been extensively studied and implemented in different software packages. These methods usually focus on the study of sensitivity functions and on the impact of a parameter change to the Chan-Darwiche distance. Although not fully recognized, the majority of these results heavily rely on the multilinear structure of atomic probabilities in terms of the conditional probability parameters associated with this type of network. By defining a statistical model through the polynomial expression of its associated defining conditional probabilities, we develop a unifying approach to sensitivity methods applicable to a large suite of models including extensions of Bayesian networks, for instance context-specific and dynamic ones, and chain event graphs. By then focusing on models whose defining polynomial is multilinear, our algebraic approach enables us to prove that the Chan-Darwiche distance is minimized for a certain class of multi-parameter contemporaneous variations when parameters are proportionally covaried.

Minimal supports of eigenfunctions of Hamming graphs

Accurately Predicting Functional Connectivity from Diffusion Imaging

The modular group and words in its two generators

The minimum volume of subspace trades

Honeycomb Lattices with Defects

Hunting for Spammers: Detecting Evolved Spammers on Twitter

Distributed Adaptive LMF Algorithm for Sparse Parameter Estimation in Gaussian Mixture Noise

The combinatorial geometry of stresses in frameworks

Deep Learning for Single and Multi-Session i-Vector Speaker Recognition

Cubic Graphs, Disjoint Matchings and Some Inequalities

Gibbs-type Indian buffet processes

Projection Theorems for the Rényi Divergence on $α$-Convex Sets

A randomized polynomial kernel for Subset Feedback Vertex Set

Vertex-Coloring with Star-Defects

Refining a Tree-Decomposition which Distinguishes Tangles

On the Limiting Spectral Density of Random Matrices filled with Stochastic Processes

Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views

Light subgraphs in the graphs with average degree at most four

High-Dimensional Gaussian Copula Regression: Adaptive Estimation and Statistical Inference

Regularity of stochastic Volterra equations by functional calculus methods

Grid Intersection Graphs and Order Dimension

Mapping the Current-Current Correlation Function Near a Quantum Critical Point

Low Autocorrelation Binary Sequences

Properties for CD Inequalities with Unbounded Laplacians

Sequential Markov Chain Monte Carlo for Bayesian Filtering with Massive Data

Minimum Risk Training for Neural Machine Translation

Asymptotic entropy of random walks on Fuchsian buildings and Kac-Moody groups

Hamilton-Jacobi equations on graph and applications

A sufficient condition for a pair of sequences to be bipartite graphic

On the existence of infinitely many universal tree-based networks

Critical density of activated random walks on $\mathbb{Z}^d$ and general graphs

Nonuniformly weighted Schwarz smoothers for spectral element multigrid

Box representations of embedded graphs

Crossing Number is Hard for Kernelization

A new way to evaluate MOY graphs

Heat kernel estimates for anomalous heavy-tailed random walks

Davies’ method for anomalous diffusions

Minimal Distance to Approximating Noncontextual System as a Measure of Contextuality

Speeding up sum-of-squares for tensor decomposition and planted sparse vectors

Frequency Spirals in the 2D Kuramoto Lattice

Obliquely reflected Brownian motion in non-smooth planar domains

Money as Minimal Complexity

Nonparametric Reduced-Rank Regression for Multi-SNP, Multi-Trait Association Mapping

The $f$-Sensitivity Index

Asymptotic Normality of Quadratic Estimators

Motzkin monoids and partial Brauer monoids

On an ordering-dependent generalization of Tutte polynomial

Integer part polynomial multiple recurrence along shifted primes

Approximation-Friendly Discrepancy Rounding