Always Valid Inference: Bringing Sequential Analysis to A/B Testing

Nearly all technology platforms (e.g., web applications, mobile applications, etc.) use randomized controlled trials, or A/B tests, as a means to optimize their product offering. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals. These measures serve as a transparent, interpretable interface between the data and the user, allowing valid inference. However, these reported values cease to be valid if users make decisions based on continuous monitoring of their tests. Users try to take advantage of data as fast as it becomes available, but current testing practice prevents them from doing so while maintaining valid statistical inference. Through connections to sequential hypothesis testing, we present analogues of classical frequentist statistical measures that are always valid, regardless of when users choose to look at the test. We discuss how to optimally choose such a sequential test. We also discuss applications to bandits, and extensions to multiple hypothesis testing in the sequential setting. Finally, we discuss implementation and deployment of our approach in a large scale commercial A/B testing platform.

Strategies for Training Large Vocabulary Neural Language Models

Training neural network language models over large vocabularies is still computationally very costly compared to count-based models such as Kneser-Ney. At the same time, neural language models are gaining popularity for many applications such as speech recognition and machine translation whose success depends on scalability. We present a systematic comparison of strategies to represent and train large vocabularies, including softmax, hierarchical softmax, target sampling, noise contrastive estimation and self normalization. We further extend self normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax. We evaluate each method on three popular benchmarks, examining performance on rare words, the speed/accuracy trade-off and complementarity to Kneser-Ney.

Increasing the Action Gap: New Operators for Reinforcement Learning

This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird’s advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.

Data Driven Resource Allocation for Distributed Learning

In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points are often belonging to the same or similar classes, and more generally, classification rules of high accuracy tend to be ‘locally simple but globally complex’, we propose data dependent dispatching that takes advantage of such structures. Our main technical contribution is to provide algorithms with provable guarantees for data-dependent dispatching, that partition the data in a way that satisfies important conditions for accurate distributed learning, including fault tolerance and balancedness. We show the effectiveness of our method over the widely used random partitioning scheme in several real world image and advertising datasets.

Feature-Level Domain Adaptation

Domain adaptation is the supervised learning setting in which the training and test data originate from different domains: the so-called source and target domains. In this paper, we propose and study a domain adaption approach, called feature-level domain adaptation (flda), that models the dependence between two domains by means of a feature-level transfer distribution. The domain adapted classifier is trained by minimizing the expected loss under this transfer distribution. Our empirical evaluation of flda focuses on problems with binary and count features in which the domain adaptation can be naturally modeled via a dropout distribution, which allows the final classifier to adapt to the importance of specific features in the target data. Our experimental evaluation suggests that under certain conditions, flda converges to the classifier trained on the target distribution. Experiments with our domain adaptation approach on several real-world problems show that flda performs on par with state-of-the-art techniques in domain adaptation.

Bayesian model selection for linear regression

In this note we introduce linear regression with basis functions in order to apply Bayesian model selection. The goal is to incorporate Occam’s razor as provided by Bayes analysis in order to automatically pick the model optimally able to explain the data without overfitting.

Model comparison with missing data using MCMC and importance sampling

Selecting between competing statistical models is a challenging problem especially when the competing models are non-nested. In this paper we offer a simple solution by devising an algorithm which combines MCMC and importance sampling to obtain computationally efficient estimates of the marginal likelihood which can then be used to compare the models. The algorithm is successfully applied to longitudinal epidemic and time series data sets and shown to outperform existing methods for computing the marginal likelihood.

Efficient Algorithms for Personalized PageRank

We present new, more efficient algorithms for estimating random walk scores such as Personalized PageRank from a given source node to one or several target nodes. These scores are useful for personalized search and recommendations on networks including social networks, user-item networks, and the web. Past work has proposed using Monte Carlo or using linear algebra to estimate scores from a single source to every target, making them inefficient for a single pair. Our contribution is a new bidirectional algorithm which combines linear algebra and Monte Carlo to achieve significant speed improvements. On a diverse set of six graphs, our algorithm is 70x faster than past state-of-the-art algorithms. We also present theoretical analysis: while past algorithms require \Omega(n) time to estimate a random walk score of typical size \frac{1}{n} on an n-node graph to a given constant accuracy, our algorithm requires only O(\sqrt{m}) expected time for an average target, where m is the number of edges, and is provably accurate. In addition to our core bidirectional estimator for personalized PageRank, we present an alternative algorithm for undirected graphs, a generalization to arbitrary walk lengths and Markov Chains, an algorithm for personalized search ranking, and an algorithm for sampling random paths from a given source to a given set of targets. We expect our bidirectional methods can be extended in other ways and will be useful subroutines in other graph analysis problems.

Relative Density and Exact Recovery in Heterogeneous Stochastic Block Models

The non-equilibrium allele frequency spectrum in a Poisson random field framework

Declarative, Secure, Convergent Edge Computation

On the Total Number of Bends for Planar Octilinear Drawings

On Brownian motion, simple paths, and loops

Expressing the Indefinite Integral of the Standard Normal Probability Density Function

Energy-Efficient Classification for Anomaly Detection: The Wireless Channel as a Helper

On the Planar Split Thickness of Graphs

Stability with respect to initial conditions in V-norm for nonlinear filters with ergodic observations

Tight Bounds for Distributed Minimum-Weight Spanning Tree Verification

Coupling stochastic EM and Approximate Bayesian Computation for parameter inference in state-space models

Exending pseudo-arcs in odd characteristic

Constructing minimal blocking sets using field reduction

A linear set view on KM-arcs

A Bayesian approach to the g-formula

Causal and anti-causal learning in pattern recognition for neuroimaging

From One Point to A Manifold: Orbit Models for Knowledge Graph Embedding

Mixed eigenvalues of p-Laplacian on trees

On the Hurwitz action in finite Coxeter groups

Predator-prey model for the self-organisation of stochastic oscillators in dual populations

Learning optimal nonlinearities for iterative thresholding algorithms

A leader-election procedure using records

Cubic Graphs with Total Domatic Number at Least Two

Convex programming approach to robust estimation of a multivariate Gaussian model

Entropy Chaos and Bose-Einstein Condensation

Visible lattice points in random walks

Comment on ‘Phase transition for quenched coupled replicas in a plaquette spin model of glasses”

Probabilistic Analysis of the Dual Next-Fit Algorithm for Bin Covering

Edgeworth expansion for the pre-averaging estimator

Heath-Jarrow-Morton-Musiela equation with Lévy perturbation

Local entanglement structure across a many-body localization transition

Joint Image-Text News Topic Detection and Tracking with And-Or Graph Representation

n-type Markov Branching Processes with Immigration

Weak Local Rules for Planar Octagonal Tilings

An Improved Global Risk Bound in Concave Regression

Hyper-Heuristic Algorithm for Finding Efficient Features in Diagnose of Lung Cancer Disease

Agreement-based Joint Training for Bidirectional Attention-based Neural Machine Translation

Admissible colourings of 3-manifold triangulations for Turaev-Viro type invariants

Graphical Exchange Mechanisms

Noise-Compensated, Bias-Corrected Diffusion Weighted Endorectal Magnetic Resonance Imaging via a Stochastically Fully-Connected Joint Conditional Random Field Model

Reducing Parallel Communication in Algebraic Multigrid through Sparsification

Positivity of affine charge

On an Edge Precoloring Conjecture

Operators on compositions and generalized skew Pieri rules

Inverse subspace iteration for spectral stochastic finite element methods

KKM type theorems with boundary conditions

On the stability of a class of non-monotonic systems of parallel queues

Inference on the mode of weak directional signals: a Le Cam perspective on hypothesis testing near singularities

Partial regularity of viscosity solutions for a class of Kolmogorov equations arising from mathematical finance

$C_0$-sequentially equicontinuous semigroups on locally convex spaces

A Quantum Theory of the Glass Transition Suggests Universality Amongst Glass Formers

Relaxed Linearized Algorithms for Faster X-Ray CT Image Reconstruction

Random SU(2)-symmetric spin chains