Robust Inference for Time Series Models: a Wavelet-based Framework

We present a new framework for the robust estimation of latent time series models which is fairly general and, for example, covers models going from ARMA to state-space models. This approach provides estimators which are (i) consistent and asymptotically normally distributed, (ii) applicable to various classes of time series models, (iii) straightforward to implement and (iv) computationally efficient. The framework is based on the recently developed Generalized Method of Wavelet Moments (GMWM) and a new robust estimator of the wavelet variance. Compared to existing methods, the latter directly estimates the quantity of interest while performing better in finite samples and using milder conditions for its asymptotic properties to hold. Moreover, results are given showing the identifiability of the GMWM for various classes of time series models thereby allowing this method to consistently estimate many models (and combinations thereof) under mild conditions. Hence, not only does this paper provide an alternative estimator which allows to perform wavelet variance analysis when data are contaminated but also a general approach to robustly estimate the parameters of a variety of (latent) time series models. The simulation studies carried out confirm the better performance of the proposed estimators and the usefulness and broadness of the proposed methodology is shown using practical examples from the domains of economics and engineering with sample sizes up to 900,000.

Autoencoding beyond pixels using a learned similarity metric

We present an autoencoder that leverages the power of learned representations to better measure similarities in data space. By combining a variational autoencoder (VAE) with a generative adversarial network (GAN) we can use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective. Thereby, we replace element-wise errors with feature-wise errors that better capture the data distribution while offering invariance towards e.g. translation. We apply our method to images of faces and show that our method outperforms VAEs with element-wise similarity measures in terms of visual fidelity. Moreover, we show that our method learns an embedding in which high-level abstract visual features (e.g. wearing glasses) can be modified using simple arithmetic.

Strategies and Principles of Distributed Machine Learning on Big Data

The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. In order to run ML algorithms at such scales, on a distributed cluster with 10s to 1000s of machines, it is often the case that significant engineering efforts are required — and one might fairly ask if such engineering truly falls within the domain of ML research or not. Taking the view that Big ML systems can benefit greatly from ML-rooted statistical and algorithmic insights — and that ML researchers should therefore not shy away from such systems design — we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of Big ML systems and architectures, with the goal of understanding how to make them efficient, generally-applicable, and supported with convergence and scaling guarantees. They concern four key questions which traditionally receive little attention in ML research: How to distribute an ML program over a cluster? How to bridge ML computation with inter-machine communication? How to perform such communication? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and grow the area that lies between ML and systems.

Computational Limits of Divide-and-Conquer Method

This theoretical note explores statistical versus computational trade-off to address a basic question in the application of divide-and-conquer method: what is the minimal computational cost in obtaining statistical optimality? In smoothing spline setup, we observe a phase transition phenomenon for the number of deployed machines that ends up being a simple proxy for computing cost. Specifically, a sharp upper bound for the number of machines is established: when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible. These sharp bounds capture intrinsic computational limits of divide-and-conquer method, which turn out to be fully determined by the smoothness of the regression function. As a side remark, we argue that sample spitting may be viewed as a new form of regularization, playing a similar role as smoothing parameter.

Personalized Course Sequence Recommendations

Given the variability in student learning it is becoming increasingly important to tailor courses as well as course sequences to student needs. This paper presents a systematic methodology for offering personalized course sequence recommendations to students. First, a forward-search backward-induction algorithm is developed that can optimally select course sequences to decrease the time required for a student to graduate. The algorithm accounts for prerequisite requirements (typically present in higher level education) and course availability. Second, using the tools of multi-armed bandits, an algorithm is developed that can optimally recommend a course sequence that both reduces the time to graduate while also increasing the overall GPA of the student. The algorithm dynamically learns how students with different contextual backgrounds perform for given course sequences and then recommends an optimal course sequence for new students. Using real-world student data from the UCLA Mechanical and Aerospace Engineering department, we illustrate how the proposed algorithms outperform other methods that do not include student contextual information when making course sequence recommendations.

Statistical Query Algorithms for Stochastic Convex Optimization

Stochastic convex optimization, where the objective is the expectation of a random convex function, is an important and widely used method with numerous applications in machine learning, statistics, operations research and other areas. We study the complexity of stochastic convex optimization given only statistical query (SQ) access to the objective function. We show that well-known and popular methods, including first-order iterative methods and polynomial-time methods, can be implemented using only statistical queries. For many cases of interest we derive nearly matching upper and lower bounds on the estimation (sample) complexity including linear optimization in the most general setting. We then present several consequences for machine learning, differential privacy and proving concrete lower bounds on the power of convex optimization based methods. A new technical ingredient of our work is SQ algorithms for estimating the mean vector of a distribution over vectors in \mathbb{R}^d with optimal estimation complexity. This is a natural problem and we show that our solutions can be used to get substantially improved SQ versions of Perceptron and other online algorithms for learning halfspaces.

A Notation for Markov Decision Processes

This paper specifies a notation for Markov decision processes.

Learning Natural Language Inference with LSTM

Natural language inference (NLI) is a fundamentally important task in natural language processing that has many applications. The recently released Stanford Natural Language Inference (SNLI) corpus has made it possible to develop and evaluate learning-centered methods such as deep neural networks for the NLI task. In this paper, we propose a special long short-term memory (LSTM) architecture for NLI. Our model builds on top of a recently proposed neutral attention model for NLI but is based on a significantly different idea. Instead of deriving sentence embeddings for the premise and the hypothesis to be used for classification, our solution uses a matching-LSTM that performs word-by-word matching of the hypothesis with the premise. This LSTM is able to place more emphasis on important word-level matching results. In particular, we observe that this LSTM remembers important mismatches that are critical for predicting the contradiction or the neutral relationship label. Our experiments on the SNLI corpus show that our model outperforms the state of the art, achieving an accuracy of 86.1% on the test data.

Learning to Filter with Predictive State Inference Machines

Latent state space models are one of the most fundamental and widely used tools for modeling dynamical systems. Traditional Maximum Likelihood Estimation (MLE) based approaches aim to maximize the likelihood objective, which is non-convex due to latent states. While non-convex optimization methods like EM can learn models that locally optimize the likelihood objective, using the locally optimal model for an inference task such as Bayesian filtering usually does not have performance guarantees. In this work, we propose a method that considers the inference procedure on the dynamical system as a composition of predictors. Instead of optimizing a given parametrization of latent states, we learn predictors for inference in predictive belief space, where we can use sufficient features of observations for supervision of our learning algorithm. We further show that our algorithm, the Predictive State Inference Machine, has theoretical performance guarantees on the inference task. Empirical verification across several of dynamical system benchmarks ranging from a simulated helicopter to recorded telemetry traces from a robot showcase the abilities of training Inference Machines.

Common Variable Learning and Invariant Representation Learning using Siamese Neural Networks

We propose an Artificial Neural Network (ANN) approach for discovering common hidden variables and for learning of invariant representations. This approach uses synchronicity and concurrence to recognize desired structures in data and discard superfluous structures, in the absence of labels and a data generation model. In the common variable discovery problem, the ANN uses measurements from two distinct sensors to construct a representation of the common hidden variable that is manifested in both sensors, and discards sensor-specific variables. In the invariant representation learning problem, the network uses multiple observations of objects under transformations to construct a representation which is invariant to the transformations.

Not Every Flow is Equal: SMART Discrimination in Redundancy

Software-Defined Data Centers (SDDC) extend virtualization, software-defined networking and systems, and middleboxes to provide a better quality of service (QoS). While many network flow routing algorithms exist, most of them fail to adapt to the dynamic nature of the data center and cloud networks and their users’ and enterprise requirements. This paper presents SMART, a Software-Defined Networking (SDN) middlebox architecture for reliable transfers. As an architectural enhancement for network flows allocation, routing, and control, SMART ensures timely delivery of flows by diverting them to a less congested path dynamically in the software-defined data center networks. SMART also clones packets of higher priority flows to route in an alternative path, along with the original flow. Hence SMART offers a differentiated QoS through varying levels of redundancy in the flows.

Structured Pruning of Deep Convolutional Neural Networks

Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.

G-Learning: Taming the Noise in Reinforcement Learning via Soft Updates

Model-free reinforcement learning algorithms such as Q-learning perform poorly in the early stages of learning in noisy environments, because much effort is spent on unlearning biased estimates of the state-action function. The bias comes from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the noise in the space of optimal actions by penalizing deterministic policies at the beginning of the learning. Moreover, it enables naturally incorporating prior distributions over optimal actions when available. The stochastic nature of G-learning also makes it more cost-effective than Q-learning in noiseless but exploration-risky domains. We illustrate these ideas in several examples where G-learning results in significant improvements of the learning rate and the learning cost.

Mining Massive Hierarchical Data Using a Scalable Probabilistic Graphical Model

Probabilistic Graphical Models (PGM) are very useful in the fields of machine learning and data mining. The crucial limitation of those models,however, is the scalability. The Bayesian Network, which is one of the most common PGMs used in machine learning and data mining, demonstrates this limitation when the training data consists of random variables, each of them has a large set of possible values. In the big data era, one would expect new extensions to the existing PGMs to handle the massive amount of data produced these days by computers, sensors and other electronic devices. With hierarchical data – data that is arranged in a treelike structure with several levels – one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian Networks become infeasible for representing the probability distributions. In this paper we introduce an extension to Bayesian Networks to handle massive sets of hierarchical data in a reasonable amount of time and space. The proposed model achieves perfect precision of 1.0 and high recall of 0.93 when it is used as multi-label classifier for the annotation of mass spectrometry data. On another data set of 1.5 billion search logs provided by the model was able to predict latent semantic relationships between search keywords with accuracy up to 0.80.

Towards Energy Consumption Verification via Static Analysis

Stein’s method for steady-state diffusion approximations: an introduction through the Erlang-A and Erlang-C models

Geometric Memory Management

An (MI)LP-based Primal Heuristic for 3-Architecture Connected Facility Location in Urban Access Network Design

Faster Maximium Priority Matchings in Bipartite Graphs

Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server

Mutations of simple-minded systems in Calabi-Yau categories generated by a spherical object

Asynchronous Exclusive Selection

Quantitative uniform propagation of chaos for Maxwell molecules

Linear Convergence of Proximal Gradient Algorithm with Extrapolation for a Class of Nonconvex Nonsmooth Minimization Problems

Non-linear noise excitation for some space-time fractional stochastic equations in bounded domains

Average-case complexity without the black swans

Linear codes for an effective quantization of data

Metastability of the two-dimensional Blume-Capel model with zero chemical potential and small magnetic field

Nonlinear stochastic evolution equations of second order with damping

Evolving Non-linear Stacking Ensembles for Prediction of Go Player Attributes

Solving the G-problems in less than 500 iterations: Improved efficient constrained optimization by surrogate modeling and adaptive parameter control

Forecaster’s Dilemma: Extreme Events and Forecast Evaluation

A $4/5$ – Approximation Algorithm for the Maximum Traveling Salesman Problem

Godsil-McKay switching and twisted Grassmann graphs

A. Hurwitz and the origins of random matrix theory in mathematics

Efficient Construction of Simultaneous Deterministic Finite Automata on Multicores Using Rabin Fingerprints

Denoising and Completion of 3D Data via Multidimensional Dictionary Learning

Nonparametric mixture of Gaussian graphical models

Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index Policies

Efficient algorithms for topological inference on random graphs

Low rank approximation and decomposition of large matrices using error correcting codes

Dynamic Time-Dependent Route Planning in Road Networks with User Preferences

The maximum product of weights of cross-intersecting families

Critical Percolation and the Minimal Spanning Tree in Slabs

Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling

Universal non-Debye scaling in the density of states of amorphous solids

Fast Computation of Isochrones in Road Networks

The Negative Cycle Vectors of Signed Complete Graphs

Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic BP, and the information-computation gap

The cohomology rings of regular nilpotent Hessenberg varieties in Lie type A

On spectral properties and Ihara zeta function of random long-range percolation graphs

Model-based testing for space-time interaction using point processes: An application to psychiatric hospital admissions in an urban area

Boolean functions whose Fourier transform is concentrated on pairwise disjoint subsets of the input

Drawings of Kn with the same rotation scheme are the same up to Reidemeister moves. Gioan’s Theorem

Long time behaviour of random walks on the integer lattice

Limitations on detecting row covariance in the presence of column covariance

Pairwise Markov properties for regression graphs

The Bounded Edge Coloring Problem and Offline Crossbar Scheduling

Nonparametric Bayesian Factor Analysis for Dynamic Count Matrices

The Edge Group Coloring Problem with Applications to Multicast Switching

Technical Report: a tool for measuring Prosodic Accommodation

Free probability for purely discrete eigenvalues of random matrices

Evaluating Go Game Records for Prediction of Player Attributes

A Generalization of Brown’s Construction for the Degree/Diameter Problem

Perspectives on theory at the interface of physics and biology

Simple, Robust and Optimal Ranking from Pairwise Comparisons

Asymptotic Analysis of Multiscale Markov Chain

Reconfigurable State Machine Replication from Non-Reconfigurable Building Blocks

Cluster varieties from Legendrian knots

Resolvent Energy of Unicyclic, Bicyclic and Tricyclic Graphs

The Pedestrian’s Guide to Local Time

Online Keyword Spotting with a Character-Level Recurrent Neural Network

Modeling Variations of First-Order Horn Abduction in Answer Set Programming using On-Demand Constraints and Flexible Value Invention

A Comparative Study of Sparse Associative Memories

Subgraph statistics in subcritical graph classes

Estimation of the sample covariance matrix from compressive measurements

Binomial, Poisson and Gaussian supermodular orderings by tree-based correlations

Higher-order expansions of powered extremes of normal samples

A Graph Theoretic Proof of the Tight Cut Lemma

Radial Bargmann representation for the Fock space of type B

Sharp Computational-Statistical Phase Transitions via Oracle Computational Model

Copulas for maxmin systems

Counting curves on surfaces

Randomness: quantum versus classical

The Packing While Traveling Problem

Privacy Preservation in Distributed Subgradient Optimization Algorithms

Joint limiting laws for high-dimensional independence tests

A Pivot-Based Improvement to Sandwich-Based Confidence Intervals

Combining Fuzzy Cognitive Maps and Discrete Random Variables

Sparse group factor analysis for biclustering of multiple data sources

Utility maximization in Wiener-transformable markets

Matrix Completion Under Monotonic Single Index Models

Plane partitions with a ‘pit”: generating functions and representation theory

Royen’s proof of the Gaussian correlation inequality

Changes in U.S. temperature extremes under increased CO2 in millennial-scale climate simulations

Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems

Evolution of quasi-characteristic functions in quantum stochastic systems with Weyl quantization of energy operators

Stochastic Allen-Cahn equation with mobility

Recovery of periodicities hidden in heavy-tailed noise

A Variational EM Method for Mixed Membership Models for Rank Data: an Analysis of Public Policy Preferences

On the Foundations of the Brussels Operational-Realistic Approach to Cognition

Packing spanning graphs from separable families

Error Bounds for Compressed Sensing Algorithms With Group Sparsity: A Unified Approach

Spatial pattern and power function deviation in a cellular automata model of an ant population

An Importance Sampling Scheme for Models in a Strong External Field

Hypothesis Testing for Differences in Gaussian Graphical Models: Applications to Brain Connectivity

Conditional independence and conditioned limit laws

Improper Graceful and Odd-graceful Labellings of Graph Theory

On the boundary of the support of super-Brownian motion: with appendices

Tight Bounds for Approximate Carathéodory and Beyond

Optimal Selective Attention in Reactive Agents

Analyzing Walter Skeat’s Forty-Five Parallel Extracts of William Langland’s Piers Plowman

Spatial Bayesian hierarchical modeling of precipitation extremes over a large domain

Maximium Priority Matchings

Phase Transitions of Traveling Salesperson Problems solved with Linear Programming and Cutting Planes

Conditional probability generation methods for high reliability effects-based decision making

Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses

Modelling Anisotropic Covariance using Stochastic Development and Sub-Riemannian Frame Bundle Geometry