We present a new framework for the robust estimation of latent time series models which is fairly general and, for example, covers models going from ARMA to state-space models. This approach provides estimators which are (i) consistent and asymptotically normally distributed, (ii) applicable to various classes of time series models, (iii) straightforward to implement and (iv) computationally efficient. The framework is based on the recently developed Generalized Method of Wavelet Moments (GMWM) and a new robust estimator of the wavelet variance. Compared to existing methods, the latter directly estimates the quantity of interest while performing better in finite samples and using milder conditions for its asymptotic properties to hold. Moreover, results are given showing the identifiability of the GMWM for various classes of time series models thereby allowing this method to consistently estimate many models (and combinations thereof) under mild conditions. Hence, not only does this paper provide an alternative estimator which allows to perform wavelet variance analysis when data are contaminated but also a general approach to robustly estimate the parameters of a variety of (latent) time series models. The simulation studies carried out confirm the better performance of the proposed estimators and the usefulness and broadness of the proposed methodology is shown using practical examples from the domains of economics and engineering with sample sizes up to 900,000.
We present an autoencoder that leverages the power of learned representations to better measure similarities in data space. By combining a variational autoencoder (VAE) with a generative adversarial network (GAN) we can use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective. Thereby, we replace element-wise errors with feature-wise errors that better capture the data distribution while offering invariance towards e.g. translation. We apply our method to images of faces and show that our method outperforms VAEs with element-wise similarity measures in terms of visual fidelity. Moreover, we show that our method learns an embedding in which high-level abstract visual features (e.g. wearing glasses) can be modified using simple arithmetic.
The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. In order to run ML algorithms at such scales, on a distributed cluster with 10s to 1000s of machines, it is often the case that significant engineering efforts are required — and one might fairly ask if such engineering truly falls within the domain of ML research or not. Taking the view that Big ML systems can benefit greatly from ML-rooted statistical and algorithmic insights — and that ML researchers should therefore not shy away from such systems design — we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of Big ML systems and architectures, with the goal of understanding how to make them efficient, generally-applicable, and supported with convergence and scaling guarantees. They concern four key questions which traditionally receive little attention in ML research: How to distribute an ML program over a cluster? How to bridge ML computation with inter-machine communication? How to perform such communication? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and grow the area that lies between ML and systems.
This theoretical note explores statistical versus computational trade-off to address a basic question in the application of divide-and-conquer method: what is the minimal computational cost in obtaining statistical optimality? In smoothing spline setup, we observe a phase transition phenomenon for the number of deployed machines that ends up being a simple proxy for computing cost. Specifically, a sharp upper bound for the number of machines is established: when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible. These sharp bounds capture intrinsic computational limits of divide-and-conquer method, which turn out to be fully determined by the smoothness of the regression function. As a side remark, we argue that sample spitting may be viewed as a new form of regularization, playing a similar role as smoothing parameter.
Given the variability in student learning it is becoming increasingly important to tailor courses as well as course sequences to student needs. This paper presents a systematic methodology for offering personalized course sequence recommendations to students. First, a forward-search backward-induction algorithm is developed that can optimally select course sequences to decrease the time required for a student to graduate. The algorithm accounts for prerequisite requirements (typically present in higher level education) and course availability. Second, using the tools of multi-armed bandits, an algorithm is developed that can optimally recommend a course sequence that both reduces the time to graduate while also increasing the overall GPA of the student. The algorithm dynamically learns how students with different contextual backgrounds perform for given course sequences and then recommends an optimal course sequence for new students. Using real-world student data from the UCLA Mechanical and Aerospace Engineering department, we illustrate how the proposed algorithms outperform other methods that do not include student contextual information when making course sequence recommendations.
Stochastic convex optimization, where the objective is the expectation of a random convex function, is an important and widely used method with numerous applications in machine learning, statistics, operations research and other areas. We study the complexity of stochastic convex optimization given only statistical query (SQ) access to the objective function. We show that well-known and popular methods, including first-order iterative methods and polynomial-time methods, can be implemented using only statistical queries. For many cases of interest we derive nearly matching upper and lower bounds on the estimation (sample) complexity including linear optimization in the most general setting. We then present several consequences for machine learning, differential privacy and proving concrete lower bounds on the power of convex optimization based methods. A new technical ingredient of our work is SQ algorithms for estimating the mean vector of a distribution over vectors in with optimal estimation complexity. This is a natural problem and we show that our solutions can be used to get substantially improved SQ versions of Perceptron and other online algorithms for learning halfspaces.
Natural language inference (NLI) is a fundamentally important task in natural language processing that has many applications. The recently released Stanford Natural Language Inference (SNLI) corpus has made it possible to develop and evaluate learning-centered methods such as deep neural networks for the NLI task. In this paper, we propose a special long short-term memory (LSTM) architecture for NLI. Our model builds on top of a recently proposed neutral attention model for NLI but is based on a significantly different idea. Instead of deriving sentence embeddings for the premise and the hypothesis to be used for classification, our solution uses a matching-LSTM that performs word-by-word matching of the hypothesis with the premise. This LSTM is able to place more emphasis on important word-level matching results. In particular, we observe that this LSTM remembers important mismatches that are critical for predicting the contradiction or the neutral relationship label. Our experiments on the SNLI corpus show that our model outperforms the state of the art, achieving an accuracy of 86.1% on the test data.
Latent state space models are one of the most fundamental and widely used tools for modeling dynamical systems. Traditional Maximum Likelihood Estimation (MLE) based approaches aim to maximize the likelihood objective, which is non-convex due to latent states. While non-convex optimization methods like EM can learn models that locally optimize the likelihood objective, using the locally optimal model for an inference task such as Bayesian filtering usually does not have performance guarantees. In this work, we propose a method that considers the inference procedure on the dynamical system as a composition of predictors. Instead of optimizing a given parametrization of latent states, we learn predictors for inference in predictive belief space, where we can use sufficient features of observations for supervision of our learning algorithm. We further show that our algorithm, the Predictive State Inference Machine, has theoretical performance guarantees on the inference task. Empirical verification across several of dynamical system benchmarks ranging from a simulated helicopter to recorded telemetry traces from a robot showcase the abilities of training Inference Machines.
We propose an Artificial Neural Network (ANN) approach for discovering common hidden variables and for learning of invariant representations. This approach uses synchronicity and concurrence to recognize desired structures in data and discard superfluous structures, in the absence of labels and a data generation model. In the common variable discovery problem, the ANN uses measurements from two distinct sensors to construct a representation of the common hidden variable that is manifested in both sensors, and discards sensor-specific variables. In the invariant representation learning problem, the network uses multiple observations of objects under transformations to construct a representation which is invariant to the transformations.
Software-Defined Data Centers (SDDC) extend virtualization, software-defined networking and systems, and middleboxes to provide a better quality of service (QoS). While many network flow routing algorithms exist, most of them fail to adapt to the dynamic nature of the data center and cloud networks and their users’ and enterprise requirements. This paper presents SMART, a Software-Defined Networking (SDN) middlebox architecture for reliable transfers. As an architectural enhancement for network flows allocation, routing, and control, SMART ensures timely delivery of flows by diverting them to a less congested path dynamically in the software-defined data center networks. SMART also clones packets of higher priority flows to route in an alternative path, along with the original flow. Hence SMART offers a differentiated QoS through varying levels of redundancy in the flows.
Real time application of deep learning algorithms is often hindered by high computational complexity and frequent memory accesses. Network pruning is a promising technique to solve this problem. However, pruning usually results in irregular network connections that not only demand extra representation efforts but also do not fit well on parallel computation. We introduce structured sparsity at various scales for convolutional neural networks, which are channel wise, kernel wise and intra kernel strided sparsity. This structured sparsity is very advantageous for direct computational resource savings on embedded computers, parallel computing environments and hardware based systems. To decide the importance of network connections and paths, the proposed method uses a particle filtering approach. The importance weight of each particle is assigned by computing the misclassification rate with corresponding connectivity pattern. The pruned network is re-trained to compensate for the losses due to pruning. While implementing convolutions as matrix products, we particularly show that intra kernel strided sparsity with a simple constraint can significantly reduce the size of kernel and feature map matrices. The pruned network is finally fixed point optimized with reduced word length precision. This results in significant reduction in the total storage size providing advantages for on-chip memory based implementations of deep neural networks.
Model-free reinforcement learning algorithms such as Q-learning perform poorly in the early stages of learning in noisy environments, because much effort is spent on unlearning biased estimates of the state-action function. The bias comes from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the noise in the space of optimal actions by penalizing deterministic policies at the beginning of the learning. Moreover, it enables naturally incorporating prior distributions over optimal actions when available. The stochastic nature of G-learning also makes it more cost-effective than Q-learning in noiseless but exploration-risky domains. We illustrate these ideas in several examples where G-learning results in significant improvements of the learning rate and the learning cost.
Probabilistic Graphical Models (PGM) are very useful in the fields of machine learning and data mining. The crucial limitation of those models,however, is the scalability. The Bayesian Network, which is one of the most common PGMs used in machine learning and data mining, demonstrates this limitation when the training data consists of random variables, each of them has a large set of possible values. In the big data era, one would expect new extensions to the existing PGMs to handle the massive amount of data produced these days by computers, sensors and other electronic devices. With hierarchical data – data that is arranged in a treelike structure with several levels – one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian Networks become infeasible for representing the probability distributions. In this paper we introduce an extension to Bayesian Networks to handle massive sets of hierarchical data in a reasonable amount of time and space. The proposed model achieves perfect precision of 1.0 and high recall of 0.93 when it is used as multi-label classifier for the annotation of mass spectrometry data. On another data set of 1.5 billion search logs provided by CareerBuilder.com the model was able to predict latent semantic relationships between search keywords with accuracy up to 0.80.