Uniform Learning in a Deep Neural Network via ‘Oddball’ Stochastic Gradient Descent

When training deep neural networks, it is typically assumed that the training examples are uniformly difficult to learn. Or, to restate, it is assumed that the training error will be uniformly distributed across the training examples. Based on these assumptions, each training example is used an equal number of times. However, this assumption may not be valid in many cases. ‘Oddball SGD’ (novelty-driven stochastic gradient descent) was recently introduced to drive training probabilistically according to the error distribution – training frequency is proportional to training error magnitude. In this article, using a deep neural network to encode a video, we show that oddball SGD can be used to enforce uniform error across the training set.

Distilling Model Knowledge

Top-performing machine learning systems, such as deep neural networks, large ensembles and complex probabilistic graphical models, can be expensive to store, slow to evaluate and hard to integrate into larger systems. Ideally, we would like to replace such cumbersome models with simpler models that perform equally well. In this thesis, we study knowledge distillation, the idea of extracting the knowledge contained in a complex model and injecting it into a more convenient model. We present a general framework for knowledge distillation, whereby a convenient model of our choosing learns how to mimic a complex model, by observing the latter’s behaviour and being penalized whenever it fails to reproduce it. We develop our framework within the context of three distinct machine learning applications: (a) model compression, where we compress large discriminative models, such as ensembles of neural networks, into models of much smaller size; (b) compact predictive distributions for Bayesian inference, where we distil large bags of MCMC samples into compact predictive distributions in closed form; (c) intractable generative models, where we distil unnormalizable models such as RBMs into tractable models such as NADEs. We contribute to the state of the art with novel techniques and ideas. In model compression, we describe and implement derivative matching, which allows for better distillation when data is scarce. In compact predictive distributions, we introduce online distillation, which allows for significant savings in memory. Finally, in intractable generative models, we show how to use distilled models to robustly estimate intractable quantities of the original model, such as its intractable partition function.

Generalizing the Frailty Assumptions in Survival Analysis

This paper studies Cox’s regression hazard model with an unobservable random frailty where no specific distribution is postulated for the frailty variable, and the marginal lifetime distribution allows both parametric and non-parametric models. Laplace’s approximation method and gradient search on smooth manifolds embedded in Euclidean space are applied, and a non-iterative profile likelihood optimization method is proposed for estimating the regression coefficients. The proposed method is compared with the Expected-Maximization method developed based on a gamma frailty assumption, and also in the case when the frailty model is misspecified.

The Knowledge Gradient with Logistic Belief Models for Binary Classification

We consider sequential decision making problems for binary classification scenario in which the learner takes an active role in repeatedly selecting samples from the action pool and receives the binary label of the selected alternatives. Our problem is motivated by applications where observations are time consuming and/or expensive, resulting in small samples. The goal is to identify the best alternative with the highest response. We use Bayesian logistic regression to predict the response of each alternative. By formulating the problem as a Markov decision process, we develop a knowledge-gradient type policy to guide the experiment by maximizing the expected value of information of labeling each alternative and provide a finite-time analysis on the estimated error. Experiments on benchmark UCI datasets demonstrate the effectiveness of the proposed method.

A novel similarity index for better personalized recommendation

Recommender systems benefit us in tackling the problem of information overload by predicting our potential choices among diverse niche objects. So far, a variety of personalized recommendation algorithms have been proposed and most of them are based on similarities, such as collaborative filtering and mass diffusion. Here, we propose a novel similarity index named CosRA, which combines advantages of both the cosine index and the resource-allocation (RA) index. By applying the CosRA index to real recommender systems including MovieLens, Netflix and RYM, we show that the CosRA-based method has better performance in accuracy, diversity and novelty than the state-of-the-art methods. Moreover, the CosRA index is free of parameters, which is a significant advantage in real applications. Further experiments show that the introduction of two turnable parameters cannot remarkably improve the overall performance of CosRA.

Objective Bayes Covariate-Adjusted Sparse Graphical Model Selection

We present an objective Bayes method for covariance selection in Gaussian multivariate regression models whose error term has a covariance structure which is Markov with respect to a Directed Acyclic Graph (DAG). The scope is covariate-adjusted sparse graphical model selection, a topic of growing importance especially in the area of genetical genomics (eQTL analysis). Specifically, we provide a closed-form expression for the marginal likelihood of any DAG (with small parent sets) whose computation virtually requires no subjective elicitation by the user and involves only conjugate matrix normal Wishart distributions. This is made possible by a specific form of prior assignment, whereby only one prior under the complete DAG model need be specified, based on the notion of fractional Bayes factor. All priors under the other DAG models are derived using prior modularity, and global parameter independence, in the terminology of Geiger & Heckerman (2002). Since the marginal likelihood we obtain is constant within each class of Markov equivalent DAGs, our method naturally specializes to covariate-adjusted decomposable graphical models.

Dimensionality reduction based on an overcomplete basis

We discuss a strategy of dimensionality reduction that is based on the use of an overcomplete basis, and evaluate its performance when a random matrix is used as this basis. The performance is assessed in terms of the trade-off relation between the reconstruction error \epsilon and the reduction rate r. First, we evaluate the performance in the case that the \ell_0– and \ell_1-norm based methods are carried out ideally, using methods of statistical mechanics. Our result provides a quantitative relation between the reconstruction error and the number of usable basis vectors, and clarifies the fact that the \ell_0-based method greatly outperforms the \ell_1-based one. An interesting outcome of our analysis is that any small value of error is achievable for any fixed reduction rate r in the large-size limit of the overcomplete basis, for both the \ell_0– and \ell_1-based methods. The difference between the two methods is manifested in the size of the overcomplete basis that is required in order to achieve the desired value for the reconstruction error \hat{\epsilon}. As \hat{\epsilon} decreases, the required size grows in a polynomial and an exponential manners for the \ell_0– and \ell_1-based methods, respectively. Second, we examine the practical performances of two well-known algorithms, orthogonal matching pursuit and approximate message passing, when they are used to execute the \ell_0– and \ell_1-based methods, respectively. Our examination shows that orthogonal matching pursuit achieves a much better performance than the exact execution of the \ell_1-based method, as well as approximate message passing. However, regarding the \ell_0-based method, there is still room to design more effective greedy algorithms than orthogonal matching pursuit. Finally, we evaluate the performances of the algorithms when they are applied to image data compression.

An Efficient Data Structure for Fast Mining High Utility Itemsets

In this paper, we propose a novel data structure called PUN-list, which maintains both the utility information about an itemset and utility upper bound for facilitating the processing of mining high utility itemsets. Based on PUN-lists, we present a method, called MIP (Mining high utility Itemset using PUN-Lists), for fast mining high utility itemsets. The efficiency of MIP is achieved with three techniques. First, itemsets are represented by a highly condensed data structure, PUN-list, which avoids costly, repeatedly utility computation. Second, the utility of an itemset can be efficiently calculated by scanning the PUN-list of the itemset and the PUN-lists of long itemsets can be fast constructed by the PUN-lists of short itemsets. Third, by employing the utility upper bound lying in the PUN-lists as the pruning strategy, MIP directly discovers high utility itemsets from the search space, called set-enumeration tree, without generating numerous candidates. Extensive experiments on various synthetic and real datasets show that PUN-list is very effective since MIP is at least an order of magnitude faster than recently reported algorithms on average.

Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network

Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors.

High-Dimensional Multivariate Time Series With Local Dependence

We consider high-dimensional multivariate time series with additional structure. The additional structure takes the form of a metric space endowed with local dependence, i.e., time series that are close in space may be dependent whereas distant time series are independent. Such additional structure is available in a wide range of applications, e.g., in studies of air pollution and climate change. We introduce a simple two-step estimation approach that takes advantage of local dependence. The two-step estimation approach ?first estimates the range of dependence and then exploits the estimated range of dependence to estimate local dependencies among time series. We shed light on the theoretical properties of the two-step estimation approach under high-dimensional scaling and provide non-asymptotic error bounds that hold with high probability. The usefulness of the two-step estimation approach is demonstrated by an application to air pollution in the U.S. The two-step estimation approach can be extended to other high-dimensional models, such as high-dimensional graphical models, as long as additional structure is available and consistent model selection in high dimensions is possible.

The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method

1-Sperner hypergraphs

On Modeling and Estimation for the Relative Risk and Risk Difference

Exact Inference Techniques for the Dynamic Analysis of Attack Graphs

Power domination and zero forcing

Dynamic Factor Models, Cointegration, and Error Correction Mechanisms

Index theory for partial-bijections

Additivity properties of sofic entropy and measures on model spaces

Mapping Unseen Words to Task-Trained Embedding Spaces

Duality of codes supported on regular lattices, with an application to enumerative combinatorics

Self-dual binary codes from small covers and simple polytopes

Automata networks for multi-party communication in the Naming Game

Synchronization of Pulse-Coupled Oscillators and Clocks under Minimal Connectivity Assumptions

Cliques in dense inhomogeneous random graphs

On Maximal Correlation, Mutual Information and Data Privacy

Analysing multiple types of molecular profiles simultaneously: connecting the needles in the haystack

Gravitation versus Brownian motion

On the structure of (banner, odd hole)-free graphs

Small-Area Orthogonal Drawings of 3-Connected Graphs

Information processing occurs via critical avalanches in a model of the primary visual cortex

The quadratic stochastic Euclidean bipartite matching problem

Dependent time changed processes with applications to nonlinear ocean waves

Invariant sets for QMF functions

$A$-Hypergeometric Distributions and Newton Polytopes

Reduced-Order Modeling Of Hidden Dynamics

Data Transmission with Reduced Delay for Distributed Acoustic Sensors

Empirical Analysis of Sampling Based Estimators for Evaluating RBMs

Explicit Parallel-in-time Integration of a Linear Acoustic-Advection System

An Erd\’ os-R\’ enyi law for nonconventional sums

Conditions for Posterior Contraction in the Sparse Normal Means Problem

Combining behavioural types with security analysis

Distributed Estimation of Graph 4-Profiles

Non-linear conductance in mesoscopic weakly disordered wires — Interaction and magnetic field asymmetry

The F-snapshot Problem

A characterization of linearizable instances of the quadratic minimum spanning tree problem

Structure and automorphisms of primitive coherent configurations

Moderate Deviation Principles for Weakly Interacting Particle Systems

Basic inequalities for weighted entropies

Integrability Conditions for SDEs and Semi-Linear SPDEs

Star-Replaced Networks: A Generalised Class of Dual-Port Server-Centric Data Centre Networks

The largest $H$-eigenvalue and spectral radius of Laplacian tensor of non-odd-bipartite generalized power hypergraphs

Ontology-based Secure Retrieval of Semantically Significant Visual Contents

Data-Efficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models

Hockey Player Performance via Regularized Logistic Regression

Performance Analysis of an Astrophysical Simulation Code on the Intel Xeon Phi Architecture

On the characteristic polynomial of a supertropical adjoint matrix

Discrete Envy-free Division of Necklaces and Maps

Resolving References to Objects in Photographs using the Words-As-Classifiers Model

Combinatorics of symmetric plabic graphs

Stochastic Solution of Fractional Fokker-Planck Equations with Space-Time-Dependent Coefficients

Multivariate Gaussian approximations on Markov chaoses