When training deep neural networks, it is typically assumed that the training examples are uniformly difficult to learn. Or, to restate, it is assumed that the training error will be uniformly distributed across the training examples. Based on these assumptions, each training example is used an equal number of times. However, this assumption may not be valid in many cases. ‘Oddball SGD’ (novelty-driven stochastic gradient descent) was recently introduced to drive training probabilistically according to the error distribution – training frequency is proportional to training error magnitude. In this article, using a deep neural network to encode a video, we show that oddball SGD can be used to enforce uniform error across the training set.
Top-performing machine learning systems, such as deep neural networks, large ensembles and complex probabilistic graphical models, can be expensive to store, slow to evaluate and hard to integrate into larger systems. Ideally, we would like to replace such cumbersome models with simpler models that perform equally well. In this thesis, we study knowledge distillation, the idea of extracting the knowledge contained in a complex model and injecting it into a more convenient model. We present a general framework for knowledge distillation, whereby a convenient model of our choosing learns how to mimic a complex model, by observing the latter’s behaviour and being penalized whenever it fails to reproduce it. We develop our framework within the context of three distinct machine learning applications: (a) model compression, where we compress large discriminative models, such as ensembles of neural networks, into models of much smaller size; (b) compact predictive distributions for Bayesian inference, where we distil large bags of MCMC samples into compact predictive distributions in closed form; (c) intractable generative models, where we distil unnormalizable models such as RBMs into tractable models such as NADEs. We contribute to the state of the art with novel techniques and ideas. In model compression, we describe and implement derivative matching, which allows for better distillation when data is scarce. In compact predictive distributions, we introduce online distillation, which allows for significant savings in memory. Finally, in intractable generative models, we show how to use distilled models to robustly estimate intractable quantities of the original model, such as its intractable partition function.
This paper studies Cox’s regression hazard model with an unobservable random frailty where no specific distribution is postulated for the frailty variable, and the marginal lifetime distribution allows both parametric and non-parametric models. Laplace’s approximation method and gradient search on smooth manifolds embedded in Euclidean space are applied, and a non-iterative profile likelihood optimization method is proposed for estimating the regression coefficients. The proposed method is compared with the Expected-Maximization method developed based on a gamma frailty assumption, and also in the case when the frailty model is misspecified.
We consider sequential decision making problems for binary classification scenario in which the learner takes an active role in repeatedly selecting samples from the action pool and receives the binary label of the selected alternatives. Our problem is motivated by applications where observations are time consuming and/or expensive, resulting in small samples. The goal is to identify the best alternative with the highest response. We use Bayesian logistic regression to predict the response of each alternative. By formulating the problem as a Markov decision process, we develop a knowledge-gradient type policy to guide the experiment by maximizing the expected value of information of labeling each alternative and provide a finite-time analysis on the estimated error. Experiments on benchmark UCI datasets demonstrate the effectiveness of the proposed method.
Recommender systems benefit us in tackling the problem of information overload by predicting our potential choices among diverse niche objects. So far, a variety of personalized recommendation algorithms have been proposed and most of them are based on similarities, such as collaborative filtering and mass diffusion. Here, we propose a novel similarity index named CosRA, which combines advantages of both the cosine index and the resource-allocation (RA) index. By applying the CosRA index to real recommender systems including MovieLens, Netflix and RYM, we show that the CosRA-based method has better performance in accuracy, diversity and novelty than the state-of-the-art methods. Moreover, the CosRA index is free of parameters, which is a significant advantage in real applications. Further experiments show that the introduction of two turnable parameters cannot remarkably improve the overall performance of CosRA.
We present an objective Bayes method for covariance selection in Gaussian multivariate regression models whose error term has a covariance structure which is Markov with respect to a Directed Acyclic Graph (DAG). The scope is covariate-adjusted sparse graphical model selection, a topic of growing importance especially in the area of genetical genomics (eQTL analysis). Specifically, we provide a closed-form expression for the marginal likelihood of any DAG (with small parent sets) whose computation virtually requires no subjective elicitation by the user and involves only conjugate matrix normal Wishart distributions. This is made possible by a specific form of prior assignment, whereby only one prior under the complete DAG model need be specified, based on the notion of fractional Bayes factor. All priors under the other DAG models are derived using prior modularity, and global parameter independence, in the terminology of Geiger & Heckerman (2002). Since the marginal likelihood we obtain is constant within each class of Markov equivalent DAGs, our method naturally specializes to covariate-adjusted decomposable graphical models.
We discuss a strategy of dimensionality reduction that is based on the use of an overcomplete basis, and evaluate its performance when a random matrix is used as this basis. The performance is assessed in terms of the trade-off relation between the reconstruction error $\epsilon$ and the reduction rate $r$. First, we evaluate the performance in the case that the $\ell_0$– and $\ell_1$-norm based methods are carried out ideally, using methods of statistical mechanics. Our result provides a quantitative relation between the reconstruction error and the number of usable basis vectors, and clarifies the fact that the $\ell_0$-based method greatly outperforms the $\ell_1$-based one. An interesting outcome of our analysis is that any small value of error is achievable for any fixed reduction rate $r$ in the large-size limit of the overcomplete basis, for both the $\ell_0$– and $\ell_1$-based methods. The difference between the two methods is manifested in the size of the overcomplete basis that is required in order to achieve the desired value for the reconstruction error $\hat{\epsilon}$. As $\hat{\epsilon}$ decreases, the required size grows in a polynomial and an exponential manners for the $\ell_0$– and $\ell_1$-based methods, respectively. Second, we examine the practical performances of two well-known algorithms, orthogonal matching pursuit and approximate message passing, when they are used to execute the $\ell_0$– and $\ell_1$-based methods, respectively. Our examination shows that orthogonal matching pursuit achieves a much better performance than the exact execution of the $\ell_1$-based method, as well as approximate message passing. However, regarding the $\ell_0$-based method, there is still room to design more effective greedy algorithms than orthogonal matching pursuit. Finally, we evaluate the performances of the algorithms when they are applied to image data compression.
In this paper, we propose a novel data structure called PUN-list, which maintains both the utility information about an itemset and utility upper bound for facilitating the processing of mining high utility itemsets. Based on PUN-lists, we present a method, called MIP (Mining high utility Itemset using PUN-Lists), for fast mining high utility itemsets. The efficiency of MIP is achieved with three techniques. First, itemsets are represented by a highly condensed data structure, PUN-list, which avoids costly, repeatedly utility computation. Second, the utility of an itemset can be efficiently calculated by scanning the PUN-list of the itemset and the PUN-lists of long itemsets can be fast constructed by the PUN-lists of short itemsets. Third, by employing the utility upper bound lying in the PUN-lists as the pruning strategy, MIP directly discovers high utility itemsets from the search space, called set-enumeration tree, without generating numerous candidates. Extensive experiments on various synthetic and real datasets show that PUN-list is very effective since MIP is at least an order of magnitude faster than recently reported algorithms on average.
Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors.
We consider high-dimensional multivariate time series with additional structure. The additional structure takes the form of a metric space endowed with local dependence, i.e., time series that are close in space may be dependent whereas distant time series are independent. Such additional structure is available in a wide range of applications, e.g., in studies of air pollution and climate change. We introduce a simple two-step estimation approach that takes advantage of local dependence. The two-step estimation approach ?first estimates the range of dependence and then exploits the estimated range of dependence to estimate local dependencies among time series. We shed light on the theoretical properties of the two-step estimation approach under high-dimensional scaling and provide non-asymptotic error bounds that hold with high probability. The usefulness of the two-step estimation approach is demonstrated by an application to air pollution in the U.S. The two-step estimation approach can be extended to other high-dimensional models, such as high-dimensional graphical models, as long as additional structure is available and consistent model selection in high dimensions is possible.