In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains.
While model-based reinforcement learning has empirically been shown to significantly reduce the sample complexity that hinders model-free RL, the theoretical understanding of such methods has been rather limited. In this paper, we introduce a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees, and a practical algorithm Optimistic Lower Bounds Optimization (OLBO). In particular, we derive a theoretical guarantee of monotone improvement for model-based RL with our framework. We iteratively build a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and maximize it jointly over the policy and the model. Assuming the optimization in each iteration succeeds, the expected reward is guaranteed to improve. The framework also incorporates an optimism-driven perspective, and reveals the intrinsic measure for the model prediction error. Preliminary simulations demonstrate that our approach outperforms the standard baselines on continuous control benchmark tasks.
Automatic machine learning performs predictive modeling with high performing machine learning tools without human interference. This is achieved by making machine learning applications parameter-free, i.e. only a dataset is provided while the complete model selection and model building process is handled internally through (often meta) optimization. Projects like Auto-WEKA and auto-sklearn aim to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem resulting in huge configuration spaces. However, for most real-world applications, the optimization over only a few different key learning algorithms can not only be sufficient, but also potentially beneficial. The latter becomes apparent when one considers that models have to be validated, explained, deployed and maintained. Here, less complex model are often preferred, for validation or efficiency reasons, or even a strict requirement. Automatic gradient boosting simplifies this idea one step further, using only gradient boosting as a single learning algorithm in combination with model-based hyperparameter tuning, threshold optimization and encoding of categorical features. We introduce this general framework as well as a concrete implementation called autoxgboost. It is compared to current AutoML projects on 16 datasets and despite its simplicity is able to achieve comparable results on about half of the datasets as well as performing best on two.
Observed multidimensional network data can have different levels of complexity, as nodes may be characterized by heterogeneous individual-specific features. Also, such characteristics may vary across the networks. This article discusses a novel class of models for multidimensional networks, able to deal with different levels of heterogeneity within and between networks. The proposed framework is developed within the family of latent space models, in order to distinguish recurrent symmetrical relations between the nodes from node-specific features in the different views. Models parameters are estimated via a Markov Chain Monte Carlo algorithm. Simulated data and also FAO fruits import/export data are analysed to illustrate the performances of the proposed models.
Deep generative models have shown promising results in generating realistic images, but it is still non-trivial to generate images with complicated structures. The main reason is that most of the current generative models fail to explore the structures in the images including spatial layout and semantic relations between objects. To address this issue, we propose a novel deep structured generative model which boosts generative adversarial networks (GANs) with the aid of structure information. In particular, the layout or structure of the scene is encoded by a stochastic and-or graph (sAOG), in which the terminal nodes represent single objects and edges represent relations between objects. With the sAOG appropriately harnessed, our model can successfully capture the intrinsic structure in the scenes and generate images of complicated scenes accordingly. Furthermore, a detection network is introduced to infer scene structures from a image. Experimental results demonstrate the effectiveness of our proposed method on both modeling the intrinsic structures, and generating realistic images.
In this study, we address the challenge of measuring the ability of a recommender system to make surprising recommendations. Although current evaluation methods make it possible to determine if two algorithms can make recommendations with a significant difference in their average surprise measure, it could be of interest to our community to know how competent an algorithm is at embedding surprise in its recommendations, without having to resort to making a direct comparison with another algorithm. We argue that a) surprise is a finite resource in a recommender system, b) there is a limit to how much surprise any algorithm can embed in a recommendation, and c) this limit can provide us with a scale against which the performance of any algorithm can be measured. By exploring these ideas, it is possible to define the concepts of maximum and minimum potential surprise and design a surprise metric called ‘normalised surprise’ that employs these limits to potential surprise. Two experiments were conducted to test the proposed metric. The aim of the first was to validate the quality of the estimates of minimum and maximum potential surprise produced by a greedy algorithm. The purpose of the second experiment was to analyse the behaviour of the proposed metric using the MovieLens dataset. The results confirmed the behaviour that was expected, and showed that the proposed surprise metric is both effective and consistent for differing choices of recommendation algorithms, data representations and distance functions.
Multimodal machine learning is a core research area spanning the language, visual and acoustic modalities. The central challenge in multimodal learning involves learning representations that can process and relate information from multiple modalities. In this paper, we propose two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods: a \textit{Seq2Seq Modality Translation Model} and a \textit{Hierarchical Seq2Seq Modality Translation Model}. We also explore multiple different variations on the multimodal inputs and outputs of these seq2seq models. Our experiments on multimodal sentiment analysis using the CMU-MOSI dataset indicate that our methods learn informative multimodal representations that outperform the baselines and achieve improved performance on multimodal sentiment analysis, specifically in the Bimodal case where our model is able to improve F1 Score by twelve points. We also discuss future directions for multimodal Seq2Seq methods.
An analytic process is iterative between two agents, an analyst and an analytic toolbox. Each iteration comprises three main steps: preparing a dataset, running an analytic tool, and evaluating the result, where dataset preparation and result evaluation, conducted by the analyst, are largely domain-knowledge driven. In this work, the focus is on automating the result evaluation step. The underlying problem is to identify plots that are deemed interesting by an analyst. We propose a methodology to learn such analyst’s intent based on Generative Adversarial Networks (GANs) and demonstrate its applications in the context of production yield optimization using data collected from several product lines.
Deep Learning has the hierarchical network architecture to represent the complicated features of input patterns. Such architecture is well known to represent higher learning capability compared with some conventional models if the best set of parameters in the optimal network structure is found. We have been developing the adaptive learning method that can discover the optimal network structure in Deep Belief Network (DBN). The learning method can construct the network structure with the optimal number of hidden neurons in each Restricted Boltzmann Machine and with the optimal number of layers in the DBN during learning phase. The network structure of the learning method can be self-organized according to given input patterns of big data set. In this paper, we embed the adaptive learning method into the recurrent temporal RBM and the self-generated layer into DBN. In order to verify the effectiveness of our proposed method, the experimental results are higher classification capability than the conventional methods in this paper.
We propose a novel end-to-end neural network architecture that, once trained, directly outputs a probabilistic clustering of a batch of input examples in one pass. It estimates a distribution over the number of clusters $k$, and for each $1 \leq k \leq k_\mathrm{max}$, a distribution over the individual cluster assignment for each data point. The network is trained in advance in a supervised fashion on separate data to learn grouping by any perceptual similarity criterion based on pairwise labels (same/different group). It can then be applied to different data containing different groups. We demonstrate promising performance on high-dimensional data like images (COIL-100) and speech (TIMIT). We call this “learning to cluster” and show its conceptual difference to deep metric learning, semi-supervise clustering and other related approaches while having the advantage of performing learnable clustering fully end-to-end.
The data mining technique of time series clustering is well established in many fields. However, as an unsupervised learning method, it requires making choices that are nontrivially influenced by the nature of the data involved. The aim of this paper is to verify usefulness of the time series clustering method for macroeconomics research, and to develop the most suitable methodology. By extensively testing various possibilities, we arrive at a choice of a dissimilarity measure (compression-based dissimilarity measure, or CDM) which is particularly suitable for clustering macroeconomic variables. We check that the results are stable in time and reflect large-scale phenomena such as crises. We also successfully apply our findings to analysis of national economies, specifically to identyfing their structural relations.
Missing data are ubiquitous in many domains such as healthcare. Depending on how they are missing, the (conditional) independence relations in the observed data may be different from those for the complete data generated by the underlying causal process and, as a consequence, simply applying existing causal discovery methods to the observed data may lead to wrong conclusions. It is then essential to extend existing causal discovery approaches to find true underlying causal structure from such incomplete data. In this paper, we aim at solving this problem for data that are missing with different mechanisms, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). With missingness mechanisms represented by missingness Graph (m-Graph), we analyze conditions under which addition correction is needed to derive conditional independence/dependence relations in the complete data. Based on our analysis, we propose missing value PC (MVPC), which combines additional corrections with traditional causal discovery algorithm, in particular, PC. Our proposed MVPC is shown in theory to give asymptotically correct results even using data that are MAR and MNAR. Experiment results illustrate that the proposed algorithm can correct the conditional independence for values MCAR, MAR and rather general cases of values MNAR both with synthetic data as well as real-life healthcare application.
Due to the iterative nature of most nonnegative matrix factorization (\textsc{NMF}) algorithms, initialization is a key aspect as it significantly influences both the convergence and the final solution obtained. Many initialization schemes have been proposed for NMF, among which one of the most popular class of methods are based on the singular value decomposition (SVD). However, these SVD-based initializations do not satisfy a rather natural condition, namely that the error should decrease as the rank of factorization increases. In this paper, we propose a novel SVD-based \textsc{NMF} initialization to specifically address this shortcoming by taking into account the SVD factors that were discarded to obtain a nonnegative initialization. This method, referred to as nonnegative SVD with low-rank correction (NNSVD-LRC), allows us to significantly reduce the initial error at a negligible additional computational cost using the low-rank structure of the discarded SVD factors. NNSVD-LRC has two other advantages compared to previous SVD-based initializations: (1) it provably generates sparse initial factors, and (2) it is faster as it only requires to compute a truncated SVD of rank $\lceil r/2 + 1 \rceil$ where $r$ is the factorization rank of the sought NMF decomposition (as opposed to a rank-$r$ truncated SVD for other methods). We show on several standard dense and sparse data sets that our new method competes favorably with state-of-the-art SVD-based initializations for NMF.
While we are usually focused on predicting future values of time series, it is often valuable to additionally predict their entire probability distributions, for example to evaluate risk or Monte Carlo simulations. On example of time series of $\approx$ 30000 Dow Jones Industrial Averages, there will be shown application of hierarchical correlation reconstruction for this purpose: mean-square fitting polynomial as joint density for (current value, context), where context is for example a few previous values. Then substituting the currently observed context and normalizing density to 1, we get predicted probability distribution for the current value. In contrast to standard machine learning approaches like neural networks, optimal coefficients here can be inexpensively directly calculated, are unique and independent, each has a specific cumulant-like interpretation, and such approximation can approach complete description of any joint distribution – providing a perfect tool to quantitatively describe and exploit statistical dependencies in time series.