• A deconvolution path for mixtures
• Data-Dependent Path Normalization in Neural Networks
• Images Don’t Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank
• Training CNNs with Low-Rank Filters for Efficient Image Classification
• Sequence Level Training with Recurrent Neural Networks
• A Bayesian hidden Markov mixture model to detect overexpressed chromosome regions
• Hand Pose Estimation through Weakly-Supervised Learning of a Rich Intermediate Representation
• Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters
• Improving Neural Machine Translation Models with Monolingual Data
• Generalizations of the Strong Arnold Property and the minimum number of distinct eigenvalues of a graph
• Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation
• The local structure of $q$-Gaussian processes
• The partial copula: Properties and associated dependence measures
• From restricted permutations to Grassmann necklaces and back again
• F-Index of Some Graph Operations
• Using Deep Learning to Predict Demographics from Mobile Phone Metadata
• Bivariate Binomial Moments and Bonferroni-type Inequalities
• Compressed and quantized correlation estimators
• Use of Eigenvector Centrality to Detect Graph Isomorphism
• Data Representation and Compression Using Linear-Programming Approximations
• Exponential Natural Particle Filter
• Polysemy in Controlled Natural Language Texts
• Crowd Behavior Analysis: A Review where Physics meets Biology
• Dueling Network Architectures for Deep Reinforcement Learning
• Hankel Matrices for the Period-Doubling Sequence
• Tree Embedding and Directed Steiner Problems
• Near-Optimal UGC-hardness of Approximating Max k-CSP_R
• Bayesian identification of bacterial strains from sequencing data
• On the weak convergence of the empirical conditional copula under a simplifying assumption
• Effects of the tempered aging and its Fokker-Planck equation
• Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
• Optimal Approximate Designs for Comparison with Control in Dose-Escalation Studies
• Integrating Deep Features for Material Recognition
• The Euler method for continuous-time nonlinear filtering and stable convergence of conditional law
• Emergence of strongly connected giant components in continuum disk-spin percolation
• Resiliency of Deep Neural Networks under Quantization
• mplrs: A scalable parallel vertex/facet enumeration code
• Trivializing The Energy Landscape Of Deep Networks
• Quasipolynomial Solutions to the Hofstadter Q-Recurrence
• Variance Reduction in SGD by Distributed Importance Sampling
• On Binary Embedding using Circulant Matrices
• Every finite set of integers is an asymptotic approximate group
• Faster Parallel Solver for Positive Linear Programs via Dynamically-Bucketed Selective Coordinate Descent
• Unitary Evolution Recurrent Neural Networks
• Joint Inverse Covariances Estimation with Mutual Linear Structure
• DOC: Deep OCclusion Recovering From A Single Image
• Task Loss Estimation for Sequence Prediction
• Variational Auto-encoded Deep Gaussian Processes
• Deep Metric Learning via Lifted Structured Feature Embedding
• Learning to decompose for object detection and instance segmentation
• Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks
• Universality in halting time and its applications in optimization
• Learning metrics by learning constrained embeddings of objects to Rn
• A parallel algorithm for the constrained shortest path problem on lattice graphs
• A convnet for non-maximum suppression
• Delving Deeper into Convolutional Networks for Learning Video Representations
• Deconstructing the Ladder Network Architecture
• A Controller Recognizer Framework: How necessary is recognition for control?
• Reasoning in Vector Space: An Exploratory Study of Question Answering
• First Step toward Model-Free, Anonymous Object Tracking with Recurrent Neural Networks
• An Information Retrieval Approach to Finding Dependent Subspaces of Multiple Views
• All you need is a good init
• Deep Manifold Traversal: Changing Labels with Convolutional Features
• Skip-Thought Memory Networks
• Binding via Reconstruction Clustering
• Modelling Spatial Compositional Data: Reconstructions of past land cover and uncertainties
• Fast Parallel SAME Gibbs Sampling on General Discrete Bayesian Networks
• QBDC: Query by dropout committee for training deep supervised architecture
• Direct Loss Minimization for Training Deep Neural Nets
• Better Computer Go Player with Neural Network and Long-term Prediction
• Learning to generate images with perceptual similarity metrics
• Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition
• Denoising Criterion for Variational Auto-Encoding Framework
• Minimum disparity estimation in controlled branching processes
• Multilingual Relation Extraction using Compositional Universal Schema
• Geodesics of learned representations
• Fixed Point Quantization of Deep Convolutional Networks
• Neural Random-Access Machines
• Order Matters: Sequence to sequence for sets
• Griffiths effects and slow dynamics in nearly many-body localized systems
• A Unified Gradient Regularization Family for Adversarial Examples
• Iterative Refinement of Approximate Posterior for Training Directed Belief Networks
• Manifold Regularized Deep Neural Networks using Adversarial Examples
• Unsupervised Learning of Visual Structure using Predictive Generative Networks
• Characterizing graphs of maximum principal ratio
• Quantum phases of interacting electrons in three-dimensional dirty Dirac semimetals
The inferential model (IM) approach, like fiducial and its generalizations, apparently depends on a particular representation of the data-generating process. Here, a generalization of the IM framework is proposed that is more flexible in that it does not require a complete specification of the data-generating process. The generalized IM is valid under mild conditions and, moreover, provides an automatic auxiliary variable dimension reduction, which is valuable from an efficiency point of view. Computation and marginalization is discussed, and two applications of the generalized IM approach are presented.
Normalized nonnegative models assign probability distributions to users and random variables to items; see [Stark, 2015]. Rating an item is regarded as sampling the random variable assigned to the item with respect to the distribution assigned to the user who rates the item. Models of that kind are highly expressive. For instance, using normalized nonnegative models we can understand users’ preferences as mixtures of interpretable user stereotypes, and we can arrange properties of users and items in a hierarchical manner. These features would not be useful if the predictive power of normalized nonnegative models was poor. Thus, we analyze here the performance of normalized nonnegative models for top-N recommendation and observe that their performance matches the performance of methods like PureSVD which was introduced in [Cremonesi et al., 2010]. We conclude that normalized nonnegative models not only provide accurate recommendations but they also deliver (for free) representations that are interpretable. We deepen the discussion of normalized nonnegative models by providing further theoretical insights. In particular, we introduce total variational distance as an operational similarity measure, we discover scenarios where normalized nonnegative models yield unique representations of users and items, we prove that the inference of optimal normalized nonnegative models is NP-hard and finally, we discuss the relationship between normalized nonnegative models and nonnegative matrix factorization.
Class ambiguity is typical in image classification problems with a large number of classes. When classes are difficult to discriminate, it makes sense to allow k guesses and evaluate classifiers based on the top-k error instead of the standard zero-one loss. We propose top-k multiclass SVM as a direct method to optimize for top-k performance. Our generalization of the well-known multiclass SVM is based on a tight convex upper bound of the top-k error. We propose a fast optimization scheme based on an efficient projection onto the top-k simplex, which is of its own interest. Experiments on five datasets show consistent improvements in top-k accuracy compared to various baselines.
Multivariate classification methods using explanatory and predictive models are necessary for characterizing subgroups of patients according to their risk profiles. Popular methods include logistic regression and classification trees with performances that vary according to the nature and the characteristics of the dataset. In the context of imported malaria, we aimed at classifying severity criteria based on a heterogeneous patient population. We investigated these approaches by implementing two different strategies: L1 logistic regression (L1LR) that models a single global solution and classification trees that model multiple local solutions corresponding to discriminant subregions of the feature space. For each strategy, we built a standard model, and a sparser version of it. As an alternative to pruning, we explore a promising approach that first constrains the tree model with an L1LR-based feature selection, an approach we called L1LR-Tree. The objective is to decrease its vulnerability to small data variations by removing variables corresponding to unstable local phenomena. Our study is twofold: i) from a methodological perspective comparing the performances and the stability of the three previous methods, i.e L1LR, classification trees and L1LR-Tree, for the classification of severe forms of imported malaria, and ii) from an applied perspective improving the actual classification of severe forms of imported malaria by identifying more personalized profiles predictive of several clinical criteria based on variables dismissed for the clinical definition of the disease. The main methodological results show that the combined method L1LR-Tree builds sparse and stable models that significantly predicts the different severity criteria and outperforms all the other methods in terms of accuracy.
Recent work on sequence to sequence translation using Recurrent Neural Networks (RNNs) based on Long Short Term Memory (LSTM) architectures has shown great potential for learning useful representations of sequential data. These architectures, using one recurrent neural network to encode sequences into fixed-length representations, and one or more network(s) to decode representations into new sequences have the advantages of being modular, while also allowing modules to be jointly trained. A one-to-many encoder-decoder(s) scheme allows for a single encoder to provide representations serving multiple purposes. In our case, we present an LSTM encoder network able to produce representations used by two decoders: one that reconstructs, and one that classifies if the training sequence has a labelling. This allows the network to learn representations that are useful for both discriminative and generative tasks at the same time. We show how this paradigm is very well suited for semi-supervised learning with sequences. We test our proposed approach on an action recognition task using motion capture (MOCAP) sequences and show that semi-supervised feature learning can improve movement classification.
We define Recurrent Gaussian Processes (RGP) models, a general family of Bayesian nonparametric models with recurrent GP priors which are able to learn dynamical patterns from sequential data. Similar to Recurrent Neural Networks (RNNs), RGPs can have different formulations for their internal states, distinct inference methods and be extended with deep structures. In such context, we propose a novel deep RGP model whose autoregressive states are latent, thereby performing representation and dynamical learning simultaneously. To fully exploit the Bayesian nature of the RGP model we develop the Recurrent Variational Bayes (REVARB) framework, which enables efficient inference and strong regularization through coherent propagation of uncertainty across the RGP layers and states. We also introduce a RGP extension where variational parameters are greatly reduced by being reparametrized through RNN-based sequential recognition models. We apply our model to the tasks of nonlinear system identification and human motion modeling. The promising obtained results indicate that our RGP model maintains its highly flexibility while being able to avoid overfitting and being applicable even when larger datasets are not available.
Representations offered by deep generative models are fundamentally tied to their inference method from data. Variational inference methods require a rich family of approximating distributions. We construct the variational Gaussian process (VGP), a Bayesian nonparametric model which adapts its shape to match complex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity. We prove a universal approximation theorem for the VGP, demonstrating its representative power for learning any model. For inference we present a variational objective inspired by autoencoders and perform black box inference over a wide class of models. The VGP achieves new state-of-the-art results for unsupervised learning, inferring models such as the deep latent Gaussian model and the recently proposed DRAW.
Second order stationary models in time series analysis are based on the analysis of essential statistics whose computations follow a common pattern. In particular, with a map-reduce nomenclature, most of these operations can be modeled as mapping a kernel that only depends on short windows of consecutive data and reducing the results produced by each computation. This computational pattern stems from the ergodicity of the model under consideration and is often referred to as weak or short memory when it comes to data indexed with respect to time. In the following we will show how studying weak memory systems can be done in a scalable manner thanks to a framework relying on specifically designed overlapping distributed data structures that enable fragmentation and replication of the data across many machines as well as parallelism in computations. This scheme has been implemented for Apache Spark but is certainly not system specific. Indeed we prove it is also adapted to leveraging high bandwidth fragmented memory blocks on GPUs.
We provide a method for approximating Bayesian inference using rejection sampling. We not only make the process efficient, but also dramatically reduce the memory required relative to conventional methods by combining rejection sampling with particle filtering. We also provide an approximate form of rejection sampling that makes rejection filtering tractable in cases where exact rejection sampling is not efficient. Finally, we present several numerical examples of rejection filtering that show its ability to track time dependent parameters in online settings and also benchmark its performance on MNIST classification problems.
Data often comes in the form of an array or matrix. Matrix factorization techniques attempt to recover missing or corrupted entries by assuming that the matrix can be written as the product of two low-rank matrices. In other words, matrix factorization approximates the entries of the matrix by a simple, fixed function—namely, the inner product—acting on the latent feature vectors for the corresponding row and column. Here we consider replacing the inner product by an arbitrary function that we learn from the data at the same time as we learn the latent feature vectors. In particular, we replace the inner product by a multi-layer feed-forward neural network, and learn by alternating between optimizing the network for fixed latent features, and optimizing the latent features for a fixed network. The resulting approach—which we call neural network matrix factorization or NNMF, for short—dominates standard low-rank techniques on a suite of benchmark but is dominated by some recent proposals that also relax the low-rank assumption. Given the vast range of architectures, activation functions, regularizers, and optimization techniques that could be used within the NNMF framework, it seems likely the true potential of the approach has yet to be reached.
General unsupervised learning is a long-standing conceptual problem in machine learning. Supervised learning is successful because it can be solved by the minimization of the training error cost function. Unsupervised learning is not as successful, because the unsupervised objective may be unrelated to the supervised task of interest. For an example, density modelling and reconstruction have often been used for unsupervised learning, but they did not produced the sought-after performance gains, because they have no knowledge of the sought-after supervised tasks. In this paper, we present an unsupervised cost function which we name the Output Distribution Matching (ODM) cost, which measures a divergence between the distribution of predictions and distributions of labels. The ODM cost is appealing because it is consistent with the supervised cost in the following sense: a perfect supervised classifier is also perfect according to the ODM cost. Therefore, by aggressively optimizing the ODM cost, we are almost guaranteed to improve our supervised performance whenever the space of possible predictions is exponentially large. We demonstrate that the ODM cost works well on number of small and semi-artificial datasets using no (or almost no) labelled training cases. Finally, we show that the ODM cost can be used for one-shot domain adaptation, which allows the model to classify inputs that differ from the input distribution in significant ways without the need for prior exposure to the new domain.
Methods for learning word representations using large text corpora have received much attention lately due to their impressive performance in numerous natural language processing (NLP) tasks such as, semantic similarity measurement, and word analogy detection. Despite their success, these data-driven word representation learning methods do not consider the rich semantic relational structure between words in a co-occurring context. On the other hand, already much manual effort has gone into the construction of semantic lexicons such as the WordNet that represent the meanings of words by defining the various relationships that exist among the words in a language. We consider the question, can we improve the word representations learnt using a corpora by integrating the knowledge from semantic lexicons?. For this purpose, we propose a joint word representation learning method that simultaneously predicts the co-occurrences of two words in a sentence subject to the relational constrains given by the semantic lexicon. We use relations that exist between words in the lexicon to regularize the word representations learnt from the corpus. Our proposed method statistically significantly outperforms previously proposed methods for incorporating semantic lexicons into word representations on several benchmark datasets for semantic similarity and word analogy.
Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is performed on several types of deep learning architectures and we evaluate the performance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important during the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algorithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best performance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks – demonstrating their applicability as general image representations.
We show that a deep convolutional network with an architecture inspired by the models used in image recognition can yield accuracy similar to a long-short term memory (LSTM) network, which achieves the state-of-the-art performance on the standard Switchboard automatic speech recognition task. Moreover, we demonstrate that merging the knowledge in the CNN and LSTM models via model compression further improves the accuracy of the convolutional model.
Supervised, semi-supervised, and unsupervised learning estimate a function given input/output samples. Generalization to unseen samples requires making prior assumptions about this function. However, many priors cannot be defined by only taking the function, its input, and its output into account. In this paper, we propose contextual learning, which uses contextual data to define such priors. Contextual data are from a different space than the input and the output of the function, but include information relevant for learning the function. We can exploit this information by formulating priors about how contextual data relate to the target function. Incorporating these priors regularizes the main learning task and thereby improves generalization. This facilitates many challenging learning tasks, in particular when the acquisition of training data is costly or when effective learning would require prohibitively large amounts of data. The first contribution of this paper is a unified view on contextual learning, which subsumes a variety of related approaches, such as multi-task and multi-view learning. The second contribution is a set of ‘design patterns’ for utilizing contextual learning for novel problems. The third contribution is a systematic experimental evaluation of these patterns in two supervised learning tasks.
We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multiple-to-multiple linear correlations within a single set of variables. Unlike CCA, which finds correlations between two sets of data where the rows are matched exactly but the columns represent separate sets of variables, the method proposed here, Canonical Autocorrelation Analysis (CAA), finds multivariate correlations within just one set of variables. This can be useful when we look for hidden parsimonious structures in data, each involving only a small subset of all features. In addition, the discovered correlations are highly interpretable as they are formed by pairs of sparse linear combinations of the original features. We show how CAA can be of use as a tool for anomaly detection when the expected structure of correlations is not followed by anomalous data. We illustrate the utility of CAA in two application domains where single-class and unsupervised learning of correlation structures are particularly relevant: breast cancer diagnosis and radiation threat detection. When applied to the Wisconsin Breast Cancer data, single-class CAA is competitive with supervised methods used in literature. On the radiation threat detection task, unsupervised CAA performs significantly better than an unsupervised alternative prevalent in the domain, while providing valuable additional insights for threat analysis.
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic. However, these vector space representations (created through large-scale text analysis) are typically stored verbatim, since their internal structure is opaque. Using word-analogy tests to monitor the level of detail stored in compressed re-representations of the same vector space, the trade-offs between the reduction in memory usage and expressiveness are investigated. A simple scheme is outlined that can reduce the memory footprint of a state-of-the-art embedding by a factor of 10, with only minimal impact on performance. Then, using the same ‘bit budget’, a binary (approximate) factorisation of the same space is also explored, with the aim of creating an equivalent representation with better interpretability.
In this paper we present a method for learning a discriminative classifier from unlabeled or partially labeled data. Our approach is based on an objective function that trades-off mutual information between observed examples and their predicted categorical class distribution, against robustness of the classifier to an adversarial generative model. The resulting algorithm can either be interpreted as a natural generalization of the generative adversarial networks (GAN) framework or as an extension of the regularized information maximization (RIM) framework to robust classification against an optimal adversary. We empirically evaluate our method – which we dub categorical generative adversarial networks (or CatGAN) – on synthetic data as well as on challenging image classification tasks, demonstrating the robustness of the learned classifiers. We further qualitatively assess the fidelity of samples generated by the adversarial generator that is learned alongside the discriminative classifier, and identify links between the CatGAN objective and discriminative clustering algorithms (such as RIM).
Neural word representations have proven useful in Natural Language Processing (NLP) tasks due to their ability to efficiently model complex semantic and syntactic word relationships. However, most techniques model only one representation per word, despite the fact that a single word can have multiple meanings or ‘senses’. Some techniques model words by using multiple vectors that are clustered based on context. However, recent neural approaches rarely focus on the application to a consuming NLP algorithm. Furthermore, the training process of recent word-sense models is expensive relative to single-sense embedding processes. This paper presents a novel approach which addresses these concerns by modeling multiple embeddings for each word based on supervised disambiguation, which provides a fast and accurate way for a consuming NLP model to select a sense-disambiguated embedding. We demonstrate that these embeddings can disambiguate both contrastive senses such as nominal and verbal senses as well as nuanced senses such as sarcasm. We further evaluate Part-of-Speech disambiguated embeddings on neural dependency parsing, yielding a greater than 8% average error reduction in unlabeled attachment scores across 6 languages.
Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).