Natural language processing (NLP) can be done using either top-down (theory driven) and bottom-up (data driven) approaches, which we call mechanistic and phenomenological respectively. The approaches are frequently considered to stand in opposition to each other. Examining some recent approaches in deep learning we argue that deep neural networks incorporate both perspectives and, furthermore, that leveraging this aspect of deep learning may help in solving complex problems within language technology, such as modelling language and perception in the domain of spatial cognition.
We propose a novel Convolutional Neural Network (CNN) compression algorithm based on coreset representations of filters. We exploit the redundancies extant in the space of CNN weights and neuronal activations (across samples) in order to obtain compression. Our method requires no retraining, is easy to implement, and obtains state-of-the-art compression performance across a wide variety of CNN architectures. Coupled with quantization and Huffman coding, we create networks that provide AlexNet-like accuracy, with a memory footprint that is $832\times$ smaller than the original AlexNet, while also introducing significant reductions in inference time as well. Additionally these compressed networks when fine-tuned, successfully generalize to other domains as well.
Determining significant prognostic biomarkers is of increasing importance in many areas of medicine. In order to translate a continuous biomarker into a clinical decision, it is often necessary to determine cut-points. There is so far no standard method to help evaluate how many cut-points are optimal for a given feature in a survival analysis setting. Moreover, most existing methods are univariate, hence not well suited for high-dimensional frameworks. This paper introduces a prognostic method called Binacox to deal with the problem of detecting multiple cut-points per features in a multivariate setting where a large number of continuous features are available. It is based on the Cox model and combines one-hot encodings with the binarsity penalty. This penalty uses total-variation regularization together with an extra linear constraint to avoid collinearity between the one-hot encodings and enable feature selection. A non-asymptotic oracle inequality is established. The statistical performance of the method is then examined on an extensive Monte Carlo simulation study, and finally illustrated on three publicly available genetic cancer datasets with high-dimensional features. On this datasets, our proposed methodology significantly outperforms the state-of-the-art survival models regarding risk prediction in terms of C-index, with a computing time orders of magnitude faster. In addition, it provides powerful interpretability by automatically pinpointing significant cut-points on relevant features from a clinical point of view.
There is a trend towards increased specialization of data management software for performance reasons. In this paper, we study the automatic specialization and optimization of database application programs — sequences of queries and updates, augmented with control flow constructs as they appear in database scripts, UDFs, transactional workloads and triggers in languages such as PL/SQL. We show how to build an optimizing compiler for database application programs using generative programming and state-of-the-art compiler technology. We evaluate a hand-optimized low-level implementation of TPC-C, and identify the key optimization techniques that account for its good performance. Our compiler fully automates these optimizations and, applied to this benchmark, outperforms the manually optimized baseline by a factor of two. By selectively disabling some of the optimizations in the compiler, we derive a clinical and precise way of obtaining insight into their individual performance contributions.
Compared to humans, machine learning models generally require significantly more training examples and fail to extrapolate from experience to solve previously unseen challenges. To help close this performance gap, we augment single-task neural networks with a meta-recognition model which learns a succinct model code via its autoencoder structure, using just a few informative examples. The model code is then employed by a meta-generative model to construct parameters for the task-specific model. We demonstrate that for previously unseen tasks, without additional training, this Meta-Learning Autoencoder (MeLA) framework can build models that closely match the true underlying models, with loss significantly lower than given by fine-tuned baseline networks, and performance that compares favorably with state-of-the-art meta-learning algorithms. MeLA also adds the ability to identify influential training examples and predict which additional data will be most valuable to acquire to improve model prediction.
The Minnesota Supercomputing Institute has implemented Jupyterhub and the Jupyter notebook server as a general-purpose point-of-entry to interactive high performance computing services. This mode of operation runs counter to traditional job-oriented HPC operations, but offers significant advantages for ease-of-use, data exploration, prototyping, and workflow development. From the user perspective, these features bring the computing cluster nearer to parity with emerging cloud computing options. On the other hand, retreating from fully-scheduled, job-based resource allocation poses challenges for resource availability and utilization efficiency, and can involve tools and technologies outside the typical core competencies of a supercomputing center’s operations staff. MSI has attempted to mitigate these challenges by adopting Jupyter as a common technology platform for interactive services, capable of providing command-line, graphical, and workflow-oriented access to HPC resources while still integrating with job scheduling systems and using existing compute resources. This paper will describe the mechanisms that MSI has put in place, advantages for research and instructional uses, and lessons learned.
Imitation learning algorithms can be used to learn a policy from expert demonstrations without access to a reward signal. However, most existing approaches are not applicable in multi-agent settings due to the existence of multiple (Nash) equilibria and non-stationary environments. We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents.
Recent work has shown that deep neural networks are highly sensitive to tiny perturbations of input images, giving rise to adversarial examples. Though this property is usually considered a weakness of learned models, we explore whether it can be beneficial. We find that neural networks can learn to use invisible perturbations to encode a rich amount of useful information. In fact, one can exploit this capability for the task of data hiding. We jointly train encoder and decoder networks, where given an input message and cover image, the encoder produces a visually indistinguishable encoded image, from which the decoder can recover the original message. We show that these encodings are competitive with existing data hiding algorithms, and further that they can be made robust to noise: our models learn to reconstruct hidden information in an encoded image despite the presence of Gaussian blurring, pixel-wise dropout, cropping, and JPEG compression. Even though JPEG is non-differentiable, we show that a robust model can be trained using differentiable approximations. Finally, we demonstrate that adversarial training improves the visual quality of encoded images.
Introducing variability while maintaining coherence is a core task in learning to generate utterances in conversation. Standard neural encoder-decoder models and their extensions using conditional variational autoencoder often result in either trivial or digressive responses. To overcome this, we explore a novel approach that injects variability into neural encoder-decoder via the use of external memory as a mixture model, namely Variational Memory Encoder-Decoder (VMED). By associating each memory read with a mode in the latent mixture distribution at each timestep, our model can capture the variability observed in sequential data such as natural conversations. We empirically compare the proposed model against other recent approaches on various conversational datasets. The results show that VMED consistently achieves significant improvement over others in both metric-based and qualitative evaluations.
Interactive reinforcement learning (IRL) extends traditional reinforcement learning (RL) by allowing an agent to interact with parent-like trainers during a task. In this paper, we present an IRL approach using dynamic audio-visual input in terms of vocal commands and hand gestures as feedback. Our architecture integrates multi-modal information to provide robust commands from multiple sensory cues along with a confidence value indicating the trustworthiness of the feedback. The integration process also considers the case in which the two modalities convey incongruent information. Additionally, we modulate the influence of sensory-driven feedback in the IRL task using goal-oriented knowledge in terms of contextual affordances. We implement a neural network architecture to predict the effect of performed actions with different objects to avoid failed-states, i.e., states from which it is not possible to accomplish the task. In our experimental setup, we explore the interplay of multimodal feedback and task-specific affordances in a robot cleaning scenario. We compare the learning performance of the agent under four different conditions: traditional RL, multi-modal IRL, and each of these two setups with the use of contextual affordances. Our experiments show that the best performance is obtained by using audio-visual feedback with affordancemodulated IRL. The obtained results demonstrate the importance of multi-modal sensory processing integrated with goal-oriented knowledge in IRL tasks.
Understanding the intentions of drivers at intersections is a critical component for autonomous vehicles. Urban intersections that do not have traffic signals are a common epicentre of highly variable vehicle movement and interactions. We present a method for predicting driver intent at urban intersections through multi-modal trajectory prediction with uncertainty. Our method is based on recurrent neural networks combined with a mixture density network output layer. To consolidate the multi-modal nature of the output probability distribution, we introduce a clustering algorithm that extracts the set of possible paths that exist in the prediction output, and ranks them according to likelihood. To verify the method’s performance and generalizability, we present a real-world dataset that consists of over 23,000 vehicles traversing five different intersections, collected using a vehicle mounted Lidar based tracking system. An array of metrics is used to demonstrate the performance of the model against several baselines.
Due to numerous public information sources and services, many methods to combine heterogeneous data were proposed recently. However, general end-to-end solutions are still rare, especially systems taking into account different context dimensions. Therefore, the techniques often prove insufficient or are limited to a certain domain. In this paper we briefly review and rigorously evaluate a general framework for data matching and merging. The framework employs collective entity resolution and redundancy elimination using three dimensions of context types. In order to achieve domain independent results, data is enriched with semantics and trust. However, the main contribution of the paper is evaluation on five public domain-incompatible datasets. Furthermore, we introduce additional attribute, relationship, semantic and trust metrics, which allow complete framework management. Besides overall results improvement within the framework, metrics could be of independent interest.
Permutation tests are among the simplest and most widely used statistical tools. Their p-values can be computed by a straightforward sampling of permutations. However, this way of computing p-values is often so slow that it is replaced by an approximation, which is accurate only for part of the interesting range of parameters. Moreover, the accuracy of the approximation can usually not be improved by increasing the computation time. We introduce a new sampling-based algorithm which uses the fast Fourier transform to compute p-values for the permutation test based on Pearson’s correlation coefficient. The algorithm is practically and asymptotically faster than straightforward sampling. Typically, its complexity is logarithmic in the input size, while the complexity of straightforward sampling is linear. The idea behind the algorithm can also be used to accelerate the computation of p-values for many other common statistical tests. The algorithm is easy to implement, but its analysis involves results from the representation theory of the symmetric group.
The development, assessment, and comparison of randomized search algorithms heavily rely on benchmarking. Regarding the domain of constrained optimization, the number of currently available benchmark environments bears no relation to the number of distinct problem features. The present paper advances a proposal of a scalable linear constrained optimization problem that is suitable for benchmarking Evolutionary Algorithms. By comparing two recent EA variants, the linear benchmarking environment is demonstrated.
Discovering whether words are semantically related and identifying the specific semantic relation that holds between them is of crucial importance for NLP as it is essential for tasks like query expansion in IR. Within this context, different methodologies have been proposed that either exclusively focus on a single lexical relation (e.g. hypernymy vs. random) or learn specific classifiers capable of identifying multiple semantic relations (e.g. hypernymy vs. synonymy vs. random). In this paper, we propose another way to look at the problem that relies on the multi-task learning paradigm. In particular, we want to study whether the learning process of a given semantic relation (e.g. hypernymy) can be improved by the concurrent learning of another semantic relation (e.g. co-hyponymy). Within this context, we particularly examine the benefits of semi-supervised learning where the training of a prediction function is performed over few labeled data jointly with many unlabeled ones. Preliminary results based on simple learning strategies and state-of-the-art distributional feature representations show that concurrent learning can lead to improvements in a vast majority of tested situations.
Recent methods for boundary or edge detection built on Deep Convolutional Neural Networks (CNNs) typically suffer from the issue of predicted edges being thick and need post-processing to obtain crisp boundaries. Highly imbalanced categories of boundary versus background in training data is one of main reasons for the above problem. In this work, the aim is to make CNNs produce sharp boundaries without post-processing. We introduce a novel loss for boundary detection, which is very effective for classifying imbalanced data and allows CNNs to produce crisp boundaries. Moreover, we propose an end-to-end network which adopts the bottom-up/top-down architecture to tackle the task. The proposed network effectively leverages hierarchical features and produces pixel-accurate boundary mask, which is critical to reconstruct the edge map. Our experiments illustrate that directly making crisp prediction not only promotes the visual results of CNNs, but also achieves better results against the state-of-the-art on the BSDS500 dataset (ODS F-score of .815) and the NYU Depth dataset (ODS F-score of .762).
Deep convolutional neural networks (CNNs) have achieved many state-of-the-art results recently in different fields of computer vision. Different architectures are emerging almost everyday that produce fascinating results in machine vision tasks consisting million of images. Though CNN based models are able to detect and recognize large number of object classes, most of the times these models are trained with high quality images. Recently, an alternative architecture for image classification is proposed by Sabour et al. based on groups of neurons (capsules) and a novel dynamic routing protocol. The architecture has shown promising result by surpassing the performances of state-of-the-art CNN based models in some of the datasets. However, the behavior of capsule based models and CNN based models are largely unknown in presence of noise. As in most of the practical applications, it can not be guaranteed the input image is completely noise-free and does not contain any distortion, it is important to study the performance of these models under different noise conditions. In this paper, we select six widely used CNN architectures and consider their performances for image classification task for two different dataset under image quality distortions. We show that the capsule network is more robust to several image degradations than the CNN based models, especially for salt and pepper noise. To the best of our knowledge, this is the first study on the performance of CapsuleNet and other state-of-the-art CNN architectures under different types of image degradations.
Deep neural networks (DNNs) have achieved significant success in a variety of real world applications. However, tons of parameters in the networks restrict the efficiency of neural networks due to the large model size and the intensive computation. To address this issue, various compression and acceleration techniques have been investigated, among which low-rank filters and sparse filters are heavily studied. In this paper we propose a unified framework to compress the convolutional neural networks by combining these two strategies, while taking the nonlinear activation into consideration. The filer of a layer is approximated by the sum of a sparse component and a low-rank component, both of which are in favor of model compression. Especially, we constrain the sparse component to be structured sparse which facilitates acceleration. The performance of the network is retained by minimizing the reconstruction error of the feature maps after activation of each layer, using the alternating direction method of multipliers (ADMM). The experimental results show that our proposed approach can compress VGG-16 and AlexNet by over 4X. In addition, 2.2X and 1.1X speedup are achieved on VGG-16 and AlexNet, respectively, at a cost of less increase on error rate.
Algorithmic differentiation (AD) allows exact computation of derivatives given only an implementation of an objective function. Although many AD tools are available, a proper and efficient implementation of AD methods is not straightforward. The existing tools are often too different to allow for a general test suite. In this paper, we compare fifteen ways of computing derivatives including eleven automatic differentiation tools implementing various methods and written in various languages (C++, F#, MATLAB, Julia and Python), two symbolic differentiation tools, finite differences, and hand-derived computation. We look at three objective functions from computer vision and machine learning. These objectives are for the most part simple, in the sense that no iterative loops are involved, and conditional statements are encapsulated in functions such as {\tt abs} or {\tt logsumexp}. However, it is important for the success of algorithmic differentiation that such simple’ objective functions are handled efficiently, as so many problems in computer vision and machine learning are of this form. Of course, our results depend on programmer skill, and familiarity with the tools. However, we contend that this paper presents an important datapoint: a skilled programmer devoting roughly a week to each tool produced the timings we present. We have made our implementations available as open source to allow the community to replicate and update these benchmarks.
Recently, many graph matching methods that incorporate pairwise constraint and that can be formulated as a quadratic assignment problem (QAP) have been proposed. Although these methods demonstrate promising results for the graph matching problem, they have high complexity in space or time. In this paper, we introduce an adaptively transforming graph matching (ATGM) method from the perspective of functional representation. More precisely, under a transformation formulation, we aim to match two graphs by minimizing the discrepancy between the original graph and the transformed graph. With a linear representation map of the transformation, the pairwise edge attributes of graphs are explicitly represented by unary node attributes, which enables us to reduce the space and time complexity significantly. Due to an efficient Frank-Wolfe method-based optimization strategy, we can handle graphs with hundreds and thousands of nodes within an acceptable amount of time. Meanwhile, because transformation map can preserve graph structures, a domain adaptation-based strategy is proposed to remove the outliers. The experimental results demonstrate that our proposed method outperforms the state-of-the-art graph matching algorithms.
We developed a novel statistical method to identify structural differences between networks characterized by structural equation models. We propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures. The calibration stage serves to obtain consistent prediction by building the l2 regularized regression of each endogenous variables against pre-screened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and differential effects by undertaking l1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous variables. Our method allows easy parallel computation at each stage. Theoretical results are obtained to establish non-asymptotic error bounds of predictions and estimates at both stages, as well as the con-sistency of identified common and differential effects. Our studies on synthetic data demonstrated that our proposed method performed much better than independently constructing the networks. A real data set is analyzed to illustrate the applicability of our method.
Superpixels provide an efficient low/mid-level representation of image data, which greatly reduces the number of image primitives for subsequent vision tasks. Existing superpixel algorithms are not differentiable, making them difficult to integrate into otherwise end-to-end trainable deep neural networks. We develop a new differentiable model for superpixel sampling that leverages deep networks for learning superpixel segmentation. The resulting ‘Superpixel Sampling Network’ (SSN) is end-to-end trainable, which allows learning task-specific superpixels with flexible loss functions and has fast runtime. Extensive experimental analysis indicates that SSNs not only outperform existing superpixel algorithms on traditional segmentation benchmarks, but can also learn superpixels for other tasks. In addition, SSNs can be easily integrated into downstream deep networks resulting in performance improvements.
We establish an equivalence between information bottleneck (IB) learning and an unconventional quantization problem, IB quantization’. Under this equivalence, standard neural network models correspond to scalar IB quantizers. We prove a coding theorem for IB quantization, which implies that scalar IB quantizers are in general inferior to vector IB quantizers. This inspires us to develop a learning framework for neural networks, AgrLearn, that corresponds to vector IB quantizers. We experimentally verify that AgrLearn applied to some deep network models of current art improves upon them, while requiring less training data. With a heuristic smoothing, AgrLearn further improves its performance, resulting in new state of the art in image classification on Cifar10.
Novelty detection is the machine learning task to recognize data, which belongs to a previously unknown pattern. Complementary to supervised learning, it allows the data to be analyzed in a model-independent way. We demonstrate the potential role of novelty detection in collider analyses using an artificial neural network. Particularly, we introduce a set of density-based novelty evaluators, which can measure the clustering effect of new physics events in the feature space, and hence separate themselves from the traditional density-based ones, which measure isolation. This design enables recognizing new physics events, if any, at a reasonably efficient level. For illustrating its sensitivity performance, we apply novelty detection to the searches for fermionic di-top partner and resonant di-top productions at LHC and for exotic Higgs decays of two specific modes at future $e^+e^-$ collider.