Learning Taxonomies of Concepts and not Words using Contextualized Word Representations: A Position Paper

Taxonomies are semantic hierarchies of concepts. One limitation of current taxonomy learning systems is that they define concepts as single words. This position paper argues that contextualized word representations, which recently achieved state-of-the-art results on many competitive NLP tasks, are a promising method to address this limitation. We outline a novel approach for taxonomy learning that (1) defines concepts as synsets, (2) learns density-based approximations of contextualized word representations, and (3) can measure similarity and hypernymy among them.


Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

Attention is an increasingly popular mechanism used in a wide range of neural architectures. Because of the fast-paced advances in this domain, a systematic overview of attention is still missing. In this article, we define a unified model for attention architectures for natural language processing, with a focus on architectures designed to work with vector representation of the textual data. We discuss the dimensions along which proposals differ, the possible uses of attention, and chart the major research activities and open challenges in the area.


Non-Monotonic Sequential Text Generation

Standard sequential generation methods assume a pre-specified generation order, such as text generation methods which generate words from left to right. In this work, we propose a framework for training models of text generation that operate in non-monotonic orders; the model directly learns good orders, without any additional annotation. Our framework operates by generating a word at an arbitrary position, and then recursively generating words to its left and then words to its right, yielding a binary tree. Learning is framed as imitation learning, including a coaching method which moves from imitating an oracle to reinforcing the policy’s own preferences. Experimental results demonstrate that using the proposed method, it is possible to learn policies which generate text without pre-specifying a generation order, while achieving competitive performance with conventional left-to-right generation.


Explanation in Human-AI Systems: A Literature Meta-Review, Synopsis of Key Ideas and Publications, and Bibliography for Explainable AI

This is an integrative review that address the question, ‘What makes for a good explanation?’ with reference to AI systems. Pertinent literatures are vast. Thus, this review is necessarily selective. That said, most of the key concepts and issues are expressed in this Report. The Report encapsulates the history of computer science efforts to create systems that explain and instruct (intelligent tutoring systems and expert systems). The Report expresses the explainability issues and challenges in modern AI, and presents capsule views of the leading psychological theories of explanation. Certain articles stand out by virtue of their particular relevance to XAI, and their methods, results, and key points are highlighted. It is recommended that AI/XAI researchers be encouraged to include in their research reports fuller details on their empirical or experimental methods, in the fashion of experimental psychology research reports: details on Participants, Instructions, Procedures, Tasks, Dependent Variables (operational definitions of the measures and metrics), Independent Variables (conditions), and Control Conditions.


CMS Sematrix: A Tool to Aid the Development of Clinical Quality Measures (CQMs)

As part of the effort to improve quality and to reduce national healthcare costs, the Centers for Medicare and Medicaid Services (CMS) are responsible for creating and maintaining an array of clinical quality measures (CQMs) for assessing healthcare structure, process, outcome, and patient experience across various conditions, clinical specialties, and settings. The development and maintenance of CQMs involves substantial and ongoing evaluation of the evidence on the measure’s properties: importance, reliability, validity, feasibility, and usability. As such, CMS conducts monthly environmental scans of the published clinical and health service literature. Conducting time consuming, exhaustive evaluations of the ever-changing healthcare literature presents one of the largest challenges to an evidence-based approach to healthcare quality improvement. Thus, it is imperative to leverage automated techniques to aid CMS in the identification of clinical and health services literature relevant to CQMs. Additionally, the estimated labor hours and related cost savings of using CMS Sematrix compared to a traditional literature review are roughly 818 hours and 122,000 dollars for a single monthly environmental scan.


Active Learning for High-Dimensional Binary Features

Erbium-doped fiber amplifier (EDFA) is an optical amplifier/repeater device used to boost the intensity of optical signals being carried through a fiber optic communication system. A highly accurate EDFA model is important because of its crucial role in optical network management and optimization. The input channels of an EDFA device are treated as either on or off, hence the input features are binary. Labeled training data is very expensive to collect for EDFA devices, therefore we devise an active learning strategy suitable for binary variables to overcome this issue. We propose to take advantage of sparse linear models to simplify the predictive model. This approach simultaneously improves prediction and accelerates active learning query generation. We show the performance of our proposed active learning strategies on simulated data and real EDFA data.


Meta-Amortized Variational Inference and Learning

How can we learn to do probabilistic inference in a way that generalizes between models? Amortized variational inference learns for a single model, sharing statistical strength across observations. This benefits scalability and model learning, but does not help with generalization to new models. We propose meta-amortized variational inference, a framework that amortizes the cost of inference over a family of generative models. We apply this approach to deep generative models by introducing the MetaVAE: a variational autoencoder that learns to generalize to new distributions and rapidly solve new unsupervised learning problems using only a small number of target examples. Empirically, we validate the approach by showing that the MetaVAE can: (1) capture relevant sufficient statistics for inference, (2) learn useful representations of data for downstream tasks such as clustering, and (3) perform meta-density estimation on unseen synthetic distributions and out-of-sample Omniglot alphabets.


CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning

We focus on the commonly used synchronous Gradient Descent paradigm for large-scale distributed learning, for which there has been a growing interest to develop efficient and robust gradient aggregation strategies that overcome two key bottlenecks: communication bandwidth and stragglers’ delays. In particular, Ring-AllReduce (RAR) design has been proposed to avoid bandwidth bottleneck at any particular node by allowing each worker to only communicate with its neighbors that are arranged in a logical ring. On the other hand, Gradient Coding (GC) has been recently proposed to mitigate stragglers in a master-worker topology by allowing carefully designed redundant allocation of the data set to the workers. We propose a joint communication topology design and data set allocation strategy, named CodedReduce (CR), that combines the best of both RAR and GC. That is, it parallelizes the communications over a tree topology leading to efficient bandwidth utilization, and carefully designs a redundant data set allocation and coding strategy at the nodes to make the proposed gradient aggregation scheme robust to stragglers. In particular, we quantify the communication parallelization gain and resiliency of the proposed CR scheme, and prove its optimality when the communication topology is a regular tree. Furthermore, we empirically evaluate the performance of our proposed CR design over Amazon EC2 and demonstrate that it achieves speedups of up to 18.9x and 7.9x, respectively over the benchmarks GC and RAR.


An Automated Spectral Clustering for Multi-scale Data

Spectral clustering algorithms typically require a priori selection of input parameters such as the number of clusters, a scaling parameter for the affinity measure, or ranges of these values for parameter tuning. Despite efforts for automating the process of spectral clustering, the task of grouping data in multi-scale and higher dimensional spaces is yet to be explored. This study presents a spectral clustering heuristic algorithm that obviates the need for an input by estimating the parameters from the data itself. Specifically, it introduces the heuristic of iterative eigengap search with (1) global scaling and (2) local scaling. These approaches estimate the scaling parameter and implement iterative eigengap quantification along a search tree to reveal dissimilarities at different scales of a feature space and identify clusters. The performance of these approaches has been tested on various real-world datasets of power variation with multi-scale nature and gene expression. Our findings show that iterative eigengap search with a PCA-based global scaling scheme can discover different patterns with an accuracy of higher than 90% in most cases without asking for a priori input information.


Are All Layers Created Equal?

Understanding learning and generalization of deep architectures has been a major research objective in the recent years with notable theoretical progress. A main focal point of generalization studies stems from the success of excessively large networks which defy the classical wisdom of uniform convergence and learnability. We study empirically the layer-wise functional structure of over-parameterized deep models. We provide evidence for the heterogeneous characteristic of layers. To do so, we introduce the notion of (post training) re-initialization and re-randomization robustness. We show that layers can be categorized into either ‘robust’ or ‘critical’. In contrast to critical layers, resetting the robust layers to their initial value has no negative consequence, and in many cases they barely change throughout training. Our study provides further evidence that mere parameter counting or norm accounting is too coarse in studying generalization of deep models.


Bidirectional Inference Networks: A Class of Deep Bayesian Networks for Health Profiling

We consider the problem of inferring the values of an arbitrary set of variables (e.g., risk of diseases) given other observed variables (e.g., symptoms and diagnosed diseases) and high-dimensional signals (e.g., MRI images or EEG). This is a common problem in healthcare since variables of interest often differ for different patients. Existing methods including Bayesian networks and structured prediction either do not incorporate high-dimensional signals or fail to model conditional dependencies among variables. To address these issues, we propose bidirectional inference networks (BIN), which stich together multiple probabilistic neural networks, each modeling a conditional dependency. Predictions are then made via iteratively updating variables using backpropagation (BP) to maximize corresponding posterior probability. Furthermore, we extend BIN to composite BIN (CBIN), which involves the iterative prediction process in the training stage and improves both accuracy and computational efficiency by adaptively smoothing the optimization landscape. Experiments on synthetic and real-world datasets (a sleep study and a dermatology dataset) show that CBIN is a single model that can achieve state-of-the-art performance and obtain better accuracy in most inference tasks than multiple models each specifically trained for a different task.


Regularizing Generative Models Using Knowledge of Feature Dependence

Generative modeling is a fundamental problem in machine learning with many potential applications. Efficient learning of generative models requires available prior knowledge to be exploited as much as possible. In this paper, we propose a method to exploit prior knowledge of relative dependence between features for learning generative models. Such knowledge is available, for example, when side-information on features is present. We incorporate the prior knowledge by forcing marginals of the learned generative model to follow a prescribed relative feature dependence. To this end, we formulate a regularization term using a kernel-based dependence criterion. The proposed method can be incorporated straightforwardly into many optimization-based learning schemes of generative models, including variational autoencoders and generative adversarial networks. We show the effectiveness of the proposed method in experiments with multiple types of datasets and models.


Common Mode Patterns for Supervised Tensor Subspace Learning

In this work we propose a method for reducing the dimensionality of tensor objects in a binary classification framework. The proposed Common Mode Patterns method takes into consideration the labels’ information, and ensures that tensor objects that belong to different classes do not share common features after the reduction of their dimensionality. We experimentally validate the proposed supervised subspace learning technique and compared it against Multilinear Principal Component Analysis using a publicly available hyperspectral imaging dataset. Experimental results indicate that the proposed CMP method can efficiently reduce the dimensionality of tensor objects, while, at the same time, increasing the inter-class separability.


BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling

With the introduction of the variational autoencoder (VAE), probabilistic latent variable models have received renewed attention as powerful generative models. However, their performance in terms of test likelihood and quality of generated samples has been surpassed by autoregressive models without stochastic units. Furthermore, flow-based models have recently been shown to be an attractive alternative that scales well to high-dimensional data. In this paper we close the performance gap by constructing VAE models that can effectively utilize a deep hierarchy of stochastic variables and model complex covariance structures. We introduce the Bidirectional-Inference Variational Autoencoder (BIVA), characterized by a skip-connected generative model and an inference network formed by a bidirectional stochastic inference path. We show that BIVA reaches state-of-the-art test likelihoods, generates sharp and coherent natural images, and uses the hierarchy of latent variables to capture different aspects of the data distribution. We observe that BIVA, in contrast to recent results, can be used for anomaly detection. We attribute this to the hierarchy of latent variables which is able to extract high-level semantic features. Finally, we extend BIVA to semi-supervised classification tasks and show that it performs comparably to state-of-the-art results by generative adversarial networks.


Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation

Though the convergence of major reinforcement learning algorithms has been extensively studied, the finite-sample analysis to further characterize the convergence rate in terms of the sample complexity for problems with continuous state space is still very limited. Such a type of analysis is especially challenging for algorithms with dynamically changing learning policies and under non-i.i.d.\ sampled data. In this paper, we present the first finite-sample analysis for the SARSA algorithm and its minimax variant (for zero-sum Markov games), with a single sample path and linear function approximation. To establish our results, we develop a novel technique to bound the gradient bias for dynamically changing learning policies, which can be of independent interest. We further provide finite-sample bounds for Q-learning and its minimax variant. Comparison of our result with the existing finite-sample bound indicates that linear function approximation achieves order-level lower sample complexity than the nearest neighbor approach.


A Guiding Principle for Causal Decision Problems

We define a Causal Decision Problem as a Decision Problem where the available actions, the family of uncertain events and the set of outcomes are related through the variables of a Causal Graphical Model \mathcal{G}. A solution criteria based on Pearl’s Do-Calculus and the Expected Utility criteria for rational preferences is proposed. The implementation of this criteria leads to an on-line decision making procedure that has been shown to have similar performance to classic Reinforcement Learning algorithms while allowing for a causal model of an environment to be learned. Thus, we aim to provide the theoretical guarantees of the usefulness and optimality of a decision making procedure based on causal information.


Information Flow in Computational Systems

We develop a theoretical framework for defining and identifying flows of information in computational systems. Here, a computational system is assumed to be a directed graph, with ‘clocked’ nodes that send transmissions to each other along the edges of the graph at discrete points in time. A few measures of information flow have been proposed previously in the literature, and measures of directed causal influence are currently being used as a heuristic proxy for information flow. However, there is as yet no rigorous treatment of the problem with formal definitions and clearly stated assumptions, and the process of defining information flow is often conflated with the problem of estimating it. In this work, we provide a new information-theoretic definition for information flow in a computational system, which we motivate using a series of examples. We then show that this definition satisfies intuitively desirable properties, including the existence of ‘information paths’, along which information flows from the input of the computational system to its output. Finally, we describe how information flow might be estimated in a noiseless setting, and provide an algorithm to identify information paths on the time-unrolled graph of a computational system.


Neural Network Attributions: A Causal Perspective

We propose a new attribution method for neural networks developed using first principles of causality (to the best of our knowledge, the first such). The neural network architecture is viewed as a Structural Causal Model, and a methodology to compute the causal effect of each feature on the output is presented. With reasonable assumptions on the causal structure of the input data, we propose algorithms to efficiently compute the causal effects, as well as scale the approach to data with large dimensionality. We also show how this method can be used for recurrent neural networks. We report experimental results on both simulated and real datasets showcasing the promise and usefulness of the proposed algorithm.


Decentralized Flood Forecasting Using Deep Neural Networks

Predicting flood for any location at times of extreme storms is a longstanding problem that has utmost importance in emergency management. Conventional methods that aim to predict water levels in streams use advanced hydrological models still lack of giving accurate forecasts everywhere. This study aims to explore artificial deep neural networks’ performance on flood prediction. While providing models that can be used in forecasting stream stage, this paper presents a dataset that focuses on the connectivity of data points on river networks. It also shows that neural networks can be very helpful in time-series forecasting as in flood events, and support improving existing models through data assimilation.