Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching

Deep text matching approaches have been widely studied for many applications including question answering and information retrieval systems. To deal with a domain that has insufficient labeled data, these approaches can be used in a Transfer Learning (TL) setting to leverage labeled data from a resource-rich source domain. To achieve better performance, source domain data selection is essential in this process to prevent the ‘negative transfer’ problem. However, the emerging deep transfer models do not fit well with most existing data selection methods, because the data selection policy and the transfer learning model are not jointly trained, leading to sub-optimal training efficiency. In this paper, we propose a novel reinforced data selector to select high-quality source domain data to help the TL model. Specifically, the data selector ‘acts’ on the source domain data to find a subset for optimization of the TL model, and the performance of the TL model can provide ‘rewards’ in turn to update the selector. We build the reinforced data selector based on the actor-critic framework and integrate it to a DNN based transfer learning model, resulting in a Reinforced Transfer Learning (RTL) method. We perform a thorough experimental evaluation on two major tasks for text matching, namely, paraphrase identification and natural language inference. Experimental results show the proposed RTL can significantly improve the performance of the TL model. We further investigate different settings of states, rewards, and policy optimization methods to examine the robustness of our method. Last, we conduct a case study on the selected data and find our method is able to select source domain data whose Wasserstein distance is close to the target domain data. This is reasonable and intuitive as such source domain data can provide more transferability power to the model.

A Geometric Theory of Higher-Order Automatic Differentiation

First-order automatic differentiation is a ubiquitous tool across statistics, machine learning, and computer science. Higher-order implementations of automatic differentiation, however, have yet to realize the same utility. In this paper I derive a comprehensive, differential geometric treatment of automatic differentiation that naturally identifies the higher-order differential operators amenable to automatic differentiation as well as explicit procedures that provide a scaffolding for high-performance implementations.

Predicting Aircraft Trajectories: A Deep Generative Convolutional Recurrent Neural Networks Approach

Reliable 4D aircraft trajectory prediction, whether in a real-time setting or for analysis of counterfactuals, is important to the efficiency of the aviation system. Toward this end, we first propose a highly generalizable efficient tree-based matching algorithm to construct image-like feature maps from high-fidelity meteorological datasets – wind, temperature and convective weather. We then model the track points on trajectories as conditional Gaussian mixtures with parameters to be learned from our proposed deep generative model, which is an end-to-end convolutional recurrent neural network that consists of a long short-term memory (LSTM) encoder network and a mixture density LSTM decoder network. The encoder network embeds last-filed flight plan information into fixed-size hidden state variables and feeds the decoder network, which further learns the spatiotemporal correlations from the historical flight tracks and outputs the parameters of Gaussian mixtures. Convolutional layers are integrated into the pipeline to learn representations from the high-dimension weather features. During the inference process, beam search, adaptive Kalman filter, and Rauch-Tung-Striebel smoother algorithms are used to prune the variance of generated trajectories.

Soft-Autoencoder and Its Wavelet Shrinkage Interpretation

Deep learning is a main focus of artificial intelligence and has greatly impacted other fields. However, deep learning is often criticized for its lack of interpretation. As a successful unsupervised model in deep learning, various autoencoders, especially convolutional autoencoders, are very popular and important. Since these autoencoders need improvements and insights, in this paper we shed light on the nonlinearity of a deep convolutional autoencoder in perspective of perfect signal recovery. In particular, we propose a new type of convolutional autoencoders, termed as Soft-Autoencoder (Soft-AE), in which the activations of encoding layers are implemented with adaptable soft-thresholding units while decoding layers are realized with linear units. Consequently, Soft-AE can be naturally interpreted as a learned cascaded wavelet shrinkage system. Our denoising numerical experiments on CIFAR-10, BSD-300 and Mayo Clinical Challenge Dataset demonstrate that Soft-AE gives a competitive performance relative to its counterparts.

ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers

To facilitate efficient embedded and hardware implementations of deep neural networks (DNNs), two important categories of DNN model compression techniques: weight pruning and weight quantization are investigated. The former leverages the redundancy in the number of weights, whereas the latter leverages the redundancy in bit representation of weights. However, there lacks a systematic framework of joint weight pruning and quantization of DNNs, thereby limiting the available model compression ratio. Moreover, the computation reduction, energy efficiency improvement, and hardware performance overhead need to be accounted for besides simply model size reduction. To address these limitations, we present ADMM-NN, the first algorithm-hardware co-optimization framework of DNNs using Alternating Direction Method of Multipliers (ADMM), a powerful technique to deal with non-convex optimization problems with possibly combinatorial constraints. The first part of ADMM-NN is a systematic, joint framework of DNN weight pruning and quantization using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration, thereby resulting in higher performance in model compression than prior work. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations. Without accuracy loss, we can achieve 85\times and 24\times pruning on LeNet-5 and AlexNet models, respectively, significantly higher than prior work. The improvement becomes more significant when focusing on computation reductions. Combining weight pruning and quantization, we achieve 1,910\times and 231\times reductions in overall model size on these two benchmarks, when focusing on data storage. Highly promising results are also observed on other representative DNNs such as VGGNet and ResNet-50.

K-nearest Neighbor Search by Random Projection Forests

K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests (rpForests), for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.

Cross-language Citation Recommendation via Hierarchical Representation Learning on Heterogeneous Graph

While the volume of scholarly publications has increased at a frenetic pace, accessing and consuming the useful candidate papers, in very large digital libraries, is becoming an essential and challenging task for scholars. Unfortunately, because of language barrier, some scientists (especially the junior ones or graduate students who do not master other languages) cannot efficiently locate the publications hosted in a foreign language repository. In this study, we propose a novel solution, cross-language citation recommendation via Hierarchical Representation Learning on Heterogeneous Graph (HRLHG), to address this new problem. HRLHG can learn a representation function by mapping the publications, from multilingual repositories, to a low-dimensional joint embedding space from various kinds of vertexes and relations on a heterogeneous graph. By leveraging both global (task specific) plus local (task independent) information as well as a novel supervised hierarchical random walk algorithm, the proposed method can optimize the publication representations by maximizing the likelihood of locating the important cross-language neighborhoods on the graph. Experiment results show that the proposed method can not only outperform state-of-the-art baseline models, but also improve the interpretability of the representation model for cross-language citation recommendation task.

Stealing Neural Networks via Timing Side Channels

Deep learning is gaining importance in many applications and Cloud infrastructures are being advocated for this computational paradigm. However, there is a security issue which is yet to be addressed. An adversary can extract the neural network architecture for commercial gain. Given the architecture, an adversary can further infer the regularization hyperparameter, input data and generate effective transferable adversarial examples to evade classifiers. We observe that neural networks are vulnerable to timing side channel attacks as the total execution time of the network is dependent on the network depth due to the sequential computation of the layers. In this paper, black box neural network extraction attack by exploiting the timing side channels to infer the depth of the network has been proposed. The proposed approach is independent of the neural network architecture and scalable. Reconstructing substitute architectures with similar functionality as the target model is a search problem. The depth inferred from exploiting the timing side channel reduces the search space. Further, reinforcement learning with knowledge distillation is used to efficiently search for the optimal substitute architecture in the complex yet reduced search space. We evaluate our attack on VGG architectures on CIFAR10 dataset and reconstruct substitute models with test accuracy close to the target models.

Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm

The high computational and parameter complexity of neural networks makes their training very slow and difficult to deploy on energy and storage-constrained computing systems. Many network complexity reduction techniques have been proposed including fixed-point implementation. However, a systematic approach for designing full fixed-point training and inference of deep neural networks remains elusive. We describe a precision assignment methodology for neural network training in which all network parameters, i.e., activations and weights in the feedforward path, gradients and weight accumulators in the feedback path, are assigned close to minimal precision. The precision assignment is derived analytically and enables tracking the convergence behavior of the full precision training, known to converge a priori. Thus, our work leads to a systematic methodology of determining suitable precision for fixed-point training. The near optimality (minimality) of the resulting precision assignment is validated empirically for four networks on the CIFAR-10, CIFAR-100, and SVHN datasets. The complexity reduction arising from our approach is compared with other fixed-point neural network designs.

A Neural Network Based Explainable Recommender System

Recommendation system could help the companies to persuade users to visit or consume at a particular place, which was based on many traditional methods such as the set of collaborative filtering algorithms. Most research discusses the model design or feature engineering methods to minimize the root mean square error (RMSE) of rating prediction, but lacks exploring the ways to generate the reasons for recommendations. This paper proposed an integrated neural network based model which integrates rating scores prediction and explainable words generation. Based on the experimental results, this model presented lower RMSE compared with traditional methods, and generate the explanation of recommendation to convince customers to visit the recommended place.

Cluster-Based Active Learning

In this work, we introduce Cluster-Based Active Learning, a novel framework that employs clustering to boost active learning by reducing the number of human interactions required to train deep neural networks. Instead of annotating single samples individually, humans can also label clusters, producing a higher number of annotated samples with the cost of a small label error. Our experiments show that the proposed framework requires 82% and 87% less human interactions for CIFAR-10 and EuroSAT datasets respectively when compared with the fully-supervised training while maintaining similar performance on the test set.

Mathematics Content Understanding for Cyberlearning via Formula Evolution Map

Although the scientific digital library is growing at a rapid pace, scholars/students often find reading Science, Technology, Engineering, and Mathematics (STEM) literature daunting, especially for the math-content/formula. In this paper, we propose a novel problem, “mathematics content understanding”, for cyberlearning and cyberreading. To address this problem, we create a Formula Evolution Map (FEM) offline and implement a novel online learning/reading environment, PDF Reader with Math-Assistant (PRMA), which incorporates innovative math-scaffolding methods. The proposed algorithm/system can auto-characterize student emerging math-information need while reading a paper and enable students to readily explore the formula evolution trajectory in FEM. Based on a math-information need, PRMA utilizes innovative joint embedding, formula evolution mining, and heterogeneous graph mining algorithms to recommend high quality Open Educational Resources (OERs), e.g., video, Wikipedia page, or slides, to help students better understand the math-content in the paper. Evaluation and exit surveys show that the PRMA system and the proposed formula understanding algorithm can effectively assist master and PhD students better understand the complex math-content in the class readings.

Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications

Reinforcement learning (RL) algorithms have been around for decades and been employed to solve various sequential decision-making problems. These algorithms however have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments. This paper addresses an important aspect of deep RL related to situations that demand multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multi-agent deep RL (MADRL) is presented, including non-stationarity, partial observability, continuous state and action spaces, multi-agent training schemes, multi-agent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed, with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to future development of more robust and highly useful multi-agent learning methods for solving real-world problems.

An introduction to domain adaptation and transfer learning

In machine learning, if the training data is an unbiased sample of an underlying distribution, then the learned classification function will make accurate predictions for new samples. However, if the training data is not an unbiased sample, then there will be differences between how the training data is distributed and how the test data is distributed. Standard classifiers cannot cope with changes in data distributions between training and test phases, and will not perform well. Domain adaptation and transfer learning are sub-fields within machine learning that are concerned with accounting for these types of changes. Here, I present an introduction to these fields, guided by the question: when and how can a classifier generalize from a source to a target domain? I will start with a brief introduction into risk minimization, and how transfer learning and domain adaptation expand upon this framework. Following that, I discuss three special cases of data set shift, namely prior, covariate and concept shift. For more complex domain shifts, there are a wide variety of approaches. These are categorized into: importance-weighting, subspace mapping, domain-invariant spaces, feature augmentation, minimax estimators and robust algorithms. A number of points will arise, which I will discuss in the last section. I conclude with the remark that many open questions will have to be addressed before transfer learners and domain-adaptive classifiers become practical.

Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks

Unconstrained text recognition is an important computer vision task, featuring a wide variety of different sub-tasks, each with its own set of challenges. One of the biggest promises of deep neural networks has been the convergence and automation of feature extractors from input raw signals, allowing for the highest possible performance with minimum required domain knowledge. To this end, we propose a data-efficient, end-to-end neural network model for generic, unconstrained text recognition. In our proposed architecture we strive for simplicity and efficiency without sacrificing recognition accuracy. Our proposed architecture is a fully convolutional network without any recurrent connections trained with the CTC loss function. Thus it operates on arbitrary input sizes and produces strings of arbitrary length in a very efficient and parallelizable manner. We show the generality and superiority of our proposed text recognition architecture by achieving state of the art results on seven public benchmark datasets, covering a wide spectrum of text recognition tasks, namely: Handwriting Recognition, CAPTCHA recognition, OCR, License Plate Recognition, and Scene Text Recognition. Our proposed architecture has won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset.

Differentiable Satisfiability and Differentiable Answer Set Programming for Sampling-Based Multi-Model Optimization

We propose Differentiable Satisfiability and Differentiable Answer Set Programming (Differentiable SAT/ASP) for multi-model optimization. Models (answer sets or satisfying truth assignments) are sampled using a novel SAT/ASP solving approach which uses a gradient descent-based branching mechanism. Sampling proceeds until the value of a user-defined multi-model cost function reaches a given threshold. As major use cases for our approach we propose distribution-aware model sampling and expressive yet scalable probabilistic logic programming. As our main algorithmic approach to Differentiable SAT/ASP, we introduce an enhancement of the state-of-the-art CDNL/CDCL algorithm for SAT/ASP solving. Additionally, we present alternative algorithms which use an unmodified ASP solver (Clingo/clasp) and map the optimization task to conventional answer set optimization or use so-called propagators. We also report on the open source software DelSAT, a recent prototype implementation of our main algorithm, and on initial experimental results which indicate that DelSATs performance is, when applied to the use case of probabilistic logic inference, on par with Markov Logic Network (MLN) inference performance, despite having advantageous properties compared to MLNs, such as the ability to express inductive definitions and to work with probabilities as weights directly in all cases. Our experiments also indicate that our main algorithm is strongly superior in terms of performance compared to the presented alternative approaches which reduce a common instance of the general problem to regular SAT/ASP.

A Mathematical Theory of Interpersonal Interactions and Group Behavior

Emergent collective group processes and capabilities have been studied through analysis of transactive memory, measures of group task performance, and group intelligence, among others. In their approach to collective behaviors, these approaches transcend traditional studies of group decision making that focus on how individual preferences combine through power relationships, social choice by voting, negotiation and game theory. Understanding more generally how individuals contribute to group effectiveness is important to a broad set of social challenges. Here we formalize a dynamic theory of interpersonal communications that classifies individual acts, sequences of actions, group behavioral patterns, and individuals engaged in group decision making. Group decision making occurs through a sequence of communications that convey personal attitudes and preferences among members of the group. The resulting formalism is relevant to psychosocial behavior analysis, rules of order, organizational structures and personality types, as well as formalized systems such as social choice theory. More centrally, it provides a framework for quantifying and even anticipating the structure of informal dialog, allowing specific conversations to be coded and analyzed in relation to a quantitative model of the participating individuals and the parameters that govern their interactions.

An Analysis of Classical Multidimensional Scaling

Classical multidimensional scaling is an important tool for dimension reduction in many applications. Yet few theoretical results characterizing its statistical performance exist. In this paper, we provide a theoretical framework for analyzing the quality of embedded samples produced by classical multidimensional scaling. This lays down the foundation for various downstream statistical analysis. As an application, we study its performance in the setting of clustering noisy data. Our results provide scaling conditions on the sample size, ambient dimensionality, between-class distance and noise level under which classical multidimensional scaling followed by a clustering algorithm can recover the cluster labels of all samples with high probability. Numerical simulations confirm these scaling conditions are sharp in low, moderate, and high dimensional regimes. Applications to both human RNAseq data and natural language data lend strong support to the methodology and theory.

Gray-box Adversarial Testing for Control Systems with Machine Learning Component

Neural Networks (NN) have been proposed in the past as an effective means for both modeling and control of systems with very complex dynamics. However, despite the extensive research, NN-based controllers have not been adopted by the industry for safety critical systems. The primary reason is that systems with learning based controllers are notoriously hard to test and verify. Even harder is the analysis of such systems against system-level specifications. In this paper, we provide a gradient based method for searching the input space of a closed-loop control system in order to find adversarial samples against some system-level requirements. Our experimental results show that combined with randomized search, our method outperforms Simulated Annealing optimization.