In sequence to sequence generation tasks (e.g. machine translation and abstractive summarization), inference is generally performed in a left-to-right manner to produce the result token by token. The neural approaches, such as LSTM and self-attention networks, are now able to make full use of all the predicted history hypotheses from left side during inference, but cannot meanwhile access any future (right side) information and usually generate unbalanced outputs in which left parts are much more accurate than right ones. In this work, we propose a synchronous bidirectional inference model to generate outputs using both left-to-right and right-to-left decoding simultaneously and interactively. First, we introduce a novel beam search algorithm that facilitates synchronous bidirectional decoding. Then, we present the core approach which enables left-to-right and right-to-left decoding to interact with each other, so as to utilize both the history and future predictions simultaneously during inference. We apply the proposed model to both LSTM and self-attention networks. In addition, we propose two strategies for parameter optimization. The extensive experiments on machine translation and abstractive summarization demonstrate that our synchronous bidirectional inference model can achieve remarkable improvements over the strong baselines.
Model predictive control (MPC) is a powerful technique for solving dynamic control tasks. In this paper, we show that there exists a close connection between MPC and online learning, an abstract theoretical framework for analyzing online decision making in the optimization literature. This new perspective provides a foundation for leveraging powerful online learning algorithms to design MPC algorithms. Specifically, we propose a new algorithm based on dynamic mirror descent (DMD), an online learning algorithm that is designed for non-stationary setups. Our algorithm, Dynamic Mirror Decent Model Predictive Control (DMD-MPC), represents a general family of MPC algorithms that includes many existing techniques as special instances. DMD-MPC also provides a fresh perspective on previous heuristics used in MPC and suggests a principled way to design new MPC algorithms. In the experimental section of this paper, we demonstrate the flexibility of DMD-MPC, presenting a set of new MPC algorithms on a simple simulated cartpole and a simulated and real-world aggressive driving task.
A novel machine learning optimization process coined Restrictive Federated Model Selection (RFMS) is proposed under the scenario, for example, when data from healthcare units can not leave the site it is situated on and it is forbidden to carry out training algorithms on remote data sites due to either technical or privacy and trust concerns. To carry out a clinical research under this scenario, an analyst could train a machine learning model only on local data site, but it is still possible to execute a statistical query at a certain cost in the form of sending a machine learning model to some of the remote data sites and get the performance measures as feedback, maybe due to prediction being usually much cheaper. Compared to federated learning, which is optimizing the model parameters directly by carrying out training across all data sites, RFMS trains model parameters only on one local data site but optimizes hyper-parameters across other data sites jointly since hyper-parameters play an important role in machine learning performance. The aim is to get a Pareto optimal model with respective to both local and remote unseen prediction losses, which could generalize well across data sites. In this work, we specifically consider high dimensional data with shifted distributions over data sites. As an initial investigation, Bayesian Optimization especially multi-objective Bayesian Optimization is used to guide an adaptive hyper-parameter optimization process to select models under the RFMS scenario. Empirical results show that solely using the local data site to tune hyper-parameters generalizes poorly across data sites, compared to methods that utilize the local and remote performances. Furthermore, in terms of dominated hypervolumes, multi-objective Bayesian Optimization algorithms show increased performance across multiple data sites among other candidates.
Humans have schematic knowledge of how certain types of events unfold (e.g. coffeeshop visits) that can readily be generalized to new instances of those events. Schematic knowledge allows humans to perform role-filler binding, the task of associating schematic roles (e.g. ‘barista’) with specific fillers (e.g. ‘Bob’). Here we examined whether and how recurrent neural networks learn to do this. We procedurally generated stories from an underlying generative graph, and trained networks on role-filler binding question-answering tasks. We tested whether networks can learn to maintain filler information on their own, and whether they can generalize to fillers that they have not seen before. We studied networks by analyzing their behavior and decoding their memory states. We found that a network’s success in learning role-filler binding depends on both the breadth of roles introduced during training, and the network’s memory architecture. In our decoding analyses, we observed a close relationship between the information we could decode from various parts of network architecture, and the information the network could recall.
Complex networks are used to describe a broad range of disparate social systems and natural phenomena, from power grids to customer segmentation to human brain connectome. Challenges of parametric model specification and validation inspire a search for more data-driven and flexible nonparametric approaches for inference of complex networks. In this paper we discuss methodology and R implementation of two bootstrap procedures on random networks, that is, patchwork bootstrap of Thompson et al. (2016) and Gel et al. (2017) and vertex bootstrap of Snijders and Borgatti (1999). To our knowledge, the new R package snowboot is the first implementation of the vertex and patchwork bootstrap inference on networks in R. Our new package is accompanied with a detailed user’s manual, and is compatible with the popular R package on network studies igraph. We evaluate the patchwork bootstrap and vertex bootstrap with extensive simulation studies and illustrate their utility in application to analysis of real world networks.
To improve how neural networks function it is crucial to understand their learning process. The information bottleneck theory of deep learning proposes that neural networks achieve good generalization by compressing their representations to disregard information that is not relevant to the task. However, empirical evidence for this theory is conflicting, as compression was only observed when networks used saturating activation functions. In contrast, networks with non-saturating activation functions achieved comparable levels of task performance but did not show compression. In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization.
Talent Search systems aim to recommend potential candidates who are a good match to the hiring needs of a recruiter expressed in terms of the recruiter’s search query or job posting. Past work in this domain has focused on linear and nonlinear models which lack preference personalization in the user-level due to being trained only with globally collected recruiter activity data. In this paper, we propose an entity-personalized Talent Search model which utilizes a combination of generalized linear mixed (GLMix) models and gradient boosted decision tree (GBDT) models, and provides personalized talent recommendations using nonlinear tree interaction features generated by the GBDT. We also present the offline and online system architecture for the productionization of this hybrid model approach in our Talent Search systems. Finally, we provide offline and online experiment results benchmarking our entity-personalized model with tree interaction features, which demonstrate significant improvements in our precision metrics compared to globally trained non-personalized models.
Recent studies have demonstrated that reinforcement learning (RL) agents are susceptible to adversarial manipulation, similar to vulnerabilities previously demonstrated in the supervised setting. Accordingly focus has remained with computer vision, and full observability. This paper focuses on reinforcement learning in the context of autonomous defence in Software-Defined Networking (SDN). We demonstrate that causative attacks—attacks that target the training process—can poison RL agents even if the attacker only has partial observability of the environment. In addition, we propose an inversion defence method that aims to apply the opposite perturbation to that which an attacker might use to generate their adversarial samples. Our experimental results illustrate that the countermeasure can effectively reduce the impact of the causative attack, while not significantly affecting the training process in non-attack scenarios.
Generative Adversarial Networks (GANs) have become a powerful framework to learn generative models that arise across a wide variety of domains. While there has been a recent surge in the development of numerous GAN architectures with distinct optimization metrics, we are still lacking in our understanding on how far away such GANs are from optimality. In this paper, we make progress on a theoretical understanding of the GANs under a simple linear-generator Gaussian-data setting where the optimal maximum-likelihood generator is known to perform Principal Component Analysis (PCA). We find that the original GAN by Goodfellow et. al. fails to recover the optimal PCA solution. On the other hand, we show that Wasserstein GAN can perform PCA, and hence it may serve as a basis for an optimal GAN architecture that yields the optimal generator for a wide range of data settings.
This paper focuses on how to take advantage of external knowledge bases (KBs) to improve recurrent neural networks for machine reading. Traditional methods that exploit knowledge from KBs encode knowledge as discrete indicator features. Not only do these features generalize poorly, but they require task-specific feature engineering to achieve good performance. We propose KBLSTM, a novel neural model that leverages continuous representations of KBs to enhance the learning of recurrent neural networks for machine reading. To effectively integrate background knowledge with information from the currently processed text, our model employs an attention mechanism with a sentinel to adaptively decide whether to attend to background knowledge and which information from KBs is useful. Experimental results show that our model achieves accuracies that surpass the previous state-of-the-art results for both entity extraction and event extraction on the widely used ACE2005 dataset.
Transfer learning aims to solve the data sparsity for a target domain by applying information of the source domain. Given a sequence (e.g. a natural language sentence), the transfer learning, usually enabled by recurrent neural network (RNN), represents the sequential information transfer. RNN uses a chain of repeating cells to model the sequence data. However, previous studies of neural network based transfer learning simply represents the whole sentence by a single vector, which is unfeasible for seq2seq and sequence labeling. Meanwhile, such layer-wise transfer learning mechanisms lose the fine-grained cell-level information from the source domain. In this paper, we proposed the aligned recurrent transfer, ART, to achieve cell-level information transfer. ART is under the pre-training framework. Each cell attentively accepts transferred information from a set of positions in the source domain. Therefore, ART learns the cross-domain word collocations in a more flexible way. We conducted extensive experiments on both sequence labeling tasks (POS tagging, NER) and sentence classification (sentiment analysis). ART outperforms the state-of-the-arts over all experiments.
Question Answering (QA), as a research field, has primarily focused on either knowledge bases (KBs) or free text as a source of knowledge. These two sources have historically shaped the kinds of questions that are asked over these sources, and the methods developed to answer them. In this work, we look towards a practical use-case of QA over user-instructed knowledge that uniquely combines elements of both structured QA over knowledge bases, and unstructured QA over narrative, introducing the task of multi-relational QA over personal narrative. As a first step towards this goal, we make three key contributions: (i) we generate and release TextWorldsQA, a set of five diverse datasets, where each dataset contains dynamic narrative that describes entities and relations in a simulated world, paired with variably compositional questions over that knowledge, (ii) we perform a thorough evaluation and analysis of several state-of-the-art QA models and their variants at this task, and (iii) we release a lightweight Python-based framework we call TextWorlds for easily generating arbitrary additional worlds and narrative, with the goal of allowing the community to create and share a growing collection of diverse worlds as a test-bed for this task.
Recommendation systems and computing advertisements have gradually entered the field of academic research from the field of commercial applications. Click-through rate prediction is one of the core research issues because the prediction accuracy affects the user experience and the revenue of merchants and platforms. Feature engineering is very important to improve click-through rate prediction. Traditional feature engineering heavily relies on people’s experience, and is difficult to construct a feature combination that can describe the complex patterns implied in the data. This paper combines traditional feature combination methods and deep neural networks to automate feature combinations to improve the accuracy of click-through rate prediction. We propose a mechannism named ‘Field-aware Neural Factorization Machine’ (FNFM). This model can have strong second order feature interactive learning ability like Field-aware Factorization Machine, on this basis, deep neural network is used for higher-order feature combination learning. Experiments show that the model has stronger expression ability than current deep learning feature combination models like the DeepFM, DCN and NFM.
Although the fully-connected attention-based model Transformer has achieved great successes on many NLP tasks, it has heavy structure and usually requires large training data. In this paper, we present the Star-Transformer, an alternative and light-weighted model of the Transformer. To reduce the model complexity, we replace the fully-connected structure with a star-shaped structure, in which every two non-adjacent nodes are connected through a shared relay node. Thus, the Star-Transformer has lower complexity than the standard Transformer (from quadratic to linear according to the input length) and preserves the ability to handle with the long-range dependencies. The experiments on four tasks (22 datasets) show the Star-Transformer achieved significant improvements against the standard Transformer for the modestly sized datasets.
With the increasing variety of services that e-commerce platforms provide, criteria for evaluating their success become also increasingly multi-targeting. This work introduces a multi-target optimization framework with Bayesian modeling of the target events, called Deep Bayesian Multi-Target Learning (DBMTL). In this framework, target events are modeled as forming a Bayesian network, in which directed links are parameterized by hidden layers, and learned from training samples. The structure of Bayesian network is determined by model selection. We applied the framework to Taobao live-streaming recommendation, to simultaneously optimize (and strike a balance) on targets including click-through rate, user stay time in live room, purchasing behaviors and interactions. Significant improvement has been observed for the proposed method over other MTL frameworks and the non-MTL model. Our practice shows that with an integrated causality structure, we can effectively make the learning of a target benefit from other targets, creating significant synergy effects that improve all targets. The neural network construction guided by DBMTL fits in with the general probabilistic model connecting features and multiple targets, taking weaker assumption than the other methods discussed in this paper. This theoretical generality brings about practical generalization power over various targets distributions, including sparse targets and continuous-value ones.
We generalize the setting of online clustering of bandits by allowing non-uniform distribution over user frequencies. A more efficient algorithm is proposed with simple set structures to represent clusters. We prove a regret bound for the new algorithm which is free of the minimal frequency over users. The experiments on both synthetic and real datasets consistently show the advantage of the new algorithm over existing methods.
We present batch virtual adversarial training (BVAT), a novel regularization method for graph convolutional networks (GCNs). BVAT addresses the shortcoming of GCNs that do not consider the smoothness of the model’s output distribution against local perturbations around the input. We propose two algorithms, sample-based BVAT and optimization-based BVAT, which are suitable to promote the smoothness of the model for graph-structured data by either finding virtual adversarial perturbations for a subset of nodes far from each other or generating virtual adversarial perturbations for all nodes with an optimization process. Extensive experiments on three citation network datasets Cora, Citeseer and Pubmed and a knowledge graph dataset Nell validate the effectiveness of the proposed method, which establishes state-of-the-art results in the semi-supervised node classification tasks.
Inference over tails is usually performed by fitting an appropriate limiting distribution over observations that exceed a fixed threshold. However, the choice of such threshold is critical and can affect the inferential results. Extreme value mixture models have been defined to estimate the threshold using the full dataset and to give accurate tail estimates. Such models assume that the tail behavior is constant for all observations. However, the extreme behavior of financial returns often changes considerably in time and such changes occur by sudden shocks of the market. Here we extend the extreme value mixture model class to formally take into account distributional extreme changepoints, by allowing for the presence of regime-dependent parameters modelling the tail of the distribution. This extension formally uses the full dataset to both estimate the thresholds and the extreme changepoint locations, giving uncertainty measures for both quantities. Estimation of functions of interest in extreme value analyses is performed via MCMC algorithms. Our approach is evaluated through a series of simulations, applied to real data sets and assessed against competing approaches. Evidence demonstrates that the inclusion of different extreme regimes outperforms both static and dynamic competing approaches in financial applications.
Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically ‘similar’ data points and ‘negative samples,’ the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.
Machine learning algorithms have been effectively applied into various real world tasks. However, it is difficult to provide high-quality machine learning solutions to accommodate an unknown distribution of input datasets; this difficulty is called the uncertainty prediction problems. In this paper, a margin-based Pareto deep ensemble pruning (MBPEP) model is proposed. It achieves the high-quality uncertainty estimation with a small value of the prediction interval width (MPIW) and a high confidence of prediction interval coverage probability (PICP) by using deep ensemble networks. In addition to these networks, unique loss functions are proposed, and these functions make the sub-learners available for standard gradient descent learning. Furthermore, the margin criterion fine-tuning-based Pareto pruning method is introduced to optimize the ensembles. Several experiments including predicting uncertainties of classification and regression are conducted to analyze the performance of MBPEP. The experimental results show that MBPEP achieves a small interval width and a low learning error with an optimal number of ensembles. For the real-world problems, MBPEP performs well on input datasets with unknown distributions datasets incomings and improves learning performance on a multi task problem when compared to that of each single model.
Training a Neural Network (NN) with lots of parameters or intricate architectures creates undesired phenomena that complicate the optimization process. To address this issue we propose a first modular approach to NN design, wherein the NN is decomposed into a control module and several functional modules, implementing primitive operations. We illustrate the modular concept by comparing performances between a monolithic and a modular NN on a list sorting problem and show the benefits in terms of training speed, training stability and maintainability. We also discuss some questions that arise in modular NNs.
In this paper, we propose a novel pretraining-based encoder-decoder framework, which can generate the output sequence based on the input sequence in a two-stage manner. For the encoder of our model, we encode the input sequence into context representations using BERT. For the decoder, there are two stages in our model, in the first stage, we use a Transformer-based decoder to generate a draft output sequence. In the second stage, we mask each word of the draft sequence and feed it to BERT, then by combining the input sequence and the draft representation generated by BERT, we use a Transformer-based decoder to predict the refined word for each masked position. To the best of our knowledge, our approach is the first method which applies the BERT into text generation tasks. As the first step in this direction, we evaluate our proposed method on the text summarization task. Experimental results show that our model achieves new state-of-the-art on both CNN/Daily Mail and New York Times datasets.
The human mind is still an unknown process of neuroscience in many aspects. Nevertheless, for decades the scientific community has proposed computational models that try to simulate their parts, specific applications, or their behavior in different situations. The most complete model in this line is undoubtedly the LIDA model, proposed by Stan Franklin with the aim of serving as a generic computational architecture for several applications. The present project is inspired by the LIDA model to apply it to the process of movie recommendation, the model called MIRA (Movie Intelligent Recommender Agent) presented percentages of precision similar to a traditional model when submitted to the same assay conditions. Moreover, the proposed model reinforced the precision indexes when submitted to tests with volunteers, proving once again its performance as a cognitive model, when executed with small data volumes. Considering that the proposed model achieved a similar behavior to the traditional models under conditions expected to be similar for natural systems, it can be said that MIRA reinforces the applicability of LIDA as a path to be followed for the study and generation of computational agents inspired by neural behaviors.
This paper proposes a robust version of the unscented transform (UT) for one-dimensional random variables. It is assumed that the moments are not exactly known, but are known to lie in intervals. In this scenario, the moment matching equations are reformulated as a system of polynomial equations and inequalities, and it is proposed to use the Chebychev center of the solution set as a robust UT. This method yields a parametrized polynomial optimization problem, which in spite of being NP-Hard, can be relaxed by some algorithms that are proposed in this paper.
Targeted sentiment classification aims at determining the sentimental tendency towards specific targets. Most of the previous approaches model context and target words using recurrent neural networks such as LSTM in conjunction with attention mechanisms. However, LSTM networks are difficult to parallelize because of their sequential nature. Moreover, since full backpropagation over the sequence requires large amounts of memory, essentially every implementation of backpropagation through time is the truncated version, which brings difficulty in remembering long-term patterns. To address these issues, this paper propose an Attentional Encoder Network (AEN) for targeted sentiment classification. Contrary to previous LSTM based works, AEN eschews complex recurrent neural networks and employs attention based encoders for the modeling between context and target, which can excavate the rich introspective and interactive semantic information from the word embeddings without considering the distance between words. This paper also raise the label unreliability issue and introduce label smoothing regularization term to the loss function for encouraging the model to be less confident with the training labels. Experimental results on three benchmark datasets demonstrate that our model achieves comparable or superior performances with a lightweight model size.
We introduce a new methodology for analyzing serial data by quantile regression assuming that the underlying quantile function consists of constant segments. The procedure does not rely on any distributional assumption besides serial independence. It is based on a multiscale statistic, which allows to control the (finite sample) probability for selecting the correct number of segments S at a given error level, which serves as a tuning parameter. For a proper choice of this parameter, this tends exponentially fast to the true S, as sample size increases. We further show that the location and size of segments are estimated at minimax optimal rate (compared to a Gaussian setting) up to a log-factor. Thereby, our approach leads to (asymptotically) uniform confidence bands for the entire quantile regression function in a fully nonparametric setup. The procedure is efficiently implemented using dynamic programming techniques with double heap structures, and software is provided. Simulations and data examples from genetic sequencing and ion channel recordings confirm the robustness of the proposed procedure, which at the same hand reliably detects changes in quantiles from arbitrary distributions with precise statistical guarantees.
To address the challenges in learning deep generative models (e.g.,the blurriness of variational auto-encoder and the instability of training generative adversarial networks, we propose a novel deep generative model, named Wasserstein-Wasserstein auto-encoders (WWAE). We formulate WWAE as minimization of the penalized optimal transport between the target distribution and the generated distribution. By noticing that both the prior and the aggregated posterior of the latent code Z can be well captured by Gaussians, the proposed WWAE utilizes the closed-form of the squared Wasserstein-2 distance for two Gaussians in the optimization process. As a result, WWAE does not suffer from the sampling burden and it is computationally efficient by leveraging the reparameterization trick. Numerical results evaluated on multiple benchmark datasets including MNIST, fashion- MNIST and CelebA show that WWAE learns better latent structures than VAEs and generates samples of better visual quality and higher FID scores than VAEs and GANs.
We study the multi-round response generation in visual dialog systems, where a response is generated according to a visually grounded conversational history. Given a triplet: an image, Q&A history, and current question, all the prevailing methods follow a codec (ie, encoder-decoder) fashion in the supervised learning paradigm: a multimodal encoder encodes the triplet into a feature vector, which is then fed into the decoder for the current answer generation, supervised by the ground-truth answer. However, this conventional supervised learning does not take into account the impact of imperfect history in the codec training, violating the conversational nature of visual dialog and thus making the codec more inclined to learn dataset bias but not visual reasoning. To this end, inspired by the actor-critic policy gradient in reinforcement learning, we propose a novel training paradigm called Gold-Critic Sequence Training (GCST). Specifically, we intentionally impose wrong answers in the history, obtaining an adverse reward, and see how the historic error impacts the codec’s future behavior by subtracting the gold-critic baseline — reward obtained by using ground-truth history — from the adverse reward. Moreover, to make the codec more sensitive to the history, we propose a novel attention network called Recurrent Co-Attention Network (RCAN) which can be effectively trained by using GCST. Experimental results on three benchmarks: VisDial0.9&1.0 and GuessWhat?!, show that the proposed GCST strategy consistently outperforms over state-of-the-art supervised counterparts under all metrics.
Interpretability has always been a major concern for fuzzy rule-based classifiers. The usage of human-readable models allows them to explain the reasoning behind their predictions and decisions. However, when it comes to Big Data classification problems, fuzzy rule-based classifiers have not been able to maintain the good trade-off between accuracy and interpretability that has characterized these techniques in non-Big Data environments. The most accurate methods build too complex models composed of a large number of rules and fuzzy sets, while those approaches focusing on interpretability do not provide state-of-the-art discrimination capabilities. In this paper, we propose a new distributed learning algorithm named CFM-BD to construct accurate and compact fuzzy rule-based classification systems for Big Data. This method has been specifically designed from scratch for Big Data problems and does not adapt or extend any existing algorithm. The proposed learning process consists of three stages: 1) pre-processing based on the probability integral transform theorem; 2) rule induction inspired by CHI-BD and Apriori algorithms; 3) rule selection by means of a global evolutionary optimization. We conducted a complete empirical study to test the performance of our approach in terms of accuracy, complexity, and runtime. The results obtained were compared and contrasted with four state-of-the-art fuzzy classifiers for Big Data (FBDT, FMDT, Chi-Spark-RS, and CHI-BD). According to this study, CFM-BD is able to provide competitive discrimination capabilities using significantly simpler models composed of a few rules of less than 3 antecedents, employing 5 linguistic labels for all variables.
Online communities such as Facebook and Twitter are enormously popular and have become an essential part of the daily life of many of their users. Through these platforms, users can discover and create information that others will then consume. In that context, recommending relevant information to users becomes critical for viability. However, recommendation in online communities is a challenging problem: 1) users’ interests are dynamic, and 2) users are influenced by their friends. Moreover, the influencers may be context-dependent. That is, different friends may be relied upon for different topics. Modeling both signals is therefore essential for recommendations. We propose a recommender system for online communities based on a dynamic-graph-attention neural network. We model dynamic user behaviors with a recurrent neural network, and context-dependent social influence with a graph-attention neural network, which dynamically infers the influencers based on users’ current interests. The whole model can be efficiently fit on large-scale data. Experimental results on several real-world data sets demonstrate the effectiveness of our proposed approach over several competitive baselines including state-of-the-art models.
The order of the tasks a continual learning model encounters may have large impact on the performance of each task, as well as the task-average performance. This order-sensitivity may cause serious problems in real-world scenarios where fairness plays a critical role (e.g. medical diagnosis). To tackle this problem, we propose a novel order-robust continual learning method, which instead of learning a completely shared set of weights, represent the parameters for each task as a sum of task-shared parameters that captures generic representations and task-adaptive parameters capturing task-specific ones, where the latter is factorized into sparse low-rank matrices in order to minimize capacity increase. With such parameter decomposition, when training for a new task, the task-adaptive parameters for earlier tasks remain mostly unaffected, where we update them only to reflect the changes made to the task-shared parameters. This prevents catastrophic forgetting for old tasks and at the same time make the model less sensitive to the task arrival order. We validate our Order-Robust Adaptive Continual LEarning (ORACLE) method on multiple benchmark datasets against state-of-the-art continual learning methods, and the results show that it largely outperforms those strong baselines with significantly less increase in capacity and training time, as well as obtains smaller performance disparity for each task with different order sequences.
We consider the problem of building a state representation model for control, in a continual learning setting. As the environment changes, the aim is to efficiently compress the sensory state’s information without losing past knowledge, and then use Reinforcement Learning on the resulting features for efficient policy learning. To this end, we propose S-TRIGGER, a general method for Continual State Representation Learning applicable to Variational Auto-Encoders and its many variants. The method is based on Generative Replay, i.e. the use of generated samples to maintain past knowledge. It comes along with a statistically sound method for environment change detection, which self-triggers the Generative Replay. Our experiments on VAEs show that S-TRIGGER learns state representations that allows fast and high-performing Reinforcement Learning, while avoiding catastrophic forgetting. The resulting system is capable of autonomously learning new information without using past data and with a bounded system size. Code for our experiments is attached in Appendix.