Some real-world domains are best characterized as a single task, but for others this perspective is limiting. Instead, some tasks continually grow in complexity, in tandem with the agent’s competence. In continual learning, also referred to as lifelong learning, there are no explicit task boundaries or curricula. As learning agents have become more powerful, continual learning remains one of the frontiers that has resisted quick progress. To test continual learning capabilities we consider a challenging 3D domain with an implicit sequence of tasks and sparse rewards. We propose a novel agent architecture called Unicorn, which demonstrates strong continual learning and outperforms several baseline agents on the proposed domain. The agent achieves this by jointly representing and learning multiple policies efficiently, using a parallel off-policy learning setup.
We present a content-based method for recommending citations in an academic paper draft. We embed a given query document into a vector space, then use its nearest neighbors as candidates, and rerank the candidates using a discriminative model trained to distinguish between observed and unobserved citations. Unlike previous work, our method does not require metadata such as author names which can be missing, e.g., during the peer review process. Without using metadata, our method outperforms the best reported results on PubMed and DBLP datasets with relative improvements of over 18% in F1@20 and over 22% in MRR. We show empirically that, although adding metadata improves the performance on standard metrics, it favors self-citations which are less useful in a citation recommendation setup. We release an online portal (http://…/citeomatic ) for citation recommendation based on our method, and a new dataset OpenCorpus of 7 million research articles to facilitate future research on this task.
In recent years, Deep Reinforcement Learning has made impressive advances in solving several important benchmark problems for sequential decision making. Many control applications use a generic multilayer perceptron (MLP) for non-vision parts of the policy network. In this work, we propose a new neural network architecture for the policy network representation that is simple yet effective. The proposed Structured Control Net (SCN) splits the generic MLP into two separate sub-modules: a nonlinear control module and a linear control module. Intuitively, the nonlinear control is for forward-looking and global control, while the linear control stabilizes the local dynamics around the residual of global control. We hypothesize that this will bring together the benefits of both linear and nonlinear policies: improve training sample efficiency, final episodic reward, and generalization of learned policy, while requiring a smaller network and being generally applicable to different training methods. We validated our hypothesis with competitive results on simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban driving environment, with various ablation and generalization tests, trained with multiple black-box and policy gradient training methods. The proposed architecture has the potential to improve upon broader control tasks by incorporating problem specific priors into the architecture. As a case study, we demonstrate much improved performance for locomotion tasks by emulating the biological central pattern generators (CPGs) as the nonlinear part of the architecture.
A central question of crowd-sourcing is how to elicit expertise from agents. This is even more difficult when answers cannot be directly verified. A key challenge is that sophisticated agents may strategically withhold effort or information when they believe their payoff will be based upon comparison with other agents whose reports will likely omit this information due to lack of effort or expertise. Our work defines a natural model for this setting based on the assumption that \emph{more sophisticated agents know the beliefs of less sophisticated agents}. We then provide a mechanism design framework for this setting. From this framework, we design several novel mechanisms, for both the single and multiple question settings, that (1) encourage agents to invest effort and provide their information honestly; (2) output a correct ‘hierarchy’ of the information when agents are rational.
We present a deep learning model, DE-LSTM, for the simulation of a stochastic process with underlying nonlinear dynamics. The deep learning model aims to approximate the probability density function of a stochastic process via numerical discretization and the underlying nonlinear dynamics is modeled by the Long Short-Term Memory (LSTM) network. After the numerical discretization by a softmax function, the function estimation problem is solved by a multi-label classification problem. A penalized maximum log likelihood method is proposed to impose smoothness in the predicted probability distribution. It is shown that LSTM is a state space model, where the internal dynamics consists of a system of relaxation processes. A sequential Monte Carlo method is outlined to compute the time evolution of the probability distribution. The behavior of DE-LSTM is investigated by using the Ornstein-Uhlenbeck process and noisy observations of Mackey-Glass equation and forced Van der Pol oscillators. While the probability distribution computed by the conventional maximum log likelihood method makes a good prediction of the first and second moments, the Kullback-Leibler divergence shows that the penalized maximum log likelihood method results in a probability distribution closer to the ground truth. It is shown that DE-LSTM makes a good prediction of the probability distribution without assuming any distributional properties of the noise. For a multiple-step forecast, it is found that the prediction uncertainty, denoted by the 95% confidence interval, does not grow monotonically in time. For a chaotic system, Mackey-Glass time series, the 95% confidence interval first grows, then exhibits an oscillatory behavior, instead of growing indefinitely, while for the forced Van der Pol oscillator, the prediction uncertainty does not grow in time even for 3,000-step forecast.
We examine two fundamental tasks associated with graph representation learning: link prediction and semi-supervised node classification. We present a densely connected autoencoder architecture capable of learning a joint representation of both local graph structure and available external node features for the multi-task learning of link prediction and node classification. To the best of our knowledge, this is the first architecture that can be efficiently trained end-to-end in a single learning stage to simultaneously perform link prediction and node classification. We provide comprehensive empirical evaluation of our models on a range of challenging benchmark graph-structured datasets, and demonstrate significant improvement in accuracy over related methods for graph representation learning. Code implementation is available at https://…/graph-representation-learning
As architecture, system, data management, and machine learning communities pay greater attention to innovative big data and data-driven artificial intelligence (in short, AI) algorithms, architecture, and systems, the pressure of benchmarking rises. However, complexity, diversity, frequently changed workloads, and rapid evolution of big data, especially AI systems raise great challenges in benchmarking. First, for the sake of conciseness, benchmarking scalability, portability cost, reproducibility, and better interpretation of performance data, we need understand what are the abstractions of frequently-appearing units of computation, which we call dwarfs, among big data and AI workloads. Second, for the sake of fairness, the benchmarks must include diversity of data and workloads. Third, for co-design of software and hardware, the benchmarks should be consistent across different communities. Other than creating a new benchmark or proxy for every possible workload, we propose using dwarf-based benchmarks–the combination of eight dwarfs–to represent diversity of big data and AI workloads. The current version–BigDataBench 4.0 provides 13 representative real-world data sets and 47 big data and AI benchmarks, including seven workload types: online service, offline analytics, graph analytics, AI, data warehouse, NoSQL, and streaming. BigDataBench 4.0 is publicly available from http://…/BigDataBench. Also, for the first time, we comprehensively characterize the benchmarks of seven workload types in BigDataBench 4.0 in addition to traditional benchmarks like SPECCPU, PARSEC and HPCC in a hierarchical manner and drill down on five levels, using the Top-Down analysis from an architecture perspective.
The $k$-means algorithm is the most popular nonparametric clustering method in use, but cannot generally be applied to data sets with missing observations. The usual practice with such data sets is to either impute the values under an assumption of a missing-at-random mechanism or to ignore the incomplete records, and then to use the desired clustering method. We develop an efficient version of the $k$-means algorithm that allows for clustering cases where not all the features have observations recorded. Our extension is called $k_m$-means and reduces to the $k$-means algorithm when all records are complete. We also provide strategies to initialize our algorithm and to estimate the number of groups in the data set. Illustrations and simulations demonstrate the efficacy of our approach in a variety of settings and patterns of missing data. Our methods are also applied to the clustering of gamma-ray bursts and to the analysis of activation images obtained from a functional Magnetic Resonance Imaging experiment.
Because of the internal malfunction of satellite sensors and poor atmospheric conditions such as thick cloud, the acquired remote sensing data often suffer from missing information, i.e., the data usability is greatly reduced. In this paper, a novel method of missing information reconstruction in remote sensing images is proposed. The unified spatial-temporal-spectral framework based on a deep convolutional neural network (STS-CNN) employs a unified deep convolutional neural network combined with spatial-temporal-spectral supplementary information. In addition, to address the fact that most methods can only deal with a single missing information reconstruction task, the proposed approach can solve three typical missing information reconstruction tasks: 1) dead lines in Aqua MODIS band 6; 2) the Landsat ETM+ Scan Line Corrector (SLC)-off problem; and 3) thick cloud removal. It should be noted that the proposed model can use multi-source data (spatial, spectral, and temporal) as the input of the unified framework. The results of both simulated and real-data experiments demonstrate that the proposed model exhibits high effectiveness in the three missing information reconstruction tasks listed above.
Experimental design is a classical problem in statistics and has also found new applications in machine learning. In the experimental design problem, the aim is to estimate an unknown vector x in m-dimensions from linear measurements where a Gaussian noise is introduced in each measurement. The goal is to pick k out of the given n experiments so as to make the most accurate estimate of the unknown parameter x. Given a set S of chosen experiments, the most likelihood estimate x’ can be obtained by a least squares computation. One of the robust measures of error estimation is the D-optimality criterion which aims to minimize the generalized variance of the estimator. This corresponds to minimizing the volume of the standard confidence ellipsoid for the estimation error x-x’. The problem gives rise to two natural variants depending on whether repetitions are allowed or not. The latter variant, while being more general, has also found applications in the geographical location of sensors. In this work, we first show that a 1/e-approximation for the D-optimal design problem with and without repetitions giving us the first constant factor approximation for the problem. We also consider the case when the number of experiments chosen is much larger than the dimension of the measurements and provide an asymptotically optimal approximation algorithm.
Recently, some studies have utilized the Markov Decision Process for diversifying (MDP-DIV) the search results in information retrieval. Though promising performances can be delivered, MDP-DIV suffers from a very slow convergence, which hinders its usability in real applications. In this paper, we aim to promote the performance of MDP-DIV by speeding up the convergence rate without much accuracy sacrifice. The slow convergence is incurred by two main reasons: the large action space and data scarcity. On the one hand, the sequential decision making at each position needs to evaluate the query-document relevance for all the candidate set, which results in a huge searching space for MDP; on the other hand, due to the data scarcity, the agent has to proceed more ‘trial and error’ interactions with the environment. To tackle this problem, we propose MDP-DIV-kNN and MDP-DIV-NTN methods. The MDP-DIV-kNN method adopts a $k$ nearest neighbor strategy, i.e., discarding the $k$ nearest neighbors of the recently-selected action (document), to reduce the diversification searching space. The MDP-DIV-NTN employs a pre-trained diversification neural tensor network (NTN-DIV) as the evaluation model, and combines the results with MDP to produce the final ranking solution. The experiment results demonstrate that the two proposed methods indeed accelerate the convergence rate of the MDP-DIV, which is 3x faster, while the accuracies produced barely degrade, or even are better.
Recommender systems are one of the most successful applications of data mining and machine learning technology in practice. Academic research in the field is historically often based on the matrix completion problem formulation, where for each user-item-pair only one interaction (e.g., a rating) is considered. In many application domains, however, multiple user-item interactions of different types can be recorded over time. And, a number of recent works have shown that this information can be used to build richer individual user models and to discover additional behavioral patterns that can be leveraged in the recommendation process. In this work we review existing works that consider information from such sequentially-ordered user- item interaction logs in the recommendation process. Based on this review, we propose a categorization of the corresponding recommendation tasks and goals, summarize existing algorithmic solutions, discuss methodological approaches when benchmarking what we call sequence-aware recommender systems, and outline open challenges in the area.
Neural networks are commonly regarded as black boxes performing incomprehensible functions. For classification problems networks provide maps from high dimensional feature space to K-dimensional image space. Images of training vector are projected on polygon vertices, providing visualization of network function. Such visualization may show the dynamics of learning, allow for comparison of different networks, display training vectors around which potential problems may arise, show differences due to regularization and optimization procedures, investigate stability of network classification under perturbation of original vectors, and place new data sample in relation to training data, allowing for estimation of confidence in classification of a given sample. An illustrative example for the three-class Wine data and five-class Satimage data is described. The visualization method proposed here is applicable to any black box system that provides continuous outputs.
Over the last years, stream data processing has been gaining attention both in industry and in academia due to its wide range of applications. To fulfill the need for scalable and efficient stream analytics, numerous open source stream data processing systems (SDPSs) have been developed, with high throughput and low latency being their key performance targets. In this paper, we propose a framework to evaluate the performance of three SDPSs, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations. For this benchmark, we design workloads based on real-life, industrial use-cases. The main contribution of this work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we completely separate the system under test and driver, so that the measurement results are closer to actual system performance under real conditions. Third, we build the first driver to test the actual sustainable performance of a system under test. Our detailed evaluation highlights that there is no single winner, but rather, each system excels in individual use-cases.
Knowledge can be represented compactly in a multitude ways, from a set of propositional formulas, to a Kripke model, to a database. In this paper we study the aggregation of information coming from multiple sources, each source submitting a database modelled as a first-order relational structure. In the presence of an integrity constraint, we identify classes of aggregators that respect it in the aggregated database, provided all individual databases satisfy it. We also characterise languages for first-order queries on which the answer to queries on the aggregated database coincides with the aggregation of the answers to the query obtained on each individual database. This contribution is meant to be a first step on the application of techniques from rational choice theory to knowledge representation in databases.
The huge size of deep networks hinders their use in small computing devices. In this paper, we consider compressing the network by weight quantization. We extend a recently proposed loss-aware weight binarization scheme to ternarization, with possibly different scaling parameters for the positive and negative weights, and m-bit (where m > 2) quantization. Experiments on feedforward and recurrent neural networks show that the proposed scheme outperforms state-of-the-art weight quantization algorithms, and is as accurate (or even more accurate) than the full-precision network.
Single document summarization is the task of producing a shorter version of a document while preserving its principal information content. In this paper we conceptualize extractive summarization as a sentence ranking task and propose a novel training algorithm which globally optimizes the ROUGE evaluation metric through a reinforcement learning objective. We use our algorithm to train a neural summarization model on the CNN and DailyMail datasets and demonstrate experimentally that it outperforms state-of-the-art extractive and abstractive systems when evaluated automatically and by humans.
This paper introduces test and estimation procedures for abrupt and gradual changes in the entire jump behaviour of a discretely observed Ito semimartingale. In contrast to existing work we analyse jumps of arbitrary size which are not restricted to a minimum height. Our methods are based on weak convergence of a truncated sequential empirical distribution function of the jump characteristic of the underlying Ito semimartingale. Critical values for the new tests are obtained by a multiplier bootstrap approach and we investigate the performance of the tests also under local alternatives. An extensive simulation study shows the finite-sample properties of the new procedures.
Choosing optimal (or at least better) policies is an important problem in domains from medicine to education to finance and many others. One approach to this problem is through controlled experiments/trials – but controlled experiments are expensive. Hence it is important to choose the best policies on the basis of observational data. This presents two difficult challenges: (i) missing counterfactuals, and (ii) selection bias. This paper presents theoretical bounds on estimation errors of counterfactuals from observational data by making connections to domain adaptation theory. It also presents a principled way of choosing optimal policies using domain adversarial neural networks. We illustrate the effectiveness of domain adversarial training together with various features of our algorithm on a semi-synthetic breast cancer dataset and a supervised UCI dataset (Statlog).