Extractive compression is a challenging natural language processing problem. This work contributes by formulating neural extractive compression as a parse tree transduction problem, rather than a sequence transduction task. Motivated by this, we introduce a deep neural model for learning structure-to-substructure tree transductions by extending the standard Long Short-Term Memory, considering the parent-child relationships in the structural recursion. The proposed model can achieve state of the art performance on sentence compression benchmarks, both in terms of accuracy and compression rate.
We introduce operators $\mathsf{hare}$ and $\mathsf{tortoise}$, which act on words as natural generalizations of West’s stack-sorting map. We show that the heuristically slower algorithm $\mathsf{tortoise}$ can sort words arbitrarily faster than its counterpart $\mathsf{hare}$. We then generalize the combinatorial objects known as valid hook configurations in order to find a method for computing the number of preimages of any word under these two operators. We relate the question of determining which words are sortable by $\mathsf{hare}$ and $\mathsf{tortoise}$ to more classical problems in pattern avoidance, and we derive a recurrence for the number of words with a fixed number of copies of each letter (permutations of a multiset) that are sortable by each map. We conclude with several open problems and conjectures. In particular, our investigation of $\mathsf{tortoise}$-sortable words leads us to conjecture that there are $\frac{1}{\ell n+1}{(\ell+1)n\choose n}$ $\ell$-uniform (normalized) words of length $\ell n$ that avoid the patterns $231$ and $221$; we prove this conjecture in the case $\ell=2$.
This paper presents a technique that combines the occurrence of certain events, as observed by different sensors, in order to detect and classify objects. This technique explores the extent of dependence between features being observed by the sensors, and generates more informed probability distributions over the events. Provided some additional information about the features of the object, this fusion technique can outperform other existing decision level fusion approaches that may not take into account the relationship between different features.
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem [1]. We propose and compare various encoder-decoder based approaches that optimize both modules jointly, in an end-to-end manner. Evaluations on a real-world task show that 1) having an intermediate text representation is crucial for the quality of the predicted semantics, especially the intent arguments and 2) jointly optimizing the full system improves overall accuracy of prediction. Compared to independently trained models, our best jointly trained model achieves similar domain and intent prediction F1 scores, but improves argument word error rate by 18% relative.
Gaussian processes constitute a very powerful and well-understood method for non-parametric regression and classification. In the classical framework, the training data consists of deterministic vector-valued inputs and the corresponding (noisy) measurements whose joint distribution is assumed to be Gaussian. In many practical applications, however, the inputs are either noisy, i.e., each input is a vector-valued sample from an unknown probability distribution, or the probability distributions are the inputs. In this paper, we address Gaussian process regression with inputs given in form of probability distributions and propose a framework that is based on distances between such inputs. To this end, we review different admissible distance measures and provide a numerical example that demonstrates our framework.
For successful deployment of deep neural networks on highly–resource-constrained devices (hearing aids, earbuds, wearables), we must simplify the types of operations and the memory/power resources used during inference. Completely avoiding inference-time floating-point operations is one of the simplest ways to design networks for these highly-constrained environments. By discretizing both our in-network non-linearities and our network weights, we can move to simple, compact networks without floating point operations, without multiplications, and avoid all non-linear function computations. Our approach allows us to explore the spectrum of possible networks, ranging from fully continuous versions down to networks with bi-level weights and activations. Our results show that discretization can be done without loss of performance and that we can train a network that will successfully operate without floating-point, without multiplication, and with less RAM on both regression tasks (auto encoding) and multi-class classification tasks (ImageNet). The memory needed to deploy our discretized networks is less than one third of the equivalent architecture that does use floating-point operations.
Data-driven algorithms play a large role in decision making across a variety of industries. Increasingly, these algorithms are being used to make decisions that have significant ramifications for people’s social and economic well-being, e.g. in sentencing, loan approval, and policing. Amid the proliferation of such systems there is a growing concern about their potential discriminatory impact. In particular, machine learning systems which are trained on biased data have the potential to learn and perpetuate those biases. A central challenge for practitioners is thus to determine whether their models display discriminatory bias. Here we present a case study in which we frame the issue of bias detection as a causal inference problem with observational data. We enumerate two main causes of bias, sampling bias and label bias, and we investigate the abilities of six different fairness metrics to detect each bias type. Based on these investigations, we propose a set of best practice guidelines to select the fairness metric that is most likely to detect bias if it is present. Additionally, we aim to identify the conditions in which certain fairness metrics may fail to detect bias and instead give practitioners a false belief that their biased model is making fair decisions.
In this work, we introduce an asynchronous decentralized accelerated stochastic gradient descent type of method for decentralized stochastic optimization, considering communication and synchronization are the major bottlenecks. We establish $\mathcal{O}(1/\epsilon)$ (resp., $\mathcal{O}(1/\sqrt{\epsilon})$) communication complexity and $\mathcal{O}(1/\epsilon^2)$ (resp., $\mathcal{O}(1/\epsilon)$) sampling complexity for solving general convex (resp., strongly convex) problems.
Robots and autonomous agents often complete goal-based tasks with limited resources, relying on imperfect models and sensor measurements. In particular, reinforcement learning (RL) and feedback control can be used to help a robot achieve a goal. Taking advantage of this body of work, this paper formulates general computation as a feedback-control problem, which allows the agent to autonomously overcome some limitations of standard procedural language programming: resilience to errors and early program termination. Our formulation considers computation to be trajectory generation in the program’s variable space. The computing then becomes a sequential decision making problem, solved with reinforcement learning (RL), and analyzed with Lyapunov stability theory to assess the agent’s resilience and progression to the goal. We do this through a case study on a quintessential computer science problem, array sorting. Evaluations show that our RL sorting agent makes steady progress to an asymptotically stable goal, is resilient to faulty components, and performs less array manipulations than traditional Quicksort and Bubble sort.
A novel approach is put forth that utilizes data similarity, quantified on a graph, to improve upon the reconstruction performance of principal component analysis. The tasks of data dimensionality reduction and reconstruction are formulated as graph filtering operations, that enable the exploitation of data node connectivity in a graph via the adjacency matrix. The unknown reducing and reconstruction filters are determined by optimizing a mean-square error cost that entails the data, as well as their graph adjacency matrix. Working in the graph spectral domain enables the derivation of simple gradient descent recursions used to update the matrix filter taps. Numerical tests in real image datasets demonstrate the better reconstruction performance of the novel method over standard principal component analysis.
Designing search algorithms for finding global optima is one of the most active research fields, recently. These algorithms consist of two main categories, i.e., classic mathematical and metaheuristic algorithms. This article proposes a meta-algorithm, Tree-Based Optimization (TBO), which uses other heuristic optimizers as its sub-algorithms in order to improve the performance of search. The proposed algorithm is based on mathematical tree subject and improves performance and speed of search by iteratively removing parts of the search space having low fitness, in order to minimize and purify the search space. The experimental results on several well-known benchmarks show the outperforming performance of TBO algorithm in finding the global solution. Experiments on high dimensional search spaces show significantly better performance when using the TBO algorithm. The proposed algorithm improves the search algorithms in both accuracy and speed aspects, especially for high dimensional searching such as in VLSI CAD tools for Integrated Circuit (IC) design.
Joint object detection and semantic segmentation can be applied to many fields, such as self-driving cars and unmanned surface vessels. An initial and important progress towards this goal has been achieved by simply sharing the deep convolutional features for the two tasks. However, this simple scheme is unable to make full use of the fact that detection and segmentation are mutually beneficial. To overcome this drawback, we propose a framework called TripleNet where triple supervisions including detection-oriented supervision, class-aware segmentation supervision, and class-agnostic segmentation supervision are imposed on each layer of the decoder network. Class-agnostic segmentation supervision provides an objectness prior knowledge for both semantic segmentation and object detection. Besides the three types of supervisions, two light-weight modules (i.e., inner-connected module and attention skip-layer fusion) are also incorporated into each layer of the decoder. In the proposed framework, detection and segmentation can sufficiently boost each other. Moreover, class-agnostic and class-aware segmentation on each decoder layer are not performed at the test stage. Therefore, no extra computational costs are introduced at the test stage. Experimental results on the VOC2007 and VOC2012 datasets demonstrate that the proposed TripleNet is able to improve both the detection and segmentation accuracies without adding extra computational costs.
With a growing number of social apps, people have become increasingly willing to share their everyday photos and events on social media platforms, such as Facebook, Instagram, and WeChat. In social media data mining, post popularity prediction has received much attention from both data scientists and psychologists. Existing research focuses more on exploring the post popularity on a population of users and including comprehensive factors such as temporal information, user connections, number of comments, and so on. However, these frameworks are not suitable for guiding a specific user to make a popular post because the attributes of this user are fixed. Therefore, previous frameworks can only answer the question ‘whether a post is popular’ rather than ‘how to become famous by popular posts’. In this paper, we aim at predicting the popularity of a post for a specific user and mining the patterns behind the popularity. To this end, we first collect data from Instagram. We then design a method to figure out the user environment, representing the content that a specific user is very likely to post. Based on the relevant data, we devise a novel dual-attention model to incorporate image, caption, and user environment. The dual-attention model basically consists of two parts, explicit attention for image-caption pairs and implicit attention for user environment. A hierarchical structure is devised to concatenate the explicit attention part and implicit attention part. We conduct a series of experiments to validate the effectiveness of our model and investigate the factors that can influence the popularity. The classification results show that our model outperforms the baselines, and a statistical analysis identifies what kind of pictures or captions can help the user achieve a relatively high ‘likes’ number.
For many cache-oblivious algorithms for dynamic programming and linear algebra, we observe that the key factor that affects the cache complexity is the number of input entries involved in each basic computation cell. In this paper, we propose a level of abstraction to capture this property, and refer to it as the $k$-d grid computation structure. We then show the computational lower bounds for this grid structure, and propose efficient and highly-parallel algorithms to compute such grid structure that optimize the number of arithmetic operations, parallel depth, and the cache complexity in both the classic setting when reads and writes have the same cost, and the asymmetric variant that considers writes to be more expensive than reads. Using the abstraction with the proposed algorithms as the implementation, we propose cache-oblivious algorithms for many fundamental problems with improved cache complexities in both the classic and asymmetric settings. The cache bounds are optimal in most applications we consider. Meanwhile, we also reduce the parallel depths of many problems. We believe that the novelty of our framework is of interests and leads to many new questions for future work.
Despite deep reinforcement learning has recently achieved great successes, however in multiagent environments, a number of challenges still remain. Multiagent reinforcement learning (MARL) is commonly considered to suffer from the problem of non-stationary environments and exponentially increasing policy space. It would be even more challenging to learn effective policies in circumstances where the rewards are sparse and delayed over long trajectories. In this paper, we study Hierarchical Deep Multiagent Reinforcement Learning (hierarchical deep MARL) in cooperative multiagent problems with sparse and delayed rewards, where efficient multiagent learning methods are desperately needed. We decompose the original MARL problem into hierarchies and investigate how effective policies can be learned hierarchically in synchronous/asynchronous hierarchical MARL frameworks. Several hierarchical deep MARL architectures, i.e., Ind-hDQN, hCom and hQmix, are introduced for different learning paradigms. Moreover, to alleviate the issues of sparse experiences in high-level learning and non-stationarity in multiagent settings, we propose a new experience replay mechanism, named as Augmented Concurrent Experience Replay (ACER). We empirically demonstrate the effects and efficiency of our approaches in several classic Multiagent Trash Collection tasks, as well as in an extremely challenging team sports game, i.e., Fever Basketball Defense.
The era of big data provides researchers with convenient access to copious data. However, people often have little knowledge about it. The increasing prevalence of big data is challenging the traditional methods of learning causality because they are developed for the cases with limited amount of data and solid prior causal knowledge. This survey aims to close the gap between big data and learning causality with a comprehensive and structured review of traditional and frontier methods and a discussion about some open problems of learning causality. We begin with preliminaries of learning causality. Then we categorize and revisit methods of learning causality for the typical problems and data types. After that, we discuss the connections between learning causality and machine learning. At the end, some open problems are presented to show the great potential of learning causality with data.
Irregularly sampling a spatially stationary random field does not yield a graph stationary signal in general. Based on this observation, we build a definition of graph stationarity based on intrinsic stationarity, a less restrictive definition of classical stationarity. We introduce the concept of graph variogram, a novel tool for measuring spatial intrinsic stationarity at local and global scales for irregularly sampled signals by selecting subgraphs of local neighborhoods. Graph variograms are extensions of variograms used for signals defined on continuous Euclidean space. Our experiments with intrinsically stationary signals sampled on a graph, demonstrate that graph variograms yield estimates with small bias of true theoretical models, while being robust to sampling variation of the space.
Regularized online learning is widely used in machine learning. In this paper we analyze a class of regularized online algorithm with both non-linearized losses and non-linearized regularizers, which we call fully implicit online learning (FIOL). It is shown that because of avoiding the error of linearization, an extra additive regret gain can be obtained for FIOL. Then we show that by exploring the structure of the loss and regularizer, each iteration of FIOL can be exactly solved with time comparable to its linearized version, even if no closed-form solution exists. Experiments validate the proposed approaches.
Accelerated coordinate descent is a widely popular optimization algorithm due to its efficiency on large-dimensional problems. It achieves state-of-the-art complexity on an important class of empirical risk minimization problems. In this paper we design and analyze an accelerated coordinate descent (ACD) method which in each iteration updates a random subset of coordinates according to an arbitrary but fixed probability law, which is a parameter of the method. If all coordinates are updated in each iteration, our method reduces to the classical accelerated gradient descent method AGD of Nesterov. If a single coordinate is updated in each iteration, and we pick probabilities proportional to the square roots of the coordinate-wise Lipschitz constants, our method reduces to the currently fastest coordinate descent method NUACDM of Allen-Zhu, Qu, Richt\'{a}rik and Yuan. While mini-batch variants of ACD are more popular and relevant in practice, there is no importance sampling for ACD that outperforms the standard uniform mini-batch sampling. Through insights enabled by our general analysis, we design new importance sampling for mini-batch ACD which significantly outperforms previous state-of-the-art minibatch ACD in practice. We prove a rate that is at most ${\cal O}(\sqrt{\tau})$ times worse than the rate of minibatch ACD with uniform sampling, but can be ${\cal O}(n/\tau)$ times better, where $\tau$ is the minibatch size. Since in modern supervised learning training systems it is standard practice to choose $\tau \ll n$, and often $\tau={\cal O}(1)$, our method can lead to dramatic speedups. Lastly, we obtain similar results for minibatch nonaccelerated CD as well, achieving improvements on previous best rates.
We present a Bayesian method for feature selection in the presence of grouping information with sparsity on the between- and within group level. Instead of using a stochastic algorithm for parameter inference, we employ expectation propagation, which is a deterministic and fast algorithm. Available methods for feature selection in the presence of grouping information have a number of short-comings: on one hand, lasso methods, while being fast, underestimate the regression coefficients and do not make good use of the grouping information, and on the other hand, Bayesian approaches, while accurate in parameter estimation, often rely on the stochastic and slow Gibbs sampling procedure to recover the parameters, rendering them infeasible e.g. for gene network reconstruction. Our approach of a Bayesian sparse-group framework with expectation propagation enables us to not only recover accurate parameter estimates in signal recovery problems, but also makes it possible to apply this Bayesian framework to large-scale network reconstruction problems. The presented method is generic but in terms of application we focus on gene regulatory networks. We show on simulated and experimental data that the method constitutes a good choice for network reconstruction regarding the number of correctly selected features, prediction on new data and reasonable computing time.
State representation learning aims at learning compact representations from raw observations in robotics and control applications. Approaches used for this objective are auto-encoders, learning forward models, inverse dynamics or learning using generic priors on the state characteristics. However, the diversity in applications and methods makes the field lack standard evaluation datasets, metrics and tasks. This paper provides a set of environments, data generators, robotic control tasks, metrics and tools to facilitate iterative state representation learning and evaluation in reinforcement learning settings.
Incorporation of a new knowledge into neural networks with simultaneous preservation of the previous one is known to be a nontrivial problem. This problem becomes even more complex when new knowledge is contained not in new training examples, but inside the parameters (connection weights) of another neural network. Here we propose and test two methods allowing combining the knowledge contained in separate networks. One method is based on a simple operation of summation of weights of constituent neural networks. Another method assumes incorporation of a new knowledge by modification of weights nonessential for the preservation of already stored information. We show that with these methods the knowledge from one network can be transferred into another one non-iteratively without requiring training sessions. The fused network operates efficiently, performing classification far better than a chance level. The efficiency of the methods is quantified on several publicly available data sets in classification tasks both for shallow and deep neural networks.
In this paper, we present a hypergraph neural networks (HGNN) framework for data representation learning, which can encode high-order data correlation in a hypergraph structure. Confronting the challenges of learning representation for complex data in real practice, we propose to incorporate such data structure in a hypergraph, which is more flexible on data modeling, especially when dealing with complex data. In this method, a hyperedge convolution operation is designed to handle the data correlation during representation learning. In this way, traditional hypergraph learning procedure can be conducted using hyperedge convolution operations efficiently. HGNN is able to learn the hidden layer representation considering the high-order data structure, which is a general framework considering the complex data correlations. We have conducted experiments on citation network classification and visual object recognition tasks and compared HGNN with graph convolutional networks and other traditional methods. Experimental results demonstrate that the proposed HGNN method outperforms recent state-of-the-art methods. We can also reveal from the results that the proposed HGNN is superior when dealing with multi-modal data compared with existing methods.
The Knowledge graph (KG) uses the triples to describe the facts in the real world. It has been widely used in intelligent analysis and understanding of big data. In constructing a KG, especially in the process of automation building, some noises and errors are inevitably introduced or much knowledges is missed. However, learning tasks based on the KG and its underlying applications both assume that the knowledge in the KG is completely correct and inevitably bring about potential errors. Therefore, in this paper, we establish a unified knowledge graph triple trustworthiness measurement framework to calculate the confidence values for the triples that quantify its semantic correctness and the true degree of the facts expressed. It can be used not only to detect and eliminate errors in the KG but also to identify new triples to improve the KG. The framework is a crisscrossing neural network structure. It synthesizes the internal semantic information in the triples and the global inference information of the KG to achieve the trustworthiness measurement and fusion in the three levels of entity-level, relationship-level, and KG-global-level. We conducted experiments on the common dataset FB15K (from Freebase) and analyzed the validity of the model’s output confidence values. We also tested the framework in the knowledge graph error detection or completion tasks. The experimental results showed that compared with other models, our model achieved significant and consistent improvements on the above tasks, further confirming the capabilities of our model.
Organisations are increasingly open to scrutiny, and need to be able to prove that they operate in a fair and ethical way. Accountability should extend to the production and use of the data and knowledge assets used in AI systems, as it would for any raw material or process used in production of physical goods. This paper considers collective intelligence, comprising data and knowledge generated by crowd-sourced workforces, which can be used as core components of AI systems. A proposal is made for the development of a supply chain model for tracking the creation and use of crowdsourced collective intelligence assets, with a blockchain based decentralised architecture identified as an appropriate means of providing validation, accountability and fairness.
The goal of this paper is to develop a measure for characterizing complex dependence between stationary time series that cannot be captured by traditional measures such as correlation and coherence. Our approach is to use copula models of functionals of the Fourier coefficients which is a generalization of coherence. Here, we use standard parametric copula models with a single parameter both from elliptical and Archimedean families. Our approach is to analyze changes in local field potentials in the rat cortex prior to and immediately following the onset of stroke. We present the necessary theoretical background, the multivariate models and an illustration of our methodology on these local field potential data. Simulations with non-linear dependent data show information that were missed by not taking into account dependence on specific frequencies. Moreover, these simulations demonstrate how our proposed method captures more complex non-linear dependence between time series. Finally, we illustrate our copula-based approach in the analysis of local field potentials of rats.
Anderson acceleration is an old and simple method for accelerating the computation of a fixed point. However, as far as we know and quite surprisingly, it has never been applied to dynamic programming or reinforcement learning. In this paper, we explain briefly what Anderson acceleration is and how it can be applied to value iteration, this being supported by preliminary experiments showing a significant speed up of convergence, that we critically discuss. We also discuss how this idea could be applied more generally to (deep) reinforcement learning.
When trying to maximize the adoption of a behavior in a population connected by a social network, it is common to strategize about where in the network to seed the behavior. Some seeding strategies require explicit knowledge of the network, which can be difficult to collect, while other strategies do not require such knowledge but instead rely on non-trivial stochastic ingredients. For example, one such stochastic seeding strategy is to select random network neighbors of random individuals, thus exploiting a version of the friendship paradox, whereby the friend of a random individual is expected to have more friends than a random individual. Empirical evaluations of these strategies have demanded large field experiments designed specifically for this purpose, but these experiments have yielded relatively imprecise estimates of the relative efficacy of these seeding strategies. Here we show both how stochastic seeding strategies can be evaluated using existing data arising from randomized experiments in networks designed for other purposes and how to design much more efficient experiments for this specific evaluation. In particular, we consider contrasts between two common stochastic seeding strategies and analyze nonparametric estimators adapted from policy evaluation or importance sampling. Using simulations on real networks, we show that the proposed estimators and designs can dramatically increase precision while yielding valid inference. We apply our proposed estimators to a field experiment that randomly assigned households to an intensive marketing intervention and a field experiment that randomly assigned students to an anti-bullying intervention.
Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated thinking of optimization and statistics leads to fruitful research findings.
We study the problem of change point detection and localization in dynamic networks. We assume that we observe a sequence of independent adjacency matrices of given size, each corresponding to one realization from an unknown inhomogeneous Bernoulli model. The underlying distribution of the adjacency matrices may change over a subset of the time points, called change points. Our task is to recover with high accuracy the unknown number and positions of the change points. Our generic model setting allows for all the model parameters to change with the total number of time points, including the network size, the minimal spacing between consecutive change points, the magnitude of the smallest change and the degree of sparsity of the networks. We first identify an impossible region in the space of the model parameters such that no change point estimator is provably consistent if the data are generated according to parameters falling in that region. We propose a computationally simple novel algorithm for network change point localization, called Network Binary Segmentation, which relies on weighted averages of the adjacency matrices. We show that Network Binary Segmentation is consistent over a range of the model parameters that nearly cover the complement of the impossibility region, thus demonstrating the existence of a phase transition for the problem at hand. Next, we devise a more sophisticated algorithm based on singular value thresholding, called Local Refinement, that delivers more accurate estimates of the change point locations. We show that, under appropriate conditions, Local Refinement guarantees a minimax optimal rate for network change point localization while remaining computationally feasible.
Changes in the structure of observed social and complex networks’ structure can indicate a significant underlying change in an organization, or reflect the response of the network to an external event. Automatic detection of change points in evolving networks is rudimentary to the research and the understanding of the effect of such events on networks. Here we present an easy-to-implement and fast framework for change point detection in temporal evolving networks. Unlike previous approaches, our method is size agnostic, and does not require either prior knowledge about the network’s size and structure, nor does it require obtaining historical information or nodal identities over time. We use both synthetic data derived from dynamic models and two real datasets: Enron email exchange and Ask-Ubuntu forum. Our framework succeeds with both precision and recall and outperforms previous solutions