Social media communications are becoming increasingly prevalent; some useful, some false, whether unwittingly or maliciously. An increasing number of rumours daily flood the social networks. Determining their veracity in an autonomous way is a very active and challenging field of research, with a variety of methods proposed. However, most of the models rely on determining the constituent messages’ stance towards the rumour, a feature known as the ‘wisdom of the crowd’. Although several supervised machine-learning approaches have been proposed to tackle the message stance classification problem, these have numerous shortcomings. In this paper we argue that semi-supervised learning is more effective than supervised models and use two graph-based methods to demonstrate it. This is not only in terms of classification accuracy, but equally important, in terms of speed and scalability. We use the Label Propagation and Label Spreading algorithms and run experiments on a dataset of 72 rumours and hundreds of thousands messages collected from Twitter. We compare our results on two available datasets to the state-of-the-art to demonstrate our algorithms’ performance regarding accuracy, speed and scalability for real-time applications.
Building software-driven systems that are easily understood becomes a challenge, with their ever-increasing complexity and autonomy. Accordingly, recent research efforts strive to aid in designing explainable systems. Nevertheless, a common notion of what it takes for a system to be explainable is still missing. To address this problem, we propose a characterization of explainable systems that consolidates existing research. By providing a unified terminology, we lay a basis for the classification of both existing and future research, and the formulation of precise requirements towards such systems.
Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Due to it’s complexity, i For high-stakes domains such as medical, providing intuitive explanations that can be consumed by domain experts without ML expertise becomes crucial. To this demand, concept-based methods (e.g., TCAV) were introduced to provide explanations using user-chosen high-level concepts rather than individual input features. While these methods successfully leverage rich representations learned by the networks to reveal how human-defined concepts are related to the prediction, they require users to select concepts of their choice and collect labeled examples of those concepts. In this work, we introduce DTCAV (Discovery TCAV) a global concept-based interpretability method that can automatically discover concepts as image segments, along with each concept’s estimated importance for a deep neural network’s predictions. We validate that discovered concepts are as coherent to humans as hand-labeled concepts. We also show that the discovered concepts carry significant signal for prediction by analyzing a network’s performance with stitched/added/deleted concepts. DTCAV results revealed a number of undesirable correlations (e.g., a basketball player’s jersey was a more important concept for predicting the basketball class than the ball itself) and show the potential shallow reasoning of these networks.
We present a deep long short-term memory (LSTM)-based neural network for predicting asset prices, together with a successful trading strategy for generating profits based on the model’s predictions. Our work is motivated by the fact that the effectiveness of any prediction model is inherently coupled to the trading strategy it is used with, and vise versa. This highlights the difficulty in developing models and strategies which are jointly optimal, but also points to avenues of investigation which are broader than prevailing approaches. Our LSTM model is structurally simple and generates predictions based on price observations over a modest number of past trading days. The model’s architecture is tuned to promote profitability, as opposed to accuracy, under a strategy that does not trade simply based on whether the price is predicted to rise or fall, but rather takes advantage of the distribution of predicted returns, and the fact that a prediction’s position within that distribution carries useful information about the expected profitability of a trade. The proposed model and trading strategy were tested on the S&P 500, Dow Jones Industrial Average (DJIA), NASDAQ and Russel 2000 stock indices, and achieved cumulative returns of 329%, 241%, 468% and 279%, respectively, over 2010-2018, far outperforming the benchmark buy-and-hold strategy as well as other recent efforts.
Deployment of machine learning (ML) algorithms in production for extended periods of time has uncovered new challenges such as monitoring and management of real-time prediction quality of a model in the absence of labels. However, such tracking is imperative to prevent catastrophic business outcomes resulting from incorrect predictions. The scale of these deployments makes manual monitoring prohibitive, making automated techniques to track and raise alerts imperative. We present a framework, ML Health, for tracking potential drops in the predictive performance of ML models in the absence of labels. The framework employs diagnostic methods to generate alerts for further investigation. We develop one such method to monitor potential problems when production data patterns do not match training data distributions. We demonstrate that our method performs better than standard ‘distance metrics’, such as RMSE, KL-Divergence, and Wasserstein at detecting issues with mismatched data sets. Finally, we present a working system that incorporates the ML Health approach to monitor and manage ML deployments within a realistic full production ML lifecycle.
This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input belong to two different modalities, e.g., the output is an image and the input is a sketch. We solve this problem by learning two models that bear similarities to those in reinforcement learning and optimal control. One model is policy-like. It generates the output directly by a non-linear transformation of the input and a noise vector. This amounts to fast thinking because the conditional generation is accomplished by direct sampling. The other model is planner-like. It learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. This amounts to slow thinking because the sampling process is accomplished by an iterative algorithm such as Langevin dynamics. We propose to learn the two models jointly, where the fast thinking policy-like model serves to initialize the sampling of the slow thinking planner-like model, and the planner-like model refines the initial output by an iterative algorithm. The planner-like model learns from the difference between the refined output and the observed output, while the policy-like model learns from how the planner-like model refines its initial output. We demonstrate the effectiveness of the proposed method on various image generation tasks.
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
An ideal outcome of pattern mining is a small set of informative patterns, containing no redundancy or noise, that identifies the key structure of the data at hand. Standard frequent pattern miners do not achieve this goal, as due to the pattern explosion typically very large numbers of highly redundant patterns are returned. We pursue the ideal for sequential data, by employing a pattern set mining approach-an approach where, instead of ranking patterns individually, we consider results as a whole. Pattern set mining has been successfully applied to transactional data, but has been surprisingly under studied for sequential data. In this paper, we employ the MDL principle to identify the set of sequential patterns that summarises the data best. In particular, we formalise how to encode sequential data using sets of serial episodes, and use the encoded length as a quality score. As search strategy, we propose two approaches: the first algorithm selects a good pattern set from a large candidate set, while the second is a parameter-free any-time algorithm that mines pattern sets directly from the data. Experimentation on synthetic and real data demonstrates we efficiently discover small sets of informative patterns.
Can multilayer neural networks — typically constructed as highly complex structures with many nonlinearly activated neurons across layers — behave in a non-trivial way that yet simplifies away a major part of their complexities? In this work, we uncover a phenomenon in which the behavior of these complex networks — under suitable scalings and stochastic gradient descent dynamics — becomes independent of the number of neurons as this number grows sufficiently large. We develop a formalism in which this many-neurons limiting behavior is captured by a set of equations, thereby exposing a previously unknown operating regime of these networks. While the current pursuit is mathematically non-rigorous, it is complemented with several experiments that validate the existence of this behavior.
This paper motivates and develops source traces for temporal difference (TD) learning in the tabular setting. Source traces are like eligibility traces, but model potential histories rather than immediate ones. This allows TD errors to be propagated to potential causal states and leads to faster generalization. Source traces can be thought of as the model-based, backward view of successor representations (SR), and share many of the same benefits. This view, however, suggests several new ideas. First, a TD($\lambda$)-like source learning algorithm is proposed and its convergence is proven. Then, a novel algorithm for learning the source map (or SR matrix) is developed and shown to outperform the previous algorithm. Finally, various approaches to using the source/SR model are explored, and it is shown that source traces can be effectively combined with other model-based methods like Dyna and experience replay.
Items in many datasets can be arranged to a natural order. Such orders are useful since they can provide new knowledge about the data and may ease further data exploration and visualization. Our goal in this paper is to define a statistically well-founded and an objective score measuring the quality of an order. Such a measure can be used for determining whether the current order has any valuable information or can it be discarded. Intuitively, we say that the order is good if dependent attributes are close to each other. To define the order score we fit an order-sensitive model to the dataset. Our model resembles a Markov chain model, that is, the attributes depend only on the immediate neighbors. The score of the order is the BIC score of the best model. For computing the measure we introduce a fast dynamic program. The score is then compared against random orders: if it is better than the scores of the random orders, we say that the order is good. We also show the asymptotic connection between the score function and the number of free parameters of the model. In addition, we introduce a simple greedy approach for finding an order with a good score. We evaluate the score for synthetic and real datasets using different spectral orders and the orders obtained with the greedy method.
We propose an algorithm for incremental learning of classifiers. The proposed method enables an ensemble of classifiers to learn incrementally by accommodating new training data. We use an effective mechanism to overcome the stability-plasticity dilemma. In incremental learning, the general convention is to use only the knowledge acquired in the previous phase but not the previously seen data. We follow this convention by retaining the previously acquired knowledge which is relevant and using it along with the current data. The performance of each classifier is monitored to eliminate the poorly performing classifiers in the subsequent phases. Experimental results show that the proposed approach outperforms the existing incremental learning approaches.
Exploratory data analysis is a fundamental aspect of knowledge discovery that aims to find the main characteristics of a dataset. Dimensionality reduction, such as manifold learning, is often used to reduce the number of features in a dataset to a manageable level for human interpretation. Despite this, most manifold learning techniques do not explain anything about the original features nor the true characteristics of a dataset. In this paper, we propose a genetic programming approach to manifold learning called GP-MaL which evolves functional mappings from a high-dimensional space to a lower dimensional space through the use of interpretable trees. We show that GP-MaL is competitive with existing manifold learning algorithms, while producing models that can be interpreted and re-used on unseen data. A number of promising future directions of research are found in the process.
While physics conveys knowledge of nature built from an interplay between observations and theory, it has been considered less importantly in deep neural networks. Especially, there are few works leveraging physics behaviors when the knowledge is given less explicitly. In this work, we propose a novel architecture called Differentiable Physics-informed Graph Networks (DPGN) to incorporate implicit physics knowledge which is given from domain experts by informing it in latent space. Using the concept of DPGN, we demonstrate that climate prediction tasks are significantly improved. Besides the experiment results, we validate the effectiveness of the proposed module and provide further applications of DPGN, such as inductive learning and multistep predictions.
Tensor factorization has become an increasingly popular approach to knowledge graph completion(KGC), which is the task of automatically predicting missing facts in a knowledge graph. However, even with a simple model like CANDECOMP/PARAFAC(CP) tensor decomposition, KGC on existing knowledge graphs is impractical in resource-limited environments, as a large amount of memory is required to store parameters represented as 32-bit or 64-bit floating point numbers. This limitation is expected to become more stringent as existing knowledge graphs, which are already huge, keep steadily growing in scale. To reduce the memory requirement, we present a method for binarizing the parameters of the CP tensor decomposition by introducing a quantization function to the optimization problem. This method replaces floating point-valued parameters with binary ones after training, which drastically reduces the model size at run time. We investigate the trade-off between the quality and size of tensor factorization models for several KGC benchmark datasets. In our experiments, the proposed method successfully reduced the model size by more than an order of magnitude while maintaining the task performance. Moreover, a fast score computation technique can be developed with bitwise operations.
Consequential decisions are increasingly informed by sophisticated data-driven predictive models. For accurate predictive models, deterministic threshold rules have been shown to be optimal in terms of utility, even under a variety of fairness constraints. However, consistently learning accurate models requires access to ground truth data. Unfortunately, in practice, some data can only be observed if a certain decision was taken. Thus, collected data always depends on potentially imperfect historical decision policies. As a result, learned deterministic threshold rules are often suboptimal. We address the above question from the perspective of sequential policy learning. We first show that, if decisions are taken by a faulty deterministic policy, the observed outcomes under this policy are insufficient to improve it. We then describe how this undesirable behavior can be avoided using stochastic policies. Finally, we introduce a practical gradient-based algorithm to learn stochastic policies that effectively leverage the outcomes of decisions to improve over time. Experiments on both synthetic and real-world data illustrate our theoretical results and show the efficacy of our proposed algorithm.
In this paper we consider portmanteau tests for testing the adequacy of multiplicative seasonal autoregressive moving-average (SARMA) models under the assumption that the errors are uncorrelated but not necessarily independent.We relax the standard independence assumption on the error term in order to extend the range of application of the SARMA models.We study the asymptotic distributions of residual and normalized residual empirical autocovariances and autocorrelations underweak assumptions on the noise. We establish the asymptotic behaviour of the proposed statistics. A set of Monte Carlo experiments and an application to monthly mean total sunspot number are presented.
We review neural network architectures which were motivated by Fourier series and integrals and which are referred to as Fourier neural networks. These networks are empirically evaluated in synthetic and real-world tasks. Neither of them outperforms the standard neural network with sigmoid activation function in the real-world tasks. All neural networks, both Fourier and the standard one, empirically demonstrate lower approximation error than the truncated Fourier series when it comes to an approximation of a known function of multiple variables.
The increase in computational power and available data has fueled a wide deployment of deep learning in production environments. Despite their successes, deep architectures are still poorly understood and costly to train. We demonstrate in this paper how a simple recipe enables a market player to harm or delay the development of a competing product. Such a threat model is novel and has not been considered so far. We derive the corresponding attacks and show their efficacy both formally and empirically. These attacks only require access to the initial, untrained weights of a network. No knowledge of the problem domain and the data used by the victim is needed. On the initial weights, a mere permutation is sufficient to limit the achieved accuracy to for example 50% on the MNIST dataset or double the needed training time. While we can show straightforward ways to mitigate the attacks, the respective steps are not part of the standard procedure taken by developers so far.
We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of $d$-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment’s choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$.
Automatic prediction of emotion promises to revolutionise human-computer interaction. Recent trends involve fusion of multiple modalities – audio, visual, and physiological – to classify emotional state. However, practical considerations ‘in the wild’ limit collection of this physiological data to commoditised heartbeat sensors. Furthermore, real-world applications often require some measure of uncertainty over model output. We present here an end-to-end deep learning model for classifying emotional valence from unimodal heartbeat data. We further propose a Bayesian framework for modelling uncertainty over valence predictions, and describe a procedure for tuning output according to varying demands on confidence. We benchmarked our framework against two established datasets within the field and achieved peak classification accuracy of 90%. These results lay the foundation for applications of affective computing in real-world domains such as healthcare, where a high premium is placed on non-invasive collection of data, and predictive certainty.
Partial label learning deals with the problem where each training instance is assigned a set of candidate labels, only one of which is correct. This paper provides the first attempt to leverage the idea of self-training for dealing with partially labeled examples. Specifically, we propose a unified formulation with proper constraints to train the desired model and perform pseudo-labeling jointly. For pseudo-labeling, unlike traditional self-training that manually differentiates the ground-truth label with enough high confidence, we introduce the maximum infinity norm regularization on the modeling outputs to automatically achieve this consideratum, which results in a convex-concave optimization problem. We show that optimizing this convex-concave problem is equivalent to solving a set of quadratic programming (QP) problems. By proposing an upper-bound surrogate objective function, we turn to solving only one QP problem for improving the optimization efficiency. Extensive experiments on synthesized and real-world datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art partial label learning approaches.
We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.
It is well-known that exploiting label correlations is crucially important to multi-label learning. Most of the existing approaches take label correlations as prior knowledge, which may not correctly characterize the real relationships among labels. Besides, label correlations are normally used to regularize the hypothesis space, while the final predictions are not explicitly correlated. In this paper, we suggest that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels. Based on this assumption, we first propose a novel method to learn the label correlations via sparse reconstruction in the label space. Then, by seamlessly integrating the learned label correlations into model training, we propose a novel multi-label learning approach that aims to explicitly account for the correlated predictions of labels while training the desired model simultaneously. Extensive experimental results show that our approach outperforms the state-of-the-art counterparts.
We present a family of novel methods for embedding knowledge graphs into real-valued tensors. These tensor-based embeddings capture the ordered relations that are typical in the knowledge graphs represented by semantic web languages like RDF. Unlike many previous models, our methods can easily use prior background knowledge provided by users or extracted automatically from existing knowledge graphs. In addition to providing more robust methods for knowledge graph embedding, we provide a provably-convergent, linear tensor factorization algorithm. We demonstrate the efficacy of our models for the task of predicting new facts across eight different knowledge graphs, achieving between 5% and 50% relative improvement over existing state-of-the-art knowledge graph embedding techniques. Our empirical evaluation shows that all of the tensor decomposition models perform well when the average degree of an entity in a graph is high, with constraint-based models doing better on graphs with a small number of highly similar relations and regularization-based models dominating for graphs with relations of varying degrees of similarity.
In this paper, we investigate the use of global information to speed up the learning process and increase the cumulative rewards of multi-agent reinforcement learning (MARL) tasks. Within the actor-critic MARL, we introduce multiple cooperative critics from two levels of the hierarchy and propose a hierarchical critic-based multi-agent reinforcement learning algorithm. In our approach, the agent is allowed to receive information from local and global critics in a competition task. The agent not only receives low-level details but also consider coordination from high levels that receiving global information to increase operation skills. Here, we define multiple cooperative critics in the top-bottom hierarchy, called the Hierarchical Critics Assignment (HCA) framework. Our experiment, a two-player tennis competition task in the Unity environment, tested HCA multi-agent framework based on Asynchronous Advantage Actor-Critic (A3C) with Proximal Policy Optimization (PPO) algorithm. The results showed that the HCA- framework outperforms the non-hierarchical critics baseline method for MARL tasks.
Reinforcement learning (RL) problems often feature deceptive local optima, and learning methods that optimize purely for reward signal often fail to learn strategies for overcoming them. Deep neuroevolution and novelty search have been proposed as effective alternatives to gradient-based methods for learning RL policies directly from pixels. In this paper, we introduce and evaluate the use of novelty search over agent action sequences by string edit metric distance as a means for promoting innovation. We also introduce a method for stagnation detection and population resampling inspired by recent developments in the RL community that uses the same mechanisms as novelty search to promote and develop innovative policies. Our methods extend a state-of-the-art method for deep neuroevolution using a simple-yet-effective genetic algorithm (GA) designed to efficiently learn deep RL policy network weights. Experiments using four games from the Atari 2600 benchmark were conducted. Results provide further evidence that GAs are competitive with gradient-based algorithms for deep RL. Results also demonstrate that novelty search over action sequences is an effective source of selection pressure that can be integrated into existing evolutionary algorithms for deep RL.
In this paper, we introduce BINet, a neural network architecture for real-time multi-perspective anomaly detection in business process event logs. BINet is designed to handle both the control flow and the data perspective of a business process. Additionally, we propose a set of heuristics for setting the threshold of an anomaly detection algorithm automatically. We demonstrate that BINet can be used to detect anomalies in event logs not only on a case level but also on event attribute level. Finally, we demonstrate that a simple set of rules can be used to utilize the output of BINet for anomaly classification. We compare BINet to eight other state-of-the-art anomaly detection algorithms and evaluate their performance on an elaborate data corpus of 29 synthetic and 15 real-life event logs. BINet outperforms all other methods both on the synthetic as well as on the real-life datasets.
We consider the problem of streaming principal component analysis (PCA) when the observations are noisy and generated in a non-stationary environment. Given $T$, $p$-dimensional noisy observations sampled from a non-stationary variant of the spiked covariance model, our goal is to construct the best linear $k$-dimensional subspace of the terminal observations. We study the effect of non-stationarity by establishing a lower bound on the number of samples and the corresponding recovery error obtained by any algorithm. We establish the convergence behaviour of the noisy power method using a novel proof technique which maybe of independent interest. We conclude that the recovery guarantee of the noisy power method matches the fundamental limit, thereby generalizing existing results on streaming PCA to a non-stationary setting.
We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants, including extensions to deep structured prediction. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.