Naive probability

We describe a rational, but low resolution model of probability.

Decision-making with reference information

We often try to predict others’ actions by obtaining supporting information that shows a preference index of surrounding people. In order to reproduce these situations, we propose a game named ‘One-sided Preference Game with Reference Information (OSPG-R).’ We conducted experiments in which players who have similar preferences compete for objects in OSPG-R. In the experiment, we used three different types of objects: boxes, faces, and cars. Our results show that the most frequently selected object was not the most popular one. In order to gain deeper insights into the experiment’s results, we constructed a decision-making model based on two assumptions: (1) players are rational and (2) are convinced that the other players’ preference orders are equivalent to the preference index for the group. Compared to the choice behavior of the model, the experiment’s results show that there was a tendency to take risks when the objects were faces, or the priority of that particular player was low.

Soft Options Critic

The option-critic paper and several variants have successfully demonstrated the use of the options framework proposed by Barto et al to scale learning and planning in hierarchical tasks. Although most of these frameworks use entropy as a regularizer to improve exploration, they do not maximize entropy along with returns at every time step. In this paper we investigate the effect of maximizing entropy of each options and inter-option policy in options framework. We adopt the architecture of the recently introduced soft-actor critic algorithm to enable learning of robust options in continuous and discrete action spaces in a off-policy manner thus also making it sample efficient. In this paper we derive the soft options improvement theorem and propose a novel soft-options framework to incorporate maximization of entropy of actions and options in a constrained manner. Our experiments show that maximizing entropy of actions and options in a constrained manner with high learning rate does not harm the main objective of maximizing returns and hence outperforms vanilla options-critic framework in most hierarchical tasks. We also observe faster recovery when the environment is subject to perturbations.

Induction of Non-Monotonic Rules From Statistical Learning Models Using High-Utility Itemset Mining

We present a fast and scalable algorithm to induce non-monotonic logic programs from statistical learning models. We reduce the problem of search for best clauses to instances of the High-Utility Itemset Mining (HUIM) problem. In the HUIM problem, feature values and their importance are treated as transactions and utilities respectively. We make use of TreeExplainer, a fast and scalable implementation of the Explainable AI tool SHAP, to extract locally important features and their weights from ensemble tree models. Our experiments with UCI standard benchmarks suggest a significant improvement in terms of classification evaluation metrics and running time of the training algorithm compared to ALEPH, a state-of-the-art Inductive Logic Programming (ILP) system.

What if Process Predictions are notfollowed by Good Recommendations?

Process-aware Recommender systems (PAR systems) are informationsystems that aim to monitor process executions, predict their outcome, and rec-ommend effective interventions to reduce the risk of failure. This paper discussesmonitoring, predicting, and recommending using a PAR system within a finan-cial institute in the Netherlands to avoid faulty executions. While predictionswere based on the analysis of historical data, the most opportune interventionwas selected on the basis of human judgment and subjective opinions. The re-sults showed that, while the predictions of risky cases were relatively accurate,no reduction was observed in the number of faulty executions. We believe thatthis was caused by incorrect choices of interventions. While a large body of re-search exists on monitoring and predicting based on facts recorded in historicaldata, research on fact-based interventions is relatively limited. This paper reportson lessons learned from the case study in finance and proposes a new methodol-ogy to improve the performances of PAR systems. This methodology advocatesthe importance of several cycles of interactions among all actors involved so asto develop interventions that incorporate their feedback and are based on insightsfrom factual, historical data.

Perturbed Model Validation: A New Framework to Validate Model Relevance

This paper introduces PMV (Perturbed Model Validation), a new technique to validate model relevance and detect overfitting or underfitting. PMV operates by injecting noise to the training data, re-training the model against the perturbed data, then using the training accuracy decrease rate to assess model relevance. A larger decrease rate indicates better concept-hypothesis fit. We realise PMV by using label flipping to inject noise, and evaluate it on four real-world datasets (breast cancer, adult, connect-4, and MNIST) and three synthetic datasets in the binary classification setting. The results reveal that PMV selects models more precisely and in a more stable way than cross-validation, and effectively detects both overfitting and underfitting.

A score function for Bayesian cluster analysis

We propose a score function for Bayesian clustering. The function is parameter free and captures the interplay between the within cluster variance and the between cluster entropy of a clustering. It can be used to choose the number of clusters in well-established clustering methods such as hierarchical clustering or K-means algorithm.

Partially Encrypted Machine Learning using Functional Encryption

Machine learning on encrypted data has received a lot of attention thanks to recent breakthroughs in homomorphic encryption and secure multi-party computation. It allows outsourcing computation to untrusted servers without sacrificing privacy of sensitive data. We propose a practical framework to perform partially encrypted and privacy-preserving predictions which combines adversarial training and functional encryption. We first present a new functional encryption scheme to efficiently compute quadratic functions so that the data owner controls what can be computed but is not involved in the calculation: it provides a decryption key which allows one to learn a specific function evaluation of some encrypted data. We then show how to use it in machine learning to partially encrypt neural networks with quadratic activation functions at evaluation time, and we provide a thorough analysis of the information leaks based on indistinguishability of data items of the same label. Last, since most encryption schemes cannot deal with the last thresholding operation used for classification, we propose a training method to prevent selected sensitive features from leaking, which adversarially optimizes the network against an adversary trying to identify these features. This is interesting for several existing works using partially encrypted machine learning as it comes with little reduction on the model’s accuracy and significantly improves data privacy.

From Search Engines to Search Services: An End-User Driven Approach

The World Wide Web is a vast and continuously changing source of information where searching is a frequent, and sometimes critical, user task. Searching is not always the user’s primary goal but an ancillary task that is performed to find complementary information allowing to complete another task. In this paper, we explore primary and/or ancillary search tasks and propose an approach for simplifying the user interaction during search tasks. Rather than fo-cusing on dedicated search engines, our approach allows the user to abstract search engines already provided by Web applications into pervasive search services that will be available for performing searches from any other Web site. We also propose to allow users to manage the way in which searching results are displayed and the interaction with them. In order to illustrate the feasibility of this approach, we have built a support tool based on a plug-in architecture that allows users to integrate new search services (created by themselves by means of visual tools) and execute them in the context of both kinds of searches. A case study illustrates the use of these tools. We also present the results of two evaluations that demonstrate the feasibility of the approach and the benefits in its use.

Loss Surface Modality of Feed-Forward Neural Network Architectures

It has been argued in the past that high-dimensional neural networks do not exhibit local minima capable of trapping an optimisation algorithm. However, the relationship between loss surface modality and the neural architecture parameters, such as the number of hidden neurons per layer and the number of hidden layers, remains poorly understood. This study employs fitness landscape analysis to study the modality of neural network loss surfaces under various feed-forward architecture settings. An increase in the problem dimensionality is shown to yield a more searchable and more exploitable loss surface. An increase in the hidden layer width is shown to effectively reduce the number of local minima, and simplify the shape of the global attractor. An increase in the architecture depth is shown to sharpen the global attractor, thus making it more exploitable.

MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching

Text matching is the core problem in many natural language processing (NLP) tasks, such as information retrieval, question answering, and conversation. Recently, deep leaning technology has been widely adopted for text matching, making neural text matching a new and active research domain. With a large number of neural matching models emerging rapidly, it becomes more and more difficult for researchers, especially those newcomers, to learn and understand these new models. Moreover, it is usually difficult to try these models due to the tedious data pre-processing, complicated parameter configuration, and massive optimization tricks, not to mention the unavailability of public codes sometimes. Finally, for researchers who want to develop new models, it is also not an easy task to implement a neural text matching model from scratch, and to compare with a bunch of existing models. In this paper, therefore, we present a novel system, namely MatchZoo, to facilitate the learning, practicing and designing of neural text matching models. The system consists of a powerful matching library and a user-friendly and interactive studio, which can help researchers: 1) to learn state-of-the-art neural text matching models systematically, 2) to train, test and apply these models with simple configurable steps; and 3) to develop their own models with rich APIs and assistance.

Learning to learn by Self-Critique

In few-shot learning, a machine learning system learns from a small set of labelled examples relating to a specific task, such that it can generalize to new examples of the same task. Given the limited availability of labelled examples in such tasks ,we wish to make use of all the information we can. Usually a model learns task-specific information from a small training-set (support-set) to predict on an unlabelled validation set (target-set). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods. Making use of the target-set examples via transductive learning requires approaches beyond the current methods; at inference time, the target-set contains only unlabelled input data-points, and so discriminative learning cannot be used. In this paper, we propose a framework called Self-Critique and Adaptor SCA, which learns to learn a label-free loss function, parameterized as a neural network. A base-model learns on a support-set using existing methods (e.g. stochastic gradient descent combined with the cross-entropy loss), and then is updated for the incoming target-task using the learnt loss function. This label-free loss function is itself optimized such that the learnt model achieves higher generalization performance. Experiments demonstrate that SCA offers substantially reduced error-rates compared to baselines which only adapt on the support-set, and results in state of the art benchmark performance on Mini-ImageNet and Caltech-UCSD Birds 200.

An Explicitly Relational Neural Network Architecture

With a view to bridging the gap between deep learning and symbolic AI, we present a novel end-to-end neural network architecture that learns to form propositional representations with an explicitly relational structure from raw pixel data. In order to evaluate and analyse the architecture, we introduce a family of simple visual relational reasoning tasks of varying complexity. We show that the proposed architecture, when pre-trained on a curriculum of such tasks, learns to generate reusable representations that better facilitate subsequent learning on previously unseen tasks when compared to a number of baseline architectures. The workings of a successfully trained model are visualised to shed some light on how the architecture functions.

What Can ResNet Learn Efficiently, Going Beyond Kernels?

How can neural networks such as ResNet \textit{efficiently} learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall far behind? Can we more provide theoretical justifications for this gap? There is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving that they can learn certain concept class that is also learnable by kernels, with similar test error. Yet, can we show neural networks provably learn some concept class \textit{better} than kernels? We answer this positively in the PAC learning language. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class that the test error obtained by neural networks can be \textit{much smaller} than \textit{any} ‘generic’ kernel method, including neural tangent kernels, conjugate kernels, etc. The main intuition is that \textit{multi-layer} neural networks can implicitly perform hierarchal learning using different layers, which reduces the sample complexity comparing to ‘one-shot’ learning algorithms such as kernel methods.

Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar

Automatic machine learning is an important problem in the forefront of machine learning. The strongest AutoML systems are based on neural networks, evolutionary algorithms, and Bayesian optimization. Recently AlphaD3M reached state-of-the-art results with an order of magnitude speedup using reinforcement learning with self-play. In this work we extend AlphaD3M by using a pipeline grammar and a pre-trained model which generalizes from many different datasets and similar tasks. Our results demonstrate improved performance compared with our earlier work and existing methods on AutoML benchmark datasets for classification and regression tasks. In the spirit of reproducible research we make our data, models, and code publicly available.

Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However existing methods either have no theoretical guarantee or suffer a regret that is exponential in the planning horizon H. In this paper, we propose an online RL algorithm, namely the MatrixRL, that leverages ideas from linear bandit to learn a low-dimensional representation of the probability transition model while carefully balancing the exploitation-exploration tradeoff. We show that MatrixRL achieves a regret bound {O}\big(H^2d\log T\sqrt{T}\big) where d is the number of features. MatrixRL has an equivalent kernelized version, which is able to work with an arbitrary kernel Hilbert space without using explicit features. In this case, the kernelized MatrixRL satisfies a regret bound {O}\big(H^2\widetilde{d}\log T\sqrt{T}\big), where \widetilde{d} is the effective dimension of the kernel space. To our best knowledge, for RL using features or kernels, our results are the first regret bounds that are near-optimal in time T and dimension d (or \widetilde{d}) and polynomial in the planning horizon H.

Neural Jump Stochastic Differential Equations

Many time series can be effectively modeled with a combination of continuous flows along with random jumps sparked by discrete events. However, we usually do not have the equation of motion describing the flows, or how they are affected by jumps. To this end, we introduce Neural Jump Stochastic Differential Equations that provide a data-driven approach to learn continuous and discrete dynamic behavior, i.e., hybrid systems that both flow and jump. Our approach extends the framework of Neural Ordinary Differential Equations with a stochastic process term that models discrete events. We then model temporal point processes with a piecewise-continuous latent trajectory, where stochastic events cause an abrupt change in the latent variables. We demonstrate the predictive capabilities of our model on a range of synthetic and real-world marked point process datasets, including classical point processes such as Hawkes processes, medical records, awards on Stack Overflow, and earthquake monitoring.

InfoRL: Interpretable Reinforcement Learning using Information Maximization

Recent advances in reinforcement learning have proved that given an environment we can learn to perform a task in that environment if we have access to some form of a reward function (dense, sparse or derived from IRL). But most of the algorithms focus on learning a single best policy to perform a given set of tasks. In this paper, we focus on an algorithm that learns to not just perform a task but different ways to perform the same task. As we know when the environment is complex enough there always exists multiple ways to perform a task. We show that using the concept of information maximization it is possible to learn latent codes for discovering multiple ways to perform any given task in an environment.

Greedy Shallow Networks: A New Approach for Constructing and Training Neural Networks

We present a novel greedy approach to obtain a single layer neural network approximation to a target function with the use of a ReLU activation function. In our approach we construct a shallow network by utilizing a greedy algorithm where the set of possible inner weights acts as a parametrization of the prescribed dictionary. To facilitate the greedy selection we employ an integral representation of the network, based on the ridgelet transform, that significantly reduces the cardinality of the dictionary and hence promotes feasibility of the proposed method. Our approach allows for the construction of efficient architectures which can be treated either as improved initializations to be used in place of random-based alternatives, or as fully-trained networks, thus potentially nullifying the need for training and/or calibrating based on backpropagation. Numerical experiments demonstrate the tenability of the proposed concept and its advantages compared to the classical techniques for training and constructing neural networks.

Using Deep Networks and Transfer Learning to Address Disinformation

We apply an ensemble pipeline composed of a character-level convolutional neural network (CNN) and a long short-term memory (LSTM) as a general tool for addressing a range of disinformation problems. We also demonstrate the ability to use this architecture to transfer knowledge from labeled data in one domain to related (supervised and unsupervised) tasks. Character-level neural networks and transfer learning are particularly valuable tools in the disinformation space because of the messy nature of social media, lack of labeled data, and the multi-channel tactics of influence campaigns. We demonstrate their effectiveness in several tasks relevant for detecting disinformation: spam emails, review bombing, political sentiment, and conversation clustering.

Modeling Dynamic Functional Connectivity with Latent Factor Gaussian Processes

Dynamic functional connectivity, as measured by the time-varying covariance of neurological signals, is believed to play an important role in many aspects of cognition. While many methods have been proposed, reliably establishing the presence and characteristics of brain connectivity is challenging due to the high dimensionality and noisiness of neuroimaging data. We present a latent factor Gaussian process model which addresses these challenges by learning a parsimonious representation of connectivity dynamics. The proposed model naturally allows for inference and visualization of time-varying connectivity. As an illustration of the scientific utility of the model, application to a data set of rat local field potential activity recorded during a complex non-spatial memory task provides evidence of stimuli differentiation.

Learning to Identify High Betweenness Centrality Nodes from Scratch: A Novel Graph Neural Network Approach

Betweenness centrality (BC) is one of the most used centrality measures for network analysis, which seeks to describe the importance of nodes in a network in terms of the fraction of shortest paths that pass through them. It is key to many valuable applications, including community detection and network dismantling. Computing BC scores on large networks is computationally challenging due to high time complexity. Many approximation algorithms have been proposed to speed up the estimation of BC, which are mainly sampling-based. However, these methods are still prone to considerable execution time on large-scale networks, and their results are often exacerbated when small changes happen to the network structures. In this paper, we focus on identifying nodes with high BC in a graph, since many application scenarios are built upon retrieving nodes with top-k BC. Different from previous heuristic methods, we turn this task into a learning problem and design an encoder-decoder based framework to resolve the problem. More specifcally, the encoder leverages the network structure to encode each node into an embedding vector, which captures the important structural information of the node. The decoder transforms the embedding vector for each node into a scalar, which captures the relative rank of this node in terms of BC. We use the pairwise ranking loss to train the model to identify the orders of nodes regarding their BC. By training on small-scale networks, the learned model is capable of assigning relative BC scores to nodes for any unseen networks, and thus identifying the highly-ranked nodes. Comprehensive experiments on both synthetic and real-world networks demonstrate that, compared to representative baselines, our model drastically speeds up the prediction without noticeable sacrifce in accuracy, and outperforms the state-of-the-art by accuracy on several large real-world networks.

DIVA: Domain Invariant Variational Autoencoders

We consider the problem of domain generalization, namely, how to learn representations given data from a set of domains that generalize to data from a previously unseen domain. We propose the Domain Invariant Variational Autoencoder (DIVA), a generative model that tackles this problem by learning three independent latent subspaces, one for the domain, one for the class, and one for any residual variations. We highlight that due to the generative nature of our model we can also incorporate unlabeled data from known or previously unseen domains. To the best of our knowledge this has not been done before in a domain generalization setting. This property is highly desirable in fields like medical imaging where labeled data is scarce. We experimentally evaluate our model on the rotated MNIST benchmark and a malaria cell images dataset where we show that (i) the learned subspaces are indeed complementary to each other, (ii) we improve upon recent works on this task and (iii) incorporating unlabelled data can boost the performance even further.

N-BEATS: Neural basis expansion analysis for interpretable time series forecasting

We focus on solving the univariate times series point forecasting problem using deep learning. We propose a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers. The architecture has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train. We test the proposed architecture on the well-known M4 competition dataset containing 100k time series from diverse domains. We demonstrate state-of-the-art performance for two configurations of N-BEATS, improving forecast accuracy by 11% over a statistical benchmark and by 3% over last year’s winner of the M4 competition, a domain-adjusted hand-crafted hybrid between neural network and statistical time series models. The first configuration of our model does not employ any time-series-specific components and its performance on the M4 dataset strongly suggests that, contrarily to received wisdom, deep learning primitives such as residual blocks are by themselves sufficient to solve a wide range of forecasting problems. Finally, we demonstrate how the proposed architecture can be augmented to provide outputs that are interpretable without loss in accuracy.

Compress-Store on Blockchain: A Decentralized Data Processing and Immutable Storage for Multimedia Streaming

Decentralization for data storage is a challenging problem for blockchain-based solutions as the blocksize plays the key role for scalability. In addition, specific requirements of multimedia data calls for various changes in the blockchain technology internals.Considering one of the most popular applications of secure multimedia streaming, i.e., video surveillance, it is not clear how to judiciously encode incentivisation, immutability and compression into a viable ecosystem. In this study, we provide a genuine scheme that achieves this encoding for a video surveillance application. The proposed scheme provides a novel integration of data compression, immutable off-chain data storage using a new consensus protocol namely, proof of work storage (PoWS) in order to enable fully useful work to be performed by the miner nodes of the network. The proposed idea is the first step towards achieving greener application of blockchain-based environment to the video storage business that utilizes system resources efficiently.

Decentralized Bayesian Learning over Graphs

We propose a decentralized learning algorithm over a general social network. The algorithm leaves the training data distributed on the mobile devices while utilizing a peer to peer model aggregation method. The proposed algorithm allows agents with local data to learn a shared model explaining the global training data in a decentralized fashion. The proposed algorithm can be viewed as a Bayesian and peer-to-peer variant of federated learning in which each agent keeps a ‘posterior probability distribution’ over a global model parameters. The agent update its ‘posterior’ based on 1) the local training data and 2) the asynchronous communication and model aggregation with their 1-hop neighbors. This Bayesian formulation allows for a systematic treatment of model aggregation over any arbitrary connected graph. Furthermore, it provides strong analytic guarantees on converge in the realizable case as well as a closed form characterization of the rate of convergence. We also show that our methodology can be combined with efficient Bayesian inference techniques to train Bayesian neural networks in a decentralized manner. By empirical studies we show that our theoretical analysis can guide the design of network/social interactions and data partitioning to achieve convergence.

Fully Hyperbolic Convolutional Neural Networks

Convolutional Neural Networks (CNN) have recently seen tremendous success in various computer vision tasks. However, their application to problems with high dimensional input and output has been limited by two factors. First, in the training stage, it is necessary to store network activations for back propagation. Second, in the inference stage, a few copies of the image are typically stored to be concatenated to other network states deeper in the network. In these settings, the memory requirements associated with storing activations can exceed what is feasible with current hardware. For the problem of image classification, reversible architectures have been proposed that allow one to recalculate activations in the backwards pass instead of storing them, however, such networks do not perform well for problems such as segmentation. Furthermore, currently only block reversible networks have been possible because pooling operations are not reversible. Motivated by the propagation of signals over physical networks, that are governed by the hyperbolic Telegraph equation, in this work we introduce a fully conservative hyperbolic network for problems with high dimensional input and output. We introduce a coarsening operation that allows completely reversible CNNs by using the Discrete Wavelet Transform and its inverse to both coarsen and interpolate the network state and change the number of channels. This means that during training we do not need to store the activations from the forward pass, and can train arbitrarily deep or wide networks. Furthermore, our network has a much lower memory footprint for inference. We show that we are able to achieve results comparable to the state of the art in image classification, depth estimation, and semantic segmentation, with a much lower memory footprint.