Knowledge graphs are used to represent relational information in terms of triples. To enable learning about domains, embedding models, such as tensor factorization models, can be used to make predictions of new triples. Often there is background taxonomic information (in terms of subclasses and subproperties) that should also be taken into account. We show that existing fully expressive (a.k.a. universal) models cannot provably respect subclass and subproperty information. We show that minimal modifications to an existing knowledge graph completion method enables injection of taxonomic information. Moreover, we prove that our model is fully expressive, assuming a lower-bound on the size of the embeddings. Experimental results on public knowledge graphs show that despite its simplicity our approach is surprisingly effective.
This paper studies the distributed reinforcement learning (DRL) problem involving a central controller and a group of learners. Two DRL settings that find broad applications are considered: multi-agent reinforcement learning (RL) and parallel RL. In both settings, frequent information exchange between the learners and the controller are required. However, for many distributed systems, e.g., parallel machines for training deep RL algorithms, and multi-robot systems for learning the optimal coordination strategies, the overhead caused by frequent communication is not negligible and becomes the bottleneck of the overall performance. To overcome this challenge, we develop a new policy gradient method that is amenable to efficient implementation in such communication-constrained settings. By adaptively skipping the policy gradient communication, our method can reduce the communication overhead without degrading the learning accuracy. Analytically, we can establish that i) the convergence rate of our algorithm is the same as the vanilla policy gradient for the DRL tasks; and, ii) if the distributed computing units are heterogeneous in terms of their reward functions and initial state distributions, the number of communication rounds needed to achieve a targeted learning accuracy is reduced. Numerical experiments on a popular multi-agent RL benchmark corroborate the significant communication reduction of our algorithm compared to the alternatives.
Utilizing recently introduced concepts from statistics and quantitative risk management, we present a general variant of Batch Normalization (BN) that offers accelerated convergence of Neural Network training compared to conventional BN. In general, we show that mean and standard deviation are not always the most appropriate choice for the centering and scaling procedure within the BN transformation, particularly if ReLU follows the normalization step. We present a Generalized Batch Normalization (GBN) transformation, which can utilize a variety of alternative deviation measures for scaling and statistics for centering, choices which naturally arise from the theory of generalized deviation measures and risk theory in general. When used in conjunction with the ReLU non-linearity, the underlying risk theory suggests natural, arguably optimal choices for the deviation measure and statistic. Utilizing the suggested deviation measure and statistic, we show experimentally that training is accelerated more so than with conventional BN, often with improved error rate as well. Overall, we propose a more flexible BN transformation supported by a complimentary theoretical framework that can potentially guide design choices.
We survey distributed deep learning models for training or inference without accessing raw data from clients. These methods aim to protect confidential patterns in data while still allowing servers to train models. The distributed deep learning methods of federated learning, split learning and large batch stochastic gradient descent are compared in addition to private and secure approaches of differential privacy, homomorphic encryption, oblivious transfer and garbled circuits in the context of neural networks. We study their benefits, limitations and trade-offs with regards to computational resources, data leakage and communication efficiency and also share our anticipated future trends.
Machine learning relies on the availability of a vast amount of data for training. However, in reality, most data are scattered across different organizations and cannot be easily integrated under many legal and practical constraints. In this paper, we introduce a new technique and framework, known as federated transfer learning (FTL), to improve statistical models under a data federation. The federation allows knowledge to be shared without compromising user privacy, and enables complimentary knowledge to be transferred in the network. As a result, a target-domain party can build more flexible and powerful models by leveraging rich labels from a source-domain party. A secure transfer cross validation approach is also proposed to guard the FTL performance under the federation. The framework requires minimal modifications to the existing model structure and provides the same level of accuracy as the non-privacy-preserving approach. This framework is very flexible and can be effectively adapted to various secure multi-party machine learning tasks.
Time series data account for a major part of data supply available today. Time series mining handles several tasks such as classification, clustering, query-by-content, prediction, and others. Performing data mining tasks on raw time series is inefficient as these data are high-dimensional by nature. Instead, time series are first pre-processed using several techniques before different data mining tasks can be performed on them. In general, there are two main approaches to reduce time series dimensionality, the first is what we call landmark methods. These methods are based on finding characteristic features in the target time series. The second is based on data transformations. These methods transform the time series from the original space into a reduced space, where they can be managed more efficiently. The method we present in this paper applies a third approach, as it projects a time series onto a lower-dimensional space by selecting important points in the time series. The novelty of our method is that these points are not chosen according to a geometric criterion, which is subjective in most cases, but through an optimization process. The other important characteristic of our method is that these important points are selected on a dataset-level and not on a single time series-level. The direct advantage of this strategy is that the distance defined on the low-dimensional space lower bounds the original distance applied to raw data. This enables us to apply the popular GEMINI algorithm. The promising results of our experiments on a wide variety of time series datasets, using different optimizers, and applied to the two major data mining tasks, validate our new method.
Curriculum Learning – the idea of teaching by gradually exposing the learner to examples in a meaningful order, from easy to hard, has been investigated in the context of machine learning long ago. Although methods based on this concept have been empirically shown to improve performance of several learning algorithms, no theoretical analysis has been provided even for simple cases. To address this shortfall, we start by formulating an ideal definition of difficulty score – the loss of the optimal hypothesis at a given datapoint. We analyze the possible contribution of curriculum learning based on this score in two convex problems – linear regression, and binary classification by hinge loss minimization. We show that in both cases, the expected convergence rate decreases monotonically with the ideal difficulty score, in accordance with earlier empirical results. We also prove that when the ideal difficulty score is fixed, the convergence rate is monotonically increasing with respect to the loss of the current hypothesis at each point. We discuss how these results bring to term two apparently contradicting heuristics: curriculum learning on the one hand, and hard data mining on the other.
We examine an analytic variational inference scheme for the Gaussian Process State Space Model (GPSSM) – a probabilistic model for system identification and time-series modelling. Our approach performs variational inference over both the system states and the transition function. We exploit Markov structure in the true posterior, as well as an inducing point approximation to achieve linear time complexity in the length of the time series. Contrary to previous approaches, no Monte Carlo sampling is required: inference is cast as a deterministic optimisation problem. In a number of experiments, we demonstrate the ability to model non-linear dynamics in the presence of both process and observation noise as well as to impute missing information (e.g. velocities from raw positions through time), to de-noise, and to estimate the underlying dimensionality of the system. Finally, we also introduce a closed-form method for multi-step prediction, and a novel criterion for assessing the quality of our approximate posterior.
Methods proposed in the literature towards continual deep learning typically operate in a task-based sequential learning setup. A sequence of tasks is learned, one at a time, with all data of current task available but not of previous or future tasks. Task boundaries and identities are known at all times. This setup, however, is rarely encountered in practical applications. Therefore we investigate how to transform continual learning to an online setup. We develop a system that keeps on learning over time in a streaming fashion, with data distributions gradually changing and without the notion of separate tasks. To this end, we build on the work on Memory Aware Synapses, and show how this method can be made online by providing a protocol to decide i) when to update the importance weights, ii) which data to use to update them, and iii) how to accumulate the importance weights at each update step. Experimental results show the validity of the approach in the context of two applications: (self-)supervised learning of a face recognition model by watching soap series and learning a robot to avoid collisions.
We derive the fast convergence rates of a deep neural network (DNN) classifier with the rectified linear unit (ReLU) activation function learned using the hinge loss. We consider three cases for a true model: (1) a smooth decision boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of inputs near the decision boundary is small). We show that the DNN classifier learned using the hinge loss achieves fast rate convergences for all three cases provided that the architecture (i.e., the number of layers, number of nodes and sparsity). is carefully selected. An important implication is that DNN architectures are very flexible for use in various cases without much modification. In addition, we consider a DNN classifier learned by minimizing the cross-entropy, and show that the DNN classifier achieves a fast convergence rate under the condition that the conditional class probabilities of most data are sufficiently close to either 1 or zero. This assumption is not unusual for image recognition because human beings are extremely good at recognizing most images. To confirm our theoretical explanation, we present the results of a small numerical study conducted to compare the hinge loss and cross-entropy.
Changepoint detection methods are used in many areas of science and engineering, e.g., in the analysis of copy number variation data, to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or presence) of given changepoints, post-detection, are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods towards changepoint detection, focusing as our main scientific application on copy number variation data. As for changepoint algorithms, we study binary segmentation, and two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post-selection inference theory: we use auxiliary randomization to improve power, which requires implementations of MCMC algorithms (importance sampling and hit-and-run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and an example analysis on array comparative genomic hybridization (CGH) data.
Traditional plane-based clustering methods measure the cost of within-cluster and between-cluster by quadratic, linear or some other unbounded functions, which may amplify the impact of cost. This letter introduces a ramp cost function into the plane-based clustering to propose a new clustering method, called ramp-based twin support vector clustering (RampTWSVC). RampTWSVC is more robust because of its boundness, and thus it is more easier to find the intrinsic clusters than other plane-based clustering methods. The non-convex programming problem in RampTWSVC is solved efficiently through an alternating iteration algorithm, and its local solution can be obtained in a finite number of iterations theoretically. In addition, the nonlinear manifold-based formation of RampTWSVC is also proposed by kernel trick. Experimental results on several benchmark datasets show the better performance of our RampTWSVC compared with other plane-based clustering methods.
Representing the control flow of a computer program as a computation graph can bring many benefits in a broad variety of domains where performance is critical. This technique is a core component of most major numerical libraries (TensorFlow, PyTorch, Theano, MXNet,…) and is successfully used to speed up and optimise many computationally-intensive tasks. However, different design choices in each of these libraries lead to noticeable differences in efficiency and in the way an end user writes efficient code. In this report, we detail the implementation and features of the computation graph support in OCaml’s numerical library Owl, a recent entry in the world of scientific computing.
We propose a new sufficient dimension reduction approach designed deliberately for high-dimensional classification. This novel method is named maximal mean variance (MMV), inspired by the mean variance index first proposed by Cui, Li and Zhong (2015), which measures the dependence between a categorical random variable with multiple classes and a continuous random variable. Our method requires reasonably mild restrictions on the predicting variables and keeps the model-free advantage without the need to estimate the link function. The consistency of the MMV estimator is established under regularity conditions for both fixed and diverging dimension (p) cases and the number of the response classes can also be allowed to diverge with the sample size n. We also construct the asymptotic normality for the estimator when the dimension of the predicting vector is fixed. Furthermore, our method works pretty well when n < p. The surprising classification efficiency gain of the proposed method is demonstrated by simulation studies and real data analysis.
We consider a sequence of successively more restrictive definitions of abstraction for causal models, starting with a notion introduced by Rubenstein et al. (2017) called exact transformation that applies to probabilistic causal models, moving to a notion of uniform transformation that applies to deterministic causal models and does not allow differences to be hidden by the ‘right’ choice of distribution, and then to abstraction, where the interventions of interest are determined by the map from low-level states to high-level states, and strong abstraction, which takes more seriously all potential interventions in a model, not just the allowed interventions. We show that procedures for combining micro-variables into macro-variables are instances of our notion of strong abstraction, as are all the examples considered by Rubenstein et al.
The present paper studies the so called deep image prior (DIP) technique in the context of inverse problems. DIP networks have been introduced recently for applications in image processing, also first experimental results for applying DIP to inverse problems have been reported. This paper aims at discussing different interpretations of DIP and to obtain analytic results for specific network designs and linear operators. The main contribution is to introduce the idea of viewing these approaches as the optimization of Tiknonov functionals rather than optimizing networks. Besides theoretical results, we present numerical verifications for an academic example (integration operator) as well as for the inverse problem of magnetic particle imaging (MPI). The reconstructions obtained by deep prior networks are compared with state of the art methods.
We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. It extends neural network libraries with layers capturing uncertainty over weights (Bayesian neural nets), pre-activation units (dropout), activations (‘stochastic output layers’), and the function itself (Gaussian processes). With reversible layers, one can also propagate uncertainty from input to output such as for flow-based distributions and constant-memory backpropagation. Bayesian Layers are a drop-in replacement for other layers, maintaining core features that one typically desires for experimentation. As demonstration, we fit a 10-billion parameter ‘Bayesian Transformer’ on 512 TPUv2 cores, which replaces attention layers with their Bayesian counterpart.