Compressing deep neural networks (DNNs) is important for real-world applications operating on resource-constrained devices. However, it is difficult to change the model size once the training is completed, which needs re-training to configure models suitable for different devices. In this paper, we propose a novel method that enables DNNs to flexibly change their size after training. We factorize the weight matrices of the DNNs via singular value decomposition (SVD) and change their ranks according to the target size. In contrast with existing methods, we introduce simple criteria that characterize the importance of each basis and layer, which enables to effectively compress the error and complexity of models as little as possible. In experiments on multiple image-classification tasks, our method exhibits favorable performance compared with other methods.
Linear regression is an important tool across many fields that work with sensitive human-sourced data. Significant prior work has focused on producing differentially private point estimates, which provide a privacy guarantee to individuals while still allowing modelers to draw insights from data by estimating regression coefficients. We investigate the problem of Bayesian linear regression, with the goal of computing posterior distributions that correctly quantify uncertainty given privately released statistics. We show that a naive approach that ignores the noise injected by the privacy mechanism does a poor job in realistic data settings. We then develop noise-aware methods that perform inference over the privacy mechanism and produce correct posteriors across a wide range of scenarios.
Convolutional Neural Networks (CNNs) have become indispensable for solving machine learning tasks in speech recognition, computer vision, and other areas that involve high-dimensional data. A CNN filters the input feature using a network containing spatial convolution operators with compactly supported stencils. In practice, the input data and the hidden features consist of a large number of channels, which in most CNNs are fully coupled by the convolution operators. This coupling leads to immense computational cost in the training and prediction phase. In this paper, we introduce LeanConvNets that are derived by sparsifying fully-coupled operators in existing CNNs. Our goal is to improve the efficiency of CNNs by reducing the number of weights, floating point operations and latency times, with minimal loss of accuracy. Our lean convolution operators involve tuning parameters that controls the trade-off between the network’s accuracy and computational costs. These convolutions can be used in a wide range of existing networks, and we exemplify their use in residual networks (ResNets) and U-Nets. Using a range of benchmark problems from image classification and semantic segmentation, we demonstrate that the resulting LeanConvNet’s accuracy is close to state-of-the-art networks while being computationally less expensive. In our tests, the lean versions of ResNet and U-net slightly outperforms comparable reduced architectures such as MobileNets and ShuffleNets.
Conversation is the natural mode for information exchange in daily life, a spoken conversational interaction for search input and output is a logical format for information seeking. However, the conceptualisation of user-system interactions or information exchange in spoken conversational search (SCS) has not been explored. The first step in conceptualising SCS is to understand the conversational moves used in an audio-only communication channel for search. This paper explores conversational actions for the task of search. We define a qualitative methodology for creating conversational datasets, propose analysis protocols, and develop the SCSdata. Furthermore, we use the SCSdata to create the first annotation schema for SCS: the SCoSAS, enabling us to investigate interactivity in SCS. We further establish that SCS needs to incorporate interactivity and pro-activity to overcome the complexity that the information seeking process in an audio-only channel poses. In summary, this exploratory study unpacks the breadth of SCS. Our results highlight the need for integrating discourse in future SCS models and contributes the advancement in the formalisation of SCS models and the design of SCS systems.
Using neural networks in the reinforcement learning (RL) framework has achieved notable successes. Yet, neural networks tend to forget what they learned in the past, especially when they learn online and fully incrementally, a setting in which the weights are updated after each sample is received and the sample is then discarded. Under this setting, an update can lead to overly global generalization by changing too many weights. The global generalization interferes with what was previously learned and deteriorates performance, a phenomenon known as catastrophic interference. Many previous works use mechanisms such as experience replay (ER) buffers to mitigate interference by performing minibatch updates, ensuring the data distribution is approximately independent-and-identically-distributed (i.i.d.). But using ER would become infeasible in terms of memory as problem complexity increases. Thus, it is crucial to look for more memory-efficient alternatives. Interference can be averted if we replace global updates with more local ones, so only weights responsible for the observed data sample are updated. In this work, we propose the use of dynamic self-organizing map (DSOM) with neural networks to induce such locality in the updates without ER buffers. Our method learns a DSOM to produce a mask to reweigh each hidden unit’s output, modulating its degree of use. It prevents interference by replacing global updates with local ones, conditioned on the agent’s state. We validate our method on standard RL benchmarks including Mountain Car and Lunar Lander, where existing methods often fail to learn without ER. Empirically, we show that our online and fully incremental method is on par with and in some cases, better than state-of-the-art in terms of final performance and learning speed. We provide visualizations and quantitative measures to show that our method indeed mitigates interference.
Relational data mining is becoming ubiquitous in many fields of study. It offers insights into behaviour of complex, real-world systems which cannot be modeled directly using propositional learning. We propose Symbolic Graph Embedding (SGE), an algorithm aimed to learn symbolic node representations. Built on the ideas from the field of inductive logic programming, SGE first samples a given node’s neighborhood and interprets it as a transaction database, which is used for frequent pattern mining to identify logical conjuncts of items that co-occur frequently in a given context. Such patterns are in this work used as features to represent individual nodes, yielding interpretable, symbolic node embeddings. The proposed SGE approach on a venue classification task outperforms shallow node embedding methods such as DeepWalk, and performs similarly to metapath2vec, a black-box representation learner that can exploit node and edge types in a given graph. The proposed SGE approach performs especially well when small amounts of data are used for learning, scales to graphs with millions of nodes and edges, and can be run on an of-the-shelf laptop.
Digital Twin technology is an emerging concept that has recently become the centre of attention for industry and in more recent year’s academia. The advancements in industry 4.0 concepts have facilitated its growth, particularly in the manufacturing industry. The Digital Twin is defined extensively but is described as the effortless integration of data between a physical and virtual machine in either direction. The challenges, applications, and enabling technologies for Artificial Intelligence, Internet of Things and Digital Twins are presented. A review of publications relating to Digital Twins is performed, producing a categorical review of recent papers. The review has categorised them by research area; Manufacturing, Healthcare and Smart cities. Discussing a range of papers that reflect these areas and the current state of research. The paper outlines the open research opportunities and challenges.
One of the most popular approaches to understanding feature effects of modern black box machine learning models are partial dependence plots (PDP). These plots are easy to understand but only able to visualize low order dependencies. The paper is about the question ‘How much can we see?’: A framework is developed to quantify the explainability of arbitrary machine learning models, i.e. up to what degree the visualization as given by a PDP is able to explain the predictions of the model. The result allows for a judgement whether an attempt to explain a black box model is sufficient or not.
The fundamental challenge of planning for multi-step manipulation is to find effective and plausible action sequences that lead to the task goal. We present Cascaded Variational Inference (CAVIN) Planner, a model-based method that hierarchically generates plans by sampling from latent spaces. To facilitate planning over long time horizons, our method learns latent representations that decouple the prediction of high-level effects from the generation of low-level motions through cascaded variational inference. This enables us to model dynamics at two different levels of temporal resolutions for hierarchical planning. We evaluate our approach in three multi-step robotic manipulation tasks in cluttered tabletop environments given high-dimensional observations. Empirical results demonstrate that the proposed method outperforms state-of-the-art model-based methods by strategically interacting with multiple objects.
We develop techniques to quantify the degree to which a given (training or testing) example is an outlier in the underlying distribution. We evaluate five methods to score examples in a dataset by how well-represented the examples are, for different plausible definitions of ‘well-represented’, and apply these to four common datasets: MNIST, Fashion-MNIST, CIFAR-10, and ImageNet. Despite being independent approaches, we find all five are highly correlated, suggesting that the notion of being well-represented can be quantified. Among other uses, we find these methods can be combined to identify (a) prototypical examples (that match human expectations); (b) memorized training examples; and, (c) uncommon submodes of the dataset. Further, we show how we can utilize our metrics to determine an improved ordering for curriculum learning, and impact adversarial robustness. We release all metric values on training and test sets we studied.
We explore a new approach for training neural networks where all loss functions are replaced by hard constraints. The same approach is very successful in phase retrieval, where signals are reconstructed from magnitude constraints and general characteristics (sparsity, support, etc.). Instead of taking gradient steps, the optimizer in the constraint based approach, called relaxed-reflect-reflect (RRR), derives its steps from projections to local constraints. In neural networks one such projection makes the minimal modification to the inputs , the associated weights , and the pre-activation value at each neuron, to satisfy the equation . These projections, along with a host of other local projections (constraining pre- and post-activations, etc.) can be partitioned into two sets such that all the projections in each set can be applied concurrently, across the network and across all data in the training batch. This partitioning into two sets is analogous to the situation in phase retrieval and the setting for which the general purpose RRR optimizer was designed. Owing to the novelty of the method, this paper also serves as a self-contained tutorial. Starting with a single-layer network that performs non-negative matrix factorization, and concluding with a generative model comprising an autoencoder and classifier, all applications and their implementations by projections are described in complete detail. Although the new approach has the potential to extend the scope of neural networks (e.g. by defining activation not through functions but constraint sets), most of the featured models are standard to allow comparison with stochastic gradient descent.
Conventional principal component analysis (PCA) finds a principal vector that maximizes the sum of second powers of principal components. We consider a generalized PCA that aims at maximizing the sum of an arbitrary convex function of principal components. We present a gradient ascent algorithm to solve the problem. For the kernel version of generalized PCA, we show that the solutions can be obtained as fixed points of a simple single-layer recurrent neural network. We also evaluate our algorithms on different datasets.
We study a variant of decision-theoretic online learning in which the set of experts that are available to Learner can shrink over time. This is a restricted version of the well-studied sleeping experts problem, itself a generalization of the fundamental game of prediction with expert advice. Similar to many works in this direction, our benchmark is the ranking regret. Various results suggest that achieving optimal regret in the fully adversarial sleeping experts problem is computationally hard. This motivates our relaxation where any expert that goes to sleep will never again wake up. We call this setting ‘dying experts’ and study it in two different cases: the case where the learner knows the order in which the experts will die and the case where the learner does not. In both cases, we provide matching upper and lower bounds on the ranking regret in the fully adversarial setting. Furthermore, we present new, computationally efficient algorithms that obtain our optimal upper bounds.
Integro-difference equation (IDE) models describe the conditional dependence between the spatial process at a future time point and the process at the present time point through an integral operator. Nonlinearity or temporal dependence in the dynamics is often captured by allowing the operator parameters to vary temporally, or by re-fitting a model with a temporally-invariant linear operator at each time point in a sliding window. Both procedures tend to be excellent for prediction purposes over small time horizons, but are generally time-consuming and, crucially, do not provide a global prior model for the temporally-varying dynamics that is realistic. Here, we tackle these two issues by using a deep convolution neural network (CNN) in a hierarchical statistical IDE framework, where the CNN is designed to extract process dynamics from the process’ most recent behaviour. Once the CNN is fitted, probabilistic forecasting can be done extremely quickly online using an ensemble Kalman filter with no requirement for repeated parameter estimation. We conduct an experiment where we train the model using 13 years of daily sea-surface temperature data in the North Atlantic Ocean. Forecasts are seen to be accurate and calibrated. A key advantage of our approach is that the CNN provides a global prior model for the dynamics that is realistic, interpretable, and computationally efficient. We show the versatility of the approach by successfully producing 10-minute nowcasts of weather radar reflectivities in Sydney using the same model that was trained on daily sea-surface temperature data in the North Atlantic Ocean.
We introduce the Convolutional Conditional Neural Process (ConvCNP), a new member of the Neural Process family that models translation equivariance in the data. Translation equivariance is an important inductive bias for many learning problems including time series modelling, spatial data, and images. The model embeds data sets into an infinite-dimensional function space as opposed to a finite-dimensional vector space. To formalize this notion, we extend the theory of neural representations of sets to include functional representations, and demonstrate that any translation-equivariant embedding can be represented using a convolutional deep set. We evaluate ConvCNPs in several settings, demonstrating that they achieve state-of-the-art performance compared to existing NPs. We demonstrate that building in translation equivariance enables zero-shot generalization to challenging, out-of-domain tasks.
This paper studies a rarely explored but critical anomaly detection problem: weakly-supervised anomaly detection with limited labeled anomalies and a large unlabeled data set. This problem is very important because it (i) enables anomaly-informed modeling which helps identify anomalies of interests and address the notorious high false positives in unsupervised anomaly detection, and (ii) eliminates the reliance on large-scale and complete labeled anomaly data in fully-supervised settings. However, the problem is especially challenging since we have only limited labeled data for a single class, and moreover, the seen anomalies often cannot cover all types of anomalies (i.e., unseen anomalies). We address this problem by formulating the problem as a pairwise relation learning task. Particularly, our approach defines a two-stream ordinal regression network to learn the relation of randomly selected instance pairs, i.e., whether the instance pair contains labeled anomalies or just unlabeled data instances. The resulting model leverages both the labeled and unlabeled data to effectively augment the data and learn generalized representations of both normality and abnormality. Extensive empirical results show that our approach (i) significantly outperforms state-of-the-art competing methods in detecting both seen and unseen anomalies and (ii) is substantially more data-efficient.
Meta-learning methods, most notably Model-Agnostic Meta-Learning or MAML, have achieved great success in adapting to new tasks quickly, after having been trained on similar tasks. The mechanism behind their success, however, is poorly understood. We begin this work with an experimental analysis of MAML, finding that deep models are crucial for its success, even given sets of simple tasks where a linear model would suffice on any individual task. Furthermore, on image-recognition tasks, we find that the early layers of MAML-trained models learn task-invariant features, while later layers are used for adaptation, providing further evidence that these models require greater capacity than is strictly necessary for their individual tasks. Following our findings, we propose a method which enables better use of model capacity at inference time by separating the adaptation aspect of meta-learning into parameters that are only used for adaptation but are not part of the forward model. We find that our approach enables more effective meta-learning in smaller models, which are suitably sized for the individual tasks.
A fundamental challenge in artificial intelligence is to build an agent that generalizes and adapts to unseen environments. A common strategy is to build a decoder that takes the context of the unseen new environment as input and generates a policy accordingly. The current paper studies how to build a decoder for the fundamental continuous control task, linear quadratic regulator (LQR), which can model a wide range of real-world physical environments. We present a simple algorithm for this problem, which uses upper confidence bound (UCB) to refine the estimate of the decoder and balance the exploration-exploitation trade-off. Theoretically, our algorithm enjoys a regret bound in the online setting where is the number of environments the agent played. This also implies after playing environments, the agent is able to transfer the learned knowledge to obtain an -suboptimal policy for an unseen environment. To our knowledge, this is first provably efficient algorithm to build a decoder in the continuous control setting. While our main focus is theoretical, we also present experiments that demonstrate the effectiveness of our algorithm.
We consider the problem of using reinforcement learning to train adversarial agents for automatic testing and falsification of cyberphysical systems, such as autonomous vehicles, robots, and airplanes. In order to produce useful agents, however, it is useful to be able to control the degree of adversariality by specifying rules that an agent must follow. For example, when testing an autonomous vehicle, it is useful to find maximally antagonistic traffic participants that obey traffic rules. We model dynamic constraints as hierarchically ordered rules expressed in Signal Temporal Logic, and show how these can be incorporated into an agent training process. We prove that our agent-centric approach is able to find all dangerous behaviors that can be found by traditional falsification techniques while producing modular and reusable agents. We demonstrate our approach on two case studies from the automotive domain.
This paper presents an approach to analyzing two-dimensional temporal datasets focusing on identifying observations that are significant in calculating the outliers of a scatterplot. We also propose a prototype, called Outliagnostics, to guide users when interactively exploring abnormalities in large time series. Instead of focusing on detecting outliers at each time point, we monitor and display the discrepant temporal signatures of each data entry concerning the overall distributions. Our prototype is designed to handle these tasks in parallel to improve performance. To highlight the benefits and performance of our approach, we illustrate and validate the use of Outliagnostics on real-world datasets of various sizes in different parallelism configurations. This work also discusses how to extend these ideas to handle time series with a higher number of dimensions and provides a prototype for this type of datasets.
With the information explosion of news articles, personalized news recommendation has become important for users to quickly find news that they are interested in. Existing methods on news recommendation mainly include collaborative filtering methods which rely on direct user-item interactions and content based methods which characterize the content of user reading history. Although these methods have achieved good performances, they still suffer from data sparse problem, since most of them fail to extensively exploit high-order structure information (similar users tend to read similar news articles) in news recommendation systems. In this paper, we propose to build a heterogeneous graph to explicitly model the interactions among users, news and latent topics. The incorporated topic information would help indicate a user’s interest and alleviate the sparsity of user-item interactions. Then we take advantage of graph neural networks to learn user and news representations that encode high-order structure information by propagating embeddings over the graph. The learned user embeddings with complete historic user clicks capture the users’ long-term interests. We also consider a user’s short-term interest using the recent reading history with an attention based LSTM model. Experimental results on real-world datasets show that our proposed model significantly outperforms state-of-the-art methods on news recommendation.
In Interactive Machine Learning (IML), we iteratively make decisions and obtain noisy observations of an unknown function. While IML methods, e.g., Bayesian optimization and active learning, have been successful in applications, on real-world systems they must provably avoid unsafe decisions. To this end, safe IML algorithms must carefully learn about a priori unknown constraints without making unsafe decisions. Existing algorithms for this problem learn about the safety of all decisions to ensure convergence. This is sample-inefficient, as it explores decisions that are not relevant for the original IML objective. In this paper, we introduce a novel framework that renders any existing unsafe IML algorithm safe. Our method works as an add-on that takes suggested decisions as input and exploits regularity assumptions in terms of a Gaussian process prior in order to efficiently learn about their safety. As a result, we only explore the safe set when necessary for the IML problem. We apply our framework to safe Bayesian optimization and to safe exploration in deterministic Markov Decision Processes (MDP), which have been analyzed separately before. Our method outperforms other algorithms empirically.
Histograms and synthetic data are of key importance in data analysis. However, researchers have shown that even aggregated data such as histograms, containing no obvious sensitive attributes, can result in privacy leakage. To enable data analysis, a strong notion of privacy is required to avoid risking unintended privacy violations. Such a strong notion of privacy is differential privacy, a statistical notion of privacy that makes privacy leakage quantifiable. The caveat regarding differential privacy is that while it has strong guarantees for privacy, privacy comes at a cost of accuracy. Despite this trade off being a central and important issue in the adoption of differential privacy, there exists a gap in the literature for understanding the trade off and addressing it appropriately. Through a systematic literature review (SLR), we investigate the state-of-the-art within accuracy improving differentially private algorithms for histogram and synthetic data publishing. Our contribution is two-fold: 1) we provide an understanding of the problem by crystallizing the categories of accuracy improving techniques, the core problems they solve, as well as to investigate how composable the techniques are, and 2) we pave the way for future work. In order to provide an understanding, we position and visualize the ideas in relation to each other and external work, and deconstruct each algorithm to examine the building blocks separately with the aim of pinpointing which dimension of noise reduction each technique is targeting. Hence, this systematization of knowledge (SoK) provides an understanding of in which dimensions and how accuracy improvement can be pursued without sacrificing privacy.
Deep Learning has pushed the limits of what was possible in the domain of Digital Image Processing. However, that is not to say that the traditional computer vision techniques which had been undergoing progressive development in years prior to the rise of DL have become obsolete. This paper will analyse the benefits and drawbacks of each approach. The aim of this paper is to promote a discussion on whether knowledge of classical computer vision techniques should be maintained. The paper will also explore how the two sides of computer vision can be combined. Several recent hybrid methodologies are reviewed which have demonstrated the ability to improve computer vision performance and to tackle problems not suited to Deep Learning. For example, combining traditional computer vision techniques with Deep Learning has been popular in emerging domains such as Panoramic Vision and 3D vision for which Deep Learning models have not yet been fully optimised
We present an algorithm for extraction of a probabilistic deterministic finite automaton (PDFA) from a given black-box language model, such as a recurrent neural network (RNN). The algorithm is a variant of the exact-learning algorithm L*, adapted to a probabilistic setting with noise. The key insight is the use of conditional probabilities for observations, and the introduction of a local tolerance when comparing them. When applied to RNNs, our algorithm often achieves better word error rate (WER) and normalised distributed cumulative gain (NDCG) than that achieved by spectral extraction of weighted finite automata (WFA) from the same networks. PDFAs are substantially more expressive than n-grams, and are guaranteed to be stochastic and deterministic – unlike spectrally extracted WFAs.