Understanding each other is the key to success in collaboration. For humans, attributing mental states to others, the theory of mind, provides the crucial advantage. We argue for formulating human–AI interaction as a multi-agent problem, endowing AI with a computational theory of mind to understand and anticipate the user. To differentiate the approach from previous work, we introduce a categorisation of user modelling approaches based on the level of agency learnt in the interaction. We describe our recent work in using nested multi-agent modelling to formulate user models for multi-armed bandit based interactive AI systems, including a proof-of-concept user study.

Formal verification of neural networks is essential for their deployment in safety-critical areas. Many available formal verification methods have been shown to be instances of a unified Branch and Bound (BaB) formulation. We propose a novel framework for designing an effective branching strategy for BaB. Specifically, we learn a graph neural network (GNN) to imitate the strong branching heuristic behaviour. Our framework differs from previous methods for learning to branch in two main aspects. Firstly, our framework directly treats the neural network we want to verify as a graph input for the GNN. Secondly, we develop an intuitive forward and backward embedding update schedule. Empirically, our framework achieves roughly reduction in both the number of branches and the time required for verification on various convolutional networks when compared to the best available hand-designed branching strategy. In addition, we show that our GNN model enjoys both horizontal and vertical transferability. Horizontally, the model trained on easy properties performs well on properties of increased difficulty levels. Vertically, the model trained on small neural networks achieves similar performance on large neural networks.

We present Smooth Grad-CAM++, a technique which combines two recent techniques: SMOOTHGRAD and Grad-CAM++. Smooth Grad-CAM++ has the capability of either visualizing a layer, subset of feature maps, or subset of neurons within a feature map at each instance. We experimented with few images, and we discovered that Smooth Grad-CAM++ produced more visually sharp maps with larger number of salient pixels highlighted in the given input images when compared with other methods. Smooth Grad-CAM++ will give insight into what our deep CNN models (including models trained on medical scan or imagery) learn. Hence informing decisions on creating a representative training set.

In this paper, we obtain fundamental bounds in sequential prediction and recursive algorithms via an entropic analysis. Both classes of problems are examined by investigating the underlying entropic relationships of the data and/or noises involved, and the derived lower bounds may all be quantified in a conditional entropy characterization. We also study the conditions to achieve the generic bounds from an innovations’ viewpoint.

Deep neural networks (DNN) are versatile parametric models utilised successfully in a diverse number of tasks and domains. However, they have limitations—particularly from their lack of robustness and over-sensitivity to out of distribution samples. Bayesian Neural Networks, due to their formulation under the Bayesian framework, provide a principled approach to building neural networks that address these limitations. This paper describes a study that empirically evaluates and compares Bayesian Neural Networks to their equivalent point estimate Deep Neural Networks to quantify the predictive uncertainty induced by their parameters, as well as their performance in view of this uncertainty. In this study, we evaluated and compared three point estimate deep neural networks against comparable Bayesian neural network alternatives using two well-known benchmark image classification datasets (CIFAR-10 and SVHN).

In many environments, only a relatively small subset of the complete state space is necessary in order to accomplish a given task. We develop a simple technique using emergency stops (e-stops) to exploit this phenomenon. Using e-stops significantly improves sample complexity by reducing the amount of required exploration, while retaining a performance bound that efficiently trades off the rate of convergence with a small asymptotic sub-optimality gap. We analyze the regret behavior of e-stops and present empirical results in discrete and continuous settings demonstrating that our reset mechanism can provide order-of-magnitude speedups on top of existing reinforcement learning methods.

Machine learning has seen tremendous advances in the past few years which has lead to deep learning models being deployed in varied applications of day-to-day life. Attacks on such models using perturbations, particularly in real-life scenarios, pose a serious challenge to their applicability, pushing research into the direction which aims to enhance the robustness of these models. After the introduction of these perturbations by Szegedy et al., significant amount of research has focused on the reliability of such models, primarily in two aspects – white-box, where the adversary has access to the targeted model and related parameters; and the black-box, which resembles a real-life scenario with the adversary having almost no knowledge of the model to be attacked. We propose to attract attention on the latter scenario and thus, present a comprehensive comparative study among the different adversarial black-box attack approaches proposed till date. The second half of this literature survey focuses on the defense techniques. This is the first study, to the best of our knowledge, that specifically focuses on the black-box setting to motivate future work on the same.

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this paper, we introduce Flood, a multi-dimensional in-memory index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system.

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. Long composition rollouts with non-reversible state changes are among the phenomena we include to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like ‘Rinse off a mug and place it in the coffee maker.’ and low-level language instructions like ‘Walk to the coffee maker on the right.’ ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model designed for recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

Federated learning opens a number of research opportunities due to its high communication efficiency in distributed training problems within a star network. In this paper, we focus on improving the communication efficiency for fully decentralized federated learning over a graph, where the algorithm performs local updates for several iterations and then enables communications among the nodes. In such a way, the communication rounds of exchanging the common interest of parameters can be saved significantly without loss of optimality of the solutions. Multiple numerical simulations based on large, real-world electronic health record databases showcase the superiority of the decentralized federated learning compared with classic methods.

Training of deep neural networks heavily depends on the data distribution. In particular, the networks easily suffer from class imbalance. The trained networks would recognize the frequent classes better than the infrequent classes. To resolve this problem, existing approaches typically propose novel loss functions to obtain better feature embedding. In this paper, we argue that drawing a better decision boundary is as important as learning better features. Inspired by observations, we investigate how the class imbalance affects the decision boundary and deteriorates the performance. We also investigate the feature distributional discrepancy between training and test time. As a result, we propose a novel, yet simple method for class imbalanced learning. Despite its simplicity, our method shows outstanding performance. In particular, the experimental results show that we can significantly improve the network by scaling the weight vectors, even without additional training process.

The ubiquity of GPS enabled devices result into the generation of an enormous amount of user movement data consisting of a sequence of locations together with activities performed at those locations. Such data, commonly known as {\it activity-trajectory data}, can be analysed to find common user movements and preferred activities, which will have tremendous business value. However, various countries have now introduced data privacy regulations that make it mandatory to anonymize any data before releasing it. This makes it difficult to look for patterns as the existing mining techniques may not be directly applicable over anonymized data. User locations in an activity-trajectory dataset are anonymized to regions of different shapes and sizes making them uncomparable; therefore, it is unclear how to define suitable patterns over those regions. In this paper, we propose a top-k pattern mining technique called TopKMintra that employs a pre-processing step to transform anonymized activity-trajectory into an intermediate representation that address the above problem. Unlike usual sequential data, activity-trajectory data is 2-dimensional that can lead to generation of duplicate patterns. To handle this, TopKMintra restricts arbitrary extensions in the two dimensions by imposing an order over the search space; this also helps in addressing the common problem of updating the threshold in top-k pattern mining algorithms during various stages. We perform extensive experiments on real datasets to demonstrate the efficiency and the effectiveness of TopKMintra. Our results show that even after anonymization, certain patterns may remain in a dataset and those could be mined efficiently.

We propose a scalable Bayesian preference learning method for jointly predicting the preferences of individuals as well as the consensus of a crowd from pairwise labels. Peoples’ opinions often differ greatly, making it difficult to predict their preferences from small amounts of personal data. Individual biases also make it harder to infer the consensus of a crowd when there are few labels per item. We address these challenges by combining matrix factorisation with Gaussian processes, using a Bayesian approach to account for uncertainty arising from noisy and sparse data. Our method exploits input features, such as text embeddings and user metadata, to predict preferences for new items and users that are not in the training set. As previous solutions based on Gaussian processes do not scale to large numbers of users, items or pairwise labels, we propose a stochastic variational inference approach that limits computational and memory costs. Our experiments on a recommendation task show that our method is competitive with previous approaches despite our scalable inference approximation. We demonstrate the method’s scalability on a natural language processing task with thousands of users and items, and show improvements over the state of the art on this task. We make our software publicly available for future work.

Neural Architecture Search (NAS) that aims to automate the procedure of architecture design has achieved promising results in many computer vision fields. In this paper, we propose an AdversarialNAS method specially tailored for Generative Adversarial Networks (GANs) to search for a superior generative model on the task of unconditional image generation. The proposed method leverages an adversarial searching mechanism to search for the architectures of generator and discriminator simultaneously in a differentiable manner. Therefore, the searching algorithm considers the relevance and balance between the two networks leading to search for a superior generative model. Besides, AdversarialNAS does not need any extra evaluation metric to evaluate the performance of the architecture in each searching iteration, which is very efficient and can take only 1 GPU day to search for an optimal network architecture in a large search space (). Experiments demonstrate the effectiveness and superiority of our method. The discovered generative model sets a new state-of-the-art FID score of and highly competitive Inception Score of on CIFAR-10. Its transferability is also proven by setting new state-of-the-art FID score of and Inception score of on STL-10. Our code will be released to facilitate the related academic and industrial study.

Attribute Oriented Induction (AOI) is a data mining algorithm used for extracting knowledge of relational data, taking into account expert knowledge. It is a clustering algorithm that works by transforming the values of the attributes and converting an instance into others that are more generic or ambiguous. In this way, it seeks similarities between elements to generate data groupings. AOI was initially conceived as an algorithm for knowledge discovery in databases, but over the years it has been applied to other areas such as spatial patterns, intrusion detection or strategy making. In this paper, AOI has been extended to the field of Predictive Maintenance. The objective is to demonstrate that combining expert knowledge and data collected from the machine can provide good results in the Predictive Maintenance of industrial assets. To this end we adapted the algorithm and used an LSTM approach to perform both the Anomaly Detection (AD) and the Remaining Useful Life (RUL). The results obtained confirm the validity of the proposal, as the methodology was able to detect anomalies, and calculate the RUL until breakage with considerable degree of accuracy.

This paper introduces scikit-hubness, a Python package for efficient nearest neighbor search in high-dimensional spaces. Hubness is an aspect of the curse of dimensionality, and is known to impair various learning tasks, including classification, clustering, and visualization. scikit-hubness provides algorithms for hubness analysis (‘Is my data affected by hubness?’), hubness reduction (‘How can we improve neighbor retrieval in high dimensions?’), and approximate neighbor search (‘Does it work for large data sets?’). It is integrated into the scikit-learn environment, enabling rapid adoption by Python-based machine learning researchers and practitioners. Users will find all functionality of the scikit-learn neighbors package, plus additional support for transparent hubness reduction and approximate nearest neighbor search. scikit-hubness is developed using several quality assessment tools and principles, such as PEP8 compliance, unit tests with high code coverage, continuous integration on all major platforms (Linux, MacOS, Windows), and additional checks by LGTM. The source code is available at https://…/scikit-hubness under the BSD 3-clause license. Install from the Python package index with $latex $

We present a robotic setup for real-world testing and evaluation of human-robot and human-human collaborative learning. Leveraging the sample-efficiency of the Soft Actor-Critic algorithm, we have implemented a robotic platform able to learn a non-trivial collaborative task with a human partner, without pre-training in simulation, and using only 30 minutes of real-world interactions. This enables us to study Human-Robot and Human-Human collaborative learning through real-world interactions. We present preliminary results, showing that state-of-the-art deep learning methods can take human-robot collaborative learning a step closer to that of humans interacting with each other.

Neural Ordinary Differential Equations (N-ODEs) are a powerful building block for learning systems, which extend residual networks to a continuous-time dynamical system. We propose a Bayesian version of N-ODEs that enables well-calibrated quantification of prediction uncertainty, while maintaining the expressive power of their deterministic counterpart. We assign Bayesian Neural Nets (BNNs) to both the drift and the diffusion terms of a Stochastic Differential Equation (SDE) that models the flow of the activation map in time. We infer the posterior on the BNN weights using a straightforward adaptation of Stochastic Gradient Langevin Dynamics (SGLD). We illustrate significantly improved stability on two synthetic time series prediction tasks and report better model fit on UCI regression benchmarks with our method when compared to its non-Bayesian counterpart.

The Delta method is a well known procedure used to quantify uncertainty in statistical models. The method has previously been applied in the context of neural networks, but has not reached much popularity in deep learning because of the sheer size of the Hessian matrix. In this paper, we propose a low cost variant of the method based on an approximate eigendecomposition of the positive curvature subspace of the Hessian matrix. The method has a computational complexity of time and space, where is the number of utilized Hessian eigenpairs, is the number of model parameters and is the number of training examples. Given that the model is -regularized with rate , we provide a bound on the uncertainty approximation error given . We show that when the smallest Hessian eigenvalue in the positive -tail of the full spectrum, and the largest Hessian eigenvalue in the negative -tail of the full spectrum are both approximately equal to , the error will be close to zero even when . We demonstrate the method by a TensorFlow implementation, and show that meaningful rankings of images based on prediction uncertainty can be obtained for a convolutional neural network based MNIST classifier. We also observe that false positives have higher prediction uncertainty than true positives. This suggests that there is supplementing information in the uncertainty measure not captured by the probability alone.

Neural Architecture Search methods are effective but often use complex algorithms to come up with the best architecture. We propose an approach with three basic steps that is conceptually much simpler. First we train N random architectures to generate N (architecture, validation accuracy) pairs and use them to train a regression model that predicts accuracy based on the architecture. Next, we use this regression model to predict the validation accuracies of a large number of random architectures. Finally, we train the top-K predicted architectures and deploy the model with the best validation result. While this approach seems simple, it is more than 20 times as sample efficient as Regularized Evolution on the NASBench-101 benchmark and can compete on ImageNet with more complex approaches based on weight sharing, such as ProxylessNAS.

Bayesian interpretations of neural network have a long history, dating back to early work in the 1990’s and have recently regained attention because of their desirable properties like uncertainty estimation, model robustness and regularisation. We want to discuss here the application of Bayesian models to knowledge sharing between neural networks. Knowledge sharing comes in different facets, such as transfer learning, model distillation and shared embeddings. All of these tasks have in common that learned ‘features’ ought to be shared across different networks. Theoretically rooted in the concepts of Bayesian neural networks this work has widespread application to general deep learning.

The promise of ultimate elasticity and operational simplicity of serverless computing has recently lead to an explosion of research in this area. In the context of data analytics, the concept sounds appealing, but due to the limitations of current offerings, there is no consensus yet on whether or not this approach is technically and economically viable. In this paper, we identify interactive data analytics on cold data as a use case where serverless computing excels. We design and implement Lambada, a system following a purely serverless architecture, in order to illustrate when and how serverless computing should be employed for data analytics. We propose several system components that overcome the previously known limitations inherent in the serverless paradigm as well as additional ones we identify in this work. We can show that, thanks to careful design, a serverless query processing system can be at the same time one order of magnitude faster and two orders of magnitude cheaper compared to commercial Query-as-a-Service systems, the only alternative with similar operational simplicity.

Training generative adversarial networks requires balancing of delicate adversarial dynamics. Even with careful tuning, training may diverge or end up in a bad equilibrium with dropped modes. In this work, we introduce a new form of latent optimisation inspired by the CS-GAN and show that it improves adversarial dynamics by enhancing interactions between the discriminator and the generator. We develop supporting theoretical analysis from the perspectives of differentiable games and stochastic approximation. Our experiments demonstrate that latent optimisation can significantly improve GAN training, obtaining state-of-the-art performance for the ImageNet (128 x 128) dataset. Our model achieves an Inception Score (IS) of 148 and an Fr\’echet Inception Distance (FID) of 3.4, an improvement of 17% and 32% in IS and FID respectively, compared with the baseline BigGAN-deep model with the same architecture and number of parameters.

In this work, we present the application of convolutional neural networks for segmenting water bodies in satellite images. We first use a variant of the U-Net model to segment rivers and lakes from very high-resolution images from Peru. To circumvent the issue of scarce labelled data, we investigate the applicability of a knowledge transfer-based model that learns the mapping from high-resolution labelled images and combines it with the very high-resolution mapping so that better segmentation can be achieved. We train this model in a single process, end-to-end. Our preliminary results show that adding the information from the available high-resolution images does not help out-of-the-box, and in fact worsen results. This leads us to infer that the high-resolution data could be from a different distribution, and its addition leads to increased variance in our results.

This paper builds the connection between graph neural networks and traditional dynamical systems. Existing graph neural networks essentially define a discrete dynamic on node representations with multiple graph convolution layers. We propose continuous graph neural networks (CGNN), which generalise existing graph neural networks into the continuous-time dynamic setting. The key idea is how to characterise the continuous dynamics of node representations, i.e. the derivatives of node representations w.r.t. time. Inspired by existing diffusion-based methods on graphs (e.g. PageRank and epidemic models on social networks), we define the derivatives as a combination of the current node representations, the representations of neighbors, and the initial values of the nodes. We propose and analyse different possible dynamics on graphs—including each dimension of node representations (a.k.a. the feature channel) change independently or interact with each other—both with theoretical justification. The proposed continuous graph neural networks are robust to over-smoothing and hence capture the long-range dependencies between nodes. Experimental results on the task of node classification prove the effectiveness of our proposed approach over competitive baselines.

Approximate nearest neighbor (ANN) search in high dimensions is an integral part of several computer vision systems and gains importance in deep learning with explicit memory representations. Since PQT and FAISS started to leverage the massive parallelism offered by GPUs, GPU-based implementations are a crucial resource for today’s state-of-the-art ANN methods. While most of these methods allow for faster queries, less emphasis is devoted to accelerate the construction of the underlying index structures. In this paper, we propose a novel search structure based on nearest neighbor graphs and information propagation on graphs. Our method is designed to take advantage of GPU architectures to accelerate the hierarchical building of the index structure and for performing the query. Empirical evaluation shows that GGNN significantly surpasses the state-of-the-art GPU- and CPU-based systems in terms of build-time, accuracy and search speed.

Machine learning algorithms have been shown to be suitable for securing platforms for IT systems. However, due to the fundamental differences between the industrial internet of things (IIoT) and regular IT networks, a special performance review needs to be considered. The vulnerabilities and security requirements of IIoT systems demand different considerations. In this paper, we study the reasons why machine learning must be integrated into the security mechanisms of the IIoT, and where it currently falls short in having a satisfactory performance. The challenges and real-world considerations associated with this matter are studied in our experimental design. We use an IIoT testbed resembling a real industrial plant to show our proof of concept.

Understanding the meaning of text often involves reasoning about entities and their relationships. This requires identifying textual mentions of entities, linking them to a canonical concept, and discerning their relationships. These tasks are nearly always viewed as separate components within a pipeline, each requiring a distinct model and training data. While relation extraction can often be trained with readily available weak or distant supervision, entity linkers typically require expensive mention-level supervision — which is not available in many domains. Instead, we propose a model which is trained to simultaneously produce entity linking and relation decisions while requiring no mention-level annotations. This approach avoids cascading errors that arise from pipelined methods and more accurately predicts entity relationships from text. We show that our model outperforms a state-of-the art entity linking and relation extraction pipeline on two biomedical datasets and can drastically improve the overall recall of the system.

Information systems (IS) are widely used in organisations to improve business performance. The steady progression in improving technologies like artificial intelligence (AI) and the need of securing future success of organisations lead to new requirements for IS. This research in progress firstly introduces the term AI-based services (AIBS) describing AI as a component enriching IS aiming at collaborating with employees and assisting in the execution of work-related tasks. The study derives requirements from ten expert interviews to successful design AIBS following Design Science Research (DSR). For a successful deployment of AIBS in organisations the D&M IS Success Model will be considered to validated requirements within three major dimensions of quality: Information Quality, System Quality, and Service Quality. Amongst others, preliminary findings propose that AIBS must be preferably authentic. Further discussion and research on AIBS is forced, thus, providing first insights on the deployment of AIBS in organisations.

With the increase of credit card usage, the volume of credit card misuse also has significantly increased. As a result, financial organizations are working hard on developing and deploying credit card fraud detection methods, in order to adapt to ever-evolving, increasingly sophisticated defrauding strategies and identifying illicit transactions as quickly as possible to protect themselves and their customers. Compounding on the complex nature of such adverse strategies, credit card fraudulent activities are rare events compared to the number of legitimate transactions. Hence, the challenge to develop fraud detection that are accurate and efficient is substantially intensified and, as a consequence, credit card fraud detection has lately become a very active area of research. In this work, we provide a survey of current techniques most relevant to the problem of credit card fraud detection. We carry out our survey in two main parts. In the first part,we focus on studies utilizing classical machine learning models, which mostly employ traditional transnational features to make fraud predictions. These models typically rely on some static physical characteristics, such as what the user knows (knowledge-based method), or what he/she has access to (object-based method). In the second part of our survey, we review more advanced techniques of user authentication, which use behavioral biometrics to identify an individual based on his/her unique behavior while he/she is interacting with his/her electronic devices. These approaches rely on how people behave (instead of what they do), which cannot be easily forged. By providing an overview of current approaches and the results reported in the literature, this survey aims to drive the future research agenda for the community in order to develop more accurate, reliable and scalable models of credit card fraud detection.

Multiple fairness constraints have been proposed in the literature, motivated by a range of concerns about how demographic groups might be treated unfairly by machine learning classifiers. In this work we consider a different motivation; learning from biased training data. We posit several ways in which training data may be biased, including having a more noisy or negatively biased labeling process on members of a disadvantaged group, or a decreased prevalence of positive or negative examples from the disadvantaged group, or both. Given such biased training data, Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. In particular, we find that the Equal Opportunity fairness constraint (Hardt, Price, and Srebro 2016) combined with ERM will provably recover the Bayes Optimal Classifier under a range of bias models. We also consider other recovery methods including reweighting the training data, Equalized Odds, and Demographic Parity. These theoretical results provide additional motivation for considering fairness interventions even if an actor cares primarily about accuracy.

Most of the data-driven approaches applied to bearing fault diagnosis up to date are established in the supervised learning paradigm, which usually requires a large set of labeled data collected a priori. In practical applications, however, obtaining accurate labels based on real-time bearing conditions can be far more challenging than simply collecting a huge amount of unlabeled data using various sensors. In this paper, we thus propose a semi-supervised learning approach for bearing anomaly detection using variational autoencoder (VAE) based deep generative models, which allows for effective utilization of dataset when only a small subset of data have labels. Finally, a series of experiments is performed using both the Case Western Reserve University (CWRU) bearing dataset and the University of Cincinnati’s Center for Intelligent Maintenance Systems (IMS) dataset. The experimental results demonstrate that the proposed semi-supervised learning scheme greatly outperforms two mainstream semi-supervised learning approaches and a baseline supervised convolutional neural network approach, with the overall accuracy improvement ranging between 3\% to 30\% using different number of labeled samples.

t-SNE is a popular tool for embedding multi-dimensional datasets into two or three dimensions. However, it has a large computational cost, especially when the input data has many dimensions. Many use t-SNE to embed the output of a neural network, which is generally of much lower dimension than the original data. This limits the use of t-SNE in unsupervised scenarios. We propose using \textit{random} projections to embed high dimensional datasets into relatively few dimensions, and then using t-SNE to obtain a two dimensional embedding. We show that random projections preserve the desirable clustering achieved by t-SNE, while dramatically reducing the runtime of finding the embedding.

In safety-critical applications of machine learning, it is often necessary to look beyond standard metrics such as test accuracy in order to validate various qualitative properties such as monotonicity with respect to a feature or combination of features, checking for undesirable changes or oscillations in the response, and differences in outcomes (e.g. discrimination) for a protected class. To help answer this need, we propose a framework for approximately validating (or invalidating) various properties of a black box model by finding a univariate diagnostic curve in the input space whose output maximally violates a given property. These diagnostic curves show the exact value of the model along the curve and can be displayed with a simple and intuitive line graph. We demonstrate the usefulness of these diagnostic curves across multiple use-cases and datasets including selecting between two models and understanding out-of-sample behavior.

We present two approaches for computing rational approximations to multivariate functions, motivated by their effectiveness as surrogate models for high-energy physics (HEP) applications. Our first approach builds on the Stieltjes process to efficiently and robustly compute the coefficients of the rational approximation. Our second approach is based on an optimization formulation that allows us to include structural constraints on the rational approximation, resulting in a semi-infinite optimization problem that we solve using an outer approximation approach. We present results for synthetic and real-life HEP data, and we compare the approximation quality of our approaches with that of traditional polynomial approximations.

Social psychology and real experiences show that cognitive consistency plays an important role to keep human society in order: if people have a more consistent cognition about their environments, they are more likely to achieve better cooperation. Meanwhile, only cognitive consistency within a neighborhood matters because humans only interact directly with their neighbors. Inspired by these observations, we take the first step to introduce \emph{neighborhood cognitive consistency} (NCC) into multi-agent reinforcement learning (MARL). Our NCC design is quite general and can be easily combined with existing MARL methods. As examples, we propose neighborhood cognition consistent deep Q-learning and Actor-Critic to facilitate large-scale multi-agent cooperations. Extensive experiments on several challenging tasks (i.e., packet routing, wifi configuration, and Google football player control) justify the superior performance of our methods compared with state-of-the-art MARL approaches.

An intriguing phenomenon observed during training neural networks is the spectral bias, where neural networks are biased towards learning less complex functions. The priority of learning functions with low complexity might be at the core of explaining generalization ability of neural network, and certain efforts have been made to provide theoretical explanation for spectral bias. However, there is still no satisfying theoretical result justifying the underlying mechanism of spectral bias. In this paper, we give a comprehensive and rigorous explanation for spectral bias and relate it with the neural tangent kernel function proposed in recent work. We prove that the training process of neural networks can be decomposed along different directions defined by the eigenfunctions of the neural tangent kernel, where each direction has its own convergence rate and the rate is determined by the corresponding eigenvalue. We then provide a case study when the input data is uniformly distributed over the unit sphere, and show that lower degree spherical harmonics are easier to be learned by over-parameterized neural networks.

Data poisoning attacks compromise the integrity of machine-learning models by introducing malicious training samples to influence the results during test time. In this work, we investigate backdoor data poisoning attack on deep neural networks (DNNs) by inserting a backdoor pattern in the training images. The resulting attack will misclassify poisoned test samples while maintaining high accuracies for the clean test-set. We present two approaches for detection of such poisoned samples by quantifying the uncertainty estimates associated with the trained models. In the first approach, we model the outputs of the various layers (deep features) with parametric probability distributions learnt from the clean held-out dataset. At inference, the likelihoods of deep features w.r.t these distributions are calculated to derive uncertainty estimates. In the second approach, we use Bayesian deep neural networks trained with mean-field variational inference to estimate model uncertainty associated with the predictions. The uncertainty estimates from these methods are used to discriminate clean from the poisoned samples.

We define and analyze three mechanisms for getting common knowledge, a posteriori truths about the world onto a blockchain in a decentralized setting. We show that, when a reasonable economic condition is met, these mechanisms are individually rational, incentive compatible, and decide the true outcome of valid oracle queries in both the non-cooperative and cooperative settings. These mechanisms are based upon repeated games with two classes of players: queriers who desire to get common knowledge truths onto the blockchain and a pool of reporters who posses such common knowledge. Presented with a new oracle query, reporters have an opportunity to report the truth in return for a fee provided by the querier. During subsequent oracle queries, the querier has an opportunity to punish any reporters who did not report truthfully during previous rounds. While the set of reporters has the power to cause the oracle to lie, they are incentivized not to do so.

While many methods for learning vector space embeddings have been proposed in the field of Natural Language Processing, these methods typically do not distinguish between categories and individuals. Intuitively, if individuals are represented as vectors, we can think of categories as (soft) regions in the embedding space. Unfortunately, meaningful regions can be difficult to estimate, especially since we often have few examples of individuals that belong to a given category. To address this issue, we rely on the fact that different categories are often highly interdependent. In particular, categories often have conceptual neighbors, which are disjoint from but closely related to the given category (e.g.\ fruit and vegetable). Our hypothesis is that more accurate category representations can be learned by relying on the assumption that the regions representing such conceptual neighbors should be adjacent in the embedding space. We propose a simple method for identifying conceptual neighbors and then show that incorporating these conceptual neighbors indeed leads to more accurate region based representations.

Transferrable neural architecture search can be viewed as a binary optimization problem where a single optimal path should be selected among candidate paths in each edge within the repeated cell block of the directed a cyclic graph form. Recently, the field of differentiable architecture search attempts to relax the search problem continuously using a one-shot network that combines all the candidate paths in search space. However, when the one-shot network is pruned to the model in the discrete architecture space by the derivation algorithm, performance is significantly degraded to an almost random estimator. To reduce the quantization error from the heavy use of relaxation, we only sample a single edge to relax the corresponding variable and clamp variables in the other edges to zero or one.By this method, there is no performance drop after pruning the one-shot network by derivation algorithm, due to the preservation of the discrete nature of optimization variables during the search. Furthermore, the minimization of relaxation degree allows searching in a deeper network to discover better performance with remarkable search cost reduction (0.125 GPU days) compared to previous methods.By adding several regularization methods that help explore within the search space, we could obtain the network with notable performances on CIFAR-10, CIFAR-100, and ImageNet.

Background: Recently, an extensive amount of research has been focused on compressing and accelerating Deep Neural Networks (DNNs). So far, high compression rate algorithms required the entire training dataset, or its subset, for fine-tuning and low precision calibration process. However, this requirement is unacceptable when sensitive data is involved as in medical and biometric use-cases. Contributions: We present three methods for generating synthetic samples from trained models. Then, we demonstrate how these samples can be used to fine-tune or to calibrate quantized models with negligible accuracy degradation compared to the original training set — without using any real data in the process. Furthermore, we suggest that our best performing method, leveraging intrinsic batch normalization layers’ statistics of a trained model, can be used to evaluate data similarity. Our approach opens a path towards genuine data-free model compression, alleviating the need for training data during deployment.

The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. However, existing optimization algorithms show a preference for descent paths that converge slowly and do not seek to avoid bad local optima. In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. LRD empirically aids the optimizer to actively explore in the parameter space by randomly setting some learning rates to zero; at each iteration, only parameters whose learning rate is not 0 are updated. As the learning rate of different parameters is dropped, the optimizer will sample a new loss descent path for the current update. The uncertainty of the descent path helps the model avoid saddle points and bad local minima. Experiments show that LRD is surprisingly effective in accelerating training while preventing overfitting.

Error-correcting output codes (ECOC) is an ensemble method combining a set of binary classifiers for multi-class learning problems. However, in traditional ECOC framework, the binary classifiers are trained independently. To explore the interaction between the binary classifiers, we construct an error correction network (ECN) that jointly trains all binary classifiers while maximizing the ensemble diversity to improve its robustness against adversarial attacks. An ECN is built based on a code matrix which is generated by maximizing the error tolerance, i.e., the minimum Hamming distance between any two rows, as well as the ensemble diversity, i.e., the variation of information between any two columns. Though ECN inherently promotes the diversity between the binary classifiers as each ensemble member solves a different classification problem (specified by the corresponding column of the code matrix), we empirically show that the ensemble diversity can be further improved by forcing the weight matrices learned by ensemble members to be orthogonal. The ECN is trained in end-to-end fashion and can be complementary to other defense approaches including adversarial training. We show empirically that ECN is effective against the state-of-the-art while-box attacks while maintaining good accuracy on normal examples.

Architecture design has become a crucial component of successful deep learning. Recent progress in automatic neural architecture search (NAS) shows a lot of promise. However, discovered architectures often fail to generalize in the final evaluation. Architectures with a higher validation accuracy during the search phase may perform worse in the evaluation. Aiming to alleviate this common issue, we introduce sequential greedy architecture search (SGAS), an efficient method for neural architecture search. By dividing the search procedure into sub-problems, SGAS chooses and prunes candidate operations in a greedy fashion. We apply SGAS to search architectures for Convolutional Neural Networks (CNN) and Graph Convolutional Networks (GCN). Extensive experiments show that SGAS is able to find state-of-the-art architectures for tasks such as image classification, point cloud classification and node classification in protein-protein interaction graphs with minimal computational cost. Please visit https://…/sgas for more information about SGAS.

Processing large complex networks recently attracted considerable interest. Complex graphs are useful in a wide range of applications from technological networks to biological systems like the human brain. Sometimes these networks are composed of billions of entities that give rise to emerging properties and structures. Analyzing these structures aids us in gaining new insights about our surroundings. As huge networks become abundant, there is a need for scalable algorithms to perform analysis. A prominent example is the PageRank algorithm, which is one of the measures used by web search engines such as Google to rank web pages displayed to the user. In order to find these patterns, massive amounts of data have to be acquired and processed. Designing and evaluating scalable graph algorithms to handle these data sets is a crucial task on the road to understanding the underlying systems. This habilitation thesis is a summary a broad spectrum of scalable graph algorithms that I developed over the last six years with many coauthors. In general, this research is based on four pillars: multilevel algorithms, practical kernelization, parallelization and memetic algorithms that are highly interconnected. Experiments conducted indicate that our algorithms find better solutions and/or are much more scalable than the previous state-of-the-art.

Deep learning-based models have been very successful in achieving state-of-the-art results in many of the computer vision, speech recognition, and natural language processing tasks in the last few years. These models seem a natural fit for handling the ever-increasing scale of biometric recognition problems, from cellphone authentication to airport security systems. Deep learning-based models have increasingly been leveraged to improve the accuracy of different biometric recognition systems in recent years. In this work, we provide a comprehensive survey of more than 120 promising works on biometric recognition (including face, fingerprint, iris, palmprint, ear, voice, signature, and gait recognition), which deploy deep learning models, and show their strengths and potentials in different applications. For each biometric, we first introduce the available datasets that are widely used in the literature and their characteristics. We will then talk about several promising deep learning works developed for that biometric, and show their performance on popular public benchmarks. We will also discuss some of the main challenges while using these models for biometric recognition, and possible future directions to which research in this area is headed.

This paper studies the multi-cascade influence maximization problem, which explores strategies for launching one information cascade in a social network with multiple existing cascades. With natural extensions to the classic models, we first propose the independent multi-cascade model where the diffusion process is governed by the so-called activation function. We show that the proposed model is sufficiently flexible as it generalizes most of the existing cascade-based models. We then study the multi-cascade influence maximization problem under the designed model and provide approximation hardness under common complexity assumptions, namely Exponential Time Hypothesis and . Given the hardness results, we build a framework for designing heuristic seed selection algorithms with a testable data-dependent approximation ratio. The designed algorithm leverages upper and lower bounds, which reveal the key combinatorial structure behind the multi-cascade influence maximization problem. The performance of the framework is theoretically analyzed and practically evaluated through extensive simulations. The superiority of the proposed solution is supported by encouraging experimental results, in terms of effectiveness and efficiency.

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset~\cite{imdb}. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.

Even as deep neural networks have become very effective for tasks in vision and perception, it remains difficult to explain and debug their behavior. In this paper, we present a programmatic and semantic approach to explaining, understanding, and debugging the correct and incorrect behaviors of a neural network based perception system. Our approach is semantic in that it employs a high-level representation of the distribution of environment scenarios that the detector is intended to work on. It is programmatic in that the representation is a program in a domain-specific probabilistic programming language using which synthetic data can be generated to train and test the neural network. We present a framework that assesses the performance of the neural network to identify correct and incorrect detections, extracts rules from those results that semantically characterizes the correct and incorrect scenarios, and then specializes the probabilistic program with those rules in order to more precisely characterize the scenarios in which the neural network operates correctly or not, without human intervention to identify important features. We demonstrate our results using the SCENIC probabilistic programming language and a neural network-based object detector. Our experiments show that it is possible to automatically generate compact rules that significantly increase the correct detection rate (or conversely the incorrect detection rate) of the network and can thus help with debugging and understanding its behavior.

Playing an essential role in data mining, machine learning has a long history of being applied to networks on multifarious tasks and has played an essential role in data mining. However, the discrete and sparse natures of networks often render it difficult to apply machine learning directly to networks. To circumvent this difficulty, one major school of thought to approach networks using machine learning is via network embeddings. On the one hand, this network embeddings have achieved huge success on aggregated network data in recent years. On the other hand, learning network embeddings on distributively stored networks still remained understudied: To the best of our knowledge, all existing algorithms for learning network embeddings have hitherto been exclusively centralized and thus cannot be applied to these networks. To accommodate distributively stored networks, in this paper, we proposed a multi-agent model. Under this model, we developed the multi-agent network embedding learning algorithm (MANELA) for learning network embeddings. We demonstrate MANELA’s advantages over other existing centralized network embedding learning algorithms both theoretically and experimentally. Finally, we further our understanding in MANELA via visualization and exploration of its relationship to DeepWalk.

In this paper, we introduce Anomaly Contribution Explainer or ACE, a tool to explain security anomaly detection models in terms of the model features through a regression framework, and its variant, ACE-KL, which highlights the important anomaly contributors. ACE and ACE-KL provide insights in diagnosing which attributes significantly contribute to an anomaly by building a specialized linear model to locally approximate the anomaly score that a black-box model generates. We conducted experiments with these anomaly detection models to detect security anomalies on both synthetic data and real data. In particular, we evaluate performance on three public data sets: CERT insider threat, netflow logs, and Android malware. The experimental results are encouraging: our methods consistently identify the correct contributing feature in the synthetic data where ground truth is available; similarly, for real data sets, our methods point a security analyst in the direction of the underlying causes of an anomaly, including in one case leading to the discovery of previously overlooked network scanning activity. We have made our source code publicly available.

We propose a novel model for a topic-aware chatbot by combining the traditional Recurrent Neural Network (RNN) encoder-decoder model with a topic attention layer based on Nonnegative Matrix Factorization (NMF). After learning topic vectors from an auxiliary text corpus via NMF, the decoder is trained so that it is more likely to sample response words from the most correlated topic vectors. One of the main advantages in our architecture is that the user can easily switch the NMF-learned topic vectors so that the chatbot obtains desired topic-awareness. We demonstrate our model by training on a single conversational data set which is then augmented with topic matrices learned from different auxiliary data sets. We show that our topic-aware chatbot not only outperforms the non-topic counterpart, but also that each topic-aware model qualitatively and contextually gives the most relevant answer depending on the topic of question.

Notwithstanding the tremendous progress that is taking place in spoken language technology, effective speech-based human-robot interaction still raises a number of important challenges. Not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there is a large gap between the formulaic speech that typifies contemporary spoken dialogue systems and the flexible nature of human-human conversation. It is pointed out that grounded and situated speech-based human-robot interaction may lead to deeper insights into the pragmatics of language usage, thereby overcoming the current `habitability gap’.

Deep metric learning has yielded impressive results in tasks such as clustering and image retrieval by leveraging neural networks to obtain highly discriminative feature embeddings, which can be used to group samples into different classes. Much research has been devoted to the design of smart loss functions or data mining strategies for training such networks. Most methods consider only pairs or triplets of samples within a mini-batch to compute the loss function, which is commonly based on the distance between embeddings. We propose Group Loss, a loss function based on a differentiable label-propagation method that enforces embedding similarity across all samples of a group while promoting, at the same time, low-density regions amongst data points belonging to different groups. Guided by the smoothness assumption that ‘similar objects should belong to the same group’, the proposed loss trains the neural network for a classification task, enforcing a consistent labelling amongst samples within a class. We show state-of-the-art results on clustering and image retrieval on several datasets, and show the potential of our method when combined with other techniques such as ensembles

In pattern recognition or machine learning, it is a very fundamental task to find nearest neighbors of a given point. All the methods for the task work basically by comparing the given point to all the points in the data set. That is why the computational cost increases with the number of data points. However, the human visual system seems to work in a different way. When the human visual system tries to find the neighbors of one point on a map, it directly focuses on the area around the point and actively searches the neighbors by looking or zooming in and out around the point. In this paper, we propose an innovative search method for nearest neighbors, which seems very similar to how human visual system works on the task.

Pointer analysis is a foundational analysis leveraged by various static analyses. Therefore, it gathered wide attention in research for decades. Some pointer analysis frameworks are based on succinct declarative specifications. However, these tools are heterogeneous in terms of the underlying intermediate representation (IR), heap abstraction, and programming methodology. This situation complicates a fair comparison of these frameworks and thus hinders further research. Consequently, the literature lacks an evaluation of the strengths and weaknesses of these tools. In this work, we evaluate two major frameworks for pointer analysis, WALA and Doop, on the DaCapo set of benchmarks. We compare the pointer analyses available in Wala and Doop, and conclude that—even though based on a declarative specification—Doop provides a better pointer analysis than Wala in terms of precision and scalability. We also compare the two IRs used in Doop, i.e., Jimple from the Soot framework and IR from the Wala framework. Our evaluation shows that in the majority of the benchmarks Soot’s IR gives a more precise and scalable pointer analysis. Finally, we propose a micro-benchmark \emph{PointerBench}, for which we manually validate the points-to statistics to evaluate the results of these tools.

Estimating the travel time of any route is of great importance for trip planners, traffic operators, online taxi dispatching and ride-sharing platforms, and navigation provider systems. With the advance of technology, many traveling cars, including online taxi dispatch systems’ vehicles are equipped with Global Positioning System (GPS) devices that can report the location of the vehicle every few seconds. This paper uses GPS data and the Matrix Factorization techniques to estimate the travel times on all road segments and time intervals simultaneously. We aggregate GPS data into a matrix, where each cell of the original matrix contains the average vehicle speed for a segment and a specific time interval. One of the problems with this matrix is its high sparsity. We use Alternating Least Squares (ALS) method along with a regularization term to factorize the matrix. Since this approach can solve the sparsity problem that arises from the absence of cars in many road segments in a specific time interval, matrix factorization is suitable for estimating the travel time. Our comprehensive evaluation results using real data provided by one of the largest online taxi dispatching systems in Iran, shows the strength of our proposed method.

This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of for large . It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higher-order analysis of kernel methods.

Learning the underlying patterns in the data goes beyond instance-based generalization to some external knowledge represented in structured graphs or networks. Deep Learning (DL) has shown significant advances in probabilistically learning latent patterns in the data using a multi-layered network of computational nodes (i.e. neurons/hidden units). However, with the tremendous amount of training data, uncertainty in generalization on domain-specific tasks, and delta improvement with an increase in complexity of models seem to raise a concern on the features learned by the model. As incorporation of domain specific knowledge will aid in supervising the learning of features for the model, infusion of knowledge from knowledge graphs within hidden layers will further enhance the learning process. Although much work remains, we believe that KGs will play an increasing role in developing hybrid neuro-symbolic intelligent systems (that is bottom up deep learning with top down symbolic computing) as well as in building explainable AI systems for which KGs will provide a scaffolding for punctuating neural computing. In this position paper, we describe our motivation for such hybrid approach and a framework that combines knowledge graph and neural networks.

Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years — predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from six key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, and the exploitation of correlation filter advantages. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, and LaSOT. Finally, by conducting critical analyses of these state-of-the-art methods both quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh on when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.

The rapid growth of deep learning applications in real life is accompanied by severe safety concerns. To mitigate this uneasy phenomenon, much research has been done providing reliable evaluations of the fragility level in different deep neural networks. Apart from devising adversarial attacks, quantifiers that certify safeguarded regions have also been designed in the past five years. The summarizing work of Salman et al. unifies a family of existing verifiers under a convex relaxation framework. We draw inspiration from such work and further demonstrate the optimality of deterministic CROWN (Zhang et al. 2018) solutions in a given linear programming problem under mild constraints. Given this theoretical result, the computationally expensive linear programming based method is shown to be unnecessary. We then propose an optimization-based approach \textit{FROWN} (\textbf{F}astened C\textbf{ROWN}): a general algorithm to tighten robustness certificates for neural networks. Extensive experiments on various networks trained individually verify the effectiveness of FROWN in safeguarding larger robust regions.

With rapid technological advancements within the domain of Internet of Things (IoT), strong trends have emerged which indicate a rapid growth in the number of smart devices connected to IoT networks and this growth cannot be supported by traditional cloud computing platforms. In response to the increased capacity of data being transferred over networks, the edge and fog computing paradigms have emerged as extremely viable frameworks that shift computational and storage resources towards the edge of the network, thereby migrating processing power from centralized cloud servers to distributed LAN resources and powerful embedded devices within the network. These computing paradigms, therefore, have the potential to support massive IoT networks of the future and have also fueled the advancement of IoT systems within industrial settings, leading to the creation of the Industrial Internet of Things (IIoT) technology that is revolutionizing industrial processes in a variety of domains. In this paper, we elaborate on the impact of edge and fog computing paradigms on IIoT. We also highlight the how edge and fog computing are poised to bring about a turnaround in several industrial applications through a use-case approach. Finally, we conclude with the current issues and challenges faced by these paradigms in IIoT and suggest some research directions that should be followed to solve these problems and accelerate the adaptation of edge and fog computing in IIoT.

Matrix completion is widely used in machine learning, engineering control, image processing, and recommendation systems. Currently, a popular algorithm for matrix completion is Singular Value Threshold (SVT). In this algorithm, the singular value threshold should be set first. However, in a recommendation system, the dimension of the preference matrix keeps changing. Therefore, it is difficult to directly apply SVT. In addition, what the users of a recommendation system need is a sequence of personalized recommended results rather than the estimation of their scores. According to the above ideas, this paper proposes a novel approach named probability completion model~(PCM). By reducing the data dimension, the transitivity of the similar matrix, and singular value decomposition, this approach quickly obtains a completion matrix with the same probability distribution as the original matrix. The approach greatly reduces the computation time based on the accuracy of the sacrifice part, and can quickly obtain a low-rank similarity matrix with data trend approximation properties. The experimental results show that PCM can quickly generate a complementary matrix with similar data trends as the original matrix. The LCS score and efficiency of PCM are both higher than SVT.

The problem of hyperparameter optimization exists widely in the real life and many common tasks can be transformed into it, such as neural architecture search and feature subset selection. Without considering various constraints, the existing hyperparameter tuning techniques can solve these problems effectively by traversing as many hyperparameter configurations as possible. However, because of the limited resources and budget, it is not feasible to evaluate so many kinds of configurations, which requires us to design effective algorithms to find a best possible hyperparameter configuration with a finite number of configuration evaluations. In this paper, we simulate human thinking processes and combine the merit of the existing techniques, and thus propose a new algorithm called ExperienceThinking, trying to solve this constrained hyperparameter optimization problem. In addition, we analyze the performances of 3 classical hyperparameter optimization algorithms with a finite number of configuration evaluations, and compare with that of ExperienceThinking. The experimental results show that our proposed algorithm provides superior results and has better performance.

Application of K-Means algorithm is restricted by the fact that the number of clusters should be known beforehand. Previously suggested methods to solve this problem are either ad hoc or require parametric assumptions and complicated calculations. The proposed method aims to solve this conundrum by considering cluster hypersphere density as the factor to determine the number of clusters in the given dataset. The density is calculated by assuming a hypersphere around the cluster centroid for n-different number of clusters. The calculated values are plotted against their corresponding number of clusters and then the optimum number of clusters is obtained after assaying the elbow region of the graph. The method is simple and easy to comprehend and provides robust and reliable results.

A low-complexity neural network based approach for channel estimation was proposed recently, where assumptions on the channel model were incorporated into the design procedure of the estimator. Instead of using data from a measurement campaign as done in previous work, we evaluate the performance of the convolutional neural network (CNN) based channel estimator by using a reproducible mmWave environment of the DeepMIMO dataset. We further propose a neural network based predictor which is derived by starting from the linear minimum mean squared error (LMMSE) predictor. We start by deriving a weighted sum of LMMSE predictors which is motivated by the structure of the optimal MMSE predictor. This predictor provides an initialization (weight matrices, biases and activation function) to a feed-forward neural network based predictor. With a properly learned neural network, we show that it is possible to easily outperform the LMMSE predictor based on the Jakes assumption of the underlying Doppler spectrum in an reproducible indoor scenario of the DeepMIMO dataset.

In this work we explore the use of latent representations obtained from multiple input sensory modalities (such as images or sounds) in allowing an agent to learn and exploit policies over different subsets of input modalities. We propose a three-stage architecture that allows a reinforcement learning agent trained over a given sensory modality, to execute its task on a different sensory modality-for example, learning a visual policy over image inputs, and then execute such policy when only sound inputs are available. We show that the generalized policies achieve better out-of-the-box performance when compared to different baselines. Moreover, we show this holds in different OpenAI gym and video game environments, even when using different multimodal generative models and reinforcement learning algorithms.

We propose semantic region-adaptive normalization (SEAN), a simple but effective building block for Generative Adversarial Networks conditioned on segmentation masks that describe the semantic regions in the desired output image. Using SEAN normalization, we can build a network architecture that can control the style of each semantic region individually, e.g., we can specify one style reference image per region. SEAN is better suited to encode, transfer, and synthesize style than the best previous method in terms of reconstruction quality, variability, and visual quality. We evaluate SEAN on multiple datasets and report better quantitative metrics (e.g. FID, PSNR) than the current state of the art. SEAN also pushes the frontier of interactive image editing. We can interactively edit images by changing segmentation masks or the style for any given region. We can also interpolate styles from two reference images per region.

Sequential modelling with self-attention has achieved cutting edge performances in natural language processing. With advantages in model flexibility, computation complexity and interpretability, self-attention is gradually becoming a key component in event sequence models. However, like most other sequence models, self-attention does not account for the time span between events and thus captures sequential signals rather than temporal patterns. Without relying on recurrent network structures, self-attention recognizes event orderings via positional encoding. To bridge the gap between modelling time-independent and time-dependent event sequence, we introduce a functional feature map that embeds time span into high-dimensional spaces. By constructing the associated translation-invariant time kernel function, we reveal the functional forms of the feature map under classic functional function analysis results, namely Bochner’s Theorem and Mercer’s Theorem. We propose several models to learn the functional time representation and the interactions with event representation. These methods are evaluated on real-world datasets under various continuous-time event sequence prediction tasks. The experiments reveal that the proposed methods compare favorably to baseline models while also capturing useful time-event interactions.

Domain adaptation (DA) and domain generalization (DG) have emerged as a solution to the domain shift problem where the distribution of the source and target data is different. The task of DG is more challenging than DA as the target data is totally unseen during the training phase in DG scenarios. The current state-of-the-art employs adversarial techniques, however, these are rarely considered for the DG problem. Furthermore, these approaches do not consider correlation alignment which has been proven highly beneficial for minimizing domain discrepancy. In this paper, we propose a correlation-aware adversarial DA and DG framework where the features of the source and target data are minimized using correlation alignment along with adversarial learning. Incorporating the correlation alignment module along with adversarial learning helps to achieve a more domain agnostic model due to the improved ability to reduce domain discrepancy with unlabeled target data more effectively. Experiments on benchmark datasets serve as evidence that our proposed method yields improved state-of-the-art performance.

Abstract argumentation has emerged as a method for non-monotonic reasoning that has gained tremendous traction in the symbolic artificial intelligence community. In the literature, the different approaches to abstract argumentation that were refined over the years are typically evaluated from a logics perspective; an analysis that is based on models of ideal, rational decision-making does not exist. In this paper, we close this gap by analyzing abstract argumentation from the perspective of the rational man paradigm in microeconomic theory. To assess under which conditions abstract argumentation-based choice functions can be considered economically rational, we define a new argumentation principle that ensures compliance with the rational man’s reference independence property, which stipulates that a rational agent’s preferences over two choice options should not be influenced by the absence or presence of additional options. We show that the argumentation semantics as proposed in Dung’s classical paper, as well as all of a range of other semantics we evaluate do not fulfill this newly created principle. Consequently, we investigate how structural properties of argumentation frameworks impact the reference independence principle, and propose a restriction to argumentation expansions that allows all of the evaluated semantics to fulfill the requirements for economically rational argumentation-based choice. For this purpose, we define the rational man’s expansion as a normal and non-cyclic expansion. Finally, we put reference independence into the context of preference-based argumentation and show that for this argumentation variant, which explicitly model preferences, the rational man’s expansion cannot ensure reference independence.

We consider a trainable point-to-point communication system, where both transmitter and receiver are implemented as neural networks (NNs), and demonstrate that training on the bit-wise mutual information (BMI) allows seamless integration with practical bit-metric decoding (BMD) receivers, as well as joint optimization of constellation shaping and labeling. Moreover, we present a fully differentiable neural iterative demapping and decoding (IDD) structure which achieves significant gains on additive white Gaussian noise (AWGN) channels using a standard 802.11n low-density parity-check (LDPC) code. The strength of this approach is that it can be applied to arbitrary channels without any modifications. Going one step further, we show that careful code design can lead to further performance improvements. Lastly, we show the viability of the proposed system through implementation on software-defined radios (SDRs) and training of the end-to-end system on the actual wireless channel. Experimental results reveal that the proposed method enables significant gains compared to conventional techniques.

Wasserstein-GANs have been introduced to address the deficiencies of generative adversarial networks (GANs) regarding the problems of vanishing gradients and mode collapse during the training, leading to improved convergence behaviour and improved image quality. However, Wasserstein-GANs require the discriminator to be Lipschitz continuous. In current state-of-the-art Wasserstein-GANs this constraint is enforced via gradient norm regularization. In this paper, we demonstrate that this regularization does not encourage a broad distribution of spectral-values in the discriminator weights, hence resulting in less fidelity in the learned distribution. We therefore investigate the possibility of substituting this Lipschitz constraint with an orthogonality constraint on the weight matrices. We compare three different weight orthogonalization techniques with regards to their convergence properties, their ability to ensure the Lipschitz condition and the achieved quality of the learned distribution. In addition, we provide a comparison to Wasserstein-GANs trained with current state-of-the-art methods, where we demonstrate the potential of solely using orthogonality-based regularization. In this context, we propose an improved training procedure for Wasserstein-GANs which utilizes orthogonalization to further increase its generalization capability. Finally, we provide a novel metric to evaluate the generalization capabilities of the discriminators of different Wasserstein-GANs.

Literature analysis facilitates researchers better understanding the development of science and technology. The conventional literature analysis focuses on the topics, authors, abstracts, keywords, references, etc., and rarely pays attention to the content of papers. In the field of machine learning, the involved methods (M) and datasets (D) are key information in papers. The extraction and mining of M and D are useful for discipline analysis and algorithm recommendation. In this paper, we propose a novel entity recognition model, called MDER, and constructe datasets from the papers of the PAKDD conferences (2009-2019). Some preliminary experiments are conducted to assess the extraction performance and the mining results are visualized.

This paper treats functional marked point processes (FMPPs), which are defined as marked point processes where the marks are random elements in some (Polish) function space. Such marks may represent e.g. spatial paths or functions of time. To be able to consider e.g. multivariate FMPPs, we also attach an additional, Euclidean, mark to each point. We indicate how FMPPs quite naturally connect the point process framework with both the functional data analysis framework and the geostatistical framework. We further show that various existing models fit well into the FMPP framework. In addition, we introduce a new family of summary statistics, weighted marked reduced moment measures, together with their non-parametric estimators, in order to study features of the functional marks. We further show how they generalise other summary statistics and we finally apply these tools to analyse population structures, such as demographic evolution and sex ratio over time, in Spanish provinces.

In this work we present ISA, a novel approach for learning and exploiting subgoals in reinforcement learning (RL). Our method relies on inducing an automaton whose transitions are subgoals expressed as propositional formulas over a set of observable events. A state-of-the-art inductive logic programming system is used to learn the automaton from observation traces perceived by the RL agent. The reinforcement learning and automaton learning processes are interleaved: a new refined automaton is learned whenever the RL agent generates a trace not recognized by the current automaton. We evaluate ISA in several gridworld problems and show that it performs similarly to a method for which automata are given in advance. We also show that the learned automata can be exploited to speed up convergence through reward shaping and transfer learning across multiple tasks. Finally, we analyze the running time and the number of traces that ISA needs to learn an automata, and the impact that the number of observable events has on the learner’s performance.

In few-shot learning, typically, the loss function which is applied at test time is the one we are ultimately interested in minimising, such as the mean-squared-error loss for a regression problem. However, given that we have few samples at test time, we argue that the loss function that we are interested in minimising is not necessarily the loss function most suitable for computing gradients in a few-shot setting. We propose VIABLE, a generic meta-learning extension that builds on existing meta-gradient-based methods by learning a differentiable loss function, replacing the pre-defined inner-loop loss function in performing task-specific updates. We show that learning a loss function capable of leveraging relational information between samples reduces underfitting, and significantly improves performance and sample efficiency on a simple regression task. Furthermore, we show VIABLE is scalable by evaluating on the Mini-Imagenet dataset.

In this paper, we propose DeepAlign, a novel approach to multi-perspective process anomaly correction, based on recurrent neural networks and bidirectional beam search. At the core of the DeepAlign algorithm are two recurrent neural networks trained to predict the next event. One is reading sequences of process executions from left to right, while the other is reading the sequences from right to left. By combining the predictive capabilities of both neural networks, we show that it is possible to calculate sequence alignments, which are used to detect and correct anomalies. DeepAlign utilizes the case-level and event-level attributes to closely model the decisions within a process. We evaluate the performance of our approach on an elaborate data corpus of 30 realistic synthetic event logs and compare it to three state-of-the-art conformance checking methods. DeepAlign produces better corrections than the rest of the field reaching an overall accuracy of 98.45% across all datasets, whereas the best comparable state-of-the-art method reaches 70.19%.

Financial time series forecasting is, without a doubt, the top choice of computational intelligence for finance researchers from both academia and financial industry due to its broad implementation areas and substantial impact. Machine Learning (ML) researchers came up with various models and a vast number of studies have been published accordingly. As such, a significant amount of surveys exist covering ML for financial time series forecasting studies. Lately, Deep Learning (DL) models started appearing within the field, with results that significantly outperform traditional ML counterparts. Even though there is a growing interest in developing models for financial time series forecasting research, there is a lack of review papers that were solely focused on DL for finance. Hence, our motivation in this paper is to provide a comprehensive literature review on DL studies for financial time series forecasting implementations. We not only categorized the studies according to their intended forecasting implementation areas, such as index, forex, commodity forecasting, but also grouped them based on their DL model choices, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Long-Short Term Memory (LSTM). We also tried to envision the future for the field by highlighting the possible setbacks and opportunities, so the interested researchers can benefit.

Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these ‘untrained subnetworks’ exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an ‘untrained subnetwork’ approaches a network with learned weights in accuracy.

The performance of deep neural networks is often attributed to their automated, task-related feature construction. It remains an open question, though, why this leads to solutions with good generalization, even in cases where the number of parameters is larger than the number of samples. Back in the 90s, Hochreiter and Schmidhuber observed that flatness of the loss surface around a local minimum correlates with low generalization error. For several flatness measures, this correlation has been empirically validated. However, it has recently been shown that existing measures of flatness cannot theoretically be related to generalization due to a lack of invariance with respect to reparameterizations. We propose a natural modification of existing flatness measures that results in invariance to reparameterization.

In this paper, we transform tag recommendation into a word-based text generation problem and introduce a sequence-to-sequence model. The model inherits the advantages of LSTM-based encoder for sequential modeling and attention-based decoder with local positional encodings for learning relations globally. Experimental results on Zhihu datasets illustrate the proposed model outperforms other state-of-the-art text classification based methods.

Despite their importance in training artificial intelligence systems, large datasets remain challenging to acquire. For example, the ImageNet dataset required fourteen million labels of basic human knowledge, such as whether an image contains a chair. Unfortunately, this knowledge is so simple that it is tedious for human annotators but also tacit enough such that they are necessary. However, human collaborative efforts for tasks like labeling massive amounts of data are costly, inconsistent, and prone to failure, and this method does not resolve the issue of the resulting dataset being static in nature. What if we asked people questions they want to answer and collected their responses as data? This would mean we could gather data at a much lower cost, and expanding a dataset would simply become a matter of asking more questions. We focus on the task of Visual Question Answering (VQA) and propose a system that uses Visual Question Generation (VQG) to produce questions, asks them to social media users, and collects their responses. We present two models that can then parse clean answers from the noisy human responses significantly better than our baselines, with the goal of eventually incorporating the answers into a Visual Question Answering (VQA) dataset. By demonstrating how our system can collect large amounts of data at little to no cost, we envision similar systems being used to improve performance on other tasks in the future.

This is an overview of the R package iprior, which implements a unified methodology for fitting parametric and nonparametric regression models, including additive models, multilevel models, and models with one or more functional covariates. Based on the principle of maximum entropy, an I-prior is an objective Gaussian process prior for the regression function with covariance kernel equal to its Fisher information. The regression function is estimated by its posterior mean under the I-prior, and hyperparameters are estimated via maximum marginal likelihood. Estimation of I-prior models is simple and inference straightforward, while small and large sample predictive performances are comparative, and often better, to similar leading state-of-the-art models. We illustrate the use of the iprior package by analysing a simulated toy data set as well as three real-data examples, in particular, a multilevel data set, a longitudinal data set, and a dataset involving a functional covariate.

Past literature has been effective in demonstrating ideological gaps in machine learning (ML) fairness definitions when considering their use in complex socio-technical systems. However, we go further to demonstrate that these definitions often misunderstand the legal concepts from which they purport to be inspired, and consequently inappropriately co-opt legal language. In this paper, we demonstrate examples of this misalignment and discuss the differences in ML terminology and their legal counterparts, as well as what both the legal and ML fairness communities can learn from these tensions. We focus this paper on U.S. anti-discrimination law since the ML fairness research community regularly references terms from this body of law.

Deep neural networks with more parameters and FLOPs have higher capacity and generalize better to diverse domains. But to be deployed on edge devices, the model’s complexity has to be constrained due to limited compute resource. In this work, we propose a method to improve the model capacity without increasing inference-time complexity. Our method is based on an assumption of data locality: for an edge device, within a short period of time, the input data to the device are sampled from a single domain with relatively low diversity. Therefore, it is possible to utilize a specialized, low-complexity model to achieve good performance in that input domain. To leverage this, we propose Domain-aware Dynamic Network (DDN), which is a high-capacity dynamic network in which each layer contains multiple weights. During inference, based on the input domain, DDN dynamically combines those weights into one single weight that specializes in the given domain. This way, DDN can keep the inference-time complexity low but still maintain a high capacity. Experiments show that without increasing the parameters, FLOPs, and actual latency, DDN achieves up to 2.6\% higher AP50 than a static network on the BDD100K object-detection benchmark.

Recent advances in artificial intelligence research have led to a profusion of studies that apply deep learning to problems in image analysis and natural language processing among others. Additionally, the availability of open-source computational frameworks has lowered the barriers to implementing state-of-the-art methods across multiple domains. Albeit leading to major performance breakthroughs in some tasks, effective dissemination of deep learning algorithms remains challenging, inhibiting reproducibility and benchmarking studies, impeding further validation, and ultimately hindering their effectiveness in the cumulative scientific progress. In developing a platform for sharing research outputs, we present ModelHub.AI (www.modelhub.ai), a community-driven container-based software engine and platform for the structured dissemination of deep learning models. For contributors, the engine controls data flow throughout the inference cycle, while the contributor-facing standard template exposes model-specific functions including inference, as well as pre- and post-processing. Python and RESTful Application programming interfaces (APIs) enable users to interact with models hosted on ModelHub.AI and allows both researchers and developers to utilize models out-of-the-box. ModelHub.AI is domain-, data-, and framework-agnostic, catering to different workflows and contributors’ preferences.

The aim of this paper is to investigate various information-theoretic measures, including entropy, mutual information, and some systematic measures that based on mutual information, for a class of structured spiking neuronal network. In order to analyze and compute these information-theoretic measures for large networks, we coarse-grained the data by ignoring the order of spikes that fall into the same small time bin. The resultant coarse-grained entropy mainly capture the information contained in the rhythm produced by a local population of the network. We first proved that these information theoretical measures are well-defined and computable by proving the stochastic stability and the law of large numbers. Then we use three neuronal network examples, from simple to complex, to investigate these information-theoretic measures. Several analytical and computational results about properties of these information-theoretic measures are given.

Supervised systems require human labels for training. But, are humans themselves always impartial during the annotation process? We examine this question in the context of automated assessment of human behavioral tasks. Specifically, we investigate whether human ratings themselves can be trusted at their face value when scoring video-based structured interviews, and whether such ratings can impact machine learning models that use them as training data. We present preliminary empirical evidence that indicates there might be biases in such annotations, most of which are visual in nature.

Recently the concept of transformative AI (TAI) has begun to receive attention in the AI policy space. TAI is often framed as an alternative formulation to notions of strong AI (e.g. artificial general intelligence or superintelligence) and reflects increasing consensus that advanced AI which does not fit these definitions may nonetheless have extreme and long-lasting impacts on society. However, the term TAI is poorly defined and often used ambiguously. Some use the notion of TAI to describe levels of societal transformation associated with previous ‘general purpose technologies’ (GPTs) such as electricity or the internal combustion engine. Others use the term to refer to more drastic levels of transformation comparable to the agricultural or industrial revolutions. The notion has also been used much more loosely, with some implying that current AI systems are already having a transformative impact on society. This paper unpacks and analyses the notion of TAI, proposing a distinction between TAI and radically transformative AI (RTAI), roughly corresponding to societal change on the level of the agricultural or industrial revolutions. We describe some relevant dimensions associated with each and discuss what kinds of advances in capabilities they might require. We further consider the relationship between TAI and RTAI and whether we should necessarily expect a period of TAI to precede the emergence of RTAI. This analysis is important as it can help guide discussions among AI policy researchers about how to allocate resources towards mitigating the most extreme impacts of AI and it can bring attention to negative TAI scenarios that are currently neglected.

We present, FT-SWRL, a fuzzy temporal extension to the Semantic Web Rule Language (SWRL), which combines fuzzy theories based on the valid-time temporal model to provide a standard approach for modeling imprecise temporal domain knowledge in OWL ontologies. The proposal introduces a fuzzy temporal model for the semantic web, which is syntactically defined as a fuzzy temporal SWRL ontology (SWRL-FTO) with a new set of fuzzy temporal SWRL built-ins for defining their semantics. The SWRL-FTO hierarchically defines the necessary linguistic terminologies and variables for the fuzzy temporal model. An example model demonstrating the usefulness of the fuzzy temporal SWRL built-ins to model imprecise temporal information is also represented. Fuzzification process of interval-based temporal logic is further discussed as a reasoning paradigm for our FT-SWRL rules, with the aim of achieving a complete OWL-based fuzzy temporal reasoning. Literature review on fuzzy temporal representation approaches, both within and without the use of ontologies, led to the conclusion that the FT-SWRL model can authoritatively serve as a formal specification for handling imprecise temporal expressions on the semantic web.

This paper introduces the -distance matching problem, in which we are given a bipartite graph with , a weight function on the edges and an integer . The goal is to find a maximum weight subset of the edges satisfying the following two conditions: i) the degree of every node of is at most one in , ii) if for some , then . The question finds applications, for example, in various scheduling problems. We show that the problem is NP-complete in general and admits a simple -approximation. We give an FPT algorithm parameterized by and also settle the case when the size of is constant. From an approximability point of view, we consider several greedy approaches. In particular, a local search algorithm is presented that achieves an approximation ratio of for any constant in the unweighted case. We show that the integrality gap of the natural integer programming model is at most , and give an LP-based approximation algorithm for the weighted case with the same guarantee. The novel approaches used in the analysis of the integrality gap and the approximation ratio of locally optimal solutions might be of independent combinatorial interest.

In this paper we present a new framework for time-series modeling that combines the best of traditional statistical models and neural networks. We focus on time-series with long-range dependencies, needed for monitoring fine granularity data (e.g. minutes, seconds, milliseconds), prevalent in operational use-cases. Traditional models, such as auto-regression fitted with least squares (Classic-AR) can model time-series with a concise and interpretable model. When dealing with long-range dependencies, Classic-AR models can become intractably slow to fit for large data. Recently, sequence-to-sequence models, such as Recurrent Neural Networks, which were originally intended for natural language processing, have become popular for time-series. However, they can be overly complex for typical time-series data and lack interpretability. A scalable and interpretable model is needed to bridge the statistical and deep learning-based approaches. As a first step towards this goal, we propose modelling AR-process dynamics using a feed-forward neural network approach, termed AR-Net. We show that AR-Net is as interpretable as Classic-AR but also scales to long-range dependencies. Our results lead to three major conclusions: First, AR-Net learns identical AR-coefficients as Classic-AR, thus being equally interpretable. Second, the computational complexity with respect to the order of the AR process, is linear for AR-Net as compared to a quadratic for Classic-AR. This makes it possible to model long-range dependencies within fine granularity data. Third, by introducing regularization, AR-Net automatically selects and learns sparse AR-coefficients. This eliminates the need to know the exact order of the AR-process and allows to learn sparse weights for a model with long-range dependencies.

Today, Android devices are able to provide various services. They support applications for different purposes such as entertainment, business, health, education, and banking services. Because of the functionality and popularity of Android devices as well as the open-source policy of Android OS, they have become a suitable target for attackers. Android Botnet is one of the most dangerous malwares because an attacker called Botmaster can control that remotely to perform destructive attacks. A number of researchers have used different well-known Machine Learning (ML) methods to recognize Android Botnets from benign applications. However, these conventional methods are not able to detect new sophisticated Android Botnets. In this paper, we propose a novel method based on Android permissions and Convolutional Neural Networks (CNNs) to classify Botnets and benign Android applications. Being the first developed method that uses CNNs for this aim, we also proposed a novel method to represent each application as an image which is constructed based on the co-occurrence of used permissions in that application. The proposed CNN is a binary classifier that is trained using these images. Evaluating the proposed method on 5450 Android applications consist of Botnet and benign samples, the obtained results show the accuracy of 97.2% and recall of 96% which is a promising result just using Android permissions.

In this paper, we propose a new product knowledge graph (PKG) embedding approach for learning the intrinsic product relations as product knowledge for e-commerce. We define the key entities and summarize the pivotal product relations that are critical for general e-commerce applications including marketing, advertisement, search ranking and recommendation. We first provide a comprehensive comparison between PKG and ordinary knowledge graph (KG) and then illustrate why KG embedding methods are not suitable for PKG learning. We construct a self-attention-enhanced distributed representation learning model for learning PKG embeddings from raw customer activity data in an end-to-end fashion. We design an effective multi-task learning schema to fully leverage the multi-modal e-commerce data. The Poincare embedding is also employed to handle complex entity structures. We use a real-world dataset from grocery.walmart.com to evaluate the performances on knowledge completion, search ranking and recommendation. The proposed approach compares favourably to baselines in knowledge completion and downstream tasks.

Graph convolutional networks (GCNs) have shown the powerful ability in text structure representation and effectively facilitate the task of text classification. However, challenges still exist in adapting GCN on learning discriminative features from texts due to the main issue of graph variants incurred by the textual complexity and diversity. In this paper, we propose a dual-attention GCN to model the structural information of various texts as well as tackle the graph-invariant problem through embedding two types of attention mechanisms, i.e. the connection-attention and hop-attention, into the classic GCN. To encode various connection patterns between neighbour words, connection-attention adaptively imposes different weights specified to neighbourhoods of each word, which captures the short-term dependencies. On the other hand, the hop-attention applies scaled coefficients to different scopes during the graph diffusion process to make the model learn more about the distribution of context, which captures long-term semantics in an adaptive way. Extensive experiments are conducted on five widely used datasets to evaluate our dual-attention GCN, and the achieved state-of-the-art performance verifies the effectiveness of dual-attention mechanisms.

Quantization and Knowledge distillation (KD) methods are widely used to reduce memory and power consumption of deep neural networks (DNNs), especially for resource-constrained edge devices. Although their combination is quite promising to meet these requirements, it may not work as desired. It is mainly because the regularization effect of KD further diminishes the already reduced representation power of a quantized model. To address this short-coming, we propose Quantization-aware Knowledge Distillation (QKD) wherein quantization and KD are care-fully coordinated in three phases. First, Self-studying (SS) phase fine-tunes a quantized low-precision student network without KD to obtain a good initialization. Second, Co-studying (CS) phase tries to train a teacher to make it more quantizaion-friendly and powerful than a fixed teacher. Finally, Tutoring (TU) phase transfers knowledge from the trained teacher to the student. We extensively evaluate our method on ImageNet and CIFAR-10/100 datasets and show an ablation study on networks with both standard and depthwise-separable convolutions. The proposed QKD outperformed existing state-of-the-art methods (e.g., 1.3% improvement on ResNet-18 with W4A4, 2.6% on MobileNetV2 with W4A4). Additionally, QKD could recover the full-precision accuracy at as low as W3A3 quantization on ResNet and W6A6 quantization on MobilenetV2.

Current approaches to deep learning are beginning to rely heavily on transfer learning as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks. Similarly, such pre-trained models are often used to create embedding representations for various types of data, such as text and images, which can then be fed as input into separate, downstream models. However, in cases where such transfer learning models perform poorly (i.e., for data outside of the training distribution), one must resort to fine-tuning such models, or even retraining them completely. Currently, no form of data augmentation has been proposed that can be applied directly to embedding inputs to improve downstream model performance. In this work, we introduce four new types of data augmentation that are generally applicable to embedding inputs, thus making them useful in both Natural Language Processing (NLP) and Computer Vision (CV) applications. For models trained on downstream tasks with such embedding inputs, these augmentation methods are shown to improve the AUC score of the models from a score of 0.9582 to 0.9812 and significantly increase the model’s ability to identify classes of data that are not seen during training.

Deep metric learning (DML) is a popular approach for images retrieval, solving verification (same or not) problems and addressing open set classification. Arguably, the most common DML approach is with triplet loss, despite significant advances in the area of DML. Triplet loss suffers from several issues such as collapse of the embeddings, high sensitivity to sampling schemes and more importantly a lack of performance when compared to more modern methods. We attribute this adoption to a lack of fair comparisons between various methods and the difficulty in adopting them for novel problem statements. In this paper, we perform an unbiased comparison of the most popular DML baseline methods under same conditions and more importantly, not obfuscating any hyper parameter tuning or adjustment needed to favor a particular method. We find, that under equal conditions several older methods perform significantly better than previously believed. In fact, our unified implementation of 12 recently introduced DML algorithms achieve state-of-the art performance on CUB200, CAR196, and Stanford Online products datasets which establishes a new set of baselines for future DML research. The codebase and all tuned hyperparameters will be open-sourced for reproducibility and to serve as a source of benchmark.

Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as ‘Obama is a _ by profession’. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as ‘Obama worked as a _’ may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts and ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 38.1%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://…/LPAQA.

Deep learning has gained tremendous success and great popularity in the past few years. However, recent research found that it is suffering several inherent weaknesses, which can threaten the security and privacy of the stackholders. Deep learning’s wide use further magnifies the caused consequences. To this end, lots of research has been conducted with the purpose of exhaustively identifying intrinsic weaknesses and subsequently proposing feasible mitigation. Yet few is clear about how these weaknesses are incurred and how effective are these attack approaches in assaulting deep learning. In order to unveil the security weaknesses and aid in the development of a robust deep learning system, we are devoted to undertaking a comprehensive investigation on attacks towards deep learning, and extensively evaluating these attacks in multiple views. In particular, we focus on four types of attacks associated with security and privacy of deep learning: model extraction attack, model inversion attack, poisoning attack and adversarial attack. For each type of attack, we construct its essential workflow as well as adversary capabilities and attack goals. Many pivot metrics are devised for evaluating the attack approaches, by which we perform a quantitative and qualitative analysis. From the analysis, we have identified significant and indispensable factors in an attack vector, \eg, how to reduce queries to target models, what distance used for measuring perturbation. We spot light on 17 findings covering these approaches’ merits and demerits, success probability, deployment complexity and prospects. Moreover, we discuss other potential security weaknesses and possible mitigation which can inspire relevant researchers in this area.

We propose a novel approach to address one aspect of the non-stationarity problem in multi-agent reinforcement learning (RL), where the other agents may alter their policies due to environment changes during execution. This violates the Markov assumption that governs most single-agent RL methods and is one of the key challenges in multi-agent RL. To tackle this, we propose to train multiple policies for each agent and postpone the selection of the best policy at execution time. Specifically, we model the environment non-stationarity with a finite set of scenarios and train policies fitting each scenario. In addition to multiple policies, each agent also learns a policy predictor to determine which policy is the best with its local information. By doing so, each agent is able to adapt its policy when the environment changes and consequentially the other agents alter their policies during execution. We empirically evaluated our method on a variety of common benchmark problems proposed for multi-agent deep RL in the literature. Our experimental results show that the agents trained by our algorithm have better adaptiveness in changing environments and outperform the state-of-the-art methods in all the tested environments.

As a metric to measure the performance of an online method, dynamic regret with switching cost has drawn much attention for online decision making problems. Although the sublinear regret has been provided in many previous researches, we still have little knowledge about the relation between the dynamic regret and the switching cost. In the paper, we investigate the relation for two classic online settings: Online Algorithms (OA) and Online Convex Optimization (OCO). We provide a new theoretical analysis framework, which shows an interesting observation, that is, the relation between the switching cost and the dynamic regret is different for settings of OA and OCO. Specifically, the switching cost has significant impact on the dynamic regret in the setting of OA. But, it does not have an impact on the dynamic regret in the setting of OCO. Furthermore, we provide a lower bound of regret for the setting of OCO, which is same with the lower bound in the case of no switching cost. It shows that the switching cost does not change the difficulty of online decision making problems in the setting of OCO.

We present a method for solving the general mixed constrained convex quadratic programming problem using an active set method on the dual problem. The approach is similar to existing active set methods, but we present a new way of solving the linear systems arising in the algorithm. There are two main contributions; we present a new way of factorizing the linear systems, and show how iterative refinement can be used to achieve good accuracy and to solve both types of sub-problems that arise from semi-definite problems.

Decentralized optimization algorithms have attracted intensive interests recently, as it has a balanced communication pattern, especially when solving large-scale machine learning problems. Stochastic Path Integrated Differential Estimator Stochastic First-Order method (SPIDER-SFO) nearly achieves the algorithmic lower bound in certain regimes for nonconvex problems. However, whether we can find a decentralized algorithm which achieves a similar convergence rate to SPIDER-SFO is still unclear. To tackle this problem, we propose a decentralized variant of SPIDER-SFO, called decentralized SPIDER-SFO (D-SPIDER-SFO). We show that D-SPIDER-SFO achieves a similar gradient computation cost—that is, for finding an -approximate first-order stationary point—to its centralized counterpart. To the best of our knowledge, D-SPIDER-SFO achieves the state-of-the-art performance for solving nonconvex optimization problems on decentralized networks in terms of the computational cost. Experiments on different network configurations demonstrate the efficiency of the proposed method.

There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naive one-to-one mapping of each word in a database to a word embedding vector is not sufficient and would lead to poor accuracies in ML tasks. Thus, we argue to additionally incorporate the information given by the database schema into the embedding, e.g. which words appear in the same column or are related to each other. In this paper, we propose RETRO (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the best of both worlds, the rich information encoded by word embeddings and the relational information encoded by database tables. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem and derive good settings for all of them. Our evaluation shows that the proposed embeddings are ready-to-use for many ML tasks such as classification and regression and even outperform state-of-the-art techniques in integration tasks such as null value imputation and link prediction.

Dropout has been proven to be an effective algorithm for training robust deep networks because of its ability to prevent overfitting by avoiding the co-adaptation of feature detectors. Current explanations of dropout include bagging, naive Bayes, regularization, and sex in evolution. According to the activation patterns of neurons in the human brain, when faced with different situations, the firing rates of neurons are random and continuous, not binary as current dropout does. Inspired by this phenomenon, we extend the traditional binary dropout to continuous dropout. On the one hand, continuous dropout is considerably closer to the activation characteristics of neurons in the human brain than traditional binary dropout. On the other hand, we demonstrate that continuous dropout has the property of avoiding the co-adaptation of feature detectors, which suggests that we can extract more independent feature detectors for model averaging in the test stage. We introduce the proposed continuous dropout to a feedforward neural network and comprehensively compare it with binary dropout, adaptive dropout, and DropConnect on MNIST, CIFAR-10, SVHN, NORB, and ILSVRC-12. Thorough experiments demonstrate that our method performs better in preventing the co-adaptation of feature detectors and improves test performance. The code is available at: https://…/dropout.

We consider a three-layer Sejnowski machine and show that features learnt via contrastive divergence have a dual representation as patterns in a dense associative memory of order P=4. The latter is known to be able to Hebbian-store an amount of patterns scaling as N^{P-1}, where N denotes the number of constituting binary neurons interacting P-wisely. We also prove that, by keeping the dense associative network far from the saturation regime (namely, allowing for a number of patterns scaling only linearly with N, while P>2) such a system is able to perform pattern recognition far below the standard signal-to-noise threshold. In particular, a network with P=4 is able to retrieve information whose intensity is O(1) even in the presence of a noise O(\sqrt{N}) in the large N limit. This striking skill stems from a redundancy representation of patterns — which is afforded given the (relatively) low-load information storage — and it contributes to explain the impressive abilities in pattern recognition exhibited by new-generation neural networks. The whole theory is developed rigorously, at the replica symmetric level of approximation, and corroborated by signal-to-noise analysis and Monte Carlo simulations.

This paper presents a new approach to the detection of discontinuities in the n-th derivative of observational data. This is achieved by performing two polynomial approximations at each interstitial point. The polynomials are coupled by constraining their coefficients to ensure continuity of the model up to the (n-1)-th derivative; while yielding an estimate for the discontinuity of the n-th derivative. The coefficients of the polynomials correspond directly to the derivatives of the approximations at the interstitial points through the prudent selection of a common coordinate system. The approximation residual and extrapolation errors are investigated as measures for detecting discontinuity. This is necessary since discrete observations of continuous systems are discontinuous at every point. It is proven, using matrix algebra, that positive extrema in the combined approximation-extrapolation error correspond exactly to extrema in the difference of the Taylor coefficients. This provides a relative measure for the severity of the discontinuity in the observational data. The matrix algebraic derivations are provided for all aspects of the methods presented here; this includes a solution for the covariance propagation through the computation. The performance of the method is verified with a Monte Carlo simulation using synthetic piecewise polynomial data with known discontinuities. It is also demonstrated that the discontinuities are suitable as knots for B-spline modelling of data. For completeness, the results of applying the method to sensor data acquired during the monitoring of heavy machinery are presented.

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.

In this paper, we consider the problem of finding the constraints in bow-free acyclic directed mixed graphs (ADMGs). ADMGs are a generalisation of directed acyclic graphs (DAGs) that allow for certain latent variables. We first show that minimal generators for the ideal containing all the constraints of a Gaussian ADMG corresponds precisely to the pairs of non-adjacent vertices in . The proof of this theorem naturally leads to an efficient algorithm that fits a bow-free Gaussian ADMG by maximum likelihood. In particular, we can test for the goodness of fit of a given data set to a bow-free ADMG.

Cyber-Physical Systems (CPS) are present in many settings addressing a myriad of purposes. Examples are Internet-of-Things (IoT) or sensing software embedded in appliances or even specialised meters that measure and respond to electricity demands in smart grids. Due to their pervasive nature, they are usually chosen as recipients for larger scope cyber-security attacks. Those promote system-wide disruptions and are directed towards one key aspect such as confidentiality, integrity, availability or a combination of those characteristics. Our paper focuses on a particular and distressing attack where coordinated malware infected IoT units are maliciously employed to synchronously turn on or off high-wattage appliances, affecting the grid’s primary control management. Our model could be extended to larger (smart) grids, Active Buildings as well as similar infrastructures. Our approach models Coordinated Load-Changing Attacks (CLCA) also referred as GridLock or BlackIoT, against a theoretical power grid, containing various types of power plants. It employs Continuous-Time Markov Chains where elements such as Power Plants and Botnets are modelled under normal or attack situations to evaluate the effect of CLCA in power reliant infrastructures. We showcase our modelling approach in the scenario of a power supplier (e.g. power plant) being targeted by a botnet. We demonstrate how our modelling approach can quantify the impact of a botnet attack and be abstracted for any CPS system involving power load management in a smart grid. Our results show that by prioritising the type of power-plants, the impact of the attack may change: in particular, we find the most impacting attack times and show how different strategies impact their success. We also find the best power generator to use depending on the current demand and strength of attack.

We consider a finite impulse response system with centered independent sub-Gaussian design covariates and noise components that are not necessarily identically distributed. We derive non-asymptotic near-optimal estimation and prediction bounds for the least-squares estimator of the parameters. Our results are based on two concentration inequalities on the norm of sums of dependent covariate vectors and on the singular values of their covariance operator that are of independent value on their own and where the dependence arises from the time shift structure of the time series. These results generalize the known bounds for the independent case.

The performance of acquisition functions for Bayesian optimisation is investigated in terms of the Pareto front between exploration and exploitation. We show that Expected Improvement and the Upper Confidence Bound always select solutions to be expensively evaluated on the Pareto front, but Probability of Improvement is never guaranteed to do so and Weighted Expected Improvement does only for a restricted range of weights. We introduce two novel -greedy acquisition functions. Extensive empirical evaluation of these together with random search, purely exploratory and purely exploitative search on 10 benchmark problems in 1 to 10 dimensions shows that -greedy algorithms are generally at least as effective as conventional acquisition functions, particularly with a limited budget. In higher dimensions -greedy approaches are shown to have improved performance over conventional approaches. These results are borne out on a real world computational fluid dynamics optimisation problem and a robotics active learning problem.

With the improvements of Los Angeles in many aspects, people in mounting numbers tend to live or travel to the city. The primary objective of this paper is to apply a set of methods for the time series analysis of traffic accidents in Los Angeles in the past few years. The number of traffic accidents, collected from 2010 to 2019 monthly reveals that the traffic accident happens seasonally and increasing with fluctuation. This paper utilizes the ensemble methods to combine several different methods to model the data from various perspectives, which can lead to better forecasting accuracy. The IMA(1, 1), ETS(A, N, A), and two models with Fourier items are failed in independence assumption checking. However, the Online Gradient Descent (OGD) model generated by the ensemble method shows the perfect fit in the data modeling, which is the state-of-the-art model among our candidate models. Therefore, it can be easier to accurately forecast future traffic accidents based on previous data through our model, which can help designers to make better plans.

The unification of low-level perception and high-level reasoning is a long-standing problem in artificial intelligence, which has the potential to not only bring the areas of logic and learning closer together but also demonstrate how abstract concepts might emerge from sensory data. Precisely because deep learning methods dominate perception-based learning, including vision, speech, and linguistic grammar, there is fast-growing literature on how to integrate symbolic reasoning and deep learning. Broadly, efforts seem to fall into three camps: those focused on defining a logic whose formulas capture deep learning, ones that integrate symbolic constraints in deep learning, and others that allow neural computations and symbolic reasoning to co-exist separately, to enjoy the strengths of both worlds. In this paper, we identify another dimension to this inquiry: what do the hidden layers really capture, and how can we reason about that logically? In particular, we consider autoencoders that are widely used for dimensionality reduction and inject a symbolic generative framework onto the feature layer. This allows us, among other things, to generate example images for a class to get a sense of what was learned. Moreover, the modular structure of the proposed model makes it possible to learn relations over multiple images at a time, as well as handle noisy labels. Our empirical evaluations show the promise of this inquiry.

We present a continuation method that entails generating a sequence of transition probability density functions from the prior to the posterior in the context of Bayesian inference for parameter estimation problems. The characterization of transition distributions, by tempering the likelihood function, results in a homogeneous nonlinear partial integro-differential equation whose existence and uniqueness of solutions are addressed. The posterior probability distribution comes as the interpretation of the final state of the path of transition distributions. A computationally stable scaling domain for the likelihood is explored for the approximation of the expected deviance, where we manage to hold back all the evaluations of the forward predictive model at the prior stage. It follows the computational tractability of the posterior distribution and opens access to the posterior distribution for direct samplings. To get a solution formulation of the expected deviance, we derive a partial differential equation governing the moments generating function of the log-likelihood. We show also that a spectral formulation of the expected deviance can be obtained for low-dimensional problems under certain conditions. The computational efficiency of the proposed method is demonstrated through three differents numerical examples that focus on analyzing the computational bias generated by the method, assessing the continuation method in the Bayesian inference with non-Gaussian noise, and evaluating its ability to invert a multimodal parameter of interest.

Join query optimization is a complex task and is central to the performance of query processing. In fact it belongs to the class of NP-hard problems. Traditional query optimizers use dynamic programming (DP) methods combined with a set of rules and restrictions to avoid exhaustive enumeration of all possible join orders. However, DP methods are very resource intensive. Moreover, given simplifying assumptions of attribute independence, traditional query optimizers rely on erroneous cost estimations, which can lead to suboptimal query plans. Recent success of deep reinforcement learning (DRL) creates new opportunities for the field of query optimization to tackle the above-mentioned problems. In this paper, we present our DRL-based Fully Observed Optimizer (FOOP) which is a generic query optimization framework that enables plugging in different machine learning algorithms. The main idea of FOOP is to use a data-adaptive learning query optimizer that avoids exhaustive enumerations of join orders and is thus significantly faster than traditional approaches based on dynamic programming. In particular, we evaluate various DRL-algorithms and show that Proximal Policy Optimization significantly outperforms Q-learning based algorithms. Finally we demonstrate how ensemble learning techniques combined with DRL can further improve the query optimizer.

Motivated by the flexibility of biological neural networks whose connectivity structure changes significantly during their lifetime, we introduce the Unstructured Recursive Network (URN) and demonstrate that it can exhibit similar flexibility during training via gradient descent. We show empirically that many of the different neural network structures commonly used in practice today (including fully connected, locally connected and residual networks of different depths and widths) can emerge dynamically from the same URN. These different structures can be derived using gradient descent on a single general loss function where the structure of the data and the relative strengths of various regulator terms determine the structure of the emergent network. We show that this loss function and the regulators arise naturally when considering the symmetries of the network as well as the geometric properties of the input data.

We propose a generalized Bayesian regression and model learning tool based on the “Bayesian Validation Metric’ (BVM) proposed in [1], called BVM model learning. This method performs Bayesian regression based on a user’s definition of model-data agreement and allows for model selection on any type of data distribution, unlike Bayesian and standard regression techniques, that fail in some cases. We show that the BVM model learning is capable of representing and combining Bayesian and standard regression techniques in a single framework and generalizing these methods. Thus, this tool offers new insights into the interpretation of the predictive envelopes in Bayesian and standard regression while giving the modeler more control over these envelopes.

The excessively increased volume of data in modern data management systems demands an improved system performance, frequently provided by data distribution, system scalability and performance optimization techniques. Optimized horizontal data partitioning has a significant influence of distributed data management systems. An optimally partitioned schema found in the early phase of logical database design without loading of real data in the system and its adaptation to changes of business environment are very important for a successful implementation, system scalability and performance improvement. In this paper we present a novel approach for finding an optimal horizontally partitioned schema that manifests a minimal total execution cost of a given database workload. Our approach is based on a formal model that enables abstraction of the predicates in the workload queries, and are subsequently used to define all relational fragments. This approach has predictive features acquired by simulation of horizontal partitioning, without loading any data into the partitions, but instead, altering the statistics in the database catalogs. We define an optimization problem and employ a genetic algorithm (GA) to find an approximately optimal horizontally partitioned schema. The solutions to the optimization problem are evaluated using PostgreSQL’s query optimizer. The initial experimental evaluation of our approach confirms its efficiency and correctness, and the numbers imply that the approach is effective in reducing the workload execution cost.

Networks are one of the most powerful structures for modeling problems in the real world. Downstream machine learning tasks defined on networks have the potential to solve a variety of problems. With link prediction, for instance, one can predict whether two persons will become friends on a social network. Many machine learning algorithms, however, require that each input example is a real vector. Network embedding encompasses various methods for unsupervised, and sometimes supervised, learning of feature representations of nodes and links in a network. Typically, embedding methods are based on the assumption that the similarity between nodes in the network should be reflected in the learned feature representations. In this paper, we review significant contributions to network embedding in the last decade. In particular, we look at four methods: Spectral Clustering, DeepWalk, Large-scale Information Network Embedding (LINE), and node2vec. We describe each method and list its advantages and shortcomings. In addition, we give examples of real-world machine learning problems on networks in which the embedding is critical in order to maximize the predictive performance of the machine learning task. Finally, we take a look at research trends and state-of-the art methods in the research on network embedding.

Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.

An Adversarial System to attack and an Authorship Attribution System (AAS) to defend itself against the attacks are analyzed. Defending a system against attacks from an adversarial machine learner can be done by randomly switching between models for the system, by detecting and reacting to changes in the distribution of normal inputs, or by using other methods. Adversarial machine learning is used to identify a system that is being used to map system inputs to outputs. Three types of machine learners are using for the model that is being attacked. The machine learners that are used to model the system being attacked are a Radial Basis Function Support Vector Machine, a Linear Support Vector Machine, and a Feedforward Neural Network. The feature masks are evolved using accuracy as the fitness measure. The system defends itself against adversarial machine learning attacks by identifying inputs that do not match the probability distribution of normal inputs. The system also defends itself against adversarial attacks by randomly switching between the feature masks being used to map system inputs to outputs.

Generative adversarial networks (GANs) are neural networks that learn data distributions through adversarial training. In intensive studies, recent GANs have shown promising results for reproducing training data. However, in spite of noise, they reproduce data with fidelity. As an alternative, we propose a novel family of GANs called noise-robust GANs (NR-GANs), which can learn a clean image generator even when training data are noisy. In particular, NR-GANs can solve this problem without having complete noise information (e.g., the noise distribution type, noise amount, or signal-noise relation). To achieve this, we introduce a noise generator and train it along with a clean image generator. As it is difficult to generate an image and a noise separately without constraints, we propose distribution and transformation constraints that encourage the noise generator to capture only the noise-specific components. In particular, considering such constraints under different assumptions, we devise two variants of NR-GANs for signal-independent noise and three variants of NR-GANs for signal-dependent noise. On three benchmark datasets, we demonstrate the effectiveness of NR-GANs in noise robust image generation. Furthermore, we show the applicability of NR-GANs in image denoising.

Most recent neural semi-supervised learning algorithms rely on adding small perturbation to either the input vectors or their representations. These methods have been successful on computer vision tasks as the images form a continuous manifold, but are not appropriate for discrete input such as sentence. To adapt these methods to text input, we propose to decompose a neural network into two components and so that . The layers in are then frozen and only the layers in will be updated during most time of the training. In this way, serves as a feature extractor that maps the input to high-level representation and adds systematical noise using dropout. We can then train using any state-of-the-art SSL algorithms such as -model, temporal ensembling, mean teacher, etc. Furthermore, this gradually unfreezing schedule also prevents a pretrained model from catastrophic forgetting. The experimental results demonstrate that our approach provides improvements when compared to state of the art methods especially on short texts.

This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems.

We propose ViewAL, a novel active learning strategy for semantic segmentation that exploits viewpoint consistency in multi-view datasets. Our core idea is that inconsistencies in model predictions across viewpoints provide a very reliable measure of uncertainty and encourage the model to perform well irrespective of the viewpoint under which objects are observed. To incorporate this uncertainty measure, we introduce a new viewpoint entropy formulation, which is the basis of our active learning strategy. In addition, we propose uncertainty computations on a superpixel level, which exploits inherently localized signal in the segmentation task, directly lowering the annotation costs. This combination of viewpoint entropy and the use of superpixels allows to efficiently select samples that are highly informative for improving the network. We demonstrate that our proposed active learning strategy not only yields the best-performing models for the same amount of required labeled data, but also significantly reduces labeling effort. For instance, our method achieves 95% of maximum achievable network performance using only 7%, 17%, and 24% labeled data on SceneNet-RGBD, ScanNet, and Matterport3D, respectively. On these datasets, the best state-of-the-art method achieves the same performance with 14%, 27% and 33% labeled data. Finally, we demonstrate that labeling using superpixels yields the same quality of ground-truth compared to labeling whole images, but requires 25% less time.

Capsule networks excel in understanding spatial relationships in 2D data for vision related tasks. Even though they are not designed to capture 1D temporal relationships, with TimeCaps we demonstrate that given the ability, capsule networks excel in understanding temporal relationships. To this end, we generate capsules along the temporal and channel dimensions creating two temporal feature detectors which learn contrasting relationships. TimeCaps surpasses the state-of-the-art results by achieving 96.21% accuracy on identifying 13 Electrocardiogram (ECG) signal beat categories, while achieving on-par results on identifying 30 classes of short audio commands. Further, the instantiation parameters inherently learnt by the capsule networks allow us to completely parameterize 1D signals which opens various possibilities in signal processing.

Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.

We propose a method to infer the presence and location of change-points in the distribution of a sequence of independent data taking values in a general metric space, where change-points are viewed as locations at which the distribution of the data sequence changes abruptly in terms of either its Fr\’echet mean or Fr\’echet variance or both. The proposed method is based on comparisons of Fr\’echet variances before and after putative change-point locations and does not require a tuning parameter except for the specification of cut-off intervals near the endpoints where change-points are assumed not to occur. Our results include theoretical guarantees for consistency of the test under contiguous alternatives when a change-point exists and also for consistency of the estimated location of the change-point if it exists, where under the null hypothesis of no change-point the limit distribution of the proposed scan function is the square of a standardized Brownian Bridge. These consistency results are applicable for a broad class of metric spaces under mild entropy conditions. Examples include the space of univariate probability distributions and the space of graph Laplacians for networks. Simulation studies demonstrate the effectiveness of the proposed methods, both for inferring the presence of a change-point and estimating its location. We also develop theory that justifies bootstrap-based inference and illustrate the new approach with sequences of maternal fertility distributions and communication networks.

Determining the correct number of clusters (CNC) is an important task in data clustering and has a critical effect on finalizing the partitioning results. K-means is one of the popular methods of clustering that requires CNC. Validity index methods use an additional optimization procedure to estimate the CNC for K-means. We propose an alternative validity index approach denoted by k-minimizing Average Central Error (KMACE). K-means is one of the popular methods of clustering that requires CNC. Validity ACE is the average error between the true unavailable cluster center and the estimated cluster center for each sample data. Kernel K-MACE is kernel K-means that is equipped with an efficient CNC estimator. In addition, kernel K_MACE includes the first automatically tuned procedure for choosing the Gaussian kernel parameters. Simulation results for both synthetic and read data show superiority of K_MACE and kernel K-MACE over the conventional clustering methods not only in CNC estimation but also in the partitioning procedure.

Have you ever wondered how your feature space is impacting the prediction of a specific sample in your dataset? In this paper, we introduce Single Sample Feature Importance (SSFI), which is an interpretable feature importance algorithm that allows for the identification of the most important features that contribute to the prediction of a single sample. When a dataset can be learned by a Random Forest classifier or regressor, SSFI shows how the Random Forest’s prediction path can be utilized for low-level feature importance calculation. SSFI results in a relative ranking of features, highlighting those with the greatest impact on a data point’s prediction. We demonstrate these results both numerically and visually on four different datasets.

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a fixed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i.e., Lifelong Spectral Clustering (L2SC). Its goal is to efficiently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Specifically, the knowledge library of L2SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2SC firstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redefines the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.

Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at https://…/CrossStagePartialNetworks.

Transfer learning is becoming the de facto solution for vision and text encoders in the front-end processing of machine learning solutions. Utilizing vast amounts of knowledge in pre-trained models and subsequent fine-tuning allows achieving better performance in domains where labeled data is limited. In this paper, we analyze the efficiency of transfer learning in visual reasoning by introducing a new model (SAMNet) and testing it on two datasets: COG and CLEVR. Our new model achieves state-of-the-art accuracy on COG and shows significantly better generalization capabilities compared to the baseline. We also formalize a taxonomy of transfer learning for visual reasoning around three axes: feature, temporal, and reasoning transfer. Based on extensive experimentation of transfer learning on each of the two datasets, we show the performance of the new model along each axis.

Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) have been observed to assign lower uncertainty to the OOD than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient at test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns better target distribution representation than the baseline RND algorithm. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy on the CelebA dataset.

Ancestral mixture model, proposed by Chen and Lindsay (2006), is an important model to build a hierarchical tree from high dimensional binary sequences. Mixture trees created from ancestral mixture models involve in the inferred evolutionary relationships among various biological species. Moreover, it contains the information of time when the species mutates. Tree comparison metric, an essential issue in bioinformatics, is to measure the similarity between trees. However, to our knowledge, the approach to the comparison between two mixture trees is still under development. In this paper, we propose a new metric, named mixture distance metric, to measure the similarity of two mixture trees. It uniquely considers the factor of evolutionary times between trees. In addition, we also further develop two algorithms to compute the mixture distance between two mixture trees. One requires O(n^2) and the other requires O(nh) computation time with O(n) preprocessing time, where n denotes the number of leaves in the two mixture trees, and h denotes the minimum height of these two trees.

Most mobile network operators generate revenues by directly charging users for data plan subscriptions. Some operators now also offer users data rewards to incentivize them to watch mobile ads, which enables the operators to collect payments from advertisers and create new revenue streams. In this work, we analyze and compare two data rewarding schemes: a Subscription-Aware Rewarding (SAR) scheme and a Subscription-Unaware Rewarding (SUR) scheme. Under the SAR scheme, only the subscribers of the operators’ data plans are eligible for the rewards; under the SUR scheme, all users are eligible for the rewards (e.g., the users who do not subscribe to the data plans can still get SIM cards and receive data rewards by watching ads). We model the interactions among an operator, users, and advertisers by a two-stage Stackelberg game, and characterize their equilibrium strategies under both the SAR and SUR schemes. We show that the SAR scheme can lead to more subscriptions and a higher operator revenue from the data market, while the SUR scheme can lead to better ad viewership and a higher operator revenue from the ad market. We further show that the operator’s optimal choice between the two schemes is sensitive to the users’ data consumption utility function and the operator’s network capacity. We provide some counter-intuitive insights. For example, when each user has a logarithmic utility function, the operator should apply the SUR scheme (i.e., reward both subscribers and non-subscribers) if and only if it has a small network capacity.

Given labeled instances on a source domain and unlabeled ones on a target domain, unsupervised domain adaptation aims to learn a task classifier that can well classify target instances. Recent advances rely on domain-adversarial training of deep networks to learn domain-invariant features. However, due to an issue of mode collapse induced by the separate design of task and domain classifiers, these methods are limited in aligning the joint distributions of feature and category across domains. To overcome it, we propose a novel adversarial learning method termed Discriminative Adversarial Domain Adaptation (DADA). Based on an integrated category and domain classifier, DADA has a novel adversarial objective that encourages a mutually inhibitory relation between category and domain predictions for any input instance. We show that under practical conditions, it defines a minimax game that can promote the joint distribution alignment. Except for the traditional closed set domain adaptation, we also extend DADA for extremely challenging problem settings of partial and open set domain adaptation. Experiments show the efficacy of our proposed methods and we achieve the new state of the art for all the three settings on benchmark datasets.

Human parsing, or human body part semantic segmentation, has been an active research topic due to its wide potential applications. In this paper, we propose a novel GRAph PYramid Mutual Learning (Grapy-ML) method to address the cross-dataset human parsing problem, where the annotations are at different granularities. Starting from the prior knowledge of the human body hierarchical structure, we devise a graph pyramid module (GPM) by stacking three levels of graph structures from coarse granularity to fine granularity subsequently. At each level, GPM utilizes the self-attention mechanism to model the correlations between context nodes. Then, it adopts a top-down mechanism to progressively refine the hierarchical features through all the levels. GPM also enables efficient mutual learning. Specifically, the network weights of the first two levels are shared to exchange the learned coarse-granularity information across different datasets. By making use of the multi-granularity labels, Grapy-ML learns a more discriminative feature representation and achieves state-of-the-art performance, which is demonstrated by extensive experiments on the three popular benchmarks, e.g. CIHP dataset. The source code is publicly available at https://…/Grapy-ML.

We develop a framework for analyzing multivariate time series using topological data analysis (TDA) methods. The proposed methodology involves converting the multivariate time series to point cloud data, calculating Wasserstein distances between the persistence diagrams and using the -nearest neighbors algorithm (-NN) for supervised machine learning. Two methods (symmetry-breaking and anchor points) are also introduced to enable TDA to better analyze data with heterogeneous features that are sensitive to translation, rotation, or choice of coordinates. We apply our methods to room occupancy detection based on 5 time-dependent variables (temperature, humidity, light, CO2 and humidity ratio). Experimental results show that topological methods are effective in predicting room occupancy during a time window.

High-level (e.g., semantic) features encoded in the latter layers of convolutional neural networks are extensively exploited for image classification, leaving low-level (e.g., color) features in the early layers underexplored. In this paper, we propose a novel Decision Propagation Module (DPM) to make an intermediate decision that could act as category-coherent guidance extracted from early layers, and then propagate it to the latter layers. Therefore, by stacking a collection of DPMs into a classification network, the generated Decision Propagation Network is explicitly formulated as to progressively encode more discriminative features guided by the decision, and then refine the decision based on the new generated features layer by layer. Comprehensive results on four publicly available datasets validate DPM could bring significant improvements for existing classification networks with minimal additional computational cost and is superior to the state-of-the-art methods.

The K-means algorithm is a widely used clustering algorithm that offers simplicity and efficiency. However, the traditional K-means algorithm uses the random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance. Many initialization methods have been proposed, but none of them can dynamically adapt to datasets with various characteristics. In our previous research, an initialization method for K-means based on hybrid distance was proposed, and this algorithm can adapt to datasets with different characteristics. However, it has the following drawbacks: (a) When calculating density, the threshold cannot be uniquely determined, resulting in unstable results. (b) Heavily depending on adjusting the parameter, the parameter must be adjusted five times to obtain better clustering results. (c) The time complexity of the algorithm is quadratic, which is difficult to apply to large datasets. In the current paper, we proposed an adaptive initialization method for the K-means algorithm (AIMK) to improve our previous work. AIMK can not only adapt to datasets with various characteristics but also obtain better clustering results within two interactions. In addition, we then leverage random sampling in AIMK, which is named as AIMK-RS, to reduce the time complexity. AIMK-RS is easily applied to large and high-dimensional datasets. We compared AIMK and AIMK-RS with 10 different algorithms on 16 normal and six extra-large datasets. The experimental results show that AIMK and AIMK-RS outperform the current initialization methods and several well-known clustering algorithms. Furthermore, AIMK-RS can significantly reduce the complexity of applying it to extra-large datasets with high dimensions. The time complexity of AIMK-RS is O(n).

Deep Learning is a state-of-the-art technique to make inference on extensive or complex data. As a black box model due to their multilayer nonlinear structure, Deep Neural Networks are often criticized to be non-transparent and their predictions not traceable by humans. Furthermore, the models learn from artificial datasets, often with bias or contaminated discriminating content. Through their increased distribution, decision-making algorithms can contribute promoting prejudge and unfairness which is not easy to notice due to lack of transparency. Hence, scientists developed several so-called explanators or explainers which try to point out the connection between input and output to represent in a simplified way the inner structure of machine learning black boxes. In this survey we differ the mechanisms and properties of explaining systems for Deep Neural Networks for Computer Vision tasks. We give a comprehensive overview about taxonomy of related studies and compare several survey papers that deal with explainability in general. We work out the drawbacks and gaps and summarize further research ideas.

Similarity graphs are an active research direction for the nearest neighbor search (NNS) problem. New algorithms for similarity graph construction are continuously being proposed and analyzed by both theoreticians and practitioners. However, existing construction algorithms are mostly based on heuristics and do not explicitly maximize the target performance measure, i.e., search recall. Therefore, at the moment it is not clear whether the performance of similarity graphs has plateaued or more effective graphs can be constructed with more theoretically grounded methods. In this paper, we introduce a new principled algorithm, based on adjacency matrix optimization, which explicitly maximizes search efficiency. Namely, we propose a probabilistic model of a similarity graph defined in terms of its edge probabilities and show how to learn these probabilities from data as a reinforcement learning task. As confirmed by experiments, the proposed construction method can be used to refine the state-of-the-art similarity graphs, achieving higher recall rates for the same number of distance computations. Furthermore, we analyze the learned graphs and reveal the structural properties that are responsible for more efficient search.

Differential Architecture Search (DARTS) is now a widely disseminated weight-sharing neural architecture search method. However, there are two fundamental weaknesses remain untackled. First, we observe that the well-known aggregation of skip connections during optimization is caused by an unfair advantage in an exclusive competition. Second, there is a non-negligible incongruence when discretizing continuous architectural weights to a one-hot representation. Because of these two reasons, DARTS delivers a biased solution that might not even be suboptimal. In this paper, we present a novel approach to curing both frailties. Specifically, as unfair advantages in a pure exclusive competition easily induce a monopoly, we relax the choice of operations to be collaborative, where we let each operation have an equal opportunity to develop its strength. We thus call our method Fair DARTS. Moreover, we propose a zero-one loss to directly reduce the discretization gap. Experiments are performed on two mainstream search spaces, in which we achieve new state-of-the-art networks on ImageNet. Our code is available on https://…/fairdarts.

Heuristic forward search is currently the dominant paradigm in classical planning. Forward search algorithms typically rely on a single, relatively simple variation of best-first search and remain fixed throughout the process of solving a planning problem. Existing work combining multiple search techniques usually aims at supporting best-first search with an additional exploratory mechanism, triggered using a handcrafted criterion. A notable exception is very recent work which combines various search techniques using a trainable policy. It is, however, confined to a discrete action space comprising several fixed subroutines. In this paper, we introduce a parametrized search algorithm template which combines various search techniques within a single routine. The template’s parameter space defines an infinite space of search algorithms, including, among others, BFS, local and random search. We further introduce a neural architecture for designating the values of the search parameters given the state of the search. This enables expressing neural search policies that change the values of the parameters as the search progresses. The policies can be learned automatically, with the objective of maximizing the planner’s performance on a given distribution of planning problems. We consider a training setting based on a stochastic optimization algorithm known as the cross-entropy method (CEM). Experimental evaluation of our approach shows that it is capable of finding effective distribution-specific search policies, outperforming the relevant baselines.

The instability and feature redundancy in CNNs hinders further performance improvement. Using orthogonality as a regularizer has shown success in alleviating these issues. Previous works however only considered the kernel orthogonality in the convolution layers of CNNs, which is a necessary but not sufficient condition for orthogonal convolutions in general. We propose orthogonal convolutions as regularizations in CNNs and benchmark its effect on various tasks. We observe up to 3% gain for CIFAR100 and up to 1% gain for ImageNet classification. Our experiments also demonstrate improved performance on image retrieval, inpainting and generation, which suggests orthogonal convolution improves the feature expressiveness. Empirically, we show that the uniform spectrum and reduced feature redundancy may account for the gain in performance and robustness under adversarial attacks.

We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods—taking the maximum attention weight and computing the maximum spanning tree—to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets—the syntax-oriented CoLA and the semantics-oriented MNLI—to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods. Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn.

A structured understanding of our world in terms of objects, relations, and hierarchies is an important component of human cognition. Learning such a structured world model from raw sensory data remains a challenge. As a step towards this goal, we introduce Contrastively-trained Structured World Models (C-SWMs). C-SWMs utilize a contrastive approach for representation learning in environments with compositional structure. We structure each state embedding as a set of object representations and their relations, modeled by a graph neural network. This allows objects to be discovered from raw pixel observations without direct supervision as part of the learning process. We evaluate C-SWMs on compositional environments involving multiple interacting objects that can be manipulated independently by an agent, simple Atari games, and a multi-object physics simulation. Our experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.

We study the design of learning architectures for behavioural planning in a dense traffic setting. Such architectures should deal with a varying number of nearby vehicles, be invariant to the ordering chosen to describe them, while staying accurate and compact. We observe that the two most popular representations in the literature do not fit these criteria, and perform badly on an complex negotiation task. We propose an attention-based architecture that satisfies all these properties and explicitly accounts for the existing interactions between the traffic participants. We show that this architecture leads to significant performance gains, and is able to capture interactions patterns that can be visualised and qualitatively interpreted. Videos and code are available at https://…/.

Living in the ‘Information Age’ means that not only access to information has become easier but also that the distribution of information is more dynamic than ever. Through a large-scale online field experiment, we provide new empirical evidence for the presence of the anchoring bias in people’s judgment due to irrational reliance on a piece of information that they are initially given. The comparison of the anchoring stimuli and respective responses across different tasks reveals a positive, yet complex relationship between the anchors and the bias in participants’ predictions of the outcomes of events in the future. Participants in the treatment group were equally susceptible to the anchors regardless of their level of engagement, previous performance, or gender. Given the strong and ubiquitous influence of anchors quantified here, we should take great care to closely monitor and regulate the distribution of information online to facilitate less biased decision making.

The recently proposed panoptic segmentation task presents a significant challenge of image understanding with computer vision by unifying semantic segmentation and instance segmentation tasks. In this paper we present an efficient and novel panoptic data augmentation (PanDA) method which operates exclusively in pixel space, requires no additional data or training, and is computationally cheap to implement. We retrain the original state-of-the-art UPSNet panoptic segmentation model on PanDA augmented Cityscapes dataset, and demonstrate all-round performance improvement upon the original model. We also show that PanDA is effective across scales from 10 to 30,000 images, as well as generalizable to Microsoft COCO panoptic segmentation task. Finally, the effectiveness of PanDA generated unrealistic-looking training images suggest that we should rethink about optimizing levels of image realism for efficient data augmentation.

We apply methods from randomized numerical linear algebra (RandNLA) to develop improved algorithms for the analysis of large-scale time series data. We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes. We show that the accuracy of approximations lies within of the true leverage scores with high probability. These theoretical results are subsequently exploited to develop an efficient algorithm, called LSAR, for fitting an appropriate AR model to big time series data. Our proposed algorithm is guaranteed, with high probability, to find the maximum likelihood estimates of the parameters of the underlying true AR model and has a worst case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale synthetic as well as real data highly support the theoretical results and reveal the efficacy of this new approach. To the best of our knowledge, this paper is the first attempt to establish a nexus between RandNLA and big time series data analysis.

As neural networks revolutionize many applications, significant privacy concerns emerge. Owners of private data wish to use remote neural network services while ensuring their data cannot be interpreted by others. Service providers wish to keep their model private to safeguard its intellectual property. Such privacy conflicts may slow down the adoption of neural networks in sensitive domains such as healthcare. Privacy issues have been addressed in the cryptography community in the context of secure computation. However, secure computation protocols have known performance issues. E.g., runtime of secure inference in deep neural networks is three orders of magnitude longer comparing to non-secure inference. Therefore, much research efforts address the optimization of cryptographic protocols for secure inference. We take a complementary approach, and provide design principles for optimizing the crypto-oriented neural network architectures to reduce the runtime of secure inference. The principles are evaluated on three state-of-the-art architectures: SqueezeNet, ShuffleNetV2, and MobileNetV2. Our novel method significantly improves the efficiency of secure inference on common evaluation metrics.

Machine learning software, deep neural networks (DNN) software in particular, discerns valuable information from a large dataset, a set of data. Outcomes of such DNN programs are dependent on the quality of both learning programs and datasets. Unfortunately, the quality of datasets is difficult to be defined, because they are just samples. The quality assurance of DNN software is difficult, because resultant trained machine learning models are unknown prior to its development, and the validation is conducted indirectly in terms of prediction performance. This paper introduces a hypothesis that faults in the learning programs manifest themselves as distortions in trained machine learning models. Relative distortion degrees measured with appropriate observer functions may indicate that there are some hidden faults. The proposal is demonstrated with example cases of the MNIST dataset.

Neural architecture search (NAS) can have a significant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space. To address these issues, we introduce channel sampling and operation space reduction into a differentiable NAS to significantly reduce the cost of searching. This is accomplished through a performance-based strategy used to abandon less potential operations. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a performance comparable to NAS on both CIFAR and ImageNet databases. An accuracy of vs. is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a faster search than the state-of-the-art PC-DARTS.

The energy system studies include a wide range of issues from short term (e.g. real-time, hourly, daily and weekly operating decisions) to long term horizons (e.g. planning or policy making). The decision making chain is fed by input parameters which are usually subject to uncertainties. The art of dealing with uncertainties has been developed in various directions and has recently become a focal point of interest. In this paper, a new standard classification of uncertainty modeling techniques for decision making process is proposed. These methods are introduced and compared along with demonstrating their strengths and weaknesses. The promising lines of future researches are explored in the shadow of a comprehensive overview of the past and present applications. The possibility of using the novel concept of Z-numbers is introduced for the first time.

Motivated by the celebrated discrete-time model of nervous activity outlined by McCulloch and Pitts in 1943, we propose a novel continuous-time model, the McCulloch-Pitts network (MPN), for sequence learning in spiking neural networks. Our model has a local learning rule, such that the synaptic weight updates depend only on the information directly accessible by the synapse. By exploiting asymmetry in the connections between binary neurons, we show that MPN can be trained to robustly memorize multiple spatiotemporal patterns of binary vectors, generalizing the ability of the symmetric Hopfield network to memorize static spatial patterns. In addition, we demonstrate that the model can efficiently learn sequences of binary pictures as well as generative models for experimental neural spike-train data. Our learning rule is consistent with spike-timing-dependent plasticity (STDP), thus providing a theoretical ground for the systematic design of biologically inspired networks with large and robust long-range sequence storage capacity.

In the last two years, more than 200 papers have been written on how machine learning (ML) systems can fail because of adversarial attacks on the algorithms and data; this number balloons if we were to incorporate papers covering non-adversarial failure modes. The spate of papers has made it difficult for ML practitioners, let alone engineers, lawyers, and policymakers, to keep up with the attacks against and defenses of ML systems. However, as these systems become more pervasive, the need to understand how they fail, whether by the hand of an adversary or due to the inherent design of a system, will only become more pressing. In order to equip software developers, security incident responders, lawyers, and policy makers with a common vernacular to talk about this problem, we developed a framework to classify failures into ‘Intentional failures’ where the failure is caused by an active adversary attempting to subvert the system to attain her goals; and ‘Unintentional failures’ where the failure is because an ML system produces an inherently unsafe outcome. After developing the initial version of the taxonomy last year, we worked with security and ML teams across Microsoft, 23 external partners, standards organization, and governments to understand how stakeholders would use our framework. Throughout the paper, we attempt to highlight how machine learning failure modes are meaningfully different from traditional software failures from a technology and policy perspective.

Envy-freeness is a widely studied notion in resource allocation, capturing some aspects of fairness. The notion of envy being inherently subjective though, it might be the case that an agent envies another agent, but that she objectively has no reason to do so. The difficulty here is to define the notion of objectivity, since no ground-truth can properly serve as a basis of this definition. A natural approach is to consider the judgement of the other agents as a proxy for objectivity. Building on previous work by Parijs (who introduced ‘unanimous envy’) we propose the notion of approval envy: an agent experiences approval envy towards if she is envious of , and sufficiently many agents agree that this should be the case, from their own perspectives. Some interesting properties of this notion are put forward. Computing the minimal threshold guaranteeing approval envy clearly inherits well-known intractable results from envy-freeness, but (i) we identify some tractable cases such as house allocation; and (ii) we provide a general method based on a mixed integer programming encoding of the problem, which proves to be efficient in practice. This allows us in particular to show experimentally that existence of such allocations, with a rather small threshold, is very often observed.

Convolutional neural network (CNN) is widely used in computer vision applications. In the networks that deal with images, CNNs are the most time-consuming layer of the networks. Usually, the solution to address the computation cost is to decrease the number of trainable parameters. This solution usually comes with the cost of dropping the accuracy. Another problem with this technique is that usually the cost of memory access is not taken into account which results in insignificant speedup gain. The number of operations and memory access in a standard convolution layer is independent of the input content, which makes it limited for certain accelerations. We propose a simple modification to a standard convolution to bridge this gap. We propose an adaptive convolution that adopts different kernel sizes (or radii) based on the content. The network can learn and select the proper radius based on the input content in a soft decision manner. Our proposed radius-adaptive convolutional neural network (RACNN) has a similar number of weights to a standard one, yet, results show it can reach higher speeds. The code has been made available at: https://…/racnn.

The recent progress in neural architectures search (NAS) has allowed scaling the automated design of neural architectures to real-world domains such as object detection and semantic segmentation. However, one prerequisite for the application of NAS are large amounts of labeled data and compute resources. This renders its application challenging in few-shot learning scenarios, where many related tasks need to be learned, each with limited amounts of data and compute time. Thus, few-shot learning is typically done with a fixed neural architecture. To improve upon this, we propose MetaNAS, the first method which fully integrates NAS with gradient-based meta-learning. MetaNAS optimizes a meta-architecture along with the meta-weights during meta-training. During meta-testing, architectures can be adapted to a novel task with a few steps of the task optimizer, that is: task adaptation becomes computationally cheap and requires only little data per task. Moreover, MetaNAS is agnostic in that it can be used with arbitrary model-agnostic meta-learning algorithms and arbitrary gradient-based NAS methods. Empirical results on standard few-shot classification benchmarks show that MetaNAS with a combination of DARTS and REPTILE yields state-of-the-art results.

Graph matching plays a central role in such fields as computer vision, pattern recognition, and bioinformatics. Graph matching problems can be cast as two types of quadratic assignment problems (QAPs): Koopmans-Beckmann’s QAP or Lawler’s QAP. In our paper, we provide a unifying view for these two problems by introducing new rules for array operations in Hilbert spaces. Consequently, Lawler’s QAP can be considered as the Koopmans-Beckmann’s alignment between two arrays in reproducing kernel Hilbert spaces (RKHS), making it possible to efficiently solve the problem without computing a huge affinity matrix. Furthermore, we develop the entropy-regularized Frank-Wolfe (EnFW) algorithm for optimizing QAPs, which has the same convergence rate as the original FW algorithm while dramatically reducing the computational burden for each outer iteration. We conduct extensive experiments to evaluate our approach, and show that our algorithm significantly outperforms the state-of-the-art in both matching accuracy and scalability.

Detecting out-of-distribution examples is important for safety-critical machine learning applications such as self-driving vehicles. However, existing research mainly focuses on small-scale images where the whole image is considered anomalous. We propose to segment only the anomalous regions within an image, and hence we introduce the Combined Anomalous Object Segmentation benchmark for the more realistic task of large-scale anomaly segmentation. Our benchmark combines two novel datasets for anomaly segmentation that incorporate both realism and anomaly diversity. Using both real images and those from a simulated driving environment, we ensure the background context and a wide variety of anomalous objects are naturally integrated, unlike before. Additionally, we improve out-of-distribution detectors on large-scale multi-class datasets and introduce detectors for the previously unexplored setting of multi-label out-of-distribution detection. These novel baselines along with our anomaly segmentation benchmark open the door to further research in large-scale out-of-distribution detection and segmentation.

We address the problem that state-of-the-art Convolution Neural Networks (CNN) classifiers are not invariant to small shifts. The problem can be solved by the removal of sub-sampling operations such as stride and max pooling, but at a cost of severely degraded training and test efficiency. We present a novel usage of Gaussian-Hermite basis to efficiently approximate arbitrary filters within the CNN framework to obtain translation invariance. This is shown to be invariant to small shifts, and preserves the efficiency of training. Further, to improve efficiency in memory usage as well as computational speed, we show that it is still possible to sub-sample with this approach and retain a weaker form of invariance that we call \emph{translation insensitivity}, which leads to stability with respect to shifts. We prove these claims analytically and empirically. Our analytic methods further provide a framework for understanding any architecture in terms of translation insensitivity, and provide guiding principles for design.

Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.

We describe an explainable AI saliency map method for use with deep convolutional neural networks (CNN) that is much more efficient than popular gradient methods. It is also quantitatively similar and better in accuracy. Our technique works by measuring information at the end of each network scale which is then combined into a single saliency map. We describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. Finally, we visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods. Since our method only requires a single forward pass through a few of the layers in a network, it is at least 97x faster than Guided Backprop and much more accurate. Using our method instead of Guided Backprop, class activation methods such as Grad-CAM, Grad-CAM++ and Smooth Grad-CAM++ will run several orders of magnitude faster, have a significantly smaller memory footprint and be more accurate. This will make such methods feasible on resource limited platforms such as robots, cell phones and low cost industrial devices. This will also significantly help them work in extremely data intensive applications such as satellite image processing. All without sacrificing accuracy. Our method is generally straight forward and should be applicable to the most commonly used CNNs. We also show examples of our method used to enhance Grad-CAM++.

In many advanced network analysis applications, like social networks, e-commerce, and network security, hotspots are generally considered as a group of vertices that are tightly connected owing to the similar characteristics, such as common habits and location proximity. In this paper, we investigate the formation of hotspots from an alternative perspective that considers the routes along the network paths as the auxiliary information, and attempt to find the route hotspots in large labeled networks. A route hotspot is a cohesive subgraph that is covered by a set of routes, and these routes correspond to the same sequential pattern consisting of vertices’ labels. To the best of our knowledge, the problem of Finding Route Hotspots in Large Labeled Networks has not been tackled in the literature. However, it is challenging as counting the number of hotspots in a network is #P-hard. Inspired by the observation that the sizes of hotspots decrease with the increasing lengths of patterns, we prove several anti-monotonicity properties of hotspots, and then develop a scalable algorithm called FastRH that can use these properties to effectively prune the patterns that cannot form any hotspots. In addition, to avoid the duplicate computation overhead, we judiciously design an effective index structure called RH-Index for storing the hotspot and pattern information collectively, which also enables incremental updating and efficient query processing. Our experimental results on real-world datasets clearly demonstrate the effectiveness and scalability of our proposed methods.

An ontology language for ontology mediated query answering (OMQA-language) is universal for a family of OMQA-languages if it is the most expressive one among this family. In this paper, we focus on three families of tractable OMQA-languages, including first-order rewritable languages and languages whose data complexity of the query answering is in AC0 or PTIME. On the negative side, we prove that there is, in general, no universal language for each of these families of languages. On the positive side, we propose a novel property, the locality, to approximate the first-order rewritability, and show that there exists a language of disjunctive embedded dependencies that is universal for the family of OMQA-languages with locality. All of these results apply to OMQA with query languages such as conjunctive queries, unions of conjunctive queries and acyclic conjunctive queries.

A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm , there exists a stream so that running twice on (using different randomness) would with high probability result in two different entries as the output. In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix. We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted.

Learning representations of data is an important problem in statistics and machine learning. While the origin of learning representations can be traced back to factor analysis and multidimensional scaling in statistics, it has become a central theme in deep learning with important applications in computer vision and computational neuroscience. In this article, we review recent advances in learning representations from a statistical perspective. In particular, we review the following two themes: (a) unsupervised learning of vector representations and (b) learning of both vector and matrix representations.

Machine Learning as a Service (MLaaS) has become a growing trend in recent years and several such services are currently offered. MLaaS is essentially a set of services that provides machine learning tools and capabilities as part of cloud computing services. In these settings, the cloud has pre-trained models that are deployed and large computing capacity whereas the clients can use these models to make predictions without having to worry about maintaining the models and the service. However, the main concern with MLaaS is the privacy of the client’s data. Although there have been several proposed approaches in the literature to run machine learning models on encrypted data, the performance is still far from being satisfactory for practical use. In this paper, we aim to accelerate the performance of running machine learning on encrypted data using combination of Fully Homomorphic Encryption (FHE), Convolutional Neural Networks (CNNs) and Graphics Processing Units (GPUs). We use a number of optimization techniques, and efficient GPU-based implementation to achieve high performance. We evaluate a CNN whose architecture is similar to AlexNet to classify homomorphically encrypted samples from the Cars Overhead With Context (COWC) dataset. To the best of our knowledge, it is the first time such a complex network and large dataset is evaluated on encrypted data. Our approach achieved reasonable classification accuracy of 95% for the COWC dataset. In terms of performance, our results show that we could achieve several thousands times speed up when we implement GPU-accelerated FHE operations on encrypted floating point numbers.

Multi-view clustering aims at integrating complementary information from multiple heterogeneous views to improve clustering results. Existing multi-view clustering solutions can only output a single clustering of the data. Due to their multiplicity, multi-view data, can have different groupings that are reasonable and interesting from different perspectives. However, how to find multiple, meaningful, and diverse clustering results from multi-view data is still a rarely studied and challenging topic in multi-view clustering and multiple clusterings. In this paper, we introduce a deep matrix factorization based solution (DMClusts) to discover multiple clusterings. DMClusts gradually factorizes multi-view data matrices into representational subspaces layer-by-layer and generates one clustering in each layer. To enforce the diversity between generated clusterings, it minimizes a new redundancy quantification term derived from the proximity between samples in these subspaces. We further introduce an iterative optimization procedure to simultaneously seek multiple clusterings with quality and diversity. Experimental results on benchmark datasets confirm that DMClusts outperforms state-of-the-art multiple clustering solutions.

We propose an approach towards natural language generation using a bidirectional encoder-decoder which incorporates external rewards through reinforcement learning (RL). We use attention mechanism and maximum mutual information as an initial objective function using RL. Using a two-part training scheme, we train an external reward analyzer to predict the external rewards and then use the predicted rewards to maximize the expected rewards (both internal and external). We evaluate the system on two standard dialogue corpora – Cornell Movie Dialog Corpus and Yelp Restaurant Review Corpus. We report standard evaluation metrics including BLEU, ROUGE-L, and perplexity as well as human evaluation to validate our approach.

We address the problem of disentangled representation learning with independent latent factors in graph convolutional networks (GCNs). The current methods usually learn node representation by describing its neighborhood as a perceptual whole in a holistic manner while ignoring the entanglement of the latent factors. However, a real-world graph is formed by the complex interaction of many latent factors (e.g., the same hobby, education or work in social network). While little effort has been made toward exploring the disentangled representation in GCNs. In this paper, we propose a novel Independence Promoted Graph Disentangled Networks (IPGDN) to learn disentangled node representation while enhancing the independence among node representations. In particular, we firstly present disentangled representation learning by neighborhood routing mechanism, and then employ the Hilbert-Schmidt Independence Criterion (HSIC) to enforce independence between the latent representations, which is effectively integrated into a graph convolutional framework as a regularizer at the output layer. Experimental studies on real-world graphs validate our model and demonstrate that our algorithms outperform the state-of-the-arts by a wide margin in different network applications, including semi-supervised graph classification, graph clustering and graph visualization.

In this extended abstract we introduce a novel {\em control-tutored} Q-learning approach (CTQL) as part of the ongoing effort in developing model-based and safe RL for continuous state spaces. We validate our approach by applying it to a challenging multi-agent herding control problem.

In this paper we propose a statistical model for dynamically evolving networks, together with a variational inference approach. Our model, which we call Dynamic Latent Attribute Interaction Model (DLAIM), encodes edge dependencies across different time snapshots. It represents nodes via latent attributes and uses attribute interaction matrices to model the presence of edges. Both are allowed to evolve with time, thus allowing us to capture the dynamics of the network. We develop a neural network based variational inference procedure that provides a suitable way to learn the model parameters. The main strengths of DLAIM are: (i) it is flexible as it does not impose strict assumptions on network evolution unlike existing approaches, (ii) it applies to both directed as well as undirected networks, and more importantly, (iii) learned node attributes and interaction matrices may be interpretable and therefore provide insights on the mechanisms behind network evolution. Experiments done on real world networks for the task of link forecasting demonstrate the superior performance of our model as compared to existing approaches.

Neural architecture search has recently attracted lots of research efforts as it promises to automate the manual design of neural networks. However, it requires a large amount of computing resources and in order to alleviate this, a performance prediction network has been recently proposed that enables efficient architecture search by forecasting the performance of candidate architectures, instead of relying on actual model training. The performance predictor is task-aware taking as input not only the candidate architecture but also task meta-features and it has been designed to collectively learn from several tasks. In this work, we introduce a pairwise ranking loss for training a network able to rank candidate architectures for a new unseen task conditioning on its task meta-features. We present experimental results, showing that the ranking network is more effective in architecture search than the previously proposed performance predictor.

Despite the effectiveness of sequence-to-sequence framework on the task of Short-Text Conversation (STC), the issue of under-exploitation of training data (i.e., the supervision signals from query text is \textit{ignored}) still remains unresolved. Also, the adopted \textit{maximization}-based decoding strategies, inclined to generating the generic responses or responses with repetition, are unsuited to the STC task. In this paper, we propose to formulate the STC task as a language modeling problem and tailor-make a training strategy to adapt a language model for response generation. To enhance generation performance, we design a relevance-promoting transformer language model, which performs additional supervised source attention after the self-attention to increase the importance of informative query tokens in calculating the token-level representation. The model further refines the query representation with relevance clues inferred from its multiple references during training. In testing, we adopt a \textit{randomization-over-maximization} strategy to reduce the generation of generic responses. Experimental results on a large Chinese STC dataset demonstrate the superiority of the proposed model on relevance metrics and diversity metrics.\footnote{Code available at https://…/.

Embedding large and high dimensional data into low dimensional vector spaces is a necessary task to computationally cope with contemporary data sets. Superseding latent semantic analysis recent approaches like word2vec or node2vec are well established tools in this realm. In the present paper we add to this line of research by introducing fca2vec, a family of embedding techniques for formal concept analysis (FCA). Our investigation contributes to two distinct lines of research. First, we enable the application of FCA notions to large data sets. In particular, we demonstrate how the cover relation of a concept lattice can be retrieved from a computational feasible embedding. Secondly, we show an enhancement for the classical node2vec approach in low dimension. For both directions the overall constraint of FCA of explainable results is preserved. We evaluate our novel procedures by computing fca2vec on different data sets like, wiki44 (a dense part of the Wikidata knowledge graph), the Mushroom data set and a publication network derived from the FCA community.

We propose a method for efficient training of deep Reinforcement Learning (RL) agents when the reward is highly sparse and non-Markovian, but at the same time admits a high-level yet unknown sequential structure, as seen in a number of video games. This high-level sequential structure can be expressed as a computer program, which our method infers automatically as the RL agent explores the environment. Through this process, a high-level sequential task that occurs only rarely may nonetheless be encoded within the inferred program. A hybrid architecture for deep neural fitted Q-iteration is then employed to fill in low-level details and generate an optimal control policy that follows the structure of the program. Our experiments show that the agent is able to synthesise a complex program to guide the RL exploitation phase, which is otherwise difficult to achieve with state-of-the-art RL techniques.

Machine learning systems based on deep neural networks (DNNs) produce state-of-the-art results in many applications. Considering the large amount of training data and know-how required to generate the network, it is more practical to use third-party DNN intellectual property (IP) cores for many designs. No doubt to say, it is essential for DNN IP vendors to provide test cases for functional validation without leaking their parameters to IP users. To satisfy this requirement, we propose to effectively generate test cases that activate parameters as many as possible and propagate their perturbations to outputs. Then the functionality of DNN IPs can be validated by only checking their outputs. However, it is difficult considering large numbers of parameters and highly non-linearity of DNNs. In this paper, we tackle this problem by judiciously selecting samples from the DNN training set and applying a gradient-based method to generate new test cases. Experimental results demonstrate the efficacy of our proposed solution.

Learning meaningful graphs from data plays important roles in many data mining and machine learning tasks, such as data representation and analysis, dimension reduction, data clustering, and visualization, etc. In this work, for the first time, we present a highly-scalable spectral approach (GRASPEL) for learning large graphs from data. By limiting the precision matrix to be a graph Laplacian, our approach aims to estimate ultra-sparse (tree-like) weighted undirected graphs and shows a clear connection with the prior graphical Lasso method. By interleaving the latest high-performance nearly-linear time spectral methods for graph sparsification, coarsening and embedding, ultra-sparse yet spectrally-robust graphs can be learned by identifying and including the most spectrally-critical edges into the graph. Compared with prior state-of-the-art graph learning approaches, GRASPEL is more scalable and allows substantially improving computing efficiency and solution quality of a variety of data mining and machine learning applications, such as spectral clustering (SC), and t-Distributed Stochastic Neighbor Embedding (t-SNE). {For example, when comparing with graphs constructed using existing methods, GRASPEL achieved the best spectral clustering efficiency and accuracy.

We propose a means to relate properties of an interconnected system to its separate component systems in the presence of cascade-like phenomena. Building on a theory of interconnection reminiscent of the behavioral approach to systems theory, we introduce the notion of generativity, and its byproduct, generative effects. Cascade effects, enclosing contagion phenomena and cascading failures, are seen as instances of generative effects. The latter are precisely the instances where properties of interest are not preserved or behave very badly when systems interact. The goal is to overcome that obstruction. We will show how to extract mathematical objects from the systems, that encode their generativity: their potential to generate new phenomena upon interaction. Those objects may then be used to link the properties of the interconnected system to its separate systems. Such a link will be executed through the use of exact sequences from commutative algebra.

Recently, neural networks have been used as implicit representations for surface reconstruction, modelling, learning, and generation. So far, training neural networks to be implicit representations of surfaces required training data sampled from a ground-truth signed implicit functions such as signed distance or occupancy functions, which are notoriously hard to compute. In this paper we introduce Sign Agnostic Learning (SAL), a deep learning approach for learning implicit shape representations directly from raw, unsigned geometric data, such as point clouds and triangle soups. We have tested SAL on the challenging problem of surface reconstruction from an un-oriented point cloud, as well as end-to-end human shape space learning directly from raw scans dataset, and achieved state of the art reconstructions compared to current approaches. We believe SAL opens the door to many geometric deep learning applications with real-world data, alleviating the usual painstaking, often manual pre-process.

Selecting and combining the outlier scores of different base detectors used within outlier ensembles can be quite challenging in the absence of ground truth. In this paper, an unsupervised outlier detector combination framework called DCSO is proposed, demonstrated and assessed for the dynamic selection of most competent base detectors, with an emphasis on data locality. The proposed DCSO framework first defines the local region of a test instance by its k nearest neighbors and then identifies the top-performing base detectors within the local region. Experimental results on ten benchmark datasets demonstrate that DCSO provides consistent performance improvement over existing static combination approaches in mining outlying objects. To facilitate interpretability and reliability of the proposed method, DCSO is analyzed using both theoretical frameworks and visualization techniques, and presented alongside empirical parameter setting instructions that can be used to improve the overall performance.

Joint extraction of entities and relations has received significant attention due to its potential of providing higher performance for both tasks. Among existing methods, CopyRE is effective and novel, which uses a sequence-to-sequence framework and copy mechanism to directly generate the relation triplets. However, it suffers from two fatal problems. The model is extremely weak at differing the head and tail entity, resulting in inaccurate entity extraction. It also cannot predict multi-token entities (e.g. \textit{Steven Jobs}). To address these problems, we give a detailed analysis of the reasons behind the inaccurate entity extraction problem, and then propose a simple but extremely effective model structure to solve this problem. In addition, we propose a multi-task learning framework equipped with copy mechanism, called CopyMTL, to allow the model to predict multi-token entities. Experiments reveal the problems of CopyRE and show that our model achieves significant improvement over the current state-of-the-art method by 9% in NYT and 16% in WebNLG (F1 score). Our code is available at https://…/CopyMTL

Answering questions that require multi-hop reasoning at web-scale necessitates retrieving multiple evidence documents, one of which often has little lexical or semantic relationship to the question. This paper introduces a new graph-based recurrent retrieval approach that learns to retrieve reasoning paths over the Wikipedia graph to answer multi-hop open-domain questions. Our retriever model trains a recurrent neural network that learns to sequentially retrieve evidence paragraphs in the reasoning path by conditioning on the previously retrieved documents. Our reader model ranks the reasoning paths and extracts the answer span included in the best reasoning path. Experimental results show state-of-the-art results in three open-domain QA datasets, showcasing the effectiveness and robustness of our method. Notably, our method achieves significant improvement in HotpotQA, outperforming the previous best model by more than 14 points.

As deep learning techniques advance more than ever, hyper-parameter optimization is the new major workload in deep learning clusters. Although hyper-parameter optimization is crucial in training deep learning models for high model performance, effectively executing such a computation-heavy workload still remains a challenge. We observe that numerous trials issued from existing hyper-parameter optimization algorithms share common hyper-parameter sequence prefixes, which implies that there are redundant computations from training the same hyper-parameter sequence multiple times. We propose a stage-based execution strategy for efficient execution of hyper-parameter optimization algorithms. Our strategy removes redundancy in the training process by splitting the hyper-parameter sequences of trials into homogeneous stages, and generating a tree of stages by merging the common prefixes. Our preliminary experiment results show that applying stage-based execution to hyper-parameter optimization algorithms outperforms the original trial-based method, saving required GPU-hours and end-to-end training time by up to 6.60 times and 4.13 times, respectively.

Discriminative feature representation of person image is important for person re-identification (Re-ID) task. Recently, attributes have been demonstrated beneficially in guiding for learning more discriminative feature representations for Re-ID. As attributes normally co-occur in person images, it is desirable to model the attribute dependencies to improve the attribute prediction and thus Re-ID results. In this paper, we propose to model these attribute dependencies via a novel attribute knowledge graph (AttKG), and propose a novel Attribute Knowledge Graph Convolutional Network (AttKGCN) to solve Re-ID problem. AttKGCN integrates both attribute prediction and Re-ID learning together in a unified end-to-end framework which can boost their performances, respectively. AttKGCN first builds a directed attribute KG whose nodes denote attributes and edges encode the co-occurrence relationships of different attributes. Then, AttKGCN learns a set of inter-dependent attribute classifiers which are combined with person visual descriptors for attribute prediction. Finally, AttKGCN integrates attribute description and deeply visual representation together to construct a more discriminative feature representation for Re-ID task. Extensive experiments on several benchmark datasets demonstrate the effectiveness of AttKGCN on attribute prediction and Re-ID tasks.

In many real-world applications of machine learning, data are distributed across many clients and cannot leave the devices they are stored on. Furthermore, each client’s data, computational resources and communication constraints may be very different. This setting is known as federated learning, in which privacy is a key concern. Differential privacy is commonly used to provide mathematical privacy guarantees. This work, to the best of our knowledge, is the first to consider federated, differentially private, Bayesian learning. We build on Partitioned Variational Inference (PVI) which was recently developed to support approximate Bayesian inference in the federated setting. We modify the client-side optimisation of PVI to provide an (, )-DP guarantee. We show that it is possible to learn moderately private logistic regression models in the federated setting that achieve similar performance to models trained non-privately on centralised data.

We investigate an algorithm named histogram transform ensembles (HTE) density estimator whose effectiveness is supported by both solid theoretical analysis and significant experimental performance. On the theoretical side, by decomposing the error term into approximation error and estimation error, we are able to conduct the following analysis: First of all, we establish the universal consistency under -norm. Secondly, under the assumption that the underlying density function resides in the H\'{o}lder space , we prove almost optimal convergence rates for both single and ensemble density estimators under -norm and -norm for different tail distributions, whereas in contrast, for its subspace consisting of smoother functions, almost optimal convergence rates can only be established for the ensembles and the lower bound of the single estimators illustrates the benefits of ensembles over single density estimators. In the experiments, we first carry out simulations to illustrate that histogram transform ensembles surpass single histogram transforms, which offers powerful evidence to support the theoretical results in the space . Moreover, to further exert the experimental performances, we propose an adaptive version of HTE and study the parameters by generating several synthetic datasets with diversities in dimensions and distributions. Last but not least, real data experiments with other state-of-the-art density estimators demonstrate the accuracy of the adaptive HTE algorithm.

We present a facial landmark position correlation analysis as well as its applications. Although numerous facial landmark detection methods have been presented in the literature, few of them concern the intrinsic relationship among the landmarks. In order to reveal and interpret this relationship, we propose to analyze the facial landmark correlation by using Canonical Correlation Analysis (CCA). We experimentally show that dense facial landmark annotations in current benchmarks are strongly correlated, and we propose several applications based on this analysis. First, we give insights into the predictions from different facial landmark detection models (including cascaded random forests, cascaded Convolutional Neural Networks (CNNs), heatmap regression models) and interpret how CNNs progressively learn to predict facial landmarks. Second, we propose a few-shot learning method that allows to considerably reduce manual effort for dense landmark annotation. To this end, we select a portion of landmarks from the dense annotation format to form a sparse format, which is mostly correlated to the rest of them. Thanks to the strong correlation among the landmarks, the entire set of dense facial landmarks can then be inferred from the annotation in the sparse format by transfer learning. Unlike the previous methods, we mainly focus on how to find the most efficient sparse format to annotate. Overall, our correlation analysis provides new perspectives for the research on facial landmark detection.

Recent years have witnessed significant advances in reinforcement learning (RL), which has registered great success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

This paper considers online convex optimization (OCO) problems – the paramount framework for online learning algorithm design. The loss function of learning task in OCO setting is based on streaming data so that OCO is a powerful tool to model large scale applications such as online recommender systems. Meanwhile, real-world data are usually of extreme high-dimensional due to modern feature engineering techniques so that the quadratic regression is impractical. Factorization Machine as well as its variants are efficient models for capturing feature interactions with low-rank matrix model but they can’t fulfill the OCO setting due to their non-convexity. In this paper, We propose a projective quadratic regression (PQR) model. First, it can capture the import second-order feature information. Second, it is a convex model, so the requirements of OCO are fulfilled and the global optimal solution can be achieved. Moreover, existing modern online optimization methods such as Online Gradient Descent (OGD) or Follow-The-Regularized-Leader (FTRL) can be applied directly. In addition, by choosing a proper hyper-parameter, we show that it has the same order of space and time complexity as the linear model and thus can handle high-dimensional data. Experimental results demonstrate the performance of the proposed PQR model in terms of accuracy and efficiency by comparing with the state-of-the-art methods.

Mutual information is widely applied to learn latent representations of observations, whilst its implication in classification neural networks remain to be better explained. In this paper, we show that optimising the parameters of classification neural networks with softmax cross-entropy is equivalent to maximising the mutual information between inputs and labels under the balanced data assumption. Through the experiments on synthetic and real datasets, we show that softmax cross-entropy can estimate mutual information approximately. When applied to image classification, this relation helps approximate the point-wise mutual information between an input image and a label without modifying the network structure. In this end, we propose infoCAM, informative class activation map, which highlights regions of the input image that are the most relevant to a given label based on differences in information. The activation map helps localise the target object in an image. Through the experiments on the semi-supervised object localisation task with two real-world datasets, we evaluate the effectiveness of the information-theoretic approach.

Recent advances in adversarial attacks uncover the intrinsic vulnerability of modern deep neural networks. Since then, extensive efforts have been devoted to enhancing the robustness of deep networks via specialized learning algorithms and loss functions. In this work, we take an architectural perspective and investigate the patterns of network architectures that are resilient to adversarial attacks. To obtain the large number of networks needed for this study, we adopt one-shot neural architecture search, training a large network for once and then finetuning the sub-networks sampled therefrom. The sampled architectures together with the accuracies they achieve provide a rich basis for our study. Our ‘robust architecture Odyssey’ reveals several valuable observations: 1) densely connected patterns result in improved robustness; 2) under computational budget, adding convolution operations to direct connection edge is effective; 3) flow of solution procedure (FSP) matrix is a good indicator of network robustness. Based on these observations, we discover a family of robust architectures (RobNets). On various datasets, including CIFAR, SVHN, and Tiny-ImageNet, RobNets exhibit superior robustness performance to other widely used architectures. Notably, RobNets substantially improve the robust accuracy (~5% absolute gains) under both white-box and black-box attacks, even with fewer parameter numbers.

The interactions of users and items in recommender system could be naturally modeled as a user-item bipartite graph. In recent years, we have witnessed an emerging research effort in exploring user-item graph for collaborative filtering methods. Nevertheless, the formation of user-item interactions typically arises from highly complex latent purchasing motivations, such as high cost performance or eye-catching appearance, which are indistinguishably represented by the edges. The existing approaches still remain the differences between various purchasing motivations unexplored, rendering the inability to capture fine-grained user preference. Therefore, in this paper we propose a novel Multi-Component graph convolutional Collaborative Filtering (MCCF) approach to distinguish the latent purchasing motivations underneath the observed explicit user-item interactions. Specifically, there are two elaborately designed modules, decomposer and combiner, inside MCCF. The former first decomposes the edges in user-item graph to identify the latent components that may cause the purchasing relationship; the latter then recombines these latent components automatically to obtain unified embeddings for prediction. Furthermore, the sparse regularizer and weighted random sample strategy are utilized to alleviate the overfitting problem and accelerate the optimization. Empirical results on three real datasets and a synthetic dataset not only show the significant performance gains of MCCF, but also well demonstrate the necessity of considering multiple components.

We consider the problem of selecting a seed set to maximize the expected number of influenced nodes in the social network, referred to as the \textit{influence maximization} (IM) problem. We assume that the topology of the social network is prescribed while the influence probabilities among edges are unknown. In order to learn the influence probabilities and simultaneously maximize the influence spread, we consider the tradeoff between exploiting the current estimation of the influence probabilities to ensure certain influence spread and exploring more nodes to learn better about the influence probabilities. The exploitation-exploration trade-off is the core issue in the multi-armed bandit (MAB) problem. If we regard the influence spread as the reward, then the IM problem could be reduced to the combinatorial multi-armed bandits. At each round, the learner selects a limited number of seed nodes in the social network, then the influence spreads over the network according to the real influence probabilities. The learner could observe the activation status of the edge if and only if its start node is influenced, which is referred to as the edge-level semi-bandit feedback. Two classical bandit algorithms including Thompson Sampling and Epsilon Greedy are used to solve this combinatorial problem. To ensure the robustness of these two algorithms, we use an automatic ensemble learning strategy, which combines the exploration strategy with exploitation strategy. The ensemble algorithm is self-adaptive regarding that the probability of each algorithm could be adjusted based on the historical performance of the algorithm. Experimental evaluation illustrates the effectiveness of the automatically adjusted hybridization of exploration algorithm with exploitation algorithm.

Multiplex networks are convenient mathematical representations for many real-world — biological, social, and technological — systems of interacting elements, where pairwise interactions among elements have different flavors. Previous studies pointed out that real-world multiplex networks display significant inter-layer correlations — degree-degree correlation, edge overlap, node similarities — able to make them robust against random and targeted failures of their individual components. Here, we show that inter-layer correlations are important also in the characterization of their -core structure, namely the organization in shells of nodes with increasingly high degree. Understanding -core structures is important in the study of spreading processes taking place on networks, as for example in the identification of influential spreaders and the emergence of localization phenomena. We find that, if the degree distribution of the network is heterogeneous, then a strong -core structure is well predicted by significantly positive degree-degree correlations. However, if the network degree distribution is homogeneous, then strong -core structure is due to positive correlations at the level of node similarities. We reach our conclusions by analyzing different real-world multiplex networks, introducing novel techniques for controlling inter-layer correlations of networks without changing their structure, and taking advantage of synthetic network models with tunable levels of inter-layer correlations.

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.

We present Neural Random Forest Imitation – a novel approach for transforming random forests into neural networks. Existing methods produce very inefficient architectures and do not scale. In this paper, we introduce a new method for generating data from a random forest and learning a neural network that imitates it. Without any additional training data, this transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is fully differentiable and can be combined with the feature extraction in a single pipeline enabling further end-to-end processing. Experiments on several real-world benchmark datasets demonstrate outstanding performance in terms of scalability, accuracy, and learning with very few training examples. Compared to state-of-the-art mappings, we significantly reduce the network size while achieving the same or even improved accuracy due to better generalization.

This research proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or . Within the context of topic modeling, has the same interpretation that it does when used in a broader class of statistical models. Reporting with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. The author proposes that should be reported as a standard metric when constructing topic models.

In this chapter, we give an introduction to symbolic artificial intelligence (AI) and discuss its relation and application to multimedia. We begin by defining what symbolic AI is, what distinguishes it from non-symbolic approaches, such as machine learning, and how it can used in the construction of advanced multimedia applications. We then introduce description logic (DL) and use it to discuss symbolic representation and reasoning. DL is the logical underpinning of OWL, the most successful family of ontology languages. After discussing DL, we present OWL and related Semantic Web technologies, such as RDF and SPARQL. We conclude the chapter by discussing a hybrid model for multimedia representation, called Hyperknowledge. Throughout the text, we make references to technologies and extensions specifically designed to solve the kinds of problems that arise in multimedia representation.

I overview recent research advances in Bayesian state-space modeling of multivariate time series. A main focus is on the decouple/recouple concept that enables application of state-space models to increasingly large-scale data, applying to continuous or discrete time series outcomes. The scope includes large-scale dynamic graphical models for forecasting and multivariate volatility analysis in areas such as economics and finance, multi-scale approaches for forecasting discrete/count time series in areas such as commercial sales and demand forecasting, and dynamic network flow models for areas including internet traffic monitoring. In applications, explicit forecasting, monitoring and decision goals are paramount and should factor into model assessment and comparison, a perspective that is highlighted.

There is an increasing number of pre-trained deep neural network models. However, it is still unclear how to effectively use these models for a new task. Transfer learning, which aims to transfer knowledge from source tasks to a target task, is an effective solution to this problem. Fine-tuning is a popular transfer learning technique for deep neural networks where a few rounds of training are applied to the parameters of a pre-trained model to adapt them to a new task. Despite its popularity, in this paper, we show that fine-tuning suffers from several drawbacks. We propose an adaptive fine-tuning approach, called AdaFilter, which selects only a part of the convolutional filters in the pre-trained model to optimize on a per-example basis. We use a recurrent gated network to selectively fine-tune convolutional filters based on the activations of the previous layer. We experiment with 7 public image classification datasets and the results show that AdaFilter can reduce the average classification error of the standard fine-tuning by 2.54%.

For Bayesian computation in big data contexts, the divide-and-conquer MCMC concept splits the whole data set into batches, runs MCMC algorithms separately over each batch to produce samples of parameters, and combines them to produce an approximation of the target distribution. In this article, we embed random forests into this framework and use each subposterior/partial-posterior as a proposal distribution to implement importance sampling. Unlike the existing divide-and-conquer MCMC, our methods are based on scaled subposteriors, whose scale factors are not necessarily restricted to being equal to one or to the number of subsets. Through several experiments, we show that our methods work well with models ranging from Gaussian cases to strongly non-Gaussian cases, and include model misspecification.

Humans can learn a variety of concepts and skills incrementally over the course of their lives while exhibiting an array of desirable properties, such as non-forgetting, concept rehearsal, forward transfer and backward transfer of knowledge, and so on. Previous approaches to lifelong learning (LLL) have demonstrated subsets of these properties, often with multiple mechanisms. In this paper, we propose a simple yet powerful unified framework that demonstrates all of these desirable properties. Our novel framework utilizes a small number of weight consolidation parameters dynamically applied to groups of weights, reflecting how ‘stiff’ weights can be modified during training in deep neural networks. In addition, we are able to draw many parallels between the behaviours and mechanisms of our model and those surrounding human learning, such as memory loss or sleep deprivation. This allows our approach to serve as a conduit for two-way inspiration to further understand lifelong learning in machines and humans.

Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity – introducing attitudes via framing, presupposing truth, and casting doubt – remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view (‘neutralizing’ biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque CONCURRENT system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable MODULAR algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.

Historically, the pursuit of efficient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeeze-and-excitation module, depthwise separable convolutions in Xception, and the inverted bottleneck in MobileNet v2. Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new, the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly, which we open-source for the benefit of the community as part of the XNNPACK library. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EfficientNet architectures substantially outperform strong dense baselines on the efficiency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by — equivalent to approximately one entire generation of MobileNet-family improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures.

We present the Global Health Monitor, an online Web-based system for detecting and mapping infectious disease outbreaks that appear in news stories. The system analyzes English news stories from news feed providers, classifies them for topical relevance and plots them onto a Google map using geo-coding information, helping public health workers to monitor the spread of diseases in a geo-temporal context. The background knowledge for the system is contained in the BioCaster ontology (BCO) (Collier et al., 2007a) which includes both information on infectious diseases as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and visualization. Evaluation of the system shows that it achieved high accuracy on a gold standard corpus. The system is now in practical use. Running on a clustercomputer, it monitors more than 1500 news feeds 24/7, updating the map every hour.

Batch Normalization (BN) is a highly successful and widely used batch dependent training method. Its use of mini-batch statistics to normalize the activations introduces dependence between samples, which can hurt the training if the mini-batch size is too small, or if the samples are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address these issues. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. In this paper we propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a drop-in replacement for other normalizations and activations. Our method operates on each activation map of each batch sample independently, eliminating the dependency on other batch samples or channels of the same sample. Our method outperforms BN and all alternatives in a variety of settings for all batch sizes. FRN layer performs better on top-1 validation accuracy than BN with large mini-batch sizes on Imagenet classification on InceptionV3 and ResnetV2-50 architectures. Further, it performs better than GN on the same problem in the small mini-batch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least in all batch size regimes.

In this paper, we study normalization methods for neural networks from the perspective of elimination singularity. Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances. We show that channel-based normalizations (e.g. Layer Normalization and Group Normalization) are unable to guarantee a far distance from elimination singularities, in contrast with Batch Normalization which by design avoids models from getting too close to them. To address this issue, we propose BatchChannel Normalization (BCN), which uses batch knowledge to avoid the elimination singularities in the training of channel-normalized models. Unlike Batch Normalization, BCN is able to run in both large-batch and micro-batch training settings. The effectiveness of BCN is verified on many tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is here: https://…/Batch-Channel-Normalization.

Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor’s notes that has been annotated with ICD concepts. Extensive experimental evaluation shows our model achieves significant performance improvements over several state-of-the-art techniques.

Listwise learning-to-rank methods form a powerful class of ranking algorithms that are widely adopted in applications such as information retrieval. These algorithms learn to rank a set of items by optimizing a loss that is a function of the entire set—as a surrogate to a typically non-differentiable ranking metric. Despite their empirical success, existing listwise methods are based on heuristics and remain theoretically ill-understood. In particular, none of the empirically-successful loss functions are related to ranking metrics. In this work, we propose a cross entropy-based learning-to-rank loss function that is theoretically sound and is a convex bound on NDCG, a popular ranking metric. Furthermore, empirical evaluation of an implementation of the proposed method with gradient boosting machines on benchmark learning-to-rank datasets demonstrates the superiority of our proposed formulation over existing algorithms in quality and robustness.

The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without encountering difficulties during training (e.g. overfitting) even on relatively low-resource setups. All the attention mechanisms within FMT have a full time-domain receptive field which allows them to asynchronously capture long-range multimodal dynamics. In our experiments we focus on datasets that contain the three commonly studied modalities of language, vision and acoustic. We perform a wide range of experiments, spanning across 3 well-studied datasets and 21 distinct labels. FMT shows superior performance over previously proposed models, setting new state of the art in the studied datasets.

Artificial Intelligence (AI) has become an integral part of modern-day security solutions for its capability of learning very complex functions and handling ‘Big Data’. However, the lack of explainability and interpretability of successful AI models is a key stumbling block when trust in a model’s prediction is critical. This leads to human intervention, which in turn results in a delayed response or decision. While there have been major advancements in the speed and performance of AI-based intrusion detection systems, the response is still at human speed when it comes to explaining and interpreting a specific prediction or decision. In this work, we infuse popular domain knowledge (i.e., CIA principles) in our model for better explainability and validate the approach on a network intrusion detection test case. Our experimental results suggest that the infusion of domain knowledge provides better explainability as well as a faster decision or response. In addition, the infused domain knowledge generalizes the model to work well with unknown attacks, as well as open the path to adapt to a large stream of network traffic from numerous IoT devices.

Scarcity of labeled data is a bottleneck for supervised learning models. A paradigm that has evolved for dealing with this problem is data programming. An existing data programming paradigm allows human supervision to be provided as a set of discrete labeling functions (LF) that output possibly noisy labels to input instances and a generative modelfor consolidating the weak labels. We enhance and generalize this paradigm by supporting functions that output a continuous score (instead of a hard label) that noisily correlates with labels. We show across five applications that continuous LFs are more natural to program and lead to improved recall. We also show that accuracy of existing generative models is unstable with respect to initialization, training epochs, and learning rates. We give control to the data programmer to guide the training process by providing intuitive quality guides with each LF. We propose an elegant method of incorporating these guides into the generative model. Our overall method, called CAGE, makes the data programming paradigm more reliable than other tricks based on initialization, sign-penalties, or soft-accuracy constraints.

The diversity of intrinsic qualities of multimedia entities tends to impede their effective retrieval. In a SelfLearning Search Engine architecture, the subtle nuances of human perceptions and deep knowledge are taught and captured through unsupervised reinforcement learning, where the degree of reinforcement may be suitably calibrated. Such architectural paradigm enables indexes to evolve naturally while accommodating the dynamic changes of user interests. It operates by continuously constructing indexes over time, while injecting progressive improvement in search performance. For search operations to be effective, convergence of index learning is of crucial importance to ensure efficiency and robustness. In this paper, we develop a Self-Learning Search Engine architecture based on reinforcement learning using a Markov Decision Process framework. The balance between exploration and exploitation is achieved through evolutionary exploration Strategies. The evolutionary index learning behavior is then studied and formulated using stochastic analysis. Experimental results are presented which corroborate the steady convergence of the index evolution mechanism. Index Term

High performance machine learning models have become highly dependent on the availability of large quantity and quality of training data. To achieve this, various central agencies such as the government have suggested for different data providers to pool their data together to learn a unified predictive model, which performs better. However, these providers are usually profit-driven and would only agree to participate inthe data sharing process if the process is deemed both profitable and fair for themselves. Due to the lack of existing literature, it is unclear whether a fair and stable outcome is possible in such data sharing processes. Hence, we wish to investigate the outcomes surrounding these scenarios and study if data providers would even agree to collaborate in the first place. Tapping on cooperative game concepts in Game Theory, we introduce the data sharing process between a group of agents as a new class of cooperative games with modified definition of stability and fairness. Using these new definitions, we then theoretically study the optimal and suboptimal outcomes of such data sharing processes and their sensitivity to perturbation.Through experiments, we present intuitive insights regarding theoretical results analysed in this paper and discuss various ways in which data can be valued reasonably.

Age-of-information (AoI) is a metric quantifying information freshness at the receiver. It has received a lot of attention as it captures the delay together with packet loss and packet generation rate. However, most of the work aim for the average or peak AoI neglecting the complete distribution. In this work, we consider a n-hop network between a source and destination pair with time-invariant packet loss probabilities on each link. We derive closed form equations for the probability mass function of AoI at the receiver. We verify our findings with simulations. Our results show that the performance indicators considered in the literature such as average AoI or peak AoI may give misleading insights into the complete AoI performance.

Loss functions play a crucial role in deep metric learning thus a variety of them have been proposed. Some supervise the learning process by pairwise or tripletwise similarity constraints while others take advantage of structured similarity information among multiple data points. In this work, we approach deep metric learning from a novel perspective. We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. ICE has three main appealing properties. Firstly, similar to categorical cross entropy (CCE), ICE has clear probabilistic interpretation and exploits structured semantic similarity information for learning supervision. Secondly, ICE is scalable to infinite training data as it learns on mini-batches iteratively and is independent of the training set size. Thirdly, motivated by our relative weight analysis, seamless sample reweighting is incorporated. It rescales samples’ gradients to control the differentiation degree over training examples instead of truncating them by sample mining. In addition to its simplicity and intuitiveness, extensive experiments on three real-world benchmarks demonstrate the superiority of ICE.

-NN classifier is one of the most famous classification algorithms, whose performance is crucially dependent on the distance metric. When we consider the distance metric as a parameter of -NN, learning an appropriate distance metric for -NN can be seen as minimizing the empirical risk of -NN. In this paper, we design a new type of continuous decision function of the -NN classification rule which can be used to construct the continuous empirical risk function of -NN. By minimizing this continuous empirical risk function, we obtain a novel distance metric learning algorithm named as adaptive nearest neighbor (ANN). We have proved that the current algorithms such as the large margin nearest neighbor (LMNN), neighbourhood components analysis (NCA) and the pairwise constraint methods are special cases of the proposed ANN by setting the parameter different values. Compared with the LMNN, NCA, and pairwise constraint methods, our method has a broader searching space which may contain better solutions. At last, extensive experiments on various data sets are conducted to demonstrate the effectiveness and efficiency of the proposed method.

We consider the problem of reinforcing federated learning with formal privacy guarantees. We propose to employ Bayesian differential privacy, a relaxation of differential privacy for similarly distributed data, to provide sharper privacy loss bounds. We adapt the Bayesian privacy accounting method to the federated setting and suggest multiple improvements for more efficient privacy budgeting at different levels. Our experiments show significant advantage over the state-of-the-art differential privacy bounds for federated learning on image classification tasks, including a medical application, bringing the privacy budget below 1 at the client level, and below 0.1 at the instance level. Lower amounts of noise also benefit the model accuracy and reduce the number of communication rounds.

The ability to observe the effects of actions performed by others and to infer their intent, most likely goals, or course of action, is known as a plan or intention recognition cognitive capability and has long been one of the fundamental research challenges in AI. Deep learning has recently been making significant inroads on various pattern recognition problems, except for intention recognition. While extensively explored since the seventies, the problem remains unsolved for most interesting cases in various areas, ranging from natural language understanding to human behavior understanding based on video feeds. This paper compares symbolic inverse planning, one of the most investigated approaches to goal recognition, to deep learning using CNN and LTSM neural network architectures, on five synthetic benchmarks often used in the literature. The results show that the deep learning approach achieves better goal-prediction accuracy and timeliness than the symbolic cost-based plan recognizer in these domains. Although preliminary, these results point to interesting future research avenues.

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that ‘adapts’ to its current learning state and estimates the importance of each training data instance. Training such an adaptive scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance at a given time during the training, it is typically necessary to first complete the entire training process. To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS). In DDS, we formulate a scorer network as a learnable function of the training data, which can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a dev set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers strong and consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.

The process of identifying and understanding art styles to discover artistic influences is essential to the study of art history. Traditionally, trained experts review fine details of the works and compare them to other known works. To automate and scale this task, we use several state-of-the-art CNN architectures to explore how a machine may help perceive and quantify art styles. This study explores: (1) How accurately can a machine classify art styles? (2) What may be the underlying relationships among different styles and artists? To help answer the first question, our best-performing model using Inception V3 achieves a 9-class classification accuracy of 88.35%, which outperforms the model in Elgammal et al.’s study by more than 20 percent. Visualizations using Grad-CAM heat maps confirm that the model correctly focuses on the characteristic parts of paintings. To help address the second question, we conduct network analysis on the influences among styles and artists by extracting 512 features from the best-performing classification model. Through 2D and 3D T-SNE visualizations, we observe clear chronological patterns of development and separation among the art styles. The network analysis also appears to show anticipated artist level connections from an art historical perspective. This technique appears to help identify some previously unknown linkages that may shed light upon new directions for further exploration by art historians. We hope that humans and machines working in concert may bring new opportunities to the field.

Artificial Intelligence (AI) has become an integral part of domains such as security, finance, healthcare, medicine, and criminal justice. Explaining the decisions of AI systems in human terms is a key challenge–due to the high complexity of the model, as well as the potential implications on human interests, rights, and lives . While Explainable AI is an emerging field of research, there is no consensus on the definition, quantification, and formalization of explainability. In fact, the quantification of explainability is an open challenge. In our previous work, we incorporated domain knowledge for better explainability, however, we were unable to quantify the extent of explainability. In this work, we (1) briefly analyze the definitions of explainability from the perspective of different disciplines (e.g., psychology, social science), properties of explanation, explanation methods, and human-friendly explanations; and (2) propose and formulate an approach to quantify the extent of explainability. Our experimental result suggests a reasonable and model-agnostic way to quantify explainability

Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighboring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. As opposed to in the planning setting, for learning methods it is challenging to establish theoretical guarantees. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a single agent for sparse coordination graphs. Finally, we empirically show that MATS outperforms the state-of-the-art algorithm, MAUCE, on two synthetic benchmarks, a realistic wind farm control task, and a novel benchmark with Poisson distributions.

In this work we have analyzed a novel concept of sequential binding based learning capable network based on the coupling of recurrent units with Bayesian prior definition. The coupling structure encodes to generate efficient tensor representations that can be decoded to generate efficient sentences and can describe certain events. These descriptions are derived from structural representations of visual features of images and media. An elaborated study of the different types of coupling recurrent structures are studied and some insights of their performance are provided. Supervised learning performance for natural language processing is judged based on statistical evaluations, however, the truth is perspective, and in this case the qualitative evaluations reveal the real capability of the different architectural strengths and variations. Bayesian prior definition of different embedding helps in better characterization of the sentences based on the natural language structure related to parts of speech and other semantic level categorization in a form which is machine interpret-able and inherits the characteristics of the Tensor Representation binding and unbinding based on the mutually orthogonality. Our approach has surpassed some of the existing basic works related to image captioning.

Traditionally, the way one evaluates the performance of an Artificial Intelligence (AI) system is via a comparison to human performance in specific tasks, treating humans as a reference for high-level cognition. However, these comparisons leave out important features of human intelligence: the capability to transfer knowledge and make complex decisions based on emotional and rational reasoning. These decisions are influenced by current inferences as well as prior experiences, making the decision process strongly subjective and apparently biased. In this context, a definition of compositional intelligence is necessary to incorporate these features in future AI tests. Here, a concrete implementation of this will be suggested, using recent developments in quantum cognition, natural language and compositional meaning of sentences, thanks to categorical compositional models of meaning.

Parametric entities appear in many contexts, be it in optimisation, control, modelling of random quantities, or uncertainty quantification. These are all fields where reduced order models (ROMs) have a place to alleviate the computational burden. Assuming that the parametric entity takes values in a linear space, we show how is is associated to a linear map or operator. This provides a general point of view on how to consider and analyse different representations of such entities. Analysis of the associated linear map in turn connects such representations with reproducing kernel Hilbert spaces and affine- / linear-representations in terms of tensor products. A generalised correlation operator is defined through the associated linear map, and its spectral analysis helps to shed light on the approximation properties of ROMs. This point of view thus unifies many such representations under a functional analytic roof, leading to a deeper understanding and making them available for appropriate analysis.

Differential privacy is a cryptographically-motivated approach to privacy that has become a very active field of research over the last decade in theoretical computer science and machine learning. In this paradigm one assumes there is a trusted curator who holds the data of individuals in a database and the goal of privacy is to simultaneously protect individual data while allowing the release of global characteristics of the database. In this setting we introduce a general framework for parametric inference with differential privacy guarantees. We first obtain differentially private estimators based on bounded influence M-estimators by leveraging their gross-error sensitivity in the calibration of a noise term added to them in order to ensure privacy. We then show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score and likelihood ratio tests. We provide statistical guarantees for all our proposals via an asymptotic analysis. An interesting consequence of our results is to further clarify the connection between differential privacy and robust statistics. In particular, we demonstrate that differential privacy is a weaker stability requirement than infinitesimal robustness, and show that robust M-estimators can be easily randomized in order to guarantee both differential privacy and robustness towards the presence of contaminated data. We illustrate our results both on simulated and real data.

For many NLP applications, such as question answering and summarisation, the goal is to select the best solution from a large space of candidates to meet a particular user’s needs. To address the lack of user-specific training data, we propose an interactive text ranking approach that actively selects pairs of candidates, from which the user selects the best. Unlike previous strategies, which attempt to learn a ranking across the whole candidate space, our method employs Bayesian optimisation to focus the user’s labelling effort on high quality candidates and integrates prior knowledge in a Bayesian manner to cope better with small data scenarios. We apply our method to community question answering (cQA) and extractive summarisation, finding that it significantly outperforms existing interactive approaches. We also show that the ranking function learned by our method is an effective reward function for reinforcement learning, which improves the state of the art for interactive summarisation.

Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. In this work, we present a Graph Convolutional Network (GCN) algorithm SWAG (Sample Weight and AGgregate), which combines efficient random walks and graph convolutions on weighted graphs to generate embeddings for nodes (items) that incorporate both graph structure as well as node feature information such as item-descriptions and item-images. The three important SWAG operations that enable us to efficiently generate node embeddings based on graph structures are (a) Sampling of graph to homogeneous structure, (b) Weighting the sampling, walks and convolution operations, and (c) using AGgregation functions for generating convolutions. The work is an adaptation of graphSAGE over weighted graphs. We deploy SWAG at Target and train it on a graph of more than 500K products sold online with over 50M edges. Offline and online evaluations reveal the benefit of using a graph-based approach and the benefits of weighing to produce high quality embeddings and product recommendations.

Neural networks have demonstrated unmatched performance in a range of classification tasks. Despite numerous efforts of the research community, novelty detection remains one of the significant limitations of neural networks. The ability to identify previously unseen inputs as novel is crucial for our understanding of the decisions made by neural networks. At runtime, inputs not falling into any of the categories learned during training cannot be classified correctly by the neural network. Existing approaches treat the neural network as a black box and try to detect novel inputs based on the confidence of the output predictions. However, neural networks are not trained to reduce their confidence for novel inputs, which limits the effectiveness of these approaches. We propose a framework to monitor a neural network by observing the hidden layers. We employ a common abstraction from program analysis – boxes – to identify novel behaviors in the monitored layers, i.e., inputs that cause behaviors outside the box. For each neuron, the boxes range over the values seen in training. The framework is efficient and flexible to achieve a desired trade-off between raising false warnings and detecting novel inputs. We illustrate the performance and the robustness to variability in the unknown classes on popular image-classification benchmarks.

We consider the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, we introduce natural notions of universality and we show a local equivalence among them. Our analysis is naturally expressed via information geometry, and represents a conceptually and computationally useful analysis. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-R\’enyi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishby’s information bottleneck, Wyner’s common information, Ky Fan -norms, and Brieman and Friedman’s alternating conditional expectations algorithm. We further illustrate how this framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and the associated neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.

Differentiable programming has emerged as a key programming paradigm empowering rapid developments of deep learning while its applications to important computational methods such as Monte Carlo remain largely unexplored. Here we present the general theory enabling infinite-order automatic differentiation on expectations computed by Monte Carlo with unnormalized probability distributions, which we call ‘automatic differentiable Monte Carlo’ (ADMC). By implementing ADMC algorithms on computational graphs, one can also leverage state-of-the-art machine learning frameworks and techniques to traditional Monte Carlo applications in statistics and physics. We illustrate the versatility of ADMC by showing some applications: fast search of phase transitions and accurately finding ground states of interacting many-body models in two dimensions. ADMC paves a promising way to innovate Monte Carlo in various aspects to achieve higher accuracy and efficiency, e.g. easing or solving the sign problem of quantum many-body models through ADMC.

Global optimization finds applications in a wide range of real world problems. The multi-start methods are a popular class of global optimization techniques, which are based on the ideas of conducting local searches at multiple starting points, and then sequentially determine the starting points according to some prescribed rules. In this work we propose a new multi-start algorithm where the starting points are determined in a Bayesian optimization framework. Specifically, the method can be understood as to construct a new function by conducting local searches of the original objective function, where the new function attains the same global optima as the original one. Bayesian optimization is then applied to find the global optima of the new local search based function.

The cost of drawing object bounding boxes (i.e. labeling) for millions of images is prohibitively high. For instance, labeling pedestrians in a regular urban image could take 35 seconds on average. Active learning aims to reduce the cost of labeling by selecting only those images that are informative to improve the detection network accuracy. In this paper, we propose a method to perform active learning of object detectors based on convolutional neural networks. We propose a new image-level scoring process to rank unlabeled images for their automatic selection, which clearly outperforms classical scores. The proposed method can be applied to videos and sets of still images. In the former case, temporal selection rules can complement our scoring process. As a relevant use case, we extensively study the performance of our method on the task of pedestrian detection. Overall, the experiments show that the proposed method performs better than random selection. Our codes are publicly available at http://www.gitlab.com/haghdam/deep_active_learning.

Deep neural networks have been successfully applied to many real-world applications. However, these successes rely heavily on large amounts of labeled data, which is expensive to obtain. Recently, Auto-Encoding Transformation (AET) and MixMatch have been proposed and achieved state-of-the-art results for unsupervised and semi-supervised learning, respectively. In this study, we train an Ensemble of Auto-Encoding Transformations (EnAET) to learn from both labeled and unlabeled data based on the embedded representations by decoding both spatial and non-spatial transformations. This distinguishes EnAET from conventional semi-supervised methods that focus on improving prediction consistency and confidence by different models on both unlabeled and labeled examples. In contrast, we propose to explore the role of self-supervised representations in semi-supervised learning under a rich family of transformations. Experiment results on CIFAR-10, CIFAR-100, SVHN and STL10 demonstrate that the proposed EnAET outperforms the state-of-the-art semi-supervised methods by significant margins. In particular, we apply the proposed method to extremely challenging scenarios with only 10 images per class, and show that EnAET can achieve an error rate of 9.35% on CIFAR-10 and 16.92% on SVHN. In addition, EnAET achieves the best result when compared with fully supervised learning using all labeled data with the same network architecture. The performance on CIFAR-10, CIFAR-100 and SVHN with a smaller network is even more competitive than the state-of-the-art of supervised learning methods based on a larger network. We also set a new performance record with an error rate of 1.99% on CIFAR-10 and 4.52% on STL10. The code and experiment records are released at https://…/EnAET.

The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.

We present new algorithms for computing and approximating bisimulation metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an elegant formalism that capture behavioral equivalence between states and provide strong theoretical guarantees on differences in optimal behaviour. Unfortunately, their computation is expensive and requires a tabular representation of the states, which has thus far rendered them impractical for large problems. In this paper we present a new version of the metric that is tied to a behavior policy in an MDP, along with an analysis of its theoretical properties. We then present two new algorithms for approximating bisimulation metrics in large, deterministic MDPs. The first does so via sampling and is guaranteed to converge to the true metric. The second is a differentiable loss which allows us to learn an approximation even for continuous state MDPs, which prior to this work had not been possible.

OneClass SVM is a popular method for unsupervised anomaly detection. As many other methods, it suffers from the \textit{black box} problem: it is difficult to justify, in an intuitive and simple manner, why the decision frontier is identifying data points as anomalous or non anomalous. Such type of problem is being widely addressed for supervised models. However, it is still an uncharted area for unsupervised learning. In this paper, we describe a method to infer rules that justify why a point is labelled as an anomaly, so as to obtain intuitive explanations for models created using the OneClass SVM algorithm. We evaluate our proposal with different datasets, including real-world data coming from industry. With this, our proposal contributes to extend Explainable AI techniques to unsupervised machine learning models.

Due to the recent advances on Neural Architecture Search (NAS), it gains popularity in designing best networks for specific tasks. Although it shows promising results on many benchmarks and competitions, NAS still suffers from its demanding computation cost for searching high dimensional architectural design space, and this problem becomes even worse when we want to use a large-scale dataset. If we can make a reliable data proxy for NAS, the efficiency of NAS approaches increase accordingly. Our basic observation for making a data proxy is that each example in a specific dataset has a different impact on NAS process and most of examples are redundant from a relative accuracy ranking perspective, which we should preserve when making a data proxy. We propose a systematic approach to measure the importance of each example from this relative accuracy ranking point of view, and make a reliable data proxy based on the statistics of training and testing examples. Our experiment shows that we can preserve the almost same relative accuracy ranking between all possible network configurations even with 10-20 smaller data proxy.

Entity extraction is an important task in text mining and natural language processing. A popular method for entity extraction is by comparing substrings from free text against a dictionary of entities. In this paper, we present several techniques as a post-processing step for improving the effectiveness of the existing entity extraction technique. These techniques utilise models trained with the web-scale corpora which makes our techniques robust and versatile. Experiments show that our techniques bring a notable improvement on efficiency and effectiveness.

Improvement of statistical learning models in order to increase efficiency in solving classification or regression problems is still a goal pursued by the scientific community. In this way, the support vector machine model is one of the most successful and powerful algorithms for those tasks. However, its performance depends directly from the choice of the kernel function and their hyperparameters. The traditional choice of them, actually, can be computationally expensive to do the kernel choice and the tuning processes. In this article, it is proposed a novel framework to deal with the kernel function selection called Random Machines. The results improved accuracy and reduced computational time. The data study was performed in simulated data and over 27 real benchmarking datasets.

Contraction Clustering (RASTER) is a very fast algorithm for density-based clustering, which requires only a single pass. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. Yet, RASTER is limited to batch processing. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step in the sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. It is therefore very well suited to real-world scenarios where clustering does not happen continually but only periodically. We describe the algorithm, including a discussion of implementation details.

Although deep neural networks are highly effective, their high computational and memory costs severely challenge their applications on portable devices. As a consequence, low-bit quantization, which converts a full-precision neural network into a low-bitwidth integer version, has been an active and promising research topic. Existing methods formulate the low-bit quantization of networks as an approximation or optimization problem. Approximation-based methods confront the gradient mismatch problem, while optimization-based methods are only suitable for quantizing weights and could introduce high computational cost in the training stage. In this paper, we propose a novel perspective of interpreting and implementing neural network quantization by formulating low-bit quantization as a differentiable non-linear function (termed quantization function). The proposed quantization function can be learned in a lossless and end-to-end manner and works for any weights and activations of neural networks in a simple and uniform way. Extensive experiments on image classification and object detection tasks show that our quantization networks outperform the state-of-the-art methods. We believe that the proposed method will shed new insights on the interpretation of neural network quantization. Our code is available at https://…/alibabacloud-quantization-networks.

We survey a burgeoning and promising new research area that considers the online nature of many practical fair division problems. We identify wide variety of such online fair division problems, as well as discuss new mechanisms and normative properties that apply to this online setting. The online nature of such fair division problems provides both opportunities and challenges such as the possibility to develop new online mechanisms as well as the difficulty of dealing with an uncertain future.

Machine and deep learning-based algorithms are the emerging approaches in addressing prediction problems in time series. These techniques have been shown to produce more accurate results than conventional regression-based modeling. It has been reported that artificial Recurrent Neural Networks (RNN) with memory, such as Long Short-Term Memory (LSTM), are superior compared to Autoregressive Integrated Moving Average (ARIMA) with a large margin. The LSTM-based models incorporate additional ‘gates’ for the purpose of memorizing longer sequences of input data. The major question is that whether the gates incorporated in the LSTM architecture already offers a good prediction and whether additional training of data would be necessary to further improve the prediction. Bidirectional LSTMs (BiLSTMs) enable additional training by traversing the input data twice (i.e., 1) left-to-right, and 2) right-to-left). The research question of interest is then whether BiLSTM, with additional training capability, outperforms regular unidirectional LSTM. This paper reports a behavioral analysis and comparison of BiLSTM and LSTM models. The objective is to explore to what extend additional layers of training of data would be beneficial to tune the involved parameters. The results show that additional training of data and thus BiLSTM-based modeling offers better predictions than regular LSTM-based models. More specifically, it was observed that BiLSTM models provide better predictions compared to ARIMA and LSTM models. It was also observed that BiLSTM models reach the equilibrium much slower than LSTM-based models.

Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand, separately modelling each task avoids catastrophic forgetting but it does not support transfer learning and leads to large models. On the other hand, rigidly specifying a shared component and a task-specific part enables task transfer and limits the model size, but it is vulnerable to catastrophic forgetting and restricts the form of task-transfer that can occur. Ideally, the network should adaptively identify which parts of the network to share in a data driven way. Here we introduce such an approach called Continual Learning with Adaptive Weights (CLAW), which is based on probabilistic modelling and variational inference. Experiments show that CLAW achieves state-of-the-art performance on six benchmarks in terms of overall continual learning performance, as measured by classification accuracy, and in terms of addressing catastrophic forgetting.

Deep neural networks (DNNs) can easily fit a random labeling of the training data with zero training error. What is the difference between DNNs trained with random labels and the ones trained with true labels? Our paper answers this question with two contributions. First, we study the memorization properties of DNNs. Our empirical experiments shed light on how DNNs prioritize the learning of simple input patterns. In the second part, we propose to measure the similarity between what different DNNs have learned and memorized. With the proposed approach, we analyze and compare DNNs trained on data with true labels and random labels. The analysis shows that DNNs have \textit{One way to Learn} and \textit{N ways to Memorize}. We also use gradient information to gain an understanding of the analysis results.