The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. However, existing optimization algorithms show a preference for descent paths that converge slowly and do not seek to avoid bad local optima. In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. LRD empirically aids the optimizer to actively explore in the parameter space by randomly setting some learning rates to zero; at each iteration, only parameters whose learning rate is not 0 are updated. As the learning rate of different parameters is dropped, the optimizer will sample a new loss descent path for the current update. The uncertainty of the descent path helps the model avoid saddle points and bad local minima. Experiments show that LRD is surprisingly effective in accelerating training while preventing overfitting.

Error-correcting output codes (ECOC) is an ensemble method combining a set of binary classifiers for multi-class learning problems. However, in traditional ECOC framework, the binary classifiers are trained independently. To explore the interaction between the binary classifiers, we construct an error correction network (ECN) that jointly trains all binary classifiers while maximizing the ensemble diversity to improve its robustness against adversarial attacks. An ECN is built based on a code matrix which is generated by maximizing the error tolerance, i.e., the minimum Hamming distance between any two rows, as well as the ensemble diversity, i.e., the variation of information between any two columns. Though ECN inherently promotes the diversity between the binary classifiers as each ensemble member solves a different classification problem (specified by the corresponding column of the code matrix), we empirically show that the ensemble diversity can be further improved by forcing the weight matrices learned by ensemble members to be orthogonal. The ECN is trained in end-to-end fashion and can be complementary to other defense approaches including adversarial training. We show empirically that ECN is effective against the state-of-the-art while-box attacks while maintaining good accuracy on normal examples.

Architecture design has become a crucial component of successful deep learning. Recent progress in automatic neural architecture search (NAS) shows a lot of promise. However, discovered architectures often fail to generalize in the final evaluation. Architectures with a higher validation accuracy during the search phase may perform worse in the evaluation. Aiming to alleviate this common issue, we introduce sequential greedy architecture search (SGAS), an efficient method for neural architecture search. By dividing the search procedure into sub-problems, SGAS chooses and prunes candidate operations in a greedy fashion. We apply SGAS to search architectures for Convolutional Neural Networks (CNN) and Graph Convolutional Networks (GCN). Extensive experiments show that SGAS is able to find state-of-the-art architectures for tasks such as image classification, point cloud classification and node classification in protein-protein interaction graphs with minimal computational cost. Please visit https://…/sgas for more information about SGAS.

Processing large complex networks recently attracted considerable interest. Complex graphs are useful in a wide range of applications from technological networks to biological systems like the human brain. Sometimes these networks are composed of billions of entities that give rise to emerging properties and structures. Analyzing these structures aids us in gaining new insights about our surroundings. As huge networks become abundant, there is a need for scalable algorithms to perform analysis. A prominent example is the PageRank algorithm, which is one of the measures used by web search engines such as Google to rank web pages displayed to the user. In order to find these patterns, massive amounts of data have to be acquired and processed. Designing and evaluating scalable graph algorithms to handle these data sets is a crucial task on the road to understanding the underlying systems. This habilitation thesis is a summary a broad spectrum of scalable graph algorithms that I developed over the last six years with many coauthors. In general, this research is based on four pillars: multilevel algorithms, practical kernelization, parallelization and memetic algorithms that are highly interconnected. Experiments conducted indicate that our algorithms find better solutions and/or are much more scalable than the previous state-of-the-art.

Deep learning-based models have been very successful in achieving state-of-the-art results in many of the computer vision, speech recognition, and natural language processing tasks in the last few years. These models seem a natural fit for handling the ever-increasing scale of biometric recognition problems, from cellphone authentication to airport security systems. Deep learning-based models have increasingly been leveraged to improve the accuracy of different biometric recognition systems in recent years. In this work, we provide a comprehensive survey of more than 120 promising works on biometric recognition (including face, fingerprint, iris, palmprint, ear, voice, signature, and gait recognition), which deploy deep learning models, and show their strengths and potentials in different applications. For each biometric, we first introduce the available datasets that are widely used in the literature and their characteristics. We will then talk about several promising deep learning works developed for that biometric, and show their performance on popular public benchmarks. We will also discuss some of the main challenges while using these models for biometric recognition, and possible future directions to which research in this area is headed.

This paper studies the multi-cascade influence maximization problem, which explores strategies for launching one information cascade in a social network with multiple existing cascades. With natural extensions to the classic models, we first propose the independent multi-cascade model where the diffusion process is governed by the so-called activation function. We show that the proposed model is sufficiently flexible as it generalizes most of the existing cascade-based models. We then study the multi-cascade influence maximization problem under the designed model and provide approximation hardness under common complexity assumptions, namely Exponential Time Hypothesis and . Given the hardness results, we build a framework for designing heuristic seed selection algorithms with a testable data-dependent approximation ratio. The designed algorithm leverages upper and lower bounds, which reveal the key combinatorial structure behind the multi-cascade influence maximization problem. The performance of the framework is theoretically analyzed and practically evaluated through extensive simulations. The superiority of the proposed solution is supported by encouraging experimental results, in terms of effectiveness and efficiency.

In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset~\cite{imdb}. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.

Even as deep neural networks have become very effective for tasks in vision and perception, it remains difficult to explain and debug their behavior. In this paper, we present a programmatic and semantic approach to explaining, understanding, and debugging the correct and incorrect behaviors of a neural network based perception system. Our approach is semantic in that it employs a high-level representation of the distribution of environment scenarios that the detector is intended to work on. It is programmatic in that the representation is a program in a domain-specific probabilistic programming language using which synthetic data can be generated to train and test the neural network. We present a framework that assesses the performance of the neural network to identify correct and incorrect detections, extracts rules from those results that semantically characterizes the correct and incorrect scenarios, and then specializes the probabilistic program with those rules in order to more precisely characterize the scenarios in which the neural network operates correctly or not, without human intervention to identify important features. We demonstrate our results using the SCENIC probabilistic programming language and a neural network-based object detector. Our experiments show that it is possible to automatically generate compact rules that significantly increase the correct detection rate (or conversely the incorrect detection rate) of the network and can thus help with debugging and understanding its behavior.

Playing an essential role in data mining, machine learning has a long history of being applied to networks on multifarious tasks and has played an essential role in data mining. However, the discrete and sparse natures of networks often render it difficult to apply machine learning directly to networks. To circumvent this difficulty, one major school of thought to approach networks using machine learning is via network embeddings. On the one hand, this network embeddings have achieved huge success on aggregated network data in recent years. On the other hand, learning network embeddings on distributively stored networks still remained understudied: To the best of our knowledge, all existing algorithms for learning network embeddings have hitherto been exclusively centralized and thus cannot be applied to these networks. To accommodate distributively stored networks, in this paper, we proposed a multi-agent model. Under this model, we developed the multi-agent network embedding learning algorithm (MANELA) for learning network embeddings. We demonstrate MANELA’s advantages over other existing centralized network embedding learning algorithms both theoretically and experimentally. Finally, we further our understanding in MANELA via visualization and exploration of its relationship to DeepWalk.

In this paper, we introduce Anomaly Contribution Explainer or ACE, a tool to explain security anomaly detection models in terms of the model features through a regression framework, and its variant, ACE-KL, which highlights the important anomaly contributors. ACE and ACE-KL provide insights in diagnosing which attributes significantly contribute to an anomaly by building a specialized linear model to locally approximate the anomaly score that a black-box model generates. We conducted experiments with these anomaly detection models to detect security anomalies on both synthetic data and real data. In particular, we evaluate performance on three public data sets: CERT insider threat, netflow logs, and Android malware. The experimental results are encouraging: our methods consistently identify the correct contributing feature in the synthetic data where ground truth is available; similarly, for real data sets, our methods point a security analyst in the direction of the underlying causes of an anomaly, including in one case leading to the discovery of previously overlooked network scanning activity. We have made our source code publicly available.

We propose a novel model for a topic-aware chatbot by combining the traditional Recurrent Neural Network (RNN) encoder-decoder model with a topic attention layer based on Nonnegative Matrix Factorization (NMF). After learning topic vectors from an auxiliary text corpus via NMF, the decoder is trained so that it is more likely to sample response words from the most correlated topic vectors. One of the main advantages in our architecture is that the user can easily switch the NMF-learned topic vectors so that the chatbot obtains desired topic-awareness. We demonstrate our model by training on a single conversational data set which is then augmented with topic matrices learned from different auxiliary data sets. We show that our topic-aware chatbot not only outperforms the non-topic counterpart, but also that each topic-aware model qualitatively and contextually gives the most relevant answer depending on the topic of question.

Notwithstanding the tremendous progress that is taking place in spoken language technology, effective speech-based human-robot interaction still raises a number of important challenges. Not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there is a large gap between the formulaic speech that typifies contemporary spoken dialogue systems and the flexible nature of human-human conversation. It is pointed out that grounded and situated speech-based human-robot interaction may lead to deeper insights into the pragmatics of language usage, thereby overcoming the current `habitability gap’.

Deep metric learning has yielded impressive results in tasks such as clustering and image retrieval by leveraging neural networks to obtain highly discriminative feature embeddings, which can be used to group samples into different classes. Much research has been devoted to the design of smart loss functions or data mining strategies for training such networks. Most methods consider only pairs or triplets of samples within a mini-batch to compute the loss function, which is commonly based on the distance between embeddings. We propose Group Loss, a loss function based on a differentiable label-propagation method that enforces embedding similarity across all samples of a group while promoting, at the same time, low-density regions amongst data points belonging to different groups. Guided by the smoothness assumption that ‘similar objects should belong to the same group’, the proposed loss trains the neural network for a classification task, enforcing a consistent labelling amongst samples within a class. We show state-of-the-art results on clustering and image retrieval on several datasets, and show the potential of our method when combined with other techniques such as ensembles

In pattern recognition or machine learning, it is a very fundamental task to find nearest neighbors of a given point. All the methods for the task work basically by comparing the given point to all the points in the data set. That is why the computational cost increases with the number of data points. However, the human visual system seems to work in a different way. When the human visual system tries to find the neighbors of one point on a map, it directly focuses on the area around the point and actively searches the neighbors by looking or zooming in and out around the point. In this paper, we propose an innovative search method for nearest neighbors, which seems very similar to how human visual system works on the task.

Pointer analysis is a foundational analysis leveraged by various static analyses. Therefore, it gathered wide attention in research for decades. Some pointer analysis frameworks are based on succinct declarative specifications. However, these tools are heterogeneous in terms of the underlying intermediate representation (IR), heap abstraction, and programming methodology. This situation complicates a fair comparison of these frameworks and thus hinders further research. Consequently, the literature lacks an evaluation of the strengths and weaknesses of these tools. In this work, we evaluate two major frameworks for pointer analysis, WALA and Doop, on the DaCapo set of benchmarks. We compare the pointer analyses available in Wala and Doop, and conclude that—even though based on a declarative specification—Doop provides a better pointer analysis than Wala in terms of precision and scalability. We also compare the two IRs used in Doop, i.e., Jimple from the Soot framework and IR from the Wala framework. Our evaluation shows that in the majority of the benchmarks Soot’s IR gives a more precise and scalable pointer analysis. Finally, we propose a micro-benchmark \emph{PointerBench}, for which we manually validate the points-to statistics to evaluate the results of these tools.

Estimating the travel time of any route is of great importance for trip planners, traffic operators, online taxi dispatching and ride-sharing platforms, and navigation provider systems. With the advance of technology, many traveling cars, including online taxi dispatch systems’ vehicles are equipped with Global Positioning System (GPS) devices that can report the location of the vehicle every few seconds. This paper uses GPS data and the Matrix Factorization techniques to estimate the travel times on all road segments and time intervals simultaneously. We aggregate GPS data into a matrix, where each cell of the original matrix contains the average vehicle speed for a segment and a specific time interval. One of the problems with this matrix is its high sparsity. We use Alternating Least Squares (ALS) method along with a regularization term to factorize the matrix. Since this approach can solve the sparsity problem that arises from the absence of cars in many road segments in a specific time interval, matrix factorization is suitable for estimating the travel time. Our comprehensive evaluation results using real data provided by one of the largest online taxi dispatching systems in Iran, shows the strength of our proposed method.

This paper studies the optimality of kernel methods in high-dimensional data clustering. Recent works have studied the large sample performance of kernel clustering in the high-dimensional regime, where Euclidean distance becomes less informative. However, it is unknown whether popular methods, such as kernel k-means, are optimal in this regime. We consider the problem of high-dimensional Gaussian clustering and show that, with the exponential kernel function, the sufficient conditions for partial recovery of clusters using the NP-hard kernel k-means objective matches the known information-theoretic limit up to a factor of for large . It also exactly matches the known upper bounds for the non-kernel setting. We also show that a semi-definite relaxation of the kernel k-means procedure matches up to constant factors, the spectral threshold, below which no polynomial-time algorithm is known to succeed. This is the first work that provides such optimality guarantees for the kernel k-means as well as its convex relaxation. Our proofs demonstrate the utility of the less known polynomial concentration results for random variables with exponentially decaying tails in a higher-order analysis of kernel methods.

Learning the underlying patterns in the data goes beyond instance-based generalization to some external knowledge represented in structured graphs or networks. Deep Learning (DL) has shown significant advances in probabilistically learning latent patterns in the data using a multi-layered network of computational nodes (i.e. neurons/hidden units). However, with the tremendous amount of training data, uncertainty in generalization on domain-specific tasks, and delta improvement with an increase in complexity of models seem to raise a concern on the features learned by the model. As incorporation of domain specific knowledge will aid in supervising the learning of features for the model, infusion of knowledge from knowledge graphs within hidden layers will further enhance the learning process. Although much work remains, we believe that KGs will play an increasing role in developing hybrid neuro-symbolic intelligent systems (that is bottom up deep learning with top down symbolic computing) as well as in building explainable AI systems for which KGs will provide a scaffolding for punctuating neural computing. In this position paper, we describe our motivation for such hybrid approach and a framework that combines knowledge graph and neural networks.

Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years — predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from six key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, and the exploitation of correlation filter advantages. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, and LaSOT. Finally, by conducting critical analyses of these state-of-the-art methods both quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh on when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark dataframes and Scala application programming interface. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.

The rapid growth of deep learning applications in real life is accompanied by severe safety concerns. To mitigate this uneasy phenomenon, much research has been done providing reliable evaluations of the fragility level in different deep neural networks. Apart from devising adversarial attacks, quantifiers that certify safeguarded regions have also been designed in the past five years. The summarizing work of Salman et al. unifies a family of existing verifiers under a convex relaxation framework. We draw inspiration from such work and further demonstrate the optimality of deterministic CROWN (Zhang et al. 2018) solutions in a given linear programming problem under mild constraints. Given this theoretical result, the computationally expensive linear programming based method is shown to be unnecessary. We then propose an optimization-based approach \textit{FROWN} (\textbf{F}astened C\textbf{ROWN}): a general algorithm to tighten robustness certificates for neural networks. Extensive experiments on various networks trained individually verify the effectiveness of FROWN in safeguarding larger robust regions.

With rapid technological advancements within the domain of Internet of Things (IoT), strong trends have emerged which indicate a rapid growth in the number of smart devices connected to IoT networks and this growth cannot be supported by traditional cloud computing platforms. In response to the increased capacity of data being transferred over networks, the edge and fog computing paradigms have emerged as extremely viable frameworks that shift computational and storage resources towards the edge of the network, thereby migrating processing power from centralized cloud servers to distributed LAN resources and powerful embedded devices within the network. These computing paradigms, therefore, have the potential to support massive IoT networks of the future and have also fueled the advancement of IoT systems within industrial settings, leading to the creation of the Industrial Internet of Things (IIoT) technology that is revolutionizing industrial processes in a variety of domains. In this paper, we elaborate on the impact of edge and fog computing paradigms on IIoT. We also highlight the how edge and fog computing are poised to bring about a turnaround in several industrial applications through a use-case approach. Finally, we conclude with the current issues and challenges faced by these paradigms in IIoT and suggest some research directions that should be followed to solve these problems and accelerate the adaptation of edge and fog computing in IIoT.

Matrix completion is widely used in machine learning, engineering control, image processing, and recommendation systems. Currently, a popular algorithm for matrix completion is Singular Value Threshold (SVT). In this algorithm, the singular value threshold should be set first. However, in a recommendation system, the dimension of the preference matrix keeps changing. Therefore, it is difficult to directly apply SVT. In addition, what the users of a recommendation system need is a sequence of personalized recommended results rather than the estimation of their scores. According to the above ideas, this paper proposes a novel approach named probability completion model~(PCM). By reducing the data dimension, the transitivity of the similar matrix, and singular value decomposition, this approach quickly obtains a completion matrix with the same probability distribution as the original matrix. The approach greatly reduces the computation time based on the accuracy of the sacrifice part, and can quickly obtain a low-rank similarity matrix with data trend approximation properties. The experimental results show that PCM can quickly generate a complementary matrix with similar data trends as the original matrix. The LCS score and efficiency of PCM are both higher than SVT.

The problem of hyperparameter optimization exists widely in the real life and many common tasks can be transformed into it, such as neural architecture search and feature subset selection. Without considering various constraints, the existing hyperparameter tuning techniques can solve these problems effectively by traversing as many hyperparameter configurations as possible. However, because of the limited resources and budget, it is not feasible to evaluate so many kinds of configurations, which requires us to design effective algorithms to find a best possible hyperparameter configuration with a finite number of configuration evaluations. In this paper, we simulate human thinking processes and combine the merit of the existing techniques, and thus propose a new algorithm called ExperienceThinking, trying to solve this constrained hyperparameter optimization problem. In addition, we analyze the performances of 3 classical hyperparameter optimization algorithms with a finite number of configuration evaluations, and compare with that of ExperienceThinking. The experimental results show that our proposed algorithm provides superior results and has better performance.

Application of K-Means algorithm is restricted by the fact that the number of clusters should be known beforehand. Previously suggested methods to solve this problem are either ad hoc or require parametric assumptions and complicated calculations. The proposed method aims to solve this conundrum by considering cluster hypersphere density as the factor to determine the number of clusters in the given dataset. The density is calculated by assuming a hypersphere around the cluster centroid for n-different number of clusters. The calculated values are plotted against their corresponding number of clusters and then the optimum number of clusters is obtained after assaying the elbow region of the graph. The method is simple and easy to comprehend and provides robust and reliable results.

A low-complexity neural network based approach for channel estimation was proposed recently, where assumptions on the channel model were incorporated into the design procedure of the estimator. Instead of using data from a measurement campaign as done in previous work, we evaluate the performance of the convolutional neural network (CNN) based channel estimator by using a reproducible mmWave environment of the DeepMIMO dataset. We further propose a neural network based predictor which is derived by starting from the linear minimum mean squared error (LMMSE) predictor. We start by deriving a weighted sum of LMMSE predictors which is motivated by the structure of the optimal MMSE predictor. This predictor provides an initialization (weight matrices, biases and activation function) to a feed-forward neural network based predictor. With a properly learned neural network, we show that it is possible to easily outperform the LMMSE predictor based on the Jakes assumption of the underlying Doppler spectrum in an reproducible indoor scenario of the DeepMIMO dataset.

In this work we explore the use of latent representations obtained from multiple input sensory modalities (such as images or sounds) in allowing an agent to learn and exploit policies over different subsets of input modalities. We propose a three-stage architecture that allows a reinforcement learning agent trained over a given sensory modality, to execute its task on a different sensory modality-for example, learning a visual policy over image inputs, and then execute such policy when only sound inputs are available. We show that the generalized policies achieve better out-of-the-box performance when compared to different baselines. Moreover, we show this holds in different OpenAI gym and video game environments, even when using different multimodal generative models and reinforcement learning algorithms.

We propose semantic region-adaptive normalization (SEAN), a simple but effective building block for Generative Adversarial Networks conditioned on segmentation masks that describe the semantic regions in the desired output image. Using SEAN normalization, we can build a network architecture that can control the style of each semantic region individually, e.g., we can specify one style reference image per region. SEAN is better suited to encode, transfer, and synthesize style than the best previous method in terms of reconstruction quality, variability, and visual quality. We evaluate SEAN on multiple datasets and report better quantitative metrics (e.g. FID, PSNR) than the current state of the art. SEAN also pushes the frontier of interactive image editing. We can interactively edit images by changing segmentation masks or the style for any given region. We can also interpolate styles from two reference images per region.

Sequential modelling with self-attention has achieved cutting edge performances in natural language processing. With advantages in model flexibility, computation complexity and interpretability, self-attention is gradually becoming a key component in event sequence models. However, like most other sequence models, self-attention does not account for the time span between events and thus captures sequential signals rather than temporal patterns. Without relying on recurrent network structures, self-attention recognizes event orderings via positional encoding. To bridge the gap between modelling time-independent and time-dependent event sequence, we introduce a functional feature map that embeds time span into high-dimensional spaces. By constructing the associated translation-invariant time kernel function, we reveal the functional forms of the feature map under classic functional function analysis results, namely Bochner’s Theorem and Mercer’s Theorem. We propose several models to learn the functional time representation and the interactions with event representation. These methods are evaluated on real-world datasets under various continuous-time event sequence prediction tasks. The experiments reveal that the proposed methods compare favorably to baseline models while also capturing useful time-event interactions.

Domain adaptation (DA) and domain generalization (DG) have emerged as a solution to the domain shift problem where the distribution of the source and target data is different. The task of DG is more challenging than DA as the target data is totally unseen during the training phase in DG scenarios. The current state-of-the-art employs adversarial techniques, however, these are rarely considered for the DG problem. Furthermore, these approaches do not consider correlation alignment which has been proven highly beneficial for minimizing domain discrepancy. In this paper, we propose a correlation-aware adversarial DA and DG framework where the features of the source and target data are minimized using correlation alignment along with adversarial learning. Incorporating the correlation alignment module along with adversarial learning helps to achieve a more domain agnostic model due to the improved ability to reduce domain discrepancy with unlabeled target data more effectively. Experiments on benchmark datasets serve as evidence that our proposed method yields improved state-of-the-art performance.

Abstract argumentation has emerged as a method for non-monotonic reasoning that has gained tremendous traction in the symbolic artificial intelligence community. In the literature, the different approaches to abstract argumentation that were refined over the years are typically evaluated from a logics perspective; an analysis that is based on models of ideal, rational decision-making does not exist. In this paper, we close this gap by analyzing abstract argumentation from the perspective of the rational man paradigm in microeconomic theory. To assess under which conditions abstract argumentation-based choice functions can be considered economically rational, we define a new argumentation principle that ensures compliance with the rational man’s reference independence property, which stipulates that a rational agent’s preferences over two choice options should not be influenced by the absence or presence of additional options. We show that the argumentation semantics as proposed in Dung’s classical paper, as well as all of a range of other semantics we evaluate do not fulfill this newly created principle. Consequently, we investigate how structural properties of argumentation frameworks impact the reference independence principle, and propose a restriction to argumentation expansions that allows all of the evaluated semantics to fulfill the requirements for economically rational argumentation-based choice. For this purpose, we define the rational man’s expansion as a normal and non-cyclic expansion. Finally, we put reference independence into the context of preference-based argumentation and show that for this argumentation variant, which explicitly model preferences, the rational man’s expansion cannot ensure reference independence.

We consider a trainable point-to-point communication system, where both transmitter and receiver are implemented as neural networks (NNs), and demonstrate that training on the bit-wise mutual information (BMI) allows seamless integration with practical bit-metric decoding (BMD) receivers, as well as joint optimization of constellation shaping and labeling. Moreover, we present a fully differentiable neural iterative demapping and decoding (IDD) structure which achieves significant gains on additive white Gaussian noise (AWGN) channels using a standard 802.11n low-density parity-check (LDPC) code. The strength of this approach is that it can be applied to arbitrary channels without any modifications. Going one step further, we show that careful code design can lead to further performance improvements. Lastly, we show the viability of the proposed system through implementation on software-defined radios (SDRs) and training of the end-to-end system on the actual wireless channel. Experimental results reveal that the proposed method enables significant gains compared to conventional techniques.

Wasserstein-GANs have been introduced to address the deficiencies of generative adversarial networks (GANs) regarding the problems of vanishing gradients and mode collapse during the training, leading to improved convergence behaviour and improved image quality. However, Wasserstein-GANs require the discriminator to be Lipschitz continuous. In current state-of-the-art Wasserstein-GANs this constraint is enforced via gradient norm regularization. In this paper, we demonstrate that this regularization does not encourage a broad distribution of spectral-values in the discriminator weights, hence resulting in less fidelity in the learned distribution. We therefore investigate the possibility of substituting this Lipschitz constraint with an orthogonality constraint on the weight matrices. We compare three different weight orthogonalization techniques with regards to their convergence properties, their ability to ensure the Lipschitz condition and the achieved quality of the learned distribution. In addition, we provide a comparison to Wasserstein-GANs trained with current state-of-the-art methods, where we demonstrate the potential of solely using orthogonality-based regularization. In this context, we propose an improved training procedure for Wasserstein-GANs which utilizes orthogonalization to further increase its generalization capability. Finally, we provide a novel metric to evaluate the generalization capabilities of the discriminators of different Wasserstein-GANs.

Literature analysis facilitates researchers better understanding the development of science and technology. The conventional literature analysis focuses on the topics, authors, abstracts, keywords, references, etc., and rarely pays attention to the content of papers. In the field of machine learning, the involved methods (M) and datasets (D) are key information in papers. The extraction and mining of M and D are useful for discipline analysis and algorithm recommendation. In this paper, we propose a novel entity recognition model, called MDER, and constructe datasets from the papers of the PAKDD conferences (2009-2019). Some preliminary experiments are conducted to assess the extraction performance and the mining results are visualized.

This paper treats functional marked point processes (FMPPs), which are defined as marked point processes where the marks are random elements in some (Polish) function space. Such marks may represent e.g. spatial paths or functions of time. To be able to consider e.g. multivariate FMPPs, we also attach an additional, Euclidean, mark to each point. We indicate how FMPPs quite naturally connect the point process framework with both the functional data analysis framework and the geostatistical framework. We further show that various existing models fit well into the FMPP framework. In addition, we introduce a new family of summary statistics, weighted marked reduced moment measures, together with their non-parametric estimators, in order to study features of the functional marks. We further show how they generalise other summary statistics and we finally apply these tools to analyse population structures, such as demographic evolution and sex ratio over time, in Spanish provinces.

In this work we present ISA, a novel approach for learning and exploiting subgoals in reinforcement learning (RL). Our method relies on inducing an automaton whose transitions are subgoals expressed as propositional formulas over a set of observable events. A state-of-the-art inductive logic programming system is used to learn the automaton from observation traces perceived by the RL agent. The reinforcement learning and automaton learning processes are interleaved: a new refined automaton is learned whenever the RL agent generates a trace not recognized by the current automaton. We evaluate ISA in several gridworld problems and show that it performs similarly to a method for which automata are given in advance. We also show that the learned automata can be exploited to speed up convergence through reward shaping and transfer learning across multiple tasks. Finally, we analyze the running time and the number of traces that ISA needs to learn an automata, and the impact that the number of observable events has on the learner’s performance.

In few-shot learning, typically, the loss function which is applied at test time is the one we are ultimately interested in minimising, such as the mean-squared-error loss for a regression problem. However, given that we have few samples at test time, we argue that the loss function that we are interested in minimising is not necessarily the loss function most suitable for computing gradients in a few-shot setting. We propose VIABLE, a generic meta-learning extension that builds on existing meta-gradient-based methods by learning a differentiable loss function, replacing the pre-defined inner-loop loss function in performing task-specific updates. We show that learning a loss function capable of leveraging relational information between samples reduces underfitting, and significantly improves performance and sample efficiency on a simple regression task. Furthermore, we show VIABLE is scalable by evaluating on the Mini-Imagenet dataset.

In this paper, we propose DeepAlign, a novel approach to multi-perspective process anomaly correction, based on recurrent neural networks and bidirectional beam search. At the core of the DeepAlign algorithm are two recurrent neural networks trained to predict the next event. One is reading sequences of process executions from left to right, while the other is reading the sequences from right to left. By combining the predictive capabilities of both neural networks, we show that it is possible to calculate sequence alignments, which are used to detect and correct anomalies. DeepAlign utilizes the case-level and event-level attributes to closely model the decisions within a process. We evaluate the performance of our approach on an elaborate data corpus of 30 realistic synthetic event logs and compare it to three state-of-the-art conformance checking methods. DeepAlign produces better corrections than the rest of the field reaching an overall accuracy of 98.45% across all datasets, whereas the best comparable state-of-the-art method reaches 70.19%.

Financial time series forecasting is, without a doubt, the top choice of computational intelligence for finance researchers from both academia and financial industry due to its broad implementation areas and substantial impact. Machine Learning (ML) researchers came up with various models and a vast number of studies have been published accordingly. As such, a significant amount of surveys exist covering ML for financial time series forecasting studies. Lately, Deep Learning (DL) models started appearing within the field, with results that significantly outperform traditional ML counterparts. Even though there is a growing interest in developing models for financial time series forecasting research, there is a lack of review papers that were solely focused on DL for finance. Hence, our motivation in this paper is to provide a comprehensive literature review on DL studies for financial time series forecasting implementations. We not only categorized the studies according to their intended forecasting implementation areas, such as index, forex, commodity forecasting, but also grouped them based on their DL model choices, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Long-Short Term Memory (LSTM). We also tried to envision the future for the field by highlighting the possible setbacks and opportunities, so the interested researchers can benefit.

Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these ‘untrained subnetworks’ exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an ‘untrained subnetwork’ approaches a network with learned weights in accuracy.

The performance of deep neural networks is often attributed to their automated, task-related feature construction. It remains an open question, though, why this leads to solutions with good generalization, even in cases where the number of parameters is larger than the number of samples. Back in the 90s, Hochreiter and Schmidhuber observed that flatness of the loss surface around a local minimum correlates with low generalization error. For several flatness measures, this correlation has been empirically validated. However, it has recently been shown that existing measures of flatness cannot theoretically be related to generalization due to a lack of invariance with respect to reparameterizations. We propose a natural modification of existing flatness measures that results in invariance to reparameterization.

In this paper, we transform tag recommendation into a word-based text generation problem and introduce a sequence-to-sequence model. The model inherits the advantages of LSTM-based encoder for sequential modeling and attention-based decoder with local positional encodings for learning relations globally. Experimental results on Zhihu datasets illustrate the proposed model outperforms other state-of-the-art text classification based methods.

Despite their importance in training artificial intelligence systems, large datasets remain challenging to acquire. For example, the ImageNet dataset required fourteen million labels of basic human knowledge, such as whether an image contains a chair. Unfortunately, this knowledge is so simple that it is tedious for human annotators but also tacit enough such that they are necessary. However, human collaborative efforts for tasks like labeling massive amounts of data are costly, inconsistent, and prone to failure, and this method does not resolve the issue of the resulting dataset being static in nature. What if we asked people questions they want to answer and collected their responses as data? This would mean we could gather data at a much lower cost, and expanding a dataset would simply become a matter of asking more questions. We focus on the task of Visual Question Answering (VQA) and propose a system that uses Visual Question Generation (VQG) to produce questions, asks them to social media users, and collects their responses. We present two models that can then parse clean answers from the noisy human responses significantly better than our baselines, with the goal of eventually incorporating the answers into a Visual Question Answering (VQA) dataset. By demonstrating how our system can collect large amounts of data at little to no cost, we envision similar systems being used to improve performance on other tasks in the future.

This is an overview of the R package iprior, which implements a unified methodology for fitting parametric and nonparametric regression models, including additive models, multilevel models, and models with one or more functional covariates. Based on the principle of maximum entropy, an I-prior is an objective Gaussian process prior for the regression function with covariance kernel equal to its Fisher information. The regression function is estimated by its posterior mean under the I-prior, and hyperparameters are estimated via maximum marginal likelihood. Estimation of I-prior models is simple and inference straightforward, while small and large sample predictive performances are comparative, and often better, to similar leading state-of-the-art models. We illustrate the use of the iprior package by analysing a simulated toy data set as well as three real-data examples, in particular, a multilevel data set, a longitudinal data set, and a dataset involving a functional covariate.

Past literature has been effective in demonstrating ideological gaps in machine learning (ML) fairness definitions when considering their use in complex socio-technical systems. However, we go further to demonstrate that these definitions often misunderstand the legal concepts from which they purport to be inspired, and consequently inappropriately co-opt legal language. In this paper, we demonstrate examples of this misalignment and discuss the differences in ML terminology and their legal counterparts, as well as what both the legal and ML fairness communities can learn from these tensions. We focus this paper on U.S. anti-discrimination law since the ML fairness research community regularly references terms from this body of law.

Deep neural networks with more parameters and FLOPs have higher capacity and generalize better to diverse domains. But to be deployed on edge devices, the model’s complexity has to be constrained due to limited compute resource. In this work, we propose a method to improve the model capacity without increasing inference-time complexity. Our method is based on an assumption of data locality: for an edge device, within a short period of time, the input data to the device are sampled from a single domain with relatively low diversity. Therefore, it is possible to utilize a specialized, low-complexity model to achieve good performance in that input domain. To leverage this, we propose Domain-aware Dynamic Network (DDN), which is a high-capacity dynamic network in which each layer contains multiple weights. During inference, based on the input domain, DDN dynamically combines those weights into one single weight that specializes in the given domain. This way, DDN can keep the inference-time complexity low but still maintain a high capacity. Experiments show that without increasing the parameters, FLOPs, and actual latency, DDN achieves up to 2.6\% higher AP50 than a static network on the BDD100K object-detection benchmark.

Recent advances in artificial intelligence research have led to a profusion of studies that apply deep learning to problems in image analysis and natural language processing among others. Additionally, the availability of open-source computational frameworks has lowered the barriers to implementing state-of-the-art methods across multiple domains. Albeit leading to major performance breakthroughs in some tasks, effective dissemination of deep learning algorithms remains challenging, inhibiting reproducibility and benchmarking studies, impeding further validation, and ultimately hindering their effectiveness in the cumulative scientific progress. In developing a platform for sharing research outputs, we present ModelHub.AI (www.modelhub.ai), a community-driven container-based software engine and platform for the structured dissemination of deep learning models. For contributors, the engine controls data flow throughout the inference cycle, while the contributor-facing standard template exposes model-specific functions including inference, as well as pre- and post-processing. Python and RESTful Application programming interfaces (APIs) enable users to interact with models hosted on ModelHub.AI and allows both researchers and developers to utilize models out-of-the-box. ModelHub.AI is domain-, data-, and framework-agnostic, catering to different workflows and contributors’ preferences.

The aim of this paper is to investigate various information-theoretic measures, including entropy, mutual information, and some systematic measures that based on mutual information, for a class of structured spiking neuronal network. In order to analyze and compute these information-theoretic measures for large networks, we coarse-grained the data by ignoring the order of spikes that fall into the same small time bin. The resultant coarse-grained entropy mainly capture the information contained in the rhythm produced by a local population of the network. We first proved that these information theoretical measures are well-defined and computable by proving the stochastic stability and the law of large numbers. Then we use three neuronal network examples, from simple to complex, to investigate these information-theoretic measures. Several analytical and computational results about properties of these information-theoretic measures are given.

Supervised systems require human labels for training. But, are humans themselves always impartial during the annotation process? We examine this question in the context of automated assessment of human behavioral tasks. Specifically, we investigate whether human ratings themselves can be trusted at their face value when scoring video-based structured interviews, and whether such ratings can impact machine learning models that use them as training data. We present preliminary empirical evidence that indicates there might be biases in such annotations, most of which are visual in nature.

Recently the concept of transformative AI (TAI) has begun to receive attention in the AI policy space. TAI is often framed as an alternative formulation to notions of strong AI (e.g. artificial general intelligence or superintelligence) and reflects increasing consensus that advanced AI which does not fit these definitions may nonetheless have extreme and long-lasting impacts on society. However, the term TAI is poorly defined and often used ambiguously. Some use the notion of TAI to describe levels of societal transformation associated with previous ‘general purpose technologies’ (GPTs) such as electricity or the internal combustion engine. Others use the term to refer to more drastic levels of transformation comparable to the agricultural or industrial revolutions. The notion has also been used much more loosely, with some implying that current AI systems are already having a transformative impact on society. This paper unpacks and analyses the notion of TAI, proposing a distinction between TAI and radically transformative AI (RTAI), roughly corresponding to societal change on the level of the agricultural or industrial revolutions. We describe some relevant dimensions associated with each and discuss what kinds of advances in capabilities they might require. We further consider the relationship between TAI and RTAI and whether we should necessarily expect a period of TAI to precede the emergence of RTAI. This analysis is important as it can help guide discussions among AI policy researchers about how to allocate resources towards mitigating the most extreme impacts of AI and it can bring attention to negative TAI scenarios that are currently neglected.

We present, FT-SWRL, a fuzzy temporal extension to the Semantic Web Rule Language (SWRL), which combines fuzzy theories based on the valid-time temporal model to provide a standard approach for modeling imprecise temporal domain knowledge in OWL ontologies. The proposal introduces a fuzzy temporal model for the semantic web, which is syntactically defined as a fuzzy temporal SWRL ontology (SWRL-FTO) with a new set of fuzzy temporal SWRL built-ins for defining their semantics. The SWRL-FTO hierarchically defines the necessary linguistic terminologies and variables for the fuzzy temporal model. An example model demonstrating the usefulness of the fuzzy temporal SWRL built-ins to model imprecise temporal information is also represented. Fuzzification process of interval-based temporal logic is further discussed as a reasoning paradigm for our FT-SWRL rules, with the aim of achieving a complete OWL-based fuzzy temporal reasoning. Literature review on fuzzy temporal representation approaches, both within and without the use of ontologies, led to the conclusion that the FT-SWRL model can authoritatively serve as a formal specification for handling imprecise temporal expressions on the semantic web.

This paper introduces the -distance matching problem, in which we are given a bipartite graph with , a weight function on the edges and an integer . The goal is to find a maximum weight subset of the edges satisfying the following two conditions: i) the degree of every node of is at most one in , ii) if for some , then . The question finds applications, for example, in various scheduling problems. We show that the problem is NP-complete in general and admits a simple -approximation. We give an FPT algorithm parameterized by and also settle the case when the size of is constant. From an approximability point of view, we consider several greedy approaches. In particular, a local search algorithm is presented that achieves an approximation ratio of for any constant in the unweighted case. We show that the integrality gap of the natural integer programming model is at most , and give an LP-based approximation algorithm for the weighted case with the same guarantee. The novel approaches used in the analysis of the integrality gap and the approximation ratio of locally optimal solutions might be of independent combinatorial interest.

In this paper we present a new framework for time-series modeling that combines the best of traditional statistical models and neural networks. We focus on time-series with long-range dependencies, needed for monitoring fine granularity data (e.g. minutes, seconds, milliseconds), prevalent in operational use-cases. Traditional models, such as auto-regression fitted with least squares (Classic-AR) can model time-series with a concise and interpretable model. When dealing with long-range dependencies, Classic-AR models can become intractably slow to fit for large data. Recently, sequence-to-sequence models, such as Recurrent Neural Networks, which were originally intended for natural language processing, have become popular for time-series. However, they can be overly complex for typical time-series data and lack interpretability. A scalable and interpretable model is needed to bridge the statistical and deep learning-based approaches. As a first step towards this goal, we propose modelling AR-process dynamics using a feed-forward neural network approach, termed AR-Net. We show that AR-Net is as interpretable as Classic-AR but also scales to long-range dependencies. Our results lead to three major conclusions: First, AR-Net learns identical AR-coefficients as Classic-AR, thus being equally interpretable. Second, the computational complexity with respect to the order of the AR process, is linear for AR-Net as compared to a quadratic for Classic-AR. This makes it possible to model long-range dependencies within fine granularity data. Third, by introducing regularization, AR-Net automatically selects and learns sparse AR-coefficients. This eliminates the need to know the exact order of the AR-process and allows to learn sparse weights for a model with long-range dependencies.

Today, Android devices are able to provide various services. They support applications for different purposes such as entertainment, business, health, education, and banking services. Because of the functionality and popularity of Android devices as well as the open-source policy of Android OS, they have become a suitable target for attackers. Android Botnet is one of the most dangerous malwares because an attacker called Botmaster can control that remotely to perform destructive attacks. A number of researchers have used different well-known Machine Learning (ML) methods to recognize Android Botnets from benign applications. However, these conventional methods are not able to detect new sophisticated Android Botnets. In this paper, we propose a novel method based on Android permissions and Convolutional Neural Networks (CNNs) to classify Botnets and benign Android applications. Being the first developed method that uses CNNs for this aim, we also proposed a novel method to represent each application as an image which is constructed based on the co-occurrence of used permissions in that application. The proposed CNN is a binary classifier that is trained using these images. Evaluating the proposed method on 5450 Android applications consist of Botnet and benign samples, the obtained results show the accuracy of 97.2% and recall of 96% which is a promising result just using Android permissions.

In this paper, we propose a new product knowledge graph (PKG) embedding approach for learning the intrinsic product relations as product knowledge for e-commerce. We define the key entities and summarize the pivotal product relations that are critical for general e-commerce applications including marketing, advertisement, search ranking and recommendation. We first provide a comprehensive comparison between PKG and ordinary knowledge graph (KG) and then illustrate why KG embedding methods are not suitable for PKG learning. We construct a self-attention-enhanced distributed representation learning model for learning PKG embeddings from raw customer activity data in an end-to-end fashion. We design an effective multi-task learning schema to fully leverage the multi-modal e-commerce data. The Poincare embedding is also employed to handle complex entity structures. We use a real-world dataset from grocery.walmart.com to evaluate the performances on knowledge completion, search ranking and recommendation. The proposed approach compares favourably to baselines in knowledge completion and downstream tasks.

Graph convolutional networks (GCNs) have shown the powerful ability in text structure representation and effectively facilitate the task of text classification. However, challenges still exist in adapting GCN on learning discriminative features from texts due to the main issue of graph variants incurred by the textual complexity and diversity. In this paper, we propose a dual-attention GCN to model the structural information of various texts as well as tackle the graph-invariant problem through embedding two types of attention mechanisms, i.e. the connection-attention and hop-attention, into the classic GCN. To encode various connection patterns between neighbour words, connection-attention adaptively imposes different weights specified to neighbourhoods of each word, which captures the short-term dependencies. On the other hand, the hop-attention applies scaled coefficients to different scopes during the graph diffusion process to make the model learn more about the distribution of context, which captures long-term semantics in an adaptive way. Extensive experiments are conducted on five widely used datasets to evaluate our dual-attention GCN, and the achieved state-of-the-art performance verifies the effectiveness of dual-attention mechanisms.

Quantization and Knowledge distillation (KD) methods are widely used to reduce memory and power consumption of deep neural networks (DNNs), especially for resource-constrained edge devices. Although their combination is quite promising to meet these requirements, it may not work as desired. It is mainly because the regularization effect of KD further diminishes the already reduced representation power of a quantized model. To address this short-coming, we propose Quantization-aware Knowledge Distillation (QKD) wherein quantization and KD are care-fully coordinated in three phases. First, Self-studying (SS) phase fine-tunes a quantized low-precision student network without KD to obtain a good initialization. Second, Co-studying (CS) phase tries to train a teacher to make it more quantizaion-friendly and powerful than a fixed teacher. Finally, Tutoring (TU) phase transfers knowledge from the trained teacher to the student. We extensively evaluate our method on ImageNet and CIFAR-10/100 datasets and show an ablation study on networks with both standard and depthwise-separable convolutions. The proposed QKD outperformed existing state-of-the-art methods (e.g., 1.3% improvement on ResNet-18 with W4A4, 2.6% on MobileNetV2 with W4A4). Additionally, QKD could recover the full-precision accuracy at as low as W3A3 quantization on ResNet and W6A6 quantization on MobilenetV2.

Current approaches to deep learning are beginning to rely heavily on transfer learning as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks. Similarly, such pre-trained models are often used to create embedding representations for various types of data, such as text and images, which can then be fed as input into separate, downstream models. However, in cases where such transfer learning models perform poorly (i.e., for data outside of the training distribution), one must resort to fine-tuning such models, or even retraining them completely. Currently, no form of data augmentation has been proposed that can be applied directly to embedding inputs to improve downstream model performance. In this work, we introduce four new types of data augmentation that are generally applicable to embedding inputs, thus making them useful in both Natural Language Processing (NLP) and Computer Vision (CV) applications. For models trained on downstream tasks with such embedding inputs, these augmentation methods are shown to improve the AUC score of the models from a score of 0.9582 to 0.9812 and significantly increase the model’s ability to identify classes of data that are not seen during training.

Deep metric learning (DML) is a popular approach for images retrieval, solving verification (same or not) problems and addressing open set classification. Arguably, the most common DML approach is with triplet loss, despite significant advances in the area of DML. Triplet loss suffers from several issues such as collapse of the embeddings, high sensitivity to sampling schemes and more importantly a lack of performance when compared to more modern methods. We attribute this adoption to a lack of fair comparisons between various methods and the difficulty in adopting them for novel problem statements. In this paper, we perform an unbiased comparison of the most popular DML baseline methods under same conditions and more importantly, not obfuscating any hyper parameter tuning or adjustment needed to favor a particular method. We find, that under equal conditions several older methods perform significantly better than previously believed. In fact, our unified implementation of 12 recently introduced DML algorithms achieve state-of-the art performance on CUB200, CAR196, and Stanford Online products datasets which establishes a new set of baselines for future DML research. The codebase and all tuned hyperparameters will be open-sourced for reproducibility and to serve as a source of benchmark.

Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as ‘Obama is a _ by profession’. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as ‘Obama worked as a _’ may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts and ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 38.1%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://…/LPAQA.

Deep learning has gained tremendous success and great popularity in the past few years. However, recent research found that it is suffering several inherent weaknesses, which can threaten the security and privacy of the stackholders. Deep learning’s wide use further magnifies the caused consequences. To this end, lots of research has been conducted with the purpose of exhaustively identifying intrinsic weaknesses and subsequently proposing feasible mitigation. Yet few is clear about how these weaknesses are incurred and how effective are these attack approaches in assaulting deep learning. In order to unveil the security weaknesses and aid in the development of a robust deep learning system, we are devoted to undertaking a comprehensive investigation on attacks towards deep learning, and extensively evaluating these attacks in multiple views. In particular, we focus on four types of attacks associated with security and privacy of deep learning: model extraction attack, model inversion attack, poisoning attack and adversarial attack. For each type of attack, we construct its essential workflow as well as adversary capabilities and attack goals. Many pivot metrics are devised for evaluating the attack approaches, by which we perform a quantitative and qualitative analysis. From the analysis, we have identified significant and indispensable factors in an attack vector, \eg, how to reduce queries to target models, what distance used for measuring perturbation. We spot light on 17 findings covering these approaches’ merits and demerits, success probability, deployment complexity and prospects. Moreover, we discuss other potential security weaknesses and possible mitigation which can inspire relevant researchers in this area.

We propose a novel approach to address one aspect of the non-stationarity problem in multi-agent reinforcement learning (RL), where the other agents may alter their policies due to environment changes during execution. This violates the Markov assumption that governs most single-agent RL methods and is one of the key challenges in multi-agent RL. To tackle this, we propose to train multiple policies for each agent and postpone the selection of the best policy at execution time. Specifically, we model the environment non-stationarity with a finite set of scenarios and train policies fitting each scenario. In addition to multiple policies, each agent also learns a policy predictor to determine which policy is the best with its local information. By doing so, each agent is able to adapt its policy when the environment changes and consequentially the other agents alter their policies during execution. We empirically evaluated our method on a variety of common benchmark problems proposed for multi-agent deep RL in the literature. Our experimental results show that the agents trained by our algorithm have better adaptiveness in changing environments and outperform the state-of-the-art methods in all the tested environments.

As a metric to measure the performance of an online method, dynamic regret with switching cost has drawn much attention for online decision making problems. Although the sublinear regret has been provided in many previous researches, we still have little knowledge about the relation between the dynamic regret and the switching cost. In the paper, we investigate the relation for two classic online settings: Online Algorithms (OA) and Online Convex Optimization (OCO). We provide a new theoretical analysis framework, which shows an interesting observation, that is, the relation between the switching cost and the dynamic regret is different for settings of OA and OCO. Specifically, the switching cost has significant impact on the dynamic regret in the setting of OA. But, it does not have an impact on the dynamic regret in the setting of OCO. Furthermore, we provide a lower bound of regret for the setting of OCO, which is same with the lower bound in the case of no switching cost. It shows that the switching cost does not change the difficulty of online decision making problems in the setting of OCO.

We present a method for solving the general mixed constrained convex quadratic programming problem using an active set method on the dual problem. The approach is similar to existing active set methods, but we present a new way of solving the linear systems arising in the algorithm. There are two main contributions; we present a new way of factorizing the linear systems, and show how iterative refinement can be used to achieve good accuracy and to solve both types of sub-problems that arise from semi-definite problems.

Decentralized optimization algorithms have attracted intensive interests recently, as it has a balanced communication pattern, especially when solving large-scale machine learning problems. Stochastic Path Integrated Differential Estimator Stochastic First-Order method (SPIDER-SFO) nearly achieves the algorithmic lower bound in certain regimes for nonconvex problems. However, whether we can find a decentralized algorithm which achieves a similar convergence rate to SPIDER-SFO is still unclear. To tackle this problem, we propose a decentralized variant of SPIDER-SFO, called decentralized SPIDER-SFO (D-SPIDER-SFO). We show that D-SPIDER-SFO achieves a similar gradient computation cost—that is, for finding an -approximate first-order stationary point—to its centralized counterpart. To the best of our knowledge, D-SPIDER-SFO achieves the state-of-the-art performance for solving nonconvex optimization problems on decentralized networks in terms of the computational cost. Experiments on different network configurations demonstrate the efficiency of the proposed method.

There are massive amounts of textual data residing in databases, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, word embeddings are increasingly utilized to convert symbolic representations such as text into meaningful numbers. However, a naive one-to-one mapping of each word in a database to a word embedding vector is not sufficient and would lead to poor accuracies in ML tasks. Thus, we argue to additionally incorporate the information given by the database schema into the embedding, e.g. which words appear in the same column or are related to each other. In this paper, we propose RETRO (RElational reTROfitting), a novel approach to learn numerical representations of text values in databases, capturing the best of both worlds, the rich information encoded by word embeddings and the relational information encoded by database tables. We formulate relation retrofitting as a learning problem and present an efficient algorithm solving it. We investigate the impact of various hyperparameters on the learning problem and derive good settings for all of them. Our evaluation shows that the proposed embeddings are ready-to-use for many ML tasks such as classification and regression and even outperform state-of-the-art techniques in integration tasks such as null value imputation and link prediction.

Dropout has been proven to be an effective algorithm for training robust deep networks because of its ability to prevent overfitting by avoiding the co-adaptation of feature detectors. Current explanations of dropout include bagging, naive Bayes, regularization, and sex in evolution. According to the activation patterns of neurons in the human brain, when faced with different situations, the firing rates of neurons are random and continuous, not binary as current dropout does. Inspired by this phenomenon, we extend the traditional binary dropout to continuous dropout. On the one hand, continuous dropout is considerably closer to the activation characteristics of neurons in the human brain than traditional binary dropout. On the other hand, we demonstrate that continuous dropout has the property of avoiding the co-adaptation of feature detectors, which suggests that we can extract more independent feature detectors for model averaging in the test stage. We introduce the proposed continuous dropout to a feedforward neural network and comprehensively compare it with binary dropout, adaptive dropout, and DropConnect on MNIST, CIFAR-10, SVHN, NORB, and ILSVRC-12. Thorough experiments demonstrate that our method performs better in preventing the co-adaptation of feature detectors and improves test performance. The code is available at: https://…/dropout.

We consider a three-layer Sejnowski machine and show that features learnt via contrastive divergence have a dual representation as patterns in a dense associative memory of order P=4. The latter is known to be able to Hebbian-store an amount of patterns scaling as N^{P-1}, where N denotes the number of constituting binary neurons interacting P-wisely. We also prove that, by keeping the dense associative network far from the saturation regime (namely, allowing for a number of patterns scaling only linearly with N, while P>2) such a system is able to perform pattern recognition far below the standard signal-to-noise threshold. In particular, a network with P=4 is able to retrieve information whose intensity is O(1) even in the presence of a noise O(\sqrt{N}) in the large N limit. This striking skill stems from a redundancy representation of patterns — which is afforded given the (relatively) low-load information storage — and it contributes to explain the impressive abilities in pattern recognition exhibited by new-generation neural networks. The whole theory is developed rigorously, at the replica symmetric level of approximation, and corroborated by signal-to-noise analysis and Monte Carlo simulations.

This paper presents a new approach to the detection of discontinuities in the n-th derivative of observational data. This is achieved by performing two polynomial approximations at each interstitial point. The polynomials are coupled by constraining their coefficients to ensure continuity of the model up to the (n-1)-th derivative; while yielding an estimate for the discontinuity of the n-th derivative. The coefficients of the polynomials correspond directly to the derivatives of the approximations at the interstitial points through the prudent selection of a common coordinate system. The approximation residual and extrapolation errors are investigated as measures for detecting discontinuity. This is necessary since discrete observations of continuous systems are discontinuous at every point. It is proven, using matrix algebra, that positive extrema in the combined approximation-extrapolation error correspond exactly to extrema in the difference of the Taylor coefficients. This provides a relative measure for the severity of the discontinuity in the observational data. The matrix algebraic derivations are provided for all aspects of the methods presented here; this includes a solution for the covariance propagation through the computation. The performance of the method is verified with a Monte Carlo simulation using synthetic piecewise polynomial data with known discontinuities. It is also demonstrated that the discontinuities are suitable as knots for B-spline modelling of data. For completeness, the results of applying the method to sensor data acquired during the monitoring of heavy machinery are presented.

One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured by standard word embeddings. To explore this question, we propose a methodology for distilling relational knowledge from a pre-trained language model. Starting from a few seed instances of a given relation, we first use a large text corpus to find sentences that are likely to express this relation. We then use a subset of these extracted sentences as templates. Finally, we fine-tune a language model to predict whether a given word pair is likely to be an instance of some relation, when given an instantiated template for that relation as input.

In this paper, we consider the problem of finding the constraints in bow-free acyclic directed mixed graphs (ADMGs). ADMGs are a generalisation of directed acyclic graphs (DAGs) that allow for certain latent variables. We first show that minimal generators for the ideal containing all the constraints of a Gaussian ADMG corresponds precisely to the pairs of non-adjacent vertices in . The proof of this theorem naturally leads to an efficient algorithm that fits a bow-free Gaussian ADMG by maximum likelihood. In particular, we can test for the goodness of fit of a given data set to a bow-free ADMG.

Cyber-Physical Systems (CPS) are present in many settings addressing a myriad of purposes. Examples are Internet-of-Things (IoT) or sensing software embedded in appliances or even specialised meters that measure and respond to electricity demands in smart grids. Due to their pervasive nature, they are usually chosen as recipients for larger scope cyber-security attacks. Those promote system-wide disruptions and are directed towards one key aspect such as confidentiality, integrity, availability or a combination of those characteristics. Our paper focuses on a particular and distressing attack where coordinated malware infected IoT units are maliciously employed to synchronously turn on or off high-wattage appliances, affecting the grid’s primary control management. Our model could be extended to larger (smart) grids, Active Buildings as well as similar infrastructures. Our approach models Coordinated Load-Changing Attacks (CLCA) also referred as GridLock or BlackIoT, against a theoretical power grid, containing various types of power plants. It employs Continuous-Time Markov Chains where elements such as Power Plants and Botnets are modelled under normal or attack situations to evaluate the effect of CLCA in power reliant infrastructures. We showcase our modelling approach in the scenario of a power supplier (e.g. power plant) being targeted by a botnet. We demonstrate how our modelling approach can quantify the impact of a botnet attack and be abstracted for any CPS system involving power load management in a smart grid. Our results show that by prioritising the type of power-plants, the impact of the attack may change: in particular, we find the most impacting attack times and show how different strategies impact their success. We also find the best power generator to use depending on the current demand and strength of attack.

We consider a finite impulse response system with centered independent sub-Gaussian design covariates and noise components that are not necessarily identically distributed. We derive non-asymptotic near-optimal estimation and prediction bounds for the least-squares estimator of the parameters. Our results are based on two concentration inequalities on the norm of sums of dependent covariate vectors and on the singular values of their covariance operator that are of independent value on their own and where the dependence arises from the time shift structure of the time series. These results generalize the known bounds for the independent case.

The performance of acquisition functions for Bayesian optimisation is investigated in terms of the Pareto front between exploration and exploitation. We show that Expected Improvement and the Upper Confidence Bound always select solutions to be expensively evaluated on the Pareto front, but Probability of Improvement is never guaranteed to do so and Weighted Expected Improvement does only for a restricted range of weights. We introduce two novel -greedy acquisition functions. Extensive empirical evaluation of these together with random search, purely exploratory and purely exploitative search on 10 benchmark problems in 1 to 10 dimensions shows that -greedy algorithms are generally at least as effective as conventional acquisition functions, particularly with a limited budget. In higher dimensions -greedy approaches are shown to have improved performance over conventional approaches. These results are borne out on a real world computational fluid dynamics optimisation problem and a robotics active learning problem.

With the improvements of Los Angeles in many aspects, people in mounting numbers tend to live or travel to the city. The primary objective of this paper is to apply a set of methods for the time series analysis of traffic accidents in Los Angeles in the past few years. The number of traffic accidents, collected from 2010 to 2019 monthly reveals that the traffic accident happens seasonally and increasing with fluctuation. This paper utilizes the ensemble methods to combine several different methods to model the data from various perspectives, which can lead to better forecasting accuracy. The IMA(1, 1), ETS(A, N, A), and two models with Fourier items are failed in independence assumption checking. However, the Online Gradient Descent (OGD) model generated by the ensemble method shows the perfect fit in the data modeling, which is the state-of-the-art model among our candidate models. Therefore, it can be easier to accurately forecast future traffic accidents based on previous data through our model, which can help designers to make better plans.

The unification of low-level perception and high-level reasoning is a long-standing problem in artificial intelligence, which has the potential to not only bring the areas of logic and learning closer together but also demonstrate how abstract concepts might emerge from sensory data. Precisely because deep learning methods dominate perception-based learning, including vision, speech, and linguistic grammar, there is fast-growing literature on how to integrate symbolic reasoning and deep learning. Broadly, efforts seem to fall into three camps: those focused on defining a logic whose formulas capture deep learning, ones that integrate symbolic constraints in deep learning, and others that allow neural computations and symbolic reasoning to co-exist separately, to enjoy the strengths of both worlds. In this paper, we identify another dimension to this inquiry: what do the hidden layers really capture, and how can we reason about that logically? In particular, we consider autoencoders that are widely used for dimensionality reduction and inject a symbolic generative framework onto the feature layer. This allows us, among other things, to generate example images for a class to get a sense of what was learned. Moreover, the modular structure of the proposed model makes it possible to learn relations over multiple images at a time, as well as handle noisy labels. Our empirical evaluations show the promise of this inquiry.

We present a continuation method that entails generating a sequence of transition probability density functions from the prior to the posterior in the context of Bayesian inference for parameter estimation problems. The characterization of transition distributions, by tempering the likelihood function, results in a homogeneous nonlinear partial integro-differential equation whose existence and uniqueness of solutions are addressed. The posterior probability distribution comes as the interpretation of the final state of the path of transition distributions. A computationally stable scaling domain for the likelihood is explored for the approximation of the expected deviance, where we manage to hold back all the evaluations of the forward predictive model at the prior stage. It follows the computational tractability of the posterior distribution and opens access to the posterior distribution for direct samplings. To get a solution formulation of the expected deviance, we derive a partial differential equation governing the moments generating function of the log-likelihood. We show also that a spectral formulation of the expected deviance can be obtained for low-dimensional problems under certain conditions. The computational efficiency of the proposed method is demonstrated through three differents numerical examples that focus on analyzing the computational bias generated by the method, assessing the continuation method in the Bayesian inference with non-Gaussian noise, and evaluating its ability to invert a multimodal parameter of interest.

Join query optimization is a complex task and is central to the performance of query processing. In fact it belongs to the class of NP-hard problems. Traditional query optimizers use dynamic programming (DP) methods combined with a set of rules and restrictions to avoid exhaustive enumeration of all possible join orders. However, DP methods are very resource intensive. Moreover, given simplifying assumptions of attribute independence, traditional query optimizers rely on erroneous cost estimations, which can lead to suboptimal query plans. Recent success of deep reinforcement learning (DRL) creates new opportunities for the field of query optimization to tackle the above-mentioned problems. In this paper, we present our DRL-based Fully Observed Optimizer (FOOP) which is a generic query optimization framework that enables plugging in different machine learning algorithms. The main idea of FOOP is to use a data-adaptive learning query optimizer that avoids exhaustive enumerations of join orders and is thus significantly faster than traditional approaches based on dynamic programming. In particular, we evaluate various DRL-algorithms and show that Proximal Policy Optimization significantly outperforms Q-learning based algorithms. Finally we demonstrate how ensemble learning techniques combined with DRL can further improve the query optimizer.

Motivated by the flexibility of biological neural networks whose connectivity structure changes significantly during their lifetime, we introduce the Unstructured Recursive Network (URN) and demonstrate that it can exhibit similar flexibility during training via gradient descent. We show empirically that many of the different neural network structures commonly used in practice today (including fully connected, locally connected and residual networks of different depths and widths) can emerge dynamically from the same URN. These different structures can be derived using gradient descent on a single general loss function where the structure of the data and the relative strengths of various regulator terms determine the structure of the emergent network. We show that this loss function and the regulators arise naturally when considering the symmetries of the network as well as the geometric properties of the input data.

We propose a generalized Bayesian regression and model learning tool based on the “Bayesian Validation Metric’ (BVM) proposed in [1], called BVM model learning. This method performs Bayesian regression based on a user’s definition of model-data agreement and allows for model selection on any type of data distribution, unlike Bayesian and standard regression techniques, that fail in some cases. We show that the BVM model learning is capable of representing and combining Bayesian and standard regression techniques in a single framework and generalizing these methods. Thus, this tool offers new insights into the interpretation of the predictive envelopes in Bayesian and standard regression while giving the modeler more control over these envelopes.

The excessively increased volume of data in modern data management systems demands an improved system performance, frequently provided by data distribution, system scalability and performance optimization techniques. Optimized horizontal data partitioning has a significant influence of distributed data management systems. An optimally partitioned schema found in the early phase of logical database design without loading of real data in the system and its adaptation to changes of business environment are very important for a successful implementation, system scalability and performance improvement. In this paper we present a novel approach for finding an optimal horizontally partitioned schema that manifests a minimal total execution cost of a given database workload. Our approach is based on a formal model that enables abstraction of the predicates in the workload queries, and are subsequently used to define all relational fragments. This approach has predictive features acquired by simulation of horizontal partitioning, without loading any data into the partitions, but instead, altering the statistics in the database catalogs. We define an optimization problem and employ a genetic algorithm (GA) to find an approximately optimal horizontally partitioned schema. The solutions to the optimization problem are evaluated using PostgreSQL’s query optimizer. The initial experimental evaluation of our approach confirms its efficiency and correctness, and the numbers imply that the approach is effective in reducing the workload execution cost.

Networks are one of the most powerful structures for modeling problems in the real world. Downstream machine learning tasks defined on networks have the potential to solve a variety of problems. With link prediction, for instance, one can predict whether two persons will become friends on a social network. Many machine learning algorithms, however, require that each input example is a real vector. Network embedding encompasses various methods for unsupervised, and sometimes supervised, learning of feature representations of nodes and links in a network. Typically, embedding methods are based on the assumption that the similarity between nodes in the network should be reflected in the learned feature representations. In this paper, we review significant contributions to network embedding in the last decade. In particular, we look at four methods: Spectral Clustering, DeepWalk, Large-scale Information Network Embedding (LINE), and node2vec. We describe each method and list its advantages and shortcomings. In addition, we give examples of real-world machine learning problems on networks in which the embedding is critical in order to maximize the predictive performance of the machine learning task. Finally, we take a look at research trends and state-of-the art methods in the research on network embedding.

Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.

An Adversarial System to attack and an Authorship Attribution System (AAS) to defend itself against the attacks are analyzed. Defending a system against attacks from an adversarial machine learner can be done by randomly switching between models for the system, by detecting and reacting to changes in the distribution of normal inputs, or by using other methods. Adversarial machine learning is used to identify a system that is being used to map system inputs to outputs. Three types of machine learners are using for the model that is being attacked. The machine learners that are used to model the system being attacked are a Radial Basis Function Support Vector Machine, a Linear Support Vector Machine, and a Feedforward Neural Network. The feature masks are evolved using accuracy as the fitness measure. The system defends itself against adversarial machine learning attacks by identifying inputs that do not match the probability distribution of normal inputs. The system also defends itself against adversarial attacks by randomly switching between the feature masks being used to map system inputs to outputs.

Generative adversarial networks (GANs) are neural networks that learn data distributions through adversarial training. In intensive studies, recent GANs have shown promising results for reproducing training data. However, in spite of noise, they reproduce data with fidelity. As an alternative, we propose a novel family of GANs called noise-robust GANs (NR-GANs), which can learn a clean image generator even when training data are noisy. In particular, NR-GANs can solve this problem without having complete noise information (e.g., the noise distribution type, noise amount, or signal-noise relation). To achieve this, we introduce a noise generator and train it along with a clean image generator. As it is difficult to generate an image and a noise separately without constraints, we propose distribution and transformation constraints that encourage the noise generator to capture only the noise-specific components. In particular, considering such constraints under different assumptions, we devise two variants of NR-GANs for signal-independent noise and three variants of NR-GANs for signal-dependent noise. On three benchmark datasets, we demonstrate the effectiveness of NR-GANs in noise robust image generation. Furthermore, we show the applicability of NR-GANs in image denoising.

Most recent neural semi-supervised learning algorithms rely on adding small perturbation to either the input vectors or their representations. These methods have been successful on computer vision tasks as the images form a continuous manifold, but are not appropriate for discrete input such as sentence. To adapt these methods to text input, we propose to decompose a neural network into two components and so that . The layers in are then frozen and only the layers in will be updated during most time of the training. In this way, serves as a feature extractor that maps the input to high-level representation and adds systematical noise using dropout. We can then train using any state-of-the-art SSL algorithms such as -model, temporal ensembling, mean teacher, etc. Furthermore, this gradually unfreezing schedule also prevents a pretrained model from catastrophic forgetting. The experimental results demonstrate that our approach provides improvements when compared to state of the art methods especially on short texts.

This paper introduces SuperGlue, a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points. Assignments are estimated by solving a differentiable optimal transport problem, whose costs are predicted by a graph neural network. We introduce a flexible context aggregation mechanism based on attention, enabling SuperGlue to reason about the underlying 3D scene and feature assignments jointly. Compared to traditional, hand-designed heuristics, our technique learns priors over geometric transformations and regularities of the 3D world through end-to-end training from image pairs. SuperGlue outperforms other learned approaches and achieves state-of-the-art results on the task of pose estimation in challenging real-world indoor and outdoor environments. The proposed method performs matching in real-time on a modern GPU and can be readily integrated into modern SfM or SLAM systems.

We propose ViewAL, a novel active learning strategy for semantic segmentation that exploits viewpoint consistency in multi-view datasets. Our core idea is that inconsistencies in model predictions across viewpoints provide a very reliable measure of uncertainty and encourage the model to perform well irrespective of the viewpoint under which objects are observed. To incorporate this uncertainty measure, we introduce a new viewpoint entropy formulation, which is the basis of our active learning strategy. In addition, we propose uncertainty computations on a superpixel level, which exploits inherently localized signal in the segmentation task, directly lowering the annotation costs. This combination of viewpoint entropy and the use of superpixels allows to efficiently select samples that are highly informative for improving the network. We demonstrate that our proposed active learning strategy not only yields the best-performing models for the same amount of required labeled data, but also significantly reduces labeling effort. For instance, our method achieves 95% of maximum achievable network performance using only 7%, 17%, and 24% labeled data on SceneNet-RGBD, ScanNet, and Matterport3D, respectively. On these datasets, the best state-of-the-art method achieves the same performance with 14%, 27% and 33% labeled data. Finally, we demonstrate that labeling using superpixels yields the same quality of ground-truth compared to labeling whole images, but requires 25% less time.

Capsule networks excel in understanding spatial relationships in 2D data for vision related tasks. Even though they are not designed to capture 1D temporal relationships, with TimeCaps we demonstrate that given the ability, capsule networks excel in understanding temporal relationships. To this end, we generate capsules along the temporal and channel dimensions creating two temporal feature detectors which learn contrasting relationships. TimeCaps surpasses the state-of-the-art results by achieving 96.21% accuracy on identifying 13 Electrocardiogram (ECG) signal beat categories, while achieving on-par results on identifying 30 classes of short audio commands. Further, the instantiation parameters inherently learnt by the capsule networks allow us to completely parameterize 1D signals which opens various possibilities in signal processing.

Computer vision models learn to perform a task by capturing relevant statistics from training data. It has been shown that models learn spurious age, gender, and race correlations when trained for seemingly unrelated tasks like activity recognition or image captioning. Various mitigation techniques have been presented to prevent models from utilizing or learning such biases. However, there has been little systematic comparison between these techniques. We design a simple but surprisingly effective visual recognition benchmark for studying bias mitigation. Using this benchmark, we provide a thorough analysis of a wide range of techniques. We highlight the shortcomings of popular adversarial training approaches for bias mitigation, propose a simple but similarly effective alternative to the inference-time Reducing Bias Amplification method of Zhao et al., and design a domain-independent training technique that outperforms all other methods. Finally, we validate our findings on the attribute classification task in the CelebA dataset, where attribute presence is known to be correlated with the gender of people in the image, and demonstrate that the proposed technique is effective at mitigating real-world gender bias.

We propose a method to infer the presence and location of change-points in the distribution of a sequence of independent data taking values in a general metric space, where change-points are viewed as locations at which the distribution of the data sequence changes abruptly in terms of either its Fr\’echet mean or Fr\’echet variance or both. The proposed method is based on comparisons of Fr\’echet variances before and after putative change-point locations and does not require a tuning parameter except for the specification of cut-off intervals near the endpoints where change-points are assumed not to occur. Our results include theoretical guarantees for consistency of the test under contiguous alternatives when a change-point exists and also for consistency of the estimated location of the change-point if it exists, where under the null hypothesis of no change-point the limit distribution of the proposed scan function is the square of a standardized Brownian Bridge. These consistency results are applicable for a broad class of metric spaces under mild entropy conditions. Examples include the space of univariate probability distributions and the space of graph Laplacians for networks. Simulation studies demonstrate the effectiveness of the proposed methods, both for inferring the presence of a change-point and estimating its location. We also develop theory that justifies bootstrap-based inference and illustrate the new approach with sequences of maternal fertility distributions and communication networks.

Determining the correct number of clusters (CNC) is an important task in data clustering and has a critical effect on finalizing the partitioning results. K-means is one of the popular methods of clustering that requires CNC. Validity index methods use an additional optimization procedure to estimate the CNC for K-means. We propose an alternative validity index approach denoted by k-minimizing Average Central Error (KMACE). K-means is one of the popular methods of clustering that requires CNC. Validity ACE is the average error between the true unavailable cluster center and the estimated cluster center for each sample data. Kernel K-MACE is kernel K-means that is equipped with an efficient CNC estimator. In addition, kernel K_MACE includes the first automatically tuned procedure for choosing the Gaussian kernel parameters. Simulation results for both synthetic and read data show superiority of K_MACE and kernel K-MACE over the conventional clustering methods not only in CNC estimation but also in the partitioning procedure.

Have you ever wondered how your feature space is impacting the prediction of a specific sample in your dataset? In this paper, we introduce Single Sample Feature Importance (SSFI), which is an interpretable feature importance algorithm that allows for the identification of the most important features that contribute to the prediction of a single sample. When a dataset can be learned by a Random Forest classifier or regressor, SSFI shows how the Random Forest’s prediction path can be utilized for low-level feature importance calculation. SSFI results in a relative ranking of features, highlighting those with the greatest impact on a data point’s prediction. We demonstrate these results both numerically and visually on four different datasets.

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a fixed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i.e., Lifelong Spectral Clustering (L2SC). Its goal is to efficiently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Specifically, the knowledge library of L2SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2SC firstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redefines the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.

Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20% with equivalent or even superior accuracy on the ImageNet dataset, and significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at https://…/CrossStagePartialNetworks.

Transfer learning is becoming the de facto solution for vision and text encoders in the front-end processing of machine learning solutions. Utilizing vast amounts of knowledge in pre-trained models and subsequent fine-tuning allows achieving better performance in domains where labeled data is limited. In this paper, we analyze the efficiency of transfer learning in visual reasoning by introducing a new model (SAMNet) and testing it on two datasets: COG and CLEVR. Our new model achieves state-of-the-art accuracy on COG and shows significantly better generalization capabilities compared to the baseline. We also formalize a taxonomy of transfer learning for visual reasoning around three axes: feature, temporal, and reasoning transfer. Based on extensive experimentation of transfer learning on each of the two datasets, we show the performance of the new model along each axis.

Conventional out-of-distribution (OOD) detection schemes based on variational autoencoder or Random Network Distillation (RND) have been observed to assign lower uncertainty to the OOD than the target distribution. In this work, we discover that such conventional novelty detection schemes are also vulnerable to the blurred images. Based on the observation, we construct a novel RND-based OOD detector, SVD-RND, that utilizes blurred images during training. Our detector is simple, efficient at test time, and outperforms baseline OOD detectors in various domains. Further results show that SVD-RND learns better target distribution representation than the baseline RND algorithm. Finally, SVD-RND combined with geometric transform achieves near-perfect detection accuracy on the CelebA dataset.

Ancestral mixture model, proposed by Chen and Lindsay (2006), is an important model to build a hierarchical tree from high dimensional binary sequences. Mixture trees created from ancestral mixture models involve in the inferred evolutionary relationships among various biological species. Moreover, it contains the information of time when the species mutates. Tree comparison metric, an essential issue in bioinformatics, is to measure the similarity between trees. However, to our knowledge, the approach to the comparison between two mixture trees is still under development. In this paper, we propose a new metric, named mixture distance metric, to measure the similarity of two mixture trees. It uniquely considers the factor of evolutionary times between trees. In addition, we also further develop two algorithms to compute the mixture distance between two mixture trees. One requires O(n^2) and the other requires O(nh) computation time with O(n) preprocessing time, where n denotes the number of leaves in the two mixture trees, and h denotes the minimum height of these two trees.

Most mobile network operators generate revenues by directly charging users for data plan subscriptions. Some operators now also offer users data rewards to incentivize them to watch mobile ads, which enables the operators to collect payments from advertisers and create new revenue streams. In this work, we analyze and compare two data rewarding schemes: a Subscription-Aware Rewarding (SAR) scheme and a Subscription-Unaware Rewarding (SUR) scheme. Under the SAR scheme, only the subscribers of the operators’ data plans are eligible for the rewards; under the SUR scheme, all users are eligible for the rewards (e.g., the users who do not subscribe to the data plans can still get SIM cards and receive data rewards by watching ads). We model the interactions among an operator, users, and advertisers by a two-stage Stackelberg game, and characterize their equilibrium strategies under both the SAR and SUR schemes. We show that the SAR scheme can lead to more subscriptions and a higher operator revenue from the data market, while the SUR scheme can lead to better ad viewership and a higher operator revenue from the ad market. We further show that the operator’s optimal choice between the two schemes is sensitive to the users’ data consumption utility function and the operator’s network capacity. We provide some counter-intuitive insights. For example, when each user has a logarithmic utility function, the operator should apply the SUR scheme (i.e., reward both subscribers and non-subscribers) if and only if it has a small network capacity.

Given labeled instances on a source domain and unlabeled ones on a target domain, unsupervised domain adaptation aims to learn a task classifier that can well classify target instances. Recent advances rely on domain-adversarial training of deep networks to learn domain-invariant features. However, due to an issue of mode collapse induced by the separate design of task and domain classifiers, these methods are limited in aligning the joint distributions of feature and category across domains. To overcome it, we propose a novel adversarial learning method termed Discriminative Adversarial Domain Adaptation (DADA). Based on an integrated category and domain classifier, DADA has a novel adversarial objective that encourages a mutually inhibitory relation between category and domain predictions for any input instance. We show that under practical conditions, it defines a minimax game that can promote the joint distribution alignment. Except for the traditional closed set domain adaptation, we also extend DADA for extremely challenging problem settings of partial and open set domain adaptation. Experiments show the efficacy of our proposed methods and we achieve the new state of the art for all the three settings on benchmark datasets.

Human parsing, or human body part semantic segmentation, has been an active research topic due to its wide potential applications. In this paper, we propose a novel GRAph PYramid Mutual Learning (Grapy-ML) method to address the cross-dataset human parsing problem, where the annotations are at different granularities. Starting from the prior knowledge of the human body hierarchical structure, we devise a graph pyramid module (GPM) by stacking three levels of graph structures from coarse granularity to fine granularity subsequently. At each level, GPM utilizes the self-attention mechanism to model the correlations between context nodes. Then, it adopts a top-down mechanism to progressively refine the hierarchical features through all the levels. GPM also enables efficient mutual learning. Specifically, the network weights of the first two levels are shared to exchange the learned coarse-granularity information across different datasets. By making use of the multi-granularity labels, Grapy-ML learns a more discriminative feature representation and achieves state-of-the-art performance, which is demonstrated by extensive experiments on the three popular benchmarks, e.g. CIHP dataset. The source code is publicly available at https://…/Grapy-ML.

We develop a framework for analyzing multivariate time series using topological data analysis (TDA) methods. The proposed methodology involves converting the multivariate time series to point cloud data, calculating Wasserstein distances between the persistence diagrams and using the -nearest neighbors algorithm (-NN) for supervised machine learning. Two methods (symmetry-breaking and anchor points) are also introduced to enable TDA to better analyze data with heterogeneous features that are sensitive to translation, rotation, or choice of coordinates. We apply our methods to room occupancy detection based on 5 time-dependent variables (temperature, humidity, light, CO2 and humidity ratio). Experimental results show that topological methods are effective in predicting room occupancy during a time window.

High-level (e.g., semantic) features encoded in the latter layers of convolutional neural networks are extensively exploited for image classification, leaving low-level (e.g., color) features in the early layers underexplored. In this paper, we propose a novel Decision Propagation Module (DPM) to make an intermediate decision that could act as category-coherent guidance extracted from early layers, and then propagate it to the latter layers. Therefore, by stacking a collection of DPMs into a classification network, the generated Decision Propagation Network is explicitly formulated as to progressively encode more discriminative features guided by the decision, and then refine the decision based on the new generated features layer by layer. Comprehensive results on four publicly available datasets validate DPM could bring significant improvements for existing classification networks with minimal additional computational cost and is superior to the state-of-the-art methods.

The K-means algorithm is a widely used clustering algorithm that offers simplicity and efficiency. However, the traditional K-means algorithm uses the random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance. Many initialization methods have been proposed, but none of them can dynamically adapt to datasets with various characteristics. In our previous research, an initialization method for K-means based on hybrid distance was proposed, and this algorithm can adapt to datasets with different characteristics. However, it has the following drawbacks: (a) When calculating density, the threshold cannot be uniquely determined, resulting in unstable results. (b) Heavily depending on adjusting the parameter, the parameter must be adjusted five times to obtain better clustering results. (c) The time complexity of the algorithm is quadratic, which is difficult to apply to large datasets. In the current paper, we proposed an adaptive initialization method for the K-means algorithm (AIMK) to improve our previous work. AIMK can not only adapt to datasets with various characteristics but also obtain better clustering results within two interactions. In addition, we then leverage random sampling in AIMK, which is named as AIMK-RS, to reduce the time complexity. AIMK-RS is easily applied to large and high-dimensional datasets. We compared AIMK and AIMK-RS with 10 different algorithms on 16 normal and six extra-large datasets. The experimental results show that AIMK and AIMK-RS outperform the current initialization methods and several well-known clustering algorithms. Furthermore, AIMK-RS can significantly reduce the complexity of applying it to extra-large datasets with high dimensions. The time complexity of AIMK-RS is O(n).

Deep Learning is a state-of-the-art technique to make inference on extensive or complex data. As a black box model due to their multilayer nonlinear structure, Deep Neural Networks are often criticized to be non-transparent and their predictions not traceable by humans. Furthermore, the models learn from artificial datasets, often with bias or contaminated discriminating content. Through their increased distribution, decision-making algorithms can contribute promoting prejudge and unfairness which is not easy to notice due to lack of transparency. Hence, scientists developed several so-called explanators or explainers which try to point out the connection between input and output to represent in a simplified way the inner structure of machine learning black boxes. In this survey we differ the mechanisms and properties of explaining systems for Deep Neural Networks for Computer Vision tasks. We give a comprehensive overview about taxonomy of related studies and compare several survey papers that deal with explainability in general. We work out the drawbacks and gaps and summarize further research ideas.

Similarity graphs are an active research direction for the nearest neighbor search (NNS) problem. New algorithms for similarity graph construction are continuously being proposed and analyzed by both theoreticians and practitioners. However, existing construction algorithms are mostly based on heuristics and do not explicitly maximize the target performance measure, i.e., search recall. Therefore, at the moment it is not clear whether the performance of similarity graphs has plateaued or more effective graphs can be constructed with more theoretically grounded methods. In this paper, we introduce a new principled algorithm, based on adjacency matrix optimization, which explicitly maximizes search efficiency. Namely, we propose a probabilistic model of a similarity graph defined in terms of its edge probabilities and show how to learn these probabilities from data as a reinforcement learning task. As confirmed by experiments, the proposed construction method can be used to refine the state-of-the-art similarity graphs, achieving higher recall rates for the same number of distance computations. Furthermore, we analyze the learned graphs and reveal the structural properties that are responsible for more efficient search.

Differential Architecture Search (DARTS) is now a widely disseminated weight-sharing neural architecture search method. However, there are two fundamental weaknesses remain untackled. First, we observe that the well-known aggregation of skip connections during optimization is caused by an unfair advantage in an exclusive competition. Second, there is a non-negligible incongruence when discretizing continuous architectural weights to a one-hot representation. Because of these two reasons, DARTS delivers a biased solution that might not even be suboptimal. In this paper, we present a novel approach to curing both frailties. Specifically, as unfair advantages in a pure exclusive competition easily induce a monopoly, we relax the choice of operations to be collaborative, where we let each operation have an equal opportunity to develop its strength. We thus call our method Fair DARTS. Moreover, we propose a zero-one loss to directly reduce the discretization gap. Experiments are performed on two mainstream search spaces, in which we achieve new state-of-the-art networks on ImageNet. Our code is available on https://…/fairdarts.

Heuristic forward search is currently the dominant paradigm in classical planning. Forward search algorithms typically rely on a single, relatively simple variation of best-first search and remain fixed throughout the process of solving a planning problem. Existing work combining multiple search techniques usually aims at supporting best-first search with an additional exploratory mechanism, triggered using a handcrafted criterion. A notable exception is very recent work which combines various search techniques using a trainable policy. It is, however, confined to a discrete action space comprising several fixed subroutines. In this paper, we introduce a parametrized search algorithm template which combines various search techniques within a single routine. The template’s parameter space defines an infinite space of search algorithms, including, among others, BFS, local and random search. We further introduce a neural architecture for designating the values of the search parameters given the state of the search. This enables expressing neural search policies that change the values of the parameters as the search progresses. The policies can be learned automatically, with the objective of maximizing the planner’s performance on a given distribution of planning problems. We consider a training setting based on a stochastic optimization algorithm known as the cross-entropy method (CEM). Experimental evaluation of our approach shows that it is capable of finding effective distribution-specific search policies, outperforming the relevant baselines.

The instability and feature redundancy in CNNs hinders further performance improvement. Using orthogonality as a regularizer has shown success in alleviating these issues. Previous works however only considered the kernel orthogonality in the convolution layers of CNNs, which is a necessary but not sufficient condition for orthogonal convolutions in general. We propose orthogonal convolutions as regularizations in CNNs and benchmark its effect on various tasks. We observe up to 3% gain for CIFAR100 and up to 1% gain for ImageNet classification. Our experiments also demonstrate improved performance on image retrieval, inpainting and generation, which suggests orthogonal convolution improves the feature expressiveness. Empirically, we show that the uniform spectrum and reduced feature redundancy may account for the gain in performance and robustness under adversarial attacks.

We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods—taking the maximum attention weight and computing the maximum spanning tree—to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets—the syntax-oriented CoLA and the semantics-oriented MNLI—to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods. Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn.

A structured understanding of our world in terms of objects, relations, and hierarchies is an important component of human cognition. Learning such a structured world model from raw sensory data remains a challenge. As a step towards this goal, we introduce Contrastively-trained Structured World Models (C-SWMs). C-SWMs utilize a contrastive approach for representation learning in environments with compositional structure. We structure each state embedding as a set of object representations and their relations, modeled by a graph neural network. This allows objects to be discovered from raw pixel observations without direct supervision as part of the learning process. We evaluate C-SWMs on compositional environments involving multiple interacting objects that can be manipulated independently by an agent, simple Atari games, and a multi-object physics simulation. Our experiments demonstrate that C-SWMs can overcome limitations of models based on pixel reconstruction and outperform typical representatives of this model class in highly structured environments, while learning interpretable object-based representations.

We study the design of learning architectures for behavioural planning in a dense traffic setting. Such architectures should deal with a varying number of nearby vehicles, be invariant to the ordering chosen to describe them, while staying accurate and compact. We observe that the two most popular representations in the literature do not fit these criteria, and perform badly on an complex negotiation task. We propose an attention-based architecture that satisfies all these properties and explicitly accounts for the existing interactions between the traffic participants. We show that this architecture leads to significant performance gains, and is able to capture interactions patterns that can be visualised and qualitatively interpreted. Videos and code are available at https://…/.

Living in the ‘Information Age’ means that not only access to information has become easier but also that the distribution of information is more dynamic than ever. Through a large-scale online field experiment, we provide new empirical evidence for the presence of the anchoring bias in people’s judgment due to irrational reliance on a piece of information that they are initially given. The comparison of the anchoring stimuli and respective responses across different tasks reveals a positive, yet complex relationship between the anchors and the bias in participants’ predictions of the outcomes of events in the future. Participants in the treatment group were equally susceptible to the anchors regardless of their level of engagement, previous performance, or gender. Given the strong and ubiquitous influence of anchors quantified here, we should take great care to closely monitor and regulate the distribution of information online to facilitate less biased decision making.

The recently proposed panoptic segmentation task presents a significant challenge of image understanding with computer vision by unifying semantic segmentation and instance segmentation tasks. In this paper we present an efficient and novel panoptic data augmentation (PanDA) method which operates exclusively in pixel space, requires no additional data or training, and is computationally cheap to implement. We retrain the original state-of-the-art UPSNet panoptic segmentation model on PanDA augmented Cityscapes dataset, and demonstrate all-round performance improvement upon the original model. We also show that PanDA is effective across scales from 10 to 30,000 images, as well as generalizable to Microsoft COCO panoptic segmentation task. Finally, the effectiveness of PanDA generated unrealistic-looking training images suggest that we should rethink about optimizing levels of image realism for efficient data augmentation.

We apply methods from randomized numerical linear algebra (RandNLA) to develop improved algorithms for the analysis of large-scale time series data. We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes. We show that the accuracy of approximations lies within of the true leverage scores with high probability. These theoretical results are subsequently exploited to develop an efficient algorithm, called LSAR, for fitting an appropriate AR model to big time series data. Our proposed algorithm is guaranteed, with high probability, to find the maximum likelihood estimates of the parameters of the underlying true AR model and has a worst case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale synthetic as well as real data highly support the theoretical results and reveal the efficacy of this new approach. To the best of our knowledge, this paper is the first attempt to establish a nexus between RandNLA and big time series data analysis.

As neural networks revolutionize many applications, significant privacy concerns emerge. Owners of private data wish to use remote neural network services while ensuring their data cannot be interpreted by others. Service providers wish to keep their model private to safeguard its intellectual property. Such privacy conflicts may slow down the adoption of neural networks in sensitive domains such as healthcare. Privacy issues have been addressed in the cryptography community in the context of secure computation. However, secure computation protocols have known performance issues. E.g., runtime of secure inference in deep neural networks is three orders of magnitude longer comparing to non-secure inference. Therefore, much research efforts address the optimization of cryptographic protocols for secure inference. We take a complementary approach, and provide design principles for optimizing the crypto-oriented neural network architectures to reduce the runtime of secure inference. The principles are evaluated on three state-of-the-art architectures: SqueezeNet, ShuffleNetV2, and MobileNetV2. Our novel method significantly improves the efficiency of secure inference on common evaluation metrics.

Machine learning software, deep neural networks (DNN) software in particular, discerns valuable information from a large dataset, a set of data. Outcomes of such DNN programs are dependent on the quality of both learning programs and datasets. Unfortunately, the quality of datasets is difficult to be defined, because they are just samples. The quality assurance of DNN software is difficult, because resultant trained machine learning models are unknown prior to its development, and the validation is conducted indirectly in terms of prediction performance. This paper introduces a hypothesis that faults in the learning programs manifest themselves as distortions in trained machine learning models. Relative distortion degrees measured with appropriate observer functions may indicate that there are some hidden faults. The proposal is demonstrated with example cases of the MNIST dataset.

Neural architecture search (NAS) can have a significant impact in computer vision by automatically designing optimal neural network architectures for various tasks. A variant, binarized neural architecture search (BNAS), with a search space of binarized convolutions, can produce extremely compressed models. Unfortunately, this area remains largely unexplored. BNAS is more challenging than NAS due to the learning inefficiency caused by optimization requirements and the huge architecture space. To address these issues, we introduce channel sampling and operation space reduction into a differentiable NAS to significantly reduce the cost of searching. This is accomplished through a performance-based strategy used to abandon less potential operations. Two optimization methods for binarized neural networks are used to validate the effectiveness of our BNAS. Extensive experiments demonstrate that the proposed BNAS achieves a performance comparable to NAS on both CIFAR and ImageNet databases. An accuracy of vs. is achieved on the CIFAR-10 dataset, but with a significantly compressed model, and a faster search than the state-of-the-art PC-DARTS.

The energy system studies include a wide range of issues from short term (e.g. real-time, hourly, daily and weekly operating decisions) to long term horizons (e.g. planning or policy making). The decision making chain is fed by input parameters which are usually subject to uncertainties. The art of dealing with uncertainties has been developed in various directions and has recently become a focal point of interest. In this paper, a new standard classification of uncertainty modeling techniques for decision making process is proposed. These methods are introduced and compared along with demonstrating their strengths and weaknesses. The promising lines of future researches are explored in the shadow of a comprehensive overview of the past and present applications. The possibility of using the novel concept of Z-numbers is introduced for the first time.

Motivated by the celebrated discrete-time model of nervous activity outlined by McCulloch and Pitts in 1943, we propose a novel continuous-time model, the McCulloch-Pitts network (MPN), for sequence learning in spiking neural networks. Our model has a local learning rule, such that the synaptic weight updates depend only on the information directly accessible by the synapse. By exploiting asymmetry in the connections between binary neurons, we show that MPN can be trained to robustly memorize multiple spatiotemporal patterns of binary vectors, generalizing the ability of the symmetric Hopfield network to memorize static spatial patterns. In addition, we demonstrate that the model can efficiently learn sequences of binary pictures as well as generative models for experimental neural spike-train data. Our learning rule is consistent with spike-timing-dependent plasticity (STDP), thus providing a theoretical ground for the systematic design of biologically inspired networks with large and robust long-range sequence storage capacity.

In the last two years, more than 200 papers have been written on how machine learning (ML) systems can fail because of adversarial attacks on the algorithms and data; this number balloons if we were to incorporate papers covering non-adversarial failure modes. The spate of papers has made it difficult for ML practitioners, let alone engineers, lawyers, and policymakers, to keep up with the attacks against and defenses of ML systems. However, as these systems become more pervasive, the need to understand how they fail, whether by the hand of an adversary or due to the inherent design of a system, will only become more pressing. In order to equip software developers, security incident responders, lawyers, and policy makers with a common vernacular to talk about this problem, we developed a framework to classify failures into ‘Intentional failures’ where the failure is caused by an active adversary attempting to subvert the system to attain her goals; and ‘Unintentional failures’ where the failure is because an ML system produces an inherently unsafe outcome. After developing the initial version of the taxonomy last year, we worked with security and ML teams across Microsoft, 23 external partners, standards organization, and governments to understand how stakeholders would use our framework. Throughout the paper, we attempt to highlight how machine learning failure modes are meaningfully different from traditional software failures from a technology and policy perspective.

Envy-freeness is a widely studied notion in resource allocation, capturing some aspects of fairness. The notion of envy being inherently subjective though, it might be the case that an agent envies another agent, but that she objectively has no reason to do so. The difficulty here is to define the notion of objectivity, since no ground-truth can properly serve as a basis of this definition. A natural approach is to consider the judgement of the other agents as a proxy for objectivity. Building on previous work by Parijs (who introduced ‘unanimous envy’) we propose the notion of approval envy: an agent experiences approval envy towards if she is envious of , and sufficiently many agents agree that this should be the case, from their own perspectives. Some interesting properties of this notion are put forward. Computing the minimal threshold guaranteeing approval envy clearly inherits well-known intractable results from envy-freeness, but (i) we identify some tractable cases such as house allocation; and (ii) we provide a general method based on a mixed integer programming encoding of the problem, which proves to be efficient in practice. This allows us in particular to show experimentally that existence of such allocations, with a rather small threshold, is very often observed.

Convolutional neural network (CNN) is widely used in computer vision applications. In the networks that deal with images, CNNs are the most time-consuming layer of the networks. Usually, the solution to address the computation cost is to decrease the number of trainable parameters. This solution usually comes with the cost of dropping the accuracy. Another problem with this technique is that usually the cost of memory access is not taken into account which results in insignificant speedup gain. The number of operations and memory access in a standard convolution layer is independent of the input content, which makes it limited for certain accelerations. We propose a simple modification to a standard convolution to bridge this gap. We propose an adaptive convolution that adopts different kernel sizes (or radii) based on the content. The network can learn and select the proper radius based on the input content in a soft decision manner. Our proposed radius-adaptive convolutional neural network (RACNN) has a similar number of weights to a standard one, yet, results show it can reach higher speeds. The code has been made available at: https://…/racnn.

The recent progress in neural architectures search (NAS) has allowed scaling the automated design of neural architectures to real-world domains such as object detection and semantic segmentation. However, one prerequisite for the application of NAS are large amounts of labeled data and compute resources. This renders its application challenging in few-shot learning scenarios, where many related tasks need to be learned, each with limited amounts of data and compute time. Thus, few-shot learning is typically done with a fixed neural architecture. To improve upon this, we propose MetaNAS, the first method which fully integrates NAS with gradient-based meta-learning. MetaNAS optimizes a meta-architecture along with the meta-weights during meta-training. During meta-testing, architectures can be adapted to a novel task with a few steps of the task optimizer, that is: task adaptation becomes computationally cheap and requires only little data per task. Moreover, MetaNAS is agnostic in that it can be used with arbitrary model-agnostic meta-learning algorithms and arbitrary gradient-based NAS methods. Empirical results on standard few-shot classification benchmarks show that MetaNAS with a combination of DARTS and REPTILE yields state-of-the-art results.

Graph matching plays a central role in such fields as computer vision, pattern recognition, and bioinformatics. Graph matching problems can be cast as two types of quadratic assignment problems (QAPs): Koopmans-Beckmann’s QAP or Lawler’s QAP. In our paper, we provide a unifying view for these two problems by introducing new rules for array operations in Hilbert spaces. Consequently, Lawler’s QAP can be considered as the Koopmans-Beckmann’s alignment between two arrays in reproducing kernel Hilbert spaces (RKHS), making it possible to efficiently solve the problem without computing a huge affinity matrix. Furthermore, we develop the entropy-regularized Frank-Wolfe (EnFW) algorithm for optimizing QAPs, which has the same convergence rate as the original FW algorithm while dramatically reducing the computational burden for each outer iteration. We conduct extensive experiments to evaluate our approach, and show that our algorithm significantly outperforms the state-of-the-art in both matching accuracy and scalability.

Detecting out-of-distribution examples is important for safety-critical machine learning applications such as self-driving vehicles. However, existing research mainly focuses on small-scale images where the whole image is considered anomalous. We propose to segment only the anomalous regions within an image, and hence we introduce the Combined Anomalous Object Segmentation benchmark for the more realistic task of large-scale anomaly segmentation. Our benchmark combines two novel datasets for anomaly segmentation that incorporate both realism and anomaly diversity. Using both real images and those from a simulated driving environment, we ensure the background context and a wide variety of anomalous objects are naturally integrated, unlike before. Additionally, we improve out-of-distribution detectors on large-scale multi-class datasets and introduce detectors for the previously unexplored setting of multi-label out-of-distribution detection. These novel baselines along with our anomaly segmentation benchmark open the door to further research in large-scale out-of-distribution detection and segmentation.

We address the problem that state-of-the-art Convolution Neural Networks (CNN) classifiers are not invariant to small shifts. The problem can be solved by the removal of sub-sampling operations such as stride and max pooling, but at a cost of severely degraded training and test efficiency. We present a novel usage of Gaussian-Hermite basis to efficiently approximate arbitrary filters within the CNN framework to obtain translation invariance. This is shown to be invariant to small shifts, and preserves the efficiency of training. Further, to improve efficiency in memory usage as well as computational speed, we show that it is still possible to sub-sample with this approach and retain a weaker form of invariance that we call \emph{translation insensitivity}, which leads to stability with respect to shifts. We prove these claims analytically and empirically. Our analytic methods further provide a framework for understanding any architecture in terms of translation insensitivity, and provide guiding principles for design.

Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.

We describe an explainable AI saliency map method for use with deep convolutional neural networks (CNN) that is much more efficient than popular gradient methods. It is also quantitatively similar and better in accuracy. Our technique works by measuring information at the end of each network scale which is then combined into a single saliency map. We describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. Finally, we visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods. Since our method only requires a single forward pass through a few of the layers in a network, it is at least 97x faster than Guided Backprop and much more accurate. Using our method instead of Guided Backprop, class activation methods such as Grad-CAM, Grad-CAM++ and Smooth Grad-CAM++ will run several orders of magnitude faster, have a significantly smaller memory footprint and be more accurate. This will make such methods feasible on resource limited platforms such as robots, cell phones and low cost industrial devices. This will also significantly help them work in extremely data intensive applications such as satellite image processing. All without sacrificing accuracy. Our method is generally straight forward and should be applicable to the most commonly used CNNs. We also show examples of our method used to enhance Grad-CAM++.

In many advanced network analysis applications, like social networks, e-commerce, and network security, hotspots are generally considered as a group of vertices that are tightly connected owing to the similar characteristics, such as common habits and location proximity. In this paper, we investigate the formation of hotspots from an alternative perspective that considers the routes along the network paths as the auxiliary information, and attempt to find the route hotspots in large labeled networks. A route hotspot is a cohesive subgraph that is covered by a set of routes, and these routes correspond to the same sequential pattern consisting of vertices’ labels. To the best of our knowledge, the problem of Finding Route Hotspots in Large Labeled Networks has not been tackled in the literature. However, it is challenging as counting the number of hotspots in a network is #P-hard. Inspired by the observation that the sizes of hotspots decrease with the increasing lengths of patterns, we prove several anti-monotonicity properties of hotspots, and then develop a scalable algorithm called FastRH that can use these properties to effectively prune the patterns that cannot form any hotspots. In addition, to avoid the duplicate computation overhead, we judiciously design an effective index structure called RH-Index for storing the hotspot and pattern information collectively, which also enables incremental updating and efficient query processing. Our experimental results on real-world datasets clearly demonstrate the effectiveness and scalability of our proposed methods.

An ontology language for ontology mediated query answering (OMQA-language) is universal for a family of OMQA-languages if it is the most expressive one among this family. In this paper, we focus on three families of tractable OMQA-languages, including first-order rewritable languages and languages whose data complexity of the query answering is in AC0 or PTIME. On the negative side, we prove that there is, in general, no universal language for each of these families of languages. On the positive side, we propose a novel property, the locality, to approximate the first-order rewritability, and show that there exists a language of disjunctive embedded dependencies that is universal for the family of OMQA-languages with locality. All of these results apply to OMQA with query languages such as conjunctive queries, unions of conjunctive queries and acyclic conjunctive queries.

A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm , there exists a stream so that running twice on (using different randomness) would with high probability result in two different entries as the output. In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix. We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted.

Learning representations of data is an important problem in statistics and machine learning. While the origin of learning representations can be traced back to factor analysis and multidimensional scaling in statistics, it has become a central theme in deep learning with important applications in computer vision and computational neuroscience. In this article, we review recent advances in learning representations from a statistical perspective. In particular, we review the following two themes: (a) unsupervised learning of vector representations and (b) learning of both vector and matrix representations.

Machine Learning as a Service (MLaaS) has become a growing trend in recent years and several such services are currently offered. MLaaS is essentially a set of services that provides machine learning tools and capabilities as part of cloud computing services. In these settings, the cloud has pre-trained models that are deployed and large computing capacity whereas the clients can use these models to make predictions without having to worry about maintaining the models and the service. However, the main concern with MLaaS is the privacy of the client’s data. Although there have been several proposed approaches in the literature to run machine learning models on encrypted data, the performance is still far from being satisfactory for practical use. In this paper, we aim to accelerate the performance of running machine learning on encrypted data using combination of Fully Homomorphic Encryption (FHE), Convolutional Neural Networks (CNNs) and Graphics Processing Units (GPUs). We use a number of optimization techniques, and efficient GPU-based implementation to achieve high performance. We evaluate a CNN whose architecture is similar to AlexNet to classify homomorphically encrypted samples from the Cars Overhead With Context (COWC) dataset. To the best of our knowledge, it is the first time such a complex network and large dataset is evaluated on encrypted data. Our approach achieved reasonable classification accuracy of 95% for the COWC dataset. In terms of performance, our results show that we could achieve several thousands times speed up when we implement GPU-accelerated FHE operations on encrypted floating point numbers.

Multi-view clustering aims at integrating complementary information from multiple heterogeneous views to improve clustering results. Existing multi-view clustering solutions can only output a single clustering of the data. Due to their multiplicity, multi-view data, can have different groupings that are reasonable and interesting from different perspectives. However, how to find multiple, meaningful, and diverse clustering results from multi-view data is still a rarely studied and challenging topic in multi-view clustering and multiple clusterings. In this paper, we introduce a deep matrix factorization based solution (DMClusts) to discover multiple clusterings. DMClusts gradually factorizes multi-view data matrices into representational subspaces layer-by-layer and generates one clustering in each layer. To enforce the diversity between generated clusterings, it minimizes a new redundancy quantification term derived from the proximity between samples in these subspaces. We further introduce an iterative optimization procedure to simultaneously seek multiple clusterings with quality and diversity. Experimental results on benchmark datasets confirm that DMClusts outperforms state-of-the-art multiple clustering solutions.

We propose an approach towards natural language generation using a bidirectional encoder-decoder which incorporates external rewards through reinforcement learning (RL). We use attention mechanism and maximum mutual information as an initial objective function using RL. Using a two-part training scheme, we train an external reward analyzer to predict the external rewards and then use the predicted rewards to maximize the expected rewards (both internal and external). We evaluate the system on two standard dialogue corpora – Cornell Movie Dialog Corpus and Yelp Restaurant Review Corpus. We report standard evaluation metrics including BLEU, ROUGE-L, and perplexity as well as human evaluation to validate our approach.

We address the problem of disentangled representation learning with independent latent factors in graph convolutional networks (GCNs). The current methods usually learn node representation by describing its neighborhood as a perceptual whole in a holistic manner while ignoring the entanglement of the latent factors. However, a real-world graph is formed by the complex interaction of many latent factors (e.g., the same hobby, education or work in social network). While little effort has been made toward exploring the disentangled representation in GCNs. In this paper, we propose a novel Independence Promoted Graph Disentangled Networks (IPGDN) to learn disentangled node representation while enhancing the independence among node representations. In particular, we firstly present disentangled representation learning by neighborhood routing mechanism, and then employ the Hilbert-Schmidt Independence Criterion (HSIC) to enforce independence between the latent representations, which is effectively integrated into a graph convolutional framework as a regularizer at the output layer. Experimental studies on real-world graphs validate our model and demonstrate that our algorithms outperform the state-of-the-arts by a wide margin in different network applications, including semi-supervised graph classification, graph clustering and graph visualization.

In this extended abstract we introduce a novel {\em control-tutored} Q-learning approach (CTQL) as part of the ongoing effort in developing model-based and safe RL for continuous state spaces. We validate our approach by applying it to a challenging multi-agent herding control problem.

In this paper we propose a statistical model for dynamically evolving networks, together with a variational inference approach. Our model, which we call Dynamic Latent Attribute Interaction Model (DLAIM), encodes edge dependencies across different time snapshots. It represents nodes via latent attributes and uses attribute interaction matrices to model the presence of edges. Both are allowed to evolve with time, thus allowing us to capture the dynamics of the network. We develop a neural network based variational inference procedure that provides a suitable way to learn the model parameters. The main strengths of DLAIM are: (i) it is flexible as it does not impose strict assumptions on network evolution unlike existing approaches, (ii) it applies to both directed as well as undirected networks, and more importantly, (iii) learned node attributes and interaction matrices may be interpretable and therefore provide insights on the mechanisms behind network evolution. Experiments done on real world networks for the task of link forecasting demonstrate the superior performance of our model as compared to existing approaches.

Neural architecture search has recently attracted lots of research efforts as it promises to automate the manual design of neural networks. However, it requires a large amount of computing resources and in order to alleviate this, a performance prediction network has been recently proposed that enables efficient architecture search by forecasting the performance of candidate architectures, instead of relying on actual model training. The performance predictor is task-aware taking as input not only the candidate architecture but also task meta-features and it has been designed to collectively learn from several tasks. In this work, we introduce a pairwise ranking loss for training a network able to rank candidate architectures for a new unseen task conditioning on its task meta-features. We present experimental results, showing that the ranking network is more effective in architecture search than the previously proposed performance predictor.

Despite the effectiveness of sequence-to-sequence framework on the task of Short-Text Conversation (STC), the issue of under-exploitation of training data (i.e., the supervision signals from query text is \textit{ignored}) still remains unresolved. Also, the adopted \textit{maximization}-based decoding strategies, inclined to generating the generic responses or responses with repetition, are unsuited to the STC task. In this paper, we propose to formulate the STC task as a language modeling problem and tailor-make a training strategy to adapt a language model for response generation. To enhance generation performance, we design a relevance-promoting transformer language model, which performs additional supervised source attention after the self-attention to increase the importance of informative query tokens in calculating the token-level representation. The model further refines the query representation with relevance clues inferred from its multiple references during training. In testing, we adopt a \textit{randomization-over-maximization} strategy to reduce the generation of generic responses. Experimental results on a large Chinese STC dataset demonstrate the superiority of the proposed model on relevance metrics and diversity metrics.\footnote{Code available at https://…/.

Embedding large and high dimensional data into low dimensional vector spaces is a necessary task to computationally cope with contemporary data sets. Superseding latent semantic analysis recent approaches like word2vec or node2vec are well established tools in this realm. In the present paper we add to this line of research by introducing fca2vec, a family of embedding techniques for formal concept analysis (FCA). Our investigation contributes to two distinct lines of research. First, we enable the application of FCA notions to large data sets. In particular, we demonstrate how the cover relation of a concept lattice can be retrieved from a computational feasible embedding. Secondly, we show an enhancement for the classical node2vec approach in low dimension. For both directions the overall constraint of FCA of explainable results is preserved. We evaluate our novel procedures by computing fca2vec on different data sets like, wiki44 (a dense part of the Wikidata knowledge graph), the Mushroom data set and a publication network derived from the FCA community.

We propose a method for efficient training of deep Reinforcement Learning (RL) agents when the reward is highly sparse and non-Markovian, but at the same time admits a high-level yet unknown sequential structure, as seen in a number of video games. This high-level sequential structure can be expressed as a computer program, which our method infers automatically as the RL agent explores the environment. Through this process, a high-level sequential task that occurs only rarely may nonetheless be encoded within the inferred program. A hybrid architecture for deep neural fitted Q-iteration is then employed to fill in low-level details and generate an optimal control policy that follows the structure of the program. Our experiments show that the agent is able to synthesise a complex program to guide the RL exploitation phase, which is otherwise difficult to achieve with state-of-the-art RL techniques.

Machine learning systems based on deep neural networks (DNNs) produce state-of-the-art results in many applications. Considering the large amount of training data and know-how required to generate the network, it is more practical to use third-party DNN intellectual property (IP) cores for many designs. No doubt to say, it is essential for DNN IP vendors to provide test cases for functional validation without leaking their parameters to IP users. To satisfy this requirement, we propose to effectively generate test cases that activate parameters as many as possible and propagate their perturbations to outputs. Then the functionality of DNN IPs can be validated by only checking their outputs. However, it is difficult considering large numbers of parameters and highly non-linearity of DNNs. In this paper, we tackle this problem by judiciously selecting samples from the DNN training set and applying a gradient-based method to generate new test cases. Experimental results demonstrate the efficacy of our proposed solution.

Learning meaningful graphs from data plays important roles in many data mining and machine learning tasks, such as data representation and analysis, dimension reduction, data clustering, and visualization, etc. In this work, for the first time, we present a highly-scalable spectral approach (GRASPEL) for learning large graphs from data. By limiting the precision matrix to be a graph Laplacian, our approach aims to estimate ultra-sparse (tree-like) weighted undirected graphs and shows a clear connection with the prior graphical Lasso method. By interleaving the latest high-performance nearly-linear time spectral methods for graph sparsification, coarsening and embedding, ultra-sparse yet spectrally-robust graphs can be learned by identifying and including the most spectrally-critical edges into the graph. Compared with prior state-of-the-art graph learning approaches, GRASPEL is more scalable and allows substantially improving computing efficiency and solution quality of a variety of data mining and machine learning applications, such as spectral clustering (SC), and t-Distributed Stochastic Neighbor Embedding (t-SNE). {For example, when comparing with graphs constructed using existing methods, GRASPEL achieved the best spectral clustering efficiency and accuracy.

We propose a means to relate properties of an interconnected system to its separate component systems in the presence of cascade-like phenomena. Building on a theory of interconnection reminiscent of the behavioral approach to systems theory, we introduce the notion of generativity, and its byproduct, generative effects. Cascade effects, enclosing contagion phenomena and cascading failures, are seen as instances of generative effects. The latter are precisely the instances where properties of interest are not preserved or behave very badly when systems interact. The goal is to overcome that obstruction. We will show how to extract mathematical objects from the systems, that encode their generativity: their potential to generate new phenomena upon interaction. Those objects may then be used to link the properties of the interconnected system to its separate systems. Such a link will be executed through the use of exact sequences from commutative algebra.

Recently, neural networks have been used as implicit representations for surface reconstruction, modelling, learning, and generation. So far, training neural networks to be implicit representations of surfaces required training data sampled from a ground-truth signed implicit functions such as signed distance or occupancy functions, which are notoriously hard to compute. In this paper we introduce Sign Agnostic Learning (SAL), a deep learning approach for learning implicit shape representations directly from raw, unsigned geometric data, such as point clouds and triangle soups. We have tested SAL on the challenging problem of surface reconstruction from an un-oriented point cloud, as well as end-to-end human shape space learning directly from raw scans dataset, and achieved state of the art reconstructions compared to current approaches. We believe SAL opens the door to many geometric deep learning applications with real-world data, alleviating the usual painstaking, often manual pre-process.

Selecting and combining the outlier scores of different base detectors used within outlier ensembles can be quite challenging in the absence of ground truth. In this paper, an unsupervised outlier detector combination framework called DCSO is proposed, demonstrated and assessed for the dynamic selection of most competent base detectors, with an emphasis on data locality. The proposed DCSO framework first defines the local region of a test instance by its k nearest neighbors and then identifies the top-performing base detectors within the local region. Experimental results on ten benchmark datasets demonstrate that DCSO provides consistent performance improvement over existing static combination approaches in mining outlying objects. To facilitate interpretability and reliability of the proposed method, DCSO is analyzed using both theoretical frameworks and visualization techniques, and presented alongside empirical parameter setting instructions that can be used to improve the overall performance.

Joint extraction of entities and relations has received significant attention due to its potential of providing higher performance for both tasks. Among existing methods, CopyRE is effective and novel, which uses a sequence-to-sequence framework and copy mechanism to directly generate the relation triplets. However, it suffers from two fatal problems. The model is extremely weak at differing the head and tail entity, resulting in inaccurate entity extraction. It also cannot predict multi-token entities (e.g. \textit{Steven Jobs}). To address these problems, we give a detailed analysis of the reasons behind the inaccurate entity extraction problem, and then propose a simple but extremely effective model structure to solve this problem. In addition, we propose a multi-task learning framework equipped with copy mechanism, called CopyMTL, to allow the model to predict multi-token entities. Experiments reveal the problems of CopyRE and show that our model achieves significant improvement over the current state-of-the-art method by 9% in NYT and 16% in WebNLG (F1 score). Our code is available at https://…/CopyMTL

Answering questions that require multi-hop reasoning at web-scale necessitates retrieving multiple evidence documents, one of which often has little lexical or semantic relationship to the question. This paper introduces a new graph-based recurrent retrieval approach that learns to retrieve reasoning paths over the Wikipedia graph to answer multi-hop open-domain questions. Our retriever model trains a recurrent neural network that learns to sequentially retrieve evidence paragraphs in the reasoning path by conditioning on the previously retrieved documents. Our reader model ranks the reasoning paths and extracts the answer span included in the best reasoning path. Experimental results show state-of-the-art results in three open-domain QA datasets, showcasing the effectiveness and robustness of our method. Notably, our method achieves significant improvement in HotpotQA, outperforming the previous best model by more than 14 points.

As deep learning techniques advance more than ever, hyper-parameter optimization is the new major workload in deep learning clusters. Although hyper-parameter optimization is crucial in training deep learning models for high model performance, effectively executing such a computation-heavy workload still remains a challenge. We observe that numerous trials issued from existing hyper-parameter optimization algorithms share common hyper-parameter sequence prefixes, which implies that there are redundant computations from training the same hyper-parameter sequence multiple times. We propose a stage-based execution strategy for efficient execution of hyper-parameter optimization algorithms. Our strategy removes redundancy in the training process by splitting the hyper-parameter sequences of trials into homogeneous stages, and generating a tree of stages by merging the common prefixes. Our preliminary experiment results show that applying stage-based execution to hyper-parameter optimization algorithms outperforms the original trial-based method, saving required GPU-hours and end-to-end training time by up to 6.60 times and 4.13 times, respectively.

Discriminative feature representation of person image is important for person re-identification (Re-ID) task. Recently, attributes have been demonstrated beneficially in guiding for learning more discriminative feature representations for Re-ID. As attributes normally co-occur in person images, it is desirable to model the attribute dependencies to improve the attribute prediction and thus Re-ID results. In this paper, we propose to model these attribute dependencies via a novel attribute knowledge graph (AttKG), and propose a novel Attribute Knowledge Graph Convolutional Network (AttKGCN) to solve Re-ID problem. AttKGCN integrates both attribute prediction and Re-ID learning together in a unified end-to-end framework which can boost their performances, respectively. AttKGCN first builds a directed attribute KG whose nodes denote attributes and edges encode the co-occurrence relationships of different attributes. Then, AttKGCN learns a set of inter-dependent attribute classifiers which are combined with person visual descriptors for attribute prediction. Finally, AttKGCN integrates attribute description and deeply visual representation together to construct a more discriminative feature representation for Re-ID task. Extensive experiments on several benchmark datasets demonstrate the effectiveness of AttKGCN on attribute prediction and Re-ID tasks.

In many real-world applications of machine learning, data are distributed across many clients and cannot leave the devices they are stored on. Furthermore, each client’s data, computational resources and communication constraints may be very different. This setting is known as federated learning, in which privacy is a key concern. Differential privacy is commonly used to provide mathematical privacy guarantees. This work, to the best of our knowledge, is the first to consider federated, differentially private, Bayesian learning. We build on Partitioned Variational Inference (PVI) which was recently developed to support approximate Bayesian inference in the federated setting. We modify the client-side optimisation of PVI to provide an (, )-DP guarantee. We show that it is possible to learn moderately private logistic regression models in the federated setting that achieve similar performance to models trained non-privately on centralised data.

We investigate an algorithm named histogram transform ensembles (HTE) density estimator whose effectiveness is supported by both solid theoretical analysis and significant experimental performance. On the theoretical side, by decomposing the error term into approximation error and estimation error, we are able to conduct the following analysis: First of all, we establish the universal consistency under -norm. Secondly, under the assumption that the underlying density function resides in the H\'{o}lder space , we prove almost optimal convergence rates for both single and ensemble density estimators under -norm and -norm for different tail distributions, whereas in contrast, for its subspace consisting of smoother functions, almost optimal convergence rates can only be established for the ensembles and the lower bound of the single estimators illustrates the benefits of ensembles over single density estimators. In the experiments, we first carry out simulations to illustrate that histogram transform ensembles surpass single histogram transforms, which offers powerful evidence to support the theoretical results in the space . Moreover, to further exert the experimental performances, we propose an adaptive version of HTE and study the parameters by generating several synthetic datasets with diversities in dimensions and distributions. Last but not least, real data experiments with other state-of-the-art density estimators demonstrate the accuracy of the adaptive HTE algorithm.

We present a facial landmark position correlation analysis as well as its applications. Although numerous facial landmark detection methods have been presented in the literature, few of them concern the intrinsic relationship among the landmarks. In order to reveal and interpret this relationship, we propose to analyze the facial landmark correlation by using Canonical Correlation Analysis (CCA). We experimentally show that dense facial landmark annotations in current benchmarks are strongly correlated, and we propose several applications based on this analysis. First, we give insights into the predictions from different facial landmark detection models (including cascaded random forests, cascaded Convolutional Neural Networks (CNNs), heatmap regression models) and interpret how CNNs progressively learn to predict facial landmarks. Second, we propose a few-shot learning method that allows to considerably reduce manual effort for dense landmark annotation. To this end, we select a portion of landmarks from the dense annotation format to form a sparse format, which is mostly correlated to the rest of them. Thanks to the strong correlation among the landmarks, the entire set of dense facial landmarks can then be inferred from the annotation in the sparse format by transfer learning. Unlike the previous methods, we mainly focus on how to find the most efficient sparse format to annotate. Overall, our correlation analysis provides new perspectives for the research on facial landmark detection.

Recent years have witnessed significant advances in reinforcement learning (RL), which has registered great success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

This paper considers online convex optimization (OCO) problems – the paramount framework for online learning algorithm design. The loss function of learning task in OCO setting is based on streaming data so that OCO is a powerful tool to model large scale applications such as online recommender systems. Meanwhile, real-world data are usually of extreme high-dimensional due to modern feature engineering techniques so that the quadratic regression is impractical. Factorization Machine as well as its variants are efficient models for capturing feature interactions with low-rank matrix model but they can’t fulfill the OCO setting due to their non-convexity. In this paper, We propose a projective quadratic regression (PQR) model. First, it can capture the import second-order feature information. Second, it is a convex model, so the requirements of OCO are fulfilled and the global optimal solution can be achieved. Moreover, existing modern online optimization methods such as Online Gradient Descent (OGD) or Follow-The-Regularized-Leader (FTRL) can be applied directly. In addition, by choosing a proper hyper-parameter, we show that it has the same order of space and time complexity as the linear model and thus can handle high-dimensional data. Experimental results demonstrate the performance of the proposed PQR model in terms of accuracy and efficiency by comparing with the state-of-the-art methods.

Mutual information is widely applied to learn latent representations of observations, whilst its implication in classification neural networks remain to be better explained. In this paper, we show that optimising the parameters of classification neural networks with softmax cross-entropy is equivalent to maximising the mutual information between inputs and labels under the balanced data assumption. Through the experiments on synthetic and real datasets, we show that softmax cross-entropy can estimate mutual information approximately. When applied to image classification, this relation helps approximate the point-wise mutual information between an input image and a label without modifying the network structure. In this end, we propose infoCAM, informative class activation map, which highlights regions of the input image that are the most relevant to a given label based on differences in information. The activation map helps localise the target object in an image. Through the experiments on the semi-supervised object localisation task with two real-world datasets, we evaluate the effectiveness of the information-theoretic approach.

Recent advances in adversarial attacks uncover the intrinsic vulnerability of modern deep neural networks. Since then, extensive efforts have been devoted to enhancing the robustness of deep networks via specialized learning algorithms and loss functions. In this work, we take an architectural perspective and investigate the patterns of network architectures that are resilient to adversarial attacks. To obtain the large number of networks needed for this study, we adopt one-shot neural architecture search, training a large network for once and then finetuning the sub-networks sampled therefrom. The sampled architectures together with the accuracies they achieve provide a rich basis for our study. Our ‘robust architecture Odyssey’ reveals several valuable observations: 1) densely connected patterns result in improved robustness; 2) under computational budget, adding convolution operations to direct connection edge is effective; 3) flow of solution procedure (FSP) matrix is a good indicator of network robustness. Based on these observations, we discover a family of robust architectures (RobNets). On various datasets, including CIFAR, SVHN, and Tiny-ImageNet, RobNets exhibit superior robustness performance to other widely used architectures. Notably, RobNets substantially improve the robust accuracy (~5% absolute gains) under both white-box and black-box attacks, even with fewer parameter numbers.

The interactions of users and items in recommender system could be naturally modeled as a user-item bipartite graph. In recent years, we have witnessed an emerging research effort in exploring user-item graph for collaborative filtering methods. Nevertheless, the formation of user-item interactions typically arises from highly complex latent purchasing motivations, such as high cost performance or eye-catching appearance, which are indistinguishably represented by the edges. The existing approaches still remain the differences between various purchasing motivations unexplored, rendering the inability to capture fine-grained user preference. Therefore, in this paper we propose a novel Multi-Component graph convolutional Collaborative Filtering (MCCF) approach to distinguish the latent purchasing motivations underneath the observed explicit user-item interactions. Specifically, there are two elaborately designed modules, decomposer and combiner, inside MCCF. The former first decomposes the edges in user-item graph to identify the latent components that may cause the purchasing relationship; the latter then recombines these latent components automatically to obtain unified embeddings for prediction. Furthermore, the sparse regularizer and weighted random sample strategy are utilized to alleviate the overfitting problem and accelerate the optimization. Empirical results on three real datasets and a synthetic dataset not only show the significant performance gains of MCCF, but also well demonstrate the necessity of considering multiple components.

We consider the problem of selecting a seed set to maximize the expected number of influenced nodes in the social network, referred to as the \textit{influence maximization} (IM) problem. We assume that the topology of the social network is prescribed while the influence probabilities among edges are unknown. In order to learn the influence probabilities and simultaneously maximize the influence spread, we consider the tradeoff between exploiting the current estimation of the influence probabilities to ensure certain influence spread and exploring more nodes to learn better about the influence probabilities. The exploitation-exploration trade-off is the core issue in the multi-armed bandit (MAB) problem. If we regard the influence spread as the reward, then the IM problem could be reduced to the combinatorial multi-armed bandits. At each round, the learner selects a limited number of seed nodes in the social network, then the influence spreads over the network according to the real influence probabilities. The learner could observe the activation status of the edge if and only if its start node is influenced, which is referred to as the edge-level semi-bandit feedback. Two classical bandit algorithms including Thompson Sampling and Epsilon Greedy are used to solve this combinatorial problem. To ensure the robustness of these two algorithms, we use an automatic ensemble learning strategy, which combines the exploration strategy with exploitation strategy. The ensemble algorithm is self-adaptive regarding that the probability of each algorithm could be adjusted based on the historical performance of the algorithm. Experimental evaluation illustrates the effectiveness of the automatically adjusted hybridization of exploration algorithm with exploitation algorithm.

Multiplex networks are convenient mathematical representations for many real-world — biological, social, and technological — systems of interacting elements, where pairwise interactions among elements have different flavors. Previous studies pointed out that real-world multiplex networks display significant inter-layer correlations — degree-degree correlation, edge overlap, node similarities — able to make them robust against random and targeted failures of their individual components. Here, we show that inter-layer correlations are important also in the characterization of their -core structure, namely the organization in shells of nodes with increasingly high degree. Understanding -core structures is important in the study of spreading processes taking place on networks, as for example in the identification of influential spreaders and the emergence of localization phenomena. We find that, if the degree distribution of the network is heterogeneous, then a strong -core structure is well predicted by significantly positive degree-degree correlations. However, if the network degree distribution is homogeneous, then strong -core structure is due to positive correlations at the level of node similarities. We reach our conclusions by analyzing different real-world multiplex networks, introducing novel techniques for controlling inter-layer correlations of networks without changing their structure, and taking advantage of synthetic network models with tunable levels of inter-layer correlations.

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.

We present Neural Random Forest Imitation – a novel approach for transforming random forests into neural networks. Existing methods produce very inefficient architectures and do not scale. In this paper, we introduce a new method for generating data from a random forest and learning a neural network that imitates it. Without any additional training data, this transformation creates very efficient neural networks that learn the decision boundaries of a random forest. The generated model is fully differentiable and can be combined with the feature extraction in a single pipeline enabling further end-to-end processing. Experiments on several real-world benchmark datasets demonstrate outstanding performance in terms of scalability, accuracy, and learning with very few training examples. Compared to state-of-the-art mappings, we significantly reduce the network size while achieving the same or even improved accuracy due to better generalization.

This research proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or . Within the context of topic modeling, has the same interpretation that it does when used in a broader class of statistical models. Reporting with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. The author proposes that should be reported as a standard metric when constructing topic models.

In this chapter, we give an introduction to symbolic artificial intelligence (AI) and discuss its relation and application to multimedia. We begin by defining what symbolic AI is, what distinguishes it from non-symbolic approaches, such as machine learning, and how it can used in the construction of advanced multimedia applications. We then introduce description logic (DL) and use it to discuss symbolic representation and reasoning. DL is the logical underpinning of OWL, the most successful family of ontology languages. After discussing DL, we present OWL and related Semantic Web technologies, such as RDF and SPARQL. We conclude the chapter by discussing a hybrid model for multimedia representation, called Hyperknowledge. Throughout the text, we make references to technologies and extensions specifically designed to solve the kinds of problems that arise in multimedia representation.

I overview recent research advances in Bayesian state-space modeling of multivariate time series. A main focus is on the decouple/recouple concept that enables application of state-space models to increasingly large-scale data, applying to continuous or discrete time series outcomes. The scope includes large-scale dynamic graphical models for forecasting and multivariate volatility analysis in areas such as economics and finance, multi-scale approaches for forecasting discrete/count time series in areas such as commercial sales and demand forecasting, and dynamic network flow models for areas including internet traffic monitoring. In applications, explicit forecasting, monitoring and decision goals are paramount and should factor into model assessment and comparison, a perspective that is highlighted.

There is an increasing number of pre-trained deep neural network models. However, it is still unclear how to effectively use these models for a new task. Transfer learning, which aims to transfer knowledge from source tasks to a target task, is an effective solution to this problem. Fine-tuning is a popular transfer learning technique for deep neural networks where a few rounds of training are applied to the parameters of a pre-trained model to adapt them to a new task. Despite its popularity, in this paper, we show that fine-tuning suffers from several drawbacks. We propose an adaptive fine-tuning approach, called AdaFilter, which selects only a part of the convolutional filters in the pre-trained model to optimize on a per-example basis. We use a recurrent gated network to selectively fine-tune convolutional filters based on the activations of the previous layer. We experiment with 7 public image classification datasets and the results show that AdaFilter can reduce the average classification error of the standard fine-tuning by 2.54%.

For Bayesian computation in big data contexts, the divide-and-conquer MCMC concept splits the whole data set into batches, runs MCMC algorithms separately over each batch to produce samples of parameters, and combines them to produce an approximation of the target distribution. In this article, we embed random forests into this framework and use each subposterior/partial-posterior as a proposal distribution to implement importance sampling. Unlike the existing divide-and-conquer MCMC, our methods are based on scaled subposteriors, whose scale factors are not necessarily restricted to being equal to one or to the number of subsets. Through several experiments, we show that our methods work well with models ranging from Gaussian cases to strongly non-Gaussian cases, and include model misspecification.

Humans can learn a variety of concepts and skills incrementally over the course of their lives while exhibiting an array of desirable properties, such as non-forgetting, concept rehearsal, forward transfer and backward transfer of knowledge, and so on. Previous approaches to lifelong learning (LLL) have demonstrated subsets of these properties, often with multiple mechanisms. In this paper, we propose a simple yet powerful unified framework that demonstrates all of these desirable properties. Our novel framework utilizes a small number of weight consolidation parameters dynamically applied to groups of weights, reflecting how ‘stiff’ weights can be modified during training in deep neural networks. In addition, we are able to draw many parallels between the behaviours and mechanisms of our model and those surrounding human learning, such as memory loss or sleep deprivation. This allows our approach to serve as a conduit for two-way inspiration to further understand lifelong learning in machines and humans.

Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity – introducing attitudes via framing, presupposing truth, and casting doubt – remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view (‘neutralizing’ biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque CONCURRENT system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable MODULAR algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.

Historically, the pursuit of efficient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeeze-and-excitation module, depthwise separable convolutions in Xception, and the inverted bottleneck in MobileNet v2. Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new, the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly, which we open-source for the benefit of the community as part of the XNNPACK library. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EfficientNet architectures substantially outperform strong dense baselines on the efficiency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by — equivalent to approximately one entire generation of MobileNet-family improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures.

We present the Global Health Monitor, an online Web-based system for detecting and mapping infectious disease outbreaks that appear in news stories. The system analyzes English news stories from news feed providers, classifies them for topical relevance and plots them onto a Google map using geo-coding information, helping public health workers to monitor the spread of diseases in a geo-temporal context. The background knowledge for the system is contained in the BioCaster ontology (BCO) (Collier et al., 2007a) which includes both information on infectious diseases as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and visualization. Evaluation of the system shows that it achieved high accuracy on a gold standard corpus. The system is now in practical use. Running on a clustercomputer, it monitors more than 1500 news feeds 24/7, updating the map every hour.

Batch Normalization (BN) is a highly successful and widely used batch dependent training method. Its use of mini-batch statistics to normalize the activations introduces dependence between samples, which can hurt the training if the mini-batch size is too small, or if the samples are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address these issues. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. In this paper we propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a drop-in replacement for other normalizations and activations. Our method operates on each activation map of each batch sample independently, eliminating the dependency on other batch samples or channels of the same sample. Our method outperforms BN and all alternatives in a variety of settings for all batch sizes. FRN layer performs better on top-1 validation accuracy than BN with large mini-batch sizes on Imagenet classification on InceptionV3 and ResnetV2-50 architectures. Further, it performs better than GN on the same problem in the small mini-batch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least in all batch size regimes.

In this paper, we study normalization methods for neural networks from the perspective of elimination singularity. Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances. We show that channel-based normalizations (e.g. Layer Normalization and Group Normalization) are unable to guarantee a far distance from elimination singularities, in contrast with Batch Normalization which by design avoids models from getting too close to them. To address this issue, we propose BatchChannel Normalization (BCN), which uses batch knowledge to avoid the elimination singularities in the training of channel-normalized models. Unlike Batch Normalization, BCN is able to run in both large-batch and micro-batch training settings. The effectiveness of BCN is verified on many tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is here: https://…/Batch-Channel-Normalization.

Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor’s notes that has been annotated with ICD concepts. Extensive experimental evaluation shows our model achieves significant performance improvements over several state-of-the-art techniques.

Listwise learning-to-rank methods form a powerful class of ranking algorithms that are widely adopted in applications such as information retrieval. These algorithms learn to rank a set of items by optimizing a loss that is a function of the entire set—as a surrogate to a typically non-differentiable ranking metric. Despite their empirical success, existing listwise methods are based on heuristics and remain theoretically ill-understood. In particular, none of the empirically-successful loss functions are related to ranking metrics. In this work, we propose a cross entropy-based learning-to-rank loss function that is theoretically sound and is a convex bound on NDCG, a popular ranking metric. Furthermore, empirical evaluation of an implementation of the proposed method with gradient boosting machines on benchmark learning-to-rank datasets demonstrates the superiority of our proposed formulation over existing algorithms in quality and robustness.

The complex world around us is inherently multimodal and sequential (continuous). Information is scattered across different modalities and requires multiple continuous sensors to be captured. As machine learning leaps towards better generalization to real world, multimodal sequential learning becomes a fundamental research area. Arguably, modeling arbitrarily distributed spatio-temporal dynamics within and across modalities is the biggest challenge in this research area. In this paper, we present a new transformer model, called the Factorized Multimodal Transformer (FMT) for multimodal sequential learning. FMT inherently models the intramodal and intermodal (involving two or more modalities) dynamics within its multimodal input in a factorized manner. The proposed factorization allows for increasing the number of self-attentions to better model the multimodal phenomena at hand; without encountering difficulties during training (e.g. overfitting) even on relatively low-resource setups. All the attention mechanisms within FMT have a full time-domain receptive field which allows them to asynchronously capture long-range multimodal dynamics. In our experiments we focus on datasets that contain the three commonly studied modalities of language, vision and acoustic. We perform a wide range of experiments, spanning across 3 well-studied datasets and 21 distinct labels. FMT shows superior performance over previously proposed models, setting new state of the art in the studied datasets.

Artificial Intelligence (AI) has become an integral part of modern-day security solutions for its capability of learning very complex functions and handling ‘Big Data’. However, the lack of explainability and interpretability of successful AI models is a key stumbling block when trust in a model’s prediction is critical. This leads to human intervention, which in turn results in a delayed response or decision. While there have been major advancements in the speed and performance of AI-based intrusion detection systems, the response is still at human speed when it comes to explaining and interpreting a specific prediction or decision. In this work, we infuse popular domain knowledge (i.e., CIA principles) in our model for better explainability and validate the approach on a network intrusion detection test case. Our experimental results suggest that the infusion of domain knowledge provides better explainability as well as a faster decision or response. In addition, the infused domain knowledge generalizes the model to work well with unknown attacks, as well as open the path to adapt to a large stream of network traffic from numerous IoT devices.

Scarcity of labeled data is a bottleneck for supervised learning models. A paradigm that has evolved for dealing with this problem is data programming. An existing data programming paradigm allows human supervision to be provided as a set of discrete labeling functions (LF) that output possibly noisy labels to input instances and a generative modelfor consolidating the weak labels. We enhance and generalize this paradigm by supporting functions that output a continuous score (instead of a hard label) that noisily correlates with labels. We show across five applications that continuous LFs are more natural to program and lead to improved recall. We also show that accuracy of existing generative models is unstable with respect to initialization, training epochs, and learning rates. We give control to the data programmer to guide the training process by providing intuitive quality guides with each LF. We propose an elegant method of incorporating these guides into the generative model. Our overall method, called CAGE, makes the data programming paradigm more reliable than other tricks based on initialization, sign-penalties, or soft-accuracy constraints.

The diversity of intrinsic qualities of multimedia entities tends to impede their effective retrieval. In a SelfLearning Search Engine architecture, the subtle nuances of human perceptions and deep knowledge are taught and captured through unsupervised reinforcement learning, where the degree of reinforcement may be suitably calibrated. Such architectural paradigm enables indexes to evolve naturally while accommodating the dynamic changes of user interests. It operates by continuously constructing indexes over time, while injecting progressive improvement in search performance. For search operations to be effective, convergence of index learning is of crucial importance to ensure efficiency and robustness. In this paper, we develop a Self-Learning Search Engine architecture based on reinforcement learning using a Markov Decision Process framework. The balance between exploration and exploitation is achieved through evolutionary exploration Strategies. The evolutionary index learning behavior is then studied and formulated using stochastic analysis. Experimental results are presented which corroborate the steady convergence of the index evolution mechanism. Index Term

High performance machine learning models have become highly dependent on the availability of large quantity and quality of training data. To achieve this, various central agencies such as the government have suggested for different data providers to pool their data together to learn a unified predictive model, which performs better. However, these providers are usually profit-driven and would only agree to participate inthe data sharing process if the process is deemed both profitable and fair for themselves. Due to the lack of existing literature, it is unclear whether a fair and stable outcome is possible in such data sharing processes. Hence, we wish to investigate the outcomes surrounding these scenarios and study if data providers would even agree to collaborate in the first place. Tapping on cooperative game concepts in Game Theory, we introduce the data sharing process between a group of agents as a new class of cooperative games with modified definition of stability and fairness. Using these new definitions, we then theoretically study the optimal and suboptimal outcomes of such data sharing processes and their sensitivity to perturbation.Through experiments, we present intuitive insights regarding theoretical results analysed in this paper and discuss various ways in which data can be valued reasonably.

Age-of-information (AoI) is a metric quantifying information freshness at the receiver. It has received a lot of attention as it captures the delay together with packet loss and packet generation rate. However, most of the work aim for the average or peak AoI neglecting the complete distribution. In this work, we consider a n-hop network between a source and destination pair with time-invariant packet loss probabilities on each link. We derive closed form equations for the probability mass function of AoI at the receiver. We verify our findings with simulations. Our results show that the performance indicators considered in the literature such as average AoI or peak AoI may give misleading insights into the complete AoI performance.

Loss functions play a crucial role in deep metric learning thus a variety of them have been proposed. Some supervise the learning process by pairwise or tripletwise similarity constraints while others take advantage of structured similarity information among multiple data points. In this work, we approach deep metric learning from a novel perspective. We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. ICE has three main appealing properties. Firstly, similar to categorical cross entropy (CCE), ICE has clear probabilistic interpretation and exploits structured semantic similarity information for learning supervision. Secondly, ICE is scalable to infinite training data as it learns on mini-batches iteratively and is independent of the training set size. Thirdly, motivated by our relative weight analysis, seamless sample reweighting is incorporated. It rescales samples’ gradients to control the differentiation degree over training examples instead of truncating them by sample mining. In addition to its simplicity and intuitiveness, extensive experiments on three real-world benchmarks demonstrate the superiority of ICE.

-NN classifier is one of the most famous classification algorithms, whose performance is crucially dependent on the distance metric. When we consider the distance metric as a parameter of -NN, learning an appropriate distance metric for -NN can be seen as minimizing the empirical risk of -NN. In this paper, we design a new type of continuous decision function of the -NN classification rule which can be used to construct the continuous empirical risk function of -NN. By minimizing this continuous empirical risk function, we obtain a novel distance metric learning algorithm named as adaptive nearest neighbor (ANN). We have proved that the current algorithms such as the large margin nearest neighbor (LMNN), neighbourhood components analysis (NCA) and the pairwise constraint methods are special cases of the proposed ANN by setting the parameter different values. Compared with the LMNN, NCA, and pairwise constraint methods, our method has a broader searching space which may contain better solutions. At last, extensive experiments on various data sets are conducted to demonstrate the effectiveness and efficiency of the proposed method.

We consider the problem of reinforcing federated learning with formal privacy guarantees. We propose to employ Bayesian differential privacy, a relaxation of differential privacy for similarly distributed data, to provide sharper privacy loss bounds. We adapt the Bayesian privacy accounting method to the federated setting and suggest multiple improvements for more efficient privacy budgeting at different levels. Our experiments show significant advantage over the state-of-the-art differential privacy bounds for federated learning on image classification tasks, including a medical application, bringing the privacy budget below 1 at the client level, and below 0.1 at the instance level. Lower amounts of noise also benefit the model accuracy and reduce the number of communication rounds.

The ability to observe the effects of actions performed by others and to infer their intent, most likely goals, or course of action, is known as a plan or intention recognition cognitive capability and has long been one of the fundamental research challenges in AI. Deep learning has recently been making significant inroads on various pattern recognition problems, except for intention recognition. While extensively explored since the seventies, the problem remains unsolved for most interesting cases in various areas, ranging from natural language understanding to human behavior understanding based on video feeds. This paper compares symbolic inverse planning, one of the most investigated approaches to goal recognition, to deep learning using CNN and LTSM neural network architectures, on five synthetic benchmarks often used in the literature. The results show that the deep learning approach achieves better goal-prediction accuracy and timeliness than the symbolic cost-based plan recognizer in these domains. Although preliminary, these results point to interesting future research avenues.

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that ‘adapts’ to its current learning state and estimates the importance of each training data instance. Training such an adaptive scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance at a given time during the training, it is typically necessary to first complete the entire training process. To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS). In DDS, we formulate a scorer network as a learnable function of the training data, which can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a dev set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers strong and consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.

The process of identifying and understanding art styles to discover artistic influences is essential to the study of art history. Traditionally, trained experts review fine details of the works and compare them to other known works. To automate and scale this task, we use several state-of-the-art CNN architectures to explore how a machine may help perceive and quantify art styles. This study explores: (1) How accurately can a machine classify art styles? (2) What may be the underlying relationships among different styles and artists? To help answer the first question, our best-performing model using Inception V3 achieves a 9-class classification accuracy of 88.35%, which outperforms the model in Elgammal et al.’s study by more than 20 percent. Visualizations using Grad-CAM heat maps confirm that the model correctly focuses on the characteristic parts of paintings. To help address the second question, we conduct network analysis on the influences among styles and artists by extracting 512 features from the best-performing classification model. Through 2D and 3D T-SNE visualizations, we observe clear chronological patterns of development and separation among the art styles. The network analysis also appears to show anticipated artist level connections from an art historical perspective. This technique appears to help identify some previously unknown linkages that may shed light upon new directions for further exploration by art historians. We hope that humans and machines working in concert may bring new opportunities to the field.

Artificial Intelligence (AI) has become an integral part of domains such as security, finance, healthcare, medicine, and criminal justice. Explaining the decisions of AI systems in human terms is a key challenge–due to the high complexity of the model, as well as the potential implications on human interests, rights, and lives . While Explainable AI is an emerging field of research, there is no consensus on the definition, quantification, and formalization of explainability. In fact, the quantification of explainability is an open challenge. In our previous work, we incorporated domain knowledge for better explainability, however, we were unable to quantify the extent of explainability. In this work, we (1) briefly analyze the definitions of explainability from the perspective of different disciplines (e.g., psychology, social science), properties of explanation, explanation methods, and human-friendly explanations; and (2) propose and formulate an approach to quantify the extent of explainability. Our experimental result suggests a reasonable and model-agnostic way to quantify explainability

Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighboring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. As opposed to in the planning setting, for learning methods it is challenging to establish theoretical guarantees. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a single agent for sparse coordination graphs. Finally, we empirically show that MATS outperforms the state-of-the-art algorithm, MAUCE, on two synthetic benchmarks, a realistic wind farm control task, and a novel benchmark with Poisson distributions.

In this work we have analyzed a novel concept of sequential binding based learning capable network based on the coupling of recurrent units with Bayesian prior definition. The coupling structure encodes to generate efficient tensor representations that can be decoded to generate efficient sentences and can describe certain events. These descriptions are derived from structural representations of visual features of images and media. An elaborated study of the different types of coupling recurrent structures are studied and some insights of their performance are provided. Supervised learning performance for natural language processing is judged based on statistical evaluations, however, the truth is perspective, and in this case the qualitative evaluations reveal the real capability of the different architectural strengths and variations. Bayesian prior definition of different embedding helps in better characterization of the sentences based on the natural language structure related to parts of speech and other semantic level categorization in a form which is machine interpret-able and inherits the characteristics of the Tensor Representation binding and unbinding based on the mutually orthogonality. Our approach has surpassed some of the existing basic works related to image captioning.

Traditionally, the way one evaluates the performance of an Artificial Intelligence (AI) system is via a comparison to human performance in specific tasks, treating humans as a reference for high-level cognition. However, these comparisons leave out important features of human intelligence: the capability to transfer knowledge and make complex decisions based on emotional and rational reasoning. These decisions are influenced by current inferences as well as prior experiences, making the decision process strongly subjective and apparently biased. In this context, a definition of compositional intelligence is necessary to incorporate these features in future AI tests. Here, a concrete implementation of this will be suggested, using recent developments in quantum cognition, natural language and compositional meaning of sentences, thanks to categorical compositional models of meaning.

Parametric entities appear in many contexts, be it in optimisation, control, modelling of random quantities, or uncertainty quantification. These are all fields where reduced order models (ROMs) have a place to alleviate the computational burden. Assuming that the parametric entity takes values in a linear space, we show how is is associated to a linear map or operator. This provides a general point of view on how to consider and analyse different representations of such entities. Analysis of the associated linear map in turn connects such representations with reproducing kernel Hilbert spaces and affine- / linear-representations in terms of tensor products. A generalised correlation operator is defined through the associated linear map, and its spectral analysis helps to shed light on the approximation properties of ROMs. This point of view thus unifies many such representations under a functional analytic roof, leading to a deeper understanding and making them available for appropriate analysis.

Differential privacy is a cryptographically-motivated approach to privacy that has become a very active field of research over the last decade in theoretical computer science and machine learning. In this paradigm one assumes there is a trusted curator who holds the data of individuals in a database and the goal of privacy is to simultaneously protect individual data while allowing the release of global characteristics of the database. In this setting we introduce a general framework for parametric inference with differential privacy guarantees. We first obtain differentially private estimators based on bounded influence M-estimators by leveraging their gross-error sensitivity in the calibration of a noise term added to them in order to ensure privacy. We then show how a similar construction can also be applied to construct differentially private test statistics analogous to the Wald, score and likelihood ratio tests. We provide statistical guarantees for all our proposals via an asymptotic analysis. An interesting consequence of our results is to further clarify the connection between differential privacy and robust statistics. In particular, we demonstrate that differential privacy is a weaker stability requirement than infinitesimal robustness, and show that robust M-estimators can be easily randomized in order to guarantee both differential privacy and robustness towards the presence of contaminated data. We illustrate our results both on simulated and real data.

For many NLP applications, such as question answering and summarisation, the goal is to select the best solution from a large space of candidates to meet a particular user’s needs. To address the lack of user-specific training data, we propose an interactive text ranking approach that actively selects pairs of candidates, from which the user selects the best. Unlike previous strategies, which attempt to learn a ranking across the whole candidate space, our method employs Bayesian optimisation to focus the user’s labelling effort on high quality candidates and integrates prior knowledge in a Bayesian manner to cope better with small data scenarios. We apply our method to community question answering (cQA) and extractive summarisation, finding that it significantly outperforms existing interactive approaches. We also show that the ranking function learned by our method is an effective reward function for reinforcement learning, which improves the state of the art for interactive summarisation.

Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. In this work, we present a Graph Convolutional Network (GCN) algorithm SWAG (Sample Weight and AGgregate), which combines efficient random walks and graph convolutions on weighted graphs to generate embeddings for nodes (items) that incorporate both graph structure as well as node feature information such as item-descriptions and item-images. The three important SWAG operations that enable us to efficiently generate node embeddings based on graph structures are (a) Sampling of graph to homogeneous structure, (b) Weighting the sampling, walks and convolution operations, and (c) using AGgregation functions for generating convolutions. The work is an adaptation of graphSAGE over weighted graphs. We deploy SWAG at Target and train it on a graph of more than 500K products sold online with over 50M edges. Offline and online evaluations reveal the benefit of using a graph-based approach and the benefits of weighing to produce high quality embeddings and product recommendations.

Neural networks have demonstrated unmatched performance in a range of classification tasks. Despite numerous efforts of the research community, novelty detection remains one of the significant limitations of neural networks. The ability to identify previously unseen inputs as novel is crucial for our understanding of the decisions made by neural networks. At runtime, inputs not falling into any of the categories learned during training cannot be classified correctly by the neural network. Existing approaches treat the neural network as a black box and try to detect novel inputs based on the confidence of the output predictions. However, neural networks are not trained to reduce their confidence for novel inputs, which limits the effectiveness of these approaches. We propose a framework to monitor a neural network by observing the hidden layers. We employ a common abstraction from program analysis – boxes – to identify novel behaviors in the monitored layers, i.e., inputs that cause behaviors outside the box. For each neuron, the boxes range over the values seen in training. The framework is efficient and flexible to achieve a desired trade-off between raising false warnings and detecting novel inputs. We illustrate the performance and the robustness to variability in the unknown classes on popular image-classification benchmarks.

We consider the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, we introduce natural notions of universality and we show a local equivalence among them. Our analysis is naturally expressed via information geometry, and represents a conceptually and computationally useful analysis. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-R\’enyi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishby’s information bottleneck, Wyner’s common information, Ky Fan -norms, and Brieman and Friedman’s alternating conditional expectations algorithm. We further illustrate how this framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and the associated neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.

Differentiable programming has emerged as a key programming paradigm empowering rapid developments of deep learning while its applications to important computational methods such as Monte Carlo remain largely unexplored. Here we present the general theory enabling infinite-order automatic differentiation on expectations computed by Monte Carlo with unnormalized probability distributions, which we call ‘automatic differentiable Monte Carlo’ (ADMC). By implementing ADMC algorithms on computational graphs, one can also leverage state-of-the-art machine learning frameworks and techniques to traditional Monte Carlo applications in statistics and physics. We illustrate the versatility of ADMC by showing some applications: fast search of phase transitions and accurately finding ground states of interacting many-body models in two dimensions. ADMC paves a promising way to innovate Monte Carlo in various aspects to achieve higher accuracy and efficiency, e.g. easing or solving the sign problem of quantum many-body models through ADMC.

Global optimization finds applications in a wide range of real world problems. The multi-start methods are a popular class of global optimization techniques, which are based on the ideas of conducting local searches at multiple starting points, and then sequentially determine the starting points according to some prescribed rules. In this work we propose a new multi-start algorithm where the starting points are determined in a Bayesian optimization framework. Specifically, the method can be understood as to construct a new function by conducting local searches of the original objective function, where the new function attains the same global optima as the original one. Bayesian optimization is then applied to find the global optima of the new local search based function.

The cost of drawing object bounding boxes (i.e. labeling) for millions of images is prohibitively high. For instance, labeling pedestrians in a regular urban image could take 35 seconds on average. Active learning aims to reduce the cost of labeling by selecting only those images that are informative to improve the detection network accuracy. In this paper, we propose a method to perform active learning of object detectors based on convolutional neural networks. We propose a new image-level scoring process to rank unlabeled images for their automatic selection, which clearly outperforms classical scores. The proposed method can be applied to videos and sets of still images. In the former case, temporal selection rules can complement our scoring process. As a relevant use case, we extensively study the performance of our method on the task of pedestrian detection. Overall, the experiments show that the proposed method performs better than random selection. Our codes are publicly available at www.gitlab.com/haghdam/deep_active_learning.

Deep neural networks have been successfully applied to many real-world applications. However, these successes rely heavily on large amounts of labeled data, which is expensive to obtain. Recently, Auto-Encoding Transformation (AET) and MixMatch have been proposed and achieved state-of-the-art results for unsupervised and semi-supervised learning, respectively. In this study, we train an Ensemble of Auto-Encoding Transformations (EnAET) to learn from both labeled and unlabeled data based on the embedded representations by decoding both spatial and non-spatial transformations. This distinguishes EnAET from conventional semi-supervised methods that focus on improving prediction consistency and confidence by different models on both unlabeled and labeled examples. In contrast, we propose to explore the role of self-supervised representations in semi-supervised learning under a rich family of transformations. Experiment results on CIFAR-10, CIFAR-100, SVHN and STL10 demonstrate that the proposed EnAET outperforms the state-of-the-art semi-supervised methods by significant margins. In particular, we apply the proposed method to extremely challenging scenarios with only 10 images per class, and show that EnAET can achieve an error rate of 9.35% on CIFAR-10 and 16.92% on SVHN. In addition, EnAET achieves the best result when compared with fully supervised learning using all labeled data with the same network architecture. The performance on CIFAR-10, CIFAR-100 and SVHN with a smaller network is even more competitive than the state-of-the-art of supervised learning methods based on a larger network. We also set a new performance record with an error rate of 1.99% on CIFAR-10 and 4.52% on STL10. The code and experiment records are released at https://…/EnAET.

The convolutional layers are core building blocks of neural network architectures. In general, a convolutional filter applies to the entire frequency spectrum of the input data. We explore artificially constraining the frequency spectra of these filters and data, called band-limiting, during training. The frequency domain constraints apply to both the feed-forward and back-propagation steps. Experimentally, we observe that Convolutional Neural Networks (CNNs) are resilient to this compression scheme and results suggest that CNNs learn to leverage lower-frequency components. In particular, we found: (1) band-limited training can effectively control the resource usage (GPU and memory); (2) models trained with band-limited layers retain high prediction accuracy; and (3) requires no modification to existing training algorithms or neural network architectures to use unlike other compression schemes.

We present new algorithms for computing and approximating bisimulation metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an elegant formalism that capture behavioral equivalence between states and provide strong theoretical guarantees on differences in optimal behaviour. Unfortunately, their computation is expensive and requires a tabular representation of the states, which has thus far rendered them impractical for large problems. In this paper we present a new version of the metric that is tied to a behavior policy in an MDP, along with an analysis of its theoretical properties. We then present two new algorithms for approximating bisimulation metrics in large, deterministic MDPs. The first does so via sampling and is guaranteed to converge to the true metric. The second is a differentiable loss which allows us to learn an approximation even for continuous state MDPs, which prior to this work had not been possible.

OneClass SVM is a popular method for unsupervised anomaly detection. As many other methods, it suffers from the \textit{black box} problem: it is difficult to justify, in an intuitive and simple manner, why the decision frontier is identifying data points as anomalous or non anomalous. Such type of problem is being widely addressed for supervised models. However, it is still an uncharted area for unsupervised learning. In this paper, we describe a method to infer rules that justify why a point is labelled as an anomaly, so as to obtain intuitive explanations for models created using the OneClass SVM algorithm. We evaluate our proposal with different datasets, including real-world data coming from industry. With this, our proposal contributes to extend Explainable AI techniques to unsupervised machine learning models.

Due to the recent advances on Neural Architecture Search (NAS), it gains popularity in designing best networks for specific tasks. Although it shows promising results on many benchmarks and competitions, NAS still suffers from its demanding computation cost for searching high dimensional architectural design space, and this problem becomes even worse when we want to use a large-scale dataset. If we can make a reliable data proxy for NAS, the efficiency of NAS approaches increase accordingly. Our basic observation for making a data proxy is that each example in a specific dataset has a different impact on NAS process and most of examples are redundant from a relative accuracy ranking perspective, which we should preserve when making a data proxy. We propose a systematic approach to measure the importance of each example from this relative accuracy ranking point of view, and make a reliable data proxy based on the statistics of training and testing examples. Our experiment shows that we can preserve the almost same relative accuracy ranking between all possible network configurations even with 10-20 smaller data proxy.

Entity extraction is an important task in text mining and natural language processing. A popular method for entity extraction is by comparing substrings from free text against a dictionary of entities. In this paper, we present several techniques as a post-processing step for improving the effectiveness of the existing entity extraction technique. These techniques utilise models trained with the web-scale corpora which makes our techniques robust and versatile. Experiments show that our techniques bring a notable improvement on efficiency and effectiveness.

Improvement of statistical learning models in order to increase efficiency in solving classification or regression problems is still a goal pursued by the scientific community. In this way, the support vector machine model is one of the most successful and powerful algorithms for those tasks. However, its performance depends directly from the choice of the kernel function and their hyperparameters. The traditional choice of them, actually, can be computationally expensive to do the kernel choice and the tuning processes. In this article, it is proposed a novel framework to deal with the kernel function selection called Random Machines. The results improved accuracy and reduced computational time. The data study was performed in simulated data and over 27 real benchmarking datasets.

Contraction Clustering (RASTER) is a very fast algorithm for density-based clustering, which requires only a single pass. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. Yet, RASTER is limited to batch processing. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step in the sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. It is therefore very well suited to real-world scenarios where clustering does not happen continually but only periodically. We describe the algorithm, including a discussion of implementation details.

Although deep neural networks are highly effective, their high computational and memory costs severely challenge their applications on portable devices. As a consequence, low-bit quantization, which converts a full-precision neural network into a low-bitwidth integer version, has been an active and promising research topic. Existing methods formulate the low-bit quantization of networks as an approximation or optimization problem. Approximation-based methods confront the gradient mismatch problem, while optimization-based methods are only suitable for quantizing weights and could introduce high computational cost in the training stage. In this paper, we propose a novel perspective of interpreting and implementing neural network quantization by formulating low-bit quantization as a differentiable non-linear function (termed quantization function). The proposed quantization function can be learned in a lossless and end-to-end manner and works for any weights and activations of neural networks in a simple and uniform way. Extensive experiments on image classification and object detection tasks show that our quantization networks outperform the state-of-the-art methods. We believe that the proposed method will shed new insights on the interpretation of neural network quantization. Our code is available at https://…/alibabacloud-quantization-networks.

We survey a burgeoning and promising new research area that considers the online nature of many practical fair division problems. We identify wide variety of such online fair division problems, as well as discuss new mechanisms and normative properties that apply to this online setting. The online nature of such fair division problems provides both opportunities and challenges such as the possibility to develop new online mechanisms as well as the difficulty of dealing with an uncertain future.

Machine and deep learning-based algorithms are the emerging approaches in addressing prediction problems in time series. These techniques have been shown to produce more accurate results than conventional regression-based modeling. It has been reported that artificial Recurrent Neural Networks (RNN) with memory, such as Long Short-Term Memory (LSTM), are superior compared to Autoregressive Integrated Moving Average (ARIMA) with a large margin. The LSTM-based models incorporate additional ‘gates’ for the purpose of memorizing longer sequences of input data. The major question is that whether the gates incorporated in the LSTM architecture already offers a good prediction and whether additional training of data would be necessary to further improve the prediction. Bidirectional LSTMs (BiLSTMs) enable additional training by traversing the input data twice (i.e., 1) left-to-right, and 2) right-to-left). The research question of interest is then whether BiLSTM, with additional training capability, outperforms regular unidirectional LSTM. This paper reports a behavioral analysis and comparison of BiLSTM and LSTM models. The objective is to explore to what extend additional layers of training of data would be beneficial to tune the involved parameters. The results show that additional training of data and thus BiLSTM-based modeling offers better predictions than regular LSTM-based models. More specifically, it was observed that BiLSTM models provide better predictions compared to ARIMA and LSTM models. It was also observed that BiLSTM models reach the equilibrium much slower than LSTM-based models.

Approaches to continual learning aim to successfully learn a set of related tasks that arrive in an online manner. Recently, several frameworks have been developed which enable deep learning to be deployed in this learning scenario. A key modelling decision is to what extent the architecture should be shared across tasks. On the one hand, separately modelling each task avoids catastrophic forgetting but it does not support transfer learning and leads to large models. On the other hand, rigidly specifying a shared component and a task-specific part enables task transfer and limits the model size, but it is vulnerable to catastrophic forgetting and restricts the form of task-transfer that can occur. Ideally, the network should adaptively identify which parts of the network to share in a data driven way. Here we introduce such an approach called Continual Learning with Adaptive Weights (CLAW), which is based on probabilistic modelling and variational inference. Experiments show that CLAW achieves state-of-the-art performance on six benchmarks in terms of overall continual learning performance, as measured by classification accuracy, and in terms of addressing catastrophic forgetting.

Deep neural networks (DNNs) can easily fit a random labeling of the training data with zero training error. What is the difference between DNNs trained with random labels and the ones trained with true labels? Our paper answers this question with two contributions. First, we study the memorization properties of DNNs. Our empirical experiments shed light on how DNNs prioritize the learning of simple input patterns. In the second part, we propose to measure the similarity between what different DNNs have learned and memorized. With the proposed approach, we analyze and compare DNNs trained on data with true labels and random labels. The analysis shows that DNNs have \textit{One way to Learn} and \textit{N ways to Memorize}. We also use gradient information to gain an understanding of the analysis results.

Multiscale, agent-based mathematical models of biological systems are often associated with model uncertainty and sensitivity to parameter perturbations. Here, three uncertainty and sensitivity analyses methods, that are suitable to use when working with agent-based models, are discussed. These methods are namely Consistency Analysis, Robustness Analysis and Latin Hypercube Analysis. This introductory review discusses origins, conventions, implementation and result interpretation of the aforementioned methods. Information on how to implement the discussed methods in MATLAB is included.

Healthcare data continues to flourish yet a relatively small portion, mostly structured, is being utilized effectively for predicting clinical outcomes. The rich subjective information available in unstructured clinical notes can possibly facilitate higher discrimination but tends to be under-utilized in mortality prediction. This work attempts to assess the gain in performance when multiple notes that have been minimally preprocessed are used as an input for prediction. A hierarchical architecture consisting of both convolutional and recurrent layers is used to concurrently model the different notes compiled in an individual hospital stay. This approach is evaluated on predicting in-hospital mortality on the MIMIC-III dataset. On comparison to approaches utilizing structured data, it achieved higher metrics despite requiring less cleaning and preprocessing. This demonstrates the potential of unstructured data in enhancing mortality prediction and signifies the need to incorporate more raw unstructured data into current clinical prediction methods.

In this paper we target the problem of transferring policies across multiple environments with different dynamics parameters and motor noise variations, by introducing a framework that decouples the processes of policy learning and system identification. Efficiently transferring learned policies to an unknown environment with changes in dynamics configurations in the presence of motor noise is very important for operating robots in the real world, and our work is a novel attempt in that direction. We introduce MANGA: Method Agnostic Neural-policy Generalization and Adaptation, that trains dynamics conditioned policies and efficiently learns to estimate the dynamics parameters of the environment given off-policy state-transition rollouts in the environment. Our scheme is agnostic to the type of training method used – both reinforcement learning (RL) and imitation learning (IL) strategies can be used. We demonstrate the effectiveness of our approach by experimenting with four different MuJoCo agents and comparing against previously proposed transfer baselines.

This paper addresses the problem of unsupervised clustering which remains one of the most fundamental challenges in machine learning and artificial intelligence. We propose the clustered generator model for clustering which contains both continuous and discrete latent variables. Discrete latent variables model the cluster label while the continuous ones model variations within each cluster. The learning of the model proceeds in a unified probabilistic framework and incorporates the unsupervised clustering as an inner step without the need for an extra inference model as in existing variational-based models. The latent variables learned serve as both observed data embedding or latent representation for data distribution. Our experiments show that the proposed model can achieve competitive unsupervised clustering accuracy and can learn disentangled latent representations to generate realistic samples. In addition, the model can be naturally extended to per-pixel unsupervised clustering which remains largely unexplored.

We propose a new nonlinear factorization model for graphs that are with topological structures, and optionally, node attributes. This model is based on a pseudometric called Gromov-Wasserstein (GW) discrepancy, which compares graphs in a relational way. It estimates observed graphs as GW barycenters constructed by a set of atoms with different weights. By minimizing the GW discrepancy between each observed graph and its GW barycenter-based estimation, we learn the atoms and their weights associated with the observed graphs. The model achieves a novel and flexible factorization mechanism under GW discrepancy, in which both the observed graphs and the learnable atoms can be unaligned and with different sizes. We design an effective approximate algorithm for learning this Gromov-Wasserstein factorization (GWF) model, unrolling loopy computations as stacked modules and computing gradients with backpropagation. The stacked modules can be with two different architectures, which correspond to the proximal point algorithm (PPA) and Bregman alternating direction method of multipliers (BADMM), respectively. Experiments show that our model obtains encouraging results on clustering graphs.

Let be the lowest distance to which a -symbol distribution can be estimated from batches of samples each, when up to batches may be adversarial. For , Qiao and Valiant (2017) showed that and requires batches. For , they provided a and order-optimal algorithm that runs in time exponential in . For , we propose an algorithm with comparably optimal and , but run-time polynomial in and all other parameters.

Graph representation learning is to learn universal node representations that preserve both node attributes and structural information. The derived node representations can be used to serve various downstream tasks, such as node classification and node clustering. When a graph is heterogeneous, the problem becomes more challenging than the homogeneous graph node learning problem. Inspired by the emerging information theoretic-based learning algorithm, in this paper we propose an unsupervised graph neural network Heterogeneous Deep Graph Infomax (HDGI) for heterogeneous graph representation learning. We use the meta-path structure to analyze the connections involving semantics in heterogeneous graphs and utilize graph convolution module and semantic-level attention mechanism to capture local representations. By maximizing local-global mutual information, HDGI effectively learns high-level node representations that can be utilized in downstream graph-related tasks. Experiment results show that HDGI remarkably outperforms state-of-the-art unsupervised graph representation learning methods on both classification and clustering tasks. By feeding the learned representations into a parametric model, such as logistic regression, we even achieve comparable performance in node classification tasks when comparing with state-of-the-art supervised end-to-end GNN models.

We study the problem of learning permutation invariant representations that can capture ‘flexible’ notions of containment. We formalize this problem via a measure theoretic definition of multisets, and obtain a theoretically-motivated learning model. We propose training this model on a novel task: predicting the size of the symmetric difference (or intersection) between pairs of multisets. We demonstrate that our model not only performs very well on predicting containment relations (and more effectively predicts the sizes of symmetric differences and intersections than DeepSets-based approaches with unconstrained object representations), but that it also learns meaningful representations.

We present a novel machine learning architecture for classification suggested by experiments on the insect olfactory system. The network separates odors via a winnerless competition network, then classifies objects by projection into a high dimensional space where a support vector machine provides more precision in classification. We build this network using biophysical models of neurons with our results showing high discrimination among inputs and exceptional robustness to noise. The same circuitry accurately identifies the amplitudes of mixtures of the odors on which it has been trained.

Anomaly detection and localization is a popular computer vision problem involving detecting anomalous images and localizing anomalies within them. However, this task is challenging due to the small sample size and pixel coverage of the anomaly in real-world scenarios. Prior works need to use anomalous training images to compute a threshold to detect and localize anomalies. To remove this need, we propose Convolutional Adversarial Variational autoencoder with Guided Attention (CAVGA), which localizes the anomaly with a convolutional latent variable to preserve the spatial information. In the unsupervised setting, we propose an attention expansion loss, where we encourage CAVGA to focus on all normal regions in the image without using any anomalous training image. Furthermore, using only 2% anomalous images in the weakly supervised setting we propose a complementary guided attention loss, where we encourage the normal attention to focus on all normal regions while minimizing the regions covered by the anomalous attention in the normal image. CAVGA outperforms the state-of-the-art (SOTA) anomaly detection methods on the MNIST, CIFAR-10, Fashion-MNIST, MVTec Anomaly Detection (MVTAD), and modified ShanghaiTech Campus (mSTC) datasets. CAVGA also outperforms the SOTA anomaly localization methods on the MVTAD and mSTC datasets.

Although deep learning has been applied to successfully address many data mining problems, relatively limited work has been done on deep learning for anomaly detection. Existing deep anomaly detection methods, which focus on learning new feature representations to enable downstream anomaly detection methods, perform indirect optimization of anomaly scores, leading to data-inefficient learning and suboptimal anomaly scoring. Also, they are typically designed as unsupervised learning due to the lack of large-scale labeled anomaly data. As a result, they are difficult to leverage prior knowledge (e.g., a few labeled anomalies) when such information is available as in many real-world anomaly detection applications. This paper introduces a novel anomaly detection framework and its instantiation to address these problems. Instead of representation learning, our method fulfills an end-to-end learning of anomaly scores by a neural deviation learning, in which we leverage a few (e.g., multiple to dozens) labeled anomalies and a prior probability to enforce statistically significant deviations of the anomaly scores of anomalies from that of normal data objects in the upper tail. Extensive results show that our method can be trained substantially more data-efficiently and achieves significantly better anomaly scoring than state-of-the-art competing methods.

A statistical model is a mathematical representation of an often simplified or idealised data-generating process. In this paper, we focus on a particular type of statistical model, called linear mixed models (LMMs), that is widely used in many disciplines e.g.~agriculture, ecology, econometrics, psychology. Mixed models, also commonly known as multi-level, nested, hierarchical or panel data models, incorporate a combination of fixed and random effects, with LMMs being a special case. The inclusion of random effects in particular gives LMMs considerable flexibility in accounting for many types of complex correlated structures often found in data. This flexibility, however, has given rise to a number of ways by which an end-user can specify the precise form of the LMM that they wish to fit in statistical software. In this paper, we review the software design for specification of the LMM (and its special case, the linear model), focusing in particular on the use of high-level symbolic model formulae and two popular but contrasting R-packages in lme4 and asreml.

Deep neural networks (DNNs) are incredibly brittle due to adversarial examples. To robustify DNNs, adversarial training was proposed, which requires large-scale but well-labeled data. However, it is quite expensive to annotate large-scale data well. To compensate for this shortage, several seminal works are utilizing large-scale unlabeled data. In this paper, we observe that seminal works do not perform well, since the quality of pseudo labels on unlabeled data is quite poor, especially when the amount of unlabeled data is significantly larger than that of labeled data. We believe that the quality of pseudo labels is the bottleneck of adversarial learning with unlabeled data. To tackle this bottleneck, we leverage deep co-training, which trains two deep networks and encourages two networks diverged by exploiting peer’s adversarial examples. Based on deep co-training, we propose robust co-training (RCT) for adversarial learning with unlabeled data. We conduct comprehensive experiments on CIFAR-10 and SVHN datasets. Empirical results demonstrate that our RCT can significantly outperform baselines (e.g., robust self-training (RST)) in both standard test accuracy and robust test accuracy w.r.t. different datasets, different network structures, and different types of adversarial training.

Balancing exploration and exploitation is a fundamental part of reinforcement learning, yet most state-of-the-art algorithms use a naive exploration protocol like -greedy. This contributes to the problem of high sample complexity, as the algorithm wastes effort by repeatedly visiting parts of the state space that have already been explored. We introduce a novel method based on Bayesian linear regression and latent space embedding to generate an intrinsic reward signal that encourages the learning agent to seek out unexplored parts of the state space. This method is computationally efficient, simple to implement, and can extend any state-of-the-art reinforcement learning algorithm. We evaluate the method on a range of algorithms and challenging control tasks, on both simulated and physical robots, demonstrating how the proposed method can significantly improve sample complexity.

We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational autoencoders, thus embedding the nodes of the graph i.e., samples for the tasks) in a uniform manner while specializing their organization and usage to different tasks. With a focus on healthcare applications (tasks), including clinical topic modeling, procedure recommendation and admission-type prediction, we demonstrate that our method successfully leverages information across different tasks, boosting performance in all tasks and outperforming existing state-of-the-art approaches.

Deep neural networks enjoy a powerful representation and have proven effective in a number of applications. However, recent advances show that deep neural networks are vulnerable to adversarial attacks incurred by the so-called adversarial examples. Although the adversarial example is only slightly different from the input sample, the neural network classifies it as the wrong class. In order to alleviate this problem, we propose the Deep Minimax Probability Machine (DeepMPM), which applies MPM to deep neural networks in an end-to-end fashion. In a worst-case scenario, MPM tries to minimize an upper bound of misclassification probabilities, considering the global information (i.e., mean and covariance information of each class). DeepMPM can be more robust since it learns the worst-case bound on the probability of misclassification of future data. Experiments on two real-world datasets can achieve comparable classification performance with CNN, while can be more robust on adversarial attacks.

We focus on a class of real-world domains, where gathering hierarchical knowledge is required to accomplish a task. Many problems can be represented in this manner, such as network penetration testing, targeted advertising or medical diagnosis. In our formalization, the task is to sequentially request pieces of information about a sample to build the knowledge hierarchy and terminate when suitable. Any of the learned pieces of information can be further analyzed, resulting in a complex and variable action space. We present a combination of techniques in which the knowledge hierarchy is explicitly represented and given to a deep reinforcement learning algorithm as its input. To process the hierarchical input, we employ Hierarchical Multiple-Instance Learning and to cope with the complex action space, we factor it with hierarchical softmax. Our end-to-end differentiable model is trained with A2C, a standard deep reinforcement learning algorithm. We demonstrate the method in a set of seven classification domains, where the task is to achieve the best accuracy with a set budget on the amount of information retrieved. Compared to baseline algorithms, our method achieves not only better results, but also better generalization.

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top- sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top- operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top- operator. Finally, we exploit the property of gradient distribution to propose an approximate top- selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://…/GaussianK-SGD}.

It is essential for the advancement of science that scientists and researchers share, reuse and reproduce workflows and protocols used by others. The FAIR principles are a set of guidelines that aim to maximize the value and usefulness of research data, and emphasize a number of important points regarding the means by which digital objects are found and reused by others. The question of how to apply these principles not just to the static input and output data but also to the dynamic workflows and protocols that consume and produce them is still under debate and poses a number of challenges. In this paper we describe our inclusive and overarching approach to apply the FAIR principles to workflows and protocols and demonstrate its benefits. We apply and evaluate our approach on a case study that consists of making the PREDICT workflow, a highly cited drug repurposing workflow, open and FAIR. This includes FAIRification of the involved datasets, as well as applying semantic technologies to represent and store data about the detailed versions of the general protocol, of the concrete workflow instructions, and of their execution traces. A semantic model was proposed to better address these specific requirements and were evaluated by answering competency questions. This semantic model consists of classes and relations from a number of existing ontologies, including Workflow4ever, PROV, EDAM, and BPMN. This allowed us then to formulate and answer new kinds of competency questions. Our evaluation shows the high degree to which our FAIRified OpenPREDICT workflow now adheres to the FAIR principles and the practicality and usefulness of being able to answer our new competency questions.

Towards a future where machine learning systems will integrate into every aspect of people’s lives, researching methods to interpret such systems is necessary, instead of focusing exclusively on enhancing their performance. Enriching the trust between these systems and people will accelerate this integration process. Many medical and retail banking/finance applications use state-of-the-art machine learning techniques to predict certain aspects of new instances. Tree ensembles, like random forests, are widely acceptable solutions on these tasks, while at the same time they are avoided due to their black-box uninterpretable nature, creating an unreasonable paradox. In this paper, we provide a sequence of actions for shedding light on the predictions of the misjudged family of tree ensemble algorithms. Using classic unsupervised learning techniques and an enhanced similarity metric, to wander among transparent trees inside a forest following breadcrumbs, the interpretable essence of tree ensembles arises. An explanation provided by these systems using our approach, which we call ‘LionForests’, can be a simple, comprehensive rule.

Good quality explanations of artificial intelligence (XAI) reasoning must be written (and evaluated) for an explanatory purpose, targeted towards their readers, have a good narrative and causal structure, and highlight where uncertainty and data quality affect the AI output. I discuss these challenges from a Natural Language Generation (NLG) perspective, and highlight four specific NLG for XAI research challenges.

Graph neural network (GNN) is a deep model for graph representation learning. One advantage of graph neural network is its ability to incorporate node features into the learning process. However, this prevents graph neural network from being applied into featureless graphs. In this paper, we first analyze the effects of node features on the performance of graph neural network. We show that GNNs work well if there is a strong correlation between node features and node labels. Based on these results, we propose new feature initialization methods that allows to apply graph neural network to non-attributed graphs. Our experimental results show that the artificial features are highly competitive with real features.

When a black-box optimization objective can only be evaluated with costly or noisy measurements, most standard optimization algorithms are unsuited to find the optimal solution. Specialized algorithms that deal with exactly this situation make use of surrogate models. These models are usually continuous and smooth, which is beneficial for continuous optimization problems, but not necessarily for combinatorial problems. However, by choosing the basis functions of the surrogate model in a certain way, we show that it can be guaranteed that the optimal solution of the surrogate model is integer. This approach outperforms random search, simulated annealing and one Bayesian optimization algorithm on the problem of finding robust routes for a noise-perturbed traveling salesman benchmark problem, with similar performance as another Bayesian optimization algorithm, and outperforms all compared algorithms on a convex binary optimization problem with a large number of variables.

We consider a zero-shot semantic parsing task: parsing instructions into compositional logical forms, in domains that were not seen during training. We present a new dataset with 1,390 examples from 7 application domains (e.g. a calendar or a file manager), each example consisting of a triplet: (a) the application’s initial state, (b) an instruction, to be carried out in the context of that state, and (c) the state of the application after carrying out the instruction. We introduce a new training algorithm that aims to train a semantic parser on examples from a set of source domains, so that it can effectively parse instructions from an unknown target domain. We integrate our algorithm into the floating parser of Pasupat and Liang (2015), and further augment the parser with features and a logical form candidate filtering logic, to support zero-shot adaptation. Our experiments with various zero-shot adaptation setups demonstrate substantial performance gains over a non-adapted parser.

A new statistical procedure, based on a modified spline basis, is proposed to identify the linear components in the panel data model with fixed effects. Under some mild assumptions, the proposed procedure is shown to consistently estimate the underlying regression function, correctly select the linear components, and effectively conduct the statistical inference. When compared to existing methods for detection of linearity in the panel model, our approach is demonstrated to be theoretically justified as well as practically convenient. We provide a computational algorithm that implements the proposed procedure along with a path-based solution method for linearity detection, which avoids the burden of selecting the tuning parameter for the penalty term. Monte Carlo simulations are conducted to examine the finite sample performance of our proposed procedure with detailed findings that confirm our theoretical results in the paper. Applications to Aggregate Production and Environmental Kuznets Curve data also illustrate the necessity for detecting linearity in the partially linear panel model.

Detecting frequent items is a fundamental problem in data streaming research. However, in many applications, besides the frequent items themselves, meta data such as the timestamps of when the frequent items appeared or other application-specific data that ‘arrives’ with the frequent items needs to be reported too. To this end, we introduce the Neighborhood Detection problem in graph streams, which both accurately models situations such as those stated above, and addresses the fundamental problem of detecting large neighborhoods or stars in graph streams. In Neighborhood Detection, an algorithm receives the edges of a bipartite graph with and in arbitrary order and is given a threshold parameter . Provided that there is at least one -node of degree at least , the objective is to output a node together with at least of its neighbors, where is the approximation factor. We show that in insertion-only streams, there is a one-pass space -approximation streaming algorithm, for integral values of . We complement this result with a lower bound, showing that computing a -approximation requires space , for any integral , which renders our algorithm optimal for a large range of settings (up to logarithmic factors). In insertion-deletion (turnstile) streams, we give a one-pass -approximation algorithm with space (if ). We also prove that this is best possible up to logarithmic factors. Our lower bounds are obtained by defining new multi-party and two-party communication problems, respectively, and proving lower bounds on their communication complexities using information theoretic arguments.

Parameterized complexity attempts to give a more fine-grained analysis of the complexity of problems: instead of measuring the running time as a function of only the input size, we analyze the running time with respect to additional parameters. This approach has proven to be highly successful in delineating our understanding of \NP-hard problems. Given this success with the TIME resource, it seems but natural to use this approach for dealing with the SPACE resource. First attempts in this direction have considered a few individual problems, with some success: Fafianie and Kratsch [MFCS’14] and Chitnis et al. [SODA’15] introduced the notions of streaming kernels and parameterized streaming algorithms respectively. For example, the latter shows how to refine the bit lower bound for finding a minimum Vertex Cover (VC) in the streaming setting by designing an algorithm for the parameterized -VC problem which uses bits. In this paper, we initiate a systematic study of graph problems from the paradigm of parameterized streaming algorithms. We first define a natural hierarchy of space complexity classes of FPS, SubPS, SemiPS, SupPS and BrutePS, and then obtain tight classifications for several well-studied graph problems such as Longest Path, Feedback Vertex Set, Dominating Set, Girth, Treewidth, etc. into this hierarchy. (see paper for full abstract)

This paper considers the multi-group multicast beamforming optimization problem, for which the optimal solution has been unknown due to its non-convex and NP-hard nature. By utilizing the successive convex approximation numerical method and Lagrangian duality, we obtain the optimal multicast beamforming solution in a semi-closed form for both the quality-of-service (QoS) problem and the max-min fair (MMF) problem. From the optimal beamforming structure obtained, we show that the notion of uplink-downlink duality can be generalized to the multicast beamforming problem. The optimal multicast beamformer is a weighted MMSE filter based on a group-channel direction — a generalized version of the optimal downlink multi-user unicast beamformer. We also show that there is an inherent low-dimensional structure in the optimal beamforming solution independent of the number of transmit antennas, leading to efficient numerical algorithm design, especially for systems with large antenna arrays. We propose efficient algorithms to compute the multicast beamformer based on the optimal beamforming structure. We characterize the asymptotic behavior of the beamformers through asymptotic analysis, and in turn, provide simple closed-form approximate multicast beamformers for both the QoS and MMF problems. The approximation offers practical multicast beamforming solutions with a near-optimal performance at very low computational complexity for large-scale antenna systems.

Representation learning on a knowledge graph (KG) is to embed entities and relations of a KG into low-dimensional continuous vector spaces. Early KG embedding methods only pay attention to structured information encoded in triples, which would cause limited performance due to the structure sparseness of KGs. Some recent attempts consider paths information to expand the structure of KGs but lack explainability in the process of obtaining the path representations. In this paper, we propose a novel Rule and Path-based Joint Embedding (RPJE) scheme, which takes full advantage of the explainability and accuracy of logic rules, the generalization of KG embedding as well as the supplementary semantic structure of paths. Specifically, logic rules of different lengths (the number of relations in rule body) in the form of Horn clauses are first mined from the KG and elaborately encoded for representation learning. Then, the rules of length 2 are applied to compose paths accurately while the rules of length 1 are explicitly employed to create semantic associations among relations and constrain relation embeddings. Besides, the confidence level of each rule is also considered in optimization to guarantee the availability of applying the rule to representation learning. Extensive experimental results illustrate that RPJE outperforms other state-of-the-art baselines on KG completion task, which also demonstrate the superiority of utilizing logic rules as well as paths for improving the accuracy and explainability of representation learning.

We address the efficiency issue for the construction of a deep graph neural network (GNN). The approach exploits the idea of representing each input graph as a fixed point of a dynamical system (implemented through a recurrent neural network), and leverages a deep architectural organization of the recurrent units. Efficiency is gained by many aspects, including the use of small and very sparse networks, where the weights of the recurrent units are left untrained under the stability condition introduced in this work. This can be viewed as a way to study the intrinsic power of the architecture of a deep GNN, and also to provide insights for the set-up of more complex fully-trained models. Through experimental results, we show that even without training of the recurrent connections, the architecture of small deep GNN is surprisingly able to achieve or improve the state-of-the-art performance on a significant set of tasks in the field of graphs classification.

This paper presents the application of a newly developed nature-inspired metaheuristic optimization method, namely the Adaptive Wind Driven Optimization (AWDO), to the training of feedforward artificial neural networks (NN) and presents a discussion into the future research of AWDO implementation in Deep Learning (DL). Application example of digit classification with MNIST dataset reveals interesting behavior of the derivative-free AWDO method compared to steepest descent method where results and future work on the implementation of AWDO in deep neural networks are discussed.

The transfer learning toolkit wraps the codes of 17 transfer learning models and provides integrated interfaces, allowing users to use those models by calling a simple function. It is easy for primary researchers to use this toolkit and to choose proper models for real-world applications. The toolkit is written in Python and distributed under MIT open source license. In this paper, the current state of this toolkit is described and the necessary environment setting and usage are introduced.

Representing networks in a low dimensional latent space is a crucial task with many interesting applications in graph learning problems, such as link prediction and node classification. A widely applied network representation learning paradigm is based on the combination of random walks for sampling context nodes and the traditional \textit{Skip-Gram} model to capture center-context node relationships. In this paper, we emphasize on exponential family distributions to capture rich interaction patterns between nodes in random walk sequences. We introduce the generic \textit{exponential family graph embedding} model, that generalizes random walk-based network representation learning techniques to exponential family conditional distributions. We study three particular instances of this model, analyzing their properties and showing their relationship to existing unsupervised learning models. Our experimental evaluation on real-world datasets demonstrates that the proposed techniques outperform well-known baseline methods in two downstream machine learning tasks.

Clustering methods group a set of data points into a few coherent groups or clusters of similar data points. As an example, consider clustering pixels in an image (or video) if they belong to the same object. Different clustering methods are obtained by using different notions of similarity and different representations of data points.

Knowledge Graph (KG) embedding has attracted more attention in recent years. Most of KG embedding models learn from time-unaware triples. However, the inclusion of temporal information beside triples would further improve the performance of a KGE model. In this regard, we propose ATiSE, a temporal KG embedding model which incorporates time information into entity/relation representations by using Additive Time Series decomposition. Moreover, considering the temporal uncertainty during the evolution of entity/relation representations over time, we map the representations of temporal KGs into the space of multi-dimensional Gaussian distributions. The mean of each entity/relation embedding at a time step shows the current expected position, whereas its covariance (which is temporally stationary) represents its temporal uncertainty. Experimental results show that ATiSE not only achieves the state-of-the-art on link prediction over temporal KGs, but also can predict the occurrence time of facts with missing time annotations, as well as the existence of future events. To the best of our knowledge, no other model is capable to perform all these tasks.

Agreement protocols have been typically deployed at small scale, e.g., using three to five machines. This is because these protocols seem to suffer from a sharp performance decay. More specifically, as the size of a deployment—i.e., degree of replication—increases, the protocol performance greatly decreases. There is not much experimental evidence for this decay in practice, however, notably for larger system sizes, e.g., beyond a handful of machines. In this paper we execute agreement protocols on up to 100 machines and observe on their performance decay. We consider well-known agreement protocols part of mature systems, such as Apache ZooKeeper, etcd, and BFT-Smart, as well as a chain and a novel ring-based agreement protocol which we implement ourselves. We provide empirical evidence that current agreement protocols execute gracefully on 100 machines. We observe that throughput decay is initially sharp (consistent with previous observations); but intriguingly—as each system grows beyond a few tens of replicas—the decay dampens. For chain- and ring-based replication, this decay is slower than for the other systems. The positive takeaway from our evaluation is that mature agreement protocol implementations can sustain out-of-the-box 300 to 500 requests per second when executing on 100 replicas on a wide-area public cloud platform. Chain- and ring-based replication can reach between 4K and 11K (up to 20x improvements) depending on the fault assumptions.

With the ever-increasing reliance on data for data-driven applications in power grids, such as event cause analysis, the authenticity of data streams has become crucially important. The data can be prone to adversarial stealthy attacks aiming to manipulate the data such that residual-based bad data detectors cannot detect them, and the perception of system operators or event classifiers changes about the actual event. This paper investigates the impact of adversarial attacks on convolutional neural network-based event cause analysis frameworks. We have successfully verified the ability of adversaries to maliciously misclassify events through stealthy data manipulations. The vulnerability assessment is studied with respect to the number of compromised measurements. Furthermore, a defense mechanism to robustify the performance of the event cause analysis is proposed. The effectiveness of adversarial attacks on changing the output of the framework is studied using the data generated by real-time digital simulator (RTDS) under different scenarios such as type of attacks and level of access to data.

Outlier based Robust Principal Component Analysis (RPCA) requires centering of the non-outliers. We show a ‘bias trick’ that automatically centers these non-outliers. Using this bias trick we obtain the first RPCA algorithm that is optimal with respect to centering.

Despite advancements in vehicle security systems, over the last decade, auto-theft rates have increased, and cyber-security attacks on internet-connected and autonomous vehicles are becoming a new threat. In this paper, a deep learning model is proposed, which can identify drivers from their driving behaviors based on vehicle telematics data. The proposed Long-Short-Term-Memory (LSTM) model predicts the identity of the driver based on the individual’s unique driving patterns learned from the vehicle telematics data. Given the telematics is time-series data, the problem is formulated as a time series prediction task to exploit the embedded sequential information. The performance of the proposed approach is evaluated on three naturalistic driving datasets, which gives high accuracy prediction results. The robustness of the model on noisy and anomalous data that is usually caused by sensor defects or environmental factors is also investigated. Results show that the proposed model prediction accuracy remains satisfactory and outperforms the other approaches despite the extent of anomalies and noise-induced in the data.

The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone – stifling the adoption of the innovations. In this work, we first identify design features which are desirable within a DL benchmarking platform. These features include: performing the evaluation in a consistent, reproducible, and scalable manner, being framework and hardware agnostic, supporting real-world benchmarking workloads, providing in-depth model execution inspection across the HW/SW stack levels, etc. We then propose MLModelScope, a DL benchmarking platform design that realizes the objectives. MLModelScope proposes a specification to define DL model evaluations and techniques to provision the evaluation workflow using the user-specified HW/SW stack. MLModelScope defines abstractions for frameworks and supports board range of DL models and evaluation scenarios. We implement MLModelScope as an open-source project with support for all major frameworks and hardware architectures. Through MLModelScope’s evaluation and automated analysis workflows, we performed case-study analyses of models across systems and show how model, hardware, and framework selection affects model accuracy and performance under different benchmarking scenarios. We further demonstrated how MLModelScope’s tracing capability gives a holistic view of model execution and helps pinpoint bottlenecks.

We study high-dimensional sparse estimation tasks in a robust setting where a constant fraction of the dataset is adversarially corrupted. Specifically, we focus on the fundamental problems of robust sparse mean estimation and robust sparse PCA. We give the first practically viable robust estimators for these problems. In more detail, our algorithms are sample and computationally efficient and achieve near-optimal robustness guarantees. In contrast to prior provable algorithms which relied on the ellipsoid method, our algorithms use spectral techniques to iteratively remove outliers from the dataset. Our experimental evaluation on synthetic data shows that our algorithms are scalable and significantly outperform a range of previous approaches, nearly matching the best error rate without corruptions.

Capsule network is the most recent exciting advancement in the deep learning field and represents positional information by stacking features into vectors. The dynamic routing algorithm is used in the capsule network, however, there are some disadvantages such as the inability to stack multiple layers and a large amount of computation. In this paper, we propose an adaptive routing algorithm that can solve the problems mentioned above. First, the low-layer capsules adaptively adjust their direction and length in the routing algorithm and removing the influence of the coupling coefficient on the gradient propagation, so that the network can work when stacked in multiple layers. Then, the iterative process of routing is simplified to reduce the amount of computation and we introduce the gradient coefficient . Further, we tested the performance of our proposed adaptive routing algorithm on CIFAR10, Fashion-MNIST, SVHN and MNIST, while achieving better results than the dynamic routing algorithm.

Recently the Generative Adversarial Network has become a hot topic. Considering the application of GAN in multi-user environment, we propose Distributed-GAN. It enables multiple users to train with their own data locally and generates more diverse samples. Users don’t need to share data with each other to avoid the leakage of privacy. In recent years, commercial companies have launched cloud platforms based on artificial intelligence to provide model for users who lack computing power. We hope our work can inspire these companies to provide more powerful AI services.

Motivated by duality in traditional signal processing, we investigate the concept of duality for graph Fourier transforms. Given two graphs, we define their dualness, a measure of how close the graphs are to being (signal processing) duals of each other. We show that this definition satisfies some desirable properties, and develop an algorithm based on co-ordinate descent and perfect matching to compute the dualness when the graphs have distinct eigen values.

Active learning typically focuses on training a model on few labeled examples alone, while unlabeled ones are only used for acquisition. In this work we depart from this setting by using both labeled and unlabeled data during model training across active learning cycles. We do so by using unsupervised feature learning at the beginning of the active learning pipeline and semi-supervised learning at every active learning cycle, on all available data. The former has not been investigated before in active learning, while the study of latter in the context of deep learning is scarce and recent findings are not conclusive with respect to its benefit. Our idea is orthogonal to acquisition strategies by using more data, much like ensemble methods use more models. By systematically evaluating on a number of popular acquisition strategies and datasets, we find that the use of unlabeled data during model training brings a surprising accuracy improvement in image classification, compared to the differences between acquisition strategies. We thus explore smaller label budgets, even one label per class.

Recent advances in deep learning theory have evoked the study of generalizability across different local minima of deep neural networks (DNNs). While current work focused on either discovering properties of good local minima or developing regularization techniques to induce good local minima, no approach exists that can tackle both problems. We achieve these two goals successfully in a unified manner. Specifically, based on the Fisher information we propose a metric both strongly indicative of generalizability of local minima and effectively applied as a practical regularizer. We provide theoretical analysis including a generalization bound and empirically demonstrate the success of our approach in both capturing and improving the generalizability of DNNs. Experiments are performed on CIFAR-10 and CIFAR-100 for various network architectures.

This paper presents a novel partially distributed outer approximation algorithm, named PaDOA, for solving a class of structured mixed integer convex programming (MICP) problems to global optimality. The proposed scheme uses an iterative outer approximation method for coupled mixed integer optimization problems with separable convex objective functions, affine coupling constraints, and compact domain. PaDOA proceeds by alternating between solving large-scale structured mixed-integer linear programming problems and partially decoupled mixed-integer nonlinear programming subproblems that comprise much fewer integer variables. We establish conditions under which PaDOA converges to global minimizers after a finite number of iterations and verify these properties with an application to thermostatically controlled loads.

This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual bottleneck features, and show that multilingual features perform increasingly better with more training languages. Previously, it has been shown that the DTW based matching can be replaced with a CNN based matching while using posterior features. Here, we show that the CNN based matching outperforms DTW based matching using bottleneck features as well. In this case, the feature extraction and pattern matching stages of our QbE-STD system are optimized independently of each other. We propose to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously. The proposed approaches are evaluated on two challenging multilingual datasets: Spoken Web Search 2013 and Query by Example Search on Speech Task 2014, demonstrating in each case significant improvements.

In this work, we focus on the problem of entity alignment in Knowledge Graphs (KG) and we report on our experiences when applying a Graph Convolutional Network (GCN) based model for this task. Variants of GCN are used in multiple state-of-the-art approaches and therefore it is important to understand the specifics and limitations of GCN-based models. Despite serious efforts, we were not able to fully reproduce the results from the original paper and after a thorough audit of the code provided by authors, we concluded, that their implementation is different from the architecture described in the paper. In addition, several tricks are required to make the model work and some of them are not very intuitive. We provide an extensive ablation study to quantify the effects these tricks and changes of architecture have on final performance. Furthermore, we examine current evaluation approaches and systematize available benchmark datasets. We believe that people interested in KG matching might profit from our work, as well as novices entering the field

Three important issues are often encountered in Supervised and Semi-Supervised Classification: class-memberships are unreliable for some training units (label noise), a proportion of observations might depart from the main structure of the data (outliers) and new groups in the test set may have not been encountered earlier in the learning phase (unobserved classes). The present work introduces a robust and adaptive Discriminant Analysis rule, capable of handling situations in which one or more of the afore-mentioned problems occur. Two EM-based classifiers are proposed: the first one that jointly exploits the training and test sets (transductive approach), and the second one that expands the parameter estimate using the test set, to complete the group structure learned from the training set (inductive approach). Experiments on synthetic and real data, artificially adulterated, are provided to underline the benefits of the proposed method.