The ability to rank models by its real strength is the key to Neural Architecture Search. Traditional approaches adopt an incomplete training for such purpose which is still very costly. One-shot methods are thus devised to cut the expense by reusing the same set of weights. However, it is uncertain whether shared weights are truly effective. It is also unclear if a picked model is better because of its vigorous representational power or simply because it is overtrained. In order to remove the suspicion, we propose a novel idea called Fair Neural Architecture Search (FairNAS), in which a strict fairness constraint is enforced for fair inheritance and training. In this way, our supernet exhibits nice convergence and very high training accuracy. The performance of any sampled model loaded with shared weights from the supernet strongly correlates with that of stand-alone counterpart when trained fully. This result dramatically improves the searching efficiency, with a multi-objective reinforced evolutionary search backend, our pipeline generated a new set of state-of-the-art architectures on ImageNet: FairNAS-A attains 75.34% top-1 validation accuracy on ImageNet, FairNAS-B 75.10%, FairNAS-C 74.69%, even with lower multi-adds and/or fewer number of parameters compared with others. The models and their evaluation code are made publicly available online http://…/FairNAS.

Dynamical systems are capable of performing computation in a reservoir computing paradigm. This paper presents a general representation of these systems as an artificial neural network (ANN). Initially, we implement the simplest dynamical system, a cellular automaton. The mathematical fundamentals behind an ANN are maintained, but the weights of the connections and the activation function are adjusted to work as an update rule in the context of cellular automata. The advantages of such implementation are its usage on specialized and optimized deep learning libraries, the capabilities to generalize it to other types of networks and the possibility to evolve cellular automata and other dynamical systems in terms of connectivity, update and learning rules. Our implementation of cellular automata constitutes an initial step towards a general framework for dynamical systems. It aims to evolve such systems to optimize their usage in reservoir computing and to model physical computing substrates.

As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structure of the data. Understanding the topology of RDF graphs can guide and inform the development of, e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. In this work, we propose two resources: (i) a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and (ii) results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework. We present a preliminary analysis based on the proposed resources and point out implications for synthetic dataset generators. Finally, we identify a set of measures, that can be used to characterize graphs in the Semantic Web.

The ability to design complex neural network architectures which enable effective training by stochastic gradient descent has been the key for many achievements in the field of deep learning. However, developing such architectures remains a challenging and resourceintensive process full of trial-and-error iterations. All in all, the relation between the network topology and its ability to model the data remains poorly understood. We propose to encode neural networks with a differentiable variant of Cartesian Genetic Programming (dCGPANN) and present a memetic algorithm for architecture design: local searches with gradient descent learn the network parameters while evolutionary operators act on the dCGPANN genes shaping the network architecture towards faster learning. Studying a particular instance of such a learning scheme, we are able to improve the starting feed forward topology by learning how to rewire and prune links, adapt activation functions and introduce skip connections for chosen regression tasks. The evolved network architectures require less space for network parameters and reach, given the same amount of time, a significantly lower error on average.

Nonlinear dimensionality reduction methods are a popular tool for data scientists and researchers to visualize complex, high dimensional data. However, while these methods continue to improve and grow in number, it is often difficult to evaluate the quality of a visualization due to a variety of factors such as lack of information about the intrinsic dimension of the data and additional tuning required for many evaluation metrics. In this paper, we seek to provide a systematic comparison of dimensionality reduction quality metrics using datasets where we know the ground truth manifold. We utilize each metric for hyperparameter optimization in popular dimensionality reduction methods used for visualization and provide quantitative metrics to objectively compare visualizations to their original manifold. In our results, we find a few methods that appear to consistently do well and propose the best performer as a benchmark for evaluating dimensionality reduction based visualizations.

The focus of this paper is on intrinsic methods to detect overfitting. These rely only on the model and the training data, as opposed to traditional extrinsic methods that rely on performance on a test set or on bounds from model complexity. We propose a family of intrinsic methods called Counterfactual Simulation (CFS) which analyze the flow of training examples through the model by identifying and perturbing rare patterns. By applying CFS to logic circuits we get a method that has no hyper-parameters and works uniformly across different types of models such as neural networks, random forests and lookup tables. Experimentally, CFS can separate models with different levels of overfit using only their logic circuit representations without any access to the high level structure. By comparing lookup tables, neural networks, and random forests using CFS, we get insight into why neural networks generalize. In particular, we find that stochastic gradient descent in neural nets does not lead to ‘brute force’ memorization, but finds common patterns (whether we train with actual or randomized labels), and neural networks are not unlike forests in this regard. Finally, we identify a limitation with our proposal that makes it unsuitable in an adversarial setting, but points the way to future work on robust intrinsic methods.

Even though it is well known that for most relevant computational problems different algorithms may perform better on different classes of problem instances, most computational experiments still focus on determining a single best algorithm configuration based on aggregate results such as the average. In this paper, we propose Integer Programming based approaches to build decision trees for the Algorithm Selection Problem. These techniques allow to automatically: (i) find the most important problem features to determine problem classes; (ii) group the problems into classes and (iii) select the best algorithm configuration for each class. To evaluate this new approach, extensive computational experiments were executed using the linear programming algorithms implemented in the COIN-OR Branch & Cut solver in a comprehensive set of instances, including all MIPLIB benchmark instances. The results exceeded our initial expectations. While the single best parameter setting discovered decreased the total running time by 22%, our approach decreased the total running time by 40% in average in 10-fold cross validation experiments. These results indicate that our method generalizes quite well and does not overfit.

The growing prosperity of social networks has brought great challenges to the sentimental tendency mining of users. As more and more researchers pay attention to the sentimental tendency of online users, rich research results have been obtained based on the sentiment classification of explicit texts. However, research on the implicit sentiment of users is still in its infancy. Aiming at the difficulty of implicit sentiment classification, a research on implicit sentiment classification model based on deep neural network is carried out. Classification models based on DNN, LSTM, Bi-LSTM and CNN were established to judge the tendency of the user’s implicit sentiment text. Based on the Bi-LSTM model, the classification model of word-level attention mechanism is studied. The experimental results on the public dataset show that the established LSTM series classification model and CNN classification model can achieve good sentiment classification effect, and the classification effect is significantly better than the DNN model. The Bi-LSTM based attention mechanism classification model obtained the optimal R value in the positive category identification.

Learning to use tools to solve a variety of tasks is an innate ability of humans and has been observed of animals in the wild. However, the underlying mechanisms that are required to learn to use tools are abstract and widely contested in the literature. In this paper, we study tool use in the context of reinforcement learning and propose a framework for analyzing generalization inspired by a classic study of tool using behavior, the trap-tube task. Recently, it has become common in reinforcement learning to measure generalization performance on a single test set of environments. We instead propose transfers that produce multiple test sets that are used to measure specified types of generalization, inspired by abilities demonstrated by animal and human tool users. The source code to reproduce our experiments is publicly available at https://…/gym_tool_use.

We introduce a method for learning landmark detectors from unlabelled video frames and unpaired labels. This allows us to learn a detector from a large collection of raw videos given only a few example annotations harvested from existing data or motion capture. We achieve this by formulating the landmark detection task as one of image translation, learning to map an image of the object to an image of its landmarks, represented as a skeleton. The advantage is that this translation problem can then be tackled by CycleGAN. However, we show that a naive application of CycleGAN confounds appearance and pose information, with suboptimal keypoint detection performance. We solve this problem by introducing an analytical and differentiable renderer for the skeleton image so that no appearance information can be leaked in the skeleton. Then, since cycle consistency requires to reconstruct the input image from the skeleton, we supply the appearance information thus removed by conditioning the generator with a second image of the same object (e.g. another frame from a video). Furthermore, while CycleGAN uses two cycle consistency constraints, we show that the second one is detrimental in this application and we discard it, significantly simplifying the model. We show that these modifications improve the quality of the learned detector leading to state-of-the-art unsupervised landmark detection performance in a number of challenging human pose and facial landmark detection benchmarks.

With the increase in the amount of data in many fields, a method to consistently and efficiently decipher relationships within high dimensional data sets is important. Because many modern datasets are high-dimensional, univariate independence tests are not applicable. While many multivariate independence tests have R packages available, the interfaces are inconsistent, most are not available in Python. mgcpy is an extensive Python library that includes many state of the art high-dimensional independence testing procedures using a common interface. The package is easy-to-use and is flexible enough to enable future extensions. This manuscript provides details for each of the tests as well as extensive power and run-time benchmarks on a suite of high-dimensional simulations previously used in different publications. The appendix includes demonstrations of how the user can interact with the package, as well as links and documentation.

Quick detection of common changes is critical in sequential monitoring of multi-stream data where a common change is referred as a change that only occurs in a portion of panels. After a common change is detected by using a combined CUSUM-SR procedure, we first study the joint distribution for values of the CUSUM process and the estimated delay detection time for the unchanged panels. The BH method by using the asymptotic exponential property for the CUSUM process is developed to isolate the changed panels with the control on FDR. The common change point is then estimated based on the isolated changed panels. Simulation results show that the proposed method can also control the FNR by properly selecting FDR.

Pinterest is a popular Web application that has over 250 million active users. It is a visual discovery engine for finding ideas for recipes, fashion, weddings, home decoration, and much more. In the last year, the company adopted Semantic Web technologies to create a knowledge graph that aims to represent the vast amount of content and users on Pinterest, to help both content recommendation and ads targeting. In this paper, we present the engineering of an OWL ontology—the Pinterest Taxonomy—that forms the core of Pinterest’s knowledge graph, the Pinterest Taste Graph. We describe modeling choices and enhancements to WebProt\’eg\’e that we used for the creation of the ontology. In two months, eight Pinterest engineers, without prior experience of OWL and WebProt\’eg\’e, revamped an existing taxonomy of noisy terms into an OWL ontology. We share our experience and present the key aspects of our work that we believe will be useful for others working in this area.

Multi-label charge prediction is a task to predict the corresponding accusations for legal cases, and recently becomes a hot topic. However, current studies use rough methods to deal with the label number. These methods manually set parameters to select label numbers, which has an effect in final prediction quality. We propose an external knowledge enhanced multi-label charge prediction approach that has two phases. One is charge label prediction phase with external knowledge from law provisions, the other one is number learning phase with a number learning network (NLN) designed. Our approach enhanced by external knowledge can automatically adjust the threshold to get label number of law cases. It combines the output probabilities of samples and their corresponding label numbers to get final prediction results. In experiments, our approach is connected to some state of-the art deep learning models. By testing on the biggest published Chinese law dataset, we find that our approach has improvements on these models. We future conduct experiments on multi-label samples from the dataset. In items of macro-F1, the improvement of baselines with our approach is 3%-5%; In items of micro-F1, the significant improvement of our approach is 5%-15%. The experiment results show the effectiveness our approach for multi-label charge prediction.

Knowledge distillation (KD) is a technique to derive optimal performance from a small student network (SN) by distilling knowledge of a large teacher network (TN) and transferring the distilled knowledge to the small SN. Since a role of convolutional neural network (CNN) in KD is to embed a dataset so as to perform a given task well, it is very important to acquire knowledge that considers intra-data relations. Conventional KD methods have concentrated on distilling knowledge in data units. To our knowledge, any KD methods for distilling information in dataset units have not yet been proposed. Therefore, this paper proposes a novel method that enables distillation of dataset-based knowledge from the TN using an attention network. The knowledge of the embedding procedure of the TN is distilled to graph by multi-head attention (MHA), and multi-task learning is performed to give relational inductive bias to the SN. The MHA can provide clear information about the source dataset, which can greatly improves the performance of the SN. Experimental results show that the proposed method is 7.05% higher than the SN alone for CIFAR100, which is 2.46% higher than the state-of-the-art.

Argumentation mining is a rising subject in the computational linguistics domain focusing on extracting structured arguments from natural text, often from unstructured or noisy text. The initial approaches on modeling arguments was aiming to identify a flawless argument on specific fields (Law, Scientific Papers) serving specific needs (completeness, effectiveness). With the emerge of Web 2.0 and the explosion in the use of social media both the diffusion of the data and the argument structure have changed. In this survey article, we bridge the gap between theoretical approaches of argumentation mining and pragmatic schemes that satisfy the needs of social media generated data, recognizing the need for adapting more flexible and expandable schemes, capable to adjust to the argumentation conditions that exist in social media. We review, compare, and classify existing approaches, techniques and tools, identifying the positive outcome of combining tasks and features, and eventually propose a conceptual architecture framework. The proposed theoretical framework is an argumentation mining scheme able to identify the distinct sub-tasks and capture the needs of social media text, revealing the need for adopting more flexible and extensible frameworks.

Feature construction can substantially improve the accuracy of Machine Learning (ML) algorithms. Genetic Programming (GP) has been proven to be effective at this task by evolving non-linear combinations of input features. GP additionally has the potential to improve ML explainability since explicit expressions are evolved. Yet, in most GP works the complexity of evolved features is not explicitly bound or minimized though this is arguably key for explainability. In this article, we assess to what extent GP still performs favorably at feature construction when constructing features that are (1) Of small-enough number, to enable visualization of the behavior of the ML model; (2) Of small-enough size, to enable interpretability of the features themselves; (3) Of sufficient informative power, to retain or even improve the performance of the ML algorithm. We consider a simple feature construction scheme using three different GP algorithms, as well as random search, to evolve features for four ML algorithms, including support vector machines and random forest. Our results on 20 datasets pertaining to classification and regression problems show that constructing only two compact features can be sufficient to rival the use of the entire original feature set. We further find that a modern GP algorithm, GP-GOMEA, performs best overall. These results, combined with examples that we provide of readable constructed features and of 2D visualizations of ML behavior, lead us to positively conclude that GP-based feature construction still works well when explicitly searching for compact features, making it extremely helpful to explain ML models.

We present a phenomenon-oriented comparative analysis of the two dominant approaches in task-independent semantic parsing: classic, knowledge-intensive and neural, data-intensive models. To reflect state-of-the-art neural NLP technologies, we introduce a new target structure-centric parser that can produce semantic graphs much more accurately than previous data-driven parsers. We then show that, in spite of comparable performance overall, knowledge- and data-intensive models produce different types of errors, in a way that can be explained by their theoretical properties. This analysis leads to new directions for parser development.

This paper introduces \texttt{infotheory}: a package that implements multivariate information theoretic measures for discrete and continuous data. This package implements widely used measures such as entropy and mutual information, as well as more recent measures that arise from multivariate extensions to information theory, specifically Partial Information Decomposition. It provides an easy-to-use and flexible tool for performing information theoretic analyses on any multivariate dataset consisting of discrete or continuous data.

Canonical Correlation Analysis (CCA) is a classic technique for multi-view data analysis. To overcome the deficiency of linear correlation in practical multi-view learning tasks, various CCA variants were proposed to capture nonlinear dependency. However, it is non-trivial to have an in-principle understanding of these variants due to their inherent restrictive assumption on the data and latent code distributions. Although some works have studied probabilistic interpretation for CCA, these models still require the explicit form of the distributions to achieve a tractable solution for the inference. In this work, we study probabilistic interpretation for CCA based on implicit distributions. We present Conditional Mutual Information (CMI) as a new criterion for CCA to consider both linear and nonlinear dependency for arbitrarily distributed data. To eliminate direct estimation for CMI, in which explicit form of the distributions is still required, we derive an objective which can provide an estimation for CMI with efficient inference methods. To facilitate Bayesian inference of multi-view analysis, we propose Adversarial CCA (ACCA), which achieves consistent encoding for multi-view data with the consistent constraint imposed on the marginalization of the implicit posteriors. Such a model would achieve superiority in the alignment of the multi-view data with implicit distributions. It is interesting to note that most of the existing CCA variants can be connected with our proposed CCA model by assigning specific form for the posterior and likelihood distributions. Extensive experiments on nonlinear correlation analysis and cross-view generation on benchmark and real-world datasets demonstrate the superiority of our model.

Graphical models are commonly used to represent conditional dependence relationships between variables. There are multiple methods available for exploring them from high-dimensional data, but almost all of them rely on the assumption that the observations are independent and identically distributed. At the same time, observations connected by a network are becoming increasingly common, and tend to violate these assumptions. Here we develop a Gaussian graphical model for observations connected by a network with potentially different mean vectors, varying smoothly over the network. We propose an efficient estimation algorithm and demonstrate its effectiveness on both simulated and real data, obtaining meaningful interpretable results on a statistics coauthorship network. We also prove that our method estimates both the inverse covariance matrix and the corresponding graph structure correctly under the assumption of network ‘cohesion’, which refers to the empirically observed phenomenon of network neighbors sharing similar traits.

We design a new algorithm for the Euclidean -means problem that operates in the local model of differential privacy. Unlike in the non-private literature, differentially private algorithms for the -means incur both additive and multiplicative errors. Our algorithm significantly reduces the additive error while keeping the multiplicative error the same as in previous state-of-the-art results. Specifically, on a database of size , our algorithm guarantees multiplicative error and additive error for an arbitrarily small constant , whereas all previous algorithms in the local model on had additive error . We give a simple lower bound showing that additive error of is necessary for -means algorithms in the local model (at least for algorithms with a constant number of interaction rounds, which is the setting we consider in this paper).

Exploring the effects a chemical compound has on a species takes a considerable experimental effort. Appropriate methods for estimating and suggesting new effects can dramatically reduce the work needed to be done by a laboratory. In this paper we explore the suitability of using a knowledge graph embedding approach for ecotoxicological effect prediction. A knowledge graph has been constructed from publicly available data sets, including a species taxonomy and chemical classification and similarity. The publicly available effect data is integrated to the knowledge graph using ontology alignment techniques. Our experimental results show that the knowledge graph based approach improves the selected baselines.

The introduction of deep learning and transfer learning techniques in fields such as computer vision allowed a leap forward in the accuracy of image classification tasks. Currently there is only limited use of such techniques in neuroscience. The challenge of using deep learning methods to successfully train models in neuroscience, lies in the complexity of the information that is processed, the availability of data and the cost of producing sufficient high quality annotations. Inspired by its application in computer vision, we introduce transfer learning on electrophysiological data to enable training a model with limited amounts of data. Our method was tested on the dataset of the BCI competition IV 2a and compared to the top results that were obtained using traditional machine learning techniques. Using our DL model we outperform the top result of the competition by 33%. We also explore transferability of knowledge between trained models over different experiments, called inter-experimental transfer learning. This reduces the amount of required data even further and is especially useful when few subjects are available. This method is able to outperform the standard deep learning methods used in the BCI competition IV 2b approaches by 18%. In this project we propose a method that can produce reliable electroencephalography (EEG) signal classification, based on modest amounts of training data through the use of transfer learning.

Novelty is an important factor of creativity in product design. Acceptance of novelty, however, depends on one’s emotions. Yanagisawa, the last author, and his colleagues previously developed a mathematical model of emotional dimensions associated with novelty such as arousal (surprise) and valence. The model formalized arousal as Bayesian information gain and valence as a function of arousal based on Berlyne’s arousal potential theory. One becomes accustomed to novelty by repeated exposure. This so-called habituation to novelty is important in the design of long-term product experience. We herein propose a mathematical model of habituation to novelty based on the emotional dimension model. We formalized the habituation as a decrement in information gain from a novel event through Bayesian update. We derived the information gained from the repeated exposure of a novel stimulus as a function of three parameters: initial prediction error, initial uncertainty, and noise of sensory stimulus. With the proposed model, we discovered an interaction effect of the initial prediction error and initial uncertainty on habituation. Furthermore, we demonstrate that a range of positive emotions on prediction errors shift toward becoming more novel by repeated exposure.

The recent success of single-agent reinforcement learning (RL) encourages the exploration of multi-agent reinforcement learning (MARL), which is more challenging due to the interactions among different agents. In this paper, we consider a voting-based MARL problem, in which the agents vote to make group decisions and the goal is to maximize the globally averaged returns. To this end, we formulate the MARL problem based on the linear programming form of the policy optimization problem and propose a distributed primal-dual algorithm to obtain the optimal solution. We also propose a voting mechanism through which the distributed learning achieves the same sub-linear convergence rate as centralized learning. In other words, the distributed decision making does not slow down the global consensus to optimal. We also verify the convergence of our proposed algorithm with numerical simulations and conduct case studies in practical multi-agent systems.

We revisit the notion of individual fairness proposed by Dwork et al. A central challenge in operationalizing their approach is the difficulty in eliciting a human specification of a similarity metric. In this paper, we propose an operationalization of individual fairness that does not rely on a human specification of a distance metric. Instead, we propose novel approaches to elicit and leverage side-information on equally deserving individuals to counter subordination between social groups. We model this knowledge as a fairness graph, and learn a unified Pairwise Fair Representation(PFR) of the data that captures both data-driven similarity between individuals and the pairwise side-information in fairness graph. We elicit fairness judgments from a variety of sources, including humans judgments for two real-world datasets on recidivism prediction (COMPAS) and violent neighborhood prediction (Crime & Communities). Our experiments show that the PFR model for operationalizing individual fairness is practically viable.

A common approach for knowledge-base entity search is to consider an entity as a document with multiple fields. Models that focus on matching query terms in different fields are popular choices for searching such entity representations. An instance of such a model is FSDM (Fielded Sequential Dependence Model). We propose to integrate field-level semantic features into FSDM. We use FSDM to retrieve a pool of documents, and then to use semantic field-level features to re-rank those documents. We propose to represent queries as bags of terms as well as bags of entities, and eventually, use their dense vector representation to compute semantic features based on query document similarity. Our proposed re-ranking approach achieves significant improvement in entity retrieval on the DBpedia-Entity (v2) dataset over existing FSDM model. Specifically, for all queries we achieve 2.5% and 1.2% significant improvement in NDCG@10 and NDCG@100, respectively.

In this article we describe our experiences with computational text analysis. We hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of best practices for working with thick social and cultural concepts. Our guidance is based on our own experiences and is therefore inherently imperfect. Still, given our diversity of disciplinary backgrounds and research practices, we hope to capture a range of ideas and identify commonalities that will resonate for many. And this leads to our final goal: to help promote interdisciplinary collaborations. Interdisciplinary insights and partnerships are essential for realizing the full potential of any computational text analysis that involves social and cultural concepts, and the more we are able to bridge these divides, the more fruitful we believe our work will be.

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads’ unique attributes. ML jobs are typically long running, have coarse grained tasks that need to be gang-scheduled, and their performance is sensitive to tasks’ relative placement. These properties cannot be captured by existing fair sharing schemes. We propose Themis, a new scheduling framework for ML training workloads. It’s GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but compensating for finish-time fairness in the long term. Our evaluation on a number of machine learning models shows that Themis can ensure greater fairness while providing more efficient allocations compared to state-of-the-art schedulers.

We propose and investigate feasibility of a novel task that consists in finding e-sports talent using multimodal Twitch chat and video stream data. In that, we focus on predicting the ranks of Counter-Strike: Global Offensive (CS:GO) gamers who broadcast their games on Twitch. During January 2019-April 2019, we have built two Twitch stream collections: One for 425 publicly ranked CS:GO gamers and one for 9,928 unranked CS:GO gamers. We extract neural features from video, audio and text chat data and estimate modality-specific probabilities for a gamer to be top-ranked during the data collection time-frame. A hierarchical Bayesian model is then used to pool the evidence across modalities and generate estimates of intrinsic skill for each gamer. Our modeling is validated through correlating the intrinsic skill predictions with May 2019 ranks of the publicly profiled gamers.

Algorithms now permeate multiple aspects of human lives and multiple recent results have reported that these algorithms may have biases pertaining to gender, race, and other demographic characteristics. The metrics used to quantify such biases have still focused on a static notion of algorithms. However, algorithms evolve over time. For instance, Tay (a conversational bot launched by Microsoft) was arguably not biased at its launch but quickly became biased, sexist, and racist over time. We suggest a set of intuitive metrics to study the variations in biases over time and present the results for a case study for genders represented in images resulting from a Twitter image search for #Nurse and #Doctor over a period of 21 days. Results indicate that biases vary significantly over time and the direction of bias could appear to be different on different days. Hence, one-shot measurements may not suffice for understanding algorithmic bias, thus motivating further work on studying biases in algorithms over time.

Multi-view learning (MVL) is a strategy for fusing data from different sources or subsets. Canonical correlation analysis (CCA) is very important in MVL, whose main idea is to map data from different views onto a common space with the maximum correlation. The traditional CCA can only be used to calculate the linear correlation between two views. Moreover, it is unsupervised, and the label information is wasted in supervised learning tasks. Many nonlinear, supervised, or generalized extensions have been proposed to overcome these limitations. However, to our knowledge, there is no up-to-date overview of these approaches. This paper fills this gap, by providing a comprehensive overview of many classical and latest CCA approaches, and describing their typical applications in pattern recognition, multi-modal retrieval and classification, and multi-view embedding.

Many Machine Learning algorithms, such as deep neural networks, have long been criticized for being ‘black-boxes’-a kind of models unable to provide how it arrive at a decision without further efforts to interpret. This problem has raised concerns on model applications’ trust, safety, nondiscrimination, and other ethical issues. In this paper, we discuss the machine learning interpretability of a real-world application, eXtreme Multi-label Learning (XML), which involves learning models from annotated data with many pre-defined labels. We propose a two-step XML approach that combines deep non-negative autoencoder with other multi-label classifiers to tackle different data applications with a large number of labels. Our experimental result shows that the proposed approach is able to cope with many-label problems as well as to provide interpretable label hierarchies and dependencies that helps us understand how the model recognizes the existences of objects in an image.

The naturalistic driving data are employed to study the accelerating behavior of the driver. Firstly, the question that whether the database is big enough to achieve a convergent accelerating behavior of the driver is studied. The kernel density estimation is applied to estimate the distributions of the accelerations. The Kullback-Liebler divergence is employed to evaluate the distinction between datasets composed of different quantity of data. The results show that a convergent accelerating behavior of the driver can be obtained by using the database in this study. Secondly, the bivariate accelerating behavior is proposed. It is shown that the bivariate distribution between longitudinal acceleration and lateral acceleration follows the dual triangle distribution pattern. Two bivariate distribution models are proposed to explain this phenomenon, i.e. the bivariate Normal distribution model (BNDM) and the bivariate Pareto distribution model (BPDM). The univariate acceleration behavior is presented to examine which model is better. It is identified that the marginal distribution and conditional distribution of the accelerations approximately follow the univariate Pareto distribution. Hence, the BPDM is a more appropriate one to describe the bivariate accelerating behavior of the driver. This reveals that the bivariate distribution pattern will never reach a circle-shaped region.

In this paper, we propose a new capsule network architecture called Attention Routing CapsuleNet (AR CapsNet). We replace the dynamic routing and squash activation function of the capsule network with dynamic routing (CapsuleNet) with the attention routing and capsule activation. The attention routing is a routing between capsules through an attention module. The attention routing is a fast forward-pass while keeping spatial information. On the other hand, the intuitive interpretation of the dynamic routing is finding a centroid of the prediction capsules. Thus, the squash activation function and its variant focus on preserving a vector orientation. However, the capsule activation focuses on performing a capsule-scale activation function. We evaluate our proposed model on the MNIST, affNIST, and CIFAR-10 classification tasks. The proposed model achieves higher accuracy with fewer parameters (x0.41 parameters in the MNIST, x0.45 parameters in the CIFAR-10) and less training time than CapsuleNet (x0.40 training time in the MNIST, x0.30 training time in the CIFAR-10). These results validate that designing a capsule-scale operation is a key factor to implement the capsule concept. Also, our experiment shows that our proposed model is transformation equivariant as CapsuleNet. As we perturb each element of the output capsule, the decoder attached to the output capsules shows global variations. Further experiments show that the difference in the capsule features caused by applying affine transformations on an input image is significantly aligned in one direction.

We present methods for multi-task learning that take advantage of natural groupings of related tasks. Task groups may be defined along known properties of the tasks, such as task domain or language. Such task groups represent supervised information at the inter-task level and can be encoded into the model. We investigate two variants of neural network architectures that accomplish this, learning different feature spaces at the levels of individual tasks, task groups, as well as the universe of all tasks: (1) parallel architectures encode each input simultaneously into feature spaces at different levels; (2) serial architectures encode each input successively into feature spaces at different levels in the task hierarchy. We demonstrate the methods on natural language understanding (NLU) tasks, where a grouping of tasks into different task domains leads to improved performance on ATIS, Snips, and a large inhouse dataset.

In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.

Motivation: Elastic net regression is a form of penalized regression that lies between ridge and least absolute shrinkage and selection operator (LASSO) regression. The elastic net penalty is a powerful tool controlling the impact of correlated predictors and the overall complexity of generalized linear regression models. The elastic net penalty has two tuning parameters: for the complexity and for the compromise between LASSO and ridge. The R package glmnet provides efficient tools for fitting elastic net models and selecting for a given However, glmnet does not simultaneously search the space for the optional elastic net model. Results: We built the R package ensr, elastic net searcher. enser extends the functionality of glment to search the space and identify an optimal pair. Availability: ensr is available from the Comprehensive R Archive Network at https://…/package=ensr

Fast and invariant feature extraction is crucial in certain computer vision applications where the computation time is constrained in both training and testing phases of the classifier. In this paper, we propose a nature-inspired dimensionality reduction technique for fast and invariant visual feature extraction. The human brain can exchange the spatial and spectral resolution to reconstruct missing colors in visual perception. The phenomenon is widely used in the printing industry to reduce the number of colors used to print, through a technique, called color dithering. In this work, we adopt a fast error-diffusion color dithering algorithm to reduce the spectral resolution and extract salient features by employing novel Hessian matrix analysis technique, which is then described by a spatial-chromatic histogram. The computation time, descriptor dimensionality and classification performance of the proposed feature are assessed under drastic variances in orientation, viewing angle and illumination of objects comparing with several different state-of-the-art handcrafted and deep-learned features. Extensive experiments on two publicly available object datasets, coil-100 and ALOI carried on both a desktop PC and a Raspberry Pi device show multiple advantages of using the proposed approach, such as the lower computation time, high robustness, and comparable classification accuracy under weakly supervised environment. Further, it showed the capability of operating solely inside a conventional SoC device utilizing a small fraction of the available hardware resources.

We study the problem of semantic matching in product search, that is, given a customer query, retrieve all semantically related products from the catalog. Pure lexical matching via an inverted index falls short in this respect due to several factors: a) lack of understanding of hypernyms, synonyms, and antonyms, b) fragility to morphological variants (e.g. ‘woman’ vs. ‘women’), and c) sensitivity to spelling errors. To address these issues, we train a deep learning model for semantic matching using customer behavior data. Much of the recent work on large-scale semantic search using deep learning focuses on ranking for web search. In contrast, semantic matching for product search presents several novel challenges, which we elucidate in this paper. We address these challenges by a) developing a new loss function that has an inbuilt threshold to differentiate between random negative examples, impressed but not purchased examples, and positive examples (purchased items), b) using average pooling in conjunction with n-grams to capture short-range linguistic patterns, c) using hashing to handle out of vocabulary tokens, and d) using a model parallel training architecture to scale across 8 GPUs. We present compelling offline results that demonstrate at least 4.7% improvement in Recall@100 and 14.5% improvement in mean average precision (MAP) over baseline state-of-the-art semantic search methods using the same tokenization method. Moreover, we present results and discuss learnings from online A/B tests which demonstrate the efficacy of our method.

Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these kinds of observation spaces present a number of challenges in practice, since the policy must now solve two problems: a representation learning problem, and a task learning problem. In this paper, we aim to explicitly learn representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC learns a compact latent representation space using a stochastic sequential latent variable model, and then learns a critic model within this latent space. By learning a critic within a compact state space, SLAC can learn much more efficiently than standard RL methods. The proposed model improves performance substantially over alternative representations as well, such as variational autoencoders. In fact, our experimental evaluation demonstrates that the sample efficiency of our resulting method is comparable to that of model-based RL methods that directly use a similar type of model for control. Furthermore, our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks.

Can we reduce the search cost of Neural Architecture Search (NAS) from days down to only few hours? NAS methods automate the design of Convolutional Networks (ConvNets) under hardware constraints and they have emerged as key components of AutoML frameworks. However, the NAS problem remains challenging due to the combinatorially large design space and the significant search time (at least 200 GPU-hours). In this work, we alleviate the NAS search cost down to less than 3 hours, while achieving state-of-the-art image classification results under mobile latency constraints. We propose a novel differentiable NAS formulation, namely Single-Path NAS, that uses one single-path over-parameterized ConvNet to encode all architectural decisions based on shared convolutional kernel parameters, hence drastically decreasing the search overhead. Single-Path NAS achieves state-of-the-art top-1 ImageNet accuracy (75.62%), hence outperforming existing mobile NAS methods in similar latency settings (~80ms). In particular, we enhance the accuracy-runtime trade-off in differentiable NAS by treating the Squeeze-and-Excitation path as a fully searchable operation with our novel single-path encoding. Our method has an overall cost of only 8 epochs (24 TPU-hours), which is up to 5,000x faster compared to prior work. Moreover, we study how different NAS formulation choices affect the performance of the designed ConvNets. Furthermore, we exploit the efficiency of our method to answer an interesting question: instead of empirically tuning the hyperparameters of the NAS solver (as in prior work), can we automatically find the hyperparameter values that yield the desired accuracy-runtime trade-off? We open-source our entire codebase at: https://…/single-path-nas.

Given a set X of finite strings, one interesting question to ask is whether there exists a member of X which is simple conditional to all other members of X. Conditional simplicity is measured by low conditional Kolmogorov complexity. We prove the affirmative to this question for sets that have low mutual information with the halting sequence. There are two results with respect to this question. One is dependent on the maximum conditional complexity between two elements of X, the other is dependent on the maximum expected value of the conditional complexity of a member of X relative to each member of X.

A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical text with key phrases. We use the artificial data as additional training data in two key biomedical NLP tasks: text classification and temporal relation extraction. We show that artificially generated training data used in conjunction with real training data can lead to performance boosts for data-greedy neural network algorithms. We also demonstrate the usefulness of the generated data for NLP setups where it fully replaces real training data.

Knowledge graph embeddings rank among the most successful methods for link prediction in knowledge graphs, i.e., the task of completing an incomplete collection of relational facts. A downside of these models is their strong sensitivity to model hyperparameters, in particular regularizers, which have to be extensively tuned to reach good performance [Kadlec et al., 2017]. We propose an efficient method for large scale hyperparameter tuning by interpreting these models in a probabilistic framework. After a model augmentation that introduces per-entity hyperparameters, we use a variational expectation-maximization approach to tune thousands of such hyperparameters with minimal additional cost. Our approach is agnostic to details of the model and results in a new state of the art in link prediction on standard benchmark data.

Machine Reading Comprehension (MRC), which requires the machine to answer questions based on the given context, has gained increasingly wide attention with the appearance of deep learning over the past few years. Although the research of MRC based on deep learning is flourishing, there is a lack of a comprehensive survey article to summarize the proposed approaches and the recent trends. As a result, we conduct a thorough overview of recent research efforts on this promising field. To be concrete, we compare MRC tasks in different dimensions and introduce the general architecture. We further provide a taxonomy of state-of-the-art approaches utilized in prevalent models. Finally, we discuss some new trends and conclude by describing some open issues in the field.

Grammatical error correction can be viewed as a low-resource sequence-to-sequence task, because publicly available parallel corpora are limited. To tackle this challenge, we first generate erroneous versions of large unannotated corpora using a realistic noising function. The resulting parallel corpora are subsequently used to pre-train Transformer models. Then, by sequentially applying transfer learning, we adapt these models to the domain and style of the test set. Combined with a context-aware neural spellchecker, our system achieves competitive results in both restricted and low resource tracks in ACL 2019 BEA Shared Task. We release all of our code and materials for reproducibility.

The application of deep learning (DL) models to the decoding of cognitive states from whole-brain functional Magnetic Resonance Imaging (fMRI) data is often hindered by the small sample size and high dimensionality of these datasets. Especially, in clinical settings, where patient data are scarce. In this work, we demonstrate that transfer learning represents a solution to this problem. Particularly, we show that a DL model, which has been previously trained on a large openly available fMRI dataset of the Human Connectome Project, outperforms a model variant with the same architecture, but which is trained from scratch, when both are applied to the data of a new, unrelated fMRI task. Even further, the pre-trained DL model variant is already able to correctly decode 67.51% of the cognitive states from a test dataset with 100 individuals, when fine-tuned on a dataset of the size of only three subjects.

Data-driven models for audio source separation such as U-Net or Wave-U-Net are usually models dedicated to and specifically trained for a single task, e.g. a particular instrument isolation. Training them for various tasks at once commonly results in worse performances than training them for a single specialized task. In this work, we introduce the Conditioned-U-Net (C-U-Net) which adds a control mechanism to the standard U-Net. The control mechanism allows us to train a unique and generic U-Net to perform the separation of various instruments. The C-U-Net decides the instrument to isolate according to a one-hot-encoding input vector. The input vector is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. FiLM layers modify the U-Net feature maps in order to separate the desired instrument via affine transformations. The C-U-Net performs different instrument separations, all with a single model achieving the same performances as the dedicated ones at a lower cost.

We humans seem to have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and manipulate our environment. Drawing inspiration from that, we address the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture meaningful information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. We show empirical results on a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a known notion of an arrow of time given by the celebrated Jordan-Kinderlehrer-Otto result.

In this study, we investigated multi-modal approaches using images, descriptions, and title to categorize e-commerce products on Amazon.com. Specifically, we examined late fusion models, where the modalities are fused at the decision level. Products were each assigned multiple labels, and the hierarchy in the labels were flattened and filtered. For our individual baseline models, we modified a CNN architecture to classify the description and title, and then modified Keras’ ResNet-50 to classify the images, achieving F1 scores of 77.0%, 82.7%, and 61.0%, respectively. In comparison, our tri-modal late fusion model can classify products more accurately than single modal models can, improving the F1 score to 88.2%. Each modality complemented the shortcomings of the other modalities, demonstrating that increasing the number of modalities can be an effective method for improving the accuracy of multi-label classification problems.

The sequential order of utterances is often meaningful in coherent dialogues, and the order changes of utterances could lead to low-quality and incoherent conversations. We consider the order information as a crucial supervised signal for dialogue learning, which, however, has been neglected by many previous dialogue systems. Therefore, in this paper, we introduce a self-supervised learning task, inconsistent order detection, to explicitly capture the flow of conversation in dialogues. Given a sampled utterance pair triple, the task is to predict whether it is ordered or misordered. Then we propose a sampling-based self-supervised network SSN to perform the prediction with sampled triple references from previous dialogue history. Furthermore, we design a joint learning framework where SSN can guide the dialogue systems towards more coherent and relevant dialogue learning through adversarial training. We demonstrate that the proposed methods can be applied to both open-domain and task-oriented dialogue scenarios, and achieve the new state-of-the-art performance on the OpenSubtitiles and Movie-Ticket Booking datasets.

Recently, there has been interest in multiplicative recurrent neural networks for language modeling. Indeed, simple Recurrent Neural Networks (RNNs) encounter difficulties recovering from past mistakes when generating sequences due to high correlation between hidden states. These challenges can be mitigated by integrating second-order terms in the hidden-state update. One such model, multiplicative Long Short-Term Memory (mLSTM) is particularly interesting in its original formulation because of the sharing of its second-order term, referred to as the intermediate state. We explore these architectural improvements by introducing new models and testing them on character-level language modeling tasks. This allows us to establish the relevance of shared parametrization in recurrent language modeling.

Named entity recognition (NER) is one of the best studied tasks in natural language processing. However, most approaches are not capable of handling nested structures which are common in many applications. In this paper we introduce a novel neural network architecture that first merges tokens and/or entities into entities forming nested structures, and then labels each of them independently. Unlike previous work, our merge and label approach predicts real-valued instead of discrete segmentation structures, which allow it to combine word and nested entity embeddings while maintaining differentiability. %which smoothly groups entities into single vectors across multiple levels. We evaluate our approach using the ACE 2005 Corpus, where it achieves state-of-the-art F1 of 74.6, further improved with contextual embeddings (BERT) to 82.4, an overall improvement of close to 8 F1 points over previous approaches trained on the same data. Additionally we compare it against BiLSTM-CRFs, the dominant approach for flat NER structures, demonstrating that its ability to predict nested structures does not impact performance in simpler cases.

The advance of node pooling operations in a Graph Neural Network (GNN) has lagged behind the feverish design of new graph convolution techniques, and pooling remains an important and challenging endeavor for the design of deep architectures. In this paper, we propose a pooling operation for GNNs that implements a differentiable unsupervised loss based on the mincut optimization objective. First, we validate the effectiveness of the proposed loss function by clustering nodes in citation networks and through visualization examples, such as image segmentation. Then, we show how the proposed pooling layer can be used to build a deep GNN architecture for graph classification.

Clustering is an important part of many modern data analysis pipelines, including network analysis and data retrieval. There are many different clustering algorithms developed by various communities, and it is often not clear which algorithm will give the best performance on a specific clustering task. Similarly, we often have multiple ways to measure distances between data points, and the best clustering performance might require a non-trivial combination of those metrics. In this work, we study data-driven algorithm selection and metric learning for clustering problems, where the goal is to simultaneously learn the best algorithm and metric for a specific application. The family of clustering algorithms we consider is parameterized linkage based procedures that includes single and complete linkage. The family of distance functions we learn over are convex combinations of base distance functions. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and simultaneously learn both a near-optimal distance and clustering algorithm from these classes. We also carry out a comprehensive empirical evaluation of our techniques showing that they can lead to significantly improved clustering performance.

Graph alignment, also known as network alignment, is a fundamental task in social network analysis. Many recent works have relied on partially labeled cross-graph node correspondences, i.e., anchor links. However, due to the privacy and security issue, the manual labeling of anchor links for diverse scenarios may be prohibitive. Aligning two graphs without any anchor links is a crucial and challenging task. In this paper, we propose an Unsupervised Adversarial Graph Alignment (UAGA) framework to learn a cross-graph alignment between two embedding spaces of different graphs in a fully unsupervised fashion (\emph{i.e.,} no existing anchor links and no users’ personal profile or attribute information is available). The proposed framework learns the embedding spaces of each graph, and then attempts to align the two spaces via adversarial training, followed by a refinement procedure. We further extend our UAGA method to incremental UAGA (iUAGA) that iteratively reveals the unobserved user links based on the pseudo anchor links. This can be used to further improve both the embedding quality and the alignment accuracy. Moreover, the proposed methods will benefit some real-world applications, \emph{e.g.,} link prediction in social networks. Comprehensive experiments on real-world data demonstrate the effectiveness of our proposed approaches UAGA and iUAGA for unsupervised graph alignment.

This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.

Writing review for a purchased item is a unique channel to express a user’s opinion in E-Commerce. Recently, many deep learning based solutions have been proposed by exploiting user reviews for rating prediction. In contrast, there has been few attempt to enlist the semantic signals covered by user reviews for the task of collaborative filtering. In this paper, we propose a novel review-driven neural sequential recommendation model (named RNS) by considering users’ intrinsic preference (long-term) and sequential patterns (short-term). In detail, RNS is devised to encode each user or item with the aspect-aware representations extracted from the reviews. Given a sequence of historical purchased items for a user, we devise a novel hierarchical attention over attention mechanism to capture sequential patterns at both union-level and individual-level. Extensive experiments on three real-world datasets of different domains demonstrate that RNS obtains significant performance improvement over uptodate state-of-the-art sequential recommendation models.

We present a simple and novel way to do the task of text-to-SQL problem with weak supervision. We call it Rule-SQL. Given the question and the answer from the database table without the SQL logic form, Rule-SQL use the database rules for the SQL exploration first and then use the explored SQL for supervised training. We design several rules for reducing the exploration search space. For the deep model, we leverage BERT for the representation layer and separate the model to SELECT, AGG and WHERE parts. The experiment result on WikiSQL outperforms the strong baseline of full supervision and is comparable to the start-of-the-art weak supervised mothods.

Machines learning techniques plays a preponderant role in dealing with massive amount of data and are employed in almost every possible domain. Building a high quality machine learning model to be deployed in production is a challenging task, from both, the subject matter experts and the machine learning practitioners. For a broader adoption and scalability of machine learning systems, the construction and configuration of machine learning workflow need to gain in automation. In the last few years, several techniques have been developed in this direction, known as autoML. In this paper, we present a two-stage optimization process to build data pipelines and configure machine learning algorithms. First, we study the impact of data pipelines compared to algorithm configuration in order to show the importance of data preprocessing over hyperparameter tuning. The second part presents policies to efficiently allocate search time between data pipeline construction and algorithm configuration. Those policies are agnostic from the metaoptimizer. Last, we present a metric to determine if a data pipeline is specific or independent from the algorithm, enabling fine-grain pipeline pruning and meta-learning for the coldstart problem.

When it comes to clustering nonconvex shapes, two paradigms are used to find the most suitable clustering: minimum cut and maximum density. The most popular algorithms incorporating these paradigms are Spectral Clustering and DBSCAN. Both paradigms have their pros and cons. While minimum cut clusterings are sensitive to noise, density-based clusterings have trouble handling clusters with varying densities. In this paper, we propose \textsc{SpectACl}: a method combining the advantages of both approaches, while solving the two mentioned drawbacks. Our method is easy to implement, such as spectral clustering, and theoretically founded to optimize a proposed density criterion of clusterings. Through experiments on synthetic and real-world data, we demonstrate that our approach provides robust and reliable clusterings.

How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc ‘traditional’ approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state-of-the-art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics. We briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables, and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling. Our overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research.

Recent work has shown that memory modules are crucial for the generalization ability of neural networks on learning simple algorithms. However, we still have little understanding of the working mechanism of memory modules. To alleviate this problem, we apply a two-step analysis pipeline consisting of first inferring hypothesis about what strategy the model has learned according to visualization and then verify it by a novel proposed qualitative analysis method based on dimension reduction. Using this method, we have analyzed two popular memory-augmented neural networks, neural Turing machine and stack-augmented neural network on two simple algorithm tasks including reversing a random sequence and evaluation of arithmetic expressions. Results have shown that on the former task both models can learn to generalize and on the latter task only the stack-augmented model can do so. We show that different strategies are learned by the models, in which specific categories of input are monitored and different policies are made based on that to change the memory.

This paper investigates the specific experience of following a suggestion by an intelligent machine that has a wrong outcome and the emotions people feel. By adopting a typical task employed in studies on decision-making, we presented participants with two scenarios in which they follow a suggestion and have a wrong outcome by either an expert human being or an intelligent machine. We found a significant decrease in the perceived responsibility on the wrong choice when the machine offers the suggestion. At present, few studies have investigated the negative emotions that could arise from a bad outcome after following the suggestion given by an intelligent system, and how to cope with the potential distrust that could affect the long-term use of the system and the cooperation. This preliminary research has implications in the study of cooperation and decision making with intelligent machines. Further research may address how to offer the suggestion in order to better cope with user’s self-blame.

Transfer learning across different reinforcement learning (RL) tasks is becoming an increasingly valuable area of research. We consider a goal-based multi-task RL framework and mechanisms by which previously solved tasks can reduce sample complexity and regret when the agent is faced with a new task. Specifically, we introduce two metrics on the state space that encode notions of traversibility of the state space for an agent. Using these metrics a topological covering is constructed by way of a set of landmark states in a fully self-supervised manner. We show that these landmark coverings confer theoretical advantages for transfer learning within the goal-based multi-task RL setting. Specifically, we demonstrate three mechanisms by which landmark coverings can be used for successful transfer learning. First, we extend the Landmark Options Via Reflection (LOVR) framework to this new topological covering; second, we use the landmark-centric value functions themselves as features and define a greedy zombie policy that achieves near oracle performance on a sequence of zero-shot transfer tasks; finally, motivated by the second transfer mechanism, we introduce a learned reward function that provides a more dense reward signal for goal-based RL. Our novel topological landmark covering confers beneficial theoretical results, bounding the Q values at each state-action pair. In doing so, we introduce a mechanism that performs action-pruning at infeasible actions which cannot possibly be part of an optimal policy for the current goal.

We live in an era of big data and rich data visualization. As data sets increase in size, browser-based interactive visualizations eventually hit limits in storage and processing capacity. In order to provide interactivity over large datasets, visualization applications typically need to be extensively rewritten to make use of powerful back-end services. It would be far preferable if front-end developers could write visualizations once in a natural way, and have a framework take responsibility for transparently scaling up the visualization to use back-end services as needed. Achieving this goal requires rethinking how communication and state are managed by the framework: the mapping of interaction logic to server APIs or database queries, handling of results arriving asynchronously over the network, as well as basic cross-layer performance optimizations like caching. In this paper, we present DIEL, a framework that achieves this cross-layer autoscaling transparently under a simple, declarative interface. DIEL treats UI events as a stream of data that is captured in an event history for reuse. Developers declare what the state of the interface should be after the arrival of events. DIEL compiles these declarative specifications into relational queries over both event history and the data to be visualized. In doing so, DIEL makes it easier to develop visualizations that are robust against changes to the size and location of data. To evaluate the DIEL framework, we developed a prototype implementation and confirmed that DIEL supports a range of visualization and interaction designs. Visualizations written using DIEL can transparently and seamlessly scale to use back-end services with little intervention from the developer.

Modern interactive visualizations are akin to distributed systems, where user interactions, background data processing, remote requests, and streaming data read and modify the interface at the same time. This concurrency is crucial to provide an interactive user experience—forbidding it can cripple responsiveness. However, it is notoriously challenging to program distributed systems, and concurrency can easily lead to ambiguous or confusing interface behaviors. In this paper, we present DIEL, a declarative programming model to help developers reason about and reconcile concurrency-related issues. Using DIEL, developers no longer need to procedurally describe how the interface should update based on different input events, but rather declaratively specify what the state of the interface should be as queries over event history. We show that resolving conflicts from concurrent processes in real-world interactive visualizations can be done in a few lines of DIEL code.

We propose a new end-to-end method for extending a Knowledge Graph (KG) from tables. Existing techniques tend to interpret tables by focusing on information that is already in the KG, and therefore tend to extract many redundant facts. Our method aims to find more novel facts. We introduce a new technique for table interpretation based on a scalable graphical model using entity similarities. Our method further disambiguates cell values using KG embeddings as additional ranking method. Other distinctive features are the lack of assumptions about the underlying KG and the enabling of a fine-grained tuning of the precision/recall trade-off of extracted facts. Our experiments show that our approach has a higher recall during the interpretation process than the state-of-the-art, and is more resistant against the bias observed in extracting mostly redundant facts since it produces more novel extractions.

The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, e.g. if the signal is clustered in many small blocks, as is the case in some relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. We derive these results both for the univariate and the multivariate settings as well as for the problem of detecting clusters in a network. These results recover as special cases the heterogeneous mixture detection problem [1] where there is no structure in the signal, as well as scan problem [2] where the signal comprises a single interval. We develop methodology that allows optimal adaptive detection in the general setting, thus exploiting the structure if it is present without incurring a relevant penalty in the case where there is no structure. The advantage of this methodology can be considerable, as in the case of no structure the means need to increase at the rate to ensure detection, while the presence of structure allows detection even if the means \emph{decrease} at a polynomial rate.

This paper addresses the ability of generative adversarial networks (GANs) to model complex distributions of data in high-dimensional spaces. Our proposition is that the more effective the adversary is in discriminating the output of the generator, the more effective the generator will be at modeling (or generating) the distribution represented by the training data. The most extreme failure of GANs in this context is mode collapse, and there are several proposed methods to address that problem. However, mode collapse is merely a symptom of a more general problem of GANs, where the generator fools the adversary while failing to faithfully model the distribution of the training data. Here, we address the challenge of constructing and evaluating GANs that more effectively represent the input distribution. We introduce an adversarial architecture that processes sets of generated and real samples, and discriminates between the origins of these sets (i.e., training versus generated data) in a flexible, permutation invariant manner. We present quantitative and qualitative results that demonstrate the effectiveness of this approach relative to state-of-the-art methods for avoiding mode collapse.

Point-to-Point Shortest Distance (PPSD) query is a crucial primitive in graph database applications. Hub labeling algorithms compute a labeling that converts a PPSD query into a list intersection problem (over a pre-computed indexing) enabling swift query response. However, constructing hub labeling is computationally challenging. Even state-of-the-art parallel algorithms based on Pruned Landmark Labeling (PLL) [3], are plagued by large label size, violation of given network hierarchy, poor scalability and inability to process large graphs. In this paper, we develop novel parallel shared-memory and distributed-memory algorithms for constructing the Canonical Hub Labeling (CHL) that is minimal in size for a given network hierarchy. To the best of our knowledge, none of the existing parallel algorithms guarantee canonical labeling. Our key contribution, the PLaNT algorithm, scales well beyond the limits of current practice by completely avoiding inter-node communication. PLaNT also enables the design of a collaborative label partitioning scheme across multiple nodes for completely in-memory processing of massive graphs whose labels cannot fit on a single machine. Compared to the sequential PLL, we empirically demonstrate upto 47.4x speedup on a 72 thread shared-memory platform. On a 64-node cluster, PLaNT achieves an average 42x speedup over single node execution. Finally, we show how our approach demonstrates superior scalability – we can process 14x larger graphs (in terms of label size) and construct hub labeling orders of magnitude faster compared to state-of-the-art distributed paraPLL algorithm.

Data scientists are constantly creating methods to efficiently and accurately populate big data sets for use in large-scale applications. Many recent efforts utilize crowd-sourcing and textual interfaces. In this paper, we propose a new method of curating data; namely, creating a multi-device Amazon Alexa Skill in the form of a research trivia game. Users experience a synchronized gaming experience with other Amazon Echo users, competing against one another while filling in gaps of a connected knowledge base. This allows for full exploitation of the speed improvement offered by voice interface technology in a game-based format.

Can we trust black-box machine learning with its decisions? Can we trust algorithms to train machine learning models on sensitive data? Transparency and privacy are two fundamental elements of trust for adopting machine learning. In this paper, we investigate the relation between interpretability and privacy. In particular we analyze if an adversary can exploit transparent machine learning to infer sensitive information about its training set. To this end, we perform membership inference as well as reconstruction attacks on two popular classes of algorithms for explaining machine learning models: feature-based and record-based influence measures. We empirically show that an attacker, that only observes the feature-based explanations, has the same power as the state of the art membership inference attacks on model predictions. We also demonstrate that record-based explanations can be effectively exploited to reconstruct significant parts of the training set. Finally, our results indicate that minorities and special cases are more vulnerable to these type of attacks than majority groups.

Continual learning (CL) is a particular machine learning paradigm where the data distribution and learning objective changes through time, or where all the training data and objective criteria are never available at once. The evolution of the learning process is modeled by a sequence of learning experiences where the goal is to be able to learn new skills all along the sequence without forgetting what has been previously learned. Continual learning also aims at the same time at optimizing the memory, the computation power and the speed during the learning process. An important challenge for machine learning is not necessarily finding solutions that work in the real world but rather finding stable algorithms that can learn in real world. Hence, the ideal approach would be tackling the real world in a embodied platform: an autonomous agent. Continual learning would then be effective in an autonomous agent or robot, which would learn autonomously through time about the external world, and incrementally develop a set of complex skills and knowledge. Robotic agents have to learn to adapt and interact with their environment using a continuous stream of observations. Some recent approaches aim at tackling continual learning for robotics, but most recent papers on continual learning only experiment approaches in simulation or with static datasets. Unfortunately, the evaluation of those algorithms does not provide insights on whether their solutions may help continual learning in the context of robotics. This paper aims at reviewing the existing state of the art of continual learning, summarizing existing benchmarks and metrics, and proposing a framework for presenting and evaluating both robotics and non robotics approaches in a way that makes transfer between both fields easier.

We deal with the \textit{selective classification} problem (supervised-learning problem with a rejection option), where we want to achieve the best performance at a certain level of coverage of the data. We transform the original -class classification problem to -class where the -th class represents the model abstaining from making a prediction due to uncertainty. Inspired by portfolio theory, we propose a loss function for the selective classification problem based on the doubling rate of gambling. We show that minimizing this loss function has a natural interpretation as maximizing the return of a \textit{horse race}, where a player aims to balance between betting on an outcome (making a prediction) when confident and reserving one’s winnings (abstaining) when not confident. This loss function allows us to train neural networks and characterize the uncertainty of prediction in an end-to-end fashion. In comparison with previous methods, our method requires almost no modification to the model inference algorithm or neural architecture. Experimentally, we show that our method can identify both uncertain and outlier data points, and achieves strong results on SVHN and CIFAR10 at various coverages of the data.

Nowadays, users open multiple accounts on social media platforms and e-commerce sites, expressing their personal preferences on different domains. However, users’ behaviors change across domains, depending on the content that users interact with, such as movies, music, clothing and retail products. In this paper, we propose an adaptive deep learning strategy for cross-domain recommendation, referred to as ADC. We design a neural architecture and formulate a cross-domain loss function, to compute the non-linearity in user preferences across domains and transfer the knowledge of users’ multiple behaviors, accordingly. In addition, we introduce an efficient algorithm for cross-domain loss balancing which directly tunes gradient magnitudes and adapts the learning rates based on the domains’ complexities/scales when training the model via backpropagation. In doing so, ADC controls and adjusts the contribution of each domain when optimizing the model parameters. Our experiments on six publicly available cross-domain recommendation tasks demonstrate the effectiveness of the proposed ADC model over other state-of-the-art methods. Furthermore, we study the effect of the proposed adaptive deep learning strategy and show that ADC can well balance the impact of the domains with different complexities.

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length , making modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only memory cost, improving the time series forecasting in finer granularity under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

Multidimensional scaling (MDS) is a popular technique for mapping a finite metric space into a low-dimensional Euclidean space in a way that best preserves pairwise distances. We overview the theory of classical MDS, along with its optimality properties and goodness of fit. Further, we present a notion of MDS on infinite metric measure spaces that generalizes these optimality properties. As a consequence we can study the MDS embeddings of the geodesic circle into for all , and ask questions about the MDS embeddings of the geodesic -spheres into . Finally, we address questions on convergence of MDS. For instance, if a sequence of metric measure spaces converges to a fixed metric measure space , then in what sense do the MDS embeddings of these spaces converge to the MDS embedding of ?

This paper presents a novel and formal interpretation of the original vision of hypertext: infrastructure-agnostic hypertext is independent from specific standards such as data formats and network protocols. Its model is illustrated with examples and references to existing technologies that allow for implementation and integration in current information infrastructures such as the Internet.

Real-world applications of object recognition often require the solution of multiple tasks in a single platform. Under the standard paradigm of network fine-tuning, an entirely new CNN is learned per task, and the final network size is independent of task complexity. This is wasteful, since simple tasks require smaller networks than more complex tasks, and limits the number of tasks that can be solved simultaneously. To address these problems, we propose a transfer learning procedure, denoted NetTailor, in which layers of a pre-trained CNN are used as universal blocks that can be combined with small task-specific layers to generate new networks. Besides minimizing classification error, the new network is trained to mimic the internal activations of a strong unconstrained CNN, and minimize its complexity by the combination of 1) a soft-attention mechanism over blocks and 2) complexity regularization constraints. In this way, NetTailor can adapt the network architecture, not just its weights, to the target task. Experiments show that networks adapted to simple tasks, such as character or traffic sign recognition, become significantly smaller than those adapted to hard tasks, such as fine-grained recognition. More importantly, due to the modular nature of the procedure, this reduction in network complexity is achieved without compromise of either parameter sharing across tasks, or classification accuracy.

Machine reading comprehension aims to teach machines to understand a text like a human and is a new challenging direction in Artificial Intelligence. This article summarizes recent advances in MRC, mainly focusing on two aspects (i.e., corpus and techniques). The specific characteristics of various MRC corpus are listed and compared. The main ideas of some typical MRC techniques are also described.

Developing a structured method for analyzing various aspects of a system requires a novel methodology. This study is aimed at developing such as methodology through combining two major matrix methods, namely, Design Structure Matrix (DSM) and Interface Structure Matrix (ISM). Through this paper, a business process modeling method is applied to turn a real work project to a process model. Then that process model is written in two various matrix forms of DSM and ISM. These two matrices are analyzed by two types of algorithm for extracting activity levels and sub-processes. In the end, a Mixed Matrix Model (MMM) is built upon these activity levels and sub-processes, which can be used as a framework for the engineering of real-world systems.

In this paper, we propose a deep learning framework based on randomized neural network. In particular, inspired by the principles of Random Vector Functional Link (RVFL) network, we present a deep RVFL network (dRVFL) with stacked layers. The parameters of the hidden layers of the dRVFL are randomly generated within a suitable range and kept fixed while the output weights are computed using the closed form solution as in a standard RVFL network. We also propose an ensemble deep network (edRVFL) that can be regarded as a marriage of ensemble learning with deep learning. Unlike traditional ensembling approaches that require training several models independently from scratch, edRVFL is obtained by training a single dRVFL network once. Both dRVFL and edRVFL frameworks are generic and can be used with any RVFL variant. To illustrate this, we integrate the deep learning networks with a recently proposed sparse-pretrained RVFL (SP-RVFL). Extensive experiments on benchmark datasets from diverse domains show the superior performance of our proposed deep RVFL networks.

When translating natural language questions into SQL queries to answer questions from a database, we would like our methods to generalize to domains and database schemas outside of the training set. To handle complex questions and database schemas with a neural encoder-decoder paradigm, it is critical to properly encode the schema as part of the input with the question. In this paper, we use relation-aware self-attention within the encoder so that it can reason about how the tables and columns in the provided schema relate to each other and use this information in interpreting the question. We achieve significant gains on the recently-released Spider dataset with 42.94% exact match accuracy, compared to the 18.96% reported in published work.

Membership inference (MI) attacks exploit a learned model’s lack of generalization to infer whether a given sample was in the model’s training set. Known MI attacks generally work by casting the attacker’s goal as a supervised learning problem, training an attack model from predictions generated by the target model, or by others like it. However, we find that these attacks do not often provide a meaningful basis for confidently inferring training set membership, as the attack models are not well-calibrated. Moreover, these attacks do not significantly outperform a trivial attack that predicts that a point is a member if and only if the model correctly predicts its label. In this work we present well-calibrated MI attacks that allow the attacker to accurately control the minimum confidence with which positive membership inferences are made. Our attacks take advantage of white-box information about the target model and leverage new insights about how overfitting occurs in deep neural networks; namely, we show how a model’s idiosyncratic use of features can provide evidence for membership. Experiments on seven real-world datasets show that our attacks support calibration for high-confidence inferences, while outperforming previous MI attacks in terms of accuracy. Finally, we show that our attacks achieve non-trivial advantage on some models with low generalization error, including those trained with small-epsilon-differential privacy; for large-epsilon (epsilon=16, as reported in some industrial settings), the attack performs comparably to unprotected models.

Fair representations are a powerful tool for establishing criteria like statistical parity, proxy non-discrimination, and equality of opportunity in learned models. Existing techniques for learning these representations are typically model-agnostic, as they preprocess the original data such that the output satisfies some fairness criterion, and can be used with arbitrary learning methods. In contrast, we demonstrate the promise of learning a model-aware fair representation, focusing on kernel-based models. We leverage the classical Sufficient Dimension Reduction (SDR) framework to construct representations as subspaces of the reproducing kernel Hilbert space (RKHS), whose member functions are guaranteed to satisfy fairness. Our method supports several fairness criteria, continuous and discrete data, and multiple protected attributes. We further show how to calibrate the accuracy tradeoff by characterizing it in terms of the principal angles between subspaces of the RKHS. Finally, we apply our approach to obtain the first Fair Gaussian Process (FGP) prior for fair Bayesian learning, and show that it is competitive with, and in some cases outperforms, state-of-the-art methods on real data.

What is the relationship between sentence representations learned by deep recurrent models against those encoded by the brain? Is there any correspondence between hidden layers of these recurrent models and brain regions when processing sentences? Can these deep models be used to synthesize brain data which can then be utilized in other extrinsic tasks? We investigate these questions using sentences with simple syntax and semantics (e.g., The bone was eaten by the dog.). We consider multiple neural network architectures, including recently proposed ELMo and BERT. We use magnetoencephalography (MEG) brain recording data collected from human subjects when they were reading these simple sentences. Overall, we find that BERT’s activations correlate the best with MEG brain data. We also find that the deep network representation can be used to generate brain data from new sentences to augment existing brain data. To the best of our knowledge, this is the first work showing that the MEG brain recording when reading a word in a sentence can be used to distinguish earlier words in the sentence. Our exploration is also the first to use deep neural network representations to generate synthetic brain data and to show that it helps in improving subsequent stimuli decoding task accuracy.

Policy gradient based reinforcement learning algorithms coupled with neural networks have shown success in learning complex policies in the model free continuous action space control setting. However, explicitly parameterized policies are limited by the scope of the chosen parametric probability distribution. We show that alternatively to the likelihood based policy gradient, a related objective can be optimized through advantage weighted quantile regression. Our approach models the policy implicitly in the network, which gives the agent the freedom to approximate any distribution in each action dimension, not limiting its capabilities to the commonly used unimodal Gaussian parameterization. This broader spectrum of policies makes our algorithm suitable for problems where Gaussian policies cannot fit the optimal policy. Moreover, our results on the MuJoCo physics simulator benchmarks are comparable or superior to state-of-the-art on-policy methods.

Self-supervised methods, wherein an agent learns representations solely by observing the results of its actions, become crucial in environments which do not provide a dense reward signal or have labels. In most cases, such methods are used for pretraining or auxiliary tasks for ‘downstream’ tasks, such as control, exploration, or imitation learning. However, it is not clear which method’s representations best capture meaningful features of the environment, and which are best suited for which types of environments. We present a small-scale study of self-supervised methods on two visual environments: Flappy Bird and Sonic The Hedgehog. In particular, we quantitatively evaluate the representations learned from these tasks in two contexts: a) the extent to which the representations capture true state information of the agent and b) how generalizable these representations are to novel situations, like new levels and textures. Lastly, we evaluate these self-supervised features by visualizing which parts of the environment they focus on. Our results show that the utility of the representations is highly dependent on the visuals and dynamics of the environment.

The research literature on cybersecurity incident detection & response is very rich in automatic detection methodologies, in particular those based on the anomaly detection paradigm. However, very little attention has been devoted to the diagnosis ability of the methods, aimed to provide useful information on the causes of a given detected anomaly. This information is of utmost importance for the security team to reduce the time from detection to response. In this paper, we present Multivariate Big Data Analysis (MBDA), a complete intrusion detection approach based on 5 steps to effectively handle massive amounts of disparate data sources. The approach has been designed to deal with the main characteristics of Big Data, that is, the high volume, velocity and variety. The core of the approach is the Multivariate Statistical Network Monitoring (MSNM) technique proposed in a recent paper. Unlike in state of the art machine learning methodologies applied to the intrusion detection problem, when an anomaly is identified in MBDA the output of the system includes the detail of the logs of raw information associated to this anomaly, so that the security team can use this information to elucidate its root causes. MBDA is based in two open software packages available in Github: the MEDA Toolbox and the FCParser. We illustrate our approach with two case studies. The first one demonstrates the application of MBDA to semistructured sources of information, using the data from the VAST 2012 mini challenge 2. This complete case study is supplied in a virtual machine available for download. In the second case study we show the Big Data capabilities of the approach in data collected from a real network with labeled attacks.

Machine learning algorithms have been increasingly deployed in critical automated decision-making systems that directly affect human lives. When these algorithms are only trained to minimize the training/test error, they could suffer from systematic discrimination against individuals based on their sensitive attributes such as gender or race. Recently, there has been a surge in machine learning society to develop algorithms for fair machine learning. In particular, many adversarial learning procedures have been proposed to impose fairness. Unfortunately, these algorithms either can only impose fairness up to first-order dependence between the variables, or they lack computational convergence guarantees. In this paper, we use R\’enyi correlation as a measure of fairness of machine learning models and develop a general training framework to impose fairness. In particular, we propose a min-max formulation which balances the accuracy and fairness when solved to optimality. For the case of discrete sensitive attributes, we suggest an iterative algorithm with theoretical convergence guarantee for solving the proposed min-max problem. Our algorithm and analysis are then specialized to fair classification and the fair clustering problem under disparate impact doctrine. Finally, the performance of the proposed R\’enyi fair inference framework is evaluated on Adult and Bank datasets.

Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial advantages. On the other hand, binary decompilation can be leveraged for code vulnerability analysis and malware detection. However, efficient binary decompilation is challenging. Conventional decompilers have the following major limitations: (i) they are only applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks; (ii) their output high-level code cannot effectively preserve the correct functionality of the input binary; (iii) their output program does not capture the semantics of the input and the reversed program is hard to interpret. To address the above problems, we propose Coda, the first end-to-end neural-based framework for code decompilation. Coda decomposes the decompilation task into two key phases: First, Coda employs an instruction type-aware encoder and a tree decoder for generating an abstract syntax tree (AST) with attention feeding during the code sketch generation stage. Second, Coda then updates the code sketch using an iterative error correction machine guided by an ensembled neural error predictor. By finding a good approximate candidate and then fixing it towards perfect, Coda achieves superior performance compared to baseline approaches. We assess Coda’s performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82\% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0\% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70\% program accuracy.

Corner cases are the main bottlenecks when applying Artificial Intelligence (AI) systems to safety-critical applications. An AI system should be intelligent enough to detect such situations so that system developers can prepare for subsequent planning. In this paper, we propose semi-supervised anomaly detection considering the imbalance of normal situations. In particular, driving data consists of multiple positive/normal situations (e.g., right turn, going straight), some of which (e.g., U-turn) could be as rare as anomalous situations. Existing machine learning based anomaly detection approaches do not fare sufficiently well when applied to such imbalanced data. In this paper, we present a novel multi-task learning based approach that leverages domain-knowledge (maneuver labels) for anomaly detection in driving data. We evaluate the proposed approach both quantitatively and qualitatively on 150 hours of real-world driving data and show improved performance over baseline approaches.

With the arrival of the European Union’s General Data Protection Regulation (GDPR), several companies are making significant changes to their systems to achieve compliance. The changes range from modifying privacy policies to redesigning systems which process personal data. This work analyzes the privacy policies of large-scaled cloud services which seek to be GDPR compliant. The privacy policy is the main medium of information dissemination between the data controller and the users. We show that many services that claim compliance today do not have clear and concise privacy policies. We identify several points in the privacy policies which potentially indicate non-compliance; we term these GDPR vulnerabilities. We identify GDPR vulnerabilities in ten cloud services. Based on our analysis, we propose seven best practices for crafting GDPR privacy policies.

Pre-trained word embeddings are the primary method for transfer learning in several Natural Language Processing (NLP) tasks. Recent works have focused on using unsupervised techniques such as language modeling to obtain these embeddings. In contrast, this work focuses on extracting representations from multiple pre-trained supervised models, which enriches word embeddings with task and domain specific knowledge. Experiments performed in cross-task, cross-domain and cross-lingual settings indicate that such supervised embeddings are helpful, especially in the low-resource setting, but the extent of gains is dependent on the nature of the task and domain. We make our code publicly available.

Anomaly subsequence detection is to detect inconsistent data, which always contains important information, among time series. Due to the high dimensionality of the time series, traditional anomaly detection often requires a large time overhead; furthermore, even if the dimensionality reduction techniques can improve the efficiency, they will lose some information and suffer from time drift and parameter tuning. In this paper, we propose a new anomaly subsequence detection with Dynamic Local Density Estimation (DLDE) to improve the detection effect without losing the trend information by dynamically dividing the time series using Time Split Tree. In order to avoid the impact of the hash function and the randomness of dynamic time segments, ensemble learning is used. Experimental results on different types of data sets verify that the proposed model outperforms the state-of-art methods, and the accuracy has big improvement.

The security of Deep Reinforcement Learning (Deep RL) algorithms deployed in real life applications are of a primary concern. In particular, the robustness of RL agents in cyber-physical systems against adversarial attacks are especially vital since the cost of a malevolent intrusions can be extremely high. Studies have shown Deep Neural Networks (DNN), which forms the core decision-making unit in most modern RL algorithms, are easily subjected to adversarial attacks. Hence, it is imperative that RL agents deployed in real-life applications have the capability to detect and mitigate adversarial attacks in an online fashion. An example of such a framework is the Meta-Learned Advantage Hierarchy (MLAH) agent that utilizes a meta-learning framework to learn policies robustly online. Since the mechanism of this framework are still not fully explored, we conducted multiple experiments to better understand the framework’s capabilities and limitations. Our results shows that the MLAH agent exhibits interesting coping behaviors when subjected to different adversarial attacks to maintain a nominal reward. Additionally, the framework exhibits a hierarchical coping capability, based on the adaptability of the Master policy and sub-policies themselves. From empirical results, we also observed that as the interval of adversarial attacks increase, the MLAH agent can maintain a higher distribution of rewards, though at the cost of higher instabilities.

We present an open source math-aware Question Answering System based on Ask Platypus. Our system returns as a single mathematical formula for a natural language question in English or Hindi. This formulae originate from the knowledge-base Wikidata. We translate these formulae to computable data by integrating the calculation engine sympy into our system. This way, users can enter numeric values for the variables occurring in the formula. Moreover, the system loads numeric values for constants occurring in the formula from Wikidata. In a user study, our system outperformed a commercial computational mathematical knowledge engine by 13%. However, the performance of our system heavily depends on the size and quality of the formula data available in Wikidata. Since only a few items in Wikidata contained formulae when we started the project, we facilitated the import process by suggesting formula edits to Wikidata editors. With the simple heuristic that the first formula is significant for the article, 80% of the suggestions were correct.

The Wikipedia category graph serves as the taxonomic backbone for large-scale knowledge graphs like YAGO or Probase, and has been used extensively for tasks like entity disambiguation or semantic similarity estimation. Wikipedia’s categories are a rich source of taxonomic as well as non-taxonomic information. The category ‘German science fiction writers’, for example, encodes the type of its resources (Writer), as well as their nationality (German) and genre (Science Fiction). Several approaches in the literature make use of fractions of this encoded information without exploiting its full potential. In this paper, we introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia’s categories and populate the DBpedia knowledge graph with additional 4.4M relation assertions and 3.3M type assertions at more than 87% and 90% precision, respectively.

Interaction function (IFC), which captures interactions among items and users, is of great importance in collaborative filtering (CF). The inner product is the most popular IFC due to its success in low-rank matrix factorization. However, interactions in real-world applications can be highly complex. Many other operations (such as plus and concatenation) have also been proposed, and can possibly offer better performance than the inner product. In this paper, motivated by the success of automated machine learning, we propose to search for proper interaction functions (SIF) for CF tasks. We first design an expressive search space for SIF by reviewing and generalizing existing CF approaches. We then propose to represent the search space as a structured multi-layer perceptron, and design a stochastic gradient descent algorithm which can simultaneously update both architectures and learning parameters. Experimental results demonstrate that the proposed method can be much more efficient than popular AutoML approaches, and also obtain much better prediction performance than state-of-the-art CF approaches.

The R package stochvol provides a fully Bayesian implementation of heteroskedasticity modeling within the framework of stochastic volatility. It utilizes Markov chain Monte Carlo (MCMC) samplers to conduct inference by obtaining draws from the posterior distribution of parameters and latent variables which can then be used for predicting future volatilities. The package can straightforwardly be employed as a stand-alone tool; moreover, it allows for easy incorporation into other MCMC samplers. The main focus of this paper is to show the functionality of stochvol. In addition, it provides a brief mathematical description of the model, an overview of the sampling schemes used, and several illustrative examples using exchange rate data.

Local differential privacy (LDP) is a recently proposed privacy standard for collecting and analyzing data, which has been used, e.g., in the Chrome browser, iOS and macOS. In LDP, each user perturbs her information locally, and only sends the randomized version to an aggregator who performs analyses, which protects both the users and the aggregator against private information leaks. Although LDP has attracted much research attention in recent years, the majority of existing work focuses on applying LDP to complex data and/or analysis tasks. In this paper, we point out that the fundamental problem of collecting multidimensional data under LDP has not been addressed sufficiently, and there remains much room for improvement even for basic tasks such as computing the mean value over a single numeric attribute under LDP. Motivated by this, we first propose novel LDP mechanisms for collecting a numeric attribute, whose accuracy is at least no worse (and usually better) than existing solutions in terms of worst-case noise variance. Then, we extend these mechanisms to multidimensional data that can contain both numeric and categorical attributes, where our mechanisms always outperform existing solutions regarding worst-case noise variance. As a case study, we apply our solutions to build an LDP-compliant stochastic gradient descent algorithm (SGD), which powers many important machine learning tasks. Experiments using real datasets confirm the effectiveness of our methods, and their advantages over existing solutions.

Given a labeled dataset that contains a rare (or minority) class of of-interest instances, as well as a large class of instances that are not of interest, how can we learn to recognize future of-interest instances over a continuous stream? We introduce RaRecognize, which (i) estimates a general decision boundary between the rare and the majority class, (ii) learns to recognize individual rare subclasses that exist within the training data, as well as (iii) flags instances from previously unseen rare subclasses as newly emerging. The learner in (i) is general in the sense that by construction it is dissimilar to the specialized learners in (ii), thus distinguishes minority from the majority without overly tuning to what is seen in the training data. Thanks to this generality, RaRecognize ignores all future instances that it labels as majority and recognizes the recurrent as well as emerging rare subclasses only. This saves effort at test time as well as ensures that the model size grows moderately over time as it only maintains specialized minority learners. Through extensive experiments, we show that RaRecognize outperforms state-of-the art baselines on three real-world datasets that contain corporate-risk and disaster documents as rare classes.

In the context of deep learning, the costliest phase from a computational point of view is the full training of the learning algorithm. However, this process is to be used a significant number of times during the design of a new artificial neural network, leading therefore to extremely expensive operations. Here, we propose a low-cost strategy to predict the accuracy of the algorithm, based only on its initial behaviour. To do so, we train the network of interest up to convergence several times, modifying its characteristics at each training. The initial and final accuracies observed during this beforehand process are stored in a database. We then make use of both curve fitting and Support Vector Machines techniques, the latter being trained on the created database, to predict the accuracy of the network, given its accuracy on the primary iterations of its learning. This approach can be of particular interest when the space of the characteristics of the network is notably large or when its full training is highly time-consuming. The results we obtained are promising and encouraged us to apply this strategy to a topical issue: hyper-parameter optimisation (HO). In particular, we focused on the HO of a convolutional neural network for the classification of the databases MNIST and CIFAR-10. By using our method of prediction, and an algorithm implemented by us for a probabilistic exploration of the hyper-parameter space, we were able to find the hyper-parameter settings corresponding to the optimal accuracies already known in literature, at a quite low-cost.

Anticipatory thinking is a complex cognitive process for assessing and managing risk in many contexts. Humans use anticipatory thinking to identify potential future issues and proactively take actions to manage their risks. In this paper we define a cognitive systems approach to anticipatory thinking as a metacognitive goal reasoning mechanism. The contributions of this paper include (1) defining anticipatory thinking in the MIDCA cognitive architecture, (2) operationalizing anticipatory thinking as a three step process for managing risk in plans, and (3) a numeric risk assessment calculating an expected cost-benefit ratio for modifying a plan with anticipatory actions.

In complex tasks, such as those with large combinatorial action spaces, random exploration may be too inefficient to achieve meaningful learning progress. In this work, we use a curriculum of progressively growing action spaces to accelerate learning. We assume the environment is out of our control, but that the agent may set an internal curriculum by initially restricting its action space. Our approach uses off-policy reinforcement learning to estimate optimal value functions for multiple action spaces simultaneously and efficiently transfers data, value estimates, and state representations from restricted action spaces to the full task. We show the efficacy of our approach in proof-of-concept control tasks and on challenging large-scale StarCraft micromanagement tasks with large, multi-agent action spaces.

Recent works show that Graph Neural Networks (GNNs) are highly non-robust with respect to adversarial attacks on both the graph structure and the node attributes, making their outcomes unreliable. We propose the first method for certifiable (non-)robustness of graph convolutional networks with respect to perturbations of the node attributes. We consider the case of binary node attributes (e.g. bag-of-words) and perturbations that are L_0-bounded. If a node has been certified with our method, it is guaranteed to be robust under any possible perturbation given the attack model. Likewise, we can certify non-robustness. Finally, we propose a robust semi-supervised training procedure that treats the labeled and unlabeled nodes jointly. As shown in our experimental evaluation, our method significantly improves the robustness of the GNN with only minimal effect on the predictive accuracy.

With the deluge of digitized information in the Big Data era, massive datasets are becoming increasingly available for learning predictive models. However, in many situations, the poor control of data acquisition processes may naturally jeopardize the outputs of machine-learning algorithms and selection bias issues are now the subject of much attention in the literature. It is precisely the purpose of the present article to investigate how to extend Empirical Risk Minimization (ERM), the main paradigm of statistical learning, when the training observations are generated from biased models, i.e. from distributions that are different from that of the data in the test/prediction stage. Precisely, we show how to build a ‘nearly debiased’ training statistical population from biased samples and the related biasing functions following in the footsteps of the approach originally proposed in Vardi et al. (1985) and study, from a non asymptotic perspective, the performance of minimizers of an empirical version of the risk computed from the statistical population thus constructed. Remarkably, the learning rate achieved by this procedure is of the same order as that attained in absence of any selection bias phenomenon. Beyond these theoretical guarantees, illustrative experimental results supporting the relevance of the algorithmic approach promoted in this paper are also displayed.

Self-supervision provides effective representations for downstream tasks without requiring labels. However, existing approaches lag behind fully supervised training and are often not thought beneficial beyond obviating the need for annotations. We find that self-supervision can benefit robustness in a variety of ways, including robustness to adversarial examples, label corruption, and common input corruptions. Additionally, self-supervision greatly benefits out-of-distribution detection on difficult, near-distribution outliers, so much so that it exceeds the performance of fully supervised methods. These results demonstrate the promise of self-supervision for improving robustness and uncertainty estimation and establish these tasks as new axes of evaluation for future self-supervised learning research.

We provide a discussion of several recent results which have overcome a key barrier in distributed optimization for machine learning. Our focus is the so-called network independence property, which is achieved whenever a distributed method executed over a network of nodes achieves comparable performance to a centralized method with the same computational power as the entire network. We explain this property through an example involving of training ML models and sketch a short mathematical analysis.

Most automation in machine learning focuses on model selection and hyper parameter tuning, and many overlook the challenge of automatically defining predictive tasks. We still heavily rely on human experts to define prediction tasks, and generate labels by aggregating raw data. In this paper, we tackle the challenge of defining useful prediction problems on event-driven time-series data. We introduce MLFriend to address this challenge. MLFriend first generates all possible prediction tasks under a predefined space, then interacts with a data scientist to learn the context of the data and recommend good prediction tasks from all the tasks in the space. We evaluate our system on three different datasets and generate a total of 2885 prediction tasks and solve them. Out of these 722 were deemed useful by expert data scientists. We also show that an automatic prediction task discovery system is able to identify top 10 tasks that a user may like within a batch of 100 tasks.

Matrix factorization methods are extensively employed to understand complex data. In this paper, we introduce the cross-product penalized component analysis (XCAN), a sparse matrix factorization based on the optimization of a loss function that allows a trade-off between variance maximization and structural preservation. The approach is based on previous developments, notably (i) the Sparse Principal Component Analysis (SPCA) framework based on the LASSO, (ii) extensions of SPCA to constrain both modes of the factorization, like co-clustering or the Penalized Matrix Decomposition (PMD), and (iii) the Group-wise Principal Component Analysis (GPCA) method. The result is a flexible modeling approach that can be used for data exploration in a large variety of problems. We demonstrate its use with applications from different disciplines.

We tested in a live setting the use of active learning for selecting text sentences for human annotations used in training a Thai segmentation machine learning model. In our study, two concurrent annotated samples were constructed, one through random sampling of sentences from a text corpus, and the other through model-based scoring and ranking of sentences from the same corpus. In the course of the experiment, we observed the effect of significant changes to the learning environment which are likely to occur in real-world learning tasks. We describe how our active learning strategy interacted with these events and discuss other practical challenges encountered in using active learning in the live setting.

Drawing an inspiration from behavioral studies of human decision making, we propose here a general parametric framework for a reinforcement learning problem, which extends the standard Q-learning approach to incorporate a two-stream framework of reward processing with biases biologically associated with several neurological and psychiatric conditions, including Parkinson’s and Alzheimer’s diseases, attention-deficit/hyperactivity disorder (ADHD), addiction, and chronic pain. For AI community, the development of agents that react differently to different types of rewards can enable us to understand a wide spectrum of multi-agent interactions in complex real-world socioeconomic systems. Moreover, from the behavioral modeling perspective, our parametric framework can be viewed as a first step towards a unifying computational model capturing reward processing abnormalities across multiple mental conditions and user preferences in long-term recommendation systems.

In this work, we present graph star net (GraphStar), a novel and unified graph neural net architecture which utilizes message-passing relay and attention mechanism for multiple prediction tasks – node classification, graph classification and link prediction. GraphStar addresses many earlier challenges facing graph neural nets and achieves non-local representation without increasing the model depth or bearing heavy computational costs. We also propose a new method to tackle topic-specific sentiment analysis based on node classification and text classification as graph classification. Our work shows that ‘star nodes’ can learn effective graph-data representation and improve on current methods for the three tasks. Specifically, for graph classification and link prediction, GraphStar outperforms the current state-of-the-art models by 2-5% on several key benchmarks.

The subjective experience of consciousness is at once familiar and yet deeply mysterious. Strategies exploring the top-down mechanisms of conscious thought within the human brain have been unable to produce a generalized explanatory theory that scales through evolution and can be applied to artificial systems. Information Flow Theory (IFT) provides a novel framework for understanding both the development and nature of consciousness in any system capable of processing information. In prioritizing the direction of information flow over information computation, IFT produces a range of unexpected predictions. The purpose of this manuscript is to introduce the basic concepts of IFT and explore the manifold implications regarding artificial intelligence, superhuman consciousness, and our basic perception of reality.

This is lecture notes on the course “Stochastic Processes”. In this format, the course was taught in the spring semesters 2017 and 2018 for third-year bachelor students of the Department of Control and Applied Mathematics, School of Applied Mathematics and Informatics AT Moscow Institute of Physics and Technology. The base of this course was formed and taught for decades by professors from the Department of Mathematical Foundations of Control A.A. Natan, S.A. Guz, and O.G. Gorbachev. Besides standard chapters of stochastic processes theory (correlation theory, Markov processes) in this book (and lectures) the following chapters are included: von Neumann–Birkhoff–Khinchin ergodic theorem, macrosystem equilibrium concept, Markov Chain Monte Carlo, Markov decision processes and the secretary problem.

Genetic algorithms, which mimic evolutionary processes to solve optimization problems, can be enhanced by using powerful semi-local search algorithms as mutation operators. Here, we introduce reverse quantum annealing, a class of quantum evolutions that can be used for performing families of quasi-local or quasi-nonlocal search starting from a classical state, as novel sources of mutations. Reverse annealing enables the development of genetic algorithms that use quantum fluctuation for mutations and classical mechanisms for the crossovers — we refer to these as Quantum-Assisted Genetic Algorithms (QAGAs). We describe a QAGA and present experimental results using a D-Wave 2000Q quantum annealing processor. On a set of spin-glass inputs, standard (forward) quantum annealing finds good solutions very quickly but struggles to find global optima. In contrast, our QAGA proves effective at finding global optima for these inputs. This successful interplay of non-local classical and quantum fluctuations could provide a promising step toward practical applications of Noisy Intermediate-Scale Quantum (NISQ) devices for heuristic discrete optimization.

The work presented in this master thesis consists of extracting a set of events from texts written in natural language. For this purpose, we have based ourselves on the basic notions of the information extraction as well as the open information extraction. First, we applied an open information extraction(OIE) system for the relationship extraction, to highlight the importance of OIEs in event extraction, and we used the ontology to the event modeling. We tested the results of our approach with test metrics. As a result, the two-level event extraction approach has shown good performance results but requires a lot of expert intervention in the construction of classifiers and this will take time. In this context we have proposed an approach that reduces the expert intervention in the relation extraction, the recognition of entities and the reasoning which are automatic and based on techniques of adaptation and correspondence. Finally, to prove the relevance of the extracted results, we conducted a set of experiments using different test metrics as well as a comparative study.

Automated representation learning is behind many recent success stories in machine learning. It is often used to transfer knowledge learned from a large dataset (e.g., raw text) to tasks for which only a small number of training examples are available. In this paper, we review recent advance in learning to represent social media users in low-dimensional embeddings. The technology is critical for creating high performance social media-based human traits and behavior models since the ground truth for assessing latent human traits and behavior is often expensive to acquire at a large scale. In this survey, we review typical methods for learning a unified user embeddings from heterogeneous user data (e.g., combines social media texts with images to learn a unified user representation). Finally we point out some current issues and future directions.

Conditions are essential in the statements of biological literature. Without the conditions (e.g., environment, equipment) that were precisely specified, the facts (e.g., observations) in the statements may no longer be valid. One biological statement has one or multiple fact(s) and/or condition(s). Their subject and object can be either a concept or a concept’s attribute. Existing information extraction methods do not consider the role of condition in the biological statement nor the role of attribute in the subject/object. In this work, we design a new tag schema and propose a deep sequence tagging framework to structure conditional statement into fact and condition tuples from biological text. Experiments demonstrate that our method yields a information-lossless structure of the literature.

Data selection methods such as active learning and core-set selection are useful tools for machine learning on large datasets, but they can be prohibitively expensive to apply in deep learning. Unlike in other areas of machine learning, the feature representations that these techniques depend on are learned in deep learning rather than given, which takes a substantial amount of training time. In this work, we show that we can significantly improve the computational efficiency of data selection in deep learning by using a much smaller proxy model to perform data selection for tasks that will eventually require a large target model (e.g., selecting data points to label for active learning). In deep learning, we can scale down models by removing hidden layers or reducing their dimension to create proxies that are an order of magnitude faster. Although these small proxy models have significantly higher error, we find that they empirically provide useful rankings for data selection that have a high correlation with those of larger models. We evaluate this ‘selection via proxy’ (SVP) approach on several data selection tasks. For active learning, applying SVP to Sener and Savarese [2018]’s recent method for active learning in deep learning gives a 4x improvement in execution time while yielding the same model accuracy. For core-set selection, we show that a proxy model that trains 10x faster than a target ResNet164 model on CIFAR10 can be used to remove 50% of the training data without compromising the accuracy of the target model, making end-to-end training time improvements via core-set selection possible.

Dempster-Shafer evidence theory has been widely used in various fields of applications, because of the flexibility and effectiveness in modeling uncertainties without prior information. However, the existing evidence theory is insufficient to consider the situations where it has no capability to express the fluctuations of data at a given phase of time during their execution, and the uncertainty and imprecision which are inevitably involved in the data occur concurrently with changes to the phase or periodicity of the data. In this paper, therefore, a generalized Dempster-Shafer evidence theory is proposed. To be specific, a mass function in the generalized Dempster-Shafer evidence theory is modeled by a complex number, called as a complex basic belief assignment, which has more powerful ability to express uncertain information. Based on that, a generalized Dempster’s combination rule is exploited. In contrast to the classical Dempster’s combination rule, the condition in terms of the conflict coefficient between the evidences K<1 is released in the generalized Dempster’s combination rule. Hence, it is more general and applicable than the classical Dempster’s combination rule. When the complex mass function is degenerated from complex numbers to real numbers, the generalized Dempster’s combination rule degenerates to the classical evidence theory under the condition that the conflict coefficient between the evidences K is less than 1. In a word, this generalized Dempster-Shafer evidence theory provides a promising way to model and handle more uncertain information.

Paper proposes a hierarchical learning strategy for generation of sparse representations which capture the information content in large datasets and act as a model. The hierarchy arises from the approximation spaces considered at successively finer data dependent scales. Paper presents a detailed analysis of stability, convergence and behavior of error functionals associated with the approximations and well chosen set of applications. Results show the performance of the approach as a data reduction mechanism on both synthetic (univariate and multivariate) and real datasets (geo-spatial, computer vision and numerical model outcomes). The sparse model generated is shown to efficiently reconstruct data and minimize error in prediction.

Though it has been recognized that recommending serendipitous (i.e., surprising and relevant) items can be helpful for increasing users’ satisfaction and behavioral intention, how to measure serendipity in the offline environment is still an open issue. In recent years, a number of metrics have been proposed, but most of them were based on researchers’ assumptions due to the serendipity’s subjective nature. In order to validate these metrics’ actual performance, we collected over 10,000 users’ real feedback data and compared with the metrics’ results. It turns out the user profile based metrics, especially content-based ones, perform better than those based on item popularity, in terms of estimating the unexpectedness facet of recommendations. Moreover, the full metrics, which involve the unexpectedness component, relevance, timeliness, and user curiosity, can more accurately indicate the recommendation’s serendipity degree, relative to those that just involve some of them. The application of these metrics to several recommender algorithms further consolidates their practical usage, because the comparison results are consistent with those from user evaluation. Thus, this work is constructive for filling the gap between offline measurement and user study on recommendation serendipity.

Blockchain is a promising technology for establishing trust in IoT networks, where network nodes do not necessarily trust each other. Cryptographic hash links and distributed consensus mechanisms ensure that the data stored on an immutable blockchain can not be altered or deleted. However, blockchain mechanisms do not guarantee the trustworthiness of data at the origin. We propose a layered architecture for improving the end-to-end trust that can be applied to a diverse range of blockchain-based IoT applications. Our architecture evaluates the trustworthiness of sensor observations at the data layer and adapts block verification at the blockchain layer through the proposed data trust and gateway reputation modules. We present the performance evaluation of the data trust module using a simulated indoor target localization and the gateway reputation module using an end-to-end blockchain implementation, together with a qualitative security analysis for the architecture.

With the recent advances in Reinforcement Learning (RL), there have been tremendous interests in employing RL for recommender systems. RL-based recommender systems have two key advantages: (i) they can continuously update their recommendation strategies according to users’ real-time feedback, and (ii) the optimal strategy maximizes the long-term reward from users, such as the total revenue of a recommendation session. However, directly training and evaluating a new RL-based recommendation algorithm needs to collect users’ real-time feedback in the real system, which is time and efforts consuming and could negatively impact on users’ experiences. Thus, it calls for a user simulator that can mimic real users’ behaviors where we can pre-train and evaluate new recommendation algorithms. Simulating users’ behaviors in a dynamic system faces immense challenges — (i) the underlining item distribution is complex, and (ii) historical logs for each user are limited. In this paper, we develop a user simulator base on Generative Adversarial Network (GAN). To be specific, we design the generator to capture the underlining distribution of users’ historical logs and generate realistic logs that can be considered as augmentations of real logs; while the discriminator is developed to not only distinguish real and fake logs but also predict users’ behaviors. The experimental results based on real-world e-commerce data demonstrate the effectiveness of the proposed simulator. Further experiments have been conducted to understand the importance of each component in the simulator.

Model selection is treated as a standard performance boosting step in many machine learning applications. Once all other properties of a learning problem are fixed, the model is selected by grid search on a held-out validation set. This is strictly inapplicable to active learning. Within the standardized workflow, the acquisition function is chosen among available heuristics a priori, and its success is observed only after the labeling budget is already exhausted. More importantly, none of the earlier studies report a unique consistently successful acquisition heuristic to the extent to stand out as the unique best choice. We present a method to break this vicious circle by defining the acquisition function as a learning predictor and training it by reinforcement feedback collected from each labeling round. As active learning is a scarce data regime, we bootstrap from a well-known heuristic that filters the bulk of data points on which all heuristics would agree, and learn a policy to warp the top portion of this ranking in the most beneficial way for the character of a specific data distribution. Our system consists of a Bayesian neural net, the predictor, a bootstrap acquisition function, a probabilistic state definition, and another Bayesian policy network that can effectively incorporate this input distribution. We observe on three benchmark data sets that our method always manages to either invent a new superior acquisition function or to adapt itself to the a priori unknown best performing heuristic for each specific data set.

Recently there emerge many distributed algorithms that aim at solving subgraph matching at scale. Existing algorithm-level comparisons failed to provide a systematic view to the pros and cons of each algorithm mainly due to the intertwining of strategy and optimization. In this paper, we identify four strategies and three general-purpose optimizations from representative state-of-the-art works. We implement the four strategies with the optimizations based on the common Timely dataflow system for systematic strategy-level comparison. Our implementation covers all representation algorithms. We conduct extensive experiments for both unlabelled matching and labelled matching to analyze the performance of distributed subgraph matching under various settings, which is finally summarized as a practical guide.

Hyperparameter tuning is an omnipresent problem in machine learning as it is an integral aspect of obtaining the state-of-the-art performance for any model. Most often, hyperparameters are optimized just by training a model on a grid of possible hyperparameter values and taking the one that performs best on a validation sample (grid search). More recently, methods have been introduced that build a so-called surrogate model that predicts the validation loss for a specific hyperparameter setting, model and dataset and then sequentially select the next hyperparameter to test, based on a heuristic function of the expected value and the uncertainty of the surrogate model called acquisition function (sequential model-based Bayesian optimization, SMBO). In this paper we model the hyperparameter optimization problem as a sequential decision problem, which hyperparameter to test next, and address it with reinforcement learning. This way our model does not have to rely on a heuristic acquisition function like SMBO, but can learn which hyperparameters to test next based on the subsequent reduction in validation loss they will eventually lead to, either because they yield good models themselves or because they allow the hyperparameter selection policy to build a better surrogate model that is able to choose better hyperparameters later on. Experiments on a large battery of 50 data sets demonstrate that our method outperforms the state-of-the-art approaches for hyperparameter learning.

We describe a limitation in the expressiveness of the predictive uncertainty estimate given by mean-field variational inference (MFVI), a popular approximate inference method for Bayesian neural networks. In particular, MFVI fails to give calibrated uncertainty estimates in between separated regions of observations. This can lead to catastrophically overconfident predictions when testing on out-of-distribution data. Avoiding such overconfidence is critical for active learning, Bayesian optimisation and out-of-distribution robustness. We instead find that a classical technique, the linearised Laplace approximation, can handle ‘in-between’ uncertainty much better for small network architectures.

We unify the Bayesian and Frequentist justifications for model selection based upon maximizing the evidence, using a precise definition of model complexity which we call ‘flexibility’. In the Gaussian linear model, flexibility is asymptotically equal to the Bayesian Information Criterion (BIC) penalty. But we argue against replacing flexibility with the BIC penalty. Instead, we advocate estimating the evidence directly, for which there now exists a wide range of approaches in the literature.

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings. Unlike conventional speech recognition models, our model learns longer conversational-context information that spans across sentences and is consequently better at recognizing long conversations. Specifically, we propose to use the text-based external word and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end framework, yielding a significant improvement in word error rate with better conversational-context representation. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.

We propose an approach for transfer learning with GAN architectures. In general, transfer learning enables deep networks for classification tasks to be trained with limited computing and data resources. However a similar approach is missing in the specific context of generative tasks. This is partly due to the fact that the extremal layers of the two networks of a GAN, which should be learned during transfer, are on two opposite sides. This requires back-propagating information through both networks, which is computationally expensive. We develop a method to directly train these extremal layers against each other, by-passing all the intermediate layers. We also prove rigorously, for Wasserstein GANs, a theorem ensuring the convergence of the learning of the transferred GAN. Finally, we compare our method to state-of-the-art methods and show that our method converges much faster and requires less data.

Anomaly detection is a significant problem faced in several research areas. Detecting and correctly classifying something unseen as anomalous is a challenging problem that has been tackled in many different manners over the years. Generative Adversarial Networks (GANs) and the adversarial training process have been recently employed to face this task yielding remarkable results. In this paper we survey the principal GAN-based anomaly detection methods, highlighting their pros and cons. Our contributions are the empirical validation of the main GAN models for anomaly detection, the increase of the experimental results on different datasets and the public release of a complete Open Source toolbox for Anomaly Detection using GANs.

We identify two major steps in data analysis, data exploration for understanding and observing patterns/relationships in data; and construction, design and assessment of various models to formalize these relationships. For each step, there exists a large set of tools and software. For the first step, many visualization tools exist, such as, GGobi, Parallax, and Crystal Vision, and most recently tableau and plottly. For the second step, many Scientific Computing Environments (SCEs) exist, such as, Matlab, Mathematica, R and Python. However, there does not exist a tool which allows for seamless two-way interaction between visualization tools and SCEs. We have designed and implemented a data visualization platform (DVP) with an architecture and design that attempts to bridge this gap. DVP connects seamlessly to SCEs to bring the computational capabilities to the visualization methods in a single coherent platform. DVP is designed with two interfaces, the desktop stand alone version and the online interface. DVP with its structure and planned features is a unique software that serves a great deal of parties, including university research, governmental decision support and country’s economy modeling, traffic analysis and control, financial sector and companies, and any other party interested in data analysis and interpretation. A free demo for the online interface of DVP is available \citep{DVP}. Since DVP was launched, circa 2012, the present manuscript was not published since today for commercialization and patent considerations.

Most semantic parsers that map sentences to graph-based meaning representations are hand-designed for specific graphbanks. We present a compositional neural semantic parser which achieves, for the first time, competitive accuracies across a diverse range of graphbanks. Incorporating BERT embeddings and multi-task learning improves the accuracy further, setting new states of the art on DM, PAS, PSD, AMR 2015 and EDS.

Singular Value Decomposition (SVD) constitutes a bridge between the linear algebra concepts and multi-layer neural networks—it is their linear analogy. Besides of this insight, it can be used as a good initial guess for the network parameters, leading to substantially better optimization results.

In this work we present a novel approach for transfer-guided exploration in reinforcement learning that is inspired by the human tendency to leverage experiences from similar encounters in the past while navigating a new task. Given an optimal policy in a related task-environment, we show that its bisimulation distance from the current task-environment gives a lower bound on the optimal advantage of state-action pairs in the current task-environment. Transfer-guided Exploration (ExTra) samples actions from a Softmax distribution over these lower bounds. In this way, actions with potentially higher optimum advantage are sampled more frequently. In our experiments on gridworld environments, we demonstrate that given access to an optimal policy in a related task-environment, ExTra can outperform popular domain-specific exploration strategies viz. epsilon greedy, Model-Based Interval Estimation – Exploration Based (MBIE-EB), Pursuit and Boltzmann in terms of sample complexity and rate of convergence. We further show that ExTra is robust to choices of source task and shows a graceful degradation of performance as the dissimilarity of the source task increases. We also demonstrate that ExTra, when used alongside traditional exploration algorithms, improves their rate of convergence. Thus it is capable of complimenting the efficacy of traditional exploration algorithms.

Graph neural networks have become increasingly popular in recent years due to their ability to naturally encode relational input data and their ability to scale to large graphs by operating on a sparse representation of graph adjacency matrices. As we look to scale up these models using custom hardware, a natural assumption would be that we need hardware tailored to sparse operations and/or dynamic control flow. In this work, we question this assumption by scaling up sparse graph neural networks using a platform targeted at dense computation on fixed-size data. Drawing inspiration from optimization of numerical algorithms on sparse matrices, we develop techniques that enable training the sparse graph neural network model from Allamanis et al. [2018] in 13 minutes using a 512-core TPUv2 Pod, whereas the original training takes almost a day.

Generally, anomaly detection has a great importance particularly in applied statistical signal processing. Here we provide a general framework in order to detect anomaly through the statistical modeling. In this paper, it is assumed that a signal is corrupted by noise whose variance follows an ARMA model. The assumption on the signal is further compromised to encompass the inherent nonstationarity associated with natural phenomenon, hence, the signal of interest is assumed to follow an ARIMA model and the noise to denote an anomaly, however, unknown. Anomaly is assumed to possess heteroskedastic properties, therefore, ARCH/GARCH modeling could extract the anomaly pattern given an additive model for signal of interest and anomaly.

Gradient boosted models are a fundamental machine learning technique. Robustness to small perturbations of the input is an important quality measure for machine learning models, but the literature lacks a method to prove the robustness of gradient boosted models. This work introduces VeriGB, a tool for quantifying the robustness of gradient boosted models. VeriGB encodes the model and the robustness property as an SMT formula, which enables state of the art verification tools to prove the model’s robustness. We extensively evaluate VeriGB on publicly available datasets and demonstrate a capability for verifying large models. Finally, we show that some model configurations tend to be inherently more robust than others.

In our connected world, services are expected to be delivered at speed through multiple means with seamless communication. To put it in day to day conversational terms, ‘there is an app for it’ attitude prevails. Several technologies are needed to meet this growing demand and indeed these technologies are being developed. The first noteworthy is Internet of Things (IoT), which is in itself coupled technologies to deliver seamless communication with ‘anywhere, anytime’ as an underlying objective. The ‘anywhere, anytime’ service delivery paradigm requires a new type of smart systems in developing these services with better capabilities to interact with the human user, such as personalisation, affect state recognition, etc. Here enter cognitive systems, where AI meets cognitive sciences (e.g. cognitive psychology, linguistics, social cognition, etc.). In this paper we will examine the requirements imposed by smart cities development, e.g. intelligent logistics, sensor networks and domestic appliances connectivity, data streams and media delivery, to mention but few. Then we will explore how cognitive systems can meet the challenges these requirements present to the development of new systems. Throughout our discussion here, examples from our recent and current projects will be given supplemented by examples from the literature.

The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection sub-space. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.

Data augmentation is a popular technique largely used to enhance the training of convolutional neural networks. Although many of its benefits are well known by deep learning researchers and practitioners, its implicit regularization effects, as compared to popular explicit regularization techniques, such as weight decay and dropout, remain largely unstudied. As a matter of fact, convolutional neural networks for image object classification are typically trained with both data augmentation and explicit regularization, assuming the benefits of all techniques are complementary. In this paper, we systematically analyze these techniques through ablation studies of different network architectures trained with different amounts of training data. Our results unveil a largely ignored advantage of data augmentation: networks trained with just data augmentation more easily adapt to different architectures and amount of training data, as opposed to weight decay and dropout, which require specific fine-tuning of their hyperparameters.

This paper aims to question the suitability of the Turing Test, for testing machine intelligence, in the light of advances made in the last 60 years in science, medicine, and philosophy of mind. While the main concept of the test may seem sound and valid, a detailed analysis of what is required to pass the test highlights a significant flow. Once the analysis of the test is presented, a systematic approach is followed in analysing what is needed to devise a test or tests for intelligent machines. The paper presents a plausible generic framework based on categories of factors implied by subjective perception of intelligence. An evaluative discussion concludes the paper highlighting some of the unaddressed issues within this generic framework.

Blockchains are tamper evident and tamper resistant digital ledgers implemented in a distributed fashion (i.e., without a central repository) and usually without a central authority (i.e., a bank, company, or government). At their basic level, they enable a community of users to record transactions in a shared ledger within that community, such that under normal operation of the blockchain network no transaction can be changed once published. This document provides a high-level technical overview of blockchain technology. The purpose is to help readers understand how blockchain technology works.

A learning control system is presented suitable for control affine nonlinear plants based on discrete-time Chen-Fliess series and capable of incorporating knowledge of a given physical model. The underlying noncommutative algebraic and combinatorial structures needed to realize the multivariable case are also described. The method is demonstrated using a two-input, two-output Lotka-Volterra system.

We present a versatile formulation of the convolution operation that we term a ‘mapped convolution.’ The standard convolution operation implicitly samples the pixel grid and computes a weighted sum. Our mapped convolution decouples these two components, freeing the operation from the confines of the image grid and allowing the kernel to process any type of structured data. As a test case, we demonstrate its use by applying it to dense inference on spherical data. We perform an in-depth study of existing spherical image convolution methods and propose an improved sampling method for equirectangular images. Then, we discuss the impact of data discretization when deriving a sampling function, highlighting drawbacks of the cube map representation for spherical data. Finally, we illustrate how mapped convolutions enable us to convolve directly on a mesh by projecting the spherical image onto a geodesic grid and training on the textured mesh. This method exceeds the state of the art for spherical depth estimation by nearly 17%. Our findings suggest that mapped convolutions can be instrumental in expanding the application scope of convolutional neural networks.

We derive generalization and excess risk bounds for neural nets using a family of complexity measures based on a multilevel relative entropy. The bounds are obtained by introducing the notion of generated hierarchical coverings of neural nets and by using the technique of chaining mutual information introduced in Asadi et al. NeurIPS’18. The resulting bounds are algorithm-dependent and exploit the multilevel structure of neural nets. This, in turn, leads to an empirical risk minimization problem with a multilevel entropic regularization. The minimization problem is resolved by introducing a multi-scale generalization of the celebrated Gibbs posterior distribution, proving that the derived distribution achieves the unique minimum. This leads to a new training procedure for neural nets with performance guarantees, which exploits the chain rule of relative entropy rather than the chain rule of derivatives (as in backpropagation). To obtain an efficient implementation of the latter, we further develop a multilevel Metropolis algorithm simulating the multi-scale Gibbs distribution, with an experiment for a two-layer neural net on the MNIST data set.

We present an approach to Bayesian Optimization that allows for robust search strategies over a large class of challenging functions. Our method is motivated by the belief that the trends useful to exploit in search of the optimum typically are a subset of the characteristics of the true objective function. At the core of our approach is the use of a Latent Gaussian Process Regression model that allows us to modulate the input domain with an orthogonal latent space. Using this latent space we can encapsulate local information about each observed data point that can be used to guide the search problem. We show experimentally that our method can be used to significantly improve performance on challenging benchmarks.

Most time-series models assume that the data come from observations that are equally spaced in time. However, this assumption does not hold in many diverse scientific fields, such as astronomy, finance, and climatology, among others. There are some techniques that fit unequally spaced time series, such as the continuous-time autoregressive moving average (CARMA) processes. These models are defined as the solution of a stochastic differential equation. It is not uncommon in astronomical time series, that the time gaps between observations are large. Therefore, an alternative suitable approach to modeling astronomical time series with large gaps between observations should be based on the solution of a difference equation of a discrete process. In this work we propose a novel model to fit irregular time series called the complex irregular autoregressive (CIAR) model that is represented directly as a discrete-time process. We show that the model is weakly stationary and that it can be represented as a state-space system, allowing efficient maximum likelihood estimation based on the Kalman recursions. Furthermore, we show via Monte Carlo simulations that the finite sample performance of the parameter estimation is accurate. The proposed methodology is applied to light curves from periodic variable stars, illustrating how the model can be implemented to detect poor adjustment of the harmonic model. This can occur when the period has not been accurately estimated or when the variable stars are multiperiodic. Last, we show how the CIAR model, through its state space representation, allows unobserved measurements to be forecast.

As the core of recommender system, collaborative filtering (CF) models the affinity between a user and an item from historical user-item interactions, such as clicks, purchases, and so on. Benefited from the strong representation power, neural networks have recently revolutionized the recommendation research, setting up a new standard for CF. However, existing neural recommender models do not explicitly consider the correlations among embedding dimensions, making them less effective in modeling the interaction function between users and items. In this work, we emphasize on modeling the correlations among embedding dimensions in neural networks to pursue higher effectiveness for CF. We propose a novel and general neural collaborative filtering framework, namely ConvNCF, which is featured with two designs: 1) applying outer product on user embedding and item embedding to explicitly model the pairwise correlations between embedding dimensions, and 2) employing convolutional neural network above the outer product to learn the high-order correlations among embedding dimensions. To justify our proposal, we present three instantiations of ConvNCF by using different inputs to represent a user and conduct experiments on two real-world datasets. Extensive results verify the utility of modeling embedding dimension correlations with ConvNCF, which outperforms several competitive CF methods.

Data augmentation is a critical component of training deep learning models. Although data augmentation has been shown to significantly improve image classification, its potential has not been thoroughly investigated for object detection. Given the additional cost for annotating images for object detection, data augmentation may be of even greater importance for this computer vision task. In this work, we study the impact of data augmentation on object detection. We first demonstrate that data augmentation operations borrowed from image classification may be helpful for training detection models, but the improvement is limited. Thus, we investigate how learned, specialized data augmentation policies improve generalization performance for detection models. Importantly, these augmentation policies only affect training and leave a trained model unchanged during evaluation. Experiments on the COCO dataset indicate that an optimized data augmentation policy improves detection accuracy by more than +2.3 mAP, and allow a single inference model to achieve a state-of-the-art accuracy of 50.7 mAP. Importantly, the best policy found on COCO may be transferred unchanged to other detection datasets and models to improve predictive accuracy. For example, the best augmentation policy identified with COCO improves a strong baseline on PASCAL-VOC by +2.7 mAP. Our results also reveal that a learned augmentation policy is superior to state-of-the-art architecture regularization methods for object detection, even when considering strong baselines. Code for training with the learned policy is available online at https://…/detection

Ontology-based knowledge bases (KBs) like DBpedia are very valuable resources, but their usefulness and usability is limited by various quality issues. One such issue is the use of string literals instead of semantically typed entities. In this paper we study the automated canonicalization of such literals, i.e., replacing the literal with an existing entity from the KB or with a new entity that is typed using classes from the KB. We propose a framework that combines both reasoning and machine learning in order to predict the relevant entities and types, and we evaluate this framework against state-of-the-art baselines for both semantic typing and entity matching.

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of gaussian linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.

To solve tasks in new environments involving objects unseen during training, agents must reason over prior information about those objects and their relations. We introduce the Prior Knowledge Graph network, an architecture for combining prior information, structured as a knowledge graph, with a symbolic parsing of the visual scene, and demonstrate that this approach is able to apply learned relations to novel objects whereas the baseline algorithms fail. Ablation experiments show that the agents ground the knowledge graph relations to semantically-relevant behaviors. In both a Sokoban game and the more complex Pacman environment, our network is also more sample efficient than the baselines, reaching the same performance in 5-10x fewer episodes. Once the agents are trained with our approach, we can manipulate agent behavior by modifying the knowledge graph in semantically meaningful ways. These results suggest that our network provides a framework for agents to reason over structured knowledge graphs while still leveraging gradient based learning approaches.

We address the problem of signal denoising and pattern recognition in processing batch-mode time-series data by combining linear time-invariant filters, orthogonal multiresolution representations, and sparsity-based methods. We propose a novel approach to designing higher-order zero-phase low-pass, high-pass, and band-pass infinite impulse response filters as matrices, using spectral transformation of the state-space representation of digital filters. We also propose a proximal gradient-based technique to factorize a special class of zero-phase high-pass and band-pass digital filters so that the factorization product preserves the zero-phase property of the filter and also incorporates a sparse-derivative component of the input in the signal model. To demonstrate applications of our novel filter designs, we validate and propose new signal models to simultaneously denoise and identify patterns of interest. We begin by using our proposed filter design to test an existing signal model that simultaneously combines linear time invariant (LTI) filters and sparsity-based methods. We develop a new signal model called sparsity-assisted signal denoising (SASD) by combining our proposed filter designs with the existing signal model. Thereafter, we propose and derive a new signal model called sparsity-assisted pattern recognition (SAPR). In SAPR, we combine LTI band-pass filters and sparsity-based methods with orthogonal multiresolution representations, such as wavelets, to detect specific patterns in the input signal. Finally, we combine the signal denoising and pattern recognition tasks, and derive a new signal model called the sparsity-assisted signal denoising and pattern recognition (SASDPR). We illustrate the capabilities of the SAPR and SASDPR frameworks using sleep-electroencephalography data to detect K-complexes and sleep spindles, respectively.

A substantial portion of the literature on fairness in algorithms proposes, analyzes, and operationalizes simple formulaic criteria for assessing fairness. Two of these criteria, Equalized Odds and Calibration by Group, have gained significant attention for their simplicity and intuitive appeal, but also for their incompatibility. This chapter provides a perspective on the meaning and consequences of these and other fairness criteria using graphical models which reveals Equalized Odds and related criteria to be ultimately misleading. An assessment of various graphical models suggests that fairness criteria should ultimately be case-specific and sensitive to the nature of the information the algorithm processes.

Recommender systems are a critical tool to match listings and travelers in two-sided vacation rental marketplaces. Such systems require high capacity to extract user preferences for items from implicit signals at scale. To learn those preferences, we propose a Simple Deep Personalized Recommendation System to compute travelers’ conditional embeddings. Our method combines listing embeddings in a supervised structure to build short-term historical context to personalize recommendations for travelers. This approach is computationally efficient and scalable, and allows us to capture non-linear dependencies. Our offline evaluation indicates that traveler embeddings created using a Deep Average Network can improve the precision of a downstream conversion prediction model by seven percent, outperforming more complex benchmark methods for shopping experience personalization.

We study two problems in high-dimensional robust statistics: \emph{robust mean estimation} and \emph{outlier detection}. In robust mean estimation the goal is to estimate the mean of a distribution on given independent samples, an -fraction of which have been corrupted by a malicious adversary. In outlier detection the goal is to assign an \emph{outlier score} to each element of a data set such that elements more likely to be outliers are assigned higher scores. Our algorithms for both problems are based on a new outlier scoring method we call QUE-scoring based on \emph{quantum entropy regularization}. For robust mean estimation, this yields the first algorithm with optimal error rates and nearly-linear running time in all parameters, improving on the previous fastest running time . For outlier detection, we evaluate the performance of QUE-scoring via extensive experiments on synthetic and real data, and demonstrate that it often performs better than previously proposed algorithms. Code for these experiments is available at https://…/que-outlier-detection .

Cognitive task analysis (CTA) is a type of analysis in applied psychology aimed at eliciting and representing the knowledge and thought processes of domain experts. In CTA, often heavy human labor is involved to parse the interview transcript into structured knowledge (e.g., flowchart for different actions). To reduce human efforts and scale the process, automated CTA transcript parsing is desirable. However, this task has unique challenges as (1) it requires the understanding of long-range context information in conversational text; and (2) the amount of labeled data is limited and indirect—i.e., context-aware, noisy, and low-resource. In this paper, we propose a weakly-supervised information extraction framework for automated CTA transcript parsing. We partition the parsing process into a sequence labeling task and a text span-pair relation extraction task, with distant supervision from human-curated protocol files. To model long-range context information for extracting sentence relations, neighbor sentences are involved as a part of input. Different types of models for capturing context dependency are then applied. We manually annotate real-world CTA transcripts to facilitate the evaluation of the parsing tasks

This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 128 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.

We propose design guidelines for a probabilistic programming facility suitable for deployment as a part of a production software system. As a reference implementation, we introduce Infergo, a probabilistic programming facility for Go, a modern programming language of choice for server-side software development. We argue that a similar probabilistic programming facility can be added to most modern general-purpose programming languages. Probabilistic programming enables automatic tuning of program parameters and algorithmic decision making through probabilistic inference based on the data. To facilitate addition of probabilistic programming capabilities to other programming languages, we share implementation choices and techniques employed in development of Infergo. We illustrate applicability of Infergo to various use cases on case studies, and evaluate Infergo’s performance on several benchmarks, comparing Infergo to dedicated inference-centric probabilistic programming frameworks.

In Reinforcement Learning we look for meaning in the flow of input/output information. If we do not find meaning, the information flow is not more than noise to us. Before we are able to find meaning, we should first learn how to discover and identify objects. What is an object? In this article we will demonstrate that an object is an event-driven model. These models are a generalization of action-driven models. In Markov Decision Process we have an action-driven model which changes its state at each step. The advantage of event-driven models is their greater sustainability as they change their states only upon the occurrence of particular events. These events may occur very rarely, therefore the state of the event-driven model is much more predictable.

Reinforcement Learning, a machine learning framework for training an autonomous agent based on rewards, has shown outstanding results in various domains. However, it is known that learning a good policy is difficult in a domain where rewards are rare. We propose a method, optimistic proximal policy optimization (OPPO) to alleviate this difficulty. OPPO considers the uncertainty of the estimated total return and optimistically evaluates the policy based on that amount. We show that OPPO outperforms the existing methods in a tabular task.

Humans do not make inferences over texts, but over models of what texts are about. When annotators are asked to annotate coreferent spans of text, it is therefore a somewhat unnatural task. This paper presents an alternative in which we preprocess documents, linking entities to a knowledge base, and turn the coreference annotation task — in our case limited to pronouns — into an annotation task where annotators are asked to assign pronouns to entities. Model-based annotation is shown to lead to faster annotation and higher inter-annotator agreement, and we argue that it also opens up for an alternative approach to coreference resolution. We present two new coreference benchmark datasets, for English Wikipedia and English teacher-student dialogues, and evaluate state-of-the-art coreference resolvers on them.

Manifold learning techniques for dynamical systems and time series have shown their utility for a broad spectrum of applications in recent years. While these methods are effective at learning a low-dimensional representation, they are often insufficient for visualizing the global and local structure of the data. In this paper, we present DIG (Dynamical Information Geometry), a visualization method for multivariate time series data that extracts an information geometry from a diffusion framework. Specifically, we implement a novel group of distances in the context of diffusion operators, which may be useful to reveal structure in the data that may not be accessible by the commonly used diffusion distances. Finally, we present a case study applying our visualization tool to EEG data to visualize sleep stages.

We investigate the difficulties of training sparse neural networks and make new observations about optimization dynamics and the energy landscape within the sparse regime. Recent work of \citep{Gale2019, Liu2018} has shown that sparse ResNet-50 architectures trained on ImageNet-2012 dataset converge to solutions that are significantly worse than those found by pruning. We show that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the ‘good’ solution. Additionally, our attempts to find a decreasing objective path from ‘bad’ solutions to the ‘good’ ones in the sparse subspace fail. However, if we allow the path to traverse the dense subspace, then we consistently find a path between two solutions. These findings suggest traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.

Structural pruning of neural network parameters reduces computation, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and second-order Taylor expansions to approximate a filter’s contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods leads to an improvement over state-of-the-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet. Code is available at https://…/Taylor_pruning.

We study black-box reductions from mechanism design to algorithm design for welfare maximization in settings of incomplete information. Given oracle access to an algorithm for an underlying optimization problem, the goal is to simulate an incentive compatible mechanism. The mechanism will be evaluated on its expected welfare, relative to the algorithm provided, and its complexity is measured by the time (and queries) needed to simulate the mechanism on any input. While it is known that black-box reductions are not possible in many prior-free settings, settings with priors appear more promising: there are known reductions for Bayesian incentive compatible (BIC) mechanism design for general classes of welfare maximization problems. This dichotomy begs the question: which mechanism design problems admit black-box reductions, and which do not? Our main result is that black-box mechanism design is impossible under two of the simplest settings not captured by known positive results. First, for the problem of allocating goods to a single buyer whose valuation is additive and independent across the goods, subject to a downward-closed constraint on feasible allocations, we show that there is no polytime (in ) BIC black-box reduction for expected welfare maximization. Second, for the setting of multiple single-parameter agents—where polytime BIC reductions are known—we show that no polytime reductions exist when the incentive requirement is tightened to Max-In-Distributional-Range. In each case, we show that achieving a sub-polynomial approximation to the expected welfare requires exponentially many queries, even when the set of feasible allocations is known to be downward-closed.

Deep learning has been used as a powerful tool for various tasks in computer vision, such as image segmentation, object recognition and data generation. A key part of end-to-end training is designing the appropriate encoder to extract specific features from the input data. However, few encoders maintain the topological properties of data, such as connection structures and global contours. In this paper, we introduce a Voronoi Diagram encoder based on convex set distance (CSVD) and apply it in edge encoding. The boundaries of Voronoi cells is related to detected edges of structures and contours. The CSVD model improves contour extraction in CNN and structure generation in GAN. We also show the experimental results and demonstrate that the proposed model has great potentiality in different visual problems where topology information should be involved.

It is well known that a speech recognition system that combines multiple acoustic models trained on the same data significantly outperforms a single-model system. Unfortunately, real time speech recognition using a whole ensemble of models is too computationally expensive. In this paper, we propose to distill the knowledge of essence in an ensemble of models (i.e. the teacher model) to a single model (i.e. the student model) that needs much less computation to deploy. Previously, all the soften outputs of the teacher model are used to optimize the student model. We argue that not all the outputs of the ensemble are necessary to be distilled. Some of the outputs may even contain noisy information that is useless or even harmful to the training of the student model. In addition, we propose to train the student model with a multitask learning approach by utilizing both the soften outputs of the teacher model and the correct hard labels. The proposed method achieves some surprising results on the Switchboard data set. When the student model is trained together with the correct labels and the essence knowledge from the teacher model, it not only significantly outperforms another single model with the same architecture that is trained only with the correct labels, but also consistently outperforms the teacher model that is used to generate the soft labels.

As demand for Real-Time applications rises among the general public, the importance of enabling large-scale, unbound algorithms to solve conventional problems with low to no latency is critical for product viability. Timer algorithms are prevalent in the core mechanisms behind operating systems, network protocol implementation, stream processing, and several database capabilities. This paper presents a field-tested algorithm for low latency, unbound range timer structure, based upon the well excepted Timing Wheel algorithm. Using a set of queues hashed by TTL, the algorithm allows for a simpler implementation, minimal overhead no overflow and no performance degradation in comparison to the current state of the algorithms under typical use cases.

In this paper, we improved the performance of the contrast source inversion (CSI) method by incorporating a so-called cross-correlated cost functional, which interrelates the state error and the data error in the measurement domain. The proposed method is referred to as the cross-correlated CSI. It enables better robustness and higher inversion accuracy than both the classical CSI and multiplicative regularized CSI (MR-CSI). In addition, we show how the gradient of the modified cost functional can be calculated without significantly increasing the computational burden. The advantages of the proposed algorithms are demonstrated using a 2-D benchmark problem excited by a transverse magnetic wave as well as a transverse electric wave, respectively, in comparison to classical CSI and MR-CSI.

This paper introduces a novel deep learning based method, named bridge neural network (BNN) to dig the potential relationship between two given data sources task by task. The proposed approach employs two convolutional neural networks that project the two data sources into a feature space to learn the desired common representation required by the specific task. The training objective with artificial negative samples is introduced with the ability of mini-batch training and it’s asymptotically equivalent to maximizing the total correlation of the two data sources, which is verified by the theoretical analysis. The experiments on the tasks, including pair matching, canonical correlation analysis, transfer learning, and reconstruction demonstrate the state-of-the-art performance of BNN, which may provide new insights into the aspect of common representation learning.

The problem of time-series clustering is considered in the case where each data-point is a sample generated by a piecewise stationary ergodic process. Stationary processes are perhaps the most general class of processes considered in non-parametric statistics and allow for arbitrary long-range dependence between variables. Piecewise stationary processes studied here for the first time in the context of clustering, relax the last remaining assumption in this model: stationarity. A natural formulation is proposed for this problem and a notion of consistency is introduced which requires the samples to be placed in the same cluster if and only if the piecewise stationary distributions that generate them have the same set of stationary distributions. Simple, computationally efficient algorithms are proposed and are shown to be consistent without any additional assumptions beyond piecewise stationarity.

Interpretability of machine learning (ML) models becomes more relevant with their increasing adoption. In this work, we address the interpretability of ML based question answering (QA) models on a combination of knowledge bases (KB) and text documents. We adapt post hoc explanation methods such as LIME and input perturbation (IP) and compare them with the self-explanatory attention mechanism of the model. For this purpose, we propose an automatic evaluation paradigm for explanation methods in the context of QA. We also conduct a study with human annotators to evaluate whether explanations help them identify better QA models. Our results suggest that IP provides better explanations than LIME or attention, according to both automatic and human evaluation. We obtain the same ranking of methods in both experiments, which supports the validity of our automatic evaluation paradigm.

In this paper we consider the two-armed bandit problem, which often naturally appears per se or as a subproblem in some multi-armed generalizations, and serves as a starting point for introducing additional problem features. The consideration of binary responses is motivated by its widespread applicability and by being one of the most studied settings. We focus on the undiscounted finite-horizon objective, which is the most relevant in many applications. We make an attempt to unify the terminology as this is different across disciplines that have considered this problem, and present a unified model cast in the Markov decision process framework, with subject responses modelled using the Bernoulli distribution, and the corresponding Beta distribution for Bayesian updating. We give an extensive account of the history and state of the art of approaches from several disciplines, including design of experiments, Bayesian decision theory, naive designs, reinforcement learning, biostatistics, and combination designs. We evaluate these designs, together with a few newly proposed, accurately computationally (using a newly written package in Julia programming language by the author) in order to compare their performance. We show that conclusions are different for moderate horizons (typical in practice) than for small horizons (typical in academic literature reporting computational results). We further list and clarify a number of myths about this problem, e.g., we show that, computationally, much larger problems can be designed to Bayes-optimality than what is commonly believed.

Emerging Big Data analytics and machine learning applications require a significant amount of computational power. While there exists a plethora of large-scale data processing frameworks which thrive in handling the various complexities of data-intensive workloads, the ever-increasing demand of applications have made us reconsider the traditional ways of scaling (e.g., scale-out) and seek new opportunities for improving the performance. In order to prepare for an era where data collection and processing occur on a wide range of devices, from powerful HPC machines to small embedded devices, it is crucial to investigate and eliminate the potential sources of inefficiency in the current state of the art platforms. In this paper, we address the current and upcoming challenges of pervasive data processing and present directions for designing the next generation of large-scale data processing systems.

Sequential learning of multiple tasks in artificial neural networks using gradient descent leads to catastrophic forgetting, whereby previously learned knowledge is erased during learning of new, disjoint knowledge. Here, we propose a fundamentally new type of method – Beneficial Perturbation Network (BPN). We add task-dependent memory (biasing) units to allow the network to operate in different regimes for different tasks. We compute the most beneficial directions to train these units, in a manner inspired by recent work on adversarial examples. At test time, beneficial perturbations for a given task bias the network toward that task to overcome catastrophic forgetting. BPN is not only more parameter-efficient than network expansion methods, but also does not need to store any data from previous tasks, in contrast with episodic memory methods. Experiments on variants of the MNIST, CIFAR-10, CIFAR-100 datasets demonstrate strong performance of BPN when compared to the state-of-the-art.

Three types of regression models researchers need to be familiar with and know the requirements of each: parametric, semiparametric and nonparametric regression models. The type of modeling used is based on how much information are available about the form of the relationship between response variable and explanatory variables, and the random error distribution. In this article, differences between models, common methods of estimation, robust estimation, and applications are introduced. The R code for all the graphs and analyses presented here, in this article, is available in the Appendix.

Sequential decision making in the presence of uncertainty and stochastic dynamics gives rise to distributions over state/action trajectories in reinforcement learning (RL) and optimal control problems. This observation has led to a variety of connections between RL and inference in probabilistic graphical models (PGMs). Here we explore a different dimension to this relationship, examining reinforcement learning using the tools and abstractions of statistical physics. The central object in the statistical physics abstraction is the idea of a partition function , and here we construct a partition function from the ensemble of possible trajectories that an agent might take in a Markov decision process. Although value functions and -functions can be derived from this partition function and interpreted via average energies, the -function provides an object with its own Bellman equation that can form the basis of alternative dynamic programming approaches. Moreover, when the MDP dynamics are deterministic, the Bellman equation for is linear, allowing direct solutions that are unavailable for the nonlinear equations associated with traditional value functions. The policies learned via these -based Bellman updates are tightly linked to Boltzmann-like policy parameterizations. In addition to sampling actions proportionally to the exponential of the expected cumulative reward as Boltzmann policies would, these policies take entropy into account favoring states from which many outcomes are possible.

Financial decisions impact our lives, and thus everyone from the regulator to the consumer is interested in fair, sound, and explainable decisions. There is increasing competitive desire and regulatory incentive to deploy AI mindfully within financial services. An important mechanism towards that end is to explain AI decisions to various stakeholders. State-of-the-art explainable AI systems mostly serve AI engineers and offer little to no value to business decision makers, customers, and other stakeholders. Towards addressing this gap, in this work we consider the scenario of explaining loan denials. We build the first-of-its-kind dataset that is representative of loan-applicant friendly explanations. We design a novel Generative Adversarial Network (GAN) that can accommodate smaller datasets, to generate user-friendly textual explanations. We demonstrate how our system can also generate explanations serving different purposes: those that help educate the loan applicants, or help them take appropriate action towards a future approval.

Several centralised RDF systems support datalog reasoning by precomputing and storing all logically implied triples using the wellknown seminaive algorithm. Large RDF datasets often exceed the capacity of centralised RDF systems, and a common solution is to distribute the datasets in a cluster of shared-nothing servers. While numerous distributed query answering techniques are known, distributed seminaive evaluation of arbitrary datalog rules is less understood. In fact, most distributed RDF stores either support no reasoning or can handle only limited datalog fragments. In this paper we extend the dynamic data exchange approach for distributed query answering by Potter et al. [12] to a reasoning algorithm that can handle arbitrary rules while preserving important properties such as nonrepetition of inferences. We also show empirically that our algorithm scales well to very large RDF datasets

Local Interpretable Model-Agnostic Explanations (LIME) is a popular technique used to increase the interpretability and explainability of black box Machine Learning (ML) algorithms. LIME typically generates an explanation for a single prediction by any ML model by learning a simpler interpretable model (e.g. linear classifier) around the prediction through generating simulated data around the instance by random perturbation, and obtaining feature importance through applying some form of feature selection. While LIME and similar local algorithms have gained popularity due to their simplicity, the random perturbation and feature selection methods result in ‘instability’ in the generated explanations, where for the same prediction, different explanations can be generated. This is a critical issue that can prevent deployment of LIME in a Computer-Aided Diagnosis (CAD) system, where stability is of utmost importance to earn the trust of medical professionals. In this paper, we propose a deterministic version of LIME. Instead of random perturbation, we utilize agglomerative Hierarchical Clustering (HC) to group the training data together and K-Nearest Neighbour (KNN) to select the relevant cluster of the new instance that is being explained. After finding the relevant cluster, a linear model is trained over the selected cluster to generate the explanations. Experimental results on three different medical datasets show the superiority for Deterministic Local Interpretable Model-Agnostic Explanations (DLIME), where we quantitatively determine the stability of DLIME compared to LIME utilizing the Jaccard similarity among multiple generated explanations.

Neural processes combine the strengths of neural networks and Gaussian processes to achieve both flexible learning and fast prediction of stochastic processes. However, neural processes do not consider the temporal dependency structure of underlying processes and thus are limited in modeling a large class of problems with temporal structure. In this paper, we propose Sequential Neural Processes (SNP). By incorporating temporal state-transition model into neural processes, the proposed model extends the potential of neural processes to modeling dynamic stochastic processes. In applying SNP to dynamic 3D scene modeling, we also introduce the Temporal Generative Query Networks. To our knowledge, this is the first 4D model that can deal with temporal dynamics of 3D scenes. In experiments, we evaluate the proposed methods in dynamic (non-stationary) regression and 4D scene inference and rendering.

Traditional relational data interfaces require precise structured queries over potentially complex schemas. These rigid data retrieval mechanisms pose hurdles for non-expert users, who typically lack language expertise and are unfamiliar with the details of the schema. Query by Example (QBE) methods offer an alternative mechanism: users provide examples of their intended query output and the QBE system needs to infer the intended query. However, these approaches focus on the structural similarity of the examples and ignore the richer context present in the data. As a result, they typically produce queries that are too general, and fail to capture the user’s intent effectively. In this paper, we present SQuID, a system that performs semantic similarity-aware query intent discovery. Our work makes the following contributions: (1) We design an end-to-end system that automatically formulates select-project-join queries in an open-world setting, with optional group-by aggregation and intersection operators; a much larger class than prior QBE techniques. (2) We express the problem of query intent discovery using a probabilistic abduction model, that infers a query as the most likely explanation of the provided examples. (3) We introduce the notion of an abduction-ready database, which precomputes semantic properties and related statistics, allowing SQuID to achieve real-time performance. (4) We present an extensive empirical evaluation on three real-world datasets, including user-intent case studies, demonstrating that SQuID is efficient and effective, and outperforms machine learning methods, as well as the state-of-the-art in the related query reverse engineering problem.

Time Series Classification (TSC) has seen enormous progress over the last two decades. HIVE-COTE (Hierarchical Vote Collective of Transformation-based Ensembles) is the current state of the art in terms of classification accuracy. HIVE-COTE recognizes that time series are a specific data type for which the traditional attribute-value representation, used predominantly in machine learning, fails to provide a relevant representation. HIVE-COTE combines multiple types of classifiers: each extracting information about a specific aspect of a time series, be it in the time domain, frequency domain or summarization of intervals within the series. However, HIVE-COTE (and its predecessor, FLAT-COTE) is often infeasible to run on even modest amounts of data. For instance, training HIVE-COTE on a dataset with only 1,500 time series can require 8 days of CPU time. It has polynomial runtime w.r.t training set size, so this problem compounds as data quantity increases. We propose a novel TSC algorithm, TS-CHIEF, which is highly competitive to HIVE-COTE in accuracy, but requires only a fraction of the runtime. TS-CHIEF constructs an ensemble classifier that integrates the most effective embeddings of time series that research has developed in the last decade. It uses tree-structured classifiers to do so efficiently. We assess TS-CHIEF on 85 datasets of the UCR archive, where it achieves state-of-the-art accuracy with scalability and efficiency. We demonstrate that TS-CHIEF can be trained on 130k time series in 2 days, a data quantity that is beyond the reach of any TSC algorithm with comparable accuracy.

Recent advances in semi-supervised learning have shown tremendous potential in overcoming a major barrier to the success of modern machine learning algorithms: access to vast amounts of human-labeled training data. Algorithms based on self-ensemble learning and virtual adversarial training can harness the abundance of unlabeled data to produce impressive state-of-the-art results on a number of semi-supervised benchmarks, approaching the performance of strong supervised baselines using only a fraction of the available labeled data. However, these methods often require careful tuning of many hyper-parameters and are usually not easy to implement in practice. In this work, we present a conceptually simple yet effective semi-supervised algorithm based on self-supervised learning to combine semantic feature representations from unlabeled data. Our models are efficiently trained end-to-end for the joint, multi-task learning of labeled and unlabeled data in a single stage. Striving for simplicity and practicality, our approach requires no additional hyper-parameters to tune for optimal performance beyond the standard set for training convolutional neural networks. We conduct a comprehensive empirical evaluation of our models for semi-supervised image classification on SVHN, CIFAR-10 and CIFAR-100, and demonstrate results competitive with, and in some cases exceeding, prior state of the art. Reference code and data are available at https://…/sesemi

This note outlines a method for clustering time series based on a statistical model in which volatility shifts at unobserved change-points. The model accommodates some classical stylized features of returns and its relation to GARCH is discussed. Clustering is performed using a probability metric evaluated between posterior distributions of the most recent change-point associated with each series. This implies series are grouped together at a given time if there is evidence the most recent shifts in their respective volatilities were coincident or closely timed. The clustering method is dynamic, in that groupings may be updated in an online manner as data arrive. Numerical results are given analyzing daily returns of constituents of the S&P 500.

We see a trend where computing becomes a metered utility similar to how the electric grid evolved. Initially electricity was generated locally but economies of scale (and standardization) made it more efficient and economical to have utility companies managing the electric grid. Similar developments can be seen in computing where scientific grids paved the way for commercial cloud computing offerings. However, in our opinion, that evolution is far from finished and in this paper we bring forward the remaining challenges and propose a vision for the future of computing. In particular we focus on changes in cost of computing and high cost of human time in comparison that indicates that saving developer time is the most important for future of computing.

Authoring, developing, monitoring, and analyzing business processes has requires both domain and IT expertise since Business Process Management tools and practices have focused on enterprise applications and not end users. There are trends, however, that can greatly lower the bar for users to author and analyze their own processes. One emerging trend is the attention on blockchains as a shared ledger for parties collaborating on a process. Transaction logs recorded in a standard schema and stored in the open significantly reduces the effort to monitor and apply advanced process analytics. A second trend is the rapid maturity of machine learning algorithms, in particular deep learning models, and their increasing use in enterprise applications. These cognitive technologies can be used to generate views and processes customized for an end user so they can modify them and incorporate best practices learned from other users’ processes. Keywords: BPM, cognitive computing, blockchain, privacy, machine learning

An ontology is a formal representation of domain knowledge, which can be interpreted by machines. In recent years, ontologies have become a major tool for domain knowledge representation and a core component of many knowledge management systems, decision support systems and other intelligent systems, inter alia, in the context of agriculture. A review of the existing literature on agricultural ontologies, however, reveals that most of the studies, which propose agricultural ontologies, are lacking an explicit evaluation procedure. This is undesired because without well-structured evaluation processes, it is difficult to consider the value of ontologies to research and practice. Moreover, it is difficult to rely on such ontologies and share them on the Semantic Web or between semantic aware applications. With the growing number of ontology-based agricultural systems and the increasing popularity of the Semantic Web, it becomes essential that such development and evaluation methods are put forward to guide future efforts of ontology development. Our work contributes to the literature on agricultural ontologies, by presenting a method for evaluating agricultural ontologies, which seems to be missing from most existing studies on agricultural ontologies. The framework supports the matching of appropriate evaluation methods for a given ontology based on the ontology’s purpose.

Notions of depth in regression have been introduced and studied in the literature. Regression depth (RD) of Rousseeuw and Hubert (1999), the most famous, exemplifies a direct extension of Tukey location depth (Tukey (1975)) to regression. The extension of another prevailing location depth, the projection depth (Liu (1992), and Zuo and Serfling (2000)), to regression is called the projection regression depth (PRD) (Zuo (2018a)(Z18a)). These two represent the most promising depth notions in regression (Z18a). Carrizosa depth (Carrizosa (1996)) is another notion of depth in regression. The most remarkable advantage of the notion of depth in regression is to introduce directly, the median-type estimator, the maximum (or deepest) regression depth estimator (regression median) for regression parameters in a multi-dimensional setting. The maximum (deepest) regression depth estimators (regression medians) serve as robust alternatives to the classical least squares or least absolute deviations estimator of the unknown parameters in a general linear regression model. The uniqueness of regression medians is central in the discussion of the asymptotics of sample regression medians and a desirable property for the convergence of approximate algorithm in the computation of sample regression medians. Are the regression medians induced from RD, PRD, and unique? Answering this question is the goal of this article. It is found that the regression median induced from PRD possesses the uniqueness property, unlike its leading competitors. This and other findings on the performance of PRD and its induced median indicate that the PRD and its induced median are highly favorable among its leading competitors.

Probabilistic models are proposed for bounding the forward error in the numerically computed inner product (dot product, scalar product) between of two real -vectors. We derive probabilistic perturbation bounds, as well as probabilistic roundoff error bounds for the sequential accumulation of the inner product. These bounds are non-asymptotic, explicit, and make minimal assumptions on perturbations and roundoffs. The perturbations are represented as independent, bounded, zero-mean random variables, and the probabilistic perturbation bound is based on Azuma’s inequality. The roundoffs are also represented as bounded, zero-mean random variables. The first probabilistic bound assumes that the roundoffs are independent, while the second one does not. For the latter, we construct a Martingale that mirrors the sequential order of computations. Numerical experiments confirm that our bounds are more informative, often by several orders of magnitude, than traditional deterministic bounds — even for small vector dimensions~ and very stringent success probabilities. In particular the probabilistic roundoff error bounds are functions of rather than~, thus giving a quantitative confirmation of Wilkinson’s intuition. The paper concludes with a critical assessment of the probabilistic approach.

Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. The most commonly used RF variants however are ‘offline’ algorithms, which require the availability of the whole dataset at once. In this paper, we introduce AMF, an online random forest algorithm based on Mondrian Forests. Using a variant of the Context Tree Weighting algorithm, we show that it is possible to efficiently perform an exact aggregation over all prunings of the trees; in particular, this enables to obtain a truly online parameter-free algorithm which is competitive with the optimal pruning of the Mondrian tree, and thus adaptive to the unknown regularity of the regression function. Numerical experiments show that AMF is competitive with respect to several strong baselines on a large number of datasets for multi-class classification.

When forecasting time series with a hierarchical structure, the existing state of the art is to forecast each time series independently, and, in a post-treatment step, to reconcile the time series in a way that respects the hierarchy (Hyndman et al., 2011; Wickramasuriya et al., 2018). We propose a new loss function that can be incorporated into any maximum likelihood objective with hierarchical data, resulting in reconciled estimates with confidence intervals that correctly account for additional uncertainty due to imperfect reconciliation. We evaluate our method using a non-linear model and synthetic data on a counterfactual forecasting problem, where we have access to the ground truth and contemporaneous covariates, and show that we largely improve over the existing state-of-the-art method.

Similarity search is a core component in various applications such as image matching, product recommendation and low-shot classification. However, single machine solutions are usually insufficient due to the large cardinality of modern datasets and stringent latency requirement of on-line query processing. We present Pyramid, a general and efficient framework for distributed similarity search. Pyramid supports search with popular similarity functions including Euclidean distance, angular distance and inner product. Different from existing distributed solutions that are based on KD-tree or locality sensitive hashing (LSH), Pyramid is based on Hierarchical Navigable Small World graph (HNSW), which is the state of the art similarity search algorithm on a single machine. To achieve high query processing throughput, Pyramid partitions a dataset into sub-datasets containing similar items for index building and assigns a query to only some of the sub-datasets for query processing. To provide the robustness required by production deployment, Pyramid also supports failure recovery and straggler mitigation. Pyramid offers a set of concise API such that users can easily use Pyramid without knowing the details of distributed execution. Experiments on large-scale datasets show that Pyramid produces quality results for similarity search, achieves high query processing throughput and is robust under node failure and straggler.

This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis. In machine learning research, this gradient problem lies at the core of many learning problems, in supervised, unsupervised and reinforcement learning. We will generally seek to rewrite such gradients in a form that allows for Monte Carlo estimation, allowing them to be easily and efficiently used and analysed. We explore three strategies–the pathwise, score function, and measure-valued gradient estimators–exploring their historical developments, derivation, and underlying assumptions. We describe their use in other fields, show how they are related and can be combined, and expand on their possible generalisations. Wherever Monte Carlo gradient estimators have been derived and deployed in the past, important advances have followed. A deeper and more widely-held understanding of this problem will lead to further advances, and it is these advances that we wish to support.

Applying neural networks as controllers in dynamical systems has shown great promises. However, it is critical yet challenging to verify the safety of such control systems with neural-network controllers in the loop. Previous methods for verifying neural network controlled systems are limited to a few specific activation functions. In this work, we propose a new reachability analysis approach based on Bernstein polynomials that can verify neural-network controlled systems with a more general form of activation functions, i.e., as long as they ensure that the neural networks are Lipschitz continuous. Specifically, we consider abstracting feedforward neural networks with Bernstein polynomials for a small subset of inputs. To quantify the error introduced by abstraction, we provide both theoretical error bound estimation based on the theory of Bernstein polynomials and more practical sampling based error bound estimation, following a tight Lipschitz constant estimation approach based on forward reachability analysis. Compared with previous methods, our approach addresses a much broader set of neural networks, including heterogeneous neural networks that contain multiple types of activation functions. Experiment results on a variety of benchmarks show the effectiveness of our approach.

Machine learning algorithms generally suffer from a problem of explainability. Given a classification result from a model, it is typically hard to determine what caused the decision to be made, and to give an informative explanation. We explore a new method of generating counterfactual explanations, which instead of explaining why a particular classification was made explain how a different outcome can be achieved. This gives the recipients of the explanation a better way to understand the outcome, and provides an actionable suggestion. We show that the introduced method of Constrained Adversarial Examples (CADEX) can be used in real world applications, and yields explanations which incorporate business or domain constraints such as handling categorical attributes and range constraints.

Developing learning methods which do not discriminate subgroups in the population is a central goal of algorithmic fairness. One way to reach this goal is by modifying the data representation in order to meet certain fairness constraints. In this work we measure fairness according to demographic parity. This requires the probability of the possible model decisions to be independent of the sensitive information. We argue that the goal of imposing demographic parity can be substantially facilitated within a multitask learning setting. We leverage task similarities by encouraging a shared fair representation across the tasks via low rank matrix factorization. We derive learning bounds establishing that the learned representation transfers well to novel tasks both in terms of prediction performance and fairness metrics. We present experiments on three real world datasets, showing that the proposed method outperforms state-of-the-art approaches by a significant margin.