In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system. Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain. We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics. We also showcase the shortcomings of current vision and language models by performing an error analysis on our system’s output.
Dialogue systems and conversational agents are becoming increasingly popular in the modern society but building an agent capable of holding intelligent conversation with its users is a challenging problem for artificial intelligence. In this demo, we demonstrate a deep learning based conversational social agent called ‘Ruuh’ (facebook.com/Ruuh) designed by a team at Microsoft India to converse on a wide range of topics. Ruuh needs to think beyond the utilitarian notion of merely generating ‘relevant’ responses and meet a wider range of user social needs, like expressing happiness when user’s favorite team wins, sharing a cute comment on showing the pictures of the user’s pet and so on. The agent also needs to detect and respond to abusive language, sensitive topics and trolling behavior of the users. Many of these problems pose significant research challenges which will be demonstrated in our demo. Our agent has interacted with over 2 million real world users till date which has generated over 150 million user conversations.
This paper continues the discussion of the representation of ontologies in the first-order logical environment FOLE. According to Gruber, an ontology defines the primitives with which to model the knowledge resources for a community of discourse. These primitives, consisting of classes, relationships and properties, are represented by the entity-relationship-attribute ERA data model of Chen. An ontology uses formal axioms to constrain the interpretation of these primitives. In short, an ontology specifies a logical theory. A series of three papers by the author provide a rigorous mathematical representation for the ERA data model in particular, and ontologies in general, within FOLE. The first two papers, which provide a foundation and superstructure for FOLE, represent the formalism and semantics of (many-sorted) first-order logic in a classification form corresponding to ideas discussed in the Information Flow Framework (IFF). The third paper will define an interpretation of FOLE in terms of the transformational passage, first described in (Kent, 2013), from the classification form of first-order logic to an equivalent interpretation form, thereby defining the formalism and semantics of first-order logical/relational database systems. Two papers will provide a precise mathematical basis for FOLE interpretation: the current paper develops the notion of a FOLE relational table following the relational model of Codd, and a follow-up paper will develop the notion of a FOLE relational database. Both of these papers expand on material found in the paper (Kent, 2011). Although the classification form follows the entity-relationship-attribute data model of Chen, the interpretation form follows the relational data model of Codd. In general, the FOLE representation uses a conceptual structures approach, that is completely compatible with formal concept analysis and information flow.
Lifelong learning can be viewed as a continuous transfer learning procedure over consecutive tasks, where learning a given task depends on accumulated knowledge — the so-called knowledge base. Most published work on lifelong learning makes a batch processing of each task, implying that a data collection step is required beforehand. We are proposing a new framework, lifelong online learning, in which the learning procedure for each task is interactive. This is done through a computationally efficient algorithm where the predicted result for a given task is made by combining two intermediate predictions: by using only the information from the current task and by relying on the accumulated knowledge. In this work, two challenges are tackled: making no assumption on the task generation distribution, and processing with a possibly unknown number of instances for each task. We are providing a theoretical analysis of this algorithm, with a cumulative error upper bound for each task. We find that under some mild conditions, the algorithm can still benefit from a small cumulative error even when facing few interactions. Moreover, we provide experimental results on both synthetic and real datasets that validate the correct behavior and practical usefulness of the proposed algorithm.
Experience shows that on today’s high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations (data transfer + launches of the kernels) as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices.
We study a distributed learning problem in which Alice sends a compressed distillation of a set of training data to Bob, who uses the distilled version to best solve an associated learning problem. We formalize this as a rate-distortion problem in which the training set is the source and Bob’s cross-entropy loss is the distortion measure. We consider this problem for unsupervised learning for batch and sequential data. In the batch data, this problem is equivalent to the information bottleneck (IB), and we show that reduced-complexity versions of standard IB methods solve the associated rate-distortion problem. For the streaming data, we present a new algorithm, which may be of independent interest, that solves the rate-distortion problem for Gaussian sources. Furthermore, to improve the results of the iterative algorithm for sequential data we introduce a two-pass version of this algorithm. Finally, we show the dependency of the rate on the number of samples $k$ required for Gaussian sources to ensure cross-entropy loss that scales optimally with the growth of the training set.
We introduce an extension of the multi-instance learning problem where examples are organized as nested bags of instances (e.g., a document could be represented as a bag of sentences, which in turn are bags of words). This framework can be useful in various scenarios, such as text and image classification, but also supervised learning over graphs. As a further advantage, multi-multi instance learning enables a particular way of interpreting predictions and the decision function. Our approach is based on a special neural network layer, called bag-layer, whose units aggregate bags of inputs of arbitrary size. We prove theoretically that the associated class of functions contains all Boolean functions over sets of sets of instances and we provide empirical evidence that functions of this kind can be actually learned on semi-synthetic datasets. We finally present experiments on text classification and on citation graphs and social graph data, showing that our model obtains competitive results with respect to other approaches such as convolutional networks on graphs.
In the empirical study of evolutionary algorithms, the solution quality is evaluated by either the fitness value or approximation error. The latter measures the fitness difference between an approximation solution and the optimal solution. Since the approximation error analysis is more convenient than the direct estimation of the fitness value, this paper focuses on approximation error analysis. However, it is straightforward to extend all related results from the approximation error to the fitness value. Although the evaluation of solution quality plays an essential role in practice, few rigorous analyses have been conducted on this topic. This paper aims at establishing a novel theoretical framework of approximation error analysis of evolutionary algorithms for discrete optimization. This framework is divided into two parts. The first part is about exact expressions of the approximation error. Two methods, Jordan form and Schur’s triangularization, are presented to obtain an exact expression. The second part is about upper bounds on approximation error. Two methods, convergence rate and auxiliary matrix iteration, are proposed to estimate the upper bound. The applicability of this framework is demonstrated through several examples.
Data from motion sensors such as accelerometers and gyroscopes embedded in our devices can reveal secondary undesired, private information about our activities. This information can be used for malicious purposes such as user identification by application developers. To address this problem, we propose a data transformation mechanism that enables a device to share data for specific applications (e.g.~monitoring their daily activities) without revealing private user information (e.g.~ user identity). We formulate this anonymization process based on an information theoretic approach and propose a new multi-objective loss function for training convolutional auto-encoders~(CAEs) to provide a practical approximation to our anonymization problem. This effective loss function forces the transformed data to minimize the information about the user’s identity, as well as the data distortion to preserve application-specific utility. Our training process regulates the encoder to disregard user-identifiable patterns and tunes the decoder to shape the final output independently of users in the training set. Then, a trained CAE can be deployed on a user’s mobile device to anonymize sensor data before sharing with an app, even for users who are not included in the training dataset. The results, on a dataset of 24 users for activity recognition, show a promising trade-off on transformed data between utility and privacy, with an accuracy for activity recognition over 92%, while reducing the chance of identifying a user to less than 7%.
Information theoretic quantities play an important role in various settings in machine learning, including causality testing, structure inference in graphical models, time-series problems, feature selection as well as in providing privacy guarantees. A key quantity of interest is the mutual information and generalizations thereof, including conditional mutual information, multivariate mutual information, total correlation and directed information. While the aforementioned information quantities are well defined in arbitrary probability spaces, existing estimators add or subtract entropies (we term them $\Sigma H$ methods). These methods work only in purely discrete space or purely continuous case since entropy (or differential entropy) is well defined only in that regime. In this paper, we define a general graph divergence measure ($\mathbb{GDM}$), as a measure of incompatibility between the observed distribution and a given graphical model structure. This generalizes the aforementioned information measures and we construct a novel estimator via a coupling trick that directly estimates these multivariate information measures using the Radon-Nikodym derivative. These estimators are proven to be consistent in a general setting which includes several cases where the existing estimators fail, thus providing the only known estimators for the following settings: (1) the data has some discrete and some continuous-valued components (2) some (or all) of the components themselves are discrete-continuous mixtures (3) the data is real-valued but does not have a joint density on the entire space, rather is supported on a low-dimensional manifold. We show that our proposed estimators significantly outperform known estimators on synthetic and real datasets.
In this article, we show that solving the system of linear equations by manipulating the kernel and the range space is equivalent to solving the problem of least squares error approximation. This establishes the ground for a gradient-free learning search when the system can be expressed in the form of a linear matrix equation. When the nonlinear activation function is invertible, the learning problem of a fully-connected multilayer feedforward neural network can be easily adapted for this novel learning framework. By a series of kernel and range space manipulations, it turns out that such a network learning boils down to solving a set of cross-coupling equations. By having the weights randomly initialized, the equations can be decoupled and the network solution shows relatively good learning capability for real world data sets of small to moderate dimensions. Based on the structural information of the matrix equation, the network representation is found to be dependent on the number of data samples and the output dimension.
Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al., 1999). However, only recently in (Bacon et al., 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.
However Deep neural networks recently have achieved impressive results for different tasks, they suffer from poor uncertainty prediction. Temperature Scaling(TS) is an efficient post-processing method for calibrating DNNs toward to have more accurate uncertainty prediction. TS relies on a single parameter T which softens the logit layer of a DNN and the optimal value of it is found by minimizing on Negative Log Likelihood (NLL) loss function. In this paper, we discuss about weakness of NLL loss function, especially for DNNs with high accuracy and propose a new loss function called Attended-NLL which can improve TS calibration ability significantly
Time series clustering is the process of grouping time series with respect to their similarity or characteristics. Previous approaches usually combine a specific distance measure for time series and a standard clustering method. However, these approaches do not take the similarity of the different subsequences of each time series into account, which can be used to better compare the time series objects of the dataset. In this paper, we propose a novel technique of time series clustering based on two clustering stages. In a first step, a least squares polynomial segmentation procedure is applied to each time series, which is based on a growing window technique that returns different-length segments. Then, all the segments are projected into same dimensional space, based on the coefficients of the model that approximates the segment and a set of statistical features. After mapping, a first hierarchical clustering phase is applied to all mapped segments, returning groups of segments for each time series. These clusters are used to represent all time series in the same dimensional space, after defining another specific mapping process. In a second and final clustering stage, all the time series objects are grouped. We consider internal clustering quality to automatically adjust the main parameter of the algorithm, which is an error threshold for the segmenta- tion. The results obtained on 84 datasets from the UCR Time Series Classification Archive have been compared against two state-of-the-art methods, showing that the performance of this methodology is very promising.
Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.
We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser. Fabrik provides a simple and intuitive GUI to import neural networks written in popular deep learning frameworks such as Caffe, Keras, and TensorFlow, and allows users to interact with, build, and edit models via simple drag and drop. Fabrik is designed to be framework agnostic and support high interoperability, and can be used to export models back to any supported framework. Finally, it provides powerful collaborative features to enable users to iterate over model design remotely and at scale.
It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this work, we move a tiny step towards a theory and better understanding of the representations. Specifically, we study a simpler problem: How similar are the representations learned by two networks with identical architecture but trained from different initializations. We develop a rigorous theory based on the neuron activation subspace match model. The theory gives a complete characterization of the structure of neuron activation subspace matches, where the core concepts are maximum match and simple match which describe the overall and the finest similarity between sets of neurons in two networks respectively. We also propose efficient algorithms to find the maximum match and simple matches. Finally, we conduct extensive experiments using our algorithms. Experimental results suggest that, surprisingly, representations learned by the same convolutional layers of networks trained from different initializations are not as similar as prevalently expected, at least in terms of subspace match.
Recently, there have been great interests in Monte Carlo Tree Search (MCTS) in AI research. Although the sequential version of MCTS has been studied widely, its parallel counterpart still lacks systematic study. This leads us to the following questions: \emph{how to design efficient parallel MCTS (or more general cases) algorithms with rigorous theoretical guarantee? Is it possible to achieve linear speedup?} In this paper, we consider the search problem on a more general acyclic one-root graph (namely, Monte Carlo Graph Search (MCGS)), which generalizes MCTS. We develop a parallel algorithm (P-MCGS) to assign multiple workers to investigate appropriate leaf nodes simultaneously. Our analysis shows that P-MCGS algorithm achieves linear speedup and that the sample complexity is comparable to its sequential counterpart.
Complex networks are ubiquitous to several Computer Science domains. Centrality measures are an important analysis mechanism to uncover vital elements of complex networks. However, these metrics have high computational costs and requirements that hinder their applications in large real-world networks. In this tutorial, we explain how the use of neural network learning algorithms can render the application of the metrics in complex networks of arbitrary size. Moreover, the tutorial describes how to identify the best configuration for neural network training and learning such for tasks, besides presenting an easy way to generate and acquire training data. We do so by means of a general methodology, using complex network models adaptable to any application. We show that a regression model generated by the neural network successfully approximates the metric values and therefore are a robust, effective alternative in real-world applications. The methodology and proposed machine learning model use only a fraction of time with respect to other approximation algorithms, which is crucial in complex network applications.
To understand and predict the performance of parallel and distributed programs, several analytical and machine learning approaches have been proposed, each having its advantages and disadvantages. In this paper, we propose and validate a hybrid approach exploiting both analytical and machine learning models. The hybrid model is able to learn and correct the analytical models to better match the actual performance. Furthermore, the proposed hybrid model improves the prediction accuracy in comparison to pure machine learning techniques while using small training datasets, thus making it suitable for hardware and workload changes.
In the natural sciences, differential equations are widely used to describe dynamical systems. The discovery and verification of such models from data has become a fundamental challenge of science today. From a statistical point of view, we distinguish two problems: parameter estimation and structure search. In parameter estimation, we start from a given differential equation and estimate the parameters from noisy data that are observed at discrete time points. The estimate depends nonlinearly on the parameters. This poses both statistical and computational challenges and makes the task of structure search even more ambitious. Existing methods use either standard model selection techniques or various types of sparsity enforcing regularization, hence focusing on predictive performance. In this work, we develop novel methodology for structure search in ordinary differential equation models. Exploiting ideas from causal inference, we propose to rank models not only by their predictive performance, but also by taking into account stability, i.e., their ability to predict well in different experimental settings. Based on this model ranking we also construct a ranking of individual variables reflecting causal importance. It provides researchers with a list of promising candidate variables that may be investigated further in interventional experiments. Our ranking methodology (both for models and variables) comes with theoretical asymptotic guarantees and is shown to outperform current state-of-the art methods based on extensive experimental evaluation on simulated data. Practical applicability of the procedure is illustrated on a not yet published biological data set. Our methodology is fully implemented. Code will be provided online and will also be made available as an R package.
The problem of anomaly detection among multiple processes is considered within the framework of sequential design of experiments. The objective is an active inference strategy consisting of a selection rule governing which process to probe at each time, a stopping rule on when to terminate the detection, and a decision rule on the final detection outcome. The performance measure is the Bayes risk that takes into account of not only sample complexity and detection errors, but also costs associated with switching across processes. While the problem is a partially observable Markov decision process to which optimal solutions are generally intractable, a low-complexity deterministic policy is shown to be asymptotically optimal and offer significant performance improvement over existing methods in the finite regime.