The FacT: Taming Latent Factor Models for Explainability with Factorization Trees

Latent factor models have achieved great success in personalized recommendations, but they are also notoriously difficult to explain. In this work, we integrate regression trees to guide the learning of latent factor models for recommendation, and use the learnt tree structure to explain the resulting latent factors. Specifically, we build regression trees on users and items respectively with user-generated reviews, and associate a latent profile to each node on the trees to represent users and items. With the growth of regression tree, the latent factors are gradually refined under the regularization imposed by the tree structure. As a result, we are able to track the creation of latent profiles by looking into the path of each factor on regression trees, which thus serves as an explanation for the resulting recommendations. Extensive experiments on two large collections of Amazon and Yelp reviews demonstrate the advantage of our model over several competitive baseline algorithms. Besides, our extensive user study also confirms the practical value of explainable recommendations generated by our model.


Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Automatic generation of summaries from multiple news articles is a valuable tool as the number of online publications grows rapidly. Single document summarization (SDS) systems have benefited from advances in neural encoder-decoder model thanks to the availability of large datasets. However, multi-document summarization (MDS) of news articles has been limited to datasets of a couple of hundred examples. In this paper, we introduce Multi-News, the first large-scale MDS news dataset. Additionally, we propose an end-to-end model which incorporates a traditional extractive summarization model with a standard SDS model and achieves competitive results on MDS datasets. We benchmark several methods on Multi-News and release our data and code in hope that this work will promote advances in summarization in the multi-document setting.


An Introduction to Deep Morphological Networks

The recent impressive results of deep learning-based methods on computer vision applications brought fresh air to the research and industrial community. This success is mainly due to the process that allows those methods to learn data-driven features, generally based upon linear operations. However, in some scenarios, such operations do not have a good performance because of their inherited process that blurs edges, losing notions of corners, borders, and geometry of objects. Overcoming this, non-linear operations, such as morphological ones, may preserve such properties of the objects, being preferable and even state-of-the-art in some applications. Encouraged by this, in this work, we propose a novel network, called Deep Morphological Network (DeepMorphNet), capable of doing non-linear morphological operations while performing the feature learning process by optimizing the structuring elements. The DeepMorphNets can be trained and optimized end-to-end using traditional existing techniques commonly employed in the training of deep learning approaches. A systematic evaluation of the proposed algorithm is conducted using two synthetic and two traditional image classification datasets. Results show that the proposed DeepMorphNets is a promising technique that can learn distinct features when compared to the ones learned by current deep learning methods.


Reinforcement Learning When All Actions are Not Always Available

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS-MDP) formulation has been proposed, which captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs suffer from divergence issues, and present new algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches using several tasks inspired by real-life use cases wherein the action set is stochastic.


Foundations of Constructive Probability Theory

We provide a systematic, thorough treatment of the foundations of probability theory and stochastic processes along the lines of E. Bishop’s constructive analysis. Every existence result presented shall be a construction; and the input data, the construction procedure, and the output objects shall be regarded as integral parts of the theorem. A brief description of this approach is in Part I of this book. Part II develops basic topics in probability theory in this constructive framework, expanding on [Bishop and Bridges 1985, Springer], and in terms familiar to probabilists. Part III, the main part of the book, builds on Part II to provide a new constructive treatment of stochastic processes, in the spirit and style of Kolmogorov’s constructive methods for Brownian motion. Topics include a Daniell-Kolmogorov-Skorokhod construction of random fields, measurable random fields, a.u. continuous processes, a.u. c\`adl\`ag processes, martingales, and a.u. c\`adl\`ag and strongly Markov processes with Feller semigroups. This text also contains some new theorems in classical probability theory. Each construction theorem is accompanied by a metrical continuity theorem. For example, the construction of Markov processes from Feller semigroups is shown to be metrically continuous, which strengthens the sequential weak convergence in the classical approach. Another new result is a maximal inequality for L_p-martingales for p \ge 1. In addition to providing explicit rates of convergence, this maximal inequality also provides a unified proof of a.u. convergence of martingales, which previously required separate proofs for the cases p > 1 and p = 1. A third new result is a proof that a familiar condition on the triple-joint distributions implies that a process is not only a.u. c\`adl\`ag, but also right Hoelder, in a sense made precise in the text.


Risks from Learned Optimization in Advanced Machine Learning Systems

We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer – a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be – how will it differ from the loss function it was trained under – and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.


Data Sketching for Faster Training of Machine Learning Models

Many machine learning problems reduce to the problem of minimizing an expected risk, defined as the sum of a large number of, often convex, component functions. Iterative gradient methods are popular techniques for the above problems. However, they are in general slow to converge, in particular for large data sets. In this work, we develop analysis for selecting a subset (or sketch) of training data points with their corresponding learning rates in order to provide faster convergence to a close neighbordhood of the optimal solution. We show that subsets that minimize the upper-bound on the estimation error of the full gradient, maximize a submodular facility location function. As a result, by greedily maximizing the facility location function we obtain subsets that yield faster convergence to a close neighborhood of the optimum solution. We demonstrate the real-world effectiveness of our algorithm, SIG, confirming our analysis, through an extensive set of experiments on several applications, including logistic regression and training neural networks. We also include a method that provides a deliberate deterministic ordering of the data subset that is quite effective in practice. We observe that our method, while achieving practically the same loss, speeds up gradient methods by up to 10x for convex and 3x for non-convex (deep) functions.


On Testing Marginal versus Conditional Independence

We consider testing marginal independence versus conditional independence in a trivariate Gaussian setting. The two models are non-nested and their intersection is a union of two marginal independences. We consider two sequences of such models, one from each type of independence, that are closest to each other in the Kullback-Leibler sense as they approach the intersection. They become indistinguishable if the signal strength, as measured by the product of two correlation parameters, decreases faster than the standard parametric rate. Under local alternatives at such rate, we show that the asymptotic distribution of the likelihood ratio depends on where and how the local alternatives approach the intersection. To deal with this non-uniformity, we study a class of ‘envelope’ distributions by taking pointwise suprema over asymptotic cumulative distribution functions. We show that these envelope distributions are well-behaved and lead to model selection procedures with uniform error guarantees and near-optimal power. To control the error even when the two models are indistinguishable, rather than insist on a dichotomous choice, the proposed procedure will choose either or both models.


Invariant Tensor Feature Coding

We propose a novel feature coding method that exploits invariance. We consider the setting where the transformations that preserve the image contents compose a finite group of orthogonal matrices. This is the case in many image transformations such as image rotations and image flipping. We prove that the group-invariant feature vector contains sufficient discriminative information when we learn a linear classifier using convex loss minimization. From this result, we propose a novel feature modeling for principal component analysis, and k-means clustering, which are used for most feature coding methods, and global feature functions that explicitly consider the group action. Although the global feature functions are complex nonlinear functions in general, we can calculate the group action on this space easily by constructing the functions as the tensor product representations of basic representations, resulting in the explicit form of invariant feature functions. We demonstrate the effectiveness of our methods on several image datasets.


Fair Near Neighbor Search: Independent Range Sampling in High Dimensions

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the r-near neighbor (r-NN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. Locality sensitive hashing (LSH), the most common approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for r-NN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product, which requires nearly-linear space and exploits locality sensitive filters.


Enumeration of Distinct Support Vectors for Interactive Decision Making

In conventional prediction tasks, a machine learning algorithm outputs a single best model that globally optimizes its objective function, which typically is accuracy. Therefore, users cannot access the other models explicitly. In contrast to this, multiple model enumeration attracts increasing interests in non-standard machine learning applications where other criteria, e.g., interpretability or fairness, than accuracy are main concern and a user may want to access more than one non-optimal, but suitable models. In this paper, we propose a K-best model enumeration algorithm for Support Vector Machines (SVM) that given a dataset S and an integer K>0, enumerates the K-best models on S with distinct support vectors in the descending order of the objective function values in the dual SVM problem. Based on analysis of the lattice structure of support vectors, our algorithm efficiently finds the next best model with small latency. This is useful in supporting users’s interactive examination of their requirements on enumerated models. By experiments on real datasets, we evaluated the efficiency and usefulness of our algorithm.


Baby steps towards few-shot learning with multiple semantics

Learning from one or few visual examples is one of the key capabilities of humans since early infancy, but is still a significant challenge for modern AI systems. While considerable progress has been achieved in few-shot learning from a few image examples, much less attention has been given to the verbal descriptions that are usually provided to infants when they are presented with a new object. In this paper, we focus on the role of additional semantics that can significantly facilitate few-shot visual learning. Building upon recent advances in few-shot learning with additional semantic information, we demonstrate that further improvements are possible using richer semantics and multiple semantic sources. Using these ideas, we offer the community a new result on the one-shot test of the popular miniImageNet benchmark, comparing favorably to the previous state-of-the-art results for both visual only and visual plus semantics-based approaches. We also performed an ablation study investigating the components and design choices of our approach.


Evaluation and Improvement of Chatbot Text Classification Data Quality Using Plausible Negative Examples

We describe and validate a metric for estimating multi-class classifier performance based on cross-validation and adapted for improvement of small, unbalanced natural-language datasets used in chatbot design. Our experiences draw upon building recruitment chatbots that mediate communication between job-seekers and recruiters by exposing the ML/NLP dataset to the recruiting team. Evaluation approaches must be understandable to various stakeholders, and useful for improving chatbot performance. The metric, nex-cv, uses negative examples in the evaluation of text classification, and fulfils three requirements. First, it is actionable: it can be used by non-developer staff. Second, it is not overly optimistic compared to human ratings, making it a fast method for comparing classifiers. Third, it allows model-agnostic comparison, making it useful for comparing systems despite implementation differences. We validate the metric based on seven recruitment-domain datasets in English and German over the course of one year.


Detecting linear trend changes and point anomalies in data sequences

We propose TrendSegment, a methodology for detecting multiple change-points corresponding to linear trend changes or point anomalies in one dimensional data. A core ingredient of TrendSegment is a new Tail-Greedy Unbalanced Wavelet transform: a conditionally orthonormal, bottom-up transformation of the data through an adaptively constructed unbalanced wavelet basis, which results in a sparse representation of the data. The bottom-up nature of this multiscale decomposition enables the detection of point anomalies and linear trend changes at once as the decomposition focuses on local features in its early stages and on global features next. To reduce the computational complexity, the proposed method merges multiple regions in a single pass over the data. We show the consistency of the estimated number and locations of change-points. The practicality of our approach is demonstrated through simulations and two real data examples, involving Iceland temperature data and sea ice extent of the Arctic and the Antarctic. Our methodology is implemented in the R package trendsegmentR, available from CRAN.


Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

It has been widely understood that differential privacy (DP) can guarantee rigorous privacy against adversaries with arbitrary prior knowledge. However, recent studies demonstrate that this may not be true for correlated data, and indicate that three factors could influence privacy leakage: the data correlation pattern, prior knowledge of adversaries, and sensitivity of the query function. This poses a fundamental problem: what is the mathematical relationship between the three factors and privacy leakage? In this paper, we present a unified analysis of this problem. A new privacy definition, named \textit{prior differential privacy (PDP)}, is proposed to evaluate privacy leakage considering the exact prior knowledge possessed by the adversary. We use two models, the weighted hierarchical graph (WHG) and the multivariate Gaussian model to analyze discrete and continuous data, respectively. We demonstrate that positive, negative, and hybrid correlations have distinct impacts on privacy leakage. Considering general correlations, a closed-form expression of privacy leakage is derived for continuous data, and a chain rule is presented for discrete data. Our results are valid for general linear queries, including count, sum, mean, and histogram. Numerical experiments are presented to verify our theoretical analysis.


Enhancing interoperable datasets with virtual links

To achieve semantic interoperability, numerous data standards, ontologies, and controlled vocabularies have been developed and adopted by the industry and scientific communities. Yet, semantic heterogeneity remains a problem when interoperating data from sources of different scopes and knowledge domains. Causes for this challenge are context-specific requirements (i.e. no ‘one model fits all’), different modelling decisions, domain purpose, and technical constraints. Moreover, even if the problem of semantic heterogeneity among different RDF publishers and knowledge domains is mitigated or solved, querying and accessing the data of multiple distributed RDF datasets on the web is not straightforward. This is because of the complex and fastidious process needed to understand how the data are structured and how these datasets can be related or linked, and consequently, queried. To address these issues, we propose to extend the existing Vocabulary of Interlinked Datasets (VoID). We introduce the concept of virtual links. A virtual link is a connection between resources such as literals and IRIs (Internationalized Resource Identifier) with some commonality where each of these resources is from a different RDF dataset. Semantic relaxation is also considered when defining commonality between resources. The links are required in order to understand how to semantically relate the datasets. In addition, we argue several benefits of using virtual links to improve semantic interoperability between unlike datasets. Finally, we applied them to multiple worldwide used RDF datasets in life sciences.


Interpretable and Differentially Private Predictions

Interpretable predictions, where it is clear why a machine learning model has made a particular decision, can compromise privacy by revealing the characteristics of individual data points. This raises the central question addressed in this paper: Can models be interpretable without compromising privacy? For complex big data fit by correspondingly rich models, balancing privacy and explainability is particularly challenging, such that this question has remained largely unexplored. In this paper, we propose a family of simple models in the aim of approximating complex models using several locally linear maps per class to provide high classification accuracy, as well as differentially private explanations on the classification. We illustrate the usefulness of our approach on several image benchmark datasets as well as a medical dataset.


Ensemble MCMC: Accelerating Pseudo-Marginal MCMC for State Space Models using the Ensemble Kalman Filter

Particle Markov chain Monte Carlo (pMCMC) is now a popular method for performing Bayesian statistical inference on challenging state space models (SSMs) with unknown static parameters. It uses a particle filter (PF) at each iteration of an MCMC algorithm to unbiasedly estimate the likelihood for a given static parameter value. However, pMCMC can be computationally intensive when a large number of particles in the PF is required, such as when the data is highly informative, the model is misspecified and/or the time series is long. In this paper we exploit the ensemble Kalman filter (EnKF) developed in the data assimilation literature to speed up pMCMC. We replace the unbiased PF likelihood with the biased EnKF likelihood estimate within MCMC to sample over the space of the static parameter. On a wide class of different non-linear SSM models, we demonstrate that our new ensemble MCMC (eMCMC) method can significantly reduce the computational cost whilst maintaining reasonable accuracy. We also propose several extensions of the vanilla eMCMC algorithm to further improve computational efficiency and allow for approximate posterior hidden state inference.