Interference arises when an individual’s potential outcome depends on the individual treatment level, but also on the treatment level of others. A common assumption in the causal inference literature in the presence of interference is partial interference, implying that the population can be partitioned in clusters of individuals whose potential outcomes only depend on the treatment of units within the same cluster. Previous literature has defined average potential outcomes under counterfactual scenarios where treatments are randomly allocated to units within a cluster. However, within clusters there may be units that are more or less likely to receive treatment based on covariates or neighbors’ treatment. We define estimands that describe average potential outcomes for realistic counterfactual treatment allocation programs taking into consideration the units’ covariates, as well as dependence between units’ treatment assignment. We discuss these estimands, propose unbiased estimators and derive asymptotic results as the number of clusters grows. Finally, we estimate effects in a comparative effectiveness study of power plant emission reduction technologies on ambient ozone pollution.
Markov Logic Networks join probabilistic modeling with first-order logic and have been shown to integrate well with the Semantic Web foundations. While several approaches have been devised to tackle the subproblems of rule mining, grounding, and inference, no comprehensive workflow has been proposed so far. In this paper, we fill this gap by introducing a framework called Mandolin, which implements a workflow for knowledge discovery specifically on RDF datasets. Our framework imports knowledge from referenced graphs, creates similarity relationships among similar literals, and relies on state-of-the-art techniques for rule mining, grounding, and inference computation. We show that our best configuration scales well and achieves at least comparable results with respect to other statistical-relational-learning algorithms on link prediction.
We interpret HyperNetworks within the framework of variational inference within implicit distributions. Our method, Bayes by Hypernet, is able to model a richer variational distribution than previous methods. Experiments show that it achieves comparable predictive performance on the MNIST classification task while providing higher predictive uncertainties compared to MC-Dropout and regular maximum likelihood training.
Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsistencies is domain value violations that occur when an attribute value is outside of an allowed domain. We explore automatically detecting and repairing such violations by leveraging the often available clean test labels to determine whether a given detection and repair combination will improve model accuracy. We present BoostClean which automatically selects an ensemble of error detection and repair combinations using statistical boosting. BoostClean selects this ensemble from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains. Our evaluation on a collection of 12 datasets from Kaggle, the UCI repository, real-world data analyses, and production datasets that show that Boost- Clean can increase absolute prediction accuracy by up to 9% over the best non-ensembled alternatives. Our optimizations including parallelism, materialization, and indexing techniques show a 22.2x end-to-end speedup on a 16-core machine.
Modern society generates an incredible amount of data about individuals, and releasing summary statistics about this data in a manner that provably protects individual privacy would offer a valuable resource for researchers in many fields. We present the first algorithm for analysis of variance (ANOVA) that preserves differential privacy, allowing this important statistical test to be conducted (and the results released) on databases of sensitive information. In addition to our private algorithm for the F test statistic, we show a rigorous way to compute p-values that accounts for the added noise needed to preserve privacy. Finally, we present experimental results quantifying the statistical power of this differentially private version of the test, finding that a sample of several thousand observations is frequently enough to detect variation between groups. The differentially private ANOVA algorithm is a promising approach for releasing a common test statistic that is valuable in fields in the sciences and social sciences.
Monitoring of networks for anomaly detection has attracted a lot of attention in recent years especially with the rise of connected devices and social networks. This is of importance as anomaly detection could span a wide range of application, from detecting terrorist cells in counter-terrorism efforts to phishing attacks in social network circles. For this reason, numerous techniques for anomaly detection have been introduced. However, application of these techniques to more complex network models is hindered by various challenges such as the size of the network being investigated, how much apriori information is needed, the size of the anomalous graph, among others. A recent technique introduced by Miller et al, which relies on a spectral framework for anomaly detection, has the potential to address many of these challenges. In their discussion of the spectral framework, three algorithms were proposed that relied on the eigenvalues and eigenvectors of the residual matrix of a binary network. The authors demonstrated the ability to detect anomalous subgraphs that were less than 1\% of the network size. However, to date, there is little work that has been done to evaluate the statistical performance of these algorithms. This study investigates the statistical properties of the spectral methods, specifically the Chi-square and norm algorithm proposed by Miller. We will analyze the performance of the algorithm using simulated networks and also extend the method’s application to count networks. Finally we will make some methodological improvements and recommendations to both algorithms.
Machine learning is usually defined in behaviourist terms, where external validation is the primary mechanism of learning. In this paper, I argue for a more holistic interpretation in which finding more probable, efficient and abstract representations is as central to learning as performance. In other words, machine learning should be extended with strategies to reason over its own learning process, leading to so-called meta-cognitive machine learning. As such, the de facto definition of machine learning should be reformulated in these intrinsically multi-objective terms, taking into account not only the task performance but also internal learning objectives. To this end, we suggest a ‘model entropy function’ to be defined that quantifies the efficiency of the internal learning processes. It is conjured that the minimization of this model entropy leads to concept formation. Besides philosophical aspects, some initial illustrations are included to support the claims.
We propose the Wasserstein Auto-Encoder (WAE)—a new algorithm for building a generative model of the data distribution. WAE minimizes a penalized form of the Wasserstein distance between the model distribution and the target distribution, which leads to a different regularizer than the one used by the Variational Auto-Encoder (VAE). This regularizer encourages the encoded training distribution to match the prior. We compare our algorithm with several other techniques and show that it is a generalization of adversarial auto-encoders (AAE). Our experiments show that WAE shares many of the properties of VAEs (stable training, encoder-decoder architecture, nice latent manifold structure) while generating samples of better quality, as measured by the FID score.
Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on single-label classification problems, that is, each document is restricted to belonging to a single category. In this paper, we propose a novel Seed-guided Multi-label Topic Model, named SMTM. With a few seed words relevant to each category, SMTM conducts multi-label classification for a collection of documents without any labeled document. In SMTM, each category is associated with a single category-topic which covers the meaning of the category. To accommodate with multi-labeled documents, we explicitly model the category sparsity in SMTM by using spike and slab prior and weak smoothing prior. That is, without using any threshold tuning, SMTM automatically selects the relevant categories for each document. To incorporate the supervision of the seed words, we propose a seed-guided biased GPU (i.e., generalized Polya urn) sampling procedure to guide the topic inference of SMTM. Experiments on two public datasets show that SMTM achieves better classification accuracy than state-of-the-art alternatives and even outperforms supervised solutions in some scenarios.
Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model.
Recommender systems have been widely adopted by electronic commerce and entertainment industries for individualized prediction and recommendation, which benefit consumers and improve business intelligence. In this article, we propose an innovative method, namely the recommendation engine of multilayers (REM), for tensor recommender systems. The proposed method utilizes the structure of a tensor response to integrate information from multiple modes, and creates an additional layer of nested latent factors to accommodate between-subjects dependency. One major advantage is that the proposed method is able to address the ‘cold-start’ issue in the absence of information from new customers, new products or new contexts. Specifically, it provides more effective recommendations through sub-group information. To achieve scalable computation, we develop a new algorithm for the proposed method, which incorporates a maximum block improvement strategy into the cyclic blockwise-coordinate-descent algorithm. In theory, we investigate both algorithmic properties for global and local convergence, along with the asymptotic consistency of estimated parameters. Finally, the proposed method is applied in simulations and IRI marketing data with 116 million observations of product sales. Numerical studies demonstrate that the proposed method outperforms existing competitors in the literature.
A remarkable feature of human beings is their capacity for creative behaviour, referring to their ability to react to problems in ways that are novel, surprising, and useful. Transformational creativity is a form of creativity where the creative behaviour is induced by a transformation of the actor’s conceptual space, that is, the representational system with which the actor interprets its environment. In this report, we focus on ways of adapting systems of learned representations as they switch from performing one task to performing another. We describe an experimental comparison of multiple strategies for adaptation of learned features, and evaluate how effectively each of these strategies realizes the adaptation, in terms of the amount of training, and in terms of their ability to cope with restricted availability of training data. We show, among other things, that across handwritten digits, natural images, and classical music, adaptive strategies are systematically more effective than a baseline method that starts learning from scratch.
Profiling techniques are used extensively at different parts of the computing stack to achieve many goals. One major goal is to make a piece of software execute more efficiently on a specific hardware platform, where efficiency spans criteria such as power, performance, resource requirements, etc. Researchers, both in academia and industry, have introduced many techniques to gather, and make use of, profiling data. However, one thing remains unchanged: making application A run more efficiently on machine 1. In this paper, we extend this criteria by asking: can profiling information of application A on machine 1 be used to make application B run more efficiently on machine 1? If so, then this means as machine 1 continues to execute more applications, it becomes better and more efficient. We present a generalized method for using profiling information gathered from the execution of programs from a limited corpus of applications to improve the performance of software from outside our corpus. As a proof of concept, we apply our technique to the specific problem of selecting the most efficient last-level-cache with which to execute an application. We were able to turn off an average of 19% of last-level-cache blocks for selected programs from PARSEC benchmark suite and only saw an average 2.8% increase in the rate of last-level cache misses.
Dialogue systems have attracted more and more attention. Recent advances on dialogue systems are overwhelmingly contributed by deep learning techniques, which have been employed to enhance a wide range of big data applications such as computer vision, natural language processing, and recommender systems. For dialogue systems, deep learning can leverage a massive amount of data to learn meaningful feature representations and response generation strategies, while requiring a minimum amount of hand-crafting. In this article, we give an overview to these recent advances on dialogue systems from various perspectives and discuss some possible research directions. In particular, we generally di- vide existing dialogue systems into task-oriented and non- task-oriented models, then detail how deep learning techniques help them with representative algorithms and finally discuss some appealing research directions that can bring the dialogue system research into a new frontier.
We present an empirical study of active learning for Visual Question Answering, where a deep VQA model selects informative question-image pairs from a pool and queries an oracle for answers to maximally improve its performance under a limited query budget. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a fast and effective goal-driven active learning scoring function to pick question-image pairs for deep VQA models under the Bayesian Neural Network framework. We find that deep VQA models need large amounts of training data before they can start asking informative questions. But once they do, all three approaches outperform the random selection baseline and achieve significant query savings. For the scenario where the model is allowed to ask generic questions about images but is evaluated only on specific questions (e.g., questions whose answer is either yes or no), our proposed goal-driven scoring function performs the best.
Kernel PCA is a widely used nonlinear dimension reduction technique in machine learning, but storing the kernel matrix is notoriously challenging when the sample size is large. Inspired by [YPCC16], where the idea of partial matrix sampling followed by nonconvex optimization is proposed for matrix completion and robust PCA, we apply a similar approach to memory-efficient Kernel PCA. In theory, with no assumptions on the kernel matrix in terms of eigenvalues or eigenvectors, we established a model-free theory for the low-rank approximation based on any local minimum of the proposed objective function. As interesting byproducts, when the underlying positive semidefinite matrix is assumed to be low-rank and highly structured, corollaries of our main theorem improve the state-of-the-art results [GLM16, GJZ17] for nonconvex matrix completion with no spurious local minima. Numerical experiments also show that our approach is competitive in terms of approximation accuracy compared to the well-known Nystr\'{o}m algorithm for Kernel PCA.
It is expected that progress toward true artificial intelligence will be achieved through the emergence of a system that integrates representation learning and complex reasoning (LeCun et al. 2015). In response to this prediction, research has been conducted on implementing the symbolic reasoning of a von Neumann computer in an artificial neural network (Graves et al. 2016; Graves et al. 2014; Reed et al. 2015). However, these studies have many limitations in realizing neural-symbolic integration (Jaeger. 2016). Here, we present a new learning paradigm: a learning solving procedure (LSP) that learns the procedure for solving complex problems. This is not accomplished merely by learning input-output data, but by learning algorithms through a solving procedure that obtains the output as a sequence of tasks for a given input problem. The LSP neural network system not only learns simple problems of addition and multiplication, but also the algorithms of complicated problems, such as complex arithmetic expression, sorting, and Hanoi Tower. To realize this, the LSP neural network structure consists of a deep neural network and long short-term memory, which are recursively combined. Through experimentation, we demonstrate the efficiency and scalability of LSP and its validity as a mechanism of complex reasoning.
Many deployed learned models are black boxes: given input, returns output. Internal information about the model, such as the architecture, optimisation procedure, or training data, is not disclosed explicitly as it might contain proprietary information or make the system more vulnerable. This work shows that such attributes of neural networks can be exposed from a sequence of queries. This has multiple implications. On the one hand, our work exposes the vulnerability of black-box neural networks to different types of attacks — we show that the revealed internal information helps generate more effective adversarial examples against the black box model. On the other hand, this technique can be used for better protection of private content from automatic recognition models using adversarial examples. Our paper suggests that it is actually hard to draw a line between white box and black box models.
Model-free decentralized optimizations and learning are receiving increasing attention from theoretical and practical perspectives. In particular, two fully decentralized learning algorithms, namely Trial and Error (TEL) and Optimal Dynamical Learning (ODL), are very appealing for a broad class of games. In fact, ODL has the property to spend a high proportion of time in an optimum state that maximizes the sum of utility of all players. And the TEL has the property to spend a high proportion of time in an optimum state that maximizes the sum of utility of all players if there is a Pure Nash Equilibrium (PNE), otherwise, it spends a high proportion of time in an optimum state that maximizes a tradeoff between the sum of utility of all players and a predefined stability function. On the other hand, estimating the mean fraction of time spent in the optimum state (as well as the mean time duration to reach it) is challenging due to the high complexity and dimension of the inherent Markov Chains. In this paper, under some specific system model, an evaluation of the above performance metrics is provided by proposing an approximation of the considered Markov chains, which allows overcoming the problem of high dimensionality. A comparison between the two algorithms is then performed which allows a better understanding of their performances.
Sparse regularization such as regularization is a quite powerful and widely used strategy for high dimensional learning problems. The effectiveness of sparse regularization have been supported practically and theoretically by several studies. However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features. Ordinary regularization often selects variables correlated with each other, which results in deterioration of not only its generalization error but also interpretability. In this paper, we propose a new regularization method, ‘Independently Interpretable Lasso’ (IILasso for short). Our proposed regularizer suppresses selecting correlated variables, and thus each active variables independently affect the objective variable in the model. Hence, we can interpret regression coefficients intuitively and also improve the performance by avoiding overfitting. We analyze theoretical property of IILasso and show that the proposed method is much advantageous for its sign recovery and achieves almost minimax optimal convergence rate. Synthetic and real data analyses also indicate the effectiveness of IILasso.
This paper presents an automated approach for interpretable feature recommendation for solving signal data analytics problems. The method has been tested by performing experiments on datasets in the domain of prognostics where interpretation of features is considered very important. The proposed approach is based on Wide Learning architecture and provides means for interpretation of the recommended features. It is to be noted that such an interpretation is not available with feature learning approaches like Deep Learning (such as Convolutional Neural Network) or feature transformation approaches like Principal Component Analysis. Results show that the feature recommendation and interpretation techniques are quite effective for the problems at hand in terms of performance and drastic reduction in time to develop a solution. It is further shown by an example, how this human-in-loop interpretation system can be used as a prescriptive system.
Neural networks (NNs) have begun to have a pervasive impact on various applications of machine learning. However, the problem of finding an optimal NN architecture for large applications has remained open for several decades. Conventional approaches search for the optimal NN architecture through extensive trial-and-error. Such a procedure is quite inefficient. In addition, the generated NN architectures incur substantial redundancy. To address these problems, we propose an NN synthesis tool (NeST) that automatically generates very compact architectures for a given dataset. NeST starts with a seed NN architecture. It iteratively tunes the architecture with gradient-based growth and magnitude-based pruning of neurons and connections. Our experimental results show that NeST yields accurate yet very compact NNs with a wide range of seed architecture selection. For example, for the LeNet-300-100 (LeNet-5) NN architecture derived from the MNIST dataset, we reduce network parameters by 34.1x (74.3x) and floating-point operations (FLOPs) by 35.8x (43.7x). For the AlexNet NN architecture derived from the ImageNet dataset, we reduce network parameters by 15.7x and FLOPs by 4.6x. All these results are the current state-of-the-art for these architectures.
Nonnegative matrix factorization (NMF) is a powerful tool for data mining. However, the emergence of `big data’ has severely challenged our ability to compute this fundamental decomposition using deterministic algorithms. This paper presents a randomized hierarchical alternating least squares (HALS) algorithm to compute the NMF. By deriving a smaller matrix from the nonnegative input data, a more efficient nonnegative decomposition can be computed. Our algorithm scales to big data applications while attaining a near-optimal factorization, i.e., the algorithm scales with the target rank of the data rather than the ambient dimension of measurement space. The proposed algorithm is evaluated using synthetic and real world data and shows substantial speedups compared to deterministic HALS.