Detecting novelty of an entire document is an Artificial Intelligence (AI) frontier problem that has widespread NLP applications, such as extractive document summarization, tracking development of news events, predicting impact of scholarly articles, etc. Important though the problem is, we are unaware of any benchmark document level data that correctly addresses the evaluation of automatic novelty detection techniques in a classification framework. To bridge this gap, we present here a resource for benchmarking the techniques for document level novelty detection. We create the resource via event-specific crawling of news documents across several domains in a periodic manner. We release the annotated corpus with necessary statistics and show its use with a developed system for the problem in concern.
Most methods for learning causal structures from non-experimental data rely on some assumptions of simplicity, the most famous of which is known as the Faithfulness condition. Without assuming such conditions to begin with, we develop a learning theory for inferring the structure of a causal Bayesian network, and we use the theory to provide a novel justification of a certain assumption of simplicity that is closely related to Faithfulness. Here is the idea. With only the Markov and IID assumptions, causal learning is notoriously too hard to achieve statistical consistency but we show that it can still achieve a quite desirable ‘combined’ mode of stochastic convergence to the truth: having almost sure convergence to the true causal hypothesis with respect to almost all causal Bayesian networks, together with a certain kind of locally uniform convergence. Furthermore, every learning algorithm achieving at least that joint mode of convergence has this property: having stochastic convergence to the truth with respect to a causal Bayesian network $N$ only if $N$ satisfies a certain variant of Faithfulness, known as Pearl’s Minimality condition—as if the learning algorithm were designed by assuming that condition. This explains, for the first time, why it is not merely optional but mandatory to assume the Minimality condition—or to proceed as if we assumed it.
We consider the forecast aggregation problem in repeated settings, where the forecasts are done on a binary event. At each period multiple experts provide forecasts about an event. The goal of the aggregator is to aggregate those forecasts into a subjective accurate forecast. We assume that experts are Bayesian; namely they share a common prior, each expert is exposed to some evidence, and each expert applies Bayes rule to deduce his forecast. The aggregator is ignorant with respect to the information structure (i.e., distribution over evidence) according to which experts make their prediction. The aggregator observes the experts’ forecasts only. At the end of each period the actual state is realized. We focus on the question whether the aggregator can learn to aggregate optimally the forecasts of the experts, where the optimal aggregation is the Bayesian aggregation that takes into account all the information (evidence) in the system. We consider the class of partial evidence information structures, where each expert is exposed to a different subset of conditionally independent signals. Our main results are positive; We show that optimal aggregation can be learned in polynomial time in a quite wide range of instances of the partial evidence environments. We provide a tight characterization of the instances where learning is possible and impossible.
Deep neural networks, although shown to be a successful class of machine learning algorithms, are known to be extremely unstable to adversarial perturbations. Improving the robustness of neural networks against these attacks is important, especially for security-critical applications. To defend against such attacks, we propose dividing the input image into multiple patches, denoising each patch independently, and reconstructing the image, without losing significant image content. This proposed defense mechanism is non-differentiable which makes it non-trivial for an adversary to apply gradient-based attacks. Moreover, we do not fine-tune the network with adversarial examples, making it more robust against unknown attacks. We present a thorough analysis of the tradeoff between accuracy and robustness against adversarial attacks. We evaluate our method under black-box, grey-box, and white-box settings. The proposed method outperforms the state-of-the-art by a significant margin on the ImageNet dataset under grey-box attacks while maintaining good accuracy on clean images. We also establish a strong baseline for a novel white-box attack.
As the industry deploys increasingly large and complex neural networks to mobile devices, more pressure is put on the memory and compute resources of those devices. Deep compression, or compression of deep neural network weight matrices, is a technique to stretch resources for such scenarios. Existing compression methods cannot effectively compress models smaller than 1-2% of their original size. We develop a new compression technique, DeepThin, building on existing research in the area of low rank factorization. We identify and break artificial constraints imposed by low rank approximations by combining rank factorization with a reshaping process that adds nonlinearity to the approximation function. We deploy DeepThin as a plug-gable library integrated with TensorFlow that enables users to seamlessly compress models at different granularities. We evaluate DeepThin on two state-of-the-art acoustic models, TFKaldi and DeepSpeech, comparing it to previous compression work (Pruning, HashNet, and Rank Factorization), empirical limit study approaches, and hand-tuned models. For TFKaldi, our DeepThin networks show better word error rates (WER) than competing methods at practically all tested compression rates, achieving an average of 60% relative improvement over rank factorization, 57% over pruning, 23% over hand-tuned same-size networks, and 6% over the computationally expensive HashedNets. For DeepSpeech, DeepThin-compressed networks achieve better test loss than all other compression methods, reaching a 28% better result than rank factorization, 27% better than pruning, 20% better than hand-tuned same-size networks, and 12% better than HashedNets. DeepThin also provide inference performance benefits ranging from 2X to 14X speedups, depending on the compression ratio and platform cache sizes.
Availability of high performance computing infrastructures such as clusters of GPUs and CPUs have fueled the growth of distributed learning systems. Deep Learning frameworks express neural nets as DAGs and execute these DAGs on computation resources such as GPUs. In this paper, we propose efficient designs of embedding MPI collective operations into data parallel DAGs. Incorrect designs can easily lead to deadlocks or program crashes. In particular, we demonstrate three designs: Funneled, Concurrent communication and Dependency chaining of using MPI collectives with DAGs. These designs automatically enable overlap of computation with communication by allowing for concurrent execution with the other tasks. We directly implement these designs into the KVStore API of the MXNET. This allows us to directly leverage the rest of the infrastructure. Using ImageNet and CIFAR data sets, we show the potential of our designs. In particular, our designs scale to 256 GPUs with as low as 50 seconds of epoch times for ImageNet 1K datasets.
There are now several large scale deployments of differential privacy used to track statistical information about users. However, these systems periodically recollect the data and recompute the statistics using algorithms designed for a single use and as a result do not provide meaningful privacy guarantees over long time scales. Moreover, existing techniques to mitigate this effect do not apply in the ‘local’ model of differential privacy that these systems use. In this paper, we introduce a new local differential privacy technique to maintain persistently up-to-date statistics over time, with privacy guarantees scaling only with the number of changes in the underlying distribution rather than the number of collection periods. The key ideas include batching time into epochs — varying the epoch size allows us to trade off accuracy against frequency of updates — and a protocol for users to ‘vote’ to update out-of-date statistics while losing very little privacy. We prove our main results for the setting where users hold a single bit, redrawn at every time period, from a common (but changing) distribution; however, our framework is quite general and we give an application to frequency and heavy-hitter estimation.
This report surveys the landscape of potential security threats from malicious uses of AI, and proposes ways to better forecast, prevent, and mitigate these threats. After analyzing the ways in which AI may influence the threat landscape in the digital, physical, and political domains, we make four high-level recommendations for AI researchers and other stakeholders. We also suggest several promising areas for further research that could expand the portfolio of defenses, or make attacks less effective or harder to execute. Finally, we discuss, but do not conclusively resolve, the long-term equilibrium of attackers and defenders.
The article presents an overview of current specialized ontology engineering tools, as well as texts’ annotation tools based on ontologies. The main functions and features of these tools, their advantages and disadvantages are discussed. A systematic comparative analysis of means for engineering ontologies is presented.
We present a neural model for question generation from knowledge base triples in a ‘Zero-Shot’ setup, that is generating questions for triples containing predicates, subject types or object types that were not seen at training time. Our model leverages triples occurrences in the natural language corpus in an encoder-decoder architecture, paired with an original part-of-speech copy action mechanism to generate questions. Benchmark and human evaluation show that our model sets a new state-of-the-art for zero-shot QG.
The difficulties in matching the latent posterior to the prior, balancing powerful posteriors with computational efficiency, and the reduced flexibility of data likelihoods are the biggest challenges in the advancement of Variational Autoencoders. We show that these issues arise due to struggles in marginal divergence minimization, and explore an alternative to using conditional distributions that is inspired by Generative Adversarial Networks. The class probability estimation that GANs offer for marginal divergence minimization uncovers a family of VAE-GAN hybrids, which offer the promise of addressing these major challenges in variational inference. We systematically explore the solutions available for distribution matching, but show that these hybrid methods do not fulfill this promise, and the trade-off between generation and inference that they give rise to remains an ongoing research topic.
Matrix factorization methods – including Factor analysis (FA), and Principal Components Analysis (PCA) – are widely used for inferring and summarizing structure in multivariate data. Many matrix factorization methods exist, corresponding to different assumptions on the elements of the underlying matrix factors. For example, many recent methods use a penalty or prior distribution to achieve sparse representations (‘Sparse FA/PCA’). Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it uses the observed data to estimate prior distributions on matrix elements. We derive a correspondingly-general variational fitting algorithm, which reduces fitting EBMF to solving a simpler problem – the so-called ‘normal means’ problem. We implement this general algorithm, but focus particular attention on the use of sparsity-inducing priors that are uni-modal at 0. This yields a sparse EBMF approach – essentially a version of sparse FA/PCA – that automatically adapts the amount of sparsity to the data. We demonstrate the benefits of our approach through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that concords with known relationships among human tissues. Software implementing our approach is available at https://…/flashr.
Factorial designs are frequently used in different fields of science, e.g. psychological, medical or biometric studies. Standard approaches, as the ANOVA $F$-test, make different assumptions on the distribution of the error terms, the variances or the sample sizes in the different groups. Because of time constraints or a lack of statistical background, many users do not check these assumptions; enhancing the risk of potentially inflated type-$I$ error rates or a substantial loss of power. It is the aim of the present paper, to give an overview of different methods without such restrictive assumptions and to identify situations in which one method is superior compared to others. In particular, after summarizing their underlying assumptions, the different approaches are compared within extensive simulations. To also address the current discussion about redefining the statistical significance level, we also included simulations for the 0.5\% level.
It is common knowledge that there is no single best strategy for graph clustering, which justifies a plethora of existing approaches. In this paper, we present a general memetic algorithm, VieClus, to tackle the graph clustering problem. This algorithm can be adapted to optimize different objective functions. A key component of our contribution are natural recombine operators that employ ensemble clusterings as well as multi-level techniques. Lastly, we combine these techniques with a scalable communication protocol, producing a system that is able to compute high-quality solutions in a short amount of time. We instantiate our scheme with local search for modularity and show that our algorithm successfully improves or reproduces all entries of the 10th DIMACS implementation~challenge under consideration using a small amount of time.
Deep learning models often have more parameters than observations, and still perform well. This is sometimes described as a paradox. In this work, we show experimentally that despite their huge number of parameters, deep neural networks can compress the data losslessly even when taking the cost of encoding the parameters into account. Such a compression viewpoint originally motivated the use of variational methods in neural networks. However, we show that these variational methods provide surprisingly poor compression bounds, despite being explicitly built to minimize such bounds. This might explain the relatively poor practical performance of variational methods in deep learning. Better encoding methods, imported from the Minimum Description Length (MDL) toolbox, yield much better compression values on deep networks, corroborating the hypothesis that good compression on the training set correlates with good test performance.
This paper is the first work to propose a network to predict a structured uncertainty distribution for a reconstructed image. Our novel model learns to predict a full Gaussian covariance matrix for each reconstruction, which permits efficient sampling and likelihood evaluation. We demonstrate that our model can accurately reconstruct ground truth correlated residual distributions for synthetic datasets and generate plausible high frequency samples for real face images. We also illustrate the use of these predicted covariances for structure preserving image denoising.
Relative worst-order analysis is a technique for assessing the relative quality of online algorithms. We survey the most important results obtained with this technique and compare it with other quality measures.
It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.
Genetic Programming (GP) is an evolutionary algorithm commonly used for machine learning tasks. In this paper we present a method that allows GP to transform the representation of a large-scale machine learning dataset into a more compact representation, by means of processing features from the original representation at individual level. We develop as a proof of concept of this method an autoencoder. We tested a preliminary version of our approach in a variety of well-known machine learning image datasets. We speculate that this method, used in an iterative manner, can produce results competitive with state-of-art deep neural networks.
This survey article gives an elementary introduction to the algebraic approach to Markov process duality, as opposed to the pathwise approach. In the algebraic approach, a Markov generator is written as the sum of products of simpler operators, which each have a dual with respect to some duality function. We discuss at length the recent suggestion by Giardin\`a, Redig, and others, that it may be a good idea to choose these simpler operators in such a way that they form an irreducible representation of some known Lie algebra. In particular, we collect the necessary background on representations of Lie algebras that is crucial for this approach. We also discuss older work by Lloyd and Sudbury on duality functions of product form and the relation between intertwining and duality.
Multi-output regression models must exploit dependencies between outputs to maximise predictive performance. The application of Gaussian processes (GPs) to this setting typically yields models that are computationally demanding and have limited representational power. We present the Gaussian Process Autoregressive Regression (GPAR) model, a scalable multi-output GP model that is able to capture nonlinear, possibly input-varying, dependencies between outputs in a simple and tractable way: the product rule is used to decompose the joint distribution over the outputs into a set of conditionals, each of which is modelled by a standard GP. GPAR’s efficacy is demonstrated on a variety of synthetic and real-world problems, outperforming existing GP models and achieving state-of-the-art performance on the tasks with existing benchmarks.
A generative model may generate utter nonsense when it is fit to maximize the likelihood of observed data. This happens due to ‘model error,’ i.e., when the true data generating distribution does not fit within the class of generative models being learned. To address this, we propose a model of active distribution learning using a binary invalidity oracle that identifies some examples as clearly invalid, together with random positive examples sampled from the true distribution. The goal is to maximize the likelihood of the positive examples subject to the constraint of (almost) never generating examples labeled invalid by the oracle. Guarantees are agnostic compared to a class of probability distributions. We show that, while proper learning often requires exponentially many queries to the invalidity oracle, improper distribution learning can be done using polynomially many queries.