Deep neural networks have demonstrated impressive performance in various machine learning tasks. However, they are notoriously sensitive to changes in data distribution. Often, even a slight change in the distribution can lead to drastic performance reduction. Artificially augmenting the data may help to some extent, but in most cases, fails to achieve model invariance to the data distribution. Some examples where this sub-class of domain adaptation can be valuable are various imaging modalities such as thermal imaging, X-ray, ultrasound, and MRI, where changes in acquisition parameters or acquisition device manufacturer will result in different representation of the same input. Our work shows that standard finetuning fails to adapt the model in certain important cases. We propose a novel method of adapting to a new data source, and demonstrate near perfect adaptation on a customized ImageNet benchmark.
Despite their impressive performance, deep convolutional neural networks (CNNs) have been shown to be sensitive to small adversarial perturbations. These nuisances, which one can barely notice, are powerful enough to fool sophisticated and well performing classifiers, leading to ridiculous misclassification results. In this paper we analyze the stability of state-of-the-art classification machines to adversarial perturbations, where we assume that the signals belong to the (possibly multi-layer) sparse representation model. We start with convolutional sparsity and then proceed to its multi-layered version, which is tightly connected to CNNs. Our analysis links between the stability of the classification to noise and the underlying structure of the signal, quantified by the sparsity of its representation under a fixed dictionary. Our claims can be translated to a practical regularization term that provides a new interpretation to the robustness of Parseval Networks. Also, the proposed theory justifies the increased stability of the recently emerging layered basis pursuit architectures, when compared to the classic forward-pass.
We often desire our models to be interpretable as well as accurate. Prior work on optimizing models for interpretability has relied on easy-to-quantify proxies for interpretability, such as sparsity or the number of operations required. In this work, we optimize for interpretability by directly including humans in the optimization loop. We develop an algorithm that minimizes the number of user studies to find models that are both predictive and interpretable and demonstrate our approach on several data sets. Our human subjects results show trends towards different proxy notions of interpretability on different datasets, which suggests that different proxies are preferred on different tasks.
Statistical analysis is often used to evaluate the evidence for or against scientific hypotheses, and various statistics (e.g., p-values, likelihood ratios, Bayes factors) are interpreted as measures of evidence strength. Here I consider evidence measurement from the point of view of representational measurement theory, and argue that familiar evidence statistics do not conform to any legitimate measurement scale type. I then consider the notion of an absolute scale for evidence measurement, in a sense to be defined, focusing particularly on the notion of absolute 0 evidence, which turns out to be something other than what one might have expected.
We study the effect of imperfect training data labels on the performance of classification methods. In a general setting, where the probability that an observation in the training dataset is mislabelled may depend on both the feature vector and the true label, we bound the excess risk of an arbitrary classifier trained with imperfect labels in terms of its excess risk for predicting a noisy label. This reveals conditions under which a classifier trained with imperfect labels remains consistent for classifying uncorrupted test data points. Furthermore, under stronger conditions, we derive detailed asymptotic properties for the popular $k$-nearest neighbour ($k$nn), Support Vector Machine (SVM) and Linear Discriminant Analysis (LDA) classifiers. One consequence of these results is that the $k$nn and SVM classifiers are robust to imperfect training labels, in the sense that the rate of convergence of the excess risks of these classifiers remains unchanged; in fact, it even turns out that in some cases, imperfect labels may improve the performance of these methods. On the other hand, the LDA classifier is shown to be typically inconsistent in the presence of label noise unless the prior probabilities of each class are equal. Our theoretical results are supported by a simulation study.
We study regression using functional predictors in situations where these functions contain both phase and amplitude variability. In other words, the functions are misaligned due to errors in time measurements, and these errors can significantly degrade both model estimation and prediction performance. The current techniques either ignore the phase variability, or handle it via pre-processing, i.e., use an off-the-shelf technique for functional alignment and phase removal. We develop a functional principal component regression model which has comprehensive approach in handling phase and amplitude variability. The model utilizes a mathematical representation of the data known as the square-root slope function. These functions preserve the $\mathbf{L}^2$ norm under warping and are ideally suited for simultaneous estimation of regression and warping parameters. Using both simulated and real-world data sets, we demonstrate our approach and evaluate its prediction performance relative to current models. In addition, we propose an extension to functional logistic and multinomial logistic regression
There has been a large amount of interest, both in the past and particularly recently, into the power of different families of universal approximators, e.g. ReLU networks, polynomials, rational functions. However, current research has focused almost exclusively on understanding this problem in a worst-case setting, e.g. bounding the error of the best infinity-norm approximation in a box. In this setting a high-degree polynomial is required to even approximate a single ReLU. However, in real applications with high dimensional data we expect it is only important to approximate the desired function well on certain relevant parts of its domain. With this motivation, we analyze the ability of neural networks and polynomial kernels of bounded degree to achieve good statistical performance on a simple, natural inference problem with sparse latent structure. We give almost-tight bounds on the performance of both neural networks and low degree polynomials for this problem. Our bounds for polynomials involve new techniques which may be of independent interest and show major qualitative differences with what is known in the worst-case setting.
This work studies the problem of learning under both large data and large feature space scenarios. The feature information is assumed to be spread across agents in a network, where each agent observes some of the features. Through local cooperation, the agents are supposed to interact with each other to solve the inference problem and converge towards the global minimizer of the empirical risk. We study this problem exclusively in the primal domain, and propose new and effective distributed solutions with guaranteed convergence to the minimizer. This is achieved by combining a dynamic diffusion construction, a pipeline strategy, and variance-reduced techniques. Simulation results illustrate the conclusions.
In this work, a novel sequential Monte Carlo filter is introduced which aims at efficient sampling of high-dimensional state spaces with a limited number of particles. Particles are pushed forward from the prior to the posterior density using a sequence of mappings that minimizes the Kullback-Leibler divergence between the posterior and the sequence of intermediate densities. The sequence of mappings represents a gradient flow. A key ingredient of the mappings is that they are embedded in a reproducing kernel Hilbert space, which allows for a practical and efficient algorithm. The embedding provides a direct means to calculate the gradient of the Kullback-Leibler divergence leading to quick convergence using well-known gradient-based stochastic optimization algorithms. Evaluation of the method is conducted in the chaotic Lorenz-63 system, the Lorenz-96 system, which is a coarse prototype of atmospheric dynamics, and an epidemic model that describes cholera dynamics. No resampling is required in the mapping particle filter even for long recursive sequences. The number of effective particles remains close to the total number of particles in all the experiments.
Semi-supervised learning on graph structured data has received significant attention with the recent introduction of graph convolution networks (GCN). While traditional methods have focused on optimizing a loss augmented with Laplacian regularization framework, GCNs perform an implicit Laplacian type regularization to capture local graph structure. In this work, we propose Lovasz convolutional network (LCNs) which are capable of incorporating global graph properties. LCNs achieve this by utilizing Lovasz’s orthonormal embeddings of the nodes. We analyse local and global properties of graphs and demonstrate settings where LCNs tend to work better than GCNs. We validate the proposed method on standard random graph models such as stochastic block models (SBM) and certain community structure based graphs where LCNs outperform GCNs and learn more intuitive embeddings. We also perform extensive binary and multi-class classification experiments on real world datasets to demonstrate LCN’s effectiveness. In addition to simple graphs, we also demonstrate the use of LCNs on hypergraphs by identifying settings where they are expected to work better than GCNs.
Variational Auto-Encoders (VAEs) have become very popular techniques to perform inference and learning in latent variable models as they allow us to leverage the rich representational power of neural networks to obtain flexible approximations of the posterior of latent variables as well as tight evidence lower bounds (ELBOs). Combined with stochastic variational inference, this provides a methodology scaling to large datasets. However, for this methodology to be practically efficient, it is necessary to obtain low-variance unbiased estimators of the ELBO and its gradients with respect to the parameters of interest. While the use of Markov chain Monte Carlo (MCMC) techniques such as Hamiltonian Monte Carlo (HMC) has been previously suggested to achieve this [23, 26], the proposed methods require specifying reverse kernels which have a large impact on performance. Additionally, the resulting unbiased estimator of the ELBO for most MCMC kernels is typically not amenable to the reparameterization trick. We show here how to optimally select reverse kernels in this setting and, by building upon Hamiltonian Importance Sampling (HIS) [17], we obtain a scheme that provides low-variance unbiased estimators of the ELBO and its gradients using the reparameterization trick. This allows us to develop a Hamiltonian Variational Auto-Encoder (HVAE). This method can be reinterpreted as a target-informed normalizing flow [20] which, within our context, only requires a few evaluations of the gradient of the sampled likelihood and trivial Jacobian calculations at each iteration.
Even though probabilistic treatments of neural networks have a long history, they have not found widespread use in practice. Sampling approaches are often too slow already for simple networks. The size of the inputs and the depth of typical CNN architectures in computer vision only compound this problem. Uncertainty in neural networks has thus been largely ignored in practice, despite the fact that it may provide important information about the reliability of predictions and the inner workings of the network. In this paper, we introduce two lightweight approaches to making supervised learning with probabilistic deep networks practical: First, we suggest probabilistic output layers for classification and regression that require only minimal changes to existing networks. Second, we employ assumed density filtering and show that activation uncertainties can be propagated in a practical fashion through the entire network, again with minor changes. Both probabilistic networks retain the predictive power of the deterministic counterpart, but yield uncertainties that correlate well with the empirical error induced by their predictions. Moreover, the robustness to adversarial examples is significantly increased.
The use of ensembles of neural networks (NNs) for the quantification of predictive uncertainty is widespread. However, the current justification is intuitive rather than analytical. This work proposes one minor modification to the normal ensembling methodology, which we prove allows the ensemble to perform Bayesian inference, hence converging to the corresponding Gaussian Process as both the total number of NNs, and the size of each, tend to infinity. This working paper provides early-stage results in a reinforcement learning setting, analysing the practicality of the technique for an ensemble of small, finite number. Using the uncertainty estimates they produce to govern the exploration-exploitation process results in steadier, more stable learning.
This paper introduces Wasserstein variational inference, a new form of approximate Bayesian inference based on optimal transport theory. Wasserstein variational inference uses a new family of divergences that includes both f-divergences and the Wasserstein distance as special cases. The gradients of the Wasserstein variational loss are obtained by backpropagating through the Sinkhorn iterations. This technique results in a very stable likelihood-free training method that can be used with implicit distributions and probabilistic programs. Using the Wasserstein variational inference framework, we introduce several new forms of autoencoders and test their robustness and performance against existing variational autoencoding techniques.
Model compression has gained a lot of attention due to its ability to reduce hardware resource requirements significantly while maintaining accuracy of DNNs. Model compression is especially useful for memory-intensive recurrent neural networks because smaller memory footprint is crucial not only for reducing storage requirement but also for fast inference operations. Quantization is known to be an effective model compression method and researchers are interested in minimizing the number of bits to represent parameters. In this work, we introduce an iterative technique to apply quantization, presenting high compression ratio without any modifications to the training algorithm. In the proposed technique, weight quantization is followed by retraining the model with full precision weights. We show that iterative retraining generates new sets of weights which can be quantized with decreasing quantization loss at each iteration. We also show that quantization is efficiently able to leverage pruning, another effective model compression method. Implementation issues on combining the two methods are also addressed. Our experimental results demonstrate that an LSTM model using 1-bit quantized weights is sufficient for PTB dataset without any accuracy degradation while previous methods demand at least 2-4 bits for quantized weights.
Area under the receiver operating characteristics curve (AUC) is an important metric for a wide range of signal processing and machine learning problems, and scalable methods for optimizing AUC have recently been proposed. However, handling very large datasets remains an open challenge for this problem. This paper proposes a novel approach to AUC maximization, based on sampling mini-batches of positive/negative instance pairs and computing U-statistics to approximate a global risk minimization problem. The resulting algorithm is simple, fast, and learning-rate free. We show that the number of samples required for good performance is independent of the number of pairs available, which is a quadratic function of the positive and negative instances. Extensive experiments show the practical utility of the proposed method.
This paper considers distributed statistical inference for general symmetric statistics %that encompasses the U-statistics and the M-estimators in the context of massive data where the data can be stored at multiple platforms in different locations. In order to facilitate effective computation and to avoid expensive communication among different platforms, we formulate distributed statistics which can be conducted over smaller data blocks. The statistical properties of the distributed statistics are investigated in terms of the mean square error of estimation and asymptotic distributions with respect to the number of data blocks. In addition, we propose two distributed bootstrap algorithms which are computationally effective and are able to capture the underlying distribution of the distributed statistics. Numerical simulation and real data applications of the proposed approaches are provided to demonstrate the empirical performance.
Fairness-aware learning is increasingly important in data mining. Discrimination prevention aims to prevent discrimination in the training data before it is used to conduct predictive analysis. In this paper, we focus on fair data generation that ensures the generated data is discrimination free. Inspired by generative adversarial networks (GAN), we present fairness-aware generative adversarial networks, called FairGAN, which are able to learn a generator producing fair data and also preserving good data utility. Compared with the naive fair data generation models, FairGAN further ensures the classifiers which are trained on generated data can achieve fair classification on real data. Experiments on a real dataset show the effectiveness of FairGAN.
We present Value Propagation (VProp), a parameter-efficient differentiable planning module built on Value Iteration which can successfully be trained using reinforcement learning to solve unseen tasks, has the capability to generalize to larger map sizes, and can learn to navigate in dynamic environments. Furthermore, we show that the module enables learning to plan when the environment also includes stochastic elements, providing a cost-efficient learning system to build low-level size-invariant planners for a variety of interactive navigation problems. We evaluate on static and dynamic configurations of MazeBase grid-worlds, with randomly generated environments of several different sizes, and on a StarCraft navigation scenario, with more complex dynamics, and pixels as input.
Supervised machine learning based state-of-the-art computer vision techniques are in general data hungry and pose the challenges of not having adequate computing resources and of high costs involved in human labeling efforts. Training data subset selection and active learning techniques have been proposed as possible solutions to these challenges respectively. A special class of subset selection functions naturally model notions of diversity, coverage and representation and they can be used to eliminate redundancy and thus lend themselves well for training data subset selection. They can also help improve the efficiency of active learning in further reducing human labeling efforts by selecting a subset of the examples obtained using the conventional uncertainty sampling based techniques. In this work we empirically demonstrate the effectiveness of two diversity models, namely the Facility-Location and Disparity-Min models for training-data subset selection and reducing labeling effort. We do this for a variety of computer vision tasks including Gender Recognition, Scene Recognition and Object Recognition. Our results show that subset selection done in the right way can add 2-3% in accuracy on existing baselines, particularly in the case of less training data. This allows the training of complex machine learning models (like Convolutional Neural Networks) with much less training data while incurring minimal performance loss.
Semi-implicit variational inference (SIVI) is introduced to expand the commonly used analytic variational distribution family, by mixing the variational parameter with a flexible distribution. This mixing distribution can assume any density function, explicit or not, as long as independent random samples can be generated via reparameterization. Not only does SIVI expand the variational family to incorporate highly flexible variational distributions, including implicit ones that have no analytic density functions, but also sandwiches the evidence lower bound (ELBO) between a lower bound and an upper bound, and further derives an asymptotically exact surrogate ELBO that is amenable to optimization via stochastic gradient ascent. With a substantially expanded variational family and a novel optimization algorithm, SIVI is shown to closely match the accuracy of MCMC in inferring the posterior in a variety of Bayesian inference tasks.
Following detailed presentation of the Core Conflictual Relationship Theme (CCRT), there is the objective of relevant methods for what has been described as verbalization and visualization of data. Such is also termed data mining and text mining, and knowledge discovery in data. The Correspondence Analysis methodology, also termed Geometric Data Analysis, is shown in a case study to be comprehensive and revealing. Computational efficiency depends on how the analysis process is structured. For both illustrative and revealing aspects of the case study here, relatively extensive dream reports are used. This Geometric Data Analysis confirms the validity of CCRT method.
We present differentiable particle filters (DPFs): a differentiable implementation of the particle filter algorithm with learnable motion and measurement models. Since DPFs are end-to-end differentiable, we can efficiently train their models by optimizing end-to-end state estimation performance, rather than proxy objectives such as model accuracy. DPFs encode the structure of recursive state estimation with prediction and measurement update that operate on a probability distribution over states. This structure represents an algorithmic prior that improves learning performance in state estimation problems while enabling explainability of the learned model. Our experiments on simulated and real data show substantial benefits from end-to- end learning with algorithmic priors, e.g. reducing error rates by ~80%. Our experiments also show that, unlike long short-term memory networks, DPFs learn localization in a policy-agnostic way and thus greatly improve generalization. Source code is available at https://…/differentiable-particle-filters.
OpenNMT is an open-source toolkit for neural machine translation (NMT). The system prioritizes efficiency, modularity, and extensibility with the goal of supporting NMT research into model architectures, feature representations, and source modalities, while maintaining competitive performance and reasonable training requirements. The toolkit consists of modeling and translation support, as well as detailed pedagogical documentation about the underlying techniques. OpenNMT has been used in several production MT systems, modified for numerous research papers, and is implemented across several deep learning frameworks.
Apache SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software License version 2.0.
Data analytics using machine learning (ML) has become ubiquitous in science, business intelligence, journalism and many other domains. While a lot of work focuses on reducing the training cost, inference runtime and storage cost of ML models, little work studies how to reduce the cost of data acquisition, which potentially leads to a loss of sellers’ revenue and buyers’ affordability and efficiency. In this paper, we propose a model-based pricing (MBP) framework, which instead of pricing the data, directly prices ML model instances. We first formally describe the desired properties of the MBP framework, with a focus on avoiding arbitrage. Next, we show a concrete realization of the MBP framework via a noise injection approach, which provably satisfies the desired formal properties. Based on the proposed framework, we then provide algorithmic solutions on how the seller can assign prices to models under different market scenarios (such as to maximize revenue). Finally, we conduct extensive experiments, which validate that the MBP framework can provide high revenue to the seller, high affordability to the buyer, and also operate on low runtime cost.