We show that the internal representations of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image manipulation designed to produce erroneous class labels, while here we look at internal layers of the representation. Despite the similarities in the generation process, our class of adversarial images differs qualitatively from previous ones. Specifically, while the input image is perceptually similar to an image of one class, its internal representation appear remarkably like that of natural images in a different class. This phenomenon raises questions about DNN representations, as well as the properties of natural images themselves.
Kalman Filters are one of the most influential models of time-varying phenomena. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption in a variety of disciplines. Motivated by recent variational methods for learning deep generative models, we introduce a unified algorithm to efficiently learn a broad spectrum of Kalman filters. Of particular interest is the use of temporal generative models for counterfactual inference. We investigate the efficacy of such models for counterfactual inference, and to that end we introduce the ‘Healing MNIST’ dataset where long-term structure, noise and actions are applied to sequences of digits. We show the efficacy of our method for modeling this dataset. We further show how our model can be used for counterfactual inference for patients, based on electronic health record data of 8,000 patients over 4.5 years.
Modern applications and progress in deep learning research have created renewed interest for generative models of text and of images. However, even today it is unclear what objective functions one should use to train and evaluate these models. In this paper we present two contributions. Firstly, we present a critique of scheduled sampling, a state-of-the-art training method that contributed to the winning entry to the MSCOCO image captioning benchmark in 2015. Here we show that despite this impressive empirical performance, the objective function underlying scheduled sampling is improper and leads to an inconsistent learning algorithm. Secondly, we revisit the problems that scheduled sampling was meant to address, and present an alternative interpretation. We argue that maximum likelihood is an inappropriate training objective when the end-goal is to generate natural-looking samples. We go on to derive an ideal objective function to use in this situation instead. We introduce a generalisation of adversarial training, and show how such method can interpolate between maximum likelihood training and our ideal training objective. To our knowledge this is the first theoretical analysis that explains why adversarial training tends to produce samples with higher perceived quality.
The field of Movement Ecology, like so many other fields, is experiencing a period of rapid growth in availability of data. As the volume rises, traditional methods are giving way to machine learning and data science, which are playing an increasingly large part it turning this data into science-driving insights. One rich and interesting source is the bio-logger. These small electronic wearable devices are attached to animals free to roam in their natural habitats, and report back readings from multiple sensors, including GPS and accelerometer bursts. A common use of accelerometer data is for supervised learning of behavioral modes. However, we need unsupervised analysis tools as well, in order to overcome the inherent difficulties of obtaining a labeled dataset, which in some cases is either infeasible or does not successfully encompass the full repertoire of behavioral modes of interest. Here we present a matrix factorization based topic-model method for accelerometer bursts, derived using a linear mixture property of patch features. Our method is validated via comparison to a labeled dataset, and is further compared to standard clustering algorithms.
This paper presents a new method for the discovery of latent domains in diverse speech data, for the use of adaptation of Deep Neural Networks (DNNs) for Automatic Speech Recognition. Our work focuses on transcription of multi-genre broadcast media, which is often only categorised broadly in terms of high level genres such as sports, news, documentary, etc. However, in terms of acoustic modelling these categories are coarse. Instead, it is expected that a mixture of latent domains can better represent the complex and diverse behaviours within a TV show, and therefore lead to better and more robust performance. We propose a new method, whereby these latent domains are discovered with Latent Dirichlet Allocation, in an unsupervised manner. These are used to adapt DNNs using the Unique Binary Code (UBIC) representation for the LDA domains. Experiments conducted on a set of BBC TV broadcasts, with more than 2,000 shows for training and 47 shows for testing, show that the use of LDA-UBIC DNNs reduces the error up to 13% relative compared to the baseline hybrid DNN models.
It is hard to detect important articles in a specific context. Information retrieval techniques based on full text search can be inaccurate to identify main topics and they are not able to provide an indication about the importance of the article. Generating a citation network is a good way to find most popular articles but this approach is not context aware. The text around a citation mark is generally a good summary of the referred article. So citation context analysis presents an opportunity to use the wisdom of crowd for detecting important articles in a context sensitive way. In this work, we analyze citation contexts to rank articles properly for a given topic. The model proposed uses citation contexts in order to create a directed and weighted citation network based on the target topic. We create a directed and weighted edge between two articles if citation context contains terms related with the target topic. Then we apply common ranking algorithms in order to find important articles in this newly created network. We showed that this method successfully detects a good subset of most prominent articles in a given topic. The biggest contribution of this approach is that we are able to identify important articles for a given search term even though these articles do not contain this search term. This technique can be used in other linked documents including web pages, legal documents, and patents.
Canonical correlation analysis (CCA) is a fundamental technique in multi-view data analysis and representation learning. Several nonlinear extensions of the classical linear CCA method have been proposed, including kernel and deep neural network methods. These approaches restrict attention to certain families of nonlinear projections, which the user must specify (by choosing a kernel or a neural network architecture), and are computationally demanding. Interestingly, the theory of nonlinear CCA without any functional restrictions, has been studied in the population setting by Lancaster already in the 50’s. However, these results, have not inspired practical algorithms. In this paper, we revisit Lancaster’s theory, and use it to devise a practical algorithm for nonparametric CCA (NCCA). Specifically, we show that the most correlated nonlinear projections of two random vectors can be expressed in terms of the singular value decomposition of a certain operator associated with their joint density. Thus, by estimating the population density from data, NCCA reduces to solving an eigenvalue system, superficially like kernel CCA but, importantly, without having to compute the inverse of any kernel matrix. We also derive a partially linear CCA (PLCCA) variant in which one of the views undergoes a linear projection while the other is nonparametric. PLCCA turns out to have a similar form to the classical linear CCA, but with a nonparametric regression term replacing the linear regression in CCA. Using a kernel density estimate based on a small number of nearest neighbors, our NCCA and PLCCA algorithms are memory-efficient, often run much faster, and achieve better performance than kernel CCA and comparable performance to deep CCA.
Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.
Online learning with multiple kernels has gained increasing interests in recent years and found many applications. For classification tasks, Online Multiple Kernel Classification (OMKC), which learns a kernel based classifier by seeking the optimal linear combination of a pool of single kernel classifiers in an online fashion, achieves superior accuracy and enjoys great flexibility compared with traditional single-kernel classifiers. Despite being studied extensively, existing OMKC algorithms suffer from high computational cost due to their unbounded numbers of support vectors. To overcome this drawback, we present a novel framework of Budget Online Multiple Kernel Learning (BOMKL) and propose a new Sparse Passive Aggressive learning to perform effective budget online learning. Specifically, we adopt a simple yet effective Bernoulli sampling to decide if an incoming instance should be added to the current set of support vectors. By limiting the number of support vectors, our method can significantly accelerate OMKC while maintaining satisfactory accuracy that is comparable to that of the existing OMKC algorithms. We theoretically prove that our new method achieves an optimal regret bound in expectation, and empirically found that the proposed algorithm outperforms various OMKC algorithms and can easily scale up to large-scale datasets.
Classification is a common statistical task in many areas. In order to ameliorate the performance of the existing methods, there are always some new classification procedures proposed. These procedures, especially those raised in the machine learning and data-mining literature, are usually complicated, and therefore extra effort is required to understand them and the impacts of individual variables in these procedures. However, in some applications, for example, pharmaceutical and medical related research, future developments and/or research plans will rely on the interpretation of the classification rule, such as the role of individual variables in a diagnostic rule/model. Hence, in these kinds of research, despite the optimal performance of the complicated models, the model with the balanced ease of interpretability and satisfactory performance is preferred. The complication of a classification rule might diminish its advantage in performance and become an obstacle to be used in those applications. In this paper, we study how to improve the classification performance, in terms of area under the receiver operating characteristic curve of a conventional logistic model, while retaining its ease of interpretation. The proposed method increases the sensitivity at the whole range of specificity and hence is especially useful when the performance in the high-specificity range of a receiver operating characteristic curve is of interest. Theoretical justification is presented, and numerical results using both simulated data and two real data sets are reported.
There has been a lot of recent progress in high-dimensional distribution estimation using autoregressive networks. The conditional distributions in these networks are typically regularized through early stopping or weight decay. We consider here the use of the L1-penalty to learn sparse linear (and logistic) models for the conditionals and form large mixtures of such networks. Moreover, we introduce a new distributed representation which permits exact likelihood evaluations since the latent variables are interleaved with the observable variables and can be easily integrated out. Our model achieves excellent generalization performance and scales well to extremely high dimensions.
We introduce normalized nonnegative models (NNM) for explorative data analysis. NNMs are partial convexifications of models from probability theory. We demonstrate their value at the example of item recommendation. We show that NNM-based recommender systems satisfy three criteria that all recommender systems should ideally satisfy: high predictive power, computational tractability, and expressive representations of users and items. Expressive user and item representations are important in practice to succinctly summarize the pool of customers and the pool of items. In NNMs, user representations are expressive because each user’s preference can be regarded as normalized mixture of preferences of stereotypical users. The interpretability of item and user representations allow us to arrange properties of items (e.g., genres of movies or topics of documents) or users (e.g., personality traits) hierarchically.
Kernel Canonical correlation analysis (KCCA) is a fundamental method with broad applicability in statistics and machine learning. Although there exist closed-form solution to the KCCA objective by solving an $N\times N$ eigenvalue system where $N$ is the training set size, the computational requirements of this approach in both memory and time prohibit its usage in the large scale. Various approximation techniques have been developed for KCCA. A recently proposed approach is to first transform original inputs to a $M$-dimensional feature space using random kitchen sinks so that inner product in the feature space approximates the kernel function, and then apply linear CCA to the transformed inputs. In challenging applications, however, the dimensionality $M$ of the feature space may need to be very large in order to reveal the nonlinear correlations, and then it becomes non-trivial to solve linear CCA for data matrices of very high dimensionality. We propose to use the recently proposed stochastic optimization algorithm for linear CCA and its neural-network extension to further alleviate the computation requirements of approximate KCCA. This approach allows us to run approximate KCCA on a speech dataset with $1.4$ million training samples and random feature space of dimensionality $M=100000$ on a normal workstation.
Most traditional visualization tools and methods operate on an offline way, limited on accessing static (preprocessed) sets of data. They also restrict themselves on dealing with small dataset sizes, which can be easily visually analysed with conventional visualization techniques. However, the Web of Data has realized the availability of a great amount and variety of big datasets that are dynamic in nature; most of them offer query or API endpoints for online access and analysis. Modern visualization techniques must address the challenge of on-the-fly visualizations over large dynamic sets of data, offering efficient exploration techniques, as well as mechanisms for information abstraction and summarization. Moreover, they must take into account different user-defined exploration scenarios and user’s preferences. In this work, we present a generic model for personalized multilevel exploration and analysis over large dynamic sets of numeric and temporal data. Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure aggregates input objects into a hierarchical multiscale model. In the proposed structure, statistical computations can be efficiently performed on-the-fly. Considering different exploration scenarios over large datasets, the proposed model enables efficient multilevel exploration, offering incremental construction via user interaction, and dynamic adaptation of the hierarchies based on user’s preferences. A thorough theoretical analysis is presented, illustrating the efficiency of the proposed model. The proposed model is realized in a Web-based prototype tool, called rdf:SynopsViz that offers multilevel visual exploration and analysis over Linked Data datasets.
We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). DeepLDA produces competitive results on all three datasets and sets a new state of the art on STL-10 with a test set accuracy of 81.4%.
We propose a robust elastic net (REN) model for high-dimensional sparse regression and give its performance guarantees (both the statistical error bound and the optimization bound). A simple idea of trimming the inner product is applied to the elastic net model. Specifically, we robustify the covariance matrix by trimming the inner product based on the intuition that the trimmed inner product can not be significant affected by a bounded number of arbitrarily corrupted points (outliers). The REN model can also derive two interesting special cases: robust Lasso and robust soft thresholding. Comprehensive experimental results show that the robustness of the proposed model consistently outperforms the original elastic net and matches the performance guarantees nicely.
We propose to learn latent graphical models when data have mixed variables and missing values. This model could be used for further data analysis, including regression, classification, ranking etc. It also could be used for imputing missing values. We specify a latent Gaussian model for the data, where the categorical variables are generated by discretizing an unobserved variable and the latent variables are multivariate Gaussian. The observed data consists of two parts: observed Gaussian variables and observed categorical variables, where the latter part is considered as partially missing Gaussian variables. We use the Expectation-Maximization algorithm to fit the model. To prevent overfitting we use sparse inverse covariance estimation to obtain sparse estimate of the latent covariance matrix, equivalently, the graphical model. The fitted model then could be used for problems including re- gression, classification and ranking. Such an approach is applied to a medical data set where our method outperforms the state-of-the-art methods. Simulation studies and real data results suggest that our proposed model performs better than random forest in terms of prediction error when the model is correctly specified, and is a better imputation method than hot deck imputation even if the model is not correctly specified.
We introduce a neural machine translation model that views the input and output sentences as sequences of characters rather than words. Since word-level information provides a crucial source of bias, our input model composes representations of character sequences into representations of words (as determined by whitespace boundaries), and then these are translated using a joint attention/translation model. In the target language, the translation is modeled as a sequence of word vectors, but each word is generated one character at a time, conditional on the previous character generations in each word. As the representation and generation of words is performed at the character level, our model is capable of interpreting and generating unseen word forms. A secondary benefit of this approach is that it alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages. We show that our model can achieve translation results that are on par with conventional word-based models.