Adversarial Manipulation of Deep Representations

We show that the internal representations of an image in a deep neural network (DNN) can be manipulated to mimic those of other natural images with only minor, imperceptible perturbations to the original image. Previous methods for generating adversarial images focused on image manipulation designed to produce erroneous class labels, while here we look at internal layers of the representation. Despite the similarities in the generation process, our class of adversarial images differs qualitatively from previous ones. Specifically, while the input image is perceptually similar to an image of one class, its internal representation appear remarkably like that of natural images in a different class. This phenomenon raises questions about DNN representations, as well as the properties of natural images themselves.


Deep Kalman Filters

Kalman Filters are one of the most influential models of time-varying phenomena. They admit an intuitive probabilistic interpretation, have a simple functional form, and enjoy widespread adoption in a variety of disciplines. Motivated by recent variational methods for learning deep generative models, we introduce a unified algorithm to efficiently learn a broad spectrum of Kalman filters. Of particular interest is the use of temporal generative models for counterfactual inference. We investigate the efficacy of such models for counterfactual inference, and to that end we introduce the ‘Healing MNIST’ dataset where long-term structure, noise and actions are applied to sequences of digits. We show the efficacy of our method for modeling this dataset. We further show how our model can be used for counterfactual inference for patients, based on electronic health record data of 8,000 patients over 4.5 years.


How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?

Modern applications and progress in deep learning research have created renewed interest for generative models of text and of images. However, even today it is unclear what objective functions one should use to train and evaluate these models. In this paper we present two contributions. Firstly, we present a critique of scheduled sampling, a state-of-the-art training method that contributed to the winning entry to the MSCOCO image captioning benchmark in 2015. Here we show that despite this impressive empirical performance, the objective function underlying scheduled sampling is improper and leads to an inconsistent learning algorithm. Secondly, we revisit the problems that scheduled sampling was meant to address, and present an alternative interpretation. We argue that maximum likelihood is an inappropriate training objective when the end-goal is to generate natural-looking samples. We go on to derive an ideal objective function to use in this situation instead. We introduce a generalisation of adversarial training, and show how such method can interpolate between maximum likelihood training and our ideal training objective. To our knowledge this is the first theoretical analysis that explains why adversarial training tends to produce samples with higher perceived quality.


Topic Modeling of Behavioral Modes Using Sensor Data

The field of Movement Ecology, like so many other fields, is experiencing a period of rapid growth in availability of data. As the volume rises, traditional methods are giving way to machine learning and data science, which are playing an increasingly large part it turning this data into science-driving insights. One rich and interesting source is the bio-logger. These small electronic wearable devices are attached to animals free to roam in their natural habitats, and report back readings from multiple sensors, including GPS and accelerometer bursts. A common use of accelerometer data is for supervised learning of behavioral modes. However, we need unsupervised analysis tools as well, in order to overcome the inherent difficulties of obtaining a labeled dataset, which in some cases is either infeasible or does not successfully encompass the full repertoire of behavioral modes of interest. Here we present a matrix factorization based topic-model method for accelerometer bursts, derived using a linear mixture property of patch features. Our method is validated via comparison to a labeled dataset, and is further compared to standard clustering algorithms.


Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

This paper presents a new method for the discovery of latent domains in diverse speech data, for the use of adaptation of Deep Neural Networks (DNNs) for Automatic Speech Recognition. Our work focuses on transcription of multi-genre broadcast media, which is often only categorised broadly in terms of high level genres such as sports, news, documentary, etc. However, in terms of acoustic modelling these categories are coarse. Instead, it is expected that a mixture of latent domains can better represent the complex and diverse behaviours within a TV show, and therefore lead to better and more robust performance. We propose a new method, whereby these latent domains are discovered with Latent Dirichlet Allocation, in an unsupervised manner. These are used to adapt DNNs using the Unique Binary Code (UBIC) representation for the LDA domains. Experiments conducted on a set of BBC TV broadcasts, with more than 2,000 shows for training and 47 shows for testing, show that the use of LDA-UBIC DNNs reduces the error up to 13% relative compared to the baseline hybrid DNN models.


Context Sensitive Article Ranking with Citation Context Analysis

It is hard to detect important articles in a specific context. Information retrieval techniques based on full text search can be inaccurate to identify main topics and they are not able to provide an indication about the importance of the article. Generating a citation network is a good way to find most popular articles but this approach is not context aware. The text around a citation mark is generally a good summary of the referred article. So citation context analysis presents an opportunity to use the wisdom of crowd for detecting important articles in a context sensitive way. In this work, we analyze citation contexts to rank articles properly for a given topic. The model proposed uses citation contexts in order to create a directed and weighted citation network based on the target topic. We create a directed and weighted edge between two articles if citation context contains terms related with the target topic. Then we apply common ranking algorithms in order to find important articles in this newly created network. We showed that this method successfully detects a good subset of most prominent articles in a given topic. The biggest contribution of this approach is that we are able to identify important articles for a given search term even though these articles do not contain this search term. This technique can be used in other linked documents including web pages, legal documents, and patents.


Nonparametric Canonical Correlation Analysis

Canonical correlation analysis (CCA) is a fundamental technique in multi-view data analysis and representation learning. Several nonlinear extensions of the classical linear CCA method have been proposed, including kernel and deep neural network methods. These approaches restrict attention to certain families of nonlinear projections, which the user must specify (by choosing a kernel or a neural network architecture), and are computationally demanding. Interestingly, the theory of nonlinear CCA without any functional restrictions, has been studied in the population setting by Lancaster already in the 50’s. However, these results, have not inspired practical algorithms. In this paper, we revisit Lancaster’s theory, and use it to devise a practical algorithm for nonparametric CCA (NCCA). Specifically, we show that the most correlated nonlinear projections of two random vectors can be expressed in terms of the singular value decomposition of a certain operator associated with their joint density. Thus, by estimating the population density from data, NCCA reduces to solving an eigenvalue system, superficially like kernel CCA but, importantly, without having to compute the inverse of any kernel matrix. We also derive a partially linear CCA (PLCCA) variant in which one of the views undergoes a linear projection while the other is nonparametric. PLCCA turns out to have a similar form to the classical linear CCA, but with a nonparametric regression term replacing the linear regression in CCA. Using a kernel density estimate based on a small number of nearest neighbors, our NCCA and PLCCA algorithms are memory-efficient, often run much faster, and achieve better performance than kernel CCA and comparable performance to deep CCA.


Neural Programmer: Inducing Latent Programs with Gradient Descent

Deep neural networks have achieved impressive supervised classification performance in many tasks including image recognition, speech recognition, and sequence to sequence learning. However, this success has not been translated to applications like question answering that may involve complex arithmetic and logic reasoning. A major limitation of these models is in their inability to learn even simple arithmetic and logic operations. For example, it has been shown that neural networks fail to learn to add two binary numbers reliably. In this work, we propose Neural Programmer, an end-to-end differentiable neural network augmented with a small set of basic arithmetic and logic operations. Neural Programmer can call these augmented operations over several steps, thereby inducing compositional programs that are more complex than the built-in operations. The model learns from a weak supervision signal which is the result of execution of the correct program, hence it does not require expensive annotation of the correct program itself. The decisions of what operations to call, and what data segments to apply to are inferred by Neural Programmer. Such decisions, during training, are done in a differentiable fashion so that the entire network can be trained jointly by gradient descent. We find that training the model is difficult, but it can be greatly improved by adding random noise to the gradient. On a fairly complex synthetic table-comprehension dataset, traditional recurrent networks and attentional models perform poorly while Neural Programmer typically obtains nearly perfect accuracy.


Budget Online Multiple Kernel Learning

Online learning with multiple kernels has gained increasing interests in recent years and found many applications. For classification tasks, Online Multiple Kernel Classification (OMKC), which learns a kernel based classifier by seeking the optimal linear combination of a pool of single kernel classifiers in an online fashion, achieves superior accuracy and enjoys great flexibility compared with traditional single-kernel classifiers. Despite being studied extensively, existing OMKC algorithms suffer from high computational cost due to their unbounded numbers of support vectors. To overcome this drawback, we present a novel framework of Budget Online Multiple Kernel Learning (BOMKL) and propose a new Sparse Passive Aggressive learning to perform effective budget online learning. Specifically, we adopt a simple yet effective Bernoulli sampling to decide if an incoming instance should be added to the current set of support vectors. By limiting the number of support vectors, our method can significantly accelerate OMKC while maintaining satisfactory accuracy that is comparable to that of the existing OMKC algorithms. We theoretically prove that our new method achieves an optimal regret bound in expectation, and empirically found that the proposed algorithm outperforms various OMKC algorithms and can easily scale up to large-scale datasets.


Performance and Interpretation of Classification Models

Classification is a common statistical task in many areas. In order to ameliorate the performance of the existing methods, there are always some new classification procedures proposed. These procedures, especially those raised in the machine learning and data-mining literature, are usually complicated, and therefore extra effort is required to understand them and the impacts of individual variables in these procedures. However, in some applications, for example, pharmaceutical and medical related research, future developments and/or research plans will rely on the interpretation of the classification rule, such as the role of individual variables in a diagnostic rule/model. Hence, in these kinds of research, despite the optimal performance of the complicated models, the model with the balanced ease of interpretability and satisfactory performance is preferred. The complication of a classification rule might diminish its advantage in performance and become an obstacle to be used in those applications. In this paper, we study how to improve the classification performance, in terms of area under the receiver operating characteristic curve of a conventional logistic model, while retaining its ease of interpretation. The proposed method increases the sensitivity at the whole range of specificity and hence is especially useful when the performance in the high-specificity range of a receiver operating characteristic curve is of interest. Theoretical justification is presented, and numerical results using both simulated data and two real data sets are reported.


Sparse Autoregressive Networks

There has been a lot of recent progress in high-dimensional distribution estimation using autoregressive networks. The conditional distributions in these networks are typically regularized through early stopping or weight decay. We consider here the use of the L1-penalty to learn sparse linear (and logistic) models for the conditionals and form large mixtures of such networks. Moreover, we introduce a new distributed representation which permits exact likelihood evaluations since the latent variables are interleaved with the observable variables and can be easily integrated out. Our model achieves excellent generalization performance and scales well to extremely high dimensions.


Expressive recommender systems through normalized nonnegative models

We introduce normalized nonnegative models (NNM) for explorative data analysis. NNMs are partial convexifications of models from probability theory. We demonstrate their value at the example of item recommendation. We show that NNM-based recommender systems satisfy three criteria that all recommender systems should ideally satisfy: high predictive power, computational tractability, and expressive representations of users and items. Expressive user and item representations are important in practice to succinctly summarize the pool of customers and the pool of items. In NNMs, user representations are expressive because each user’s preference can be regarded as normalized mixture of preferences of stereotypical users. The interpretability of item and user representations allow us to arrange properties of items (e.g., genres of movies or topics of documents) or users (e.g., personality traits) hierarchically.


Large-Scale Approximate Kernel Canonical Correlation Analysis

Kernel Canonical correlation analysis (KCCA) is a fundamental method with broad applicability in statistics and machine learning. Although there exist closed-form solution to the KCCA objective by solving an N\times N eigenvalue system where N is the training set size, the computational requirements of this approach in both memory and time prohibit its usage in the large scale. Various approximation techniques have been developed for KCCA. A recently proposed approach is to first transform original inputs to a M-dimensional feature space using random kitchen sinks so that inner product in the feature space approximates the kernel function, and then apply linear CCA to the transformed inputs. In challenging applications, however, the dimensionality M of the feature space may need to be very large in order to reveal the nonlinear correlations, and then it becomes non-trivial to solve linear CCA for data matrices of very high dimensionality. We propose to use the recently proposed stochastic optimization algorithm for linear CCA and its neural-network extension to further alleviate the computation requirements of approximate KCCA. This approach allows us to run approximate KCCA on a speech dataset with 1.4 million training samples and random feature space of dimensionality M=100000 on a normal workstation.


A Hierarchical Framework for Efficient Multilevel Visual Exploration and Analysis

Most traditional visualization tools and methods operate on an offline way, limited on accessing static (preprocessed) sets of data. They also restrict themselves on dealing with small dataset sizes, which can be easily visually analysed with conventional visualization techniques. However, the Web of Data has realized the availability of a great amount and variety of big datasets that are dynamic in nature; most of them offer query or API endpoints for online access and analysis. Modern visualization techniques must address the challenge of on-the-fly visualizations over large dynamic sets of data, offering efficient exploration techniques, as well as mechanisms for information abstraction and summarization. Moreover, they must take into account different user-defined exploration scenarios and user’s preferences. In this work, we present a generic model for personalized multilevel exploration and analysis over large dynamic sets of numeric and temporal data. Our model is built on top of a lightweight tree-based structure which can be efficiently constructed on-the-fly for a given set of data. This tree structure aggregates input objects into a hierarchical multiscale model. In the proposed structure, statistical computations can be efficiently performed on-the-fly. Considering different exploration scenarios over large datasets, the proposed model enables efficient multilevel exploration, offering incremental construction via user interaction, and dynamic adaptation of the hierarchies based on user’s preferences. A thorough theoretical analysis is presented, illustrating the efficiency of the proposed model. The proposed model is realized in a Web-based prototype tool, called rdf:SynopsViz that offers multilevel visual exploration and analysis over Linked Data datasets.


Deep Linear Discriminant Analysis

We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). DeepLDA produces competitive results on all three datasets and sets a new state of the art on STL-10 with a test set accuracy of 81.4%.


Robust Elastic Net Regression

We propose a robust elastic net (REN) model for high-dimensional sparse regression and give its performance guarantees (both the statistical error bound and the optimization bound). A simple idea of trimming the inner product is applied to the elastic net model. Specifically, we robustify the covariance matrix by trimming the inner product based on the intuition that the trimmed inner product can not be significant affected by a bounded number of arbitrarily corrupted points (outliers). The REN model can also derive two interesting special cases: robust Lasso and robust soft thresholding. Comprehensive experimental results show that the robustness of the proposed model consistently outperforms the original elastic net and matches the performance guarantees nicely.


Mixed and missing data: a unified treatment with latent graphical models

We propose to learn latent graphical models when data have mixed variables and missing values. This model could be used for further data analysis, including regression, classification, ranking etc. It also could be used for imputing missing values. We specify a latent Gaussian model for the data, where the categorical variables are generated by discretizing an unobserved variable and the latent variables are multivariate Gaussian. The observed data consists of two parts: observed Gaussian variables and observed categorical variables, where the latter part is considered as partially missing Gaussian variables. We use the Expectation-Maximization algorithm to fit the model. To prevent overfitting we use sparse inverse covariance estimation to obtain sparse estimate of the latent covariance matrix, equivalently, the graphical model. The fitted model then could be used for problems including re- gression, classification and ranking. Such an approach is applied to a medical data set where our method outperforms the state-of-the-art methods. Simulation studies and real data results suggest that our proposed model performs better than random forest in terms of prediction error when the model is correctly specified, and is a better imputation method than hot deck imputation even if the model is not correctly specified.


Character-based Neural Machine Translation

We introduce a neural machine translation model that views the input and output sentences as sequences of characters rather than words. Since word-level information provides a crucial source of bias, our input model composes representations of character sequences into representations of words (as determined by whitespace boundaries), and then these are translated using a joint attention/translation model. In the target language, the translation is modeled as a sequence of word vectors, but each word is generated one character at a time, conditional on the previous character generations in each word. As the representation and generation of words is performed at the character level, our model is capable of interpreting and generating unseen word forms. A secondary benefit of this approach is that it alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages. We show that our model can achieve translation results that are on par with conventional word-based models.


Loop-Erased Random Surfaces

Random sampling of bandlimited signals on graphs

Resolving the Geometric Locus Dilemma for Support Vector Learning Machines

Yin and Yang: Balancing and Answering Binary Visual Questions

A Unified Approach to an Alternative-Choice Selection Problem

On a conjecture of Godsil concerning controllable random graphs

Diversity Networks

A Spatiotemporal Dynamic Solution to the MEG Inverse Problem: An Empirical Bayes Approach

A Polynomial Lower Bound for Testing Monotonicity

Local symmetries and defect detection in static and periodically driven lattices

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

The Kelmans-Seymour conjecture I: special separations

On the nonexplosion and explosion for nonhomogeneous Markov pure jump processes in Borel spaces

Eventually Consistent Register Revisited

What Graphs are 2-Dot Product Graphs?

N>=2 symmetric superpolynomials

Torus Principal Component Analysis with an Application to RNA Structures

The Correlated Pseudo-Marginal Method

Corners in tree-like tableaux

A genetic algorithm to discover flexible motifs with support

Paxos Made Switch-y

Distance integral complete multipartite graphs with $s=5,6$

A Perron theorem for matrices with negative entries and applications to Coxeter groups

Learning Spanish dialects through Twitter

Critical points of multidimensional random Fourier series: central limits

Irrelevant Vertices for the Planar Disjoint Paths Problem

PageRank in undirected random graphs

A stroll along the gamma

Performing Highly Accurate Predictions Through Convolutional Networks for Actual Telecommunication Challenges

The tail empirical process of regularly varying functions of geometrically ergodic Markov chains

Fast clustering for scalable statistical analysis on structured images

Sherlock: Modeling Structured Knowledge in Images

On the global offensive alliance in unicycle graphs

A non-Golod ring with a trivial product on its Koszul homology

Isogeometric Boundary Element Analysis with elasto-plastic inclusions. Part 1: Plane problems

An Online Sequence-to-Sequence Model Using Partial Conditioning

Slicings of parallelogram polyominoes, or how Baxter and Schröder can be reconciled

Hadoop Mapreduce Performance Enhancement Using In-node Combiners

Deep Learning for steganalysis is better than a Rich Model with an Ensemble Classifier, and is natively robust to the cover source-mismatch

Freeness of $ ψ$-graphical arrangements and chordality of vertex-weighted graphs

A family of self-avoiding random walks interpolating the loop-erased random walk and a self-avoiding walk on the Sierpinski gasket

Probabilistic Segmentation via Total Variation Regularization

Infrastructure and methods for real-time predictions of the 2014 dengue fever season in Thailand

Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization

Synchronization of coupled stick-slip oscillators

Causal interpretation rules for encoding and decoding models in neuroimaging

Complete Dictionary Recovery over the Sphere II: Recovery by Riemannian Trust-region Method

Optical transmission matrix as a probe of the photonic interaction strength

Bin Packing with Multiple Colors

Learning Representations of Affect from Speech

Model Space Priors for Objective Sparse Bayesian Regression

The monodromy of real Bethe vectors for the Gaudin model

Counting dense connected hypergraphs via the probabilistic method

Self-propelled Chimeras

Conditional Lower Bound for RNA Folding Problem

Random field disorder at an absorbing state transition in one and two dimensions

Estimates related to Shirshov height theorem (PhD Thesis)

Bipartite algebraic graphs without quadrilaterals

Applying Semantic Web Technologies for Improving the Visibility of Tourism Data

Evolutionary Games on the Torus with Weak Selection

Gaussianity of normal vectors in random non-Hermitian matrices

The maximum degree resistance distance of cacti

An Iterative Reweighted Method for Tucker Decomposition of Incomplete Multiway Tensors

Constructing functions with prescribed pathwise quadratic variation

Using Text Mining To Analyze Real Estate Classifieds

Continuous-time block-monotone Markov chains and their block-augmented truncations

Deep Activity Recognition Models with Triaxial Accelerometers

A System for Extracting Sentiment from Large-Scale Arabic Social Data

Cliques in Graphs Excluding a Complete Graph Minor

Investigations into Elasticity in Cloud Computing

Word Embedding based Correlation Model for Question/Answer Matching

Statistical Properties of Car Following: Theory and Driving Simulator Experiments

Deep Reinforcement Learning with an Unbounded Action Space

Composite Empirical Likelihood

Shortest Paths and Distances with Differential Privacy

A Smoothed P-Value Test When There is a Nuisance Parameter under the Alternative

Learning to Represent Words in Context with Multilingual Supervision

Computation of Two and Three Dimensional Confidence Regions with the Likelihood Ratio

Matroids: A Macaulay2 package

Strengthening and plastic events in crystals: The role of dislocation surface sources

On the Correlation of Increasing Families

DeepFool: a simple and accurate method to fool deep neural networks

Trainable performance upper bounds for image and video captioning

A Test of Relative Similarity

Combinatorial Batch Codes with Redundancy

The Pascal Rhombus and the Generalized Grand Motzkin Paths

The 1-2-3 conjecture for uniform hypergraphs

8-Bit Approximations for Parallelism in Deep Learning

The poset of proper divisibility

Minimax wavelet estimation for multisample heteroscedastic non-parametric regression

Distances in the directed configuration model

Network geometry with flavor: from complexity to quantum geometry

Large deviations for stochastic differential delay equations with Lévy noises

Engineering sensorial delay to control phototaxis and emergent collective behaviors

Supervised Hashing with Deep Neural Networks

MATEX: A Distributed Framework for Transient Simulation of Power Distribution Networks

An Algorithmic Framework for Efficient Large-Scale Circuit Simulation Using Exponential Integrators

Sparse Nonlinear Regression: Parameter Estimation and Asymptotic Inference

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

A construction principle for tight and minimal triangulations of manifolds

Constructing Permutation Arrays from Groups

Modeling Trends in Distributions

A Prediction Divergence Criterion for Model Selection

Unsupervised Learning in Synaptic Sampling Machines

Efron’s coins and the Linial arrangement

Multi-Clique-Width

A Backward/Forward Recovery Approach for the Preconditioned Conjugate Gradient Method

Detection of multiple and overlapping bidirectional communities within large, directed and weighted networks of neurons

Beyond Convex Optimization: Star-Convex Functions

Priority-Flood: An Optimal Depression-Filling and Watershed-Labeling Algorithm for Digital Elevation Models