Commonly used in computer vision and other applications, robust PCA represents an algorithmic attempt to reduce the sensitivity of classical PCA to outliers. The basic idea is to learn a decomposition of some data matrix of interest into low rank and sparse components, the latter representing unwanted outliers. Although the resulting optimization problem is typically NP-hard, convex relaxations provide a computationally-expedient alternative with theoretical support. However, in practical regimes performance guarantees break down and a variety of non-convex alternatives, including Bayesian-inspired models, have been proposed to boost estimation quality. Unfortunately though, without additional a priori knowledge none of these methods can significantly expand the critical operational range such that exact principal subspace recovery is possible. Into this mix we propose a novel pseudo-Bayesian algorithm that explicitly compensates for design weaknesses in many existing non-convex approaches leading to state-of-the-art performance with a sound analytical foundation.
Teaching dimension is a learning theoretic quantity that specifies the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a specific hypothesis via optimization. This paper presents the first known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners.
Most density-based clustering methods largely rely on how well the underlying density is estimated. However, density estimation itself is also a challenging problem, especially the determination of the kernel bandwidth. A large bandwidth could lead to the over-smoothed density estimation in which the number of density peaks could be less than the true clusters, while a small bandwidth could lead to the under-smoothed density estimation in which spurious density peaks, or called the ‘ripple noise’, would be generated in the estimated density. In this paper, we propose a density-based hierarchical clustering method, called the Deep Nearest Neighbor Descent (D-NND), which could learn the underlying density structure layer by layer and capture the cluster structure at the same time. The over-smoothed density estimation could be largely avoided and the negative effect of the under-estimated cases could be also largely reduced. Overall, D-NND presents not only the strong capability of discovering the underlying cluster structure but also the remarkable reliability due to its insensitivity to parameters.
We present a discriminative nonparametric latent feature relational model (LFRM) for link prediction to automatically infer the dimensionality of latent features. Under the generic RegBayes (regularized Bayesian inference) framework, we handily incorporate the prediction loss with probabilistic inference of a Bayesian model; set distinct regularization parameters for different types of links to handle the imbalance issue in real networks; and unify the analysis of both the smooth logistic log-loss and the piecewise linear hinge loss. For the nonconjugate posterior inference, we present a simple Gibbs sampler via data augmentation, without making restricting assumptions as done in variational methods. We further develop an approximate sampler using stochastic gradient Langevin dynamics to handle large networks with hundreds of thousands of entities and millions of links, orders of magnitude larger than what existing LFRM models can process. Extensive studies on various real networks show promising performance.
Using deep neural nets as function approximator for reinforcement learning tasks have recently been shown to be very powerful for solving problems approaching real-world complexity. Using these results as a benchmark, we discuss the role that the discount factor may play in the quality of the learning process of a deep Q-network (DQN). When the discount factor progressively increases up to its final value, we empirically show that it is possible to significantly reduce the number of learning steps. When used in conjunction with a varying learning rate, we empirically show that it outperforms original DQN on several experiments. We relate this phenomenon with the instabilities of neural networks when they are used in an approximate Dynamic Programming setting. We also describe the possibility to fall within a local optimum during the learning process, thus connecting our discussion with the exploration/exploitation dilemma.
Modeling document structure is of great importance for discourse analysis and related applications. The goal of this research is to capture the document intent structure by modeling documents as a mixture of topic words and rhetorical words. While the topics are relatively unchanged through one document, the rhetorical functions of sentences usually change following certain orders in discourse. We propose GMM-LDA, a topic modeling based Bayesian unsupervised model, to analyze the document intent structure cooperated with order information. Our model is flexible that has the ability to combine the annotations and do supervised learning. Additionally, entropic regularization can be introduced to model the significant divergence between topics and intents. We perform experiments in both unsupervised and supervised settings, results show the superiority of our model over several state-of-the-art baselines.
With data sizes constantly expanding, and with classical machine learning algorithms that analyze such data requiring larger and larger amounts of computation time and storage space, the need to distribute computation and memory requirements among several computers has become apparent. Although substantial work has been done in developing distributed binary SVM algorithms and multi-class SVM algorithms individually, the field of multi-class distributed SVMs remains largely unexplored. This research proposes a novel algorithm that implements the Support Vector Machine over a multi-class dataset and is efficient in a distributed environment (here, Hadoop). The idea is to divide the dataset into half recursively and thus compute the optimal Support Vector Machine for this half during the training phase, much like a divide and conquer approach. While testing, this structure has been effectively exploited to significantly reduce the prediction time. Our algorithm has shown better computation time during the prediction phase than the traditional sequential SVM methods (One vs. One, One vs. Rest) and out-performs them as the size of the dataset grows. This approach also classifies the data with higher accuracy than the traditional multi-class algorithms.
There exists a theory of a single general-purpose learning algorithm which could explain the principles its operation. It assumes the initial rough architecture, a small library of simple innate circuits which are prewired at birth. and proposes that all significant mental algorithms are learned. Given current understanding and observations, this paper reviews and lists the ingredients of such an algorithm from architectural and functional perspectives.
We derive non-asymptotic bounds for the minimax risk of variable selection under expected Hamming loss in the Gaussian mean model in $\mathbb{R}^d$ for classes of $s$-sparse vectors separated from 0 by a constant $a>0$. In some cases, we get exact expressions for the non-asymptotic minimax risk as a function of $d,s,a$ and find explicitly the minimax selectors. Analogous results are obtained for the probability of wrong recovery of the sparsity pattern. As corollaries, we derive necessary and sufficient conditions for such asymptotic properties as almost full recovery and exact recovery. Moreover, we propose data-driven selectors that provide almost full and exact recovery adaptive to the parameters of the classes.
In the last few years thousands of scientific papers have explored sentiment analysis, several startups that measures opinions on real data have emerged, and a number of innovative products related to this theme have been developed. There are multiple methods for measuring sentiments, including lexical-based approaches and supervised machine learning methods. Despite the vast interest on the theme and wide popularity of some methods, it is unclear which method is better for identifying the polarity (i.e., positive or negative) of a message. Thus, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources. Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods. This study aims at filling this gap by presenting a benchmark comparison of twenty one popular sentiment analysis methods (which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of twenty labeled datasets, covering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles. Our results highlight the extent to which the prediction performance of these methods varies widely across datasets. Aiming at boosting the development of this research area, we open the methods’ codes and datasets used in this paper and we deploy a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.
Traditional graph-based semi-supervised learning (SSL) approaches, even though widely applied, are not suited for massive data and large label scenarios since they scale linearly with the number of edges $|E|$ and distinct labels $m$. To deal with the large label size problem, recent works propose sketch-based methods to approximate the distribution on labels per node thereby achieving a space reduction from $O(m)$ to $O(\log m)$, under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that captures the sparsity of the label distribution and ensures the algorithm propagates labels accurately, and further reduces the space complexity per node to $O(1)$. We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. We also study different graph construction mechanisms for natural language applications and propose a robust graph augmentation strategy trained using state-of-the-art unsupervised deep learning architectures that yields further significant quality gains.
We describe an application of an encoder-decoder recurrent neural network with LSTM units and attention to generating headlines from the text of news articles. We find that the model is quite effective at concisely paraphrasing news articles. Furthermore, we study how the neural network decides which input words to pay attention to, and specifically we identify the function of the different neurons in a simplified attention mechanism. Interestingly, our simplified attention mechanism performs better that the more complex attention mechanism on a held out set of articles.
Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or require a full pass over the (large) data set at the end of each epoch to calculate the exact gradient of the objective function. This makes current VR methods impractical in distributed or parallel settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants with both high and low communication latency. We find that our distributed algorithms scale linearly with the number of local workers and remain stable even with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms consistently converge faster than other stochastic methods.
A deep learning approach to reinforcement learning led to a general learner able to train on visual input to play a variety of arcade games at the human and superhuman levels. Its creators at the Google DeepMind’s team called the approach: Deep Q-Network (DQN). We present an extension of DQN by ‘soft’ and ‘hard’ attention mechanisms. Tests of the proposed Deep Attention Recurrent Q-Network (DARQN) algorithm on multiple Atari 2600 games show level of performance superior to that of DQN. Moreover, built-in attention mechanisms allow a direct online monitoring of the training process by highlighting the regions of the game screen the agent is focusing on when making decisions.
This proposal presents a graph computing framework intending to support both online and offline computing on large dynamic graphs efficiently. The framework proposes a new data model to support rich evolving vertex and edge data types. It employs a replica-coherence protocol to improve data locality thus can adapt to data access patterns of different algorithms. A new computing model called protocol dataflow is proposed to implement and integrate various programming models for both online and offline computing on large dynamic graphs. A central topic of the proposal is also the analysis of large real dynamic graphs using our proposed framework. Our goal is to calculate the temporal patterns and properties which emerge when the large graphs keep evolving. Thus we can evaluate the capability of the proposed framework. Key words: Large dynamic graph, programming model, distributed computing.
Demanding sparsity in estimated models has become a routine practice in statistics. In many situations, we wish to demand that the sparsity patterns attained honor certain problem-specific constraints. Hierarchical sparse modeling (HSM) refers to situations in which these constraints specify that one set of parameters be set to zero whenever another is set to zero. In recent years, numerous papers have developed convex regularizers for this form of sparsity structure arising in areas including interaction modeling, time series, and covariance estimation. In this paper, we observe that these methods fall into two frameworks, which have not been systematically compared in the context of HSM. The purpose of this paper is to provide a side-by-side comparison of these two frameworks for HSM in terms of their statistical properties and computational efficiency. We call attention to a problem with the more commonly used framework and provide new insights into the other, which can greatly improve its computational performance. Finally, we compare the two methods in the context of covariance estimation, where we introduce a new sparsely-banded estimator, which we show achieves the statistical advantages of an existing method but is simpler to compute.