New Design Criteria for Robust PCA and a Compliant Bayesian-Inspired Algorithm

Commonly used in computer vision and other applications, robust PCA represents an algorithmic attempt to reduce the sensitivity of classical PCA to outliers. The basic idea is to learn a decomposition of some data matrix of interest into low rank and sparse components, the latter representing unwanted outliers. Although the resulting optimization problem is typically NP-hard, convex relaxations provide a computationally-expedient alternative with theoretical support. However, in practical regimes performance guarantees break down and a variety of non-convex alternatives, including Bayesian-inspired models, have been proposed to boost estimation quality. Unfortunately though, without additional a priori knowledge none of these methods can significantly expand the critical operational range such that exact principal subspace recovery is possible. Into this mix we propose a novel pseudo-Bayesian algorithm that explicitly compensates for design weaknesses in many existing non-convex approaches leading to state-of-the-art performance with a sound analytical foundation.


The Teaching Dimension of Linear Learners

Teaching dimension is a learning theoretic quantity that specifies the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a specific hypothesis via optimization. This paper presents the first known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners.


Clustering by Deep Nearest Neighbor Descent (D-NND): A Density-based Parameter-Insensitive Clustering Method

Most density-based clustering methods largely rely on how well the underlying density is estimated. However, density estimation itself is also a challenging problem, especially the determination of the kernel bandwidth. A large bandwidth could lead to the over-smoothed density estimation in which the number of density peaks could be less than the true clusters, while a small bandwidth could lead to the under-smoothed density estimation in which spurious density peaks, or called the ‘ripple noise’, would be generated in the estimated density. In this paper, we propose a density-based hierarchical clustering method, called the Deep Nearest Neighbor Descent (D-NND), which could learn the underlying density structure layer by layer and capture the cluster structure at the same time. The over-smoothed density estimation could be largely avoided and the negative effect of the under-estimated cases could be also largely reduced. Overall, D-NND presents not only the strong capability of discovering the underlying cluster structure but also the remarkable reliability due to its insensitivity to parameters.


Discriminative Nonparametric Latent Feature Relational Models with Data Augmentation

We present a discriminative nonparametric latent feature relational model (LFRM) for link prediction to automatically infer the dimensionality of latent features. Under the generic RegBayes (regularized Bayesian inference) framework, we handily incorporate the prediction loss with probabilistic inference of a Bayesian model; set distinct regularization parameters for different types of links to handle the imbalance issue in real networks; and unify the analysis of both the smooth logistic log-loss and the piecewise linear hinge loss. For the nonconjugate posterior inference, we present a simple Gibbs sampler via data augmentation, without making restricting assumptions as done in variational methods. We further develop an approximate sampler using stochastic gradient Langevin dynamics to handle large networks with hundreds of thousands of entities and millions of links, orders of magnitude larger than what existing LFRM models can process. Extensive studies on various real networks show promising performance.


How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies

Using deep neural nets as function approximator for reinforcement learning tasks have recently been shown to be very powerful for solving problems approaching real-world complexity. Using these results as a benchmark, we discuss the role that the discount factor may play in the quality of the learning process of a deep Q-network (DQN). When the discount factor progressively increases up to its final value, we empirically show that it is possible to significantly reduce the number of learning steps. When used in conjunction with a varying learning rate, we empirically show that it outperforms original DQN on several experiments. We relate this phenomenon with the instabilities of neural networks when they are used in an approximate Dynamic Programming setting. We also describe the possibility to fall within a local optimum during the learning process, thus connecting our discussion with the exploration/exploitation dilemma.


Jointly Modeling Topics and Intents with Global Order Structure

Modeling document structure is of great importance for discourse analysis and related applications. The goal of this research is to capture the document intent structure by modeling documents as a mixture of topic words and rhetorical words. While the topics are relatively unchanged through one document, the rhetorical functions of sentences usually change following certain orders in discourse. We propose GMM-LDA, a topic modeling based Bayesian unsupervised model, to analyze the document intent structure cooperated with order information. Our model is flexible that has the ability to combine the annotations and do supervised learning. Additionally, entropic regularization can be introduced to model the significant divergence between topics and intents. We perform experiments in both unsupervised and supervised settings, results show the superiority of our model over several state-of-the-art baselines.


A Novel Approach to Distributed Multi-Class SVM

With data sizes constantly expanding, and with classical machine learning algorithms that analyze such data requiring larger and larger amounts of computation time and storage space, the need to distribute computation and memory requirements among several computers has become apparent. Although substantial work has been done in developing distributed binary SVM algorithms and multi-class SVM algorithms individually, the field of multi-class distributed SVMs remains largely unexplored. This research proposes a novel algorithm that implements the Support Vector Machine over a multi-class dataset and is efficient in a distributed environment (here, Hadoop). The idea is to divide the dataset into half recursively and thus compute the optimal Support Vector Machine for this half during the training phase, much like a divide and conquer approach. While testing, this structure has been effectively exploited to significantly reduce the prediction time. Our algorithm has shown better computation time during the prediction phase than the traditional sequential SVM methods (One vs. One, One vs. Rest) and out-performs them as the size of the dataset grows. This approach also classifies the data with higher accuracy than the traditional multi-class algorithms.


Thinking Required

There exists a theory of a single general-purpose learning algorithm which could explain the principles its operation. It assumes the initial rough architecture, a small library of simple innate circuits which are prewired at birth. and proposes that all significant mental algorithms are learned. Given current understanding and observations, this paper reviews and lists the ingredients of such an algorithm from architectural and functional perspectives.


Variable selection with Hamming loss

We derive non-asymptotic bounds for the minimax risk of variable selection under expected Hamming loss in the Gaussian mean model in \mathbb{R}^d for classes of s-sparse vectors separated from 0 by a constant a>0. In some cases, we get exact expressions for the non-asymptotic minimax risk as a function of d,s,a and find explicitly the minimax selectors. Analogous results are obtained for the probability of wrong recovery of the sparsity pattern. As corollaries, we derive necessary and sufficient conditions for such asymptotic properties as almost full recovery and exact recovery. Moreover, we propose data-driven selectors that provide almost full and exact recovery adaptive to the parameters of the classes.


A Benchmark Comparison of State-of-the-Practice Sentiment Analysis Methods

In the last few years thousands of scientific papers have explored sentiment analysis, several startups that measures opinions on real data have emerged, and a number of innovative products related to this theme have been developed. There are multiple methods for measuring sentiments, including lexical-based approaches and supervised machine learning methods. Despite the vast interest on the theme and wide popularity of some methods, it is unclear which method is better for identifying the polarity (i.e., positive or negative) of a message. Thus, there is a strong need to conduct a thorough apple-to-apple comparison of sentiment analysis methods, as they are used in practice, across multiple datasets originated from different data sources. Such a comparison is key for understanding the potential limitations, advantages, and disadvantages of popular methods. This study aims at filling this gap by presenting a benchmark comparison of twenty one popular sentiment analysis methods (which we call the state-of-the-practice methods). Our evaluation is based on a benchmark of twenty labeled datasets, covering messages posted on social networks, movie and product reviews, as well as opinions and comments in news articles. Our results highlight the extent to which the prediction performance of these methods varies widely across datasets. Aiming at boosting the development of this research area, we open the methods’ codes and datasets used in this paper and we deploy a benchmark system, which provides an open API for accessing and comparing sentence-level sentiment analysis methods.


Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Traditional graph-based semi-supervised learning (SSL) approaches, even though widely applied, are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the distribution on labels per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that captures the sparsity of the label distribution and ensures the algorithm propagates labels accurately, and further reduces the space complexity per node to O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. We also study different graph construction mechanisms for natural language applications and propose a robust graph augmentation strategy trained using state-of-the-art unsupervised deep learning architectures that yields further significant quality gains.


Generating News Headlines with Recurrent Neural Networks

We describe an application of an encoder-decoder recurrent neural network with LSTM units and attention to generating headlines from the text of news articles. We find that the model is quite effective at concisely paraphrasing news articles. Furthermore, we study how the neural network decides which input words to pay attention to, and specifically we identify the function of the different neurons in a simplified attention mechanism. Interestingly, our simplified attention mechanism performs better that the more complex attention mechanism on a held out set of articles.


Variance Reduction for Distributed Stochastic Gradient Descent

Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or require a full pass over the (large) data set at the end of each epoch to calculate the exact gradient of the objective function. This makes current VR methods impractical in distributed or parallel settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants with both high and low communication latency. We find that our distributed algorithms scale linearly with the number of local workers and remain stable even with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms consistently converge faster than other stochastic methods.


Deep Attention Recurrent Q-Network

A deep learning approach to reinforcement learning led to a general learner able to train on visual input to play a variety of arcade games at the human and superhuman levels. Its creators at the Google DeepMind’s team called the approach: Deep Q-Network (DQN). We present an extension of DQN by ‘soft’ and ‘hard’ attention mechanisms. Tests of the proposed Deep Attention Recurrent Q-Network (DARQN) algorithm on multiple Atari 2600 games show level of performance superior to that of DQN. Moreover, built-in attention mechanisms allow a direct online monitoring of the training process by highlighting the regions of the game screen the agent is focusing on when making decisions.


A Framework for Computing on Large Dynamic Graphs

This proposal presents a graph computing framework intending to support both online and offline computing on large dynamic graphs efficiently. The framework proposes a new data model to support rich evolving vertex and edge data types. It employs a replica-coherence protocol to improve data locality thus can adapt to data access patterns of different algorithms. A new computing model called protocol dataflow is proposed to implement and integrate various programming models for both online and offline computing on large dynamic graphs. A central topic of the proposal is also the analysis of large real dynamic graphs using our proposed framework. Our goal is to calculate the temporal patterns and properties which emerge when the large graphs keep evolving. Thus we can evaluate the capability of the proposed framework. Key words: Large dynamic graph, programming model, distributed computing.


Hierarchical Sparse Modeling: A Choice of Two Regularizers

Demanding sparsity in estimated models has become a routine practice in statistics. In many situations, we wish to demand that the sparsity patterns attained honor certain problem-specific constraints. Hierarchical sparse modeling (HSM) refers to situations in which these constraints specify that one set of parameters be set to zero whenever another is set to zero. In recent years, numerous papers have developed convex regularizers for this form of sparsity structure arising in areas including interaction modeling, time series, and covariance estimation. In this paper, we observe that these methods fall into two frameworks, which have not been systematically compared in the context of HSM. The purpose of this paper is to provide a side-by-side comparison of these two frameworks for HSM in terms of their statistical properties and computational efficiency. We call attention to a problem with the more commonly used framework and provide new insights into the other, which can greatly improve its computational performance. Finally, we compare the two methods in the context of covariance estimation, where we introduce a new sparsely-banded estimator, which we show achieves the statistical advantages of an existing method but is simpler to compute.


A bijection for nonorientable general maps

FabSim: facilitating computational research through automation on large-scale and distributed e-infrastructures

Statics and dynamics of infinite-dimensional liquids and glasses: a parallel, compact derivation

Wavelet Based Load Models from AMI Data

Faster Information Gathering in Ad-Hoc Radio Tree Networks

The Random Division of the Unit Interval and the Approximate -1 Exponent in the Monkey-at-the-Typewriter Model of Zipf’s Law

Arcs in $\Z^2_{2p}$

Higher Order Estimating Equations for High-dimensional Models

Right-jumps and pattern avoiding permutations

Simple Baseline for Visual Question Answering

On star-forest ascending subgraph decomposition

Finding $k$ Simple Shortest Paths and Cycles

Limit theorems for Markovian Hawkes processes with a large initial intensity

Multi-body quenched disordered $XY$ and $p$-clock models on random graphs

A Time-varying Parameter Based Seasonally-adjusted Bayesian State-space Model for Forecasting

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

Extremal k-apex Trees for Randic Index

Digital Genesis: Computers, Evolution and Artificial Life

Piecewise deterministic Markov processes in biological models

The multi-orientable random tensor model, a review

From rules to runs: A dynamic epistemic take on imperfect information games

Minimum Cut of Directed Planar Graphs in O(nloglogn) Time

An Explicit Rate Bound for the Over-Relaxed ADMM

Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference

Computing the vertex Folkman numbers $F_v(a_1, …, a_s; m – 1)$ when $\max\{a_1, …, a_s\} = 6$

Level-Based Analysis of Genetic Algorithms for Combinatorial Optimization

Construction of Wannier Functions in Disordered Systems

Risk Minimization in Structured Prediction using Orbit Loss

Discrete Hilbert Transform a la Gundy-Varopoulos

A bloated FM-index reducing the number of cache misses during the search

On Konig-Egervary Collections of Maximum Critical Independent Sets

Uniform contractivity in Wasserstein metric for the original 1D Kac’s model

Filter Based Methods For Statistical Linear Inverse Problems

Coloring points with respect to squares

Learning population and subject-specific brain connectivity networks via Mixed Neighborhood Selection

Subsets of $\mathbb{F}_q[x]$ free of 3-term geometric progressions

Fast Optimization Algorithm on Riemannian Manifolds and Its Application in Low-Rank Representation

Controllable Synchronization of Hierarchically Networked Oscillators

Knowledge Sharing in Coalitions

Rademacher Complexity of the Restricted Boltzmann Machine

Entropy and Grand Lebesgue Spaces approach for Prokhorov-Skorokhod continuity of random processes, with tail estimates

Bounds on bilinear inverse forms via Gaussian quadrature with applications

Statistical Inference for Ergodic Point Processes and Limit Order Book

Sparsified Cholesky and Multigrid Solvers for Connection Laplacians

Probabilistic Structural Controllability in Causal Bayesian Networks

Einstein relation and steady states for the random conductance model

THCHS-30 : A Free Chinese Speech Corpus

Driverseat: Crowdstrapping Learning Tasks for Autonomous Driving

Statistical Signatures of Structural Organization: The case of long memory in renewal processes

Explaining reviews and ratings with PACO: Poisson Additive Co-Clustering

The functional AR(1) process with a unit root

Designing Applications in a Hybrid Cloud

Classification of Manifolds by Single-Layer Neural Networks

Block Stanley decompositions I. Elementary and gnomon decompositions

On Routing Disjoint Paths in Bounded Treewidth Graphs

Fourientation activities and the Tutte polynomial

Conformal Ward and BPZ Identities for Liouville quantum field theory

The cycle descent statistic on permutations

Distributed protocols for spanning tree construction and leader election

On closedness under convolution roots related to an infinitely divisible distribution in the distribution class L(γ)

Combinatorics of a fractal tiling family

k-Trails: Recognition, Complexity, and Approximations

Bayesian inference and model comparison for metallic fatigue data

Want Answers? A Reddit Inspired Study on How to Pose Questions

Fast Algorithms for Game-Theoretic Centrality Measures

Restricted Low-Rank Approximation via ADMM

On the Data Augmentation Algorithm for Bayesian Multivariate Linear Regression with Non-Gaussian Errors

Absence of Localization in Disordered Two Dimensional Electron Gas at Weak Magnetic Field and Strong Spin-Orbit Coupling

The Propus Construction for Symmetric Hadamard Matrices

Similarity Learning via Adaptive Regression and Its Application to Image Retrieval

Solving the Subset Sum Problem with Heap-Ordered Subset Trees

Relative t-designs in binary Hamming association scheme H(n,2)

Elliptic rook and file numbers

Twisted patterns in large subsets of $\mathbb{Z}^N$

A Restricted Visual Turing Test for Deep Scene and Event Understanding

Approximate Range Counting Revisited

Locating recombination hot spots in genomic sequences through the singular value decomposition

Restricted Product Sets under Unique Representability

Stochastic Collapsed Variational Inference for Sequential Data

Stochastic Collapsed Variational Inference for Hidden Markov Models

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora

The mechanics of shuffle products and their siblings

Robust Estimation of the Generalized Loggamma Model. The R Package robustloggamma

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

Exact Algorithms via Monotone Local Search

Quasi likelihood analysis of point processes for ultra high frequency data

Guarding from Spurious Discoveries in High Dimension

A Bayesian test to identify variance effects

A Novel Paradigm for Calculating Ramsey Number via Artificial Bee Colony Algorithm

The transition probability of the $q$-TAZRP ($q$-Bosons) with inhomogeneous jump rates

Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems

Polynomial bounds for decoupling, with applications

Pruned double Hurwitz numbers

Creation of a Deep Convolutional Auto-Encoder in Caffe

Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text

Stochastic epidemic models featuring contact tracing with delays

Negative local feedbacks in Boolean networks

State of the Art Control of Atari Games Using Shallow Reinforcement Learning