Methods for Projection Pursuit aim to facilitate the visual exploration of high-dimensional data by identifying interesting low-dimensional projections. A major challenge is the design of a suitable quality metric of projections, commonly referred to as the projection index, to be maximized by the Projection Pursuit algorithm. In this paper, we introduce a new information-theoretic strategy for tackling this problem, based on quantifying the amount of information the projection conveys to a user given their prior beliefs about the data. The resulting projection index is a subjective quantity, explicitly dependent on the intended user. As a useful illustration, we developed this idea for two particular kinds of prior beliefs. The first kind leads to PCA (Principal Component Analysis), shining new light on when PCA is (not) appropriate. The second kind leads to a novel projection index, the maximization of which can be regarded as a robust variant of PCA. We show how this projection index, though non-convex, can be effectively maximized using a modified power method as well as using a semidefinite programming relaxation. The usefulness of this new projection index is demonstrated in comparative empirical experiments against PCA and a popular Projection Pursuit method.
Reinforcement learning is a formal framework for modeling agents that learn to solve tasks. For example, one important task for animals is to navigate in an environment to find food or to return to their nest. In such tasks, the agent has to learn a path through the environment from start states to goal states, by visiting a sequence of intermediate states. The agent receives reward on a goal state. Concretely, we need to learn a policy that maps each encountered state to an immediate action that leads to a next state, eventually leading to a goal state. We say that a learning process has converged if eventually the policy will no longer change, i.e., the policy stabilizes. The intuition of paths and navigation policies can be applied generally. Indeed, in this article, we study navigation tasks formalized as a graph structure that, for each application of an action to a state, describes the possible successor states that could result from that application. In contrast to standard reinforcement learning, we essentially simplify numeric reward signals to boolean flags on the transitions in the graph. The resulting framework enables a clear theoretical study of how properties of the graph structure can cause convergence of the learning process. In particular, we formally study a learning process that detects revisits to states in the graph, i.e., we detect cycles, and the process keeps adjusting the policy until no more cycles are made. So, eventually, the agent goes straight to reward from each start state. We identify reducibility of the task graph as a sufficient condition for this learning process to converge. We also syntactically characterize the form of the final policy, which can be used to detect convergence in a simulation.
Neural network models have been demonstrated to be capable of achieving remarkable performance in sentence and document modeling. Convolutional neural network (CNN) and recurrent neural network (RNN) are two mainstream architectures for such modeling tasks, which adopt totally different ways of understanding natural languages. In this work, we combine the strengths of both architectures and propose a novel and unified model called C-LSTM for sentence representation and text classification. C-LSTM utilizes CNN to extract a sequence of higher-level phrase representations, and are fed into a long short-term memory recurrent neural network (LSTM) to obtain the sentence representation. C-LSTM is able to capture both local features of phrases as well as global and temporal sentence semantics. We evaluate the proposed architecture on sentiment classification and question classification tasks. The experimental results show that the C-LSTM outperforms both CNN and LSTM and can achieve excellent performance on these tasks.
Among classical search algorithms with the same heuristic information, with sufficient memory A* is essentially as fast as possible in finding a proven optimal solution. However, in many situations optimal solutions are simply infeasible, and thus search algorithms that trade solution quality for speed are desirable. In this paper, we formalize the process of classical search as a metalevel decision problem, the Abstract Search MDP. For any given optimization criterion, this establishes a well-defined notion of the best possible behaviour for a search algorithm and offers a theoretical approach to the design of algorithms for that criterion. We proceed to approximately solve a version of the Abstract Search MDP for anytime algorithms and thus derive a novel search algorithm, Search by Maximizing the Incremental Rate of Improvement (SMIRI). SMIRI is shown to outperform current state-of-the-art anytime search algorithms on a parametrized stochastic tree model for most of the tested parameter values.
Latent variable models are a fundamental modeling tool in machine learning applications, but they present significant computational and analytical challenges. The popular EM algorithm and its variants, is a much used algorithmic tool; yet our rigorous understanding of its performance is highly incomplete. Recently, work in Balakrishnan et al. (2014) has demonstrated that for an important class of problems, EM exhibits linear local convergence. In the high-dimensional setting, however, the M-step may not be well defined. We address precisely this setting through a unified treatment using regularization. While regularization for high-dimensional problems is by now well understood, the iterative EM algorithm requires a careful balancing of making progress towards the solution while identifying the right structure (e.g., sparsity or low-rank). In particular, regularizing the M-step using the state-of-the-art high-dimensional prescriptions (e.g., Wainwright (2014)) is not guaranteed to provide this balance. Our algorithm and analysis are linked in a way that reveals the balance between optimization and statistical errors. We specialize our general framework to sparse gaussian mixture models, high-dimensional mixed regression, and regression with missing variables, obtaining statistical guarantees for each of these examples.
In the causal adjustment setting, variable selection techniques based on one of either the outcome or treatment allocation model can result in the omission of confounders, which leads to bias, or the inclusion of spurious variables, which leads to variance inflation, in the propensity score. We propose a variable selection method based on a penalized objective function which considers the outcome and treatment assignment models simultaneously. The proposed method facilitates confounder selection in high-dimensional settings. We show that under regularity conditions our method attains the oracle property. The selected variables are used to form a doubly robust regression estimator of the treatment effect. We show that under some conditions our method attains the oracle property. Simulation results are presented and economic growth data are analyzed. Specifically, we study the effect of life expectancy as a measure of population health on the average growth rate of gross domestic product per capita.
Matrix-parametrized models, including multiclass logistic regression and sparse coding, are used in machine learning (ML) applications ranging from computer vision to computational biology. When these models are applied to large-scale ML problems starting at millions of samples and tens of thousands of classes, their parameter matrix can grow at an unexpected rate, resulting in high parameter synchronization costs that greatly slow down distributed learning. To address this issue, we propose a Sufficient Factor Broadcasting (SFB) computation model for efficient distributed learning of a large family of matrix-parameterized models, which share the following property: the parameter update computed on each data sample is a rank-1 matrix, i.e., the outer product of two ‘sufficient factors’ (SFs). By broadcasting the SFs among worker machines and reconstructing the update matrices locally at each worker, SFB improves communication efficiency — communication costs are linear in the parameter matrix’s dimensions, rather than quadratic — without affecting computational correctness. We present a theoretical convergence analysis of SFB, and empirically corroborate its efficiency on four different matrix-parametrized ML models.
This document provides a brief introduction to Convolutional Neural Networks (CNNs), discussing recently published papers and newly form techniques in developing these brilliantly fantastic image recognition models. This introduction assumes you are familiar with the workings of neural networks and have some background in Artificial Intelligence.
Text segmentation (TS) aims at dividing long text into coherent segments which reflect the subtopic structure of the text. It is beneficial to many natural language processing tasks, such as Information Retrieval (IR) and document summarisation. Current approaches to text segmentation are similar in that they all use word-frequency metrics to measure the similarity between two regions of text, so that a document is segmented based on the lexical cohesion between its words. Various NLP tasks are now moving towards the semantic web and ontologies, such as ontology-based IR systems, to capture the conceptualizations associated with user needs and contents. Text segmentation based on lexical cohesion between words is hence not sufficient anymore for such tasks. This paper proposes OntoSeg, a novel approach to text segmentation based on the ontological similarity between text blocks. The proposed method uses ontological similarity to explore conceptual relations between text segments and a Hierarchical Agglomerative Clustering (HAC) algorithm to represent the text as a tree-like hierarchy that is conceptually structured. The rich structure of the created tree further allows the segmentation of text in a linear fashion at various levels of granularity. The proposed method was evaluated on a wellknown dataset, and the results show that using ontological similarity in text segmentation is very promising. Also we enhance the proposed method by combining ontological similarity with lexical similarity and the results show an enhancement of the segmentation quality.
Gaussian processes (GPs) provide a general and analytically tractable way of capturing complex time-varying, nonparametric functions. The time varying parameters of GPs can be explained as a composition of base kernels such as linear, smoothness or periodicity in that covariance kernels are closed under addition and multiplication. The Automatic Bayesian Covariance Discovery (ABCD) system constructs natural-language description of time-series data by treating unknown time-series data nonparametrically using GPs. Unfortunately, learning a composite covariance kernel with a single time-series dataset often results in less informative kernels instead of finding qualitative distinct descriptions. We address this issue by proposing a relational kernel learning which can model relationship between sets of data and find shared structure among the time series datasets. We show the shared structure can help learning more accurate models for sets of regression problems with some synthetic data, US top market capitalization stock data and US house sales index data.
Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment three variants involving subsampling, Big Data-bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.
Instance segmentation is the problem of detecting and delineating each object of interest appearing in an image. Current instance segmentation approaches consist of ensembles of modules that are trained independently of each other, thus missing learning opportunities. Here we propose a new instance segmentation paradigm consisting in an end-to-end method that learns how to segment instances sequentially. The model is based on a recurrent neural network that sequentially finds objects and their segmentations one at a time. This net is provided with a spatial memory that keeps track of what pixels have been explained and allows handling occlusion. In order to train the model we designed a new principled loss function that accurately represents the properties of the instance segmentation problem. In the experiments carried out we found that our method outperforms all $5$ state of the art approaches on the Plant Phenotyping dataset.
Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they cannot be parallelized and are are hard to train due to their large depth when unfolded. We present a neural network architecture to address this problem: the Neural GPU. It is based on a type of convolutional gated recurrent unit and, like the NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly parallel which makes it easier to train and efficient to run. An essential property of algorithms is their ability to handle inputs of arbitrary size. We show that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances. We verified it on a number of tasks including long addition and long multiplication of numbers represented in binary. We train the Neural GPU on numbers with upto 20 bits and observe no errors whatsoever while testing it, even on much longer numbers. To achieve these results we introduce a technique for training deep recurrent networks: parameter sharing relaxation. We also found a small amount of dropout and gradient noise to have a large positive effect on learning and generalization.