We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex functions. Our algorithm draws heavily from a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as a recent approach to variance reduction for stochastic gradient descent from Johnson and Zhang (2013). We demonstrate experimentally that our algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision. Furthermore, we show that our algorithm performs well for a wide-range of step sizes, often differing by several orders of magnitude.
This paper establishes the consistency of spectral approaches to data clustering. We consider clustering of point clouds obtained as samples of a ground-truth measure. A graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points they connect. We investigate the spectral convergence of both unnormalized and normalized graph Laplacians towards the appropriate operators in the continuum domain. We obtain sharp conditions on how the connectivity radius can be scaled with respect to the number of sample points for the spectral convergence to hold. We also show that the discrete clusters obtained via spectral clustering converge towards a continuum partition of the ground truth measure. Such continuum partition minimizes a functional describing the continuum analogue of the graph-based spectral partitioning. Our approach, based on variational convergence, is general and flexible.
Data-intensive computing has become one of the major workloads on traditional high-performance computing (HPC) clusters. Currently, deploying data-intensive computing software framework on HPC clusters still faces performance and scalability issues. In this paper, we developed a new two-level storage system by integrating Tachyon, an in-memory file system with OrangeFS, a parallel file system. We modeled the I/O throughputs of four storage structures: HDFS, OrangeFS, Tachyon and two-level storage. We conducted computational experiments to characterize I/O throughput behavior of two-level storage and compared its performance to that of HDFS and OrangeFS, using the TeraSort benchmark application. Theoretical models and experimental tests both showed that the two-level storage system can increase the aggregate I/O throughputs. Meanwhile, the two-level storage also has low data fault tolerance cost and high storage capacity. The two-level storage may provide a better solution for I/O intensive big data analytics on traditional HPC clusters.
Inspired by empirical data on real world complex networks, the last few years have seen an explosion in proposed generative models to understand and explain observed properties of real world networks, including power law degree distribution and ‘small world’ distance scaling. In this context, a natural question is the phenomenon of {\it change point}, understanding how abrupt changes in parameters driving the network model change structural properties of the network. We study this phenomenon in one popular class of dynamically evolving networks: preferential attachment models. We derive asymptotic properties of various functionals of the network including the degree distribution as well as maximal degree asymptotics, in essence showing that the change point does effect the degree distribution but does {\bf not} change the degree exponent. This provides further evidence for long range dependence and sensitive dependence of the evolution of the process on the initial evolution of the process in such self-reinforced systems. We then propose an estimator for the change point and prove consistency properties of this estimator. The methodology developed highlights the effect of the non-ergodic nature of the evolution of the network on classical change point estimators.
In this paper, we address the problem of identifying linear structural equation models. We first extend the edge set half-trek criterion to cover a broader class of models. We then show that any semi-Markovian linear model can be recursively decomposed into simpler sub-models, resulting in improved identification power. Finally, we show that, unlike the existing methods developed for linear models, the resulting method subsumes the identification algorithm of non-parametric models.
This work investigates how the traditional image classification pipelines can be extended into a deep architecture, inspired by the recent successes of applying deep neural networks. We propose a hierarchical image representation based on layer-by-layer joint feature boosting and dictionary learning. In each layer, we construct a dictionary of filters by combining the filters from the lower layer, and iteratively optimize the image representation with a joint discriminative-generative formulation, i.e., minimization of empirical classification error plus regularization of analysis image generation over training images. For optimization, we perform two iterating steps: i) to minimize the classification error, select the most discriminative features using the gentle adaboost algorithm; ii) according to the feature selection, update the filters to minimize the regularization on analysis image representation using the gradient descent method. Once the optimization is converged, we learn the higher layer representation in the same way. Our model delivers several distinct advantages. First, our layer-wise optimization provides the potential to build very deep architectures. Second, the generated image representation is compact and meaningful by jointly considering image classification and generation. In several challenging visual recognition tasks, we demonstrate superior or comparable performances of our framework over other state-of-the-art approaches.
Dropout and other feature noising schemes have shown promising results in controlling over-fitting by artificially corrupting the training data. Though extensive theoretical and empirical studies have been performed for generalized linear models, little work has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a re-weighted least square problem, where the re-weights are analytically updated. For nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation technique in conjunction with first-order Taylor-expansion to deal with the intractable expected non-smooth hinge loss and the nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs. In addition, the nonlinear SVMs further improve the prediction performance on several image datasets.
Decision analytics commonly focuses on the text mining of financial news sources in order to provide managerial decision support and to predict stock market movements. Existing predictive frameworks almost exclusively apply traditional machine learning methods, whereas recent research indicates that traditional machine learning methods are not sufficiently capable of extracting suitable features and capturing the non-linear nature of complex tasks. As a remedy, novel deep learning models aim to overcome this issue by extending traditional neural network models with additional hidden layers. Indeed, deep learning has been shown to outperform traditional methods in terms of predictive performance. In this paper, we adapt the novel deep learning technique to financial decision support. In this instance, we aim to predict the direction of stock movements following financial disclosures. As a result, we show how deep learning can outperform the accuracy of random forests as a benchmark for machine learning by 5.66%.
Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for kernel hyperparameters or using grid search, which is slow and coarse-grained. In contrast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the utility of kernel-based methods.
We focus on spectral clustering of unlabeled graphs and review some results on clustering methods which achieve weak or strong consistent identification in data generated by such models. We also present a new algorithm which appears to perform optimally both theoretically using asymptotic theory and empirically.
Deep compositional models of meaning acting on distributional representations of words in order to produce vectors of larger text constituents are evolving to a popular area of NLP research. We detail a compositional distributional framework based on a rich form of word embeddings that aims at facilitating the interactions between words in the context of a sentence. Embeddings and composition layers are jointly learned against a generic objective that enhances the vectors with syntactic information from the surrounding context. Furthermore, each word is associated with a number of senses, the most plausible of which is selected dynamically during the composition process. We evaluate the produced vectors qualitatively and quantitatively with positive results. At the sentence level, the effectiveness of the framework is demonstrated on the MSRPar task, for which we report results within the state-of-the-art range.
We propose a new high-dimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients, subject to a budget on the maximal absolute correlation between the features and the residuals. We show that the estimator can be expressed as a solution to a Mixed Integer Linear Optimization (MILO) problem—a computationally tractable framework that enables the computation of provably optimal global solutions. Our approach has the appealing characteristic that even if we terminate the optimization problem at an early stage, it exits with a certificate of sub-optimality on the quality of the solution. We develop new discrete first order methods, motivated by recent algorithmic developments in first order continuous convex optimization, to obtain high quality feasible solutions for the Discrete Dantzig Selector problem. Our proposal leads to advantages over the off-the-shelf state-of-the-art integer programming algorithms, which include superior upper bounds obtained for a given computational budget. When a solution obtained from the discrete first order methods is passed as a warm-start to a MILO solver, the performance of the latter improves significantly. Exploiting problem specific information, we propose enhanced MILO formulations that further improve the algorithmic performance of the MILO solvers. We demonstrate, both theoretically and empirically, that, in a wide range of regimes, the statistical properties of the Discrete Dantzig Selector are superior to those of popular -based approaches. For problem instances with features and observations, our computational framework delivers optimal solutions in a few minutes and certifies optimality within an hour.
We propose a novel parameter estimation procedure that works efficiently for conditional random fields (CRF). This algorithm is an extension to the maximum likelihood estimation (MLE), using loss functions defined by Bregman divergences which measure the proximity between the model expectation and the empirical mean of the feature vectors. This leads to a flexible training framework from which multiple update strategies can be derived using natural gradient descent (NGD). We carefully choose the convex function inducing the Bregman divergence so that the types of updates are reduced, while making the optimization procedure more effective by transforming the gradients of the log-likelihood loss function. The derived algorithms are very simple and can be easily implemented on top of the existing stochastic gradient descent (SGD) optimization procedure, yet it is very effective as illustrated by experimental results.