We provide a general mathematical framework for selective inference with supervised model selection procedures characterized by quadratic forms in the outcome variable. Forward stepwise with groups of variables is an important special case as it allows models with categorical variables or factors. Models can be chosen by AIC, BIC, or a fixed number of steps. We provide an exact significance test for each group of variables in the selected model based on an appropriately truncated $\chi$ or $F$ distribution for the cases of known and unknown $\sigma^2$ respectively. An efficient software implementation is available as a package in the R statistical programming language.
We present two approaches that use unlabeled data to improve sequence learning with recurrent networks. The first approach is to predict what comes next in a sequence, which is a conventional language model in natural language processing. The second approach is to use a sequence autoencoder, which reads the input sequence into a vector and predicts the input sequence again. These two algorithms can be used as a ‘pretraining’ step for a later supervised sequence learning algorithm. In other words, the parameters obtained from the unsupervised step can be used as a starting point for other supervised training models. In our experiments, we find that long short term memory recurrent networks after being pretrained with the two approaches are more stable and generalize better. With pretraining, we are able to train long short term memory recurrent networks up to a few hundred timesteps, thereby achieving strong performance in many text classification tasks, such as IMDB, DBpedia and 20 Newsgroups.
The problem of convex optimization is studied. Usually in convex optimization the minimization is over a d-dimensional domain. Very often the convergence rate of an optimization algorithm depends on the dimension d. The algorithms studied in this paper utilize dictionaries instead of a canonical basis used in the coordinate descent algorithms. We show how this approach allows us to reduce dimensionality of the problem. Also, we investigate which properties of a dictionary are beneficial for the convergence rate of typical greedy-type algorithms.
Recommendation systems often rely on point-wise loss metrics such as the mean squared error. However, in real recommendation settings only few items are presented to a user. This observation has recently encouraged the use of rank-based metrics. LambdaMART is the state-of-the-art algorithm in learning to rank which relies on such a metric. Despite its success it does not have a principled regularization mechanism relying in empirical approaches to control model complexity leaving it thus prone to overfitting. Motivated by the fact that very often the users’ and items’ descriptions as well as the preference behavior can be well summarized by a small number of hidden factors, we propose a novel algorithm, LambdaMART Matrix Factorization (LambdaMART-MF), that learns a low rank latent representation of users and items using gradient boosted trees. The algorithm factorizes lambdaMART by defining relevance scores as the inner product of the learned representations of the users and items. The low rank is essentially a model complexity controller; on top of it we propose additional regularizers to constraint the learned latent representations that reflect the user and item manifolds as these are defined by their original feature based descriptors and the preference behavior. Finally we also propose to use a weighted variant of NDCG to reduce the penalty for similar items with large rating discrepancy. We experiment on two very different recommendation datasets, meta-mining and movies-users, and evaluate the performance of LambdaMART-MF, with and without regularization, in the cold start setting as well as in the simpler matrix completion setting. In both cases it outperforms in a significant manner current state of the art algorithms.
Recommendation systems have been integrated into the majority of large online systems to filter and rank information according to user profiles. It thus influences the way users interact with the system and, as a consequence, bias the evaluation of the performance of a recommendation algorithm computed using historical data (via offline evaluation). This paper describes this bias and discuss the relevance of a weighted offline evaluation to reduce this bias for different classes of recommendation algorithms.
Finding experts for a given problem is recognized as a difficult task. Even when a taxonomy of subject expertise exists, and is associated with a group of experts, it can be hard to exploit by users who have not internalized the taxonomy. Here we present a method for both attaching experts to a domain ontology, and hiding this fact from the end user looking for an expert. By linking Wikipedia to this same pivot ontology, we describe how a user can browse Wikipedia, as they normally do to search for information, and use this browsing behavior to find experts. Experts are characterized by their textual productions (webpages, publications, reports), and these textual productions are attached to concepts in the pivot ontology. When the user finds the Wikipedia page characterizing their need, a list of experts is displayed. In this way we transform Wikipedia into a search engine for experts.
We propose novel model transfer-learning methods that refine a decision forest model M learned within a ‘source’ domain using a training set sampled from a ‘target’ domain, assumed to be a variation of the source. We present two random forest transfer algorithms. The first algorithm searches greedily for locally optimal modifications of each tree structure by trying to locally expand or reduce the tree around individual nodes. The second algorithm does not modify structure, but only the parameter (thresholds) associated with decision nodes. We also propose to combine both methods by considering an ensemble that contains the union of the two forests. The proposed methods exhibit impressive experimental results over a range of problems.
Recurrent Neural Networks (RNNs) are powerful models that achieve unparalleled performance on several pattern recognition problems. However, training of RNNs is a computationally difficult task owing to the well-known ‘vanishing/exploding’ gradient problems. In recent years, several algorithms have been proposed for training RNNs. These algorithms either: exploit no (or limited) curvature information and have cheap per-iteration complexity; or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includes diagonally-scaled first-order methods such as ADAM and ADAGRAD while the latter consists of second-order algorithms like Hessian-Free Newton and K-FAC. In this paper, we present an novel stochastic quasi-Newton algorithm (adaQN) for training RNNs. Our approach retains a low per-iteration cost while allowing for non-diagonal scaling through a stochastic L-BFGS updating scheme. The method is judicious in storing and retaining L-BFGS curvature pairs which is indirectly used as a means of controlling the quality of the steps. We present numerical experiments on two language modeling tasks and show that adaQN performs at par, if not better, than popular RNN training algorithms. These results suggest that quasi-Newton algorithms have the potential to be a viable alternative to first- and second-order methods for training RNNs.
This paper is an empirical study of the distributed deep learning for a question answering subtask: answer selection. Comparison studies of SGD, MSGD, DOWNPOUR and EASGD/EAMSGD algorithms have been presented. Experimental results show that the message passing interface based distributed framework can accelerate the convergence speed at a sublinear scale. This paper demonstrates the importance of distributed training: with 120 workers, an 83x speedup is achievable and running time is decreased from 107.9 hours to 1.3 hours, which will benefit the productivity significantly.
Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that incorporates multiple predictors in each step of forward regression, with decision on which variables to incorporate based on the same criterion. If only one step is carried out, it actually reduces to the SIS. Thus it can be regarded as a generalization and unification of the FR and the SIS. More importantly, it preserves the sure screening property and has similar computational complexity as FR in each step, yet it can discover the relevant covariates in fewer steps. Thus, it reduces the computational burden of FR drastically while retaining advantages of the latter over SIS. Furthermore, we show that it can find all the true variables if the number of steps taken is the same as the correct model size, even when using the original FR. An extensive simulation study and application to two real data examples demonstrate excellent performance of the proposed method.