Deep neural networks with millions of parameters are at the heart of many state of the art machine learning models today. However, recent works have shown that models with much smaller number of parameters can also perform just as well. In this work, we introduce the problem of architecture-learning, i.e; learning the architecture of a neural network along with weights. We introduce a new trainable parameter called tri-state ReLU, which helps in eliminating unnecessary neurons. We also propose a smooth regularizer which encourages the total number of neurons after elimination to be small. The resulting objective is differentiable and simple to optimize. We experimentally validate our method on both small and large networks, and show that it can learn models with a considerably small number of parameters without affecting prediction accuracy.
Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.
The dimension reduction approaches commonly found in applications involving ordinal predictors are either extensions of unsupervised techniques such as principal component analysis or variable selection while modeling the regression under consideration. We introduce here a supervised model-based sufficient dimension reduction method targeting ordered categorical predictors. It is developed under the inverse regression framework and assumes the existence of a latent Gaussian variable that generates the discrete data after thresholding. The reduction is chosen before modelling the response as a function of the predictors and does not impose any distributional assumption on the response or the response given the predictors. A maximum likelihood estimator of the reduction is derived and an iterative EM-type algorithm is proposed to alleviate the computational load and thus make the method more practical. A regularized estimator which achieves variable selection and dimension reduction simultaneously is also presented. We illustrate the good performance of the proposed method through simulations and real data examples, including movie recommendation and socioeconomic index construction.
Often in real-world datasets, especially in high dimensional data, some feature values are missing. Since most data analysis and statistical methods do not handle gracefully missing values, the ?rst step in the analysis requires the imputation of missing values. Indeed, there has been a long standing interest in methods for the imputation of missing values as a pre-processing step. One recent and e?ective approach, the IRMI stepwise regression imputation method, uses a linear regression model for each real-valued feature on the basis of all other features in the dataset. However, the proposed iterative formulation lacks convergence guarantee. Here we propose a closely related method, stated as a single optimization problem and a block coordinate-descent solution which is guaranteed to converge to a local minimum. Experiments show results on both synthetic and benchmark datasets, which are comparable to the results of the IRMI method whenever it converges. However, while in the set of experiments described here IRMI often does not converge, the performance of our methods is shown to be markedly superior in comparison with other methods.
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular flexible tool for imposing such high-level intuitions in the formulation of real world problems. In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks~(RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. We expect this method to empower a new convenient approach to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks, and be of broad interest to the community.
Existing collaborative ranking based recommender systems tend to perform best when there is enough observed ratings for each user and the observation is made completely at random. Under this setting recommender systems can properly suggest a list of recommendations according to the user interests. However, when the observed ratings are extremely sparse (e.g. in the case of cold-start users where no rating data is available), and are not sampled uniformly at random, existing ranking methods fail to effectively leverage side information to transduct the knowledge from existing ratings to unobserved ones. We propose a semi-supervised collaborative ranking model, dubbed \texttt{SCOR}, to improve the quality of cold-start recommendation. \texttt{SCOR} mitigates the sparsity issue by leveraging side information about both observed and missing ratings by collaboratively learning the ranking model. This enables it to deal with the case of missing data not at random, but to also effectively incorporate the available side information in transduction. We experimentally evaluated our proposed algorithm on a number of challenging real-world datasets and compared against state-of-the-art models for cold-start recommendation. We report significantly higher quality recommendations with our algorithm compared to the state-of-the-art.
Recommender systems use algorithms to provide users product recommendations. Recently, these systems started using machine learning algorithms because of the progress and popularity of the artificial intelligence research field. However, choosing the suitable machine learning algorithm is difficult because of the sheer number of algorithms available in the literature. Researchers and practitioners are left with little information about the best approaches or the trends in algorithms usage. Moreover, the development of a recommender system featuring a machine learning algorithm has problems and open questions that must be evaluated, so software engineers know where to focus research efforts. This work presents a systematic review of the literature that analyzes the use of machine learning algorithms in recommender systems and identifies research opportunities for the software engineering research field. The study concluded that Bayesian and decision tree algorithms are widely used in recommender systems because of their low complexity, and that requirements and design phases of recommender system development must be investigated for research opportunities.
In requirements engineering for recommender systems, software engineers must identify the data that recommendations will be based on. This is a manual and labor-intensive task, which is error-prone and expensive. One solution to this problem is the adoption of an automatic recommender system development based on a general framework. One step towards the creation of such a framework is to determine what data is used in recommender systems. In this paper, a systematic review has been done to identify what data from the user and from a recommendation item a general recommender system needs. A user and an item model is proposed and described, and some considerations about algorithm specific parameters are explained. A further goal is to study the impact of the fields of big data and Internet of things into recommendation systems.
Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. It’s widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while adaptivity renders standard statistical theory invalid, folklore and experience suggest that not all types of adaptive analysis are equally at risk for false discoveries. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias and other statistics of an arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. We first consider several popular feature selection protocols, like rank selection or variance-based selection. We then consider the practice of adding random noise to the observations or to the reported statistics, which is advocated by related ideas from differential privacy and blinded data analysis. We discuss the connections between these techniques and our framework, and supplement our results with illustrative simulations.
Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihood-ratio estimator by reducing its variance using a control variate based on the first-order Taylor expansion of a mean-field network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. Our experiments on structured output prediction and discrete latent variable modeling demonstrate that MuProp yields consistently good performance across a range of difficult tasks.