In this paper we develop a statistical theory and an implementation of deep learning models. We show that an elegant variable splitting scheme for the alternating direction method of multipliers optimises a deep learning objective. We allow for non-smooth non-convex regularisation penalties to induce sparsity in parameter weights. We provide a link between traditional shallow layer statistical models such as principal component and sliced inverse regression and deep layer models. We also define the degrees of freedom of a deep learning predictor and a predictive MSE criteria to perform model selection for comparing architecture designs. We focus on deep multiclass logistic learning although our methods apply more generally. Our results suggest an interesting and previously under-exploited relationship between deep learning and proximal splitting techniques. To illustrate our methodology, we provide a multi-class logit classification analysis of Fisher’s Iris data where we illustrate the convergence of our algorithm. Finally, we conclude with directions for future research.
Generalized Word Length Pattern (GWLP) is an important and widely-used tool for comparing fractional factorial designs. We consider qualitative factors, and we code their levels using the roots of the unity. We write the GWLP of a fraction ${\mathcal F}$ using the polynomial indicator function, whose coefficients encode many properties of the fraction. We show that the coefficient of a simple or interaction term can be written using the counts of its levels. This apparently simple remark leads to major consequence, including a convolution formula for the counts. We also show that the mean aberration of a term over the permutation of its levels provides a connection with the variance of the level counts. Moreover, using mean aberrations for symmetric $s^m$ designs with $s$ prime, we derive a new formula for computing the GWLP of ${\mathcal F}$. It is computationally easy, does not use complex numbers and also provides a clear way to interpret the GWLP. As case studies, we consider non-isomorphic orthogonal arrays that have the same GWLP. The different distributions of the mean aberrations suggest that they could be used as a further tool to discriminate between fractions.
We propose a privacy-enhanced matrix factorization recommender that exploits the fact that users can often be grouped together by interest. This allows a form of ‘hiding in the crowd’ privacy. We introduce a novel matrix factorization approach suited to making recommendations in a shared group (or nym) setting and the BLC algorithm for carrying out this matrix factorization in a privacy-enhanced manner. We demonstrate that the increased privacy does not come at the cost of reduced recommendation accuracy.
This paper examines a general class of inferential problems in semiparametric and nonparametric models defined by conditional moment restrictions. We construct tests for the hypothesis that at least one element of the identified set satisfies a conjectured (Banach space) ‘equality’ and/or (a Banach lattice) ‘inequality’ constraint. Our procedure is applicable to identified and partially identified models, and is shown to control the level, and under some conditions the size, asymptotically uniformly in an appropriate class of distributions. The critical values are obtained by building a strong approximation to the statistic and then bootstrapping a (conservatively) relaxed form of the statistic. Sufficient conditions are provided, including strong approximations using Koltchinskii’s coupling. Leading important special cases encompassed by the framework we study include: (i) Tests of shape restrictions for infinite dimensional parameters; (ii) Confidence regions for functionals that impose shape restrictions on the underlying parameter; (iii) Inference for functionals in semiparametric and nonparametric models defined by conditional moment (in)equalities; and (iv) Uniform inference in possibly nonlinear and severely ill-posed problems.
Many studies record replicated time series epochs from different groups with the goal of using frequency domain properties to discriminate between the groups. In many applications, there exists variation in cyclical patterns from time series in the same group. Although a number of frequency domain methods for the discriminant analysis of time series have been explored, there is a dearth of models and methods that account for within-group spectral variability. This article proposes a model for groups of time series in which transfer functions are modeled as stochastic variables that can account for both between-group and within-group differences in spectra that are identified from individual replicates. An ensuing discriminant analysis of stochastic cepstra under this model is developed to obtain parsimonious measures of relative power that optimally separate groups in the presence of within-group spectral variability. The approach possess favorable properties in classifying new observations and can be consistently estimated through a simple discriminant analysis of a finite number of estimated cepstral coefficients. Benefits in accounting for within-group spectral variability are empirically illustrated in a simulation study and through an analysis of gait variability.
Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A document is processed as a sequence of terms, and the goal is to devise a method that can make predictions as fast as possible. The importance of this variant of the text classification problem is evident in domains like sexual predator detection, where one wants to identify an offender as early as possible. This paper analyzes the suitability of the standard naive Bayes classifier for approaching this problem. Specifically, we assess its performance when classifying documents after seeing an increasingly number of terms. A simple modification to the standard naive Bayes implementation allows us to make predictions with partial information. To the best of our knowledge naive Bayes has not been used for this purpose before. Throughout an extensive experimental evaluation we show the effectiveness of the classifier for early text classification. What is more, we show that this simple solution is very competitive when compared with state of the art methodologies that are more elaborated. We foresee our work will pave the way for the development of more effective early text classification techniques based in the naive Bayes formulation.
Classification and clustering are both important topics in statistical learning. A natural question herein is whether predefined classes are really different from one another, or whether clusters are really there. Specifically, we may be interested in knowing whether the two classes defined by some class labels (when they are provided), or the two clusters tagged by a clustering algorithm (where class labels are not provided), are from the same underlying distribution. Although both are challenging questions for the high-dimensional, low-sample size data, there has been some recent development for both. However, when it is costly to manually place labels on observations, it is often that only a small portion of the class labels is available. In this article, we propose a significance analysis approach for such type of data, namely partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information, our method provides a greater power, meanwhile, maintaining the size, illustrated by a comprehensive simulation study. Theoretical properties of the proposed method are studied with emphasis on the high-dimensional, low-sample size setting. Our simulated examples help to understand when and how the information extracted from the labeled data can be effective. A real data example further illustrates the usefulness of the proposed method.
We explore the utility of clustering in reducing error in various prediction tasks. Previous work has hinted at the improvement in prediction accuracy attributed to clustering algorithms if used to pre-process the data. In this work we more deeply investigate the direct utility of using clustering to improve prediction accuracy and provide explanations for why this may be so. We look at a number of datasets, run k-means at different scales and for each scale we train predictors. This produces k sets of predictions. These predictions are then combined by a na\’ive ensemble. We observed that this use of a predictor in conjunction with clustering improved the prediction accuracy in most datasets. We believe this indicates the predictive utility of exploiting structure in the data and the data compression handed over by clustering. We also found that using this method improves upon the prediction of even a Random Forests predictor which suggests this method is providing a novel, and useful source of variance in the prediction process.