We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. Matching methods are heavily used in the social sciences due to their interpretability, but most matching methods in the past do not pass basic sanity checks in that they fail when irrelevant variables are introduced. Also, past methods tend to be either computationally slow or produce poor matches. The method proposed in this work aims to match units on a weighted Hamming distance, taking into account the relative importance of the covariates; the algorithm aims to match units on as many relevant variables as possible. To do this, the algorithm creates a hierarchy of covariate combinations on which to match (similar to downward closure), in the process solving an optimization problem for each unit in order to construct the optimal matches. The algorithm uses a single dynamic program to solve all of optimization problems simultaneously. Notable advantages of our method over existing matching procedures are its high-quality matches, versatility in handling different data distributions that may have irrelevant variables, and ability to handle missing data by matching on as many available covariates as possible
Many estimators of the average causal effect of an intervention require estimation of the propensity score, the outcome regression, or both. For these estimators, we must carefully con- sider how to estimate the relevant regressions. It is often beneficial to utilize flexible techniques such as semiparametric regression or machine learning. However, optimal estimation of the regression function does not necessarily lead to optimal estimation of the average causal effect. Therefore, it is important to consider criteria for evaluating regression estimators and selecting hyper-parameters. A recent proposal addressed these issues via the outcome-adaptive lasso, a penalized regression technique for estimating the propensity score. We build on this proposal and offer a method that is simultaneously more flexible and more efficient than the previous pro- posal. We propose the outcome-highly-adaptive LASSO, a semi-parametric regression estimator designed to down-weight regions of the confounder space that do not contribute variation to the outcome regression. We show that tuning this method using collaborative targeted learning leads to superior finite-sample performance relative to competing estimators.
We present a framework for testing independence between two random vectors that is scalable to massive data. Taking a ‘divide-and-conquer’ approach, we break down the nonparametric multivariate test of independence into simple univariate independence tests on a collection of $2\times 2$ contingency tables, constructed by sequentially discretizing the original sample space at a cascade of scales from coarse to fine. This transforms a complex nonparametric testing problem—that traditionally requires quadratic computational complexity with respect to the sample size—into a multiple testing problem that can be addressed with a computational complexity that scales almost linearly with the sample size. We further consider the scenario when the dimensionality of the two random vectors also grows large, in which case the curse of dimensionality arises in the proposed framework through an explosion in the number of univariate tests to be completed. To overcome this difficulty, we propose a data-adaptive version of our method that completes a fraction of the univariate tests, judged to be more likely to contain evidence for dependency based on exploiting the spatial characteristics of the dependency structure in the data. We provide an inference recipe based on multiple testing adjustment that guarantees the inferential validity in terms of properly controlling the family-wise error rate. We demonstrate the tremendous computational advantage of the algorithm in comparison to existing approaches while achieving desirable statistical power through an extensive simulation study. In addition, we illustrate how our method can be used for learning the nature of the underlying dependency in addition to hypothesis testing. We demonstrate the use of our method through analyzing a data set from flow cytometry.
Outlier detection methods have become increasingly relevant in recent years due to increased security concerns and because of its vast application to different fields. Recently, Pauwels and Lasserre (2016) noticed that the sublevel sets of the inverse Christoffel function accurately depict the shape of a cloud of data using a sum-of-squares polynomial and can be used to perform outlier detection. In this work, we propose a kernelized variant of the inverse Christoffel function that makes it computationally tractable for data sets with a large number of features. We compare our approach to current methods on 15 different data sets and achieve the best average area under the precision recall curve (AUPRC) score, the best average rank and the lowest root mean square deviation.
Incremental learning from non-stationary data poses special challenges to the field of machine learning. Although new algorithms have been developed for this, assessment of results and comparison of behaviors are still open problems, mainly because evaluation metrics, adapted from more traditional tasks, can be ineffective in this context. Overall, there is a lack of common testing practices. This paper thus presents a testbed for incremental non-stationary learning algorithms, based on specially designed synthetic datasets. Also, test results are reported for some well-known algorithms to show that the proposed methodology is effective at characterizing their strengths and weaknesses. It is expected that this methodology will provide a common basis for evaluating future contributions in the field.
We consider the task of estimating a high-dimensional directed acyclic graph, given observations from a linear structural equation model with arbitrary noise distribution. By exploiting properties of common random graphs, we develop a new algorithm that requires conditioning only on small sets of variables. The proposed algorithm, which is essentially a modified version of the PC-Algorithm, offers significant gains in both computational complexity and estimation accuracy. In particular, it results in more efficient and accurate estimation in large networks containing hub nodes, which are common in biological systems. We prove the consistency of the proposed algorithm, and show that it also requires a less stringent faithfulness assumption than the PC-Algorithm. Simulations in low and high-dimensional settings are used to illustrate these findings. An application to gene expression data suggests that the proposed algorithm can identify a greater number of clinically relevant genes than current methods.
In failure-time settings, a competing risk event is any event that makes it impossible for the event of interest to occur. Different analytical methods are available for estimating the effect of a treatment on a failure event of interest that is subject to competing events. The choice of method depends on whether or not competing events are defined as censoring events. Though such definition has key implications for the causal interpretation of a given estimate, explicit consideration of those implications has been rare in the statistical literature. As a result, confusion exists as to how to choose amongst available methods for analyzing data with competing events and how to interpret effect estimates. This confusion can be alleviated by understanding that the choice to define a competing event as a censoring event or not corresponds to a choice between different causal estimands. In this paper, we describe the assumptions required to identify those causal estimands and provide a mapping between such estimands and standard terminology from the statistical literature—in particular, the terms ‘subdistribution function’, ‘subdistribution hazard’ and ’cause-specific hazard’. We show that when the censoring process depends on measured time-varying risk factors, conventional statistical methods for competing events are not valid and alternative methods derived from Robins’s g-formula may recover the causal estimand of interest.
Causal mediation analysis aims to quantify the intermediate effect of a mediator on the causal pathway from treatment to outcome. With multiple mediators, which are potentially causally dependent, the possible decomposition of pathway effects grows exponentially with the number of mediators. Huang and Pan (2016) introduced a principal component analysis (PCA) based approach to address this challenge, in which the transformed mediators are conditionally independent given the orthogonality of the PCs. However, the transformed mediator PCs, which are linear combinations of original mediators, are difficult to interpret. In this study, we propose a sparse high-dimensional mediation analysis approach by adopting the sparse PCA method introduced by Zou and others (2006) to the mediation setting. We apply the approach to a task-based functional magnetic resonance imaging study, and show that our proposed method is able to detect biologically meaningful results related to the identified mediator.
In recent years Cloud Computing service providers have been adding Data Mining (DM) services to their catalog. Several syntactic and semantic proposals have been presented to address the problem of the definition and description of services in Cloud Computing in a comprehensive way. Considering that each provider defines its own service logic for DM, we find that using semantic languages and following the linked data proposal it is possible to design a specification for the exchange of data mining services, achieving a high degree of interoperability. In this paper we propose a schema for the complete definition of DM Cloud Computing services, considering key aspects such as pricing, interfaces, experimentation work-flow, among others. Our proposal leverages the power of Linked Data for validating its usefulness with the definition of various DM services to define a complete Cloud Computing service.
This paper introduces a high performance implementation of \texttt{Zolo-SVD} algorithm on distributed memory systems, which is based on the polar decomposition (PD) algorithm via the Zolotarev’s function (\texttt{Zolo-PD}), originally proposed by Nakatsukasa and Freund [SIAM Review, 2016]. Our implementation highly relies on the routines of ScaLAPACK and therefore it is portable. Compared with the other PD algorithms such as the QR-based dynamically weighted Halley method (\texttt{QDWH-PD}), \texttt{Zolo-PD} is naturally parallelizable and has better scalability though performs more floating-point operations. When using many processes, \texttt{Zolo-PD} is usually 1.20 times faster than \texttt{QDWH-PD} algorithm, and \texttt{Zolo-SVD} can be about two times faster than the ScaLAPACK routine \texttt{\texttt{PDGESVD}}. These numerical experiments are performed on Tianhe-2 supercomputer, one of the fastest supercomputers in the world, and the tested matrices include some sparse matrices from particular applications and some randomly generated dense matrices with different dimensions. Our \texttt{QDWH-SVD} and \texttt{Zolo-SVD} implementations are freely available at https://…/Zolo-SVD.
In recent years machine learning methods that nearly interpolate the data have achieved remarkable success. In many settings achieving near-zero training error leads to excellent test results. In this work we show how the mathematical and conceptual simplicity of interpolation can be harnessed to construct a framework for very efficient, scalable and accurate kernel machines. Our main innovation is in constructing kernel machines that output solutions mathematically equivalent to those obtained using standard kernels, yet capable of fully utilizing the available computing power of a parallel computational resource, such as GPU. Such utilization is key to strong performance since much of the computational resource capability is wasted by the standard iterative methods. The computational resource and data adaptivity of our learned kernels is based on theoretical convergence bounds. The resulting algorithm, which we call EigenPro 2.0, is accurate, principled and very fast. For example, using a single GPU, training on ImageNet with $1.3\times 10^6$ data points and $1000$ labels takes under an hour, while smaller datasets, such as MNIST, take seconds. Moreover, as the parameters are chosen analytically, based on the theory, little tuning beyond selecting the kernel and kernel parameter is needed, further facilitating the practical use of these methods.
In this paper, we propose a novel regularization method for Generative Adversarial Networks, which allows the model to learn discriminative yet compact binary representations of image patches (image descriptors). We employ the dimensionality reduction that takes place in the intermediate layers of the discriminator network and train binarized low-dimensional representation of the penultimate layer to mimic the distribution of the higher-dimensional preceding layers. To achieve this, we introduce two loss terms that aim at: (i) reducing the correlation between the dimensions of the binarized low-dimensional representation of the penultimate layer i. e. maximizing joint entropy) and (ii) propagating the relations between the dimensions in the high-dimensional space to the low-dimensional space. We evaluate the resulting binary image descriptors on two challenging applications, image matching and retrieval, and achieve state-of-the-art results.
To fast approximate the maximum likelihood estimator with massive data, Wang et al. (JASA, 2017) proposed an Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for in logistic regression. This paper extends the scope of the OSMAC framework to include generalized linear models with canonical link functions. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequality, finite sample properties of the subsample estimator based on optimal subsampling probabilities are derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.
Assume we are given a set of items from a general metric space, but we neither have access to the representation of the data nor to the distances between data points. Instead, suppose that we can actively choose a triplet of items (A,B,C) and ask an oracle whether item A is closer to item B or to item C. In this paper, we propose a novel random forest algorithm for regression and classification that relies only on such triplet comparisons. In the theory part of this paper, we establish sufficient conditions for the consistency of such a forest. In a set of comprehensive experiments, we then demonstrate that the proposed random forest is efficient both for classification and regression. In particular, it is even competitive with other methods that have direct access to the metric representation of the data.
This work focuses on combining nonparametric topic models with Auto-Encoding Variational Bayes (AEVB). Specifically, we first propose iTM-VAE, where the topics are treated as trainable parameters and the document-specific topic proportions are obtained by a stick-breaking construction. The inference of iTM-VAE is modeled by neural networks such that it can be computed in a simple feed-forward manner. We also describe how to introduce a hyper-prior into iTM-VAE so as to model the uncertainty of the prior parameter. Actually, the hyper-prior technique is quite general and we show that it can be applied to other AEVB based models to alleviate the {\it collapse-to-prior} problem elegantly. Moreover, we also propose HiTM-VAE, where the document-specific topic distributions are generated in a hierarchical manner. HiTM-VAE is even more flexible and can generate topic distributions with better variability. Experimental results on 20News and Reuters RCV1-V2 datasets show that the proposed models outperform the state-of-the-art baselines significantly. The advantages of the hyper-prior technique and the hierarchical model construction are also confirmed by experiments.
Ordinal Regression (OR) aims to model the ordering information between different data categories, which is a crucial topic in multi-label learning. An important class of approaches to OR models the problem as a linear combination of basis functions that map features to a high dimensional non-linear space. However, most of the basis function-based algorithms are time consuming. We propose an incremental sparse Bayesian approach to OR tasks and introduce an algorithm to sequentially learn the relevant basis functions in the ordinal scenario. Our method, called Incremental Sparse Bayesian Ordinal Regression (ISBOR), automatically optimizes the hyper-parameters via the type-II maximum likelihood method. By exploiting fast marginal likelihood optimization, ISBOR can avoid big matrix inverses, which is the main bottleneck in applying basis function-based algorithms to OR tasks on large-scale datasets. We show that ISBOR can make accurate predictions with parsimonious basis functions while offering automatic estimates of the prediction uncertainty. Extensive experiments on synthetic and real word datasets demonstrate the efficiency and effectiveness of ISBOR compared to other basis function-based OR approaches.
Neural networks designed for the task of classification have become a commodity in recent years. Many works target the development of better networks, which results in a complexification of their architectures with more layers, multiple sub-networks, or even the combination of multiple classifiers. In this paper, we show how to redesign a simple network to reach excellent performances, which are better than the results reproduced with CapsNet on several datasets, by replacing a layer with a Hit-or-Miss layer. This layer contains activated vectors, called capsules, that we train to hit or miss a central capsule by tailoring a specific centripetal loss function. We also show how our network, named HitNet, is capable of synthesizing a representative sample of the images of a given class by including a reconstruction network. This possibility allows to develop a data augmentation step combining information from the data space and the feature space, resulting in a hybrid data augmentation process. In addition, we introduce the possibility for HitNet, to adopt an alternative to the true target when needed by using the new concept of ghost capsules, which is used here to detect potentially mislabeled images in the training data.
The dominant, state-of-the-art collaborative filtering (CF) methods today mainly comprises neural models. In these models, deep neural networks, e.g.., multi-layered perceptrons (MLP), are often used to model nonlinear relationships between user and item representations. As opposed to shallow models (e.g., factorization-based models), deep models generally provide a greater extent of expressiveness, albeit at the expense of impaired/restricted information flow. Consequently, the performance of most neural CF models plateaus at 3-4 layers, with performance stagnating or even degrading when increasing the model depth. As such, the question of how to train really deep networks in the context of CF remains unclear. To this end, this paper proposes a new technique that enables training neural CF models all the way up to 20 layers and beyond. Our proposed approach utilizes a new hierarchical self-attention mechanism that learns introspective intra-feature similarity across all the hidden layers of a standard MLP model. All in all, our proposed architecture, SA-NCF (Self-Attentive Neural Collaborative Filtering) is a densely connected self-matching model that can be trained up to 24 layers without plateau-ing, achieving wide performance margins against its competitors. On several popular benchmark datasets, our proposed architecture achieves up to an absolute improvement of 23%-58% and 1.3x to 2.8x fold improvement in terms of nDCG@10 and Hit Ratio (HR@10) scores over several strong neural CF baselines.
Among the most popular variable selection procedures in high-dimensional regression, Lasso provides a solution path to rank the variables and determines a cut-off position on the path to select variables and estimate coefficients. In this paper, we consider variable selection from a new perspective motivated by the frequently occurred phenomenon that relevant variables are not completely distinguishable from noise variables on the solution path. We propose to characterize the positions of the first noise variable and the last relevant variable on the path. We then develop a new variable selection procedure to control over-selection of the noise variables ranking after the last relevant variable, and, at the same time, retain a high proportion of relevant variables ranking before the first noise variable. Our procedure utilizes the recently developed covariance test statistic and Q statistic in post-selection inference. In numerical examples, our method compares favorably with other existing methods in selection accuracy and the ability to interpret its results.
To survive in the dynamically-evolving world, we accumulate knowledge and improve our skills based on experience. In the process, gaining new knowledge does not disrupt our vigilance to external stimuli. In other words, our learning process is ‘accumulative’ and ‘online’ without interruption. However, despite the recent success, artificial neural networks (ANNs) must be trained offline, and they suffer catastrophic interference between old and new learning, indicating that ANNs’ conventional learning algorithms may not be suitable for building intelligent agents comparable to our brain. In this study, we propose a novel neural network architecture (DynMat) consisting of dual learning systems, inspired by the complementary learning system (CLS) theory suggesting that the brain relies on short- and long-term learning systems to learn continuously. Our experiments show that 1) DynMat can learn a new class without catastrophic interference and 2) it does not strictly require offline training.
A major challenge in recommender systems is handling new users, whom are also called $\textit{cold-start}$ users. In this paper, we propose a novel approach for learning an optimal series of questions with which to interview cold-start users for movie recommender systems. We propose learning interview questions using Deep Q Networks to create user profiles to make better recommendations to cold-start users. While our proposed system is trained using a movie recommender system, our Deep Q Network model should generalize across various types of recommender systems.
Centroid-based methods including k-means and fuzzy c-means (FCM) are known as effective and easy-to-implement approaches to clustering purposes in many areas of application. However, these algorithms cannot be directly applied to supervised tasks. We propose a generative model extending centroid-based clustering approaches to be applicable to classification and regression problems. Given an arbitrary loss function, our approach, termed supervised fuzzy partitioning (SFP), incorporates labels information into its objective function through a surrogate term penalizing the risk. We also fuzzify the partition and assign weights to features alongside entropy-based regularization terms, enabling the method to capture more complex data structure, to identify significant features, and to yield better performance facing high-dimensional data. An iterative algorithm based on block coordinate descent (BCD) scheme was formulated to efficiently find a local optimizer. The results show that the SFP performance in classification and supervised dimensionality reduction on synthetic and real-world datasets is competitive with state-of-the-art algorithms such as random forest and SVM. Our method has a major advantage over such methods in that it not only leads to a flexible model but also uses the loss function in training phase without compromising computational efficiency.
Deep Neural Networks are highly over-parameterized and the size of the neural networks can be reduced significantly after training without any decrease in performance. One can clearly see this phenomenon in a wide range of architectures trained for various problems. Weight/channel pruning, distillation, quantization, matrix factorization are some of the main methods one can use to remove the redundancy to come up with smaller and faster models. This work starts with a short informative chapter, where we motivate the pruning idea and provide the necessary notation. In the second chapter, we compare various saliency scores in the context of parameter pruning. Using the insights obtained from this comparison and stating the problems it brings we motivate why pruning units instead of the individual parameters might be a better idea. We propose some set of definitions to quantify and analyze units that don’t learn and create any useful information. We propose an efficient way for detecting dead units and use it to select which units to prune. We get 5x model size reduction through unit-wise pruning on MNIST.
In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i.e.}, estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy (CE) optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability.
Despite the success of neural networks (NNs), there is still a concern among many over their ‘black box’ nature. Why do they work Here we present a simple analytic argument that NNs are in fact essentially polynomial regression models. This view will have various implications for NNs, e.g. providing an explanation for why convergence problems arise in NNs, and it gives rough guidance on avoiding overfitting. In addition, we use this phenomenon to predict and confirm a multicollinearity property of NNs not previously reported in the literature. Most importantly, given this loose correspondence, one may choose to routinely use polynomial models instead of NNs, thus avoiding some major problems of the latter, such as having to set many tuning parameters and dealing with convergence issues. We present a number of empirical results; in each case, the accuracy of the polynomial approach matches or exceeds that of NN approaches. A many-featured, open-source software package, polyreg, is available.
We introduce Implicit Policy, a general class of expressive policies that can flexibly represent complex action distributions in reinforcement learning, with efficient algorithms to compute entropy regularized policy gradients. We empirically show that, despite its simplicity in implementation, entropy regularization combined with a rich policy class can attain desirable properties displayed under maximum entropy reinforcement learning framework, such as robustness and multi-modality.