Predictive modelling relies on the assumption that observations used for training are representative of the data that will be encountered in future samples. In a variety of applications, this assumption is severely violated, since observational training data are often collected under sampling processes which are systematically biased with respect to group membership. Without explicit adjustment, machine learning algorithms can produce predictions that have poor generalization error with performance that varies widely by group. We propose a method to pre-process the training data, producing an adjusted dataset that is independent of the group variable with minimum information loss. We develop a conceptually simple approach for creating such a set of features in high dimensional settings based on a constrained form of principal components analysis. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be independent of the group variable. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two real examples: removing machine-specific correlations from brain scan data, and removing race and ethnicity information from a dataset used to predict recidivism.
We systematically investigate issues due to mis-specification that arise in estimating causal effects when (treatment) interference is informed by a network available pre-intervention, i.e., in situations where the outcome of a unit may depend on the treatment assigned to other units. We develop theory for several forms of interference through the concept of exposure neighborhood, and develop the corresponding semi-parametric representation for potential outcomes as a function of the exposure neighborhood. Using this representation, we extend the definition of two popular classes of causal estimands, marginal and average causal effects, to the case of network interference. We characterize the bias and variance one incurs when combining classical randomization strategies (namely, Bernoulli, Completely Randomized, and Cluster Randomized designs) and estimators (namely, difference-in-means and Horvitz-Thompson) used to estimate average treatment effect and on the total treatment effect, under misspecification due to interference. We illustrate how difference-in-means estimators can have arbitrarily large bias when estimating average causal effects, depending on the form and strength of interference, which is unknown at design stage. Horvitz-Thompson (HT) estimators are unbiased when the correct weights are specified. Here, we derive the HT weights for unbiased estimation of different estimands, and illustrate how they depend on the design, the form of interference, which is unknown at design stage, and the estimand. More importantly, we show that HT estimators are in-admissible for a large class of randomization strategies, in the presence of interference. We develop new model-assisted and model-dependent strategies to improve HT estimators, and we develop new randomization strategies for estimating the average treatment effect and total treatment effect.
Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts. Here, we introduce the BabyAI research platform to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. The levels gradually lead the agent towards acquiring a combinatorially rich synthetic language which is a proper subset of English. The platform also provides a heuristic expert agent for the purpose of simulating a human teacher. We report baseline results and estimate the amount of human involvement that would be required to train a neural network-based agent on some of the BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.
Due to the surge of data storage techniques, the need for the development of appropriate techniques to identify patterns and to extract knowledge from the resulting enormous data sets, which can be viewed as collections of dependent functional data, is of increasing interest in many scientific areas. We develop a similarity measure for spectral density operators of a collection of functional time series, which is based on the aggregation of Hilbert-Schmidt differences of the individual time-varying spectral density operators. Under fairly general conditions, the asymptotic properties of the corresponding estimator are derived and asymptotic normality is established. The introduced statistic lends itself naturally to quantify (dis)-similarity between functional time series, which we subsequently exploit in order to build a spectral clustering algorithm. Our algorithm is the first of its kind in the analysis of non-stationary (functional) time series and enables to discover particular patterns by grouping together `similar’ series into clusters, thereby reducing the complexity of the analysis considerably. The algorithm is simple to implement and computationally feasible. As a further application we provide a simple test for the hypothesis that the second order properties of two non-stationary functional time series coincide.
In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, the SPD Q-learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution is time-varying but sub-linearly converges to a stationary distribution. Numerical experiments are provided to demonstrate the off-policy learning abilities of the proposed algorithm in comparison to the standard Q-learning.
Computer algorithms are written with the intent that when run they perform a useful function. Typically any information obtained is unknown until the algorithm is run. However, if the behavior of an algorithm can be fully described by precomputing just once how this algorithm will respond when executed on any input, this precomputed result provides a complete specification for all solutions in the problem domain. We apply this idea to a previous anomaly detection algorithm, and in doing so transform it from one that merely detects individual anomalies when asked to discover potentially anomalous values, into an algorithm also capable of generating a complete specification for those values it would deem to be anomalous. This specification is derived by examining no more than a small training data, can be obtained in very small constant time, and is inherently far more useful than results obtained by repeated execution of this tool. For example, armed with such a specification one can ask how close an anomaly is to being deemed normal, and can validate this answer not by exhaustively testing the algorithm but by examining if the specification so generated is indeed correct. This powerful idea can be applied to any algorithm whose runtime behavior can be recovered from its construction and so has wide applicability.
Principal component analysis (PCA) and singular value decomposition (SVD) are widely used in statistics, machine learning, and applied mathematics. It has been well studied in the case of homoskedastic noise, where the noise levels of the contamination are homogeneous. In this paper, we consider PCA and SVD in the presence of heteroskedastic noise, which arises naturally in a range of applications. We introduce a general framework for heteroskedastic PCA and propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries to remove the bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on the singular subspace, which can be of independent interest. The effectiveness of the proposed algorithm is demonstrated in a suite of applications, including heteroskedastic low-rank matrix denoising, Poisson PCA, and SVD based on heteroskedastic and incomplete data.
We propose sequenced-replacement sampling (SRS) for training deep neural networks. The basic idea is to assign a fixed sequence index to each sample in the dataset. Once a mini-batch is randomly drawn in each training iteration, we refill the original dataset by successively adding samples according to their sequence index. Thus we carry out replacement sampling but in a batched and sequenced way. In a sense, SRS could be viewed as a way of performing ‘mini-batch augmentation’. It is particularly useful for a task where we have a relatively small images-per-class such as CIFAR-100. Together with a longer period of initial large learning rate, it significantly improves the classification accuracy in CIFAR-100 over the current state-of-the-art results. Our experiments indicate that training deeper networks with SRS is less prone to over-fitting. In the best case, we achieve an error rate as low as 10.10%.
Learned data models based on sparsity are widely used in signal processing and imaging applications. A variety of methods for learning synthesis dictionaries, sparsifying transforms, etc., have been proposed in recent years, often imposing useful structures or properties on the models. In this work, we focus on sparsifying transform learning, which enjoys a number of advantages. We consider multi-layer or nested extensions of the transform model, and propose efficient learning algorithms. Numerical experiments with image data illustrate the behavior of the multi-layer transform learning algorithm and its usefulness for image denoising. Multi-layer models provide better denoising quality than single layer schemes.
We extend the data augmentation technique (PANDA) by Li et al. (2018) for regularizing single graph model estimations to jointly learning the structures of multiple graphs. Our proposed approach provides an unified framework to effectively jointly train multiple graphical models, regardless of the types of nodes. We design and introduce two types of noises to augment the observed data. The first type of noises is to regularize the estimation of each graph while the second type of noises promotes either the structural similarities, referred as the joint group lasso (JGL) regularization, or numerical similarities, referred as the joint fused ridge (JFR) regularization, among the edges in the same position across multiple graphs. The computation in PANDA is straightforward and only involves obtaining maximum likelihood estimator in generalized linear models (GLMs) in an iterative manner. We also extend the JGL and JFR regularization beyond the graphical model settings to variable selection and estimation in GLMs. The multiple graph version of PANDA enjoys the theoretical properties established for single graphs including the almost sure (a.s) convergence of the noise-augmented loss function to its expectation and the a.s convergence of the minimizer of the former to the minimizer of the latter. The simulation studies suggest PANDA is non-inferior to existing joint estimation approaches in constructing multiple Gaussian graphical models (GGMs), and significantly improves over the differencing approach over separately estimated graphs in multiple Poisson graphical models (PGMs). We also applied PANDA to a real-life lung cancer microarray data to simultaneously construct four protein networks.
We introduce the concept of time series motifs for time series analysis. Time series motifs consider not only the spatial information of mutual visibility but also the temporal information of relative magnitude between the data points. We study the profiles of the six triadic time series. The six motif occurrence frequencies are derived for uncorrelated time series, which are approximately linear functions of the length of the time series. The corresponding motif profile thus converges to a constant vector $(0.2,0.2,0.1,0.2,0.1,0.2)$. These analytical results have been verified by numerical simulations. For fractional Gaussian noises, numerical simulations unveil the nonlinear dependence of motif occurrence frequencies on the Hurst exponent. Applications of the time series motif analysis uncover that the motif occurrence frequency distributions are able to capture the different dynamics in the heartbeat rates of healthy subjects, congestive heart failure (CHF) subjects, and atrial fibrillation (AF) subjects and in the price fluctuations of bullish and bearish markets. Our method shows its potential power to classify different types of time series and test the time irreversibility of time series.
The increasing presence of robots in industries has not gone unnoticed. Large industrial players have incorporated them into their production lines, but smaller companies hesitate due to high initial costs and the lack of programming expertise. In this work we introduce a framework that combines two disciplines, Programming by Demonstration and Automated Planning, to allow users without any programming knowledge to program a robot. The user teaches the robot atomic actions together with their semantic meaning and represents them in terms of preconditions and effects. Using these atomic actions the robot can generate action sequences autonomously to reach any goal given by the user. We evaluated the usability of our framework in terms of user experiments with a Baxter Research Robot and showed that it is well-adapted to users without any programming experience.
Probabilistic matrix factorization (PMF) plays a crucial role in recommendation systems. It requires a large amount of user data (such as user shopping records and movie ratings) to predict personal preferences, and thereby provides users high-quality recommendation services, which expose the risk of leakage of user privacy. Differential privacy, as a provable privacy protection framework, has been applied widely to recommendation systems. It is common that different individuals have different levels of privacy requirements on items. However, traditional differential privacy can only provide a uniform level of privacy protection for all users. In this paper, we mainly propose a probabilistic matrix factorization recommendation scheme with personalized differential privacy (PDP-PMF). It aims to meet users’ privacy requirements specified at the item-level instead of giving the same level of privacy guarantees for all. We then develop a modified sampling mechanism (with bounded differential privacy) for achieving PDP. We also perform a theoretical analysis of the PDP-PMF scheme and demonstrate the privacy of the PDP-PMF scheme. In addition, we implement the probabilistic matrix factorization schemes both with traditional and with personalized differential privacy (DP-PMF, PDP-PMF) and compare them through a series of experiments. The results show that the PDP-PMF scheme performs well on protecting the privacy of each user and its recommendation quality is much better than the DP-PMF scheme.
Model-based clustering is widely-used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data. Keywords: Distance-based clustering; Mixture model; Model-based clustering; Model misspecification; Pairwise distance matrix; Partial likelihood; Robustness.
Graph-based techniques emerged as a choice to deal with the dimensionality issues in modeling multivariate time series. However, there is yet no complete understanding of how the underlying structure could be exploited to ease this task. This work provides contributions in this direction by considering the forecasting of a process evolving over a graph. We make use of the (approximate) time-vertex stationarity assumption, i.e., timevarying graph signals whose first and second order statistical moments are invariant over time and correlated to a known graph topology. The latter is combined with VAR and VARMA models to tackle the dimensionality issues present in predicting the temporal evolution of multivariate time series. We find out that by projecting the data to the graph spectral domain: (i) the multivariate model estimation reduces to that of fitting a number of uncorrelated univariate ARMA models and (ii) an optimal low-rank data representation can be exploited so as to further reduce the estimation costs. In the case that the multivariate process can be observed at a subset of nodes, the proposed models extend naturally to Kalman filtering on graphs allowing for optimal tracking. Numerical experiments with both synthetic and real data validate the proposed approach and highlight its benefits over state-of-the-art alternatives.