• Home
  • About
  • Books
  • Courses
  • Documents
  • eBooks
  • Feeds
  • Images
  • Quotes
  • R Packages
  • What is …

AnalytiXon

~ Broaden your Horizon

Category Archives: Causality

Causal Inference, Counterfactuals, Confounding, Structural Causal Models, etc.

Finding out why

13 Friday Dec 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Dis-entangling Mixture of Interventions on a Causal Bayesian Network Using Aggregate Observations

We study the problem of separating a mixture of distributions, all of which come from interventions on a known causal bayesian network. Given oracle access to marginals of all distributions resulting from interventions on the network, and estimates of marginals from the mixture distribution, we want to recover the mixing proportions of different mixture components. We show that in the worst case, mixing proportions cannot be identified using marginals only. If exact marginals of the mixture distribution were known, under a simple assumption of excluding a few distributions from the mixture, we show that the mixing proportions become identifiable. Our identifiability proof is constructive and gives an efficient algorithm recovering the mixing proportions exactly. When exact marginals are not available, we design an optimization framework to estimate the mixing proportions. Our problem is motivated from a real-world scenario of an e-commerce business, where multiple interventions occur at a given time, leading to deviations in expected metrics. We conduct experiments on the well known publicly available ALARM network and on a proprietary dataset from a large e-commerce company validating the performance of our method.


Paper: Efficient adjustment sets for population average treatment effect estimation in non-parametric causal graphical models

The method of covariate adjustment is often used for estimation of population average treatment effects in observational studies. Graphical rules for determining all valid covariate adjustment sets from an assumed causal graphical model are well known. Restricting attention to causal linear models, a recent article derived two novel graphical criteria: one to compare the asymptotic variance of linear regression treatment effect estimators that control for certain distinct adjustment sets and another to identify the optimal adjustment set that yields the least squares treatment effect estimator with the smallest asymptotic variance among consistent adjusted least squares estimators. In this paper we show that the same graphical criteria can be used in non-parametric causal graphical models when treatment effects are estimated by contrasts involving non-parametrically adjusted estimators of the interventional means. We also provide a graphical criterion for determining the optimal adjustment set among the minimal adjustment sets, which is valid for both linear and non-parametric estimators. We provide a new graphical criterion for comparing time dependent adjustment sets, that is, sets comprised by covariates that adjust for future treatments and that are themselves affected by earlier treatments. We show by example that uniformly optimal time dependent adjustment sets do not always exist. In addition, for point interventions, we provide a sound and complete graphical criterion for determining when a non-parametric optimally adjusted estimator of an interventional mean, or of a contrast of interventional means, is as efficient as an efficient estimator of the same parameter that exploits the information in the conditional independencies encoded in the non-parametric causal graphical model.


Paper: Algorithms of Data Development For Deep Learning and Feedback Design

Recent research reveals that deep learning is an effective way of solving high dimensional Hamilton-Jacobi-Bellman equations. The resulting feedback control law in the form of a neural network is computationally efficient for real-time applications of optimal control. A critical part of this design method is to generate data for training the neural network and validating its accuracy. In this paper, we provide a survey of existing algorithms that can be used to generate data. All the algorithms surveyed in this paper are causality-free, i.e., the solution at a point is computed without using the value of the function at any other points. At the end of the paper, an illustrative example of optimal feedback design using deep learning is given.


Article: Features correlations: data leakage, confounded features and other things that can make your Deep Learning model fail

As it seems from the plot that boss is showing, the more employees have shaved heads, the more the company sales increase. If you where that boss, would you consider the same action on your employees? Probably not. In fact, you recognize that there is no causality between the two sets of events, and their behaviour is similar just by chance. More clearly: the shaved heads do not cause the sales. So, we just spotted the existence of at least two possible categories of correlations: without and with causality. We also agreed that only the second one is interesting, while the other is useless, when not misleading. But let’s dive deeper.


Article: How A.I. Will Help Address the Causation Problem in Economics

For decades, economists have done their research based on data sets only as large as their research assistants could handle, hence severely limiting the scope and precision of their work. However, AI will enable economists to dramatically enlarge these data sets and analyze them at the fastest ever speeds. As both computing power and the ability to collect data about the economy increase, AI algorithms will become better and better at detecting patterns and trends in the economy. The issue of causality is the one economists have been grappling with for aeons. As you have probably heard a thousand times, correlation does not imply causation.


Paper: Simpson’s Paradox and the implications for medical trials

This paper describes Simpson’s paradox, and explains its serious implications for randomised control trials. In particular, we show that for any number of variables we can simulate the result of a controlled trial which uniformly points to one conclusion (such as ‘drug is effective’) for every possible combination of the variable states, but when a previously unobserved confounding variable is included every possible combination of the variables state points to the opposite conclusion (‘drug is not effective’). In other words no matter how many variables are considered, and no matter how ‘conclusive’ the result, one cannot conclude the result is truly ‘valid’ since there is theoretically an unobserved confounding variable that could completely reverse the result.


Paper: Confounding Adjustment Methods for Multi-level Treatment Comparisons Under Lack of Positivity and Unknown Model Specification

Imbalances in covariates between treatment groups are frequent in observational studies and can lead to biased comparisons. Various adjustment methods can be employed to correct these biases in the context of multi-level treatments ($>$ 2). However, analytical challenges, such as positivity violations and incorrect model specification, may affect their ability to yield unbiased estimates. Adjustment methods that present the best potential to deal with those challenges were identified: the overlap weights, augmented overlap weights, bias-corrected matching and targeted maximum likelihood. A simple variance estimator for the overlap weight estimators that can naturally be combined with machine learning algorithms is proposed. In a simulation study, we investigated the empirical performance of these methods as well as those of simpler alternatives, standardization, inverse probability weighting and matching. Our proposed variance estimator performed well, even at a sample size of 500. Adjustment methods that included an outcome modeling component performed better than those that only modeled the treatment mechanism. Additionally, a machine learning implementation was observed to efficiently compensate for the unknown model specification for the former methods, but not the latter. Based on these results, the wildfire data were analyzed using the augmented overlap weight estimator. With respect to effectiveness of alternate fire-suppression interventions, the results were counter-intuitive, indeed the opposite of what would be expected on subject-matter grounds. This suggests the presence in the data of unmeasured confounding bias.


Paper: Counterfactual Explanation Algorithms for Behavioral and Textual Data

We study the interpretability of predictive systems that use high-dimensonal behavioral and textual data. Examples include predicting product interest based on online browsing data and detecting spam emails or objectionable web content. Recently, counterfactual explanations have been proposed for generating insight into model predictions, which focus on what is relevant to a particular instance. Conducting a complete search to compute counterfactuals is very time-consuming because of the huge dimensionality. To our knowledge, for behavioral and text data, only one model-agnostic heuristic algorithm (SEDC) for finding counterfactual explanations has been proposed in the literature. However, there may be better algorithms for finding counterfactuals quickly. This study aligns the recently proposed Linear Interpretable Model-agnostic Explainer (LIME) and Shapley Additive Explanations (SHAP) with the notion of counterfactual explanations, and empirically benchmarks their effectiveness and efficiency against SEDC using a collection of 13 data sets. Results show that LIME-Counterfactual (LIME-C) and SHAP-Counterfactual (SHAP-C) have low and stable computation times, but mostly, they are less efficient than SEDC. However, for certain instances on certain data sets, SEDC’s run time is comparably large. With regard to effectiveness, LIME-C and SHAP-C find reasonable, if not always optimal, counterfactual explanations. SHAP-C, however, seems to have difficulties with highly unbalanced data. Because of its good overall performance, LIME-C seems to be a favorable alternative to SEDC, which failed for some nonlinear models to find counterfactuals because of the particular heuristic search algorithm it uses. A main upshot of this paper is that there is a good deal of room for further research. For example, we propose algorithmic adjustments that are direct upshots of the paper’s findings.

Finding out why

08 Sunday Dec 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Confounding and Regression Adjustment in Difference-in-Differences

Difference-in-differences (diff-in-diff) is a study design that compares outcomes of two groups (treated and comparison) at two time points (pre- and post-treatment) and is widely used in evaluating new policy implementations. For instance, diff-in-diff has been used to estimate the effect that increasing minimum wage has on employment rates and to assess the Affordable Care Act’s effect on health outcomes. Although diff-in-diff appears simple, potential pitfalls lurk. In this paper, we discuss one such complication: time-varying confounding. We provide rigorous definitions for confounders in diff-in-diff studies and explore regression strategies to adjust for confounding. In simulations, we show how and when regression adjustment can ameliorate confounding for both time-invariant and time-varying covariates. We compare our regression approach to those models commonly fit in applied literature, which often fail to address the time-varying nature of confounding in diff-in-diff.


Paper: Actionable Interpretability through Optimizable Counterfactual Explanations for Tree Ensembles

Counterfactual explanations help users understand why machine learned models make certain decisions, and more specifically, how these decisions can be changed. In this work, we frame the problem of finding counterfactual explanations — the minimal perturbation to an input such that the prediction changes — as an optimization task. Previously, optimization techniques for generating counterfactual examples could only be applied to differentiable models, or alternatively via query access to the model by estimating gradients from randomly sampled perturbations. In order to accommodate non-differentiable models such as tree ensembles, we propose using probabilistic model approximations in the optimization framework. We introduce a novel approximation technique that is effective for finding counterfactual explanations while also closely approximating the original model. Our results show that our method is able to produce counterfactual examples that are closer to the original instance in terms of Euclidean, Cosine, and Manhattan distance compared to other methods specifically designed for tree ensembles.


Article: An introduction to Causal inference

Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others. We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the ‘ladder of causal inference’, from association (seeing) to intervention (doing) to counterfactuals (imagining). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the do-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!1


Python Library: causalnex


Paper: Causal inference of hazard ratio based on propensity score matching

Propensity score matching is commonly used to draw causal inference from observational survival data. However, there is no gold standard approach to analyze survival data after propensity score matching, and variance estimation after matching is open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching.


Paper: Improving Model Robustness Using Causal Knowledge

For decades, researchers in fields, such as the natural and social sciences, have been verifying causal relationships and investigating hypotheses that are now well-established or understood as truth. These causal mechanisms are properties of the natural world, and thus are invariant conditions regardless of the collection domain or environment. We show in this paper how prior knowledge in the form of a causal graph can be utilized to guide model selection, i.e., to identify from a set of trained networks the models that are the most robust and invariant to unseen domains. Our method incorporates prior knowledge (which can be incomplete) as a Structural Causal Model (SCM) and calculates a score based on the likelihood of the SCM given the target predictions of a candidate model and the provided input variables. We show on both publicly available and synthetic datasets that our method is able to identify more robust models in terms of generalizability to unseen out-of-distribution test examples and domains where covariates have shifted.


Paper: A review and evaluation of standard methods to handle missing data on time-varying confounders in marginal structural models

Marginal structural models (MSMs) are commonly used to estimate causal intervention effects in longitudinal non-randomised studies. A common issue when analysing data from observational studies is the presence of incomplete confounder data, which might lead to bias in the intervention effect estimates if they are not handled properly in the statistical analysis. However, there is currently no recommendation on how to address missing data on covariates in MSMs under a variety of missingness mechanisms encountered in practice. We reviewed existing methods to handling missing data in MSMs and performed a simulation study to compare the performance of complete case (CC) analysis, the last observation carried forward (LOCF), the missingness pattern approach (MPA), multiple imputation (MI) and inverse-probability-of-missingness weighting (IPMW). We considered three mechanisms for non-monotone missing data which are common in observational studies using electronic health record data. Whereas CC analysis lead to biased estimates of the intervention effect in almost all scenarios, the performance of the other approaches varied across scenarios. The LOCF approach led to unbiased estimates only under a specific non-random mechanism in which confounder values were missing when their values remained unchanged since the previous measurement. In this scenario, MI, the MPA and IPMW were biased. MI and IPMW led to the estimation of unbiased effects when data were missing at random, given the covariates or the treatment but only MI was unbiased when the outcome was a predictor of missingness. Furthermore, IPMW generally lead to very large standard errors. Lastly, regardless of the missingness mechanism, the MPA led to unbiased estimates only when the failure to record a confounder at a given time-point modified the subsequent relationships between the partially observed covariate and the outcome.


Paper: Entropy, mutual information, and systematic measures of structured spiking neural networks

The aim of this paper is to investigate various information-theoretic measures, including entropy, mutual information, and some systematic measures that based on mutual information, for a class of structured spiking neuronal network. In order to analyze and compute these information-theoretic measures for large networks, we coarse-grained the data by ignoring the order of spikes that fall into the same small time bin. The resultant coarse-grained entropy mainly capture the information contained in the rhythm produced by a local population of the network. We first proved that these information theoretical measures are well-defined and computable by proving the stochastic stability and the law of large numbers. Then we use three neuronal network examples, from simple to complex, to investigate these information-theoretic measures. Several analytical and computational results about properties of these information-theoretic measures are given.

Finding out why

30 Saturday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Algorithmic Bias in Recidivism Prediction: A Causal Perspective

ProPublica’s analysis of recidivism predictions produced by Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software tool for the task, has shown that the predictions were racially biased against African American defendants. We analyze the COMPAS data using a causal reformulation of the underlying algorithmic fairness problem. Specifically, we assess whether COMPAS exhibits racial bias against African American defendants using FACT, a recently introduced causality grounded measure of algorithmic fairness. We use the Neyman-Rubin potential outcomes framework for causal inference from observational data to estimate FACT from COMPAS data. Our analysis offers strong evidence that COMPAS exhibits racial bias against African American defendants. We further show that the FACT estimates from COMPAS data are robust in the presence of unmeasured confounding.


Paper: A Causal Inference Method for Reducing Gender Bias in Word Embedding Relations

Word embedding has become essential for natural language processing as it boosts empirical performances of various tasks. However, recent research discovers that gender bias is incorporated in neural word embeddings, and downstream tasks that rely on these biased word vectors also produce gender-biased results. While some word-embedding gender-debiasing methods have been developed, these methods mainly focus on reducing gender bias associated with gender direction and fail to reduce the gender bias presented in word embedding relations. In this paper, we design a causal and simple approach for mitigating gender bias in word vector relation by utilizing the statistical dependency between gender-definition word embeddings and gender-biased word embeddings. Our method attains state-of-the-art results on gender-debiasing tasks, lexical- and sentence-level evaluation tasks, and downstream coreference resolution tasks.


Paper: Rethinking Softmax with Cross-Entropy: Neural Network Classifier as Mutual Information Estimator

Mutual information is widely applied to learn latent representations of observations, whilst its implication in classification neural networks remain to be better explained. In this paper, we show that optimising the parameters of classification neural networks with softmax cross-entropy is equivalent to maximising the mutual information between inputs and labels under the balanced data assumption. Through the experiments on synthetic and real datasets, we show that softmax cross-entropy can estimate mutual information approximately. When applied to image classification, this relation helps approximate the point-wise mutual information between an input image and a label without modifying the network structure. In this end, we propose infoCAM, informative class activation map, which highlights regions of the input image that are the most relevant to a given label based on differences in information. The activation map helps localise the target object in an image. Through the experiments on the semi-supervised object localisation task with two real-world datasets, we evaluate the effectiveness of the information-theoretic approach.


Article: CausalImpact – A statistics library for estimating the causal effect of a designed intervention on a time series

The CausalImpact R package implements an approach to estimating the causal effect of a designed intervention on a time series. For example, how many additional daily clicks were generated by an advertising campaign? Answering a question like this can be difficult when a randomized experiment is not available. The package aims to address this difficulty using a structural Bayesian time-series model to estimate how the response metric might have evolved after the intervention if the intervention had not occurred.


Article: Causal ML: A Python Package for Uplift Modeling and Causal Inference with ML

Causal ML is a Python package that provides a suite of uplift modeling and causal inference methods using machine learning algorithms based on recent research. It provides a standard interface that allows user to estimate the Conditional Average Treatment Effect (CATE) or Individual Treatment Effect (ITE) from experimental or observational data. Essentially, it estimates the causal impact of intervention T on outcome Y for users with observed features X, without strong assumptions on the model form.


Paper: Predicting bubble bursts in oil prices using mixed causal-noncausal models

This paper investigates oil price series using mixed causal-noncausal autoregressive (MAR) models, namely dynamic processes that depend not only on their lags but also on their leads. MAR models have been successfully implemented on commodity prices as they allow to generate nonlinear features such as speculative bubbles. We estimate the probabilities that bubbles in oil price series burst once the series enter an explosive phase. To do so we first evaluate how to adequately detrend nonstationary oil price series while preserving the bubble patterns observed in the raw data. The impact of different filters on the identification of MAR models as well as on forecasting bubble events is investigated using Monte Carlo simulations. We illustrate our findings on WTI and Brent monthly series.


Paper: Theory-based Causal Transfer: Integrating Instance-level Induction and Abstract-level Structure Learning

Learning transferable knowledge across similar but different settings is a fundamental component of generalized intelligence. In this paper, we approach the transfer learning challenge from a causal theory perspective. Our agent is endowed with two basic yet general theories for transfer learning: (i) a task shares a common abstract structure that is invariant across domains, and (ii) the behavior of specific features of the environment remain constant across domains. We adopt a Bayesian perspective of causal theory induction and use these theories to transfer knowledge between environments. Given these general theories, the goal is to train an agent by interactively exploring the problem space to (i) discover, form, and transfer useful abstract and structural knowledge, and (ii) induce useful knowledge from the instance-level attributes observed in the environment. A hierarchy of Bayesian structures is used to model abstract-level structural causal knowledge, and an instance-level associative learning scheme learns which specific objects can be used to induce state changes through interaction. This model-learning scheme is then integrated with a model-based planner to achieve a task in the OpenLock environment, a virtual “escape room” with a complex hierarchy that requires agents to reason about an abstract, generalized causal structure. We compare performances against a set of predominate model-free reinforcement learning(RL) algorithms. RL agents showed poor ability transferring learned knowledge across different trials. Whereas the proposed model revealed similar performance trends as human learners, and more importantly, demonstrated transfer behavior across trials and learning situations.


Paper: High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework

We consider high dimensional $M$-estimation in settings where the response $Y$ is possibly missing at random and the covariates $\mathbf{X} \in \mathbb{R}^p$ can be high dimensional compared to the sample size $n$. The parameter of interest $\boldsymbol{\theta}_0 \in \mathbb{R}^d$ is defined as the minimizer of the risk of a convex loss, under a fully non-parametric model, and $\boldsymbol{\theta}_0$ itself is high dimensional which is a key distinction from existing works. Standard high dimensional regression and series estimation with possibly misspecified models and missing $Y$ are included as special cases, as well as their counterparts in causal inference using ‘potential outcomes’. Assuming $\boldsymbol{\theta}_0$ is $s$-sparse ($s \ll n$), we propose an $L_1$-regularized debiased and doubly robust (DDR) estimator of $\boldsymbol{\theta}_0$ based on a high dimensional adaptation of the traditional double robust (DR) estimator’s construction. Under mild tail assumptions and arbitrarily chosen (working) models for the propensity score (PS) and the outcome regression (OR) estimators, satisfying only some high-level conditions, we establish finite sample performance bounds for the DDR estimator showing its (optimal) $L_2$ error rate to be $\sqrt{s (\log d)/ n}$ when both models are correct, and its consistency and DR properties when only one of them is correct. Further, when both the models are correct, we propose a desparsified version of our DDR estimator that satisfies an asymptotic linear expansion and facilitates inference on low dimensional components of $\boldsymbol{\theta}_0$. Finally, we discuss various of choices of high dimensional parametric/semi-parametric working models for the PS and OR estimators. All results are validated via detailed simulations.

Finding out why

28 Thursday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Sampling distribution for single-regression Granger causality estimators

We show for the first time that, under the null hypothesis of vanishing Granger causality, the single-regression Granger-Geweke estimator converges to a generalised $\chi^2$ distribution, which may be well approximated by a $\Gamma$ distribution. We show that this holds too for Geweke’s spectral causality averaged over a given frequency band, and derive explicit expressions for the generalised $\chi^2$ and $\Gamma$-approximation parameters in both cases. We present an asymptotically valid Neyman-Pearson test based on the single-regression estimators, and discuss in detail how it may be usefully employed in realistic scenarios where autoregressive model order is unknown or infinite. We outline how our analysis may be extended to the conditional case, point-frequency spectral Granger causality, state-space Granger causality, and the Granger causality $F$-test statistic. Finally, we discuss approaches to approximating the distribution of the single-regression estimator under the alternative hypothesis.


Paper: An Innovative Approach to Addressing Childhood Obesity: A Knowledge-Based Infrastructure for Supporting Multi-Stakeholder Partnership Decision-Making in Quebec, Canada

The purpose of this paper is to describe and analyze the development of a knowledge-based infrastructure to support MSP decision-making processes. The paper emerged from a study to define specifications for a knowledge-based infrastructure to provide decision support for community-level MSPs in the Canadian province of Quebec. As part of the study, a process assessment was conducted to understand the needs of communities as they collect, organize, and analyze data to make decisions about their priorities. The result of this process is a portrait, which is an epidemiological profile of health and nutrition in their community. Portraits inform strategic planning and development of interventions and are used to assess the impact of interventions. Our key findings indicate ambiguities and disagreement among MSP decision-makers regarding causal relationships between actions and outcomes, and the relevant data needed for making decisions. MSP decision-makers expressed a desire for easy-to-use tools that facilitate the collection, organization, synthesis, and analysis of data, to enable decision-making in a timely manner. Findings inform conceptual modeling and ontological analysis to capture the domain knowledge and specify relationships between actions and outcomes. This modeling and analysis provide the foundation for an ontology, encoded using OWL 2 Web Ontology Language. The ontology is developed to provide semantic support for the MSP process, defining objectives, strategies, actions, indicators, and data sources. In the future, software interacting with the ontology can facilitate interactive browsing by decision-makers in the MSP in the form of concepts, instances, relationships, and axioms. Our ontology also facilitates the integration and interpretation of community data and can help in managing semantic interoperability between different knowledge sources.


Paper: Economy Statistical Recurrent Units For Inferring Nonlinear Granger Causality

Granger causality is a widely-used criterion for analyzing interactions in large-scale networks. As most physical interactions are inherently nonlinear, we consider the problem of inferring the existence of pairwise Granger causality between nonlinearly interacting stochastic processes from their time series measurements. Our proposed approach relies on modeling the embedded nonlinearities in the measurements using a component-wise time series prediction model based on Statistical Recurrent Units (SRUs). We make a case that the network topology of Granger causal relations is directly inferrable from a structured sparse estimate of the internal parameters of the SRU networks trained to predict the processes$’$ time series measurements. We propose a variant of SRU, called economy-SRU, which, by design has considerably fewer trainable parameters, and therefore less prone to overfitting. The economy-SRU computes a low-dimensional sketch of its high-dimensional hidden state in the form of random projections to generate the feedback for its recurrent processing. Additionally, the internal weight parameters of the economy-SRU are strategically regularized in a group-wise manner to facilitate the proposed network in extracting meaningful predictive features that are highly time-localized to mimic real-world causal events. Extensive experiments are carried out to demonstrate that the proposed economy-SRU based time series prediction model outperforms the MLP, LSTM and attention-gated CNN-based time series models considered previously for inferring Granger causality.


Paper: Interventional Markov Equivalence for Mixed Graph Models

We study the problem of characterizing Markov equivalence of graphical models under general interventions. For DAGs, this problem is solved using data from an interventional setting to refine MECs of DAGs into smaller, interventional MECs. A recent graphical characterization of interventional MECs of DAGs relates to their global Markov property. Motivated by this, we generalize interventional MECs to all loopless mixed graphs via their global Markov property and generalize the graphical characterization given for DAGs to ancestral graphs. We also extend the notion of interventional Markov equivalence probabilistically: via invariance properties of distributions Markov to acyclic directed mixed graphs (ADMGs). We show that this generalization aligns with the standard causal interpretation of ADMGs. Finally, we show the two generalizations coincide at their intersection, thereby completely generalizing the characterization for DAGs to directed ancestral graphs.


Paper: Instance Cross Entropy for Deep Metric Learning

Loss functions play a crucial role in deep metric learning thus a variety of them have been proposed. Some supervise the learning process by pairwise or tripletwise similarity constraints while others take advantage of structured similarity information among multiple data points. In this work, we approach deep metric learning from a novel perspective. We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. ICE has three main appealing properties. Firstly, similar to categorical cross entropy (CCE), ICE has clear probabilistic interpretation and exploits structured semantic similarity information for learning supervision. Secondly, ICE is scalable to infinite training data as it learns on mini-batches iteratively and is independent of the training set size. Thirdly, motivated by our relative weight analysis, seamless sample reweighting is incorporated. It rescales samples’ gradients to control the differentiation degree over training examples instead of truncating them by sample mining. In addition to its simplicity and intuitiveness, extensive experiments on three real-world benchmarks demonstrate the superiority of ICE.


Paper: Two Causal Principles for Improving Visual Dialog

This paper is a winner report from team MReaL-BDAI for Visual Dialog Challenge 2019. We present two causal principles for improving Visual Dialog (VisDial). By ‘improving’, we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on Visual Dialog 2019 Challenge leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise the harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.


Paper: Causality for Machine Learning

Graphical causal inference as pioneered by Judea Pearl arose from research on artificial intelligence (AI), and for a long time had little connection to the field of machine learning. This article discusses where links have been and should be established, introducing key concepts along the way. It argues that the hard open problems of machine learning and AI are intrinsically related to causality, and explains how the field is beginning to understand them.


Paper: Causally Denoise Word Embeddings Using Half-Sibling Regression

Distributional representations of words, also known as word vectors, have become crucial for modern natural language processing tasks due to their wide applications. Recently, a growing body of word vector postprocessing algorithm has emerged, aiming to render off-the-shelf word vectors even stronger. In line with these investigations, we introduce a novel word vector postprocessing scheme under a causal inference framework. Concretely, the postprocessing pipeline is realized by Half-Sibling Regression (HSR), which allows us to identify and remove confounding noise contained in word vectors. Compared to previous work, our proposed method has the advantages of interpretability and transparency due to its causal inference grounding. Evaluated on a battery of standard lexical-level evaluation tasks and downstream sentiment analysis tasks, our method reaches state-of-the-art performance.

Finding out why

25 Monday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Causality-based Feature Selection: Methods and Evaluations

Feature selection is a crucial preprocessing step in data analytics and machine learning. Classical feature selection algorithms select features based on the correlations between predictive features and the class variable and do not attempt to capture causal relationships between them. It has been shown that the knowledge about the causal relationships between features and the class variable has potential benefits for building interpretable and robust prediction models, since causal relationships imply the underlying mechanism of a system. Consequently, causality-based feature selection has gradually attracted greater attentions and many algorithms have been proposed. In this paper, we present a comprehensive review of recent advances in causality-based feature selection. To facilitate the development of new algorithms in the research area and make it easy for the comparisons between new methods and existing ones, we develop the first open-source package, called CausalFS, which consists of most of the representative causality-based feature selection algorithms (available at https://github.com/kuiy/CausalFS). Using CausalFS, we conduct extensive experiments to compare the representative algorithms with both synthetic and real-world data sets. Finally, we discuss some challenging problems to be tackled in future causality-based feature selection research.


Python Library: causeinfer

Causal inference/uplift in Python


Article: What Is The Strongest Quasi-Experimental Method? Interrupted Time Series. Period!

Randomized Controlled Trials (RCTs) are considered the gold standard for causal inference. However, it is not always feasible to run RCTs due to various reasons. Interrupted Time Series (ITS) design comes in handy when RCTs are not available (see Netflix for examples). Aa a quasi-experimental method, ITS contains a strong inferential power and has wide applications in epidemiology, medication research, and macro program evaluations in general. Arguably, ITS is the strongest quasi-experimental method in causal inference (Penfold and Zhang, 2013), and we shall elaborate in the following parts.


Paper: Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling

Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal by grounding natural language instructions to the visual surroundings. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. In this paper, we explore the use of counterfactual thinking as a human-inspired data augmentation method that results in robust models. Counterfactual thinking is a concept that describes the human propensity to create possible alternatives to life events that have already occurred. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data. In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance. APS also serves to do pre-exploration of unseen environments to strengthen the model’s ability to generalize. We evaluate the influence of APS on the performance of different VLN baseline models using the room-to-room dataset (R2R). The results show that the adversarial training process with our proposed APS benefits VLN models under both seen and unseen environments. And the pre-exploration process can further gain additional improvements under unseen environments.


Paper: A Graph Autoencoder Approach to Causal Structure Learning

Causal structure learning has been a challenging task in the past decades and several mainstream approaches such as constraint- and score-based methods have been studied with theoretical guarantees. Recently, a new approach has transformed the combinatorial structure learning problem into a continuous one and then solved it using gradient-based optimization methods. Following the recent state-of-the-arts, we propose a new gradient-based method to learn causal structures from observational data. The proposed method generalizes the recent gradient-based methods to a graph autoencoder framework that allows nonlinear structural equation models and is easily applicable to vector-valued variables. We demonstrate that on synthetic datasets, our proposed method outperforms other gradient-based methods significantly, especially on large causal graphs. We further investigate the scalability and efficiency of our method, and observe a near linear training time when scaling up the graph size.


Paper: PRINCE: Provider-side Interpretability with Counterfactual Explanations in Recommender Systems

Interpretable explanations for recommender systems and other machine learning models are crucial to gain user trust. Prior works that have focused on paths connecting users and items in a heterogeneous network have several limitations, such as discovering relationships rather than true explanations, or disregarding other users’ privacy. In this work, we take a fresh perspective, and present PRINCE: a provider-side mechanism to produce tangible explanations for end-users, where an explanation is defined to be a set of minimal actions performed by the user that, if removed, changes the recommendation to a different item. Given a recommendation, PRINCE uses a polynomial-time optimal algorithm for finding this minimal set of a user’s actions from an exponential search space, based on random walks over dynamic graphs. Experiments on two real-world datasets show that PRINCE provides more compact explanations than intuitive baselines, and insights from a crowdsourced user-study demonstrate the viability of such action-based explanations. We thus posit that PRINCE produces scrutable, actionable, and concise explanations, owing to its use of counterfactual evidence, a user’s own actions, and minimal sets, respectively.


Python Library: PyIF

An open source implementation to compute bi-variate Transfer Entropy.


Paper: Instrumental Variables: to Strengthen or not to Strengthen?

Instrumental variables (IV) are extensively used to estimate treatment effects in the presence of unmeasured confounding; however, weak IVs are often encountered in empirical studies and may cause problems. Many studies have considered building a stronger IV from the original, possibly weak, IV in the design stage of a matched study at the cost of not using some of the samples in the analysis. It is widely accepted that strengthening an IV may render nonparametric tests more powerful and typically increases the power of sensitivity analyses. In this article, we argue that contrary to this conventional wisdom, although a strong IV is good, strengthening an IV may not be. We consider matched observational studies from three perspectives. First, we evaluate the trade-off between IV strength and sample size on nonparametric tests assuming the IV is valid and show that there are circumstances in which strengthening an IV increases power but other circumstances in which it decreases power. Second, we derive a necessary condition for a valid sensitivity analysis model with continuous doses. We show that the $\Gamma$ sensitivity analysis model, which has been previously used to come to the conclusion that strengthening an IV makes studies less sensitive to bias, does not apply to the continuous IV setting and thus this previously reached conclusion may be invalid. Third, we quantify the bias of the Wald estimator with a possibly invalid IV under an oracle called the asymptotic oracle bias and leverage it to develop a valid sensitivity analysis framework; under this framework, we show that strengthening an IV may amplify or mitigate the bias of the estimator, and may or may not increase the power of sensitivity analyses. We also discuss how to better adjust for the observed covariates when building an IV.

Finding out why

22 Friday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Using natural language processing to extract health-related causality from Twitter messages

Twitter messages (tweets) contain various types of information, which include health-related information. Analysis of health-related tweets would help us understand health conditions and concerns encountered in our daily life. In this work, we evaluated an approach to extracting causal relations from tweets using natural language processing (NLP) techniques. We focused on three health-related topics: stress’, ‘insomnia’, and ‘headache’. We proposed a set of lexico-syntactic patterns based on dependency parser outputs to extract causal information. A large dataset consisting of 24 million tweets were used. The results show that our approach achieved an average precision between 74.59% and 92.27%. Analysis of extracted relations revealed interesting findings about health-related in Twitter.


Paper: Causal inference using Bayesian non-parametric quasi-experimental design

The de facto standard for causal inference is the randomized controlled trial, where one compares an manipulated group with a control group in order to determine the effect of an intervention. However, this research design is not always realistically possible due to pragmatic or ethical concerns. In these situations, quasi-experimental designs may provide a solution, as these allow for causal conclusions at the cost of additional design assumptions. In this paper, we provide a generic framework for quasi-experimental design using Bayesian model comparison, and we show how it can be used as an alternative to several common research designs. We provide a theoretical motivation for a Gaussian process based approach and demonstrate its convenient use in a number of simulations. Finally, we apply the framework to determine the effect of population-based thresholds for municipality funding in France, of the 2005 smoking ban in Sicily on the number of acute coronary events, and of the effect of an alleged historical phantom border in the Netherlands on Dutch voting behaviour.


Python Library: statinf

A library for statistics and causal inference


Paper: On the computation of counterfactual explanations — A survey

Due to the increasing use of machine learning in practice it becomes more and more important to be able to explain the prediction and behavior of machine learning models. An instance of explanations are counterfactual explanations which provide an intuitive and useful explanations of machine learning models. In this survey we review model-specific methods for efficiently computing counterfactual explanations of many different machine learning models and propose methods for models that have not been considered in literature so far.


Paper: Causal inference with recurrent data via inverse probability treatment weighting method (IPTW)

Propensity score methods are increasingly being used to reduce estimation bias of treatment effects for observational studies. Previous research has shown that propensity score methods consistently estimate the marginal hazard ratio for time to event data. However, recurrent data frequently arise in the biomedical literature and there is a paucity of research into the use of propensity score methods when data are recurrent in nature. The objective of this paper is to extend the existing propensity score methods to recurrent data setting. We illustrate our methods through a series of Monte Carlo simulations. The simulation results indicate that without the presence of censoring, the IPTW estimators allow us to consistently estimate the marginal hazard ratio for each event. Under administrative censoring regime, the stabilized IPTW estimator yields biased estimate of the marginal hazard ratio, and the degree of bias depends on the proportion of subjects being censored. For variance estimation, the na\’ive variance estimator often tends to substantially underestimate the variance of the IPTW estimator, while the robust variance estimator significantly reduces the estimation bias of the variance.


Paper: Generalized Maximum Causal Entropy for Inverse Reinforcement Learning

We consider the problem of learning from demonstrated trajectories with inverse reinforcement learning (IRL). Motivated by a limitation of the classical maximum entropy model in capturing the structure of the network of states, we propose an IRL model based on a generalized version of the causal entropy maximization problem, which allows us to generate a class of maximum entropy IRL models. Our generalized model has an advantage of being able to recover, in addition to a reward function, another expert’s function that would (partially) capture the impact of the connecting structure of the states on experts’ decisions. Empirical evaluation on a real-world dataset and a grid-world dataset shows that our generalized model outperforms the classical ones, in terms of recovering reward functions and demonstrated trajectories.


Paper: Causal Inference Under Approximate Neighborhood Interference

This paper studies causal inference in randomized experiments under network interference. Most existing models of interference posit that treatments assigned to alters only affect the ego’s response through a low-dimensional exposure mapping, which only depends on units within some known network radius around the ego. We propose a substantially weaker ‘approximate neighborhood interference’ (ANI) assumption, which allows treatments assigned to alters far from the ego to have a small, but potentially nonzero, impact on the ego’s response. Unlike the exposure mapping model, we can show that ANI is satisfied in well-known models of social interactions. Despite its generality, inference in a single-network setting is still possible under ANI, as we prove that standard inverse-probability weighting estimators can consistently estimate treatment and spillover effects and are asymptotically normal. For practical inference, we propose a new conservative variance estimator based on a network bootstrap and suggest a data-dependent bandwidth using the network diameter. Finally, we illustrate our results in a simulation study and empirical application.


Paper: Graph Topological Aspects of Granger Causal Network Learning

We study Granger causality in the context of wide-sense stationary time series, where our focus is on the topological aspects of the underlying causality graph. We establish sufficient conditions (in particular, we develop the notion of a ‘strongly causal’ graph topology) under which the true causality graph can be recovered via pairwise causality testing alone, and provide examples from the gene regulatory network literature suggesting that our concept of a strongly causal graph may be applicable to this field. We implement and detail finite-sample heuristics derived from our theory, and establish through simulation the efficiency gains (both statistical and computational) which can be obtained (in comparison to LASSO-type algorithms) when structural assumptions are met.

Finding out why

19 Tuesday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Compositional Hierarchical Tensor Factorization: Representing Hierarchical Intrinsic and Extrinsic Causal Factors

Visual objects are composed of a recursive hierarchy of perceptual wholes and parts, whose properties, such as shape, reflectance, and color, constitute a hierarchy of intrinsic causal factors of object appearance. However, object appearance is the compositional consequence of both an object’s intrinsic and extrinsic causal factors, where the extrinsic causal factors are related to illumination, and imaging conditions. Therefore, this paper proposes a unified tensor model of wholes and parts, and introduces a compositional hierarchical tensor factorization that disentangles the hierarchical causal structure of object image formation, and subsumes multilinear block tensor decomposition as a special case. The resulting object representation is an interpretable combinatorial choice of wholes’ and parts’ representations that renders object recognition robust to occlusion and reduces training data requirements. We demonstrate ourapproach in the context of face recognition by training on an extremely reduced dataset of synthetic images, and report encouragingface verification results on two datasets – the Freiburg dataset, andthe Labeled Face in the Wild (LFW) dataset consisting of real world images, thus, substantiating the suitability of our approach for data starved domains.


Paper: A Unified Framework for Causal Inference with Multiple Imputation Using Martingale

Multiple imputation is widely used to handle confounders missing at random in causal inference. Although Rubin’s combining rule is simple, it is not clear weather or not the standard multiple imputation inference is consistent when coupled with the commonly-used average causal effect (ACE) estimators. This article establishes a unified martingale representation for the average causal effect (ACE) estimators after multiple imputation. This representation invokes the wild bootstrap inference to provide consistent variance estimation. Our framework applies to asymptotically normal ACE estimators, including the regression imputation, weighting, and matching estimators. We extend to the scenarios when both outcome and confounders are subject to missingness and when the data are missing not at random.


Paper: Causality-based tests to detect the influence of confounders on mobile health diagnostic applications: a comparison with restricted permutations

Machine learning practice is often impacted by confounders. Confounding can be particularly severe in remote digital health studies where the participants self-select to enter the study. While many different confounding adjustment approaches have been proposed in the literature, most of these methods rely on modeling assumptions, and it is unclear how robust they are to violations of these assumptions. This realization has recently motivated the development of restricted permutation methods to quantify the influence of observed confounders on the predictive performance of a machine learning models and evaluate if confounding adjustment methods are working as expected. In this paper we show, nonetheless, that restricted permutations can generate biased estimates of the contribution of the confounders to the predictive performance of a learner, and we propose an alternative approach to tackle this problem. By viewing a classification task from a causality perspective, we are able to leverage conditional independence tests between predictions and test set labels and confounders in order to detect confounding on the predictive performance of a classifier. We illustrate the application of our causality-based approach to data collected from mHealth study in Parkinson’s disease.


Paper: Long-range Event-level Prediction and Response Simulation for Urban Crime and Global Terrorism with Granger Networks

Large-scale trends in urban crime and global terrorism are well-predicted by socio-economic drivers, but focused, event-level predictions have had limited success. Standard machine learning approaches are promising, but lack interpretability, are generally interpolative, and ineffective for precise future interventions with costly and wasteful false positives. Here, we are introducing Granger Network inference as a new forecasting approach for individual infractions with demonstrated performance far surpassing past results, yet transparent enough to validate and extend social theory. Considering the problem of predicting crime in the City of Chicago, we achieve an average AUC of ~90\% for events predicted a week in advance within spatial tiles approximately $1000$ ft across. Instead of pre-supposing that crimes unfold across contiguous spaces akin to diffusive systems, we learn the local transport rules from data. As our key insights, we uncover indications of suburban bias — how law-enforcement response is modulated by socio-economic contexts with disproportionately negative impacts in the inner city — and how the dynamics of violent and property crimes co-evolve and constrain each other — lending quantitative support to controversial pro-active policing policies. To demonstrate broad applicability to spatio-temporal phenomena, we analyze terror attacks in the middle-east in the recent past, and achieve an AUC of ~80% for predictions made a week in advance, and within spatial tiles measuring approximately 120 miles across. We conclude that while crime operates near an equilibrium quickly dissipating perturbations, terrorism does not. Indeed terrorism aims to destabilize social order, as shown by its dynamics being susceptible to run-away increases in event rates under small perturbations.


Paper: Guidelines for estimating causal effects in pragmatic randomized trials

Pragmatic randomized trials are designed to provide evidence for clinical decision-making rather than regulatory approval. Common features of these trials include the inclusion of heterogeneous or diverse patient populations in a wide range of care settings, the use of active treatment strategies as comparators, unblinded treatment assignment, and the study of long-term, clinically relevant outcomes. These features can greatly increase the usefulness of the trial results for patients, clinicians, and other stakeholders. However, these features also introduce an increased risk of non-adherence, which reduces the value of the intention-to-treat effect as a patient-centered measure of causal effect. In these settings, the per-protocol effect provides useful complementary information for decision making. Unfortunately, there is little guidance for valid estimation of the per-protocol effect. Here, we present our full guidelines for analyses of pragmatic trials that will result in more informative causal inferences for both the intention-to-treat effect and the per-protocol effect.


Python Library: cause-ml

Causal ML benchmarking and development tools


Article: The 10 Bias and Causality Techniques of that Everyone Needs to Master.

In the end what does Causality have to do with machine learning? Machine Learning is about prediction and causality about real effects, do these two themes have something in common? Yes, a lot in common and this series of posts tries to bridge these two sub-areas of Data Science. I like to think that Machine Learning is just a Data Grinder, if you put good quality data, you get good quality predictions, but if you put garbage, it will keep grinding, but don’t expect good predictions to come out, it’s just ground garbage , and that’s what we’ll talk about in this post.


Paper: Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling

Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, prohibiting the use of longer receptive fields in practice. To increase efficiency, we make use of the ‘slow feature’ hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model (‘Seq-U-Net’) to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance in all tasks.

Finding out why

13 Wednesday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Bayesian Matrix Completion Approach to Causal Inference with Panel Data

This study proposes a new Bayesian approach to infer average treatment effect. The approach treats counterfactual untreated outcomes as missing observations and infers them by completing a matrix composed of realized and potential untreated outcomes using a data augmentation technique. We also develop a tailored prior that helps in the identification of parameters and induces the matrix of the untreated outcomes to be approximately low rank. While the proposed approach is similar to synthetic control methods and other relevant methods, it has several notable advantages. Unlike synthetic control methods, the proposed approach does not require stringent assumptions. Whereas synthetic control methods do not have a statistically grounded method to quantify uncertainty about inference, the proposed approach can estimate credible sets in a straightforward and consistent manner. Our proposal approach has a better finite sample performance than the existing Bayesian and non-Bayesian approaches, as we show through a series of simulation studies.


Paper: Stabilizing Variable Selection and Regression

We consider regression in which one predicts a response $Y$ with a set of predictors $X$ across different experiments or environments. This is a common setup in many data-driven scientific fields and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, i.e., predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models, which allows to graphically characterize stable versus unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error given that the resulting regression generalizes to unseen new environments.


Article: Three Essays on Statistical Inference and Estimation for Heterogeneous Causal Effects in Economics

Learning about causal relationships on the basis of empirical observation is not a trivial task. The classical scientific approach of running repeated independent controlled randomized experiments to measure effects of causes is limited in practice due to ethical, financial, political, temporal or other constraints. In particular, it can be infeasible to replicate controlled conditions to answer real-world questions regarding the impact of a policy or a program currently in place, i.e. to measure the effect of a treatment on a set of outcomes of interest. However, for empirically informed policy or general decision making, knowing about the fundamental causal relationships is crucial.


Paper: Why X rather than Y? Explaining Neural Model’ Predictions by Generating Intervention Counterfactual Samples

Even though the topic of explainable AI/ML is very popular in text and computer vision domain, most of the previous literatures are not suitable for explaining black-box models’ predictions on general data mining datasets. This is because these datasets are usually in high-dimensional vectored features format that are not as friendly and comprehensible as texts and images to the end users. In this paper, we combine the best of both worlds: ‘explanations by intervention’ from causality and ‘explanations are contrastive’ from philosophy and social science domain to explain neural models’ predictions for tabular datasets. Specifically, given a model’s prediction as label X, we propose a novel idea to intervene and generate minimally modified contrastive sample to be classified as Y, that then results in a simple natural text giving answer to the question ‘Why X rather than Y?’. We carry out experiments with several datasets of different scales and compare our approach with other baselines on three different areas: fidelity, reasonableness and explainability.


Paper: Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems

This manuscript contributes a general and practical framework for casting a Markov process model of a system at equilibrium as a structural causal model, and carrying out counterfactual inference. Markov processes mathematically describe the mechanisms in the system, and predict the system’s equilibrium behavior upon intervention, but do not support counterfactual inference. In contrast, structural causal models support counterfactual inference, but do not identify the mechanisms. This manuscript leverages the benefits of both approaches. We define the structural causal models in terms of the parameters and the equilibrium dynamics of the Markov process models, and counterfactual inference flows from these settings. The proposed approach alleviates the identifiability drawback of the structural causal models, in that the counterfactual inference is consistent with the counterfactual trajectories simulated from the Markov process model. We showcase the benefits of this framework in case studies of complex biomolecular systems with nonlinear dynamics. We illustrate that, in presence of Markov process model misspecification, counterfactual inference leverages prior data, and therefore estimates the outcome of an intervention more accurately than a direct simulation.


Paper: MBCAL: A Simple and Efficient Reinforcement Learning Method for Recommendation Systems

It has been widely regarded that only considering the immediate user feedback is not sufficient for modern industrial recommendation systems. Many previous works attempt to maximize the long term rewards with Reinforcement Learning(RL). However, model-free RL suffers from problems including significant variance in gradient, long convergence period, and requirement of sophisticated online infrastructures. While model-based RL provides a sample-efficient choice, the cost of planning in an online system is unacceptable. To achieve high sample efficiency in practical situations, we propose a novel model-based reinforcement learning method, namely the model-based counterfactual advantage learning(MBCAL). In the proposed method, a masking item is introduced in the environment model learning. With the masking item and the environment model, we introduce the counterfactual future advantage, which eliminates most of the noises in long term rewards. The proposed method selects through approximating the immediate reward and future advantage separately. It is easy to implement, yet it requires reasonable cost in both training and inference processes. In the experiments, we compare our methods with several baselines, including supervised learning, model-free RL, and other model-based RL methods in carefully designed experiments. Results show that our method transcends all the baselines in both sample efficiency and asymptotic performance.


Paper: Group Average Treatment Effects for Observational Studies

The paper proposes an estimator to make inference on key features of heterogeneous treatment effects sorted by impact groups (GATES) for non-randomised experiments. Observational studies are standard in policy evaluation from labour markets, educational surveys, and other empirical studies. To control for a potential selection-bias we implement a doubly-robust estimator in the first stage. Keeping the flexibility to use any machine learning method to learn the conditional mean functions as well as the propensity score we also use machine learning methods to learn a function for the conditional average treatment effect. The group average treatment effect is then estimated via a parametric linear model to provide p-values and confidence intervals. The result is a best linear predictor for effect heterogeneity based on impact groups. Cross-splitting and averaging for each observation is a further extension to avoid biases introduced through sample splitting. The advantage of the proposed method is a robust estimation of heterogeneous group treatment effects under mild assumptions, which is comparable with other models and thus keeps its flexibility in the choice of machine learning methods. At the same time, its ability to deliver interpretable results is ensured.


Paper: Deriving pairwise transfer entropy from network structure and motifs

Transfer entropy is an established method for quantifying directed statistical dependencies in neuroimaging and complex systems datasets. The pairwise (or bivariate) transfer entropy from a source to a target node in a network does not depend solely on the local source-target link weight, but on the wider network structure that the link is embedded in. This relationship is studied using a discrete-time linearly-coupled Gaussian model, which allows us to derive the transfer entropy for each link from the network topology. It is shown analytically that the dependence on the directed link weight is only a first approximation, valid for weak coupling. More generally, the transfer entropy increases with the in-degree of the source and decreases with the in-degree of the target, indicating an asymmetry of information transfer between hubs and low-degree nodes. In addition, the transfer entropy is directly proportional to weighted motif counts involving common parents or multiple walks from the source to the target, which are more abundant in networks with a high clustering coefficient than in random networks. Our findings also apply to Granger causality, which is equivalent to transfer entropy for Gaussian variables. Moreover, similar empirical results on random Boolean networks suggest that the dependence of the transfer entropy on the in-degree extends to nonlinear dynamics.

Finding out why

11 Monday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Robust contrastive learning and nonlinear ICA in the presence of outliers

Nonlinear independent component analysis (ICA) is a general framework for unsupervised representation learning, and aimed at recovering the latent variables in data. Recent practical methods perform nonlinear ICA by solving a series of classification problems based on logistic regression. However, it is well-known that logistic regression is vulnerable to outliers, and thus the performance can be strongly weakened by outliers. In this paper, we first theoretically analyze nonlinear ICA models in the presence of outliers. Our analysis implies that estimation in nonlinear ICA can be seriously hampered when outliers exist on the tails of the (noncontaminated) target density, which happens in a typical case of contamination by outliers. We develop two robust nonlinear ICA methods based on the {\gamma}-divergence, which is a robust alternative to the KL-divergence in logistic regression. The proposed methods are shown to have desired robustness properties in the context of nonlinear ICA. We also experimentally demonstrate that the proposed methods are very robust and outperform existing methods in the presence of outliers. Finally, the proposed method is applied to ICA-based causal discovery and shown to find a plausible causal relationship on fMRI data.


Paper: Learning Hawkes Processes from a Handful of Events

Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.


Paper: Space Objects Maneuvering Prediction via Maximum Causal Entropy Inverse Reinforcement Learning

This paper uses inverse Reinforcement Learning (RL) to determine the behavior of Space Objects (SOs) by estimating the reward function that an SO is using for control. The approach discussed in this work can be used to analyze maneuvering of SOs from observational data. The inverse RL problem is solved using maximum causal entropy. This approach determines the optimal reward function that a SO is using while maneuvering with random disturbances by assuming that the observed trajectories are optimal with respect to the SO’s own reward function. Lastly, this paper develops results for scenarios involving Low Earth Orbit (LEO) station-keeping and Geostationary Orbit (GEO) station-keeping.


Paper: Modeling National Latent Socioeconomic Health and Examination of Policy Effects via Causal Inference

This research develops a socioeconomic health index for nations through a model-based approach which incorporates spatial dependence and examines the impact of a policy through a causal modeling framework. As the gross domestic product (GDP) has been regarded as a dated measure and tool for benchmarking a nation’s economic performance, there has been a growing consensus for an alternative measure—such as a composite `wellbeing’ index—to holistically capture a country’s socioeconomic health performance. Many conventional ways of constructing wellbeing/health indices involve combining different observable metrics, such as life expectancy and education level, to form an index. However, health is inherently latent with metrics actually being observable indicators of health. In contrast to the GDP or other conventional health indices, our approach provides a holistic quantification of the overall `health’ of a nation. We build upon the latent health factor index (LHFI) approach that has been used to assess the unobservable ecological/ecosystem health. This framework integratively models the relationship between metrics, the latent health, and the covariates that drive the notion of health. In this paper, the LHFI structure is integrated with spatial modeling and statistical causal modeling, so as to evaluate the impact of a policy variable (mandatory maternity leave days) on a nation’s socioeconomic health, while formally accounting for spatial dependency among the nations. We apply our model to countries around the world using data on various metrics and potential covariates pertaining to different aspects of societal health. The approach is structured in a Bayesian hierarchical framework and results are obtained by Markov chain Monte Carlo techniques.


Paper: Fast Dimensional Analysis for Root Cause Investigation in Large-Scale Service Environment

Root cause analysis in a large-scale production environment is challenging due to the complexity of services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for detecting issues. Another challenge in reviewing the logs for identifying issues is the scale – there could easily be millions of entities, each with hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. a group of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. With the use of a large-scale real-time database, we propose pre- and post-processing techniques and parallelism to further speed up the analysis. We have successfully rolled out this approach for root cause investigation purposes in a large-scale infrastructure. We also present the setup and results from multiple production use-cases in this paper.


Paper: A two-dimensional propensity score matching method for longitudinal quasi-experimental studies: A focus on travel Behavior and the built environment

The lack of longitudinal studies of the relationship between the built environment and travel behavior has been widely discussed in the literature. This paper discusses how standard propensity score matching estimators can be extended to enable such studies by pairing observations across two dimensions: longitudinal and cross-sectional. Researchers mimic randomized controlled trials (RCTs) and match observations in both dimensions, to find synthetic control groups that are similar to the treatment group and to match subjects synthetically across before-treatment and after-treatment time periods. We call this a two-dimensional propensity score matching (2DPSM). This method demonstrates superior performance for estimating treatment effects based on Monte Carlo evidence. A near-term opportunity for such matching is identifying the impact of transportation infrastructure on travel behavior.


Paper: Estimating quantities conserved by virtue of scale invariance in timeseries

In contrast to the symmetries of translation in space, rotation in space, and translation in time, the known laws of physics are not universally invariant under transformation of scale. However, the action can be invariant under change of scale in the special case of a scale free dynamical system that can be described in terms of a Lagrangian, that itself scales inversely with time. Crucially, this means symmetries under change of scale can exist in dynamical systems under certain constraints. Our contribution lies in the derivation of a generalised scale invariant Lagrangian – in the form of a power series expansion – that satisfies these constraints. This generalised Lagrangian furnishes a normal form for dynamic causal models (i.e., state space models based upon differential equations) that can be used to distinguish scale invariance (scale symmetry) from scale freeness in empirical data. We establish face validity with an analysis of simulated data and then show how scale invariance can be identified – and how the associated conserved quantities can be estimated – in neuronal timeseries.


Paper: Optimal two-stage testing of multiple mediators

Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two step familywise error rate (FWER) procedure called ScreenMin has been recently proposed (Djordjilovi\’c et al. 2019). In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed threshold for selection has been shown to guarantee asymptotic FWER. In this work, we investigate the impact of the selection threshold on the finite sample FWER. We derive power maximizing selection threshold and show that it is well approximated by an adaptive threshold of Wang et al. (2016). We study the performance of the proposed procedures in a simulation study, and apply them to a case-control study examining the effect of fish intake on the risk of colorectal adenoma.

Finding out why

09 Saturday Nov 2019

Posted by Michael Laux in Causality

≈ Leave a comment

Paper: Feature relevance quantification in explainable AI: A causality problem

We discuss promising recent contributions on quantifying feature relevance using Shapley values, where we observed some confusion on which probability distribution is the right one for dropped features. We argue that the confusion is based on not carefully distinguishing between *observational* and *interventional* conditional probabilities and try a clarification based on Pearl’s seminal work on causality. We conclude that *unconditional* rather than *conditional* expectations provide the right notion of *dropping* features in contradiction to the theoretical justification of the software package SHAP. Parts of SHAP are unaffected because unconditional expectations (which we argue to be conceptually right) are used as *approximation* for the conditional ones, which encouraged others to ‘improve’ SHAP in a way that we believe to be flawed. Further, our criticism concerns TreeExplainer in SHAP, which really uses conditional expectations (without approximating them by unconditional ones)


Paper: Efficient Identification in Linear Structural Causal Models with Instrumental Cutsets

One of the most common mistakes made when performing data analysis is attributing causal meaning to regression coefficients. Formally, a causal effect can only be computed if it is identifiable from a combination of observational data and structural knowledge about the domain under investigation (Pearl, 2000, Ch. 5). Building on the literature of instrumental variables (IVs), a plethora of methods has been developed to identify causal effects in linear systems. Almost invariably, however, the most powerful such methods rely on exponential-time procedures. In this paper, we investigate graphical conditions to allow efficient identification in arbitrary linear structural causal models (SCMs). In particular, we develop a method to efficiently find unconditioned instrumental subsets, which are generalizations of IVs that can be used to tame the complexity of many canonical algorithms found in the literature. Further, we prove that determining whether an effect can be identified with TSID (Weihs et al., 2017), a method more powerful than unconditioned instrumental sets and other efficient identification algorithms, is NP-Complete. Finally, building on the idea of flow constraints, we introduce a new and efficient criterion called Instrumental Cutsets (IC), which is able to solve for parameters missed by all other existing polynomial-time algorithms.


Paper: Mathematical decisions and non-causal elements of explainable AI

Recent conceptual discussion on the nature of the explainability of Artificial Intelligence (AI) has largely been limited to data-driven investigations. This paper identifies some shortcomings of this approach to help strengthen the debate on this subject. Building on recent philosophical work on the nature of explanations, I demonstrate the significance of two non-data driven, non-causal explanatory elements: (1) mathematical structures that are the grounds for capturing the decision-making situation; (2) statistical and optimality facts in terms of which the algorithm is designed and implemented. I argue that these elements feature directly in important aspects of AI explainability. I then propose a hierarchical framework that acknowledges the existence of various types of explanation, each of which reveals an aspect of explanation, and answers to a different kind of why-question. The usefulness of this framework will be illustrated by bringing it to bear on some salient normative concerns about the use of AI decision-making systems in society.


Paper: Bayesian causal inference via probabilistic program synthesis

Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.


Paper: Causal Inference via Conditional Kolmogorov Complexity using MDL Binning

Recent developments have linked causal inference with Algorithmic Information Theory, and methods have been developed that utilize Conditional Kolmogorov Complexity to determine causation between two random variables. However, past methods that utilize Algorithmic Information Theory do not provide off-the-shelf support for continuous data types and thus lose the essence of the shape of data. We present a method for inferring causal direction between continuous variables by using an MDL Binning technique for data discretization and complexity calculation. Our method captures the shape of the data and uses it to determine which variable has more information about the other. Its high predictive performance and robustness is shown on several real world use cases.


Paper: Towards A Logical Account of Epistemic Causality

Reasoning about observed effects and their causes is important in multi-agent contexts. While there has been much work on causality from an objective standpoint, causality from the point of view of some particular agent has received much less attention. In this paper, we address this issue by incorporating an epistemic dimension to an existing formal model of causality. We define what it means for an agent to know the causes of an effect. Then using a counterexample, we prove that epistemic causality is a different notion from its objective counterpart.


Paper: Evaluation of Granger causality measures for constructing networks from multivariate time series

Granger causality and variants of this concept allow the study of complex dynamical systems as networks constructed from multivariate time series. In this work, a large number of Granger causality measures used to form causality networks from multivariate time series are assessed. These measures are in the time domain, such as model-based and information measures, the frequency domain and the phase domain. The study aims also to compare bivariate and multivariate measures, linear and nonlinear measures, as well as the use of dimension reduction in linear model-based measures and information measures. The latter is particular relevant in the study of high-dimensional time series. For the performance of the multivariate causality measures, low and high dimensional coupled dynamical systems are considered in discrete and continuous time, as well as deterministic and stochastic. The measures are evaluated and ranked according to their ability to provide causality networks that match the original coupling structure. The simulation study concludes that the Granger causality measures using dimension reduction are superior and should be preferred particularly in studies involving many observed variables, such as multi-channel electroencephalograms and financial markets.


Paper: Scaling structural learning with NO-BEARS to infer causal transcriptome networks

Constructing gene regulatory networks is a critical step in revealing disease mechanisms from transcriptomic data. In this work, we present NO-BEARS, a novel algorithm for estimating gene regulatory networks. The NO-BEARS algorithm is built on the basis of the NOTEARS algorithm with two improvements. First, we propose a new constraint and its fast approximation to reduce the computational cost of the NO-TEARS algorithm. Next, we introduce a polynomial regression loss to handle non-linearity in gene expressions. Our implementation utilizes modern GPU computation that can decrease the time of hours-long CPU computation to seconds. Using synthetic data, we demonstrate improved performance, both in processing time and accuracy, on inferring gene regulatory networks from gene expression data.
← Older posts

Blogs by Category

  • arXiv
  • arXiv Papers
  • Blogs
  • Books
  • Causality
  • Distilled News
  • Documents
  • Ethics
  • Magister Dixit
  • Personal Productivity
  • Python Packages
  • R Packages
  • Uncategorized
  • What is …
  • WordPress

Blogs by Month

Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Follow AnalytiXon

Powered by WordPress.com.