Single World Intervention Graphs (SWIGs): A Uni cation of the Counterfactual and Graphical Approaches to Causality

We present a simple graphical theory unifying causal directed acyclic graphs (DAGs) and potential (aka counterfactual) outcomes via a node-splitting transformation. We introduce a new graph, the Single-World Intervention Graph (SWIG). The SWIG encodes the counterfactual independences associated with a speci c hypothetical intervention on the set of treatment variables. The nodes on the SWIG are the corresponding counterfactual random variables. We illustrate the theory with a number of examples. Our graphical theory of SWIGs may be used to infer the counterfactual independence relations implied by the counterfactual models developed in Robins (1986, 1987). Moreover, in the absence of hidden variables, the joint distribution of the counterfactuals is identi ed; the identifying formula is the extended g-computation formula introduced in (Robins et al., 2004). Although Robins (1986, 1987) did not use DAGs we translate his algebraic results to facilitate understanding of this prior work. An attractive feature of Robins’ approach is that it largely avoids making counterfactual independence assumptions that are experimentally untestable. As an important illustration we revisit the critique of Robins’ g-computation given in (Pearl, 2009, Ch. 11.3.7); we use SWIGs to show that all of Pearl’s claims are either erroneous or based on misconceptions. We also show that simple extensions of the formalism may be used to accommodate dynamic regimes, and to formulate non-parametric structural equation models in which assumptions relating to the absence of direct effects are formulated at the population level. Finally, we show that our graphical theory also naturally arises in the context of an expanded causal Bayesian network in which we are able to observe the natural state of a variable prior to intervention.


Introduction to Causal Inference

This is a lecture/lab within the M.Sc. program Sociology and Empirical Social Research at the University of Cologne, Germany. It´s located in the core module Sociology III that aims to familiarize students with specific quantitative methods of data analysis.


13 Common Mistakes Amateur Data Scientists Make and How to Avoid Them?

In this article, I discuss the top mistakes amateur data scientists make (I have made some of them myself). I have also provided resources wherever applicable with the aim of helping you avoid these pitfalls on your data science journey.
1. Learning Theoretical Concepts without Applying Them
2. Heading Straight for Machine Learning Techniques without Learning the Prerequisites
3. Relying Solely on Certifications and Degrees
4. Assuming that what you see in ML Competitions is what Real-Life Jobs are Like
5. Focusing on Model Accuracy over Applicability and Interpretability in the Domain
6. Using too many Data Science Terms in your Resume
7. Giving Tools and Libraries Precedence over the Business Problem
8. Not Spending Enough Time on Exploring and Visualizing the Data (Curiosity)
9. Not Having a Structured Approach to Problem Solving
10. Trying to Learn Multiple Tools at Once
11. Not Studying in a Consistent Manner
12. Shying Away from Discussions and Competitions
13. Not working on Communication Skills


Support Vector Machines with Scikit-learn

In this tutorial, you learn about Support Vector Machines, one of the most popular and widely used supervised machine learning algorithms.


New Course: Machine Learning with Tree-Based Models in Python

Learn about our new Python course. You’ll learn how to use Python to train decision trees and tree-based models with the user-friendly scikit-learn machine learning library.


New Course: Python for R Users

Python and R have seen immense growth in popularity in the ‘Machine Learning Age’. They both are high-level languages that are easy to learn and write. The language you use will depend on your background and field of study and work. R is a language made by and for statisticians, whereas Python is a more general purpose programming language. Regardless of the background, there will be times when a particular algorithm is implemented in one language and not the other, a feature is better documented, or simply, the tutorial you found online uses Python instead of R. In either case, this would require the R user to work in Python to get his/her work done, or try to understand how something is implemented in Python for it to be translated into R. This course helps you cross the R-Python language barrier.


AI Solutionism

Although media headlines imply we are already living in a future where AI has infiltrated every aspect of society, this actually sets unrealistic expectations about what AI can really do for humanity. Governments around the world are racing to pledge support to AI initiatives, but they tend to understate the complexity around deploying advanced machine learning systems in the real world. This article reflects on the risks of ‘AI solutionism’: the increasingly popular belief that, given enough data, machine learning algorithms can solve all of humanity´s problems. There is no AI solution for everything. All solutions come at a cost and not everything that can be automated should be.


GDPR after 2 months – What does it mean for Machine Learning?

Almost 2 months on from the GDPR introduction, how was machine learning affected What does the future hold


Claudette – automated CLAUse DETectEr

Machine Learning Powered Analysis of Consumer Contracts and Privacy Policies. CLAUDETTE – ‘automated CLAUse DETectEr’ – is an interdisciplinary research project hosted at the Law Department of the European University Institute, led by professors Giovanni Sartor and Hans-W. Micklitz, in cooperation with engineers from University of Bologna and University of Modena and Reggio Emilia. The research objective is to test to what extent is it possible to automate reading and legal assessment of online consumer contracts and privacy policies, to evaluate their compliance with EU´s unfair contractual terms law and personal data protection law (GDPR), using machine learning and grammar-based approaches. The idea arose out of bewilderment. Having read dozens of terms of service and of privacy policies of online platforms, we came to conclusion that despite substantive law in place, and despite enforcers´ competence for abstract control, providers of online services still tend to use unfair and unlawful clauses in these documents. Hence, the idea to automate parts of enforcement process by delegating certain tasks to machines. On one hand, we believe that relying on automation can increase quality and effectiveness of legal work of enforcers. On the other, we want to empower consumers themselves, by giving them tools to quickly assess whether what they agree to online is fair and/or lawful.


Early Detection of Students at Risk – Predicting Student Dropouts Using Administrative Student Data and Machine Learning Methods

High rates of student attrition in tertiary education are a major concern for universities and public policy, as dropout is not only costly for the students but also wastes public funds. To successfully reduce student attrition, it is imperative to understand which students are at risk of dropping out and what are the underlying determinants of dropout. We develop an early detection system (EDS) that uses machine learning and classic regression techniques to predict student success in tertiary education as a basis for a targeted intervention. The method developed in this paper is highly standardized and can be easily implemented in every German institution of higher education, as it uses student performance and demographic data collected, stored, and maintained by legal mandate at all German universities and therefore self-adjusts to the university where it is employed. The EDS uses regression analysis and machine learning methods, such as neural networks, decision trees and the AdaBoost algorithm to identify student characteristics which distinguish potential dropouts from graduates. The EDS we present is tested and applied on a medium-sized state university with 23,000 students and a medium-sized private university of applied sciences with 6,700 students. Both institutes of higher education differ considerably in their organization, tuition fees and student-teacher ratios. Our results indicate a prediction accuracy at the end of the first semester of 79% for the state university and 85% for the private university of applied sciences. Furthermore, accuracy of the EDS increases with each completed semester as new performance data becomes available. After the fourth semester, the accuracy improves to 90% for the state university and 95% for the private university of applied sciences. At the day of enrollment the accuracy, relying only on demographic data, is 68% for the state university and 67% for the private university.


Seaborn Categorical Plots in Python

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. It is built on top of matplotlib, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.


What is Minimum Viable (Data) Product?

A personal view on what MVP means for machine learning products…