The Grammar of Data Science: Python vs R
In this post, I will elaborate on my experience switching teams by comparing and contrasting R and Python solutions to some simple data exploration exercises.
Makefiles and RMarkdown
Quite some time ago (October 2013, according to Amazon), I bought a copy of ‘Reproducible Research with R and RStudio’ by Christopher Gandrud. And it was awesome. Since then, I’ve been using knitr and RMarkdown quite a lot. However, until recently, I never bothered with a makefile. At the time, I had assumed that it was something only available to people on *nix systems and back then I was developing exclusively on PC. I even wrote some R scripts that were more or less makefiles; reading the contents of a directory, checking for output and running render or knit or whatever. My workflow continued to evolve and get standardised, I moved to Linux and so I picked up Gandrud’s book again to review the bits about makefiles. I’m not sure when it was that I realized that RTools includes a make program for Windows, but I wish someone had told me that a couple years ago. So, enough preamble. What’s the benefit and how does it work?
Using the R MatchIt package for propensity score analysis
Descriptive analysis between treatment and control groups can reveal interesting patterns or relationships, but we cannot always take descriptive statistics at face value. Regression and matching methods allow us to make controlled comparisons to reduce selection bias in observational studies. For a couple good references that I am basically tracking in this post see here and here. These are links to the pages of the package authors and a nice paper (A Step by Step Guide to Propensity Score Matching in R) from higher education evaluation research respectively.
R style default plot for Pandas DataFrame
The default plot method for dataframes in R is to show each numeric variable in a pair-wise scatter plot. I find this to be a really useful first look at a dataset, both to see correlations and joint distributions between variables, but also to quickly diagnose potential strangeness like bands of repeating values or outliers. From what I can tell, there are no builtins in the python data ecosystem (numpy, pandas, matplotlib) for this so I coded up a function to emulate the R behaviour. You can get it in this gist (feedback welcomed). Here’s an example of it in action showing derived time-series features (12 hour rates of change) for some clinical variables.
Space Launch Sites over Time
Display Space Launches by site and time