Why Topological Data Analysis Works
Topological data analysis has been very successful in discovering information in many large and complex data sets. In this post, I would like to discuss the reasons why it is an effective methodology. One of the key messages around topological data analysis is that data has shape and the shape matters. Although it may appear to be a new message, in fact it describes something very familiar.
Working with “large” datasets, with dplyr and data.table
A few months ago, I was doing some training on data science for actuaries, and I started to get interesting puzzeling questions. For instance, Fleur was working on telematic data, and she’s been challenging my (rudimentary) knowledge of R. As claimed by Donald Knuth, ‘we should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil’. So usually, in my courses, and my training, codes are very basic, and easy to understand. But usually poorly efficient. Since I was challenged, to work on very large datasets, we’ve been working on R functions to manipulate those possibly (very) large dataset, and to run some simple functions as fast as possible (with simple filter and aggregation functions). In order to illustrate, let us generate our ‘large’ telematic dataset. Assume that we have 10,000 drivers, each of them drives about 200 times, and each time, we have, say, 80 locations. That mean around 160 million observations. It is ‘large’, but not huge.
Call R and Python from base SAS
Since 2009, it has been possible to call R from SAS programs. However, this integration requires IML, an add-on matrix-object language for SAS which isn’t available with all SAS installations and is separate from the standard SAS PROC execution model. Now, engineers at SAS have shared a method of calling R, Python and other open-source tools using the Java connectivity provided in base SAS. The first step is to install a Java class (shared on Github under an Apache license), SASJavaExec.jar. Then, you can use the SAS Java Object in the DATA step to call out to a separately-authored R or Python. You should write the script to generate output in CSV format (using say write.table in R), which can then be used in a subsequent SAS PROC.
R: 10^3 vs. 1e3
This is presumably obvious to most if not all R programmers, but I became aware today of a hugely (?) delaying tactic in my R codes. I was working with Jean-Michel and Natesh [who are visiting at the moment] and when coding an MCMC run I was telling them that I usually preferred to code Nsim=10000 as Nsim=10^3 for readability reasons. Suddenly, I became worried that this representation involved a computation, as opposed to Nsim=1e3 and ran a little experiment:
Predicting events, when they haven’t happened yet
Suppose you have to predict the probabilities of events which haven’t happened yet. How do you do this? Here is an example from the 1950s when Longley-Cook, an actuary at an insurance company, was asked to price the risk for a mid-air collision of two planes, an event which as far as he knew hadn’t happened before. The civilian airline industry was still very young, but rapidly growing and all Longely-Cook knew was that there were no collisions in the previous 5 years .
Clusters May Be Categorical but Cluster Membership Is Not All-or-None
Very early in the study of statistics and R, we learn that random variables can be either categorical or continuous. Regrettably, we are forced to relearn this distinction over and over again as we debug error messages produced by our code (e.g., ‘x must be numeric’). R will reminds us that if the function expects an argument to be a factor, our input ought to be a factor (although sometimes the function will do the conversion for us). Dichotomous variables do give us some flexibility for sex can be entered as a factor with values ‘male’ and ‘female’ or coded as numeric with values of 0 and 1 indicating degree of ‘maleness’ or ‘femaleness’ depending on whether male or female is assign the value of 1. Similarly, when the categorical variable has many levels, there is no reason not to select one of the levels as the basis for comparison. Then, the dummy coding remains 0 and 1 with the base level coded as 0s for all the comparisons (e.g., Catholic vs Protestant, Jewish vs Protestant, and so on).