Causal Structure Learning

Graphical models can represent a multivariate distribution in a convenient and accessible form as a graph. Causal models can be viewed as a special class of graphical models that represent not only the distribution of the observed system but also the distributions under external interventions. They hence enable predictions under hypothetical interventions, which is important for decision making. The challenging task of learning causal models from data always relies on some underlying assumptions. We discuss several recently proposed structure learning algorithms and their assumptions, and we compare their empirical performance under various scenarios.

Enterprise Data Lake Architecture

The diagram below shows an Enterprise Data Lake that ingests data from many typical systems such as CRM, ERP and other transactional systems. In addition, it is fed unstructured data from web logs, social media, IoT devices, third-party sites (such as DMP, D&B) creating a data repository. This rich data eco-system can now support combining multiple sources of data for more accurate analytics and never-before possible insights into business operations.

Some Tools for Writing Shiny Apps

Recently, I have written and hosted several multi-user Shiny apps. One example, is Taddle that runs on my server at Taddle is a multi-user shiny app that offers a free and simple way to allocate seminar topics. (Next time you need to assign seminar topics, why not have a look at Taddle?)

Shiny slider examples with the intrval R package

The intrval R package is lightweight (~11K), standalone (apart from importing from graphics, has exactly 0 non-base dependency), and it has a very narrow scope: it implements relational operators for intervals — very well aligned with the tiny manifesto. In this post we will explore the use of the package in two shiny apps with sliders. The first example uses a regular slider that returns a single value. To make that an interval, we will use standard deviation (SD, sigma) in a quality control chart (QCC). The code is based on the pistonrings data set from the qcc package. The Shewhart chart sets 3-_sigma_ limit to indicate state of control. The slider is used to adjusts the sigma limit and the GIF below plays is as an animation.

Is it time to ditch the Comparison of Means (T) Test?

For over a century, academics have been teaching the Student’s T-test and practitioners have been running it to determine if the mean values of a variable for two groups were statistically different. It is time to ditch the Comparison of Means (T) Test and rely instead on the ordinary least squares (OLS) Regression. My motivation for this suggestion is to reduce the learning burden on non-statisticians whose goal is to find a reliable answer to their research question. The current practice is to devote a considerable amount of teaching and learning effort on statistical tests that are redundant in the presence of disaggregate data sets and readily available tools to estimate Regression models. Before I proceed any further, I must confess that I remain a huge fan of William Sealy Gosset who introduced the T-statistic in 1908. He excelled in intellect and academic generosity. Mr. Gosset published the very paper that introduced the t-statistic under a pseudonym, the Student. To this day, the T-test is known as the Student’s T-test. My plea is to replace the Comparison of Means (T-test) with OLS Regression, which of course relies on the T-test. So, I am not necessarily asking for ditching the T-test but instead asking to replace the Comparison of Means Test with OLS Regression.

Understanding rolling calculations in R

In R, we often need to get values or perform calculations from information not on the same row. We need to either retrieve specific values or we need to produce some sort of aggregation. This post explores some of the options and explains the weird (to me at least!) behaviours around rolling calculations and alignments. We can retrieve earlier values by using the lag() function from dplyr.

Platform Deprecation Strategy

In an effort to streamline product development, maintenance, and support to ensure the best experience for our users, we have created a strategy for operating system and browser deprecation. This will allow us to focus our work on modern platforms, and to encourage best practices in R development. This policy applies to all of our products and packages, and has been posted to the website here. This strategy is included in our Support Agreement, the full text of which can be found here. The current support end dates have been chosen based on the OS or browser end-of-life dates, and are generally aligned with the dates on which the latest version of R can no longer be installed on them. Note that the first deprecation takes place on April 2, 2018, when Internet Explorer 10 and Ubuntu 12.04 will no longer be supported on new releases of our software.

Reptile: A Scalable Meta-Learning Algorithm

We’ve developed a simple meta-learning algorithm called Reptile which works by repeatedly sampling a task, performing stochastic gradient descent on it, and updating the initial parameters towards the final parameters learned on that task. This method performs as well as MAML, a broadly applicable meta-learning algorithm, while being simpler to implement and more computationally efficient.

From Local Machine to Dask Cluster with Terraform

Learn how you can take local code that does grid search with the Scikit-Learn package to a cluster of AWS (EC2) nodes with Terraform.

Great Data Scientists Don’t Just Think Outside the Box, They Redefine the Box

One of a data scientist’s most important characteristics is that they refuse to take “it can’t be done” as an answer. They are willing to try different variables and metrics, and different type of advanced analytic algorithms, to see if there is another way to predict performance. By the way, I included this image just because I thought it was cool. This graphic measures the activity between different IT systems. Just like with data science, this image shows there’s no lack of variables to consider when building your Machine Learning and Deep Learning models!