Automatically Storing Results from Analyzed Data Sets

This is the fifth article in a series teaching you to how to write programs that automatically analyze scientific data. The first presented the concept and motivation, then laid out the high level steps. The second taught you how to structure data sets to make automated data analysis possible, and automatically identify the conditions of each test. The third article discussed creating a for loop that automatically performs calculations on each test result and saves the results. The fourth post covered what is likely the most important part: Automatically checking the data and analysis for errors. This fifth post will teach you how to store data in a logical folder structure, enabling easy access to the data for regression development and validation.

Understanding Understanding: how anything has meaning

Our understanding is how we realize our world. It is our ‘mind’s eye’, the place where we think, and the only thing that gives meaning to anything. Every other experience outside of understanding is perceived through feeling. What you’re about to read describes the process of non-conscious understanding. The mechanism that governs conscious understanding is the subject of a future post. For clarity, understanding and awareness are two perspectives of saying a similar thing. Understanding is using knowledge to recreate a reality from its fundamentals while awareness refers to the object(s) being recreated. Knowledge is a requirement of understanding. Knowledge is the ability to recognize a pattern and abstract it into a signal. A knowledgeable person is someone who can recognize a lot of patterns. When multiple signals are able to inform a higher-level pattern, an understanding is developed.

Randomisation tests comparing dependent correlations

This is about some academic work I did that never got published. But, I think it should be out there as you might find it useful. Here’s the motivation. You measure three variables, A, B, and Z. You are interested in whether the correlation between A and Z is different to that between B and Z. Of course, they are not independent sampling distributions, so it’s a little complicated. Then, maybe you want to use a more robust correlation statistic: Spearman’s or Kendall’s or whatever. What now? I’m going to tell you about how this came to my attention and what we did. Then, you can read the manuscript, use the code files, gaze at the figures…

Interpretable Machine Learning

Imagine you are a Data Scientist and in your free time you try to predict where your friends will go on vacation in the summer based on their facebook and twitter data you have. Now, if the predictions turn out to be accurate, your friends might be impressed and could consider you to be a magician who could see the future. If the predictions are wrong, it would still bring no harm to anyone except to your reputation of being a ‘Data Scientist’. Now let’s say it wasn’t a fun project and there were investments involved. Say, you wanted to invest in properties where your friends were likely to holiday. What would happen if the model’s predictions went awry? You would lose money. As long as the model is having no significant impact, it’s interpretability doesn’t matter so much but when there are implications involved based on a model’s prediction, be it financial or social, interpretability becomes relevant.

Data: A cultural transformation and not a quick fix

Amid stronger business competition than ever before, companies need to do more than simply embrace buzzwords or trends. It’s something we see all the time when out in the field talking to customers, or speaking at events. When it comes to the role of data, the emphasis should instead be on instilling transformation into the very DNA of an organisation. Quick fixes are not the order of the day and, while the utilisation of tools such as Artificial Intelligence (AI) and Machine Learning (ML) may reap initial rewards, focus needs to switch to a longer term, more all-encompassing cultural shift surrounding data analytics. This is, and has been, Mango’s view over the past 16 years, and is one that’s expanded on in detail by Rich Pugh, Mango’s chief data scientist and co-founder, and CEO Matt Aldridge, in the Future of Data Report, recently published in The Times. According to Rich, the notion that ideas like AI or ML can just be plugged in and the company then watches as money pours out of their servers is dangerous. But at least it’s opened the door to having the conversation about how companies can become data driven. ‘Our organisation is focused on facilitating these conversations that we believe should have been occurring 16 years ago, so we can help companies avoid quick buzzword-led reactions and instead strive for a cultural transformation based on data. The question for all reverts to ‘where are you on your data-driven journey and what’s the best way forward for your company?’

Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on?

I will use a basic example to explain each performance metric on in order for you to really understand the difference between each one of them. So that in your next ML project you can choose which performance metric to improve on that best suits your project.

Cost-effectiveness analysis with multi-state and partitioned survival models: hesim 0.2.0

Analyses are performed by constructing economic models that consist of (1) a disease model, (2) a utility model, and (3) cost models. A disease model simulates the progression of disease over time. In CTSTMs and PSMs this entails simulating the probability that patients are in mutually exclusive health states as a function of time. The utility model attaches utility values to the different health states and is used to compute quality-adjusted life-years (QALYs). Similarly, the cost models attach cost values to the health states for different categories of costs (e.g., drug costs, inpatient costs, outpatient costs, etc.). Analyses follow a 3-step process consisting of (1) parameterization, (2) simulation, and (3) decision analysis. In the first step, statistical models are used to estimate the parameters of the economic model. In the second step, the parameterized economic model is simulated to compute quantities of interest – such as disease progression, QALYs, and costs – extrapolated over the desired time horizon (e.g., over a lifetime). Finally, in the third step, the simulated outcomes from step 2 are used to perform cost-effectiveness analyses (CEAs) and represent decision uncertainty.

Build XGBoost / LightGBM models on large datasets – what are the possible solutions?

XGBoost and LightGBM have been proven on many tabular datasets to be the best performant ML algorithms. But when the data is huge, how do we use them?

Installing Tensorflow with CUDA, cuDNN and GPU support on Windows 10

In Part 1 of this series, I discussed how you can upgrade your PC hardware to incorporate a CUDA Toolkit compatible graphics processing card, such as an Nvidia GPU. This Part 2 covers the installation of CUDA, cuDNN and Tensorflow on Windows 10. This article below assumes that you have a CUDA-compatible GPU already installed on your PC; but if you haven’t got this already, Part 1 of this series will help you get that hardware set up, ready for these steps.

Introducing PyTorch BigGraph

Graphs are one of the fundamental data structures in machine learning applications. Specifically, graph-embedding methods are a form of unsupervised learning, in that they learn representations of nodes using the native graph structure. Training data in mainstream scenarios such as social media predictions, internet of things(IOT) pattern detection or drug-sequence modeling are naturally represented using graph structures. Any one of those scenarios can easily produce graphs with billions of interconnected nodes. While the richness and intrinsic navigation capabilities of graph structures is a great playground for machine learning models, their complexity posses massive scalability challenges. Not surprisingly, the support for large-scale graph data structures in modern deep learning frameworks is still quite limited. Recently, Facebook unveiled PyTorch BigGraph, a new framework that makes it much faster and easier to produce graph embeddings for extremely large graphs in PyTorch models.

What is a Permutation Test?

Let’s suppose that we want to test some hypothesis, and we have a sample of size n that we plan to use. This sample could be very small, and just how we obtained it is not that important. In particular, we’re not going to assume that it was obtained randomly. Moreover, the (null) hypothesis that we’re wanting to test doesn’t have to be expressed in terms of a statement about some parameter associated with a population. It can just be some general statement. In fact, we’re not going to that much interested in ‘populations’ as such, so we certainly don’t have to make a whole bunch of heroic assumptions involving properties such as ‘normality’, or ‘constant variance’.


m2cgen (Model 2 Code Generator) – is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go).

A/B testing in One Picture

A non-technical look at A/B testing, based on Dan Siroker & Pete Koomen’s book, A / B Testing, The Most Powerful Way to Turn Clicks Into Customers.

Zero to Cohort Analysis in 60 Minutes

Cohort analysis is one of the best ways to understand your company’s growth – it often reveals things that are important but not obvious from top-line measures like MRR, MAUs, etc. Historically, it’s been kind of a slog to get cohort analysis up and running, usually because you either don’t have the data you want (e.g., a complete transaction ledger) or you can’t easily put it in the right format (e.g. it’s stuck in your point of sale). But today, it’s fantastically easier – especially if you process user transactions using a modern payments platform with a good API (like Shopify, Square, or Stripe). In fact, it’s totally feasible to go from nothing to automatically-updated cohort analysis in about 60 minutes. The rest of this post shows how. To keep things concrete, we’ll focus the specific example of pulling revenue LTV curves for a fictional Shopify company, but the ideas apply more broadly.

Graduating in GANs: Going from understanding generative adversarial networks to running your own

Generative Adversarial Networks (GANs) have taken over the public imagination – permeating pop culture with AI- generated celebrities and creating art that is selling for thousands of dollars at high-brow art auctions. In this post, we’ll explore:
• Brief primer on GANs
• Understanding and Evaluating GANs
• Running your own GAN

Identifying Duplicate Questions: A Machine Learning Case Study

Quora and Stack Exchange are knowledge-sharing platforms where people can ask questions in the hopes of attracting high-quality answers. Often, questions that people submit have previously been asked. Companies like Quora can improve user experience by identifying these duplicate entries. This would enable users to find questions that have already been answered and prevent community members from answering the same question multiple times. Consider the following pair of questions: 1. Is talent nurture or nature? 2. Are people talented by birth or can it be developed? These are duplicates; they are worded differently, but they have the same intent. This blog post focuses on solving the problem of duplicate question identification.

Detecting Personal Data within API Communication Using Deep Learning

Ever wonder how much personal data is scattered around various organizations? Undefined and untethered, a massive amount of personal data – yours, mine, everyone’s – from the borderline personal to the incredibly specific, is just floating around. Mostly, it’s undetected. Almost always, it’s used by the organization that you gave it to, for good reasons. Sometimes, for questionable reasons. Often (very, very often) it’s sold and used, or mis-used, by a third party (and a fourth, and a fifth….). Too often, that personal data is hijacked and used for nefarious purposes. Given the exponential chances for use and mis-use of personal data online, it’s little wonder that the European Union recently passed the General Data Protection Regulation (GDPR), which came in to effect on May 25th, 2018. California followed suit a month later, with the California Consumer Privacy Act (CCPA). If you wonder how many organizations have hold of your personal data, you’re not alone. But what about those organizations themselves? It’s so difficult and time-consuming to monitor that data. The scariest part of this personal-data nightmare might be that the organizations themselves often have no idea where your personal data is stored, what applications it might be flowing through, or where it might end up. The trend toward regulating the flow and uses of personal data is rising, but how can you regulate what you cannot detect?