Supervised Dimension Reduction for Large-scale “Omics” Data with Censored Survival Outcomes Under Possible Non-proportional Hazards

The past two decades have witnessed significant advances in high-throughput “omics’ technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled the simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes such as overall and recurrence-free survival with the goal of developing a predictive “omics’ profile. In this paper, we propose a supervised dimension reduction method for feature selection and survival prediction. Our approach utilizes continuum power regression – a framework that includes ordinary least squares, principal components regression and partial least squares – in conjunction with the parametric or semi-parametric accelerated failure time model, and enables feature selection under possible non-proportional hazards. The proposed approach can handle censored observations using robust Buckley-James estimation in this high-dimensional setting and the parametric version employs the flexible generalized F model that encompasses a wide spectrum of well known survival models. We evaluate the predictive performance of our methods via extensive simulation studies and compare it to existing methods using publicly available data sets in cancer genomics.

iBreakDown: Model Agnostic Explainers for Individual Predictions

The iBreakDown package is a model agnostic tool for explanation of predictions from black boxes ML models. Break Down Table shows contributions of every variable to a final prediction. Break Down Plot presents variable contributions in a concise graphical way. This package works for binary classifiers as well as regression models. iBreakDown is a successor of the breakDown package. It is faster (complexity O(p) instead of O(p^2)). It supports interactions and interactive explainers with D3.js plots.


DrWhy is the collection of tools for Explainable AI (XAI). It’s based on shared principles and simple grammar for exploration, explanation and visualisation of predictive models. Please, note that DrWhy is under rapid development and is still maturing. If you are looking for a stable solution, please use the mature DALEX package.

Is your algorithm confident enough? How to measure uncertainty in neural networks.

When machine learning techniques are used in ‘mission critical’ applications, the acceptable margin of error becomes significantly lower. Imagine that your model is driving a car, assisting a doctor or even just interacting directly with an (perhaps easily annoyed) end user. In these cases, you’ll want to ensure that you can be confident in the predictions your model makes before acting on them. Here’s the good news: There are several techniques for measuring uncertainty in neural networks and some of them are very easy to implement! First, let’s get a feel for what we’re about to measure.

Subspace clustering

This post addresses the following questions:
1.What are the challenges of working with high dimensional data?
2.What is subspace clustering?
3.How to implement a subspace clustering algorithm in python

Tiny Dataset Hypothesis Testing by Projecting Pretrained-Embedding-Space Onto KDE-Mixed Space

Text classification tasks usually demand high sample count and fine-semantic variability for reliable modeling. In many cases, the data at hand is insufficient in both sample count, over-skewness of categories, and low variability, i.e., vocabulary variety and semantic meaning. In this post, I will present a novel method to overcome these common hurdles. The purpose of this method is mainly for aiding in quick prototyping, guided topic modeling, hypothesis testing , proof-of-concept (POC), or even when creating a minimal-viable-product (MVP).

Tidy Anomaly Detection using R

Imagine, You run an online business like and you want to plan Server Resources for the next year?-?It is imperative that you need to know when your load is going to spike (or at least when did it spike in retrospective to believe it’ll repeat again) and that is where Time Series Anomaly Detection is what you are in need of. While there are some packages like Twitter’s AnomalyDetection that has been doing this job, there is another good candidate?-?anomalize?-?that does something specific which no other Anomaly Detection packages were doing. That is Tidy Anomaly Detection.

Review: RefineNet – Multi-path Refinement Network (Semantic Segmentation)

In this story, RefineNet, by The University of Adelaide, and Australian Centre for Robotic Vision, is reviewed. A generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. The deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. A chained residual pooling is also introduced which captures rich background context in an efficient manner. This is a 2017 CVPR paper with more than 400 citations. (Sik-Ho Tsang @ Medium)

Time Series Machine Learning Regression Framework

Building a time series forecasting pipeline to predict weekly sales transaction. Time! One of the most challenging concepts in the Universe. I have studied Physics for many years and seen how most brilliant scientists have been struggling to deal with the concept of the time. In machine learning, even we are far from those complicated Physics theories, in which the common understanding of the concept of the time modifies, the existence of time and the sequence of observations in an ML problem can make the problem much more complicated.

Reinforcement Learning for Combinatorial Optimization

From as early as humankind’s beginning, millions of years ago, every innovation in technology and every invention that improved our lives and our ability to survive and thrive on earth, has been devised by the cunning minds of intelligent humans. From the fire to the wheel, and from electricity to quantum mechanics, our understanding of the world and the complexity of things around us have increased to the point that we often have difficulty grasping them intuitively.

Why Documentation is Important in Data Science?

Whenever we talk about data science, chances are we’ll first think of those fancy stuff like AI, deep learning, machine learning etc. But nobody talks about documentation for one of the reasons that we all know?-?it doesn’t sound sexy (or it sounds so boring). And I completely agree that documentation is often not one of the most interesting things for data scientists. But still, it’s importance is no less than other data science workflow, especially in terms of data science project management. In fact, documentation is no longer just the task done by programmers or developers. It’s something that we as a data scientist should know and be able to perform this task only a regular basis.

How to Do Text Binary Classification with BERT?

We all know BERT is a compelling language model which has already been applied to various kinds of downstream tasks, such as sentiment analysis and QA. Have you ever tried it on text binary classification? Honestly, until the beginning of this week, my answer was still NO. Why? Because the example code on BERT’s official GitHub repo was not very friendly.

Architecting a Machine Learning Pipeline. How to build scalable Machine Learning systems – Part 2/2

When developing a model, data scientists work in some development environment tailored for Statistics and Machine Learning (Python, R etc) and are able to train and test models all in one ‘sandboxed’ environment while writing relatively little code. This is great for building interactive prototypes with fast time to market – they are not productionised, low latency systems though! This is the 2nd in a series of articles, namely ‘Being a Data Scientist does not make you a Software Engineer!’, which covers how you can architect an end-to-end scalable Machine Learning (ML) pipeline.

Solving the AI Accountability Gap – Hold developers responsible for their creations

Yesterday, a leaked white paper from the United Kingdom government suggested that social media executives could be held legally responsible for harmful content proliferating on their platform’s algorithms. This proposal aims to address one of the single biggest problems brought about by autonomous decision-making: who should be blamed when an AI causes harm?

Understanding and Reducing Bias in Machine Learning

Mireille Hildebrandt, a lawyer and philosopher working at the intersection of law and technology, has written and spoken extensively on the issue of bias and fairness in machine learning algorithms. In an upcoming paper on agnostic and bias free machine learning (Hildebrandt, 2019), she argues that bias free machine learning doesn’t exist and that a productive bias is necessary for an algorithm to be able to model the data and make relevant predictions. The three major types of bias that can occur in a predictive system can be laid out as:
• Bias inherent in any action perception system (productive bias)
• Bias that some would qualify as unfair
• Bias that discriminates on the basis of prohibited legal grounds

An End-to-End Tutorial Running Convolution Neural Network on MCU with uTensor

My attempt to showcase CNN on uTensor did not succeed at the COSCUP Taiwan, in 2018. The problem was, I accidentally fed the same image to all the input classes and it was as good as random guesses. On CIFAR10, that yield about 10% accuracy. Despite the fact that the classifier was actually working! That’s a backstory of CNN on uTensor. In this post, I’m going to guide you through how to build a CNN with uTensor. We will be using utensor-cli to seamlessly convert a simple CNN trained in Tensorflow to an equivalent uTensor implementation, compile and run on a MCU (STM32F767, to be specific).