Narrowest-over-threshold detection of multiple change points and change-point-like features

We propose a new, generic and flexible methodology for non-parametric function estimation, in which we first estimate the number and locations of any features that may be present in the function and then estimate the function parametrically between each pair of neighbouring detected features. Examples of features handled by our methodology include change points in the piecewise constant signal model, kinks in the piecewise linear signal model and other similar irregularities, which we also refer to as generalized change points. Our methodology works with only minor modifications across a range of generalized change point scenarios, and we achieve such a high degree of generality by proposing and using a new multiple generalized change point detection device, termed narrowest-over-threshold (NOT) detection. The key ingredient of the NOT method is its focus on the smallest local sections of the data on which the existence of a feature is suspected. For selected scenarios, we show the consistency and near optimality of the NOT algorithm in detecting the number and locations of generalized change points. The NOT estimators are easy to implement and rapid to compute. Importantly, the NOT approach is easy to extend by the user to tailor to their own needs. Our methodology is implemented in the R package not.

Lack-of-fit tests for quantile regression models

The paper novelly transforms lack-of-fit tests for parametric quantile regression models into checking the equality of two conditional distributions of covariates. Accordingly, by applying some successful two-sample test statistics in the literature, two tests are constructed to check the lack of fit for low and high dimensional quantile regression models. The low dimensional test works well when the number of covariates is moderate, whereas the high dimensional test can maintain the power when the number of covariates exceeds the sample size. The null distribution of the high dimensional test has an explicit form, and the p-values or critical values can then be calculated directly. The finite sample performance of the tests proposed is examined by simulation studies, and their usefulness is further illustrated by two real examples.

Running cross_validate from cvms in parallel

The cvms package is useful for cross-validating a list of linear and logistic regression model formulas in R. To speed up the process, I’ve added the option to cross-validate the models in parallel. In this post, I will walk you through a simple example and introduce the combine_predictors() function, which generates model formulas by combining a list of fixed effects. We will be using the simple participant.scores dataset from cvms.

How DevOps Drives Analytics Operationalization and Monetization

I recently wrote a blog ‘Interweaving Design Thinking and Data Science to Unleash Economic V…’ that discussed the power of interweaving Design Thinking and Data Science to make our analytic efforts more effective. Our approach was validated by a recentMcKinsey article titled ‘Fusing data and design to supercharge innovation’ that stated: ‘While many organizations are investing in data and design capabilities, only those that tightly weave these disciplines together will unlock their full benefits.’

Optimization with SciPy and application ideas to machine learning

Optimization is often the final frontier, which needs to be conquered to deliver the real value, for a large variety of business and technological processes. We show how to perform optimization with the most popular scientific analysis package in Python – SciPy and discuss unique applications in machine learning space.

Asimov’s Laws of Robotics, and why AI may not abide by them

But what if we end up in a Terminator scenario?’ One cannot be blamed for asking such questions since, through movies and sci-fi stories, cases where the robots take over are almost ubiquitous and therefore frame our impression of a future with Artificial Intelligence (AI). However, since human beings are able to live and cooperate through laws, why not apply laws for AI as well? Enter Asimov’s Laws of Robotics! As always, to have a better understanding of the future, let us take a dive into the past.

The Sleeping Beauty problem: a data scientist’s perspective

One of the first and particularly memorable lessons that I learned in courses on experimental physics was this: never, ever, draw a graph representing measurements without error bars. Error bars indicate the extend to which the value of a particular measurement is uncertain. There is a deeper truth to this practical rule. It implies that with any empirical evidence or data, there comes uncertainty as to its veracity. More so, if we are not able to gauge this uncertainty, the data become utterly useless. The abstract concept of probability allows us to to reason about uncertain events such as ‘will this gauge show an electric potential of 1.5 V or rather 1.4 V?’. It is central not only to the field of data science but practically any empirical science that leverages statistical description. Consequently, even for applied scientists, it makes sense to stop and ponder more fundamental questions about probability once in a while.

Misleading With Data & Statistics.

Statistics play a vital role in our life. We use them everyday – consciously or unconsciously. Nowadays data is everywhere and making the right decisions becomes increasingly difficult due to an information overload of our system. Statistics allow us to better process and understand the world around us if applied correctly. We should be able to make better decisions based on more complete information. But what if statistics are misleading?

Data Science-ish

A critical flaw in data science practices is beginning to surface: Decision-makers force data to justify their presumptions. So now that we’ve identified this problem, the solution is as simple as a change of perspective, right? False. Even if those in that position of power were to read this article, there’s a slim chance that this is enough to touch moral compasses that are backed by decades of undisputed ‘experience’. The presumed conclusions differ from what the data is able to illustrate and that same data has already been through rigorous cleansing, processing and interpretation. This workflow of a data science project involves unique individuals pressured to deliver to that one gut-instinct (conclusion) at each of these various different stages. However, since the data set completely transforms across this practice, tracing the point(s) of data-alchemy is practically a game of Chinese whisper.

What is Wavelet and How We Use It for Data Science

Hello, this is my second post for the signal processing topic. For now, I’m interested in learning more about signal processing to understand a certain paper. And to be honest for me, this wavelet thing is harder to understand than Fourier Transform. After I felt quite understanding about this topic, I realize something. It will be faster for me to understand this if I learn this topic with the right step by step of the learning process. So, here the right step by step in my opinion.

How Statistical Norms Improve Modeling

A regularizer is commonly used in machine learning to constrain a model’s capacity to cerain bounds either based on a statistical norm or on prior hypotheses. This adds preference for one solution over another in the model’s hypothesis space, or the set of functions that the learning algorithm is allowed to select as being the solution. The primary aim of this method is to improve the generalizability of a model, or to improve a model’s performance on previously unseen data. Using a regularizer improves generalizability because it reduces overfitting the model to the training data.

A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model

Word embedding is one of the most important techniques in natural language processing(NLP), where words are mapped to vectors of real numbers. Word embedding is capable of capturing the meaning of a word in a document, semantic and syntactic similarity, relation with other words. It also has been widely used for recommender systems and text classification. This tutorial will show a brief introduction of genism word2vec model with an example of generating word embedding for the vehicle make model.

Classification: Sigmoid vs. Softmax

When designing a model to perform a classification task (e.g. classifying diseases in a chest x-ray or classifying handwritten digits) we want to tell our model whether it is allowed to choose many answers (e.g. both pneumonia and abscess) or only one answer (e.g. the digit ‘8.’) This post will discuss how we can achieve this goal by applying either a sigmoid or a softmax function to our classifier’s raw output values.

Reinforcement Learning – Implement TicTacToe

We have implemented grid world game by iteratively updating Q value function, which is the estimating value of (state, action) pair. This time let’s look into how to leverage reinforcement learning in adversarial game – tic-tac-toe, where there are more states and actions and most importantly, there is an opponent playing against our agent.(

How Computers See

Introduction to Convolutional Neural Networks

Introduction to Neural Networks

To quote the repository of all human knowledge, ‘artificial neural networks […] are computing systems inspired by the biological neural networks that constitute animal brains.’ Biological neurons and ‘neurons’ in artificial neural networks both take in signals from other neurons and produce some output accordingly. The power of both kinds of neural networks comes not from a single neuron acting alone, but from the cumulative effect of many neurons together. But the similarities stop there. A biological neuron contains immensely complex molecular machinery, and an artificial neural network neuron encompasses a few simple math operations. Artificial neural networks have been successfully applied to images, audio, text, and medical data. This post will introduce you to how artificial neural networks compute.