# Distilled News

There are lots of great tutorials on the web that explain how to classify chunks of text with Machine Learning. But what if, rather than just categorize text, you want to categorize individual words.
A simple step by step explanation to the concept of policy gradients and how they fit into reinforcement learning. Maybe too simple.
Moravec’s paradox is the observation made by many AI researchers that high-level reasoning requires less computation than low-level unconscious cognition. This is an empirical observation that goes against the notion that greater computational capability leads to more intelligent systems. However, we have today computer systems that have super-human symbolic reasoning capabilities. Nobody is going to argue that a man with an abacus, a chess grandmaster or a champion Jeopardy player has any chance at besting a computer. Artificial symbolic reasoning is technology that has been available for decades now and this capability is without argument superior in capability than what any human can provide. Despite this, nobody will claim that computers are conscious. Today, with the discovery of deep learning (i.e. intuition or unconscious reasoning machines), low-level unconscious cognition is within humanity’s grasp. In this article, I will explore the ramifications of a scenario where machine subjectivity or self-awareness is discovered prior to the discovery of intelligent machines. This is a scenario where self-awareness is not a higher reasoning capability. Let us ask, what if self-aware machines were discovered before intelligent machines. What would the progression of breakthroughs look like? What is the order of the milestones?
If you are guiding your company’s digital journey, to what extent should you be advising them to adopt deep learning AI methods versus traditional and mature machine learning techniques.
In this article, you learn how to make Automated data report storytelling in R for Credit Modelling. First you need to install the rmarkdown rmarkdown package into your R library. Assuming that you installed the rmarkdown rmarkdown , next you create a new rmarkdown rmarkdown script in R.
Learn how to create your own free chatbot environment with just a few commands, as well as learning more about the benefits of customer service chatbots.
The fantasy that a machine is capable of simulating the human visual system is old. We’ve come a long way since the first university papers appeared back in the 1960s, as evidenced by the advent of modern systems trivially integrated into mobile applications. Today, computer vision is one of the hottest subfields of artificial intelligence and machine learning, given its wide variety of applications and tremendous potential. Its goal: to replicate the powerful capacities of human vision. But, what exactly is computer vision? How is it currently applied in different industries? What are some well-known business use cases? What tasks are typical to computer vision? In this guide, you’ll learn about the basic concept of computer vision and how it’s used in the real world. It’s a simple examination of a complex problem for anybody who has ever heard of computer vision but isn’t quite sure what it’s all about and how it’s applied.
Welcome to the second part of R Studio shortcuts and tips! If you have not yet read r studio shortcuts and tips – part one, I strongly recommend to do it before proceeding further.
Previously, I wrote a blog post showing a number of R packages and functions which you could use to quickly explore your data set. Since posting that, I’ve become aware of another exciting EDA package: inspectdf by Alastair Rushworth! As is very often the case, I became aware of this package in a twitter post by none other than Mara Averick.
A loss function in Machine Learning is a measure of how accurately your ML model is able to predict the expected outcome i.e the ground truth. The loss function will take two items as input: the output value of our model and the ground truth expected value. The output of the loss function is called the loss which is a measure of how well our model did at predicting the outcome. A high value for the loss means our model performed very poorly. A low value for the loss means our model performed very well. Selection of the proper loss function is critical for training an accurate model. Certain loss functions will have certain properties and help your model learn in a specific way. Some may put more weight on outliers, others on the majority. In this article we’re going to take a look at the 3 most common loss functions for Machine Learning Regression. I’ll explain how they work, their pros and cons, and how they can be most effectively applied when training regression models.
The multi-armed (also called k-armed) bandit is an introductory reinforcement learning problem in which an agent has to make n choices among k different options. Each option delivers a (possibly) different reward from an unknown distribution which usually doesn’t change over time (i.e. it is stationary). If the distribution changes over time (i.e. it is not stationary), the problem gets harder because previous observations (i.e. previous games) are of little usefulness. In either case, the goal is to maximize the total reward obtained. This article reviews one (of many) simple solution for both a stationary and a non-stationary 5-armed bandit across 1000 games. Note that only some remarks of the full code will be showcased here, for the fully functional notebook, please see this github repository.
In recent years, research related to ‘humans’ in the computer vision community has become increasingly active because of the high demand for real-life applications, among them is instance segmentation. The standard approach to image instance segmentation is to perform the object detection first, and then segment the object from the detection bounding-box. More recently, deep learning methods like Mask R-CNN perform them jointly. However, as human associated tasks becoming more common like human recognition, tracking etc. one might wonder why does the uniqueness of the ‘human’ category does not taken into account. The uniqueness of the ‘human’ category, can be well defined by the pose skeleton. Moreover, the human pose skeleton can be used to better distinguish instances with heavy occlusion than using bounding-boxes. In this post, I am going to review ‘Pose2Seg: Detection Free Human Instance Segmentation’, which presents a new pose-based instance segmentation framework for humans which separates instances based on human pose.
1. Initialization
2. Data Generation
3. Train-test Splitting
4. Data Standardization
5. Neural Net Construction
6. Forward Propagation
7. Back-propagation
8. Iterative Optimization
9. Testing
This is how you can build a neural net from scratch using NumPy in 9 steps. Some of you might have already built neural nets using some high-level frameworks such as TensorFlow, PyTorch, or Keras. However, building a neural net using only low-level libraries enable us to truly understand the mathematics behind the mystery.
This is the fourth article in the series Deep Learning for Life Sciences. In the previous posts, I showed how to use Deep Learning on Ancient DNA, Deep Learning for Single Cell Biology and Deep Learning for Data Integration. Now we are going to dive into Biomedicine and learn why and how we should use Bayesian Deep Learning for patient safety.
Like everything else, the world of cricket has also gone through a lot of technological transformations in the recent years. They way cricket is played and and how it is viewed all around the world have both changed as a result. In this post we discuss if neural networks are capable of generating cricket commentary by just watching it. There has been some work in the literature (can be found here, here and here) but they do not use neural networks. Being a believer in end to end deep learning, I think neural networks will seal the deal on this task in the near future. This is a hard problem to tackle, because apart from visual feature extraction, it involves very complex temporal dynamics and handling of long term dependencies. This is because commentary is generally highly contextualized by the development of current game, its significance in broader perspective (friendly match vs tournament), and histories of teams and players involved. Decontextualized explanation of what is happening appears to be a easier problem to solve and I can think of an architecture that can used for modelling this.

# Whats new on arXiv

Continual learning based on data stream mining deals with ubiquitous sources of Big Data arriving at high-velocity and in real-time. Adaptive Random Forest ({\em ARF}) is a popular ensemble method used for continual learning due to its simplicity in combining adaptive leveraging bagging with fast random Hoeffding trees. While the default ARF size provides competitive accuracy, it is usually over-provisioned resulting in the use of additional classifiers that only contribute to increasing CPU and memory consumption with marginal impact in the overall accuracy. This paper presents Elastic Swap Random Forest ({\em ESRF}), a method for reducing the number of trees in the ARF ensemble while providing similar accuracy. {\em ESRF} extends {\em ARF} with two orthogonal components: 1) a swap component that splits learners into two sets based on their accuracy (only classifiers with the highest accuracy are used to make predictions); and 2) an elastic component for dynamically increasing or decreasing the number of classifiers in the ensemble. The experimental evaluation of {\em ESRF} and comparison with the original {\em ARF} shows how the two new components contribute to reducing the number of classifiers up to one third while providing almost the same accuracy, resulting in speed-ups in terms of per-sample execution time close to 3x.
Online Normalization is a new technique for normalizing the hidden activations of a neural network. Like Batch Normalization, it normalizes the sample dimension. While Online Normalization does not use batches, it is as accurate as Batch Normalization. We resolve a theoretical limitation of Batch Normalization by introducing an unbiased technique for computing the gradient of normalized activations. Online Normalization works with automatic differentiation by adding statistical normalization as a primitive. This technique can be used in cases not covered by some other normalizers, such as recurrent networks, fully connected networks, and networks with activation memory requirements prohibitive for batching. We show its applications to image classification, image segmentation, and language modeling. We present formal proofs and experimental results on ImageNet, CIFAR, and PTB datasets.
As the application of deep learning has expanded to real-world problems with insufficient volume of training data, transfer learning recently has gained much attention as means of improving the performance in such small-data regime. However, when existing methods are applied between heterogeneous architectures and tasks, it becomes more important to manage their detailed configurations and often requires exhaustive tuning on them for the desired performance. To address the issue, we propose a novel transfer learning approach based on meta-learning that can automatically learn what knowledge to transfer from the source network to where in the target network. Given source and target networks, we propose an efficient training scheme to learn meta-networks that decide (a) which pairs of layers between the source and target networks should be matched for knowledge transfer and (b) which features and how much knowledge from each feature should be transferred. We validate our meta-transfer approach against recent transfer learning methods on various datasets and network architectures, on which our automated scheme significantly outperforms the prior baselines that find ‘what and where to transfer’ in a hand-crafted manner.
In this paper, we propose a \textit{weak supervision} framework for neural ranking tasks based on the data programming paradigm \citep{Ratner2016}, which enables us to leverage multiple weak supervision signals from different sources. Empirically, we consider two sources of weak supervision signals, unsupervised ranking functions and semantic feature similarities. We train a BERT-based passage-ranking model (which achieves new state-of-the-art performances on two benchmark datasets with full supervision) in our weak supervision framework. Without using ground-truth training labels, BERT-PR models outperform BM25 baseline by a large margin on all three datasets and even beat the previous state-of-the-art results with full supervision on two of the datasets.
In this paper, we study adaptive online convex optimization, and aim to design a universal algorithm that achieves optimal regret bounds for multiple common types of loss functions. Existing universal methods are limited in the sense that they are optimal for only a subclass of loss functions. To address this limitation, we propose a novel online method, namely Maler, which enjoys the optimal $O(\sqrt{T})$, $O(d\log T)$ and $O(\log T)$ regret bounds for general convex, exponentially concave, and strongly convex functions respectively. The essential idea is to run multiple types of learning algorithms with different learning rates in parallel, and utilize a meta algorithm to track the best one on the fly. Empirical results demonstrate the effectiveness of our method.
In this work, we propose a novel technique to boost training efficiency of a neural network. Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent. However, determining independent components is a computationally intensive task. To overcome this challenge, we propose to implement an IC layer by combining two popular techniques, Batch Normalization and Dropout, in a new manner that we can rigorously prove that Dropout can quadratically reduce the mutual information and linearly reduce the correlation between any pair of neurons with respect to the dropout layer parameter $p$. As demonstrated experimentally, the IC layer consistently outperforms the baseline approaches with more stable training process, faster convergence speed and better convergence limit on CIFAR10/100 and ILSVRC2012 datasets. The implementation of our IC layer makes us rethink the common practices in the design of neural networks. For example, we should not place Batch Normalization before ReLU since the non-negative responses of ReLU will make the weight layer updated in a suboptimal way, and we can achieve better performance by combining Batch Normalization and Dropout together as an IC layer.
In this paper, we introduce the algorithms of Orthogonal Deep Neural Networks (OrthDNNs) to connect with recent interest of spectrally regularized deep learning methods. OrthDNNs are theoretically motivated by generalization analysis of modern DNNs, with the aim to find solution properties of network weights that guarantee better generalization. To this end, we first prove that DNNs are of local isometry on data distributions of practical interest; by using a new covering of the sample space and introducing the local isometry property of DNNs into generalization analysis, we establish a new generalization error bound that is both scale- and range-sensitive to singular value spectrum of each of networks’ weight matrices. We prove that the optimal bound w.r.t. the degree of isometry is attained when each weight matrix has a spectrum of equal singular values, among which orthogonal weight matrix or a non-square one with orthonormal rows or columns is the most straightforward choice, suggesting the algorithms of OrthDNNs. We present both algorithms of strict and approximate OrthDNNs, and for the later ones we propose a simple yet effective algorithm called Singular Value Bounding (SVB), which performs as well as strict OrthDNNs, but at a much lower computational cost. We also propose Bounded Batch Normalization (BBN) to make compatible use of batch normalization with OrthDNNs. We conduct extensive comparative studies by using modern architectures on benchmark image classification. Experiments show the efficacy of OrthDNNs.
Pre-trained text encoders have rapidly advanced the state of the art on many NLP tasks. We focus on one such model, BERT, and aim to quantify where linguistic information is captured within the network. We find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference. Qualitative analysis reveals that the model can and often does adjust this pipeline dynamically, revising lower-level decisions on the basis of disambiguating information from higher-level representations.
Social influence plays a vital role in shaping a user’s behavior in online communities dealing with items of fine taste like movies, food, and beer. For online recommendation, this implies that users’ preferences and ratings are influenced due to other individuals. Given only time-stamped reviews of users, can we find out who-influences-whom, and characteristics of the underlying influence network? Can we use this network to improve recommendation? While prior works in social-aware recommendation have leveraged social interaction by considering the observed social network of users, many communities like Amazon, Beeradvocate, and Ratebeer do not have explicit user-user links. Therefore, we propose GhostLink, an unsupervised probabilistic graphical model, to automatically learn the latent influence network underlying a review community — given only the temporal traces (timestamps) of users’ posts and their content. Based on extensive experiments with four real-world datasets with 13 million reviews, we show that GhostLink improves item recommendation by around 23% over state-of-the-art methods that do not consider this influence. As additional use-cases, we show that GhostLink can be used to differentiate between users’ latent preferences and influenced ones, as well as to detect influential users based on the learned influence graph.
In this paper, we develop a new estimation procedure based on the non-linear conjugate gradient (NCG) algorithm for the Box-Cox transformation cure rate model. We compare the performance of the NCG algorithm with the well-known expectation maximization (EM) algorithm through a simulation study and show the advantages of the NCG algorithm over the EM algorithm. In particular, we show that the NCG algorithm allows simultaneous maximization of all model parameters when the likelihood surface is flat with respect to a Box-Cox model parameter. This is a big advantage over the EM algorithm, where a profile likelihood approach has been proposed in the literature that may not provide satisfactory results. We finally use the NCG algorithm to analyze a well-known melanoma data and show that it results in a better fit.
Neural networks and deep learning are changing the way that artificial intelligence is being done. Efficiently choosing a suitable network architecture and fine-tune its hyper-parameters for a specific dataset is a time-consuming task given the staggering number of possible alternatives. In this paper, we address the problem of model selection by means of a fully automated framework for efficiently selecting a neural network model for a given task: classification or regression. The algorithm, named Automatic Model Selection, is a modified micro-genetic algorithm that automatically and efficiently finds the most suitable neural network model for a given dataset. The main contributions of this method are a simple list based encoding for neural networks as genotypes in an evolutionary algorithm, new crossover, and mutation operators, the introduction of a fitness function that considers both, the accuracy of the model and its complexity and a method to measure the similarity between two neural networks. AMS is evaluated on two different datasets. By comparing some models obtained with AMS to state-of-the-art models for each dataset we show that AMS can automatically find efficient neural network models. Furthermore, AMS is computationally efficient and can make use of distributed computing paradigms to further boost its performance.
Blockchain has been regarded as a promising technology for Internet of Things (IoT), since it provides significant solutions for decentralized network which can address trust and security concerns, high maintenance cost problem, etc. The decentralization provided by blockchain can be largely attributed to the use of consensus mechanism, which enables peer-to-peer trading in a distributed manner without the involvement of any third party. This article starts from introducing the basic concept of blockchain and illustrating why consensus mechanism plays an indispensable role in a blockchain enabled IoT system. Then, we discuss the main ideas of two famous consensus mechanisms including Proof of Work (PoW) and Proof of Stake (PoS), and list their limitations in IoT. Next, two mainstream Direct Acyclic Graph (DAG) based consensus mechanisms, i.e., the Tangle and Hashgraph, are reviewed to show why DAG consensus is more suitable for IoT system than PoW and PoS. Potential issues and challenges of DAG based consensus mechanism to be addressed in the future are discussed in the last.
Operating with ignorance is an important concern of the Machine Learning research, especially when the objective is to discover knowledge from the imperfect data. Data mining (driven by appropriate knowledge discovery tools) is about processing available (observed, known and understood) samples of data aiming to build a model (e.g., a classifier) to handle data samples, which are not yet observed, known or understood. These tools traditionally take samples of the available data (known facts) as an input for learning. We want to challenge the indispensability of this approach and we suggest considering the things the other way around. What if the task would be as follows: how to learn a model based on our ignorance, i.e. by processing the shape of ‘voids’ within the available data space? Can we improve traditional classification by modeling also the ignorance? In this paper, we provide some algorithms for the discovery and visualizing of the ignorance zones in two-dimensional data spaces and design two ignorance-aware smart prototype selection techniques (incremental and adversarial) to improve the performance of the nearest neighbor classifiers. We present experiments with artificial and real datasets to test the concept of the usefulness of ignorance discovery in machine learning.
Process mining, i.e., a sub-field of data science focusing on the analysis of event data generated during the execution of (business) processes, has seen a tremendous change over the past two decades. Starting off in the early 2000’s, with limited to no tool support, nowadays, several software tools, i.e., both open-source, e.g., ProM and Apromore, and commercial, e.g., Disco, Celonis, ProcessGold, etc., exist. The commercial process mining tools provide limited support for implementing custom algorithms. Moreover, both commercial and open-source process mining tools are often only accessible through a graphical user interface, which hampers their usage in large-scale experimental settings. Initiatives such as RapidProM provide process mining support in the scientific workflow-based data science suite RapidMiner. However, these offer limited to no support for algorithmic customization. In the light of the aforementioned, in this paper, we present a novel process mining library, i.e. Process Mining for Python (PM4Py) that aims to bridge this gap, providing integration with state-of-the-art data science libraries, e.g., pandas, numpy, scipy and scikit-learn. We provide a global overview of the architecture and functionality of PM4Py, accompanied by some representative examples of its usage.
Neural networks (NN) are considered as black-boxes due to the lack of explainability and transparency of their decisions. This significantly hampers their deployment in environments where explainability is essential along with the accuracy of the system. Recently, significant efforts have been made for the interpretability of these deep networks with the aim to open up the black-box. However, most of these approaches are specifically developed for visual modalities. In addition, the interpretations provided by these systems require expert knowledge and understanding for intelligibility. This indicates a vital gap between the explainability provided by the systems and the novice user. To bridge this gap, we present a novel framework i.e. Time-Series eXplanation (TSXplain) system which produces a natural language based explanation of the decision taken by a NN. It uses the extracted statistical features to describe the decision of a NN, merging the deep learning world with that of statistics. The two-level explanation provides ample description of the decision made by the network to aid an expert as well as a novice user alike. Our survey and reliability assessment test confirm that the generated explanations are meaningful and correct. We believe that generating natural language based descriptions of the network’s decisions is a big step towards opening up the black-box.
Recently, a number of learning-based optimization methods that combine data-driven architectures with the classical optimization algorithms have been proposed and explored, showing superior empirical performance in solving various ill-posed inverse problems, but there is still a scarcity of rigorous analysis about the convergence behaviors of learning-based optimization. In particular, most existing analyses are specific to unconstrained problems but cannot apply to the more general cases where some variables of interest are subject to certain constraints. In this paper, we propose Differentiable Linearized ADMM (D-LADMM) for solving the problems with linear constraints. Specifically, D-LADMM is a K-layer LADMM inspired deep neural network, which is obtained by firstly introducing some learnable weights in the classical Linearized ADMM algorithm and then generalizing the proximal operator to some learnable activation function. Notably, we rigorously prove that there exist a set of learnable parameters for D-LADMM to generate globally converged solutions, and we show that those desired parameters can be attained by training D-LADMM in a proper way. To the best of our knowledge, we are the first to provide the convergence analysis for the learning-based optimization method on constrained problems.
Classical moment based change point tests like the cusum test are very powerful in case of Gaussian time series with one change point but behave poorly under heavy tailed distributions and corrupted data. A new class of robust change point tests based on cusum statistics of robustly transformed observations is proposed. This framework is quite flexible, depending on the used transformation one can detect for instance changes in the mean, scale or dependence of a possibly multivariate time series. Simulations indicate that this approach is very powerful in detecting changes in the marginal variance of ARCH processes and outperforms existing proposals for detecting structural breaks in the dependence structure of heavy tailed multivariate time series.
Large knowledge bases (KBs) are useful for many AI tasks, but are difficult to integrate into modern gradient-based learning systems. Here we describe a framework for accessing soft symbolic database using only differentiable operators. For example, this framework makes it easy to conveniently write neural models that adjust confidences associated with facts in a soft KB; incorporate prior knowledge in the form of hand-coded KB access rules; or learn to instantiate query templates using information extracted from text. NQL can work well with KBs with millions of tuples and hundreds of thousands of entities on a single GPU.
This paper studies semi-supervised object classification in relational data, which is a fundamental problem in relational data modeling. The problem has been extensively studied in the literature of both statistical relational learning (e.g. relational Markov networks) and graph neural networks (e.g. graph convolutional networks). Statistical relational learning methods can effectively model the dependency of object labels through conditional random fields for collective classification, whereas graph neural networks learn effective object representations for classification through end-to-end training. In this paper, we propose the Graph Markov Neural Network (GMNN) that combines the advantages of both worlds. A GMNN models the joint distribution of object labels with a conditional random field, which can be effectively trained with the variational EM algorithm. In the E-step, one graph neural network learns effective object representations for approximating the posterior distributions of object labels. In the M-step, another graph neural network is used to model the local label dependency. Experiments on object classification, link classification, and unsupervised node representation learning show that GMNN achieves state-of-the-art results.
This paper presents a method for solving the supervised learning problem in which the output is highly nonlinear and discontinuous. It is proposed to solve this problem in three stages: (i) cluster the pairs of input-output data points, resulting in a label for each point; (ii) classify the data, where the corresponding label is the output; and finally (iii) perform one separate regression for each class, where the training data corresponds to the subset of the original input-output pairs which have that label according to the classifier. It has not yet been proposed to combine these 3 fundamental building blocks of machine learning in this simple and powerful fashion. This can be viewed as a form of deep learning, where any of the intermediate layers can itself be deep. The utility and robustness of the methodology is illustrated on some toy problems, including one example problem arising from simulation of plasma fusion in a tokamak.
Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In \spider{}, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network, and this representation is later used at both encoding and decoding time. Evaluation shows that encoding the schema structure improves our parser accuracy from 33.8% to 39.4%, dramatically above the current state of the art, which is at 19.7%.
Bayesian neural network (BNN) priors are defined in parameter space, making it hard to encode prior knowledge expressed in function space. We formulate a prior that incorporates functional constraints about what the output can or cannot be in regions of the input space. Output-Constrained BNNs (OC-BNN) represent an interpretable approach of enforcing a range of constraints, fully consistent with the Bayesian framework and amenable to black-box inference. We demonstrate how OC-BNNs improve model robustness and prevent the prediction of infeasible outputs in two real-world applications of healthcare and robotics.
The interactive machine learning (IML) community aims to augment humans’ ability to learn and make decisions over time through the development of automated decision-making systems. This interaction represents a collaboration between multiple intelligent systems—humans and machines. A lack of appropriate consideration for the humans involved can lead to problematic system behaviour, and issues of fairness, accountability, and transparency. This work presents a human-centred thinking approach to applying IML methods. This guide is intended to be used by AI practitioners who incorporate human factors in their work. These practitioners are responsible for the health, safety, and well-being of interacting humans. An obligation of responsibility for public interaction means acting with integrity, honesty, fairness, and abiding by applicable legal statutes. With these values and principles in mind, we as a research community can better achieve the collective goal of augmenting human ability. This practical guide aims to support many of the responsible decisions necessary throughout iterative design, development, and dissemination of IML systems.
Autonomous optimization refers to the design of feedback controllers that steer a physical system to a steady state that solves a predefined, possibly constrained, optimization problem. As such, no exogenous control inputs such as setpoints or trajectories are required. Instead, these controllers are modeled after optimization algorithms that take the form of dynamical systems. The interconnection of this type of optimization dynamics with a physical system is however not guaranteed to be stable unless both dynamics act on sufficiently different timescales. In this paper, we quantify the required timescale separation and give prescriptions that can be directly used in the design of this type of feedback controllers. Using ideas from singular perturbation analysis we derive stability bounds for different feedback optimization schemes that are based on common continuous-time optimization schemes. In particular, we consider gradient descent and its variations, including projected gradient, and Newton gradient. We further give stability bounds for momentum methods and saddle-point flows interconnected with dynamical systems. Finally, we discuss how optimization algorithms like subgradient and accelerated gradient descent, while well-behaved in offline settings, are unsuitable for autonomous optimization due to their general lack of robustness.
PCA is often used in anomaly detection and statistical process control tasks. For bivariate data, we prove that the minor projection (the least varying projection) of the PCA-rotated data is the most sensitive to distributional changes, where sensitivity is defined by the Hellinger distance between distributions before and after a change. In particular, this is almost always the case if only one parameter of the bivariate normal distribution changes, i.e., the change is sparse. Simulations indicate that the minor projections are the most sensitive for a large range of changes and pre-change settings in higher dimensions as well. This motivates using the minor projections for detecting sparse distributional changes in high-dimensional data.

# R Packages worth a look

Calculate Exposures, Assign Records to Intervals (expstudies)
Creation of an exposure table with rows for policy-intervals from a table with a unique policy number key and beginning and ending dates for each polic …

Prediction Model Selection and Performance Evaluation in Multiple Imputed Datasets (psfmi)
Provides functions to apply pooling or backward selection for logistic or Cox regression prediction models in multiple imputed datasets. Backward selec …

Run Reports, Models, and Scripts from a Mobile Device (MobileTrigger)
A framework for interacting with R modules such as Reports, Models, and Scripts from a mobile device. The framework allows you to list available module …

Interface to the ‘Briq’ API (briqr)
An interface to the ‘Briq’ API <https://briq.github.io>. ‘Briq’ is a tool that aims to pr …

# Distilled News

In this story, we’re going to take an aerial tour of optimization with Lagrange multipliers. When do we need them? Whenever we have an optimization problem with constraints.
Fake news has been a hot topic of discussion in the post-Brexit and Trump world. While the world’s politicians, media and internet giants argue over how to counter the spread of propaganda through fake news, a much more devious evil lurks in the shadows. Harry G. Frankfurt, Professor of Philosophy at Princeton, calls it bullshit in his essay titled ‘On Bullshit’. Unlike fake news, which can be fact-checked and called out for its outright lies, bullshit ‘falls short of lying’ but is deceptive nevertheless. In Professor Frankfurt’s words it is ‘a greater enemy of truth than lies are’. While fake news may be the domain of online trolls, the art of bullshitting is practiced by the intellectual elite of the society.
Fast Segmentation Convolutional Neural Network (Fast-SCNN) is an above real-time semantic segmentation model on high resolution image data suited to efficient computation on embedded devices with low memory. The authors of the original paper are: Rudra PK Poudel, Stephan Liwicki and Roberto Cipolla. The code used in this article is not the official implementation from the authors but an attempt by me to re-construct the model as described on the paper. Since the rise of autonomous vehicles, it is highly desirable that there exist a model that can process input in real-time. There already exist some state-of-the-art offline semantic segmentation models, but these models are large in size and memory requirement and requires expensive computation, Fast-SCNN can provide solution to all these problems.
I’m an avid YouTube user. The sheer amount of content I can watch on a single platform is staggering. In fact, a lot of my data science learning has happened through YouTube videos! So, I was browsing YouTube a few weeks ago searching for a certain category to watch. That’s when my data scientist thought process kicked in. Given my love for web scraping and machine learning, could I extract data about YouTube videos and build a model to classify them into their respective categories? I was intrigued! This sounded like the perfect opportunity to combine my existing Python and data science knowledge with my curiosity to learn something new. And Analytics Vidhya’s internship challenge offered me the chance to pen down my learning in article form.
I am pleased to announce that my quantum simulator Qubiter (available at GitHub, BSD license) now has a native TensorFlow Backend-Simulator (see its class SEO_simulator_tf, the tf stands for TensorFlow). This complements Qubiter’s original numpy simulator (contained in its class SEO_simulator). A small step for Mankind, a giant leap for me! Hip Hip Hurray! This means that Qubiter can now calculate the evolution of a state vector using CPU, GPU or TPU. Plus it can do back-propagation on a quantum circuit. Here is a jupyter notebook that I wrote that uses Qubiter’s TF backend to do VQE (Variational Quantum Eigensolving). (I like to call VQE, mean Hamiltonian minimization)
Back in March, we open-sourced our implementation of ‘Fast Dense Feature Extraction with CNN’s that have Pooling or Striding Layers’, Although not broadly known, The 2017 BMVC published paper offers an efficient and elegant solution on how to avoid computational redundancy when using patch based Convolution Neural networks. So in this post I’ll explain how the model works and show how to use it in a real applications. I’ll cover two things: First, an overview of the method named ‘Fast Dense Feature Extraction with CNN’s that have Pooling or Striding Layers’. And, second, how to use this approach on an existing trained patch network to speed up inference time.
… or, why I decided to attend Flatiron School’s data science immersive bootcamp.
AI is going to change the world. We all know that. Blockchain is also (like AI) disruptive, revolutionary, and earth-shakingly transformative. So what happens if we take two buzzwords like AI (data science) and blockchain and combine them? Let’s find out! But first, do you know what a blockchain is? And why it’s one of the keys to a democratic digital future? Well, I could repeat what others have already detailed on this blog in the past and finish my word count and get the job done – but I want to inspire you! Give you fresh stunning information which you can hopefully use to ignite the flame of learning within you to start this journey or adventure in technology.
Testing and debugging machine learning systems differs significantly from testing and debugging traditional software. This course describes how, starting from debugging your model all the way to monitoring your pipeline in production.
This playbook represents a first step toward defining the governing principles and best practices which will enable developers to build great conversational applications. It is the result of several years of practical experience building and deploying dozens of the most advanced conversational applications achievable. Cutting-edge research and state-of-the-art algorithms are not surveyed here; there are many other resources available for that purpose. Instead, this playbook focuses on helping developers and data scientists build real production applications. The detailed instructions, practical advice, and real-world examples provided here should empower developers to improve the quality and variety of conversational experiences of the coming months and years.
Deep Learning has revolutionized the Machine Learning scene in the last years. Can we apply it to image compression? How well can a Deep Learning algorithm reconstruct pictures of kittens? What’s an autoencoder?

# Whats new on arXiv

Naive Bayes estimator is widely used in text classification problems. However, it doesn’t perform well with small-size training dataset. We propose a new method based on Naive Bayes estimator to solve this problem. A correlation factor is introduced to incorporate the correlation among different classes. Experimental results show that our estimator achieves a better accuracy compared with traditional Naive Bayes in real world data.
Currently, the text document retrieval systems have many challenges in exploring the semantics of queries and documents. Each query implies information which does not appear in the query but the documents related with the information are also expected by user. The disadvantage of the previous spreading activation algorithms could be many irrelevant concepts added to the query. In this paper, a proposed novel algorithm is only activate and add to the query named entities which are related with original entities in the query and explicit relations in the query.
Separable Non-negative Matrix Factorization (SNMF) is an important method for topic modeling, where ‘separable’ assumes every topic contains at least one anchor word, defined as a word that has non-zero probability only on that topic. SNMF focuses on the word co-occurrence patterns to reveal topics by two steps: anchor word selection and topic recovery. The quality of the anchor words strongly influences the quality of the extracted topics. Existing anchor word selection algorithm is to greedily find an approximate convex hull in a high-dimensional word co-occurrence space. In this work, we propose a new method for the anchor word selection by associating the word co-occurrence probability with the words similarity and assuming that the most different words on semantic are potential candidates for the anchor words. Therefore, if the similarity of a word-pair is very low, then the two words are very likely to be the anchor words. According to the statistical information of text corpora, we can get the similarity of all word-pairs. We build the word similarity graph where the nodes correspond to words and weights on edges stand for the word-pair similarity. Following this way, we design a greedy method to find a minimum edge-weight anchor clique of a given size in the graph for the anchor word selection. Extensive experiments on real-world corpus demonstrate the effectiveness of the proposed anchor word selection method that outperforms the common convex hull-based methods on the revealed topic quality. Meanwhile, our method is much faster than typical SNMF based method.
Events in the world may be caused by other, unobserved events. We consider sequences of events in continuous time. Given a probability model of complete sequences, we propose particle smoothing—a form of sequential importance sampling—to impute the missing events in an incomplete sequence. We develop a trainable family of proposal distributions based on a type of bidirectional continuous-time LSTM: Bidirectionality lets the proposals condition on future observations, not just on the past as in particle filtering. Our method can sample an ensemble of possible complete sequences (particles), from which we form a single consensus prediction that has low Bayes risk under our chosen loss metric. We experiment in multiple synthetic and real domains, using different missingness mechanisms, and modeling the complete sequences in each domain with a neural Hawkes process (Mei & Eisner 2017). On held-out incomplete sequences, our method is effective at inferring the ground-truth unobserved events, with particle smoothing consistently improving upon particle filtering.
Online anomaly detection of time-series data is an important and challenging task in machine learning. Gaussian processes (GPs) are powerful and flexible models for modeling time-series data. However, the high time complexity of GPs limits their applications in online anomaly detection. Attributed to some internal or external changes, concept drift usually occurs in time-series data, where the characteristics of data and meanings of abnormal behaviors alter over time. Online anomaly detection methods should have the ability to adapt to concept drift. Motivated by the above facts, this paper proposes the method of sparse Gaussian processes with Q-function (SGP-Q). The SGP-Q employs sparse Gaussian processes (SGPs) whose time complexity is lower than that of GPs, thus significantly speeding up online anomaly detection. By using Q-function properly, the SGP-Q can adapt to concept drift well. Moreover, the SGP-Q makes use of few abnormal data in the training data by its strategy of updating training data, resulting in more accurate sparse Gaussian process regression models and better anomaly detection results. We evaluate the SGP-Q on various artificial and real-world datasets. Experimental results validate the effectiveness of the SGP-Q.
Determining the quality of the results obtained by clustering techniques is a key issue in unsupervised machine learning. Many authors have discussed the desirable features of good clustering algorithms. However, Jon Kleinberg established an impossibility theorem for clustering. As a consequence, a wealth of studies have proposed techniques to evaluate the quality of clustering results depending on the characteristics of the clustering problem and the algorithmic technique employed to cluster data.
In the last decade, artificial intelligence (AI) models inspired by the brain have made unprecedented progress in performing real-world perceptual tasks like object classification and speech recognition. Recently, researchers of natural intelligence have begun using those AI models to explore how the brain performs such tasks. These developments suggest that future progress will benefit from increased interaction between disciplines. Here we introduce the Algonauts Project as a structured and quantitative communication channel for interdisciplinary interaction between natural and artificial intelligence researchers. The project’s core is an open challenge with a quantitative benchmark whose goal is to account for brain data through computational models. This project has the potential to provide better models of natural intelligence and to gather findings that advance AI. The 2019 Algonauts Project focuses on benchmarking computational models predicting human brain activity when people look at pictures of objects. The 2019 edition of the Algonauts Project is available online: http://…/.
In this article, we tackle the issue of the limited quantity of manually sense annotated corpora for the task of word sense disambiguation, by exploiting the semantic relationships between senses such as synonymy, hypernymy and hyponymy, in order to compress the sense vocabulary of Princeton WordNet, and thus reduce the number of different sense tags that must be observed to disambiguate all words of the lexical database. We propose two different methods that greatly reduces the size of neural WSD models, with the benefit of improving their coverage without additional training data, and without impacting their precision. In addition to our method, we present a new WSD system which relies on pre-trained BERT word vectors in order to achieve results that significantly outperform the state of the art on all WSD evaluation tasks.
Automated Planning is one of the main research field of Artificial Intelligence since its beginnings. Research in Automated Planning aims at developing general reasoners (i.e., planners) capable of automatically solve complex problems. Broadly speaking, planners rely on a general model characterizing the possible states of the world and the actions that can be performed in order to change the status of the world. Given a model and an initial known state, the objective of a planner is to synthesize a set of actions needed to achieve a particular goal state. The classical approach to planning roughly corresponds to the description given above. The timeline-based approach is a particular planning paradigm capable of integrating causal and temporal reasoning within a unified solving process. This approach has been successfully applied in many real-world scenarios although a common interpretation of the related planning concepts is missing. Indeed, there are significant differences among the existing frameworks that apply this technique. Each framework relies on its own interpretation of timeline-based planning and therefore it is not easy to compare these systems. Thus, the objective of this work is to investigate the timeline-based approach to planning by addressing several aspects ranging from the semantics of the related planning concepts to the modeling and solving techniques. Specifically, the main contributions of this PhD work consist of: (i) the proposal of a formal characterization of the timeline-based approach capable of dealing with temporal uncertainty; (ii) the proposal of a hierarchical modeling and solving approach; (iii) the development of a general purpose framework for planning and execution with timelines; (iv) the validation{\dag}of this approach in real-world manufacturing scenarios.
Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML.NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML.NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML.NET compared to more recent entrants, and a discussion of some lessons learned.
The options framework in reinforcement learning models the notion of a skill or a temporally extended sequence of actions. The discovery of a reusable set of skills has typically entailed building options, that navigate to bottleneck states. This work adopts a complementary approach, where we attempt to discover options that navigate to landmark states. These states are prototypical representatives of well-connected regions and can hence access the associated region with relative ease. In this work, we propose Successor Options, which leverages Successor Representations to build a model of the state space. The intra-option policies are learnt using a novel pseudo-reward and the model scales to high-dimensional spaces easily. Additionally, we also propose an Incremental Successor Options model that iterates between constructing Successor Representations and building options, which is useful when robust Successor Representations cannot be built solely from primitive actions. We demonstrate the efficacy of our approach on a collection of grid-worlds, and on the high-dimensional robotic control environment of Fetch.
The Poisson process is the most elementary continuous-time stochastic process that models a stream of repeating events. It is uniquely characterised by a single parameter called the rate. Instead of a single value for this rate, we here consider a rate interval and let it characterise two nested sets of stochastic processes. We call these two sets of stochastic process imprecise Poisson processes, explain why this is justified, and study the corresponding lower and upper (conditional) expectations. Besides a general theoretical framework, we also provide practical methods to compute lower and upper (conditional) expectations of functions that depend on the number of events at a single point in time.
Stochastic blockmodels (SBM) and their variants, $e.g.$, mixed-membership and overlapping stochastic blockmodels, are latent variable based generative models for graphs. They have proven to be successful for various tasks, such as discovering the community structure and link prediction on graph-structured data. Recently, graph neural networks, $e.g.$, graph convolutional networks, have also emerged as a promising approach to learn powerful representations (embeddings) for the nodes in the graph, by exploiting graph properties such as locality and invariance. In this work, we unify these two directions by developing a \emph{sparse} variational autoencoder for graphs, that retains the interpretability of SBMs, while also enjoying the excellent predictive performance of graph neural nets. Moreover, our framework is accompanied by a fast recognition model that enables fast inference of the node embeddings (which are of independent interest for inference in SBM and its variants). Although we develop this framework for a particular type of SBM, namely the \emph{overlapping} stochastic blockmodel, the proposed framework can be adapted readily for other types of SBMs. Experimental results on several benchmarks demonstrate encouraging results on link prediction while learning an interpretable latent structure that can be used for community discovery.
We propose a novel Bayesian nonparametric method to learn translation-invariant relationships on non-Euclidean domains. The resulting graph convolutional Gaussian processes can be applied to problems in machine learning for which the input observations are functions with domains on general graphs. The structure of these models allows for high dimensional inputs while retaining expressibility, as is the case with convolutional neural networks. We present applications of graph convolutional Gaussian processes to images and triangular meshes, demonstrating their versatility and effectiveness, comparing favorably to existing methods, despite being relatively simple models.
The calibration of a reservoir model with observed transient data of fluid pressures and rates is a key task in obtaining a predictive model of the flow and transport behaviour of the earth’s subsurface. The model calibration task, commonly referred to as ‘history matching’, can be formalised as an ill-posed inverse problem where we aim to find the underlying spatial distribution of petrophysical properties that explain the observed dynamic data. We use a generative adversarial network pretrained on geostatistical object-based models to represent the distribution of rock properties for a synthetic model of a hydrocarbon reservoir. The dynamic behaviour of the reservoir fluids is modelled using a transient two-phase incompressible Darcy formulation. We invert for the underlying reservoir properties by first modeling property distributions using the pre-trained generative model then using the adjoint equations of the forward problem to perform gradient descent on the latent variables that control the output of the generative model. In addition to the dynamic observation data, we include well rock-type constraints by introducing an additional objective function. Our contribution shows that for a synthetic test case, we are able to obtain solutions to the inverse problem by optimising in the latent variable space of a deep generative model, given a set of transient observations of a non-linear forward problem.
We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy ‘debugging’ in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.

# Document worth reading: “A tutorial on Particle Swarm Optimization Clustering”

This paper proposes a tutorial on the Data Clustering technique using the Particle Swarm Optimization approach. Following the work proposed by Merwe et al. here we present an in-deep analysis of the algorithm together with a Matlab implementation and a short tutorial that explains how to modify the proposed implementation and the effect of the parameters of the original algorithm. Moreover, we provide a comparison against the results obtained using the well known K-Means approach. All the source code presented in this paper is publicly available under the GPL-v2 license. A tutorial on Particle Swarm Optimization Clustering

# If you did not already know

Spatio-Temporal U-Network (ST-UNet)
The spatio-temporal graph learning is becoming an increasingly important object of graph study. Many application domains involve highly dynamic graphs where temporal information is crucial, e.g. traffic networks and financial transaction graphs. Despite the constant progress made on learning structured data, there is still a lack of effective means to extract dynamic complex features from spatio-temporal structures. Particularly, conventional models such as convolutional networks or recurrent neural networks are incapable of revealing the temporal patterns in short or long terms and exploring the spatial properties in local or global scope from spatio-temporal graphs simultaneously. To tackle this problem, we design a novel multi-scale architecture, Spatio-Temporal U-Net (ST-UNet), for graph-structured time series modeling. In this U-shaped network, a paired sampling operation is proposed in spacetime domain accordingly: the pooling (ST-Pool) coarsens the input graph in spatial from its deterministic partition while abstracts multi-resolution temporal dependencies through dilated recurrent skip connections; based on previous settings in the downsampling, the unpooling (ST-Unpool) restores the original structure of spatio-temporal graphs and resumes regular intervals within graph sequences. Experiments on spatio-temporal prediction tasks demonstrate that our model effectively captures comprehensive features in multiple scales and achieves substantial improvements over mainstream methods on several real-world datasets. …

Drift Analysis
Drift analysis is among the most powerful theoretical tools available for estimating the optimisation time of meta-heuristics. Informally, it shows how the challenging problem of predicting the long-term behaviour of a meta-heuristic can be reduced to the often trivial problem of describing how the state of the heuristic changes during one iteration. Drift analysis has dramatically simplified the analysis of meta-heuristics. Many of the most important results about the optimisation time of meta-heuristics were obtained with the help of drift analysis.
Drift Analysis
Multiplicative Up-Drift

Data Calculator
Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Calculator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primitives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbitrary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices.
The Data Calculator: Data Structure Design and Cost Synthesis From First Principles, and Learned Cost Models

Toybox
Deep convolutional neural networks (CNNs) have enjoyed tremendous success in computer vision in the past several years, particularly for visual object recognition.However, how CNNs work remains poorly understood, and the training of deep CNNs is still considered more art than science. To better characterize deep CNNs and the training process, we introduce a new video dataset called Toybox. Images in Toybox come from first-person, wearable camera recordings of common household objects and toys being manually manipulated to undergo structured transformations like rotations and translations. We also present results from initial experiments using deep CNNs that begin to examine how different distributions of training data can affect visual object recognition performance, and how visual object concepts are represented within a trained network. …

# Whats new on arXiv

The recent work of Super Characters method using two-dimensional word embedding achieved state-of-the-art results in text classification tasks, showcasing the promise of this new approach. This paper borrows the idea of Super Characters method and two-dimensional embedding, and proposes a method of generating conversational response for open domain dialogues. The experimental results on a public dataset shows that the proposed SuperChat method generates high quality responses. An interactive demo is ready to show at the workshop.
Distance metric learning is successful in discovering intrinsic relations in data. However, most algorithms are computationally demanding when the problem size becomes large. In this paper, we propose a discriminative metric learning algorithm, and develop a distributed scheme learning metrics on moderate-sized subsets of data, and aggregating the results into a global solution. The technique leverages the power of parallel computation. The algorithm of the aggregated distance metric learning (ADML) scales well with the data size and can be controlled by the partition. We theoretically analyse and provide bounds for the error induced by the distributed treatment. We have conducted experimental evaluation of ADML, both on specially designed tests and on practical image annotation tasks. Those tests have shown that ADML achieves the state-of-the-art performance at only a fraction of the cost incurred by most existing methods.
Assemblies of modular subsystems are being pressed into service to perform sensing, reasoning, and decision making in high-stakes, time-critical tasks in such areas as transportation, healthcare, and industrial automation. We address the opportunity to maximize the utility of an overall computing system by employing reinforcement learning to guide the configuration of the set of interacting modules that comprise the system. The challenge of doing system-wide optimization is a combinatorial problem. Local attempts to boost the performance of a specific module by modifying its configuration often leads to losses in overall utility of the system’s performance as the distribution of inputs to downstream modules changes drastically. We present metareasoning techniques which consider a rich representation of the input, monitor the state of the entire pipeline, and adjust the configuration of modules on-the-fly so as to maximize the utility of a system’s operation. We show significant improvement in both real-world and synthetic pipelines across a variety of reinforcement learning techniques.
Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. Moreover, based on MetricGAN, the metric scores of the generated data can also be arbitrarily specified by users. We tested the proposed MetricGAN on a speech enhancement task, which is particularly suitable to verify the proposed approach because there are multiple metrics measuring different aspects of speech signals. Moreover, these metrics are generally complex and could not be fully optimized by Lp or conventional adversarial losses.
A number of optimization approaches have been proposed for optimizing nonconvex objectives (e.g. deep learning models), such as batch gradient descent, stochastic gradient descent and stochastic variance reduced gradient descent. Theory shows these optimization methods can converge by using an unbiased gradient estimator. However, in practice biased gradient estimation can allow more efficient convergence to the vicinity since an unbiased approach is computationally more expensive. To produce fast convergence there are two trade-offs of these optimization strategies which are between stochastic/batch, and between biased/unbiased. This paper proposes an integrated approach which can control the nature of the stochastic element in the optimizer and can balance the trade-off of estimator between the biased and unbiased by using a hyper-parameter. It is shown theoretically and experimentally that this hyper-parameter can be configured to provide an effective balance to improve the convergence rate.
We establish the first nonasymptotic error bounds for Kaplan-Meier-based nearest neighbor and kernel survival probability estimators where feature vectors reside in metric spaces. Our bounds imply rates of strong consistency for these nonparametric estimators and, up to a log factor, match an existing lower bound for conditional CDF estimation. Our proof strategy also yields nonasymptotic guarantees for nearest neighbor and kernel variants of the Nelson-Aalen cumulative hazards estimator. We experimentally compare these methods on four datasets. We find that for the kernel survival estimator, a good choice of kernel is one learned using random survival forests.
In this study, we propose the Affine Variational Autoencoder (AVAE), a variant of Variational Autoencoder (VAE) designed to improve robustness by overcoming the inability of VAEs to generalize to distributional shifts in the form of affine perturbations. By optimizing an affine transform to maximize ELBO, the proposed AVAE transforms an input to the training distribution without the need to increase model complexity to model the full distribution of affine transforms. In addition, we introduce a training procedure to create an efficient model by learning a subset of the training distribution, and using the AVAE to improve generalization and robustness to distributional shift at test time. Experiments on affine perturbations demonstrate that the proposed AVAE significantly improves generalization and robustness to distributional shift in the form of affine perturbations without an increase in model complexity.
In Generalized Linear Estimation (GLE) problems, we seek to estimate a signal that is observed through a linear transform followed by a component-wise, possibly nonlinear and noisy, channel. In the Bayesian optimal setting, Generalized Approximate Message Passing (GAMP) is known to achieve optimal performance for GLE. However, its performance can significantly degrade whenever there is a mismatch between the assumed and the true generative model, a situation frequently encountered in practice. In this paper, we propose a new algorithm, named Generalized Approximate Survey Propagation (GASP), for solving GLE in the presence of prior or model mis-specifications. As a prototypical example, we consider the phase retrieval problem, where we show that GASP outperforms the corresponding GAMP, reducing the reconstruction threshold and, for certain choices of its parameters, approaching Bayesian optimal performance. Furthermore, we present a set of State Evolution equations that exactly characterize the dynamics of GASP in the high-dimensional limit.
Graph convolutional neural networks (GCNNs) have been attracting increasing research attention due to its great potential in inference over graph structures. However, insufficient effort has been devoted to the aggregation methods between different convolution graph layers. In this paper, we introduce a graph attribute aggregation network (GAAN) architecture. Different from the conventional pooling operations, a graph-transformation-based aggregation strategy, progressive margin folding, PMF, is proposed for integrating graph features. By distinguishing internal and margin elements, we provide an approach for implementing the folding iteratively. And a mechanism is also devised for preserving the local structures during progressively folding. In addition, a hypergraph-based representation is introduced for transferring the aggregated information between different layers. Our experiments applied to the public molecule datasets demonstrate that the proposed GAAN outperforms the existing GCNN models with significant effectiveness.
Given a graphical model (GM), computing its partition function is the most essential inference task, but it is computationally intractable in general. To address the issue, iterative approximation algorithms exploring certain local structure/consistency of GM have been investigated as popular choices in practice. However, due to their local/iterative nature, they often output poor approximations or even do not converge, e.g., in low-temperature regimes (hard instances of large parameters). To overcome the limitation, we propose a novel approach utilizing the global spectral feature of GM. Our contribution is two-fold: (a) we first propose a fully polynomial-time approximation scheme (FPTAS) for approximating the partition function of GM associating with a low-rank coupling matrix; (b) for general high-rank GMs, we design a spectral mean-field scheme utilizing (a) as a subroutine, where it approximates a high-rank GM into a product of rank-1 GMs for an efficient approximation of the partition function. The proposed algorithm is more robust in its running time and accuracy than prior methods, i.e., neither suffers from the convergence issue nor depends on hard local structures, as demonstrated in our experiments.
Dealing with high variance is a significant challenge in model-free reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL. In particular, we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional regularization yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off. When the policy prior has control-theoretic stability guarantees, we further show that this regularization approximately preserves those stability guarantees throughout learning. We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.
A key challenge in leveraging data augmentation for neural network training is choosing an effective augmentation policy from a large search space of candidate operations. Properly chosen augmentation policies can lead to significant generalization improvements; however, state-of-the-art approaches such as AutoAugment are computationally infeasible to run for the ordinary user. In this paper, we introduce a new data augmentation algorithm, Population Based Augmentation (PBA), which generates nonstationary augmentation policy schedules instead of a fixed augmentation policy. We show that PBA can match the performance of AutoAugment on CIFAR-10, CIFAR-100, and SVHN, with three orders of magnitude less overall compute. On CIFAR-10 we achieve a mean test error of 1.46%, which is a slight improvement upon the current state-of-the-art. The code for PBA is open source and is available at https://…/pba.
For text analysis, one often resorts to a lossy representation that either completely ignores word order or embeds each word as a low-dimensional dense feature vector. In this paper, we propose convolutional Poisson factor analysis (CPFA) that directly operates on a lossless representation that processes the words in each document as a sequence of high-dimensional one-hot vectors. To boost its performance, we further propose the convolutional Poisson gamma belief network (CPGBN) that couples CPFA with the gamma belief network via a novel probabilistic pooling layer. CPFA forms words into phrases and captures very specific phrase-level topics, and CPGBN further builds a hierarchy of increasingly more general phrase-level topics. For efficient inference, we develop both a Gibbs sampler and a Weibull distribution based convolutional variational auto-encoder. Experimental results demonstrate that CPGBN can extract high-quality text latent representations that capture the word order information, and hence can be leveraged as a building block to enrich a wide variety of existing latent variable models that ignore word order.
In this article we revisit the definition of Precision-Recall (PR) curves for generative models proposed by Sajjadi et al. (arXiv:1806.00035). Rather than providing a scalar for generative quality, PR curves distinguish mode-collapse (poor recall) and bad quality (poor precision). We first generalize their formulation to arbitrary measures, hence removing any restriction to finite support. We also expose a bridge between PR curves and type I and type II error rates of likelihood ratio classifiers on the task of discriminating between samples of the two distributions. Building upon this new perspective, we propose a novel algorithm to approximate precision-recall curves, that shares some interesting methodological properties with the hypothesis testing technique from Lopez-Paz et al (arXiv:1610.06545). We demonstrate the interest of the proposed formulation over the original approach on controlled multi-modal datasets.
Directly learning features from the point cloud has become an active research direction in 3D understanding. Existing learning-based methods usually construct local regions from the point cloud and extract the corresponding features using shared Multi-Layer Perceptron (MLP) and max pooling. However, most of these processes do not adequately take the spatial distribution of the point cloud into account, limiting the ability to perceive fine-grained patterns. We design a novel Local Spatial Attention (LSA) module to adaptively generate attention maps according to the spatial distribution of local regions. The feature learning process which integrates with these attention maps can effectively capture the local geometric structure. We further propose the Spatial Feature Extractor (SFE), which constructs a branch architecture, to aggregate the spatial information with associated features in each layer of the network better.The experiments show that our network, named LSANet, can achieve on par or better performance than the state-of-the-art methods when evaluating on the challenging benchmark datasets. The source code is available at https://…/LSANet.
This paper deals with estimating model parameters in graphical models. We reformulate it as an information geometric optimization problem and introduce a natural gradient descent strategy that incorporates additional meta parameters. We show that our approach is a strong alternative to the celebrated EM approach for learning in graphical models. Actually, our natural gradient based strategy leads to learning optimal parameters for the final objective function without artificially trying to fit a distribution that may not correspond to the real one. We support our theoretical findings with the question of trend detection in financial markets and show that the learned model performs better than traditional practitioner methods and is less prone to overfitting.
We propose to improve unconditional Generative Adversarial Networks (GAN) by training the self-supervised learning with the adversarial process. In particular, we apply self-supervised learning via the geometric transformation on input images and assign the pseudo-labels to these transformed images. (i) In addition to the GAN task, which distinguishes data (real) versus generated (fake) samples, we train the discriminator to predict the correct pseudo-labels of real transformed samples (classification task). Importantly, we find out that simultaneously training the discriminator to classify the fake class from the pseudo-classes of real samples for the classification task will improve the discriminator and subsequently lead better guides to train generator. (ii) The generator is trained by attempting to confuse the discriminator for not only the GAN task but also the classification task. For the classification task, the generator tries to confuse the discriminator recognizing the transformation of its output as one of the real transformed classes. Especially, we exploit that when the generator creates samples that result in a similar loss (via cross-entropy) as that of the real ones, the training is more stable and the generator distribution tends to match better the data distribution. When integrating our techniques into a state-of-the-art Auto-Encoder (AE) based-GAN model, they help to significantly boost the model’s performance and also establish new state-of-the-art Fr\’echet Inception Distance (FID) scores in the literature of unconditional GAN for CIFAR-10 and STL-10 datasets.
We consider a requester who acquires a set of data (e.g. images) that is not owned by one party. In order to collect all the data, crowdsourcing mechanisms have been widely used to seek help from the crowd. However, existing mechanisms rely on third-party platforms, and the workers from these platforms are not necessarily helpful and redundant data are also not properly handled. To combat this problem, we propose a novel crowdsourcing mechanism based on social networks, where the rewards of the workers are calculated by information entropy and a modified Shapley value. This mechanism incentivizes the workers from the network to not only provide all data they have, but also further invite their neighbours to offer more data. Eventually, the mechanism is able to acquire all data from all workers on the network with a constrained reward spending.
Many tasks, including language generation, benefit from learning the structure of the output space, particularly when the space of output labels is large and the data is sparse. State-of-the-art neural language models indirectly capture the output space structure in their classifier weights since they lack parameter sharing across output labels. Learning shared output label mappings helps, but existing methods have limited expressivity and are prone to overfitting. In this paper, we investigate the usefulness of more powerful shared mappings for output labels, and propose a deep residual output mapping with dropout between layers to better capture the structure of the output space and avoid overfitting. Evaluations on three language generation tasks show that our output label mapping can match or improve state-of-the-art recurrent and self-attention architectures, and suggest that the classifier does not necessarily need to be high-rank to better model natural language if it is better at capturing the structure of the output space.
This paper presents a novel time series clustering method, the self-organising eigenspace map (SOEM), based on a generalisation of the well-known self-organising feature map (SOFM). The SOEM operates on the eigenspaces of the embedded covariance structures of time series which are related directly to modes in those time series. Approximate joint diagonalisation acts as a pseudo-metric across these spaces allowing us to generalise the SOFM to a neural network with matrix input. The technique is empirically validated against three sets of experiments; univariate and multivariate time series clustering, and application to (clustered) multi-variate time series forecasting. Results indicate that the technique performs a valid topologically ordered clustering of the time series. The clustering is superior in comparison to standard benchmarks when the data is non-aligned, gives the best clustering stage for when used in forecasting, and can be used with partial/non-overlapping time series, multivariate clustering and produces a topological representation of the time series objects.
The Hessian matrix has a number of important applications in a variety of different fields, such as optimzation, image processing and statistics. In this paper we focus on the practical aspects of efficiently computing Hessian matrices in the context of deep learning using the Python scripting language and the TensorFlow library.