CARL: Aggregated Search with Context-Aware Module Embedding Learning

Aggregated search aims to construct search result pages (SERPs) from blue-links and heterogeneous modules (such as news, images, and videos). Existing studies have largely ignored the correlations between blue-links and heterogeneous modules when selecting the heterogeneous modules to be presented. We observe that the top ranked blue-links, which we refer to as the \emph{context}, can provide important information about query intent and helps identify the relevant heterogeneous modules. For example, informative terms like ‘streamed’ and ‘recorded’ in the context imply that a video module may better satisfy the query. To model and utilize the context information for aggregated search, we propose a model with context attention and representation learning (CARL). Our model applies a recurrent neural network with an attention mechanism to encode the context, and incorporates the encoded context information into module embeddings. The context-aware module embeddings together with the ranking policy are jointly optimized under the Markov decision process (MDP) formulation. To achieve a more effective joint learning, we further propose an optimization function with self-supervision loss to provide auxiliary supervision signals. Experimental results based on two public datasets demonstrate the superiority of CARL over multiple baseline approaches, and confirm the effectiveness of the proposed optimization function in boosting the joint learning process.

Viterbi Extraction tutorial with Hidden Markov Toolkit

An algorithm used to extract HMM parameters is revisited. Most parts of the extraction process are taken from implemented Hidden Markov Toolkit (HTK) program under name HInit. The algorithm itself shows a few variations compared to another domain of implementations. The HMM model is introduced briefly based on the theory of Discrete Time Markov Chain. We schematically outline the Viterbi method implemented in HTK. Iterative definition of the method which is ready to be implemented in computer programs is reviewed. We also illustrate the method calculation precisely using manual calculation and extensive graphical illustration. The distribution of observation probability used is simply independent Gaussians r.v.s. The purpose of the content is not to justify the performance or accuracy of the method applied in a specific area. This writing merely to describe how the algorithm is performed. The whole content should enlighten the audience the insight of the Viterbi Extraction method used by HTK.

The Hitchhiker’s Guide to LDA

Latent Dirichlet Allocation (LDA) model is a famous model in the topic model field, it has been studied for years due to its extensive application value in industry and academia. However, the mathematical derivation of LDA model is challenging and difficult, which makes it difficult for the beginners to learn. To help the beginners in learning LDA, this book analyzes the mathematical derivation of LDA in detail, and it also introduces all the knowledge background to make it easy for beginners to understand. Thus, this book contains the author’s unique insights. It should be noted that this book is written in Chinese.

Flood Prediction Using Machine Learning Models: Literature Review

Floods are among the most destructive natural disasters, which are highly complex to model. The research on the advancement of flood prediction models contributed to risk reduction, policy suggestion, minimization of the loss of human life, and reduction the property damage associated with floods. To mimic the complex mathematical expressions of physical processes of floods, during the past two decades, machine learning (ML) methods contributed highly in the advancement of prediction systems providing better performance and cost-effective solutions. Due to the vast benefits and potential of ML, its popularity dramatically increased among hydrologists. Researchers through introducing novel ML methods and hybridizing of the existing ones aim at discovering more accurate and efficient prediction models. The main contribution of this paper is to demonstrate the state of the art of ML models in flood prediction and to give insight into the most suitable models. In this paper, the literature where ML models were benchmarked through a qualitative analysis of robustness, accuracy, effectiveness, and speed are particularly investigated to provide an extensive overview on the various ML algorithms used in the field. The performance comparison of ML models presents an in-depth understanding of the different techniques within the framework of a comprehensive evaluation and discussion. As a result, this paper introduces the most promising prediction methods for both long-term and short-term floods. Furthermore, the major trends in improving the quality of the flood prediction models are investigated. Among them, hybridization, data decomposition, algorithm ensemble, and model optimization are reported as the most effective strategies for the improvement of ML methods.

Investigating Decision Boundaries of Trained Neural Networks

Deep learning models have been the subject of study from various perspectives, for example, their training process, interpretation, generalization error, robustness to adversarial attacks, etc. A trained model is defined by its decision boundaries, and therefore, many of the studies about deep learning models speculate about the decision boundaries, and sometimes make simplifying assumptions about them. So far, finding exact points on the decision boundaries of trained deep models has been considered an intractable problem. Here, we compute exact points on the decision boundaries of these models and provide mathematical tools to investigate the surfaces that define the decision boundaries. Through numerical results, we confirm that some of the speculations about the decision boundaries are accurate, some of the computational methods can be improved, and some of the simplifying assumptions may be unreliable, for models with nonlinear activation functions. We advocate for verification of simplifying assumptions and approximation methods, wherever they are used. Finally, we demonstrate that the computational practices used for finding adversarial examples can be improved and computing the closest point on the decision boundary reveals the weakest vulnerability of a model against adversarial attack.

Self-Organizing Maps with Variable Input Length for Motif Discovery and Word Segmentation

Time Series Motif Discovery (TSMD) is defined as searching for patterns that are previously unknown and appear with a given frequency in time series. Another problem strongly related with TSMD is Word Segmentation. This problem has received much attention from the community that studies early language acquisition in babies and toddlers. The development of biologically plausible models for word segmentation could greatly advance this field. Therefore, in this article, we propose the Variable Input Length Map (VILMAP) for Motif Discovery and Word Segmentation. The model is based on the Self-Organizing Maps and can identify Motifs with different lengths in time series. In our experiments, we show that VILMAP presents good results in finding Motifs in a standard Motif discovery dataset and can avoid catastrophic forgetting when trained with datasets with increasing values of input size. We also show that VILMAP achieves results similar or superior to other methods in the literature developed for the task of word segmentation.

Visualizing the PHATE of Neural Networks

Understanding why and how certain neural networks outperform others is key to guiding future development of network architectures and optimization methods. To this end, we introduce a novel visualization algorithm that reveals the internal geometry of such networks: Multislice PHATE (M-PHATE), the first method designed explicitly to visualize how a neural network’s hidden representations of data evolve throughout the course of training. We demonstrate that our visualization provides intuitive, detailed summaries of the learning dynamics beyond simple global measures (i.e., validation loss and accuracy), without the need to access validation data. Furthermore, M-PHATE better captures both the dynamics and community structure of the hidden units as compared to visualization based on standard dimensionality reduction methods (e.g., ISOMAP, t-SNE). We demonstrate M-PHATE with two vignettes: continual learning and generalization. In the former, the M-PHATE visualizations display the mechanism of ‘catastrophic forgetting’ which is a major challenge for learning in task-switching contexts. In the latter, our visualizations reveal how increased heterogeneity among hidden units correlates with improved generalization performance. An implementation of M-PHATE, along with scripts to reproduce the figures in this paper, is available at https://…/M-PHATE.

HyperStream: a Workflow Engine for Streaming Data

This paper describes HyperStream, a large-scale, flexible and robust software package, written in the Python language, for processing streaming data with workflow creation capabilities. HyperStream overcomes the limitations of other computational engines and provides high-level interfaces to execute complex nesting, fusion, and prediction both in online and offline forms in streaming environments. HyperStream is a general purpose tool that is well-suited for the design, development, and deployment of Machine Learning algorithms and predictive models in a wide space of sequential predictive problems. Source code, installation instructions, examples, and documentation can be found at: https://…/HyperStream.

Self-supervised Attention Model for Weakly Labeled Audio Event Classification

We describe a novel weakly labeled Audio Event Classification approach based on a self-supervised attention model. The weakly labeled framework is used to eliminate the need for expensive data labeling procedure and self-supervised attention is deployed to help a model distinguish between relevant and irrelevant parts of a weakly labeled audio clip in a more effective manner compared to prior attention models. We also propose a highly effective strongly supervised attention model when strong labels are available. This model also serves as an upper bound for the self-supervised model. The performances of the model with self-supervised attention training are comparable to the strongly supervised one which is trained using strong labels. We show that our self-supervised attention method is especially beneficial for short audio events. We achieve 8.8% and 17.6% relative mean average precision improvements over the current state-of-the-art systems for SL-DCASE-17 and balanced AudioSet.

Unsupervised Feature Learning in Remote Sensing

The need for labeled data is among the most common and well-known practical obstacles to deploying deep learning algorithms to solve real-world problems. The current generation of learning algorithms requires a large volume of data labeled according to a static and pre-defined schema. Conversely, humans can quickly learn generalizations based on large quantities of unlabeled data, and turn these generalizations into classifications using spontaneous labels, often including labels not seen before. We apply a state-of-the-art unsupervised learning algorithm to the noisy and extremely imbalanced xView data set to train a feature extractor that adapts to several tasks: visual similarity search that performs well on both common and rare classes; identifying outliers within a labeled data set; and learning a natural class hierarchy automatically.

Benchmarking Surrogate-Assisted Genetic Recommender Systems

We propose a new approach for building recommender systems by adapting surrogate-assisted interactive genetic algorithms. A pool of user-evaluated items is used to construct an approximative model which serves as a surrogate fitness function in a genetic algorithm for optimizing new suggestions. The surrogate is used to recommend new items to the user, which are then evaluated according to the user’s liking and subsequently removed from the search space. By updating the surrogate model after new recommendations have been evaluated by the user, we enable the model itself to evolve towards the user’s preferences. In order to precisely evaluate the performance of that approach, the human’s subjective evaluation is replaced by common continuous objective benchmark functions for evolutionary algorithms. The system’s performance is compared to a conventional genetic algorithm and random search. We show that given a very limited amount of allowed evaluations on the true objective, our approach outperforms these baseline methods.

Que será será? The uncertainty estimation of feature-based time series forecasts

Interval forecasts have significant advantages in providing uncertainty estimation to point forecasts, leading to the importance of providing prediction intervals (PIs) as well as point forecasts. In this paper, we propose a general feature-based time series forecasting framework, which is divided into ‘offline’ and ‘online’ parts. In the ‘offline’ part, we explore how time series features connect with the prediction interval forecasting accuracy of different forecasting methods by exploring generalized additive models (GAMs), which makes our proposed framework interpretable in the effects of features on the interval forecasting accuracy. Our proposed framework is in essence a model averaging process and we introduce a threshold ratio for the selection of individual forecasting methods in this process. In the ‘online’ part, we calculate the point forecasts and PIs of new series by pre-trained GAMs and the corresponding optimal threshold ratio. We illustrate that our feature-based forecasting framework outperforms all individual benchmark forecasting methods on M3 competition data, with an improved computational efficiency.

How much data is sufficient to learn high-performing algorithms?

Algorithms for scientific analysis typically have tunable parameters that significantly influence computational efficiency and solution quality. If a parameter setting leads to strong algorithmic performance on average over a set of typical problem instances, that parameter setting—ideally—will perform well in the future. However, if the set of typical problem instances is small, average performance will not generalize to future performance. This raises the question: how large should this set be? We answer this question for any algorithm satisfying an easy-to-describe, ubiquitous property: its performance is a piecewise-structured function of its parameters. We are the first to provide a unified sample complexity framework for algorithm parameter configuration; prior research followed case-by-case analyses. We present applications from diverse domains, including biology, political science, and economics.

Do Neural Language Representations Learn Physical Commonsense?

Humans understand language based on the rich background knowledge about how the physical world works, which in turn allows us to reason about the physical world through language. In addition to the properties of objects (e.g., boats require fuel) and their affordances, i.e., the actions that are applicable to them (e.g., boats can be driven), we can also reason about if-then inferences between what properties of objects imply the kind of actions that are applicable to them (e.g., that if we can drive something then it likely requires fuel). In this paper, we investigate the extent to which state-of-the-art neural language representations, trained on a vast amount of natural language text, demonstrate physical commonsense reasoning. While recent advancements of neural language models have demonstrated strong performance on various types of natural language inference tasks, our study based on a dataset of over 200k newly collected annotations suggests that neural language representations still only learn associations that are explicitly written down.

Contributed Discussion of ‘A Bayesian Conjugate Gradient Method’

We would like to congratulate the authors of ‘A Bayesian Conjugate Gradient Method’ on their insightful paper, and welcome this publication which we firmly believe will become a fundamental contribution to the growing field of probabilistic numerical methods and in particular the sub-field of Bayesian numerical methods. In this short piece, which will be published as a comment alongside the main paper, we first initiate a discussion on the choice of priors for solving linear systems, then propose an extension of the Bayesian conjugate gradient (BayesCG) algorithm for solving several related linear systems simultaneously.

Incremental Reinforcement Learning — a New Continuous Reinforcement Learning Frame Based on Stochastic Differential Equation methods

Continuous reinforcement learning such as DDPG and A3C are widely used in robot control and autonomous driving. However, both methods have theoretical weaknesses. While DDPG cannot control noises in the control process, A3C does not satisfy the continuity conditions under the Gaussian policy. To address these concerns, we propose a new continues reinforcement learning method based on stochastic differential equations and we call it Incremental Reinforcement Learning (IRL). This method not only guarantees the continuity of actions within any time interval, but controls the variance of actions in the training process. In addition, our method does not assume Markov control in agents’ action control and allows agents to predict scene changes for action selection. With our method, agents no longer passively adapt to the environment. Instead, they positively interact with the environment for maximum rewards.

Local Differential Privacy for Deep Learning

Deep learning (DL) is a promising area of machine learning which is becoming popular due to its remarkable accuracy when trained with a massive amount of data. Often, these datasets are highly sensitive crowd-sourced data such as medical data, financial data, or image data, and the DL models trained on these data tend to leak privacy. We propose a new local differentially private (LDP) algorithm (named LATENT) which redesigns the training process in a way that a data owner can add a randomization layer before data leave data owners’ devices and reach to a potentially untrusted machine learning service. This way LATENT prevents privacy leaks of DL models, e.g., due to membership inference and memorizing model attacks, while providing excellent accuracy. By not requiring a trusted party, LATENT can be more practical for cloud-based machine learning services in comparison to existing differentially private approaches. Our experimental evaluation of LATENT on convolutional deep neural networks demonstrates excellent accuracy (e.g. 91\%- 96\%) with high model quality even under very low privacy budgets (e.g. \epsilon=0.5), outperforming existing differentially private approaches for deep learning.

Feature selection of neural networks is skewed towards the less abstract cue

Artificial neural networks (ANNs) have become an important tool for image classification with many applications in research and industry. However, it remains largely unknown how relevant image features are selected and how data properties affect this process. In particular, we are interested whether the abstraction level of image cues correlating with class membership influences feature selection. We perform experiments with binary images that contain a combination of cues, representing two different levels of abstractions: one is a pattern drawn from a random distribution where class membership correlates with the statistics of the pattern, the other a combination of symbol-like entities, where the symbolic code correlates with class membership. When the network is trained with data in which both cues are equally significant, we observe that the cues at the lower abstraction level, i.e., the pattern, is learned, while the symbolic information is largely ignored, even in networks with many layers. Symbol-like entities are only learned if the importance of low-level cues is reduced compared to the high-level ones. These findings raise important questions about the relevance of features that are learned by deep ANNs and how learning could be shifted towards symbolic features.

One Model To Rule Them All

We present a new flavor of Variational Autoencoder (VAE) that interpolates seamlessly between unsupervised, semi-supervised and fully supervised learning domains. We show that unlabeled datapoints not only boost unsupervised tasks, but also the classification performance. Vice versa, every label not only improves classification, but also unsupervised tasks. The proposed architecture is simple: A classification layer is connected to the topmost encoder layer, and then combined with the resampled latent layer for the decoder. The usual evidence lower bound (ELBO) loss is supplemented with a supervised loss target on this classification layer that is only applied for labeled datapoints. This simplicity allows for extending any existing VAE model to our proposed semi-supervised framework with minimal effort. In the context of classification, we found that this approach even outperforms a direct supervised setup.

TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today’s datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers and the associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhanced with near-data processing cores tailored for DL tensor operations. These custom DIMMs are populated inside a GPU-centric system interconnect as a remote memory pool, allowing GPUs to utilize for scalable memory bandwidth and capacity expansion. A prototype implementation of our proposal on real DL systems shows an average 6.0-15.7x performance improvement on state-of-the-art recommender systems.

Variational Bayes on Manifolds

Variational Bayes (VB) has become a versatile tool for Bayesian inference in statistics. Nonetheless, the development of the existing VB algorithms is so far generally restricted to the case where the variational parameter space is Euclidean, which hinders the potential broad application of VB methods. This paper extends the scope of VB to the case where the variational parameter space is a Riemannian manifold. We develop, for the first time in the literature, an efficient manifold-based VB algorithm that exploits both the geometric structure of the constraint parameter space and the information geometry of the manifold of VB approximating probability distributions. Our algorithm is provably convergent and achieves a convergence rate of order \mathcal O(1/\sqrt{T}) and \mathcal O(1/T^{2-2\epsilon}) for a non-convex evidence lower bound function and a strongly retraction-convex evidence lower bound function, respectively. We develop in particular two manifold VB algorithms, Manifold Gaussian VB and Manifold Neural Net VB, and demonstrate through numerical experiments that the proposed algorithms are stable, less sensitive to initialization and compares favourably to existing VB methods.

‘Conservatives Overfit, Liberals Underfit’: The Social-Psychological Control of Affect and Uncertainty

The presence of artificial agents in human social networks is growing. From chatbots to robots, human experience in the developed world is moving towards a socio-technical system in which agents can be technological or biological, with increasingly blurred distinctions between. Given that emotion is a key element of human interaction, enabling artificial agents with the ability to reason about affect is a key stepping stone towards a future in which technological agents and humans can work together. This paper presents work on building intelligent computational agents that integrate both emotion and cognition. These agents are grounded in the well-established social-psychological Bayesian Affect Control Theory (BayesAct). The core idea of BayesAct is that humans are motivated in their social interactions by affective alignment: they strive for their social experiences to be coherent at a deep, emotional level with their sense of identity and general world views as constructed through culturally shared symbols. This affective alignment creates cohesive bonds between group members, and is instrumental for collaborations to solidify as relational group commitments. BayesAct agents are motivated in their social interactions by a combination of affective alignment and decision theoretic reasoning, trading the two off as a function of the uncertainty or unpredictability of the situation. This paper provides a high-level view of dual process theories and advances BayesAct as a plausible, computationally tractable model based in social-psychological and sociological theory. We introduce a revised BayesAct model that more deeply integrates social-psychological theorising, and we demonstrate a key component of the model as being sufficient to account for cognitive biases about fairness, dissonance and conformity. We close with ethical and philosophical discussion.

FAIRY: A Framework for Understanding Relationships between Users’ Actions and their Social Feeds

Users increasingly rely on social media feeds for consuming daily information. The items in a feed, such as news, questions, songs, etc., usually result from the complex interplay of a user’s social contacts, her interests and her actions on the platform. The relationship of the user’s own behavior and the received feed is often puzzling, and many users would like to have a clear explanation on why certain items were shown to them. Transparency and explainability are key concerns in the modern world of cognitive overload, filter bubbles, user tracking, and privacy risks. This paper presents FAIRY, a framework that systematically discovers, ranks, and explains relationships between users’ actions and items in their social media feeds. We model the user’s local neighborhood on the platform as an interaction graph, a form of heterogeneous information network constructed solely from information that is easily accessible to the concerned user. We posit that paths in this interaction graph connecting the user and her feed items can act as pertinent explanations for the user. These paths are scored with a learning-to-rank model that captures relevance and surprisal. User studies on two social platforms demonstrate the practical viability and user benefits of the FAIRY method.

Optimal multiclass overfitting by sequence reconstruction from Hamming queries

A primary concern of excessive reuse of test datasets in machine learning is that it can lead to overfitting. Multiclass classification was recently shown to be more resistant to overfitting than binary classification. In an open problem of COLT 2019, Feldman, Frostig, and Hardt ask to characterize the dependence of the amount of overfitting bias with the number of classes m, the number of accuracy queries k, and the number of examples in the dataset n. We resolve this problem and determine the amount of overfitting possible in multi-class classification. We provide computationally efficient algorithms that achieve overfitting bias of \tilde{\Theta}(\max\{\sqrt{{k}/{(mn)}}, k/n\}), matching the known upper bounds.

Completing and Debugging Ontologies: state of the art and challenges

As semantically-enabled applications require high-quality ontologies, developing and maintaining as correct and complete as possible ontologies is an important, although difficult task in ontology engineering. A key step is ontology debugging and completion. In general, there are two steps: detecting defects and repairing defects. In this paper we formalize the repairing step as an abduction problem and situate the state of the art with respect to this framework. We show that there still are many open research problems and show opportunities for further work and advancing the field.


A k-universal permutation, or k-superpermutation, is a permutation that contains all permutations of length k as patterns. The problem of finding the minimum length of a k-superpermutation has recently received significant attention in the field of permutation patterns. One can ask analogous questions for other classes of objects. In this paper, we study k-supertrees. For each d\geq 2, we focus on two types of rooted plane trees called d-ary plane trees and [d]-trees. Motivated by recent developments in the literature, we consider ‘contiguous’ and ‘noncontiguous’ notions of pattern containment for each type of tree. We obtain both upper and lower bounds on the minimum possible size of a k-supertree in three cases; in the fourth, we determine the minimum size exactly. One of our lower bounds makes use of a recent result of Albert, Engen, Pantone, and Vatter on k-universal layered permutations.

Rigid Graph Alignment

Graph databases have been the subject of significant research and development. Problems such as modularity, centrality, alignment, and clustering have been formalized and solved in various application contexts. In this paper, we focus on databases for applications in which graphs have a spatial basis, which we refer to as rigid graphs. Nodes in such graphs have preferred positions relative to their graph neighbors. Examples of such graphs include abstractions of large biomolecules, functional connectomes of the human brain, and mobile device/ sensor communication logs. When analyzing such networks it is important to consider edge lengths; e.g., when identifying conserved patterns through graph alignment, it is important for conserved edges to have correlated lengths, in addition to topological similarity. In contrast to a large body of work on topological graph alignment, rigid graph alignment simultaneously aligns the network, as well as the underlying structure as characterized by edge lengths. We formulate the problem and present a meta-algorithm based on expectation-maximization that alternately aligns the network and the structure. We demonstrate that our meta-algorithm significantly improves the quality of alignments in target applications, compared to topological or structural aligners alone. We apply rigid graph alignment to functional brain networks derived from 20 subjects drawn from the Human Connectome Project (HCP) database, and show over a two-fold increase in quality of alignment over state of the art topological aligners. We evaluate the impact of various parameters associated with input datasets through a study on synthetic graphs, where we fully characterize the performance of our method. Our results are broadly applicable to other applications and abstracted networks that can be embedded in metric spaces — e.g., through spectral embeddings.