autoAx: An Automatic Design Space Exploration and Circuit Building Methodology utilizing Libraries of Approximate Components

Approximate computing is an emerging paradigm for developing highly energy-efficient computing systems such as various accelerators. In the literature, many libraries of elementary approximate circuits have already been proposed to simplify the design process of approximate accelerators. Because these libraries contain from tens to thousands of approximate implementations for a single arithmetic operation it is intractable to find an optimal combination of approximate circuits in the library even for an application consisting of a few operations. An open problem is ‘how to effectively combine circuits from these libraries to construct complex approximate accelerators’. This paper proposes a novel methodology for searching, selecting and combining the most suitable approximate circuits from a set of available libraries to generate an approximate accelerator for a given application. To enable fast design space generation and exploration, the methodology utilizes machine learning techniques to create computational models estimating the overall quality of processing and hardware cost without performing full synthesis at the accelerator level. Using the methodology, we construct hundreds of approximate accelerators (for a Sobel edge detector) showing different but relevant tradeoffs between the quality of processing and hardware cost and identify a corresponding Pareto-frontier. Furthermore, when searching for approximate implementations of a generic Gaussian filter consisting of 17 arithmetic operations, the proposed approach allows us to identify approximately 10^3 highly important implementations from 10^{23} possible solutions in a few hours, while the exhaustive search would take four months on a high-end processor.

Rucio – Scientific Data Management

Rucio is an open source software framework that provides scientific collaborations with the functionality to organize, manage, and access their volumes of data. The data can be distributed across heterogeneous data centers at widely distributed locations. Rucio has been originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously extended to support the LHC experiments and other diverse scientific communities. In this article we detail the fundamental concepts of Rucio, describe the architecture along with implementation details, and give operational experience from production usage.

Context Vectors are Reflections of Word Vectors in Half the Dimensions

This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assumptions the widely-used word-word PMI matrix is approximately a random symmetric Gaussian ensemble. This, in turn, implies that context vectors are reflections of word vectors in approximately half the dimensions. As a direct application of our result, we suggest a theoretically grounded way of tying weights in the SGNS model.

An Abstract View on the De-anonymization Process

Over the recent years, the availability of datasets containing personal, but anonymized information has been continuously increasing. Extensive research has revealed that such datasets are vulnerable to privacy breaches: being able to reveal sensitive information about individuals through deanonymization methods. Here, we provide a taxonomy of the research in de-anonymization.

Introduction to ‘RORPack’: A Python Software Library for Robust Output Regulation

This document contains the mathematical introduction to RORPack – a Python software library for robust output tracking and disturbance rejection for linear PDE systems. The RORPack library is open-source and freely available at https://…/rorpack The package contains functionality for automated construction of robust internal model based controllers, simulation of the controlled systems, visualisation of the results, as well as a collection of example cases on robust output regulation of controlled heat and wave equations.

Transfer Learning for Performance Modeling of Configurable Systems: A Causal Analysis

Modern systems (e.g., deep neural networks, big data analytics, and compilers) are highly configurable, which means they expose different performance behavior under different configurations. The fundamental challenge is that one cannot simply measure all configurations due to the sheer size of the configuration space. Transfer learning has been used to reduce the measurement efforts by transferring knowledge about performance behavior of systems across environments. Previously, research has shown that statistical models are indeed transferable across environments. In this work, we investigate identifiability and transportability of causal effects and statistical relations in highly-configurable systems. Our causal analysis agrees with previous exploratory analysis \cite{Jamshidi17} and confirms that the causal effects of configuration options can be carried over across environments with high confidence. We expect that the ability to carry over causal relations will enable effective performance analysis of highly-configurable systems.

FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning

The computational demands of computer vision tasks based on state-of-the-art Convolutional Neural Network (CNN) image classification far exceed the energy budgets of mobile devices. This paper proposes FixyNN, which consists of a fixed-weight feature extractor that generates ubiquitous CNN features, and a conventional programmable CNN accelerator which processes a dataset-specific CNN. Image classification models for FixyNN are trained end-to-end via transfer learning, with the common feature extractor representing the transfered part, and the programmable part being learnt on the target dataset. Experimental results demonstrate FixyNN hardware can achieve very high energy efficiencies up to 26.6 TOPS/W (4.81 \times better than iso-area programmable accelerator). Over a suite of six datasets we trained models via transfer learning with an accuracy loss of <1\% resulting in up to 11.2 TOPS/W – nearly 2 \times more efficient than a conventional programmable CNN accelerator of the same area.

Continual Learning with Tiny Episodic Memories

Learning with less supervision is a major challenge in artificial intelligence. One sensible approach to decrease the amount of supervision is to leverage prior experience and transfer knowledge from tasks seen in the past. However, a necessary condition for a successful transfer is the ability to remember how to perform previous tasks. The Continual Learning (CL) setting, whereby an agent learns from a stream of tasks without seeing any example twice, is an ideal framework to investigate how to accrue such knowledge. In this work, we consider supervised learning tasks and methods that leverage a very small episodic memory for continual learning. Through an extensive empirical analysis across four benchmark datasets adapted to CL, we observe that a very simple baseline, which jointly trains on both examples from the current task as well as examples stored in the memory, outperforms state-of-the-art CL approaches with and without episodic memory. Surprisingly, repeated learning over tiny episodic memories does not harm generalization on past tasks, as joint training on data from subsequent tasks acts like a data dependent regularizer. We discuss and evaluate different approaches to write into the memory. Most notably, reservoir sampling works remarkably well across the board, except when the memory size is extremely small. In this case, writing strategies that guarantee an equal representation of all classes work better. Overall, these methods should be considered as a strong baseline candidate when benchmarking new CL approaches

DiscoFuse: A Large-Scale Dataset for Discourse-based Sentence Fusion

Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.

Attributes-aided Part Detection and Refinement for Person Re-identification

Person attributes are often exploited as mid-level human semantic information to help promote the performance of person re-identification task. In this paper, unlike most existing methods simply taking attribute learning as a classification problem, we perform it in a different way with the motivation that attributes are related to specific local regions, which refers to the perceptual ability of attributes. We utilize the process of attribute detection to generate corresponding attribute-part detectors, whose invariance to many influences like poses and camera views can be guaranteed. With detected local part regions, our model extracts local features to handle the body part misalignment problem, which is another major challenge for person re-identification. The local descriptors are further refined by fused attribute information to eliminate interferences caused by detection deviation. Extensive experiments on two popular benchmarks with attribute annotations demonstrate the effectiveness of our model and competitive performance compared with state-of-the-art algorithms.

An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models

A growing number of state-of-the-art transfer learning methods employ language models pretrained on large generic corpora. In this paper we present a conceptually simple and effective transfer learning approach that addresses the problem of catastrophic forgetting. Specifically, we combine the task-specific optimization function with an auxiliary language model objective, which is adjusted during the training process. This preserves language regularities captured by language models, while enabling sufficient adaptation for solving the target task. Our method does not require pretraining or finetuning separate components of the network and we train our models end-to-end in a single step. We present results on a variety of challenging affective and text classification tasks, surpassing well established transfer learning methods with greater level of complexity.

Accelerating Self-Play Learning in Go

By introducing several new Go-specific and non-Go-specific techniques along with other tuning, we accelerate self-play learning in Go. Like AlphaZero and Leela Zero, a popular open-source distributed project based on AlphaZero, our bot KataGo only learns from neural net Monte-Carlo tree-search self-play. With our techniques, in only a week with several dozen GPUs it achieves a likely strong pro or perhaps just-super-human level of strength. Compared to Leela Zero, we estimate a roughly 5x reduction in self-play computation required to achieve that level of strength, as well as a 30x to 100x reduction for reaching moderate to strong amateur levels. Although we so far have not tested in longer runs, we believe that our techniques hold promise for future research.

TrIK-SVM : an alternative decomposition for kernel methods in Krein spaces

The proposed work aims at proposing a alternative kernel decomposition in the context of kernel machines with indefinite kernels. The original paper of KSVM (SVM in Kre\v{i}n spaces) uses the eigen-decomposition, our proposition avoids this decompostion. We explain how it can help in designing an algorithm that won’t require to compute the full kernel matrix. Finally we illustrate the good behavior of the proposed method compared to KSVM.

Multiresolution Graph Attention Networks for Relevance Matching

A large number of deep learning models have been proposed for the text matching problem, which is at the core of various typical natural language processing (NLP) tasks. However, existing deep models are mainly designed for the semantic matching between a pair of short texts, such as paraphrase identification and question answering, and do not perform well on the task of relevance matching between short-long text pairs. This is partially due to the fact that the essential characteristics of short-long text matching have not been well considered in these deep models. More specifically, these methods fail to handle extreme length discrepancy between text pieces and neither can they fully characterize the underlying structural information in long text documents. In this paper, we are especially interested in relevance matching between a piece of short text and a long document, which is critical to problems like query-document matching in information retrieval and web searching. To extract the structural information of documents, an undirected graph is constructed, with each vertex representing a keyword and the weight of an edge indicating the degree of interaction between keywords. Based on the keyword graph, we further propose a Multiresolution Graph Attention Network to learn multi-layered representations of vertices through a Graph Convolutional Network (GCN), and then match the short text snippet with the graphical representation of the document with the attention mechanisms applied over each layer of the GCN. Experimental results on two datasets demonstrate that our graph approach outperforms other state-of-the-art deep matching models.

Bayesian data fusion for unmeasured confounding

Bayesian causal inference offers a principled approach to policy evaluation of proposed interventions on mediators or time-varying exposures. We outline a general approach to the estimation of causal quantities for settings with time-varying confounding, such as exposure-induced mediator-outcome confounders. We further extend this approach to propose two Bayesian data fusion (BDF) methods for unmeasured confounding. Using informative priors on quantities relating to the confounding bias parameters, our methods incorporate data from an external source where the confounder is measured in order to make inferences about causal estimands in the main study population. We present results from a simulation study comparing our data fusion methods to two common frequentist correction methods for unmeasured confounding bias in the mediation setting. We also demonstrate our method with an investigation of the role of stage at cancer diagnosis in contributing to Black-White colorectal cancer survival disparities.

Provable Guarantees for Gradient-Based Meta-Learning

We study the problem of meta-learning through the lens of online convex optimization, developing a meta-algorithm bridging the gap between popular gradient-based meta-learning and classical regularization-based multi-task transfer methods. Our method is the first to simultaneously satisfy good sample efficiency guarantees in the convex setting, with generalization bounds that improve with task-similarity, while also being computationally scalable to modern deep learning architectures and the many-task setting. Despite its simplicity, the algorithm matches, up to a constant factor, a lower bound on the performance of any such parameter-transfer method under natural task similarity assumptions. We use experiments in both convex and deep learning settings to verify and demonstrate the applicability of our theory.

Unifying Ensemble Methods for Q-learning via Social Choice Theory

Ensemble methods have been widely applied in Reinforcement Learning (RL) in order to enhance stability, increase convergence speed, and improve exploration. These methods typically work by employing an aggregation mechanism over actions of different RL algorithms. We show that a variety of these methods can be unified by drawing parallels from committee voting rules in Social Choice Theory. We map the problem of designing an action aggregation mechanism in an ensemble method to a voting problem which, under different voting rules, yield popular ensemble-based RL algorithms like Majority Voting Q-learning or Bootstrapped Q-learning. Our unification framework, in turn, allows us to design new ensemble-RL algorithms with better performance. For instance, we map two diversity-centered committee voting rules, namely Single Non-Transferable Voting Rule and Chamberlin-Courant Rule, into new RL algorithms that demonstrate excellent exploratory behavior in our experiments.

Improving Missing Data Imputation with Deep Generative Models

Datasets with missing values are very common on industry applications, and they can have a negative impact on machine learning models. Recent studies introduced solutions to the problem of imputing missing values based on deep generative models. Previous experiments with Generative Adversarial Networks and Variational Autoencoders showed interesting results in this domain, but it is not clear which method is preferable for different use cases. The goal of this work is twofold: we present a comparison between missing data imputation solutions based on deep generative models, and we propose improvements over those methodologies. We run our experiments using known real life datasets with different characteristics, removing values at random and reconstructing them with several imputation techniques. Our results show that the presence or absence of categorical variables can alter the selection of the best model, and that some models are more stable than others after similar runs with different random number generator seeds.

Bridging the Gap: Attending to Discontinuity in Identification of Multiword Expressions

We introduce a new method to tag Multiword Expressions (MWEs) using a linguistically interpretable language-independent deep learning architecture. We specifically target discontinuity, an under-explored aspect that poses a significant challenge to computational treatment of MWEs. Two neural architectures are explored: Graph Convolutional Network (GCN) and multi-head self-attention. GCN leverages dependency parse information, and self-attention attends to long-range relations. We finally propose a combined model that integrates complementary information from both through a gating mechanism. The experiments on a standard multilingual dataset for verbal MWEs show that our model outperforms the baselines not only in the case of discontinuous MWEs but also in overall F-score.

High-Dimensional Bayesian Optimization with Manifold Gaussian Processes

Bayesian optimization (BO) is a powerful approach for seeking the global optimum of expensive black-box functions and has proven successful for fine tuning hyper-parameters of machine learning models. The Bayesian optimization routine involves learning a response surface and maximizing a score to select the most valuable inputs to be queried at the next iteration. These key steps are subject to the curse of dimensionality so that Bayesian optimization does not scale beyond 10–20 parameters. In this work, we address this issue and propose a high-dimensional BO method that learns a nonlinear low-dimensional manifold of the input space. We achieve this with a multi-layer neural network embedded in the covariance function of a Gaussian process. This approach applies unsupervised dimensionality reduction as a byproduct of a supervised regression solution. This also allows exploiting data efficiency of Gaussian process models in a Bayesian framework. We also introduce a nonlinear mapping from the manifold to the high-dimensional space based on multi-output Gaussian processes and jointly train it end-to-end via marginal likelihood maximization. We show this intrinsically low-dimensional optimization outperforms recent baselines in high-dimensional BO literature on a set of benchmark functions in 60 dimensions.

On Constrained Open-World Probabilistic Databases

Increasing amounts of available data have led to a heightened need for representing large-scale probabilistic knowledge bases. One approach is to use a probabilistic database, a model with strong assumptions that allow for efficiently answering many interesting queries. Recent work on open-world probabilistic databases strengthens the semantics of these probabilistic databases by discarding the assumption that any information not present in the data must be false. While intuitive, these semantics are not sufficiently precise to give reasonable answers to queries. We propose overcoming these issues by using constraints to restrict this open world. We provide an algorithm for one class of queries, and establish a basic hardness result for another. Finally, we propose an efficient and tight approximation for a large class of queries.

Degenerate Feedback Loops in Recommender Systems

Machine learning is used extensively in recommender systems deployed in products. The decisions made by these systems can influence user beliefs and preferences which in turn affect the feedback the learning system receives – thus creating a feedback loop. This phenomenon can give rise to the so-called ‘echo chambers’ or ‘filter bubbles’ that have user and societal implications. In this paper, we provide a novel theoretical analysis that examines both the role of user dynamics and the behavior of recommender systems, disentangling the echo chamber from the filter bubble effect. In addition, we offer practical solutions to slow down system degeneracy. Our study contributes toward understanding and developing solutions to commonly cited issues in the complex temporal scenario, an area that is still largely unexplored.

Introspection Learning

Traditional reinforcement learning agents learn from experience, past or present, gained through interaction with their environment. Our approach synthesizes experience, without requiring an agent to interact with their environment, by asking the policy directly ‘Are there situations X, Y, and Z, such that in these situations you would select actions A, B, and C?’ In this paper we present Introspection Learning, an algorithm that allows for the asking of these types of questions of neural network policies. Introspection Learning is reinforcement learning algorithm agnostic and the states returned may be used as an indicator of the health of the policy or to shape the policy in a myriad of ways. We demonstrate the usefulness of this algorithm both in the context of speeding up training and improving robustness with respect to safety constraints.

Adversarial Attacks on Time Series

Time series classification models have been garnering significant importance in the research community. However, not much research has been done on generating adversarial samples for these models. These adversarial samples can become a security concern. In this paper, we propose utilizing an adversarial transformation network (ATN) on a distilled model to attack various time series classification models. The proposed attack on the classification model utilizes a distilled model as a surrogate that mimics the behavior of the attacked classical time series classification models. Our proposed methodology is applied onto 1-Nearest Neighbor Dynamic Time Warping (1-NN ) DTW, a Fully Connected Network and a Fully Convolutional Network (FCN), all of which are trained on 43 University of California Riverside (UCR) datasets. In this paper, we show both models were susceptible to attacks on all 43 datasets. To the best of our knowledge, such an attack on time series classification models has never been done before. Finally, we recommend future researchers that develop time series classification models to incorporating adversarial data samples into their training data sets to improve resilience on adversarial samples and to consider model robustness as an evaluative metric.

Insights into LSTM Fully Convolutional Networks for Time Series Classification

Long Short Term Memory Fully Convolutional Neural Networks (LSTM-FCN) and Attention LSTM-FCN (ALSTM-FCN) have shown to achieve state-of-the-art performance on the task of classifying time series signals on the old University of California-Riverside (UCR) time series repository. However, there has been no study on why LSTM-FCN and ALSTM-FCN perform well. In this paper, we perform a series of ablation tests (3627 experiments) on LSTM-FCN and ALSTM-FCN to provide a better understanding of the model and each of its sub-module. Results from the ablation tests on ALSTM-FCN and LSTM-FCN show that the these blocks perform better when applied in a conjoined manner. Two z-normalizing techniques, z-normalizing each sample independently and z-normalizing the whole dataset, are compared using a Wilcoxson signed-rank test to show a statistical difference in performance. In addition, we provide an understanding of the impact dimension shuffle has on LSTM-FCN by comparing its performance with LSTM-FCN when no dimension shuffle is applied. Finally, we demonstrate the performance of the LSTM-FCN when the LSTM block is replaced by a GRU, basic RNN, and Dense Block.

Stochastically Rank-Regularized Tensor Regression Networks

Over-parametrization of deep neural networks has recently been shown to be key to their successful training. However, it also renders them prone to overfitting and makes them expensive to store and train. Tensor regression networks significantly reduce the number of effective parameters in deep neural networks while retaining accuracy and the ease of training. They replace the flattening and fully-connected layers with a tensor regression layer, where the regression weights are expressed through the factors of a low-rank tensor decomposition. In this paper, to further improve tensor regression networks, we propose a novel stochastic rank-regularization. It consists of a novel randomized tensor sketching method to approximate the weights of tensor regression layers. We theoretically and empirically establish the link between our proposed stochastic rank-regularization and the dropout on low-rank tensor regression. Extensive experimental results with both synthetic data and real world datasets (i.e., CIFAR-100 and the UK Biobank brain MRI dataset) support that the proposed approach i) improves performance in both classification and regression tasks, ii) decreases overfitting, iii) leads to more stable training and iv) improves robustness to adversarial attacks and random noise.

A Simple Introduction to Free Probability Theory and its Application to Random Matrices

Free probability theory started in the 1980s has attracted much attention lately in signal processing and communications areas due to its applications in large size random matrices. However, it involves with massive mathematical concepts and notations, and is really hard for a general reader to comprehend. The main goal of this paper is to briefly describe this theory and its application in random matrices as simple as possible so that it is easy to follow. Applying free probability theory, one is able to calculate the distributions of the eigenvalues/singular-values of large size random matrices using only the second order statistics of the matrix entries. One of such applications is the mutual information calculation of a massive MIMO system.

Semi-supervised GANs to Infer Travel Modes in GPS Trajectories

Semi-supervised Generative Adversarial Networks (GANs) are developed in the context of travel mode inference with uni-dimensional smartphone trajectory data. We use data from a large-scale smartphone travel survey in Montreal, Canada. We convert GPS trajectories into fixed-sized segments with five channels (variables). We develop different GANs architectures and compare their prediction results with Convolutional Neural Networks (CNNs). The best semi-supervised GANs model led to a prediction accuracy of 83.4%, while the best CNN model was able to achieve the prediction accuracy of 81.3%. The results compare favorably with previous studies, especially when taking the large-scale real-world nature of the dataset into account.

Learning Task Knowledge and its Scope of Applicability in Experience-Based Planning Domains

Experience-based planning domains (EBPDs) have been recently proposed to improve problem solving by learning from experience. EBPDs provide important concepts for long-term learning and planning in robotics. They rely on acquiring and using task knowledge, i.e., activity schemata, for generating concrete solutions to problem instances in a class of tasks. Using Three-Valued Logic Analysis (TVLA), we extend previous work to generate a set of conditions as the scope of applicability for an activity schema. The inferred scope is a bounded representation of a set of problems of potentially unbounded size, in the form of a 3-valued logical structure, which allows an EBPD system to automatically find an applicable activity schema for solving task problems. We demonstrate the utility of our approach in a set of classes of problems in a simulated domain and a class of real world tasks in a fully physically simulated PR2 robot in Gazebo.