If you did not already know

Bottleneck Simulator google
Deep reinforcement learning has recently shown many impressive successes. However, one major obstacle towards applying such methods to real-world problems is their lack of data-efficiency. To this end, we propose the Bottleneck Simulator: a model-based reinforcement learning method which combines a learned, factorized transition model of the environment with rollout simulations to learn an effective policy from few examples. The learned transition model employs an abstract, discrete (bottleneck) state, which increases sample efficiency by reducing the number of model parameters and by exploiting structural properties of the environment. We provide a mathematical analysis of the Bottleneck Simulator in terms of fixed points of the learned policy, which reveals how performance is affected by four distinct sources of error: an error related to the abstract space structure, an error related to the transition model estimation variance, an error related to the transition model estimation bias, and an error related to the transition model class bias. Finally, we evaluate the Bottleneck Simulator on two natural language processing tasks: a text adventure game and a real-world, complex dialogue response selection task. On both tasks, the Bottleneck Simulator yields excellent performance beating competing approaches. …

Condensed Representation google
Correlated pattern mining has increasingly become an important task in data mining since these patterns allow conveying knowledge about meaningful and surprising relations among data. Frequent correlated patterns were thoroughly studied in the literature. In this thesis, we propose to benefit from both frequent correlated as well as rare correlated patterns according to the bond correlation measure. We propose to extract a subset without information loss of the sets of frequent correlated and of rare correlated patterns, this subset is called “Condensed Representation“. In this regard, we are based on the notions derived from the Formal Concept Analysis FCA, specifically the equivalence classes associated to a closure operator fbond dedicated to the bond measure, to introduce new concise representations of both frequent correlated and rare correlated patterns. …

Graph Feature Network (GFN) google
Graph Neural Nets (GNNs) have received increasing attentions, partially due to their superior performance in many node and graph classification tasks. However, there is a lack of understanding on what they are learning and how sophisticated the learned graph functions are. In this work, we first propose Graph Feature Network (GFN), a simple lightweight neural net defined on a set of graph augmented features. We then propose a dissection of GNNs on graph classification into two parts: 1) the graph filtering, where graph-based neighbor aggregations are performed, and 2) the set function, where a set of hidden node features are composed for prediction. To test the importance of these two parts separately, we prove and leverage the connection that GFN can be derived by linearizing graph filtering part of GNN. Empirically we perform evaluations on common graph classification benchmarks. To our surprise, we find that, despite the simplification, GFN could match or exceed the best accuracies produced by recently proposed GNNs, with a fraction of computation cost. Our results provide new perspectives on both the functions that GNNs learned and the current benchmarks for evaluating them. …

Quantum Annealing (QA) google
Quantum annealing (QA) is a metaheuristic for finding the global minimum of a given objective function over a given set of candidate solutions (candidate states), by a process using quantum fluctuations. Quantum annealing is used mainly for problems where the search space is discrete (combinatorial optimization problems) with many local minima; such as finding the ground state of a spin glass. It was formulated in its present form by T. Kadowaki and H. Nishimori (ja) in ‘Quantum annealing in the transverse Ising model’ though a proposal in a different form had been made by A. B. Finilla, M. A. Gomez, C. Sebenik and J. D. Doll, in ‘Quantum annealing: A new method for minimizing multidimensional functions’. Quantum annealing starts from a quantum-mechanical superposition of all possible states (candidate states) with equal weights. Then the system evolves following the time-dependent Schrödinger equation, a natural quantum-mechanical evolution of physical systems. The amplitudes of all candidate states keep changing, realizing a quantum parallelism, according to the time-dependent strength of the transverse field, which causes quantum tunneling between states. If the rate of change of the transverse field is slow enough, the system stays close to the ground state of the instantaneous Hamiltonian, i.e., adiabatic quantum computation. If the rate of change of the transverse field is accelerated, the system may leave the ground state temporarily but produce a higher likelihood of concluding in the ground state of the final problem Hamiltonian, i.e., diabatic quantum computation. The transverse field is finally switched off, and the system is expected to have reached the ground state of the classical Ising model that corresponds to the solution to the original optimization problem. An experimental demonstration of the success of quantum annealing for random magnets was reported immediately after the initial theoretical proposal. An introduction to combinatorial optimization (NP-hard) problems, the general structure of quantum annealing-based algorithms and two examples of this kind of algorithms for solving instances of the max-SAT and Minimum Multicut problems together with an overview of the quantum annealing systems manufactured by D-Wave Systems are presented in. …

If you did not already know

Robust Optimization google
Robust optimization is a field of optimization theory that deals with optimization problems in which a certain measure of robustness is sought against uncertainty that can be represented as deterministic variability in the value of the parameters of the problem itself and/or its solution. …

Deadline-Aware Task rEplication for Vehicular Cloud (DATE-V) google
Vehicular Cloud Computing (VCC) is a new technological shift which exploits the computation and storage resources on vehicles for computational service provisioning. Spare on-board resources are pooled by a VCC operator, e.g. a roadside unit, to complete task requests using the vehicle-as-a-resource framework. In this paper, we investigate timely service provisioning for deadline-constrained tasks in VCC systems by leveraging the task replication technique (i.e., allowing one task to be executed by several server vehicles). A learning-based algorithm, called DATE-V (Deadline-Aware Task rEplication for Vehicular Cloud), is proposed to address the special issues in VCC systems including uncertainty of vehicle movements, volatile vehicle members, and large vehicle population. The proposed algorithm is developed based on a novel Contextual-Combinatorial Multi-Armed Bandit (CC-MAB) learning framework. DATE-V is `contextual’ because it utilizes side information (context) of vehicles and tasks to infer the completion probability of a task replication under random vehicle movements. DATE-V is `combinatorial’ because it aims to replicate the received task and send the task replications to multiple server vehicles to guarantee the service timeliness. We rigorously prove that our learning algorithm achieves a sublinear regret bound compared to an oracle algorithm that knows the exact completion probability of any task replications. Simulations are carried out based on real-world vehicle movement traces and the results show that DATE-V significantly outperforms benchmark solutions. …

Language-Conditioned Graph Network (LCGN) google
Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question “What color is the mug on the plate?” we must check the color of the specific mug that satisfies the “on” relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the “on” relationship to the plate, the object “mug” gathers messages from the object “plate” to update its representation to “mug on the plate”, which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets. …

DeepFirearm google
There are great demands for automatically regulating inappropriate appearance of shocking firearm images in social media or identifying firearm types in forensics. Image retrieval techniques have great potential to solve these problems. To facilitate research in this area, we introduce Firearm 14k, a large dataset consisting of over 14,000 images in 167 categories. It can be used for both fine-grained recognition and retrieval of firearm images. Recent advances in image retrieval are mainly driven by fine-tuning state-of-the-art convolutional neural networks for retrieval task. The conventional single margin contrastive loss, known for its simplicity and good performance, has been widely used. We find that it performs poorly on the Firearm 14k dataset due to: (1) Loss contributed by positive and negative image pairs is unbalanced during training process. (2) A huge domain gap exists between this dataset and ImageNet. We propose to deal with the unbalanced loss by employing a double margin contrastive loss. We tackle the domain gap issue with a two-stage training strategy, where we first fine-tune the network for classification, and then fine-tune it for retrieval. Experimental results show that our approach outperforms the conventional single margin approach by a large margin (up to 88.5% relative improvement) and even surpasses the strong triplet-loss-based approach. …

If you did not already know

Easy Positive Triplet Mining google
Deep metric learning seeks to define an embedding where semantically similar images are embedded to nearby locations, and semantically dissimilar images are embedded to distant locations. Substantial work has focused on loss functions and strategies to learn these embeddings by pushing images from the same class as close together in the embedding space as possible. In this paper, we propose an alternative, loosened embedding strategy that requires the embedding function only map each training image to the most similar examples from the same class, an approach we call ‘Easy Positive’ mining. We provide a collection of experiments and visualizations that highlight that this Easy Positive mining leads to embeddings that are more flexible and generalize better to new unseen data. This simple mining strategy yields recall performance that exceeds state of the art approaches (including those with complicated loss functions and ensemble methods) on image retrieval datasets including CUB, Stanford Online Products, In-Shop Clothes and Hotels-50K. …

Generalized Dual Averaging google
We present a new class of algorithms for solving regularized optimization and saddle point problems. We analyse this class of methods for convex optimization and convex-concave saddle point problems and expect that they will be useful for solving non-convex problems as well. For convex and convex-concave problems, our algorithms form a novel class of primal dual subgradient methods. This new class of methods extends existing methods by utilizing a more general bound on the objective error and duality gap. This leads to methods for which we can control the step size of the proximal update, which is of interest for problems where the sparsity of the iterates is important. We prove that our class of methods is optimal from the point of view of worst-case black-box complexity for convex optimization problems, and derive a version for convex-concave saddle point problems. We also analyse our methods in the stochastic and online settings. Finally, we exhibit a variety of special cases and discuss their usefulness for non-convex optimization. …

SPFlow google
We introduce SPFlow, an open-source Python library providing a simple interface to inference, learning and manipulation routines for deep and tractable probabilistic models called Sum-Product Networks (SPNs). The library allows one to quickly create SPNs both from data and through a domain specific language (DSL). It efficiently implements several probabilistic inference routines like computing marginals, conditionals and (approximate) most probable explanations (MPEs) along with sampling as well as utilities for serializing, plotting and structure statistics on an SPN. Moreover, many of the algorithms proposed in the literature to learn the structure and parameters of SPNs are readily available in SPFlow. Furthermore, SPFlow is extremely extensible and customizable, allowing users to promptly distill new inference and learning routines by injecting custom code into a lightweight functional-oriented API framework. This is achieved in SPFlow by keeping an internal Python representation of the graph structure that also enables practical compilation of an SPN into a TensorFlow graph, C, CUDA or FPGA custom code, significantly speeding-up computations. …

Generative Adversarial Imputation Net (GAIN) google
We propose a novel method for imputing missing data by adapting the well-known Generative Adversarial Nets (GAN) framework. Accordingly, we call our method Generative Adversarial Imputation Nets (GAIN). The generator (G) observes some components of a real data vector, imputes the missing components conditioned on what is actually observed, and outputs a completed vector. The discriminator (D) then takes a completed vector and attempts to determine which components were actually observed and which were imputed. To ensure that D forces G to learn the desired distribution, we provide D with some additional information in the form of a hint vector. The hint reveals to D partial information about the missingness of the original sample, which is used by D to focus its attention on the imputation quality of particular components. This hint ensures that G does in fact learn to generate according to the true data distribution. We tested our method on various datasets and found that GAIN significantly outperforms state-of-the-art imputation methods. …

If you did not already know

Multivariate Bernoulli Autoregressive Process (BAR) google
Multivariate Bernoulli autoregressive (BAR) processes model time series of events in which the likelihood of current events is determined by the times and locations of past events. These processes can be used to model nonlinear dynamical systems corresponding to criminal activity, responses of patients to different medical treatment plans, opinion dynamics across social networks, epidemic spread, and more. Past work examines this problem under the assumption that the event data is complete, but in many cases only a fraction of events are observed. Incomplete observations pose a significant challenge in this setting because the unobserved events still govern the underlying dynamical system. In this work, we develop a novel approach to estimating the parameters of a BAR process in the presence of unobserved events via an unbiased estimator of the complete data log-likelihood function. We propose a computationally efficient estimation algorithm which approximates this estimator via Taylor series truncation and establish theoretical results for both the statistical error and optimization error of our algorithm. We further justify our approach by testing our method on both simulated data and a real data set consisting of crimes recorded by the city of Chicago. …

Attention-Aware Generative Adversarial Network (ATA-GAN) google
In this work, we present a novel approach for training Generative Adversarial Networks (GANs). Using the attention maps produced by a Teacher- Network we are able to improve the quality of the generated images as well as perform weakly object localization on the generated images. To this end, we generate images of HEp-2 cells captured with Indirect Imunofluoresence (IIF) and study the ability of our network to perform a weakly localization of the cell. Firstly, we demonstrate that whilst GANs can learn the mapping between the input domain and the target distribution efficiently, the discriminator network is not able to detect the regions of interest. Secondly, we present a novel attention transfer mechanism which allows us to enforce the discriminator to put emphasis on the regions of interest via transfer learning. Thirdly, we show that this leads to more realistic images, as the discriminator learns to put emphasis on the area of interest. Fourthly, the proposed method allows one to generate both images as well as attention maps which can be useful for data annotation e.g in object detection. …

Attentive Adversarial Domain-Invariant Training (AADIT) google
Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR). In ADIT, an auxiliary domain classifier takes in equally-weighted deep features from a deep neural network (DNN) acoustic model and is trained to improve their domain-invariance by optimizing an adversarial loss function. In this work, we propose an attentive Adversarial domain-invariant training (AADIT) in which we advance the domain classifier with an attention mechanism to automatically weight the input deep features according to their importance in domain classification. With this attentive re-weighting, AADIT can focus on the domain normalization of phonetic components that are more susceptible to domain variability and generates deep features with improved domain-invariance and senone-discriminativity over ADIT. Most importantly, the attention block serves only as an external component to the DNN acoustic model and is not involved in ASR, so AADIT can be used to improve the acoustic modeling with any DNN architectures. More generally, the same methodology can improve any adversarial learning system with an auxiliary discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3% relative WER improvements, respectively, over a multi-conditional model and a strong ADIT baseline. …

Markov Modulated Hawkes Process (MMHP) google
Modeling event dynamics is central to many disciplines. Patterns in observed event arrival times are commonly modeled using point processes. Such event arrival data often exhibits self-exciting, heterogeneous and sporadic trends, which is challenging for conventional models. It is reasonable to assume that there exists a hidden state process that drives different event dynamics at different states. In this paper, we propose a Markov Modulated Hawkes Process (MMHP) model for learning such a mixture of event dynamics and develop corresponding inference algorithms. Numerical experiments using synthetic data and data from an animal behavior study demonstrate that MMHP with the proposed estimation algorithms consistently recover the true hidden state process in simulations, and separately captures distinct event dynamics that reveal interesting social structures in the real data. …

If you did not already know

AttentionRNN google
Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among attention variables, making it difficult to predict consistent attention masks. In this paper we develop a novel structured spatial attention mechanism which is end-to-end trainable and can be integrated with any feed-forward convolutional neural network. This proposed AttentionRNN layer explicitly enforces structure over the spatial attention variables by sequentially predicting attention values in the spatial mask in a bi-directional raster-scan and inverse raster-scan order. As a result, each attention value depends not only on local image or contextual information, but also on the previously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks and datasets; including image categorization, question answering and image generation. …

Structured Sparsity Regularization google
The success of convolutional neural networks (CNNs) in computer vision applications has been accompanied by a significant increase of computation and memory costs, which prohibits its usage on resource-limited environments such as mobile or embedded devices. To this end, the research of CNN compression has recently become emerging. In this paper, we propose a novel filter pruning scheme, termed structured sparsity regularization (SSR), to simultaneously speedup the computation and reduce the memory overhead of CNNs, which can be well supported by various off-the-shelf deep learning libraries. Concretely, the proposed scheme incorporates two different regularizers of structured sparsity into the original objective function of filter pruning, which fully coordinates the global outputs and local pruning operations to adaptively prune filters. We further propose an Alternative Updating with Lagrange Multipliers (AULM) scheme to efficiently solve its optimization. AULM follows the principle of ADMM and alternates between promoting the structured sparsity of CNNs and optimizing the recognition loss, which leads to a very efficient solver (2.5x to the most recent work that directly solves the group sparsity-based regularization). Moreover, by imposing the structured sparsity, the online inference is extremely memory-light, since the number of filters and the output feature maps are simultaneously reduced. The proposed scheme has been deployed to a variety of state-of-the-art CNN structures including LeNet, AlexNet, VGG, ResNet and GoogLeNet over different datasets. Quantitative results demonstrate that the proposed scheme achieves superior performance over the state-of-the-art methods. We further demonstrate the proposed compression scheme for the task of transfer learning, including domain adaptation and object detection, which also show exciting performance gains over the state-of-the-arts. …

Adv-BNN google
We present a new algorithm to train a robust neural network against adversarial attacks. Our algorithm is motivated by the following two ideas. First, although recent work has demonstrated that fusing randomness can improve the robustness of neural networks (Liu 2017), we noticed that adding noise blindly to all the layers is not the optimal way to incorporate randomness. Instead, we model randomness under the framework of Bayesian Neural Network (BNN) to formally learn the posterior distribution of models in a scalable way. Second, we formulate the mini-max problem in BNN to learn the best model distribution under adversarial attacks, leading to an adversarial-trained Bayesian neural net. Experiment results demonstrate that the proposed algorithm achieves state-of-the-art performance under strong attacks. On CIFAR-10 with VGG network, our model leads to 14\% accuracy improvement compared with adversarial training (Madry 2017) and random self-ensemble (Liu 2017) under PGD attack with $0.035$ distortion, and the gap becomes even larger on a subset of ImageNet. …

Loss Data Analytics google
Loss Data Analytics is an interactive, online, freely available text. The idea behind the name Loss Data Analytics is to integrate classical loss data models from applied probability with modern analytic tools. In particular, we seek to recognize that big data (including social media and usage based insurance) are here and high speed computation is readily available. The online version contains many interactive objects (quizzes, computer demonstrations, interactive graphs, video, and the like) to promote deeper learning. A subset of the book is available for offline reading in pdf and EPUB formats. The online text will be available in multiple languages to promote access to a worldwide audience. …

If you did not already know

Universal Transformer google
Self-attentive feed-forward sequence models have been shown to achieve impressive results on sequence modeling tasks, thereby presenting a compelling alternative to recurrent neural networks (RNNs) which has remained the de-facto standard architecture for many sequence modeling problems to date. Despite these successes, however, feed-forward sequence models like the Transformer fail to generalize in many tasks that recurrent models handle with ease (e.g. copying when the string lengths exceed those observed at training time). Moreover, and in contrast to RNNs, the Transformer model is not computationally universal, limiting its theoretical expressivity. In this paper we propose the Universal Transformer which addresses these practical and theoretical shortcomings and we show that it leads to improved performance on several tasks. Instead of recurring over the individual symbols of sequences like RNNs, the Universal Transformer repeatedly revises its representations of all symbols in the sequence with each recurrent step. In order to combine information from different parts of a sequence, it employs a self-attention mechanism in every recurrent step. Assuming sufficient memory, its recurrence makes the Universal Transformer computationally universal. We further employ an adaptive computation time (ACT) mechanism to allow the model to dynamically adjust the number of times the representation of each position in a sequence is revised. Beyond saving computation, we show that ACT can improve the accuracy of the model. Our experiments show that on various algorithmic tasks and a diverse set of large-scale language understanding tasks the Universal Transformer generalizes significantly better and outperforms both a vanilla Transformer and an LSTM in machine translation, and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task. …

Pixel-Adaptive Convolutional Neural Network google
Convolutions are the fundamental building block of CNNs. The fact that their weights are spatially shared is one of the main reasons for their widespread use, but it also is a major limitation, as it makes convolutions content agnostic. We propose a pixel-adaptive convolution (PAC) operation, a simple yet effective modification of standard convolutions, in which the filter weights are multiplied with a spatially-varying kernel that depends on learnable, local pixel features. PAC is a generalization of several popular filtering techniques and thus can be used for a wide range of use cases. Specifically, we demonstrate state-of-the-art performance when PAC is used for deep joint image upsampling. PAC also offers an effective alternative to fully-connected CRF (Full-CRF), called PAC-CRF, which performs competitively, while being considerably faster. In addition, we also demonstrate that PAC can be used as a drop-in replacement for convolution layers in pre-trained networks, resulting in consistent performance improvements. …

Monte Carlo Dependency Estimation (MCDE) google
Estimating the dependency of variables is a fundamental task in data analysis. Identifying the relevant attributes in databases leads to better data understanding and also improves the performance of learning algorithms, both in terms of runtime and quality. In data streams, dependency monitoring provides key insights into the underlying process, but is challenging. In this paper, we propose Monte Carlo Dependency Estimation (MCDE), a theoretical framework to estimate multivariate dependency in static and dynamic data. MCDE quantifies dependency as the average discrepancy between marginal and conditional distributions via Monte Carlo simulations. Based on this framework, we present Mann-Whitney P (MWP), a novel dependency estimator. We show that MWP satisfies a number of desirable properties and can accommodate any kind of numerical data. We demonstrate the superiority of our estimator by comparing it to the state-of-the-art multivariate dependency measures. …

Nonlinear Collaborative Scheme google
Conventional research attributes the improvements of generalization ability of deep neural networks either to powerful optimizers or the new network design. Different from them, in this paper, we aim to link the generalization ability of a deep network to optimizing a new objective function. To this end, we propose a \textit{nonlinear collaborative scheme} for deep network training, with the key technique as combining different loss functions in a nonlinear manner. We find that after adaptively tuning the weights of different loss functions, the proposed objective function can efficiently guide the optimization process. What is more, we demonstrate that, from the mathematical perspective, the nonlinear collaborative scheme can lead to (i) smaller KL divergence with respect to optimal solutions; (ii) data-driven stochastic gradient descent; (iii) tighter PAC-Bayes bound. We also prove that its advantage can be strengthened by nonlinearity increasing. To some extent, we bridge the gap between learning (i.e., minimizing the new objective function) and generalization (i.e., minimizing a PAC-Bayes bound) in the new scheme. We also interpret our findings through the experiments on Residual Networks and DenseNet, showing that our new scheme performs superior to single-loss and multi-loss schemes no matter with randomization or not. …

If you did not already know

Source-Guided Discrepancy (S-disc) google
Unsupervised domain adaptation is the problem setting where data generating distributions in the source and target domains are different, and labels in the target domain are unavailable. One important question in unsupervised domain adaptation is how to measure the difference between the source and target domains. A previously proposed discrepancy that does not use the source domain labels requires high computational cost to estimate and may lead to a loose generalization error bound in the target domain.To mitigate these problems, we propose a novel discrepancy called source-guided discrepancy ($S$-disc), which exploits labels in the source domain. As a consequence, $S$-disc can be computed efficiently with a finite sample convergence guarantee. In addition, we show that $S$-disc can provide a tighter generalization error bound than the one based on an existing discrepancy. Finally, we report experimental results that demonstrate the advantages of $S$-disc over the existing discrepancies. …

Spontaneous Clustering google
We propose a new method for clustering based on the local minimization of the \gamma-divergence, which we call the spontaneous clustering. The greatest advantage of the proposed method is that it automatically detects the number of clusters that adequately reflect the data structure. In contrast, exiting methods such as K-means, fuzzy c-means, and model based clustering need to prescribe the number of clusters. We detect all the local minimum points of the \gamma-divergence, which are defined as the centers of clusters. A necessary and sufficient condition for the \gamma-divergence to have the local minimum points is also derived in a simple setting. A simulation study and a real data analysis are performed to compare our proposal with existing methods. …

MetaPruning google
In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. We have demonstrated competitive performances on MobileNet V1/V2 networks, up to 9.0/9.9 higher ImageNet accuracy than V1/V2. Compared to the previous state-of-the-art AutoML-based pruning methods, like AMC and NetAdapt, we achieve higher or comparable accuracy under various conditions. …

MatRox google
We present MatRox, a novel model-based algorithm and implementation of Hierarchically Semi-Separable (HSS) matrix computations on parallel architectures. MatRox uses a novel storage format to improve data locality and scalability of HSS matrix-matrix multiplications on shared memory multicore processors. We build a performance model for HSS matrix-matrix multiplications. Based on the performance model, a mixed-rank heuristic is introduced to find an optimal HSS-tree depth for a faster HSS matrix evaluation. Uniform sampling is used to improve the performance of HSS compression. MatRox outperforms state-of-the-art HSS matrix multiplication codes, GOFMM and STRUMPACK, with average speedups of 2.8x and 6.1x respectively on target multicore processors. …

If you did not already know

Bayesian Optimized Continual Learning (BOCL) google
Though neural networks have achieved much progress in various applications, it is still highly challenging for them to learn from a continuous stream of tasks without forgetting. Continual learning, a new learning paradigm, aims to solve this issue. In this work, we propose a new model for continual learning, called Bayesian Optimized Continual Learning with Attention Mechanism (BOCL) that dynamically expands the network capacity upon the arrival of new tasks by Bayesian optimization and selectively utilizes previous knowledge (e.g. feature maps of previous tasks) via attention mechanism. Our experiments on variants of MNIST and CIFAR-100 demonstrate that our methods outperform the state-of-the-art in preventing catastrophic forgetting and fitting new tasks better. …

Collaborative Memory Network (CMN) google
Recommendation systems play a vital role to keep users engaged with personalized content in modern online platforms. Deep learning has revolutionized many research fields and there is a recent surge of interest in applying it to collaborative filtering (CF). However, existing methods compose deep learning architectures with the latent factor model ignoring a major class of CF models, neighborhood or memory-based approaches. We propose Collaborative Memory Networks (CMN), a deep architecture to unify the two classes of CF models capitalizing on the strengths of the global structure of latent factor model and local neighborhood-based structure in a nonlinear fashion. Motivated by the success of Memory Networks, we fuse a memory component and neural attention mechanism as the neighborhood component. The associative addressing scheme with the user and item memories in the memory module encodes complex user-item relations coupled with the neural attention mechanism to learn a user-item specific neighborhood. Finally, the output module jointly exploits the neighborhood with the user and item memories to produce the ranking score. Stacking multiple memory modules together yield deeper architectures capturing increasingly complex user-item relations. Furthermore, we show strong connections between CMN components, memory networks and the three classes of CF models. Comprehensive experimental results demonstrate the effectiveness of CMN on three public datasets outperforming competitive baselines. Qualitative visualization of the attention weights provide insight into the model’s recommendation process and suggest the presence of higher order interactions. …

SigOpt Orchestrate google
Two key factors dominate the development of effective production grade machine learning models. First, it requires a local software implementation and iteration process. Second, it requires distributed infrastructure to efficiently conduct training and hyperparameter optimization. While modern machine learning frameworks are very effective at the former, practitioners are often left building ad hoc frameworks for the latter. We present SigOpt Orchestrate, a library for such simultaneous training in a cloud environment. We describe the motivating factors and resulting design of this library, feedback from initial testing, and future goals. …

GossipGraD google
In this paper, we present GossipGraD – a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during the feedforward phase in SGD to prevent over-fitting, 5) asynchronous communication of gradients for further reducing the communication cost of SGD and GossipGraD. We implement GossipGraD for GPU and CPU clusters and use NVIDIA GPUs (Pascal P100) connected with InfiniBand, and Intel Knights Landing (KNL) connected with Aries network. We evaluate GossipGraD using well-studied dataset ImageNet-1K (~250GB), and widely studied neural network topologies such as GoogLeNet and ResNet50 (current winner of ImageNet Large Scale Visualization Research Challenge (ILSVRC)). Our performance evaluation using both KNL and Pascal GPUs indicates that GossipGraD can achieve perfect efficiency for these datasets and their associated neural network topologies. Specifically, for ResNet50, GossipGraD is able to achieve ~100% compute efficiency using 128 NVIDIA Pascal P100 GPUs – while matching the top-1 classification accuracy published in literature. …

If you did not already know

Prediction with Unpredictable Feature Evolution (PUFE) google
Feature space can change or evolve when learning with streaming data. Several recent works have studied feature evolvable learning. They usually assume that features would not vanish or appear in an arbitrary way. For example, when knowing the battery lifespan, old features and new features represented by data gathered by sensors will disappear and emerge at the same time along with the sensors exchanging simultaneously. However, different sensors would have different lifespans, and thus the feature evolution can be unpredictable. In this paper, we propose a novel paradigm: Prediction with Unpredictable Feature Evolution (PUFE). We first complete the unpredictable overlapping period into an organized matrix and give a theoretical bound on the least number of observed entries. Then we learn the mapping from the completed matrix to recover the data from old feature space when observing the data from new feature space. With predictions on the recovered data, our model can make use of the advantage of old feature space and is always comparable with any combinations of the predictions on the current instance. Experiments on the synthetic and real datasets validate the effectiveness of our method. …

Rank-1 Convolutional Neural Network google
In this paper, we propose a convolutional neural network(CNN) with 3-D rank-1 filters which are composed by the outer product of 1-D filters. After being trained, the 3-D rank-1 filters can be decomposed into 1-D filters in the test time for fast inference. The reason that we train 3-D rank-1 filters in the training stage instead of consecutive 1-D filters is that a better gradient flow can be obtained with this setting, which makes the training possible even in the case where the network with consecutive 1-D filters cannot be trained. The 3-D rank-1 filters are updated by both the gradient flow and the outer product of the 1-D filters in every epoch, where the gradient flow tries to obtain a solution which minimizes the loss function, while the outer product operation tries to make the parameters of the filter to live on a rank-1 sub-space. Furthermore, we show that the convolution with the rank-1 filters results in low rank outputs, constraining the final output of the CNN also to live on a low dimensional subspace. …

Optuna google
The package was, and still is, developed by a Japanese AI company Preferred Networks. In many ways, Optuna is similar to Hyperopt. So why should you bother? There are a few reasons:
• It’s possible to specify how long the optimization process should last
• Integration with Pandas DataFrame
• The algorithm uses pruning to discard low-quality trials early
• It’s a relatively new project, and developers continue to work on it
• It was easier to use than Hyperopt (at least for me)
How to make your model awesome with Optuna


Text-Driven Graph Embedding With Pairs Sampling (TGE-PS) google
In graphs with rich text information, constructing expressive graph representations requires incorporating textual information with structural information. Graph embedding models are becoming more and more popular in representing graphs, yet they are faced with two issues: sampling efficiency and text utilization. Through analyzing existing models, we find their training objectives are composed of pairwise proximities, and there are large amounts of redundant node pairs in Random Walk-based methods. Besides, inferring graph structures directly from texts (also known as zero-shot scenario) is a problem that requires higher text utilization. To solve these problems, we propose a novel Text-driven Graph Embedding with Pairs Sampling (TGE-PS) framework. TGE-PS uses Pairs Sampling (PS) to generate training samples which reduces ~99% training samples and is competitive compared to Random Walk. TGE-PS uses Text-driven Graph Embedding (TGE) which adopts word- and character-level embeddings to generate node embeddings. We evaluate TGE-PS on several real-world datasets, and experimental results demonstrate that TGE-PS produces state-of-the-art results in traditional and zero-shot link prediction tasks. …

If you did not already know

Super Resolution Network (SRNet) google
Adversarially trained deep neural networks have significantly improved performance of single image super resolution, by hallucinating photorealistic local textures, thereby greatly reducing the perception difference between a real high resolution image and its super resolved (SR) counterpart. However, application to medical imaging requires preservation of diagnostically relevant features while refraining from introducing any diagnostically confusing artifacts. We propose using a deep convolutional super resolution network (SRNet) trained for (i) minimising reconstruction loss between the real and SR images, and (ii) maximally confusing learned relativistic visual Turing test (rVTT) networks to discriminate between (a) pair of real and SR images (T1) and (b) pair of patches in real and SR selected from region of interest (T2). The adversarial loss of T1 and T2 while backpropagated through SRNet helps it learn to reconstruct pathorealism in the regions of interest such as white blood cells (WBC) in peripheral blood smears or epithelial cells in histopathology of cancerous biopsy tissues, which are experimentally demonstrated here. Experiments performed for measuring signal distortion loss using peak signal to noise ratio (pSNR) and structural similarity (SSIM) with variation of SR scale factors, impact of rVTT adversarial losses, and impact on reporting using SR on a commercially available artificial intelligence (AI) digital pathology system substantiate our claims. …

RHEEMix google
In pursuit of efficient and scalable data analytics, the insight that ‘one size does not fit all’ has given rise to a plethora of specialized data processing platforms and today’s complex data analytics are moving beyond the limits of a single platform. To cope with these new requirements, we present a cross-platform optimizer that allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i)~a mechanism based on graph transformations to explore alternative execution strategies; (ii)~a novel graph-based approach to efficiently plan data movement among subtasks and platforms; and (iii)~an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. The results show that our optimizer is capable of selecting the most efficient platform combination for a given task, freeing data analysts from the need to choose and orchestrate platforms. In particular, our optimizer allows certain tasks to run more than one order of magnitude faster than on state-of-the-art platforms, such as Spark. …

Conditional Teacher-Student Learning google
The teacher-student (T/S) learning has been shown to be effective for a variety of problems such as domain adaptation and model compression. One shortcoming of the T/S learning is that a teacher model, not always perfect, sporadically produces wrong guidance in form of posterior probabilities that misleads the student model towards a suboptimal performance. To overcome this problem, we propose a conditional T/S learning scheme, in which a ‘smart’ student model selectively chooses to learn from either the teacher model or the ground truth labels conditioned on whether the teacher can correctly predict the ground truth. Unlike a naive linear combination of the two knowledge sources, the conditional learning is exclusively engaged with the teacher model when the teacher model’s prediction is correct, and otherwise backs off to the ground truth. Thus, the student model is able to learn effectively from the teacher and even potentially surpass the teacher. We examine the proposed learning scheme on two tasks: domain adaptation on CHiME-3 dataset and speaker adaptation on Microsoft short message dictation dataset. The proposed method achieves 9.8% and 12.8% relative word error rate reductions, respectively, over T/S learning for environment adaptation and speaker-independent model for speaker adaptation. …

Bayes-CPACE google
We present the first PAC optimal algorithm for Bayes-Adaptive Markov Decision Processes (BAMDPs) in continuous state and action spaces, to the best of our knowledge. The BAMDP framework elegantly addresses model uncertainty by incorporating Bayesian belief updates into long-term expected return. However, computing an exact optimal Bayesian policy is intractable. Our key insight is to compute a near-optimal value function by covering the continuous state-belief-action space with a finite set of representative samples and exploiting the Lipschitz continuity of the value function. We prove the near-optimality of our algorithm and analyze a number of schemes that boost the algorithm’s efficiency. Finally, we empirically validate our approach on a number of discrete and continuous BAMDPs and show that the learned policy has consistently competitive performance against baseline approaches. …