If you did not already know

Micro-Objective Reinforcement Learning google
The standard reinforcement learning (RL) formulation considers the expectation of the (discounted) cumulative reward. This is limiting in applications where we are concerned with not only the expected performance, but also the distribution of the performance. In this paper, we introduce micro-objective reinforcement learning — an alternative RL formalism that overcomes this issue. In this new formulation, a RL task is specified by a set of micro-objectives, which are constructs that specify the desirability or undesirability of events. In addition, micro-objectives allow prior knowledge in the form of temporal abstraction to be incorporated into the global RL objective. The generality of this formalism, and its relations to single/multi-objective RL, and hierarchical RL are discussed. …

Deep Autoencoder MIxture Clustering (DAMIC) google
In this paper we propose a Deep Autoencoder MIxture Clustering (DAMIC) algorithm based on a mixture of deep autoencoders where each cluster is represented by an autoencoder. A clustering network transforms the data into another space and then selects one of the clusters. Next, the autoencoder associated with this cluster is used to reconstruct the data-point. The clustering algorithm jointly learns the nonlinear data representation and the set of autoencoders. The optimal clustering is found by minimizing the reconstruction loss of the mixture of autoencoder network. Unlike other deep clustering algorithms, no regularization term is needed to avoid data collapsing to a single point. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods. …

Implicit Stochastic Gradient Descent (ISGD) google
Arguably the biggest challenge in applying neural networks is tuning the hyperparameters, in particular the learning rate. The sensitivity to the learning rate is due to the reliance on backpropagation to train the network. In this paper we present the first application of Implicit Stochastic Gradient Descent (ISGD) to train neural networks, a method known in convex optimization to be unconditionally stable and robust to the learning rate. Our key contribution is a novel layer-wise approximation of ISGD which makes its updates tractable for neural networks. Experiments show that our method is more robust to high learning rates and generally outperforms standard backpropagation on a variety of tasks. …

Randomized Hierarchical Alternating Least Squares google
Nonnegative matrix factorization (NMF) is a powerful tool for data mining. However, the emergence of `big data’ has severely challenged our ability to compute this fundamental decomposition using deterministic algorithms. This paper presents a randomized hierarchical alternating least squares (HALS) algorithm to compute the NMF. By deriving a smaller matrix from the nonnegative input data, a more efficient nonnegative decomposition can be computed. Our algorithm scales to big data applications while attaining a near-optimal factorization, i.e., the algorithm scales with the target rank of the data rather than the ambient dimension of measurement space. The proposed algorithm is evaluated using synthetic and real world data and shows substantial speedups compared to deterministic HALS. …

If you did not already know

Regret With Rolling Window google
Nowadays, online learning is an appealing learning paradigm, which is of great interest in practice due to the recent emergence of large scale applications such as online advertising placement and online web ranking. Standard online learning assumes a finite number of samples while in practice data is streamed infinitely. In such a setting gradient descent with a diminishing learning rate does not work. We first introduce regret with rolling window, a new performance metric for online streaming learning, which measures the performance of an algorithm on every fixed number of contiguous samples. At the same time, we propose a family of algorithms based on gradient descent with a constant or adaptive learning rate and provide very technical analyses establishing regret bound properties of the algorithms. We cover the convex setting showing the regret of the order of the square root of the size of the window in the constant and dynamic learning rate scenarios. Our proof is applicable also to the standard online setting where we provide the first analysis of the same regret order (the previous proofs have flaws). We also study a two layer neural network setting with ReLU activation. In this case we establish that if initial weights are close to a stationary point, the same square root regret bound is attainable. We conduct computational experiments demonstrating a superior performance of the proposed algorithms. …

Degradation Data Analysis google
Given that products are more frequently being designed with higher reliability and developed in a shorter amount of time, it is often not possible to test new designs to failure under normal operating conditions. In some cases, it is possible to infer the reliability behavior of unfailed test samples with only the accumulated test time information and assumptions about the distribution. However, this generally leads to a great deal of uncertainty in the results. Another option in this situation is the use of degradation analysis. Degradation analysis involves the measurement of performance data that can be directly related to the presumed failure of the product in question. Many failure mechanisms can be directly linked to the degradation of part of the product, and degradation analysis allows the analyst to extrapolate to an assumed failure time based on the measurements of degradation over time. …

Online FAult Detection (FADO) google
This paper proposes and studies a detection technique for adversarial scenarios (dubbed deterministic detection). This technique provides an alternative detection methodology in case the usual stochastic methods are not applicable: this can be because the studied phenomenon does not follow a stochastic sampling scheme, samples are high-dimensional and subsequent multiple-testing corrections render results overly conservative, sample sizes are too low for asymptotic results (as e.g. the central limit theorem) to kick in, or one cannot allow for the small probability of failure inherent to stochastic approaches. This paper instead designs a method based on insights from machine learning and online learning theory: this detection algorithm – named Online FAult Detection (FADO) – comes with theoretical guarantees of its detection capabilities. A version of the margin is found to regulate the detection performance of FADO. A precise expression is derived for bounding the performance, and experimental results are presented assessing the influence of involved quantities. A case study of scene detection is used to illustrate the approach. The technology is closely related to the linear perceptron rule, inherits its computational attractiveness and flexibility towards various extensions. …

CURIOUS google
In open-ended and changing environments, agents face a wide range of potential tasks that may or may not come with associated reward functions. Such autonomous learning agents must be able to generate their own tasks through a process of intrinsically motivated exploration, some of which might prove easy, others impossible. For this reason, they should be able to actively select which task to practice at any given moment, to maximize their overall mastery on the set of learnable tasks. This paper proposes CURIOUS, an extension of Universal Value Function Approximators that enables intrinsically motivated agents to learn to achieve both multiple tasks and multiple goals within a unique policy, leveraging hindsight learning. Agents focus on achievable tasks first, using an automated curriculum learning mechanism that biases their attention towards tasks maximizing the absolute learning progress. This mechanism provides robustness to catastrophic forgetting (by refocusing on tasks where performance decreases) and distracting tasks (by avoiding tasks with no absolute learning progress). Furthermore, we show that having two levels of parameterization (tasks and goals within tasks) enables more efficient learning of skills in an environment with a modular physical structure (e.g. multiple objects) as compared to flat, goal-parameterized RL with hindsight experience replay. …

If you did not already know

Cloud SELENE (cSELENE) google
While working in collaborative team elsewhere sometimes the federated (huge) data are from heterogeneous cloud vendors. It is not only about the data privacy concern but also about how can those federated data can be querying from cloud directly in fast and securely way. Previous solution offered hybrid cloud between public and trusted private cloud. Another previous solution used encryption on MapReduce framework. But the challenge is we are working on heterogeneous clouds. In this paper, we present a novel technique for querying with privacy concern. Since we take execution time into account, our basic idea is to use the data mining model by partitioning the federated databases in order to reduce the search and query time. By using model of the database it means we use only the summary or the very characteristic patterns of the database. Modeling is the Preserving Privacy Stage I, since by modeling the data is being symbolized. We implement encryption on the database as preserving privacy Stage II. Our system, called ‘cSELENE’ (stands for ‘cloud SELENE’), is designed to handle federated data on heterogeneous clouds: AWS, Microsoft Azure, and Google Cloud Platform with MapReduce technique. In this paper we discuss preserving-privacy system and threat model, the format of federated data, the parallel programming (GPU programming and shared/memory systems), the parallel and secure algorithm for data mining model in distributed cloud, the cloud infrastructure/architecture, and the UIX design of the cSELENE system. Other issues such as incremental method and the secure design of cloud architecture system (Virtual Machines across platform design) are still open to discuss. Our experiments should demonstrate the validity and practicality of the proposed high performance computing scheme. …

Hidden-Layer LSTM (H-LSTM) google
Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM’s original one level non-linear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning and speech recognition applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7x [floating-point operations (FLOPs) by 45.5x], run-time latency by 4.5x, and improve the CIDEr score by 2.6. For the DeepSpeech2 architecture on the AN4 dataset, our two models reduce the number of parameters by 19.4x (FLOPs by 23.5x), run-time latency by 15.7%, and the word error rate from 12.9% to 8.7%. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate. …

CPDist google
Preference are central to decision making by both machines and humans. Representing, learning, and reasoning with preferences is an important area of study both within computer science and across the sciences. When working with preferences it is necessary to understand and compute the distance between sets of objects, e.g., the preferences of a user and a the descriptions of objects to be recommended. We present CPDist, a novel neural network to address the problem of learning to measure the distance between structured preference representations. We use the popular CP-net formalism to represent preferences and then leverage deep neural networks to learn a recently proposed metric function that is computationally hard to compute directly. CPDist is a novel metric learning approach based on the use of deep siamese networks which learn the Kendal Tau distance between partial orders that are induced by compact preference representations. We find that CPDist is able to learn the distance function with high accuracy and outperform existing approximation algorithms on both the regression and classification task using less computation time. Performance remains good even when CPDist is trained with only a small number of samples compared to the dimension of the solution space, indicating the network generalizes well. …

Relational Reasoning Network (RRN) google
Accurately identifying anatomical landmarks is a crucial step in deformation analysis and surgical planning for craniomaxillofacial (CMF) bones. Available methods require segmentation of the object of interest for precise landmarking. Unlike those, our purpose in this study is to perform anatomical landmarking using the inherent relation of CMF bones without explicitly segmenting them. We propose a new deep network architecture, called relational reasoning network (RRN), to accurately learn the local and the global relations of the landmarks. Specifically, we are interested in learning landmarks in CMF region: mandible, maxilla, and nasal bones. The proposed RRN works in an end-to-end manner, utilizing learned relations of the landmarks based on dense-block units and without the need for segmentation. For a given a few landmarks as input, the proposed system accurately and efficiently localizes the remaining landmarks on the aforementioned bones. For a comprehensive evaluation of RRN, we used cone-beam computed tomography (CBCT) scans of 250 patients. The proposed system identifies the landmark locations very accurately even when there are severe pathologies or deformations in the bones. The proposed RRN has also revealed unique relationships among the landmarks that help us infer several reasoning about informativeness of the landmark points. RRN is invariant to order of landmarks and it allowed us to discover the optimal configurations (number and location) for landmarks to be localized within the object of interest (mandible) or nearby objects (maxilla and nasal). To the best of our knowledge, this is the first of its kind algorithm finding anatomical relations of the objects using deep learning. …

If you did not already know

Extreme Gradient Boosting google
Extreme Gradient Boosting, which is an efficient implementation of gradient boosting framework. …

Temporal Walk google
Networks evolve continuously over time with the addition, deletion, and changing of links and nodes. Such temporal networks (or edge streams) consist of a sequence of timestamped edges and are seemingly ubiquitous. Despite the importance of accurately modeling the temporal information, most embedding methods ignore it entirely or approximate the temporal network using a sequence of static snapshot graphs. In this work, we introduce the notion of \emph{temporal walks} for learning dynamic embeddings from temporal networks. Temporal walks capture the temporally valid interactions (\eg, flow of information, spread of disease) in the dynamic network in a lossless fashion. Based on the notion of temporal walks, we describe a general class of embeddings called continuous-time dynamic network embeddings (CTDNEs) that completely avoid the issues and problems that arise when approximating the temporal network as a sequence of static snapshot graphs. Unlike previous work, CTDNEs learn dynamic node embeddings directly from the temporal network at the finest temporal granularity and thus use only temporally valid information. As such CTDNEs naturally support online learning of the node embeddings in a streaming real-time fashion. The experiments demonstrate the effectiveness of this class of embedding methods for prediction in temporal networks. …

Knowledge Based end-to-end Memory Network (KOBE) google
End-to-end dialog systems have become very popular because they hold the promise of learning directly from human to human dialog interaction. Retrieval and Generative methods have been explored in this area with mixed results. A key element that is missing so far, is the incorporation of a-priori knowledge about the task at hand. This knowledge may exist in the form of structured or unstructured information. As a first step towards this direction, we present a novel approach, Knowledge based end-to-end memory networks (KB-memN2N), which allows special handling of named entities for goal-oriented dialog tasks. We present results on two datasets, DSTC6 challenge dataset and dialog bAbI tasks. …

Social Explorative Attention Network (SEAN) google
An effective content recommendation in modern social media platforms should benefit both creators to bring genuine benefits to them and consumers to help them get really interesting content. In this paper, we propose a model called Social Explorative Attention Network (SEAN) for content recommendation. SEAN uses a personalized content recommendation model to encourage personal interests driven recommendation. Moreover, SEAN allows the personalization factors to attend to users’ higher-order friends on the social network to improve the accuracy and diversity of recommendation results. Constructing two datasets from a popular decentralized content distribution platform, Steemit, we compare SEAN with state-of-the-art CF and content based recommendation approaches. Experimental results demonstrate the effectiveness of SEAN in terms of both Gini coefficients for recommendation equality and F1 scores for recommendation performance. …

If you did not already know

Syntactic Scaffold google
We introduce the syntactic scaffold, an approach to incorporating syntactic information into semantic tasks. Syntactic scaffolds avoid expensive syntactic processing at runtime, only making use of a treebank during training, through a multitask objective. We improve over strong baselines on PropBank semantics, frame semantics, and coreference resolution, achieving competitive performance on all three tasks. …

dypro google
Neural program embedding has shown potential in aiding the analysis of large-scale, complicated software. Newly proposed deep neural architectures pride themselves on learning program semantics rather than superficial syntactic features. However, by considering the source code only, the vast majority of neural networks do not capture a deep, precise representation of program semantics. In this paper, we present \dypro, a novel deep neural network that learns from program execution traces. Compared to the prior dynamic models, not only is \dypro capable of generalizing across multiple executions for learning a program’s dynamic semantics in its entirety, but \dypro is also more efficient when dealing with programs yielding long execution traces. For evaluation, we task \dypro with semantic classification (\ie categorizing programs based on their semantics) and compared it against two prominent static models: Gated Graph Neural Network and TreeLSTM. We find that \dypro achieves the highest prediction accuracy among all models. To further reveal the capacity of all aforementioned deep neural architectures, we examine if the models can learn to detect deeper semantic properties of a program. In particular given a task of recognizing loop invariants, we show \dypro beats all static models by a wide margin. …

Batch-Mode Active Learning google
Recently, Convolutional Neural Networks (CNNs) have shown unprecedented success in the field of computer vision, especially on challenging image classification tasks by relying on a universal approach, i.e., training a deep model on a massive dataset of supervised examples. While unlabeled data are often an abundant resource, collecting a large set of labeled data, on the other hand, are very expensive, which often require considerable human efforts. One way to ease out this is to effectively select and label highly informative instances from a pool of unlabeled data (i.e., active learning). This paper proposed a new method of batch-mode active learning, Dual Active Sampling(DAS), which is based on a simple assumption, if two deep neural networks (DNNs) of the same structure and trained on the same dataset give significantly different output for a given sample, then that particular sample should be picked for additional training. While other state of the art methods in this field usually require intensive computational power or relying on a complicated structure, DAS is simpler to implement and, managed to get improved results on Cifar-10 with preferable computational time compared to the core-set method. …

Projection Weighted Canonical Correlation Analysis (projection weighted CCA) google
Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks. Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA, a recently proposed method. We first improve the core method, showing how to differentiate between signal and noise, and then apply this technique to compare across a group of CNNs, demonstrating that networks which generalize converge to more similar representations than networks which memorize, that wider networks converge to more similar solutions than narrow networks, and that trained networks with identical topology but different learning rates converge to distinct clusters with diverse representations. We also investigate the representational dynamics of RNNs, across both training and sequential timesteps, finding that RNNs converge in a bottom-up pattern over the course of training and that the hidden state is highly variable over the course of a sequence, even when accounting for linear transforms. Together, these results provide new insights into the function of CNNs and RNNs, and demonstrate the utility of using CCA to understand representations. …

If you did not already know

RStudio Connect google
RStudio Connect is a new publishing platform for the work your teams create in R. Share Shiny applications, R Markdown reports, dashboards, plots, and more in one convenient place. Use push-button publishing from the RStudio IDE, scheduled execution of reports, and flexible security policies to bring the power of data science to your entire enterprise. …

Fuzzy Rule Interpolation (FRI) google
A way for fuzzy inference by interpolation of the existing fuzzy rules based on various distance and similarity measures of fuzzy sets. A suitable method for handling sparse fuzzy rule bases, since FRI methods can provide reasonable (interpolated/extrapolated) conclusions even if none of the existing rules fires under the current observation.
Fuzzy Rule Interpolation Methods and Fri Toolbox


SLIM google
Model interpretability is an increasingly important component of practical machine learning. Some of the most common forms of interpretability systems are example-based, local, and global explanations. One of the main challenges in interpretability is designing explanation systems that can capture aspects of each of these explanation types, in order to develop a more thorough understanding of the model. We address this challenge in a novel model called SLIM that uses local linear modeling techniques along with a dual interpretation of random forests (both as a supervised neighborhood approach and as a feature selection method). SLIM has two fundamental advantages over existing interpretability systems. First, while it is effective as a black-box explanation system, SLIM itself is a highly accurate predictive model that provides faithful self explanations, and thus sidesteps the typical accuracy-interpretability trade-off. Second, SLIM provides both example- based and local explanations and can detect global patterns, which allows it to diagnose limitations in its local explanations. …

Perturbative GAN google
Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Perturbative GAN is evaluated using con-ventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator. …

If you did not already know

SAF-Pooling google
Major winning Convolutional Neural Networks (CNNs), such as VGGNet, ResNet, DenseNet, \etc, include tens to hundreds of millions of parameters, which impose considerable computation and memory overheads. This limits their practical usage in training and optimizing for real-world applications. On the contrary, light-weight architectures, such as SqueezeNet, are being proposed to address this issue. However, they mainly suffer from low accuracy, as they have compromised between the processing power and efficiency. These inefficiencies mostly stem from following an ad-hoc designing procedure. In this work, we discuss and propose several crucial design principles for an efficient architecture design and elaborate intuitions concerning different aspects of the design procedure. Furthermore, we introduce a new layer called {\it SAF-pooling} to improve the generalization power of the network while keeping it simple by choosing best features. Based on such principles, we propose a simple architecture called {\it SimpNet}. We empirically show that SimpNet provides a good trade-off between the computation/memory efficiency and the accuracy solely based on these primitive but crucial principles. SimpNet outperforms the deeper and more complex architectures such as VGGNet, ResNet, WideResidualNet \etc, on several well-known benchmarks, while having 2 to 25 times fewer number of parameters and operations. We obtain state-of-the-art results (in terms of a balance between the accuracy and the number of involved parameters) on standard datasets, such as CIFAR10, CIFAR100, MNIST and SVHN. The implementations are available at \href{url}{https://…/SimpNet}.

Hierarchical Semantic Embedding (HSE) google
Object categories inherently form a hierarchy with different levels of concept abstraction, especially for fine-grained categories. For example, birds (Aves) can be categorized according to a four-level hierarchy of order, family, genus, and species. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make prediction less ambiguous. However, previous studies of fine-grained image recognition primarily focus on categories of one certain level and usually overlook this correlation information. In this work, we investigate simultaneously predicting categories of different levels in the hierarchy and integrating this structured correlation information into the deep neural network by developing a novel Hierarchical Semantic Embedding (HSE) framework. Specifically, the HSE framework sequentially predicts the category score vector of each level in the hierarchy, from highest to lowest. At each level, it incorporates the predicted score vector of the higher level as prior knowledge to learn finer-grained feature representation. During training, the predicted score vector of the higher level is also employed to regularize label prediction by using it as soft targets of corresponding sub-categories. To evaluate the proposed framework, we organize the 200 bird species of the Caltech-UCSD birds dataset with the four-level category hierarchy and construct a large-scale butterfly dataset that also covers four level categories. Extensive experiments on these two and the newly-released VegFru datasets demonstrate the superiority of our HSE framework over the baseline methods and existing competitors. …

Pigeonring google
The pigeonhole principle states that if n items are contained in m boxes, then at least one box has no fewer than n/m items. It is utilized to solve many data management problems, especially for thresholded similarity searches. Despite many pigeonhole principle-based solutions proposed in the last few decades, the condition stated by the principle is weak. It only constrains the number of items in a single box. By organizing the boxes in a ring, we observe that the number of items in multiple boxes are also constrained. We propose a new principle called the pigeonring principle which formally captures such constraints and yields stronger conditions. To utilize the pigeonring principle, we focus on problems defined in the form of identifying data objects whose similarities or distances to the query is constrained by a threshold. Many solutions to these problems utilize the pigeonhole principle to find candidates that satisfy a filtering condition. By the pigeonring principle, stronger filtering conditions can be established. We show that the pigeonhole principle is a special case of the pigeonring principle. This suggests that all the solutions based on the pigeonhole principle are possible to be accelerated by the pigeonring principle. A universal filtering framework is introduced to encompass the solutions to these problems based on the pigeonring principle. Besides, we discuss how to quickly find candidates specified by the pigeonring principle with minor modifications on top of existing algorithms. Experimental results on real datasets demonstrate the applicability of the pigeonring principle as well as the superior performance of the algorithms based on the principle. …

Non-Interfering Comparison-Exchange (NICE) google
In studying the statistical frequency of exchange in comparison-exchange (CE) networks we discover a new elementary form of comparison-exchange which we name the ‘2-op’. The operation supports concurrent and non-interfering operations of two traditional CEs upon one shared element. More than merely improving overall statistical performance, the introduction of NICE (non-interfering CE) networks lowers long-held bounds in the number of stages required for sorting tasks. Code-based CEs also benefit from improved average/worst case run time costs. …

If you did not already know

Deep Residual Hashing google
In this paper, we define an extension of the supersymmetric hyperbolic nonlinear sigma model introduced by Zirnbauer. We show that it arises as a weak joint limit of a time-changed version introduced by Sabot and Tarr\`es of the vertex-reinforced jump process. It describes the asymptotics of rescaled crossing numbers, rescaled fluctuations of local times, asymptotic local times on a logarithmic scale, endpoints of paths, and last exit trees. …

EILearn google
We propose an algorithm for incremental learning of classifiers. The proposed method enables an ensemble of classifiers to learn incrementally by accommodating new training data. We use an effective mechanism to overcome the stability-plasticity dilemma. In incremental learning, the general convention is to use only the knowledge acquired in the previous phase but not the previously seen data. We follow this convention by retaining the previously acquired knowledge which is relevant and using it along with the current data. The performance of each classifier is monitored to eliminate the poorly performing classifiers in the subsequent phases. Experimental results show that the proposed approach outperforms the existing incremental learning approaches. …

BiasedWalk google
Network embedding algorithms are able to learn latent feature representations of nodes, transforming networks into lower dimensional vector representations. Typical key applications, which have effectively been addressed using network embeddings, include link prediction, multilabel classification and community detection. In this paper, we propose BiasedWalk, a scalable, unsupervised feature learning algorithm that is based on biased random walks to sample context information about each node in the network. Our random-walk based sampling can behave as Breath-First-Search (BFS) and Depth-First-Search (DFS) samplings with the goal to capture homophily and role equivalence between the nodes in the network. We have performed a detailed experimental evaluation comparing the performance of the proposed algorithm against various baseline methods, on several datasets and learning tasks. The experiment results show that the proposed method outperforms the baseline ones in most of the tasks and datasets. …

Cyclically Annealed Learning Rate (CALR) google
Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which represents the size bottleneck for most NLP models. Our models are trained, compressed and then further re-trained on the downstream task to recover accuracy while maintaining the reduced size. Empirically, we show that the proposed method can achieve 90% compression with minimal impact in accuracy for sentence classification tasks, and outperforms alternative methods like fixed-point quantization or offline word embedding compression. We also analyze the inference time and storage space for our method through FLOP calculations, showing that we can compress DNN models by a configurable ratio and regain accuracy loss without introducing additional latency compared to fixed point quantization. Finally, we introduce a novel learning rate schedule, the Cyclically Annealed Learning Rate (CALR), which we empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark. …

If you did not already know

Linear Centralization Classifier (LCC) google
A classification algorithm, called the Linear Centralization Classifier (LCC), is introduced. The algorithm seeks to find a transformation that best maps instances from the feature space to a space where they concentrate towards the center of their own classes, while maximimizing the distance between class centers. We formulate the classifier as a quadratic program with quadratic constraints. We then simplify this formulation to a linear program that can be solved effectively using a linear programming solver (e.g., simplex-dual). We extend the formulation for LCC to enable the use of kernel functions for non-linear classification applications. We compare our method with two standard classification methods (support vector machine and linear discriminant analysis) and four state-of-the-art classification methods when they are applied to eight standard classification datasets. Our experimental results show that LCC is able to classify instances more accurately (based on the area under the receiver operating characteristic) in comparison to other tested methods on the chosen datasets. We also report the results for LCC with a particular kernel to solve for synthetic non-linear classification problems. …

Mutual Information Neural Entropic Estimation (MI-NEE) google
We point out a limitation of the mutual information neural estimation (MINE) where the network fails to learn at the initial training phase, leading to slow convergence in the number of training iterations. To solve this problem, we propose a faster method called the mutual information neural entropic estimation (MI-NEE). Our solution first generalizes MINE to estimate the entropy using a custom reference distribution. The entropy estimate can then be used to estimate the mutual information. We argue that the seemingly redundant intermediate step of entropy estimation allows one to improve the convergence by an appropriate reference distribution. In particular, we show that MI-NEE reduces to MINE in the special case when the reference distribution is the product of marginal distributions, but faster convergence is possible by choosing the uniform distribution as the reference distribution instead. Compared to the product of marginals, the uniform distribution introduces more samples in low-density regions and fewer samples in high-density regions, which appear to lead to an overall larger gradient for faster convergence. …

CascadeCNN google
This work presents CascadeCNN, an automated toolflow that pushes the quantisation limits of any given CNN model, aiming to perform high-throughput inference. A two-stage architecture tailored for any given CNN-FPGA pair is generated, consisting of a low- and high-precision unit in a cascade. A confidence evaluation unit is employed to identify misclassified cases from the excessively low-precision unit and forward them to the high-precision unit for re-processing. Experiments demonstrate that the proposed toolflow can achieve a performance boost up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy, without the need of retraining the model or accessing the training data. …

ReCode google
In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce ReCode, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU. …

If you did not already know

Radial Prediction Layer (RPL) google
For a broad variety of critical applications, it is essential to know how confident a classification prediction is. In this paper, we discuss the drawbacks of softmax to calculate class probabilities and to handle uncertainty in Bayesian neural networks. We introduce a new kind of prediction layer called radial prediction layer (RPL) to overcome these issues. In contrast to the softmax classification, RPL is based on the open-world assumption. Therefore, the class prediction probabilities are much more meaningful to assess the uncertainty concerning the novelty of the input. We show that neural networks with RPLs can be learned in the same way as neural networks using softmax. On a 2D toy data set (spiral data), we demonstrate the fundamental principles and advantages. On the real-world ImageNet data set, we show that the open-world properties are beneficially fulfilled. Additionally, we show that RPLs are less sensible to adversarial attacks on the MNIST data set. Due to its features, we expect RPL to be beneficial in a broad variety of applications, especially in critical environments, such as medicine or autonomous driving. …

Stripe google
Hardware architectures and machine learning (ML) libraries evolve rapidly. Traditional compilers often fail to generate high-performance code across the spectrum of new hardware offerings. To mitigate, engineers develop hand-tuned kernels for each ML library update and hardware upgrade. Unfortunately, this approach requires excessive engineering effort to scale or maintain with any degree of state-of-the-art performance. Here we present a Nested Polyhedral Model for representing highly parallelizable computations with limited dependencies between iterations. This model provides an underlying framework for an intermediate representation (IR) called Stripe, amenable to standard compiler techniques while naturally modeling key aspects of modern ML computing. Stripe represents parallelism, efficient memory layout, and multiple compute units at a level of abstraction amenable to automatic optimization. We describe how Stripe enables a compiler for ML in the style of LLVM that allows independent development of algorithms, optimizations, and hardware accelerators. We also discuss the design exploration advantages of Stripe over kernel libraries and schedule-based or schedule-space-based code generation. …

Graph Augmented Memory Network (GAMENet) google
Recent progress in deep learning is revolutionizing the healthcare domain including providing solutions to medication recommendations, especially recommending medication combination for patients with complex health conditions. Existing approaches either do not customize based on patient health history, or ignore existing knowledge on drug-drug interactions (DDI) that might lead to adverse outcomes. To fill this gap, we propose the Graph Augmented Memory Networks (GAMENet), which integrates the drug-drug interactions knowledge graph by a memory module implemented as a graph convolutional networks, and models longitudinal patient records as the query. It is trained end-to-end to provide safe and personalized recommendation of medication combination. We demonstrate the effectiveness and safety of GAMENet by comparing with several state-of-the-art methods on real EHR data. GAMENet outperformed all baselines in all effectiveness measures, and also achieved 3.60% DDI rate reduction from existing EHR data. …

Semi-Supervised Novelty Detection (SSND) google
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions …