Representing Attitudes Towards Ambiguity in Managerial Decisions

We provide here a general mathematical framework to model attitudes towards ambiguity which uses the formalism of quantum theory as a ‘purely mathematical formalism, detached from any physical interpretation’. We show that the quantum-theoretic framework enables modelling of the ‘Ellsberg paradox’, but it also successfully applies to more concrete human decision-making (DM) tests involving financial, managerial and medical decisions. In particular, we provide a faithful mathematical representation of various empirical studies which reveal that attitudes of managers towards uncertainty shift from ‘ambiguity seeking’ to ‘ambiguity aversion’, and viceversa, thus exhibiting ‘hope effects’ and ‘fear effects’ in management decisions. The present framework provides a new bold and promising direction towards the development of a unified theory of decisions in the presence of uncertainty.


Reinforcement Learning with Chromatic Networks

We present a new algorithm for finding compact neural networks encoding reinforcement learning (RL) policies. To do it, we leverage in the novel RL setting the theory of pointer networks and ENAS-type algorithms for combinatorial optimization of RL policies as well as recent evolution strategies (ES) optimization methods, and propose to define the combinatorial search space to be the the set of different edge-partitionings (colorings) into same-weight classes. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as 17 weight parameters, providing 6x compression over state-of-the-art compact policies based on Toeplitz matrices. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.


Efficient Uncertainty Modeling for System Design via Mixed Integer Programming

The post-Moore era casts a shadow of uncertainty on many aspects of computer system design. Managing that uncertainty requires new algorithmic tools to make quantitative assessments. While prior uncertainty quantification methods, such as generalized polynomial chaos (gPC), show how to work precisely under the uncertainty inherent to physical devices, these approaches focus solely on variables from a continuous domain. However, as one moves up the system stack to the architecture level many parameters are constrained to a discrete (integer) domain. This paper proposes an efficient and accurate uncertainty modeling technique, named mixed generalized polynomial chaos (M-gPC), for architectural uncertainty analysis. The M-gPC technique extends the generalized polynomial chaos (gPC) theory originally developed in the uncertainty quantification community, such that it can efficiently handle the mixed-type (i.e., both continuous and discrete) uncertainties in computer architecture design. Specifically, we employ some stochastic basis functions to capture the architecture-level impact caused by uncertain parameters in a simulator. We also develop a novel mixed-integer programming method to select a small number of uncertain parameter samples for detailed simulations. With a few highly informative simulation samples, an accurate surrogate model is constructed in place of cycle-level simulators for various architectural uncertainty analysis. In the chip-multiprocessor (CMP) model, we are able to estimate the propagated uncertainties with only 95 samples whereas Monte Carlo requires 5*10^4 samples to achieve the similar accuracy. We also demonstrate the efficiency and effectiveness of our method on a detailed DRAM subsystem.


Freeze and Chaos for DNNs: an NTK view of Batch Normalization, Checkerboard and Boundary Effects

In this paper, we analyze a number of architectural features of Deep Neural Networks (DNNs), using the so-called Neural Tangent Kernel (NTK). The NTK describes the training trajectory and generalization of DNNs in the infinite-width limit. In this limit, we show that for (fully-connected) DNNs, as the depth grows, two regimes appear: ‘freeze’ (also known as ‘order’), where the (scaled) NTK converges to a constant (slowing convergence), and ‘chaos’, where it converges to a Kronecker delta (limiting generalization). We show that when using the scaled ReLU as a nonlinearity, we naturally end up in the ‘freeze’. We show that Batch Normalization (BN) avoids the freeze regime by reducing the importance of the constant mode in the NTK. A similar effect is obtained by normalizing the nonlinearity which moves the network to the chaotic regime. We uncover the same ‘freeze’ and ‘chaos’ modes in Deep Deconvolutional Networks (DC-NNs). The ‘freeze’ regime is characterized by checkerboard patterns in the image space in addition to the constant modes in input space. Finally, we introduce a new NTK-based parametrization to eliminate border artifacts and we propose a layer-dependent learning rate to improve the convergence of DC-NNs. We illustrate our findings by training DCGANs using our setup. When trained in the ‘freeze’ regime, we see that the generator collapses to a checkerboard mode. We also demonstrate numerically that the generator collapse can be avoided and that good quality samples can be obtained, by tuning the nonlinearity to reach the ‘chaos’ regime (without using batch normalization).


Knowledge-incorporating ESIM models for Response Selection in Retrieval-based Dialog Systems

Goal-oriented dialog systems, which can be trained end-to-end without manually encoding domain-specific features, show tremendous promise in the customer support use-case e.g. flight booking, hotel reservation, technical support, student advising etc. These dialog systems must learn to interact with external domain knowledge to achieve the desired goal e.g. recommending courses to a student, booking a table at a restaurant etc. This paper presents extended Enhanced Sequential Inference Model (ESIM) models: a) K-ESIM (Knowledge-ESIM), which incorporates the external domain knowledge and b) T-ESIM (Targeted-ESIM), which leverages information from similar conversations to improve the prediction accuracy. Our proposed models and the baseline ESIM model are evaluated on the Ubuntu and Advising datasets in the Sentence Selection track of the latest Dialog System Technology Challenge (DSTC7), where the goal is to find the correct next utterance, given a partial conversation, from a set of candidates. Our preliminary results suggest that incorporating external knowledge sources and leveraging information from similar dialogs leads to performance improvements for predicting the next utterance.


Imitation-Projected Policy Gradient for Programmatic Reinforcement Learning

We present Imitation-Projected Policy Gradient (IPPG), an algorithmic framework for learning policies that are parsimoniously represented in a structured programming language. Such programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for programmatic policies remains a challenge. IPPG, our response to this challenge, is based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a ‘lift-and-project’ perspective that takes a gradient step into the unconstrained policy space and then projects back onto the constrained space. Second, we view the unconstrained policy space as mixing neural and programmatic representations, which enables employing state-of-the-art deep policy gradient approaches. Third, we cast the projection step as program synthesis via imitation learning, and exploit contemporary combinatorial methods for this task. We present theoretical convergence results for IPPG, as well as an empirical evaluation in three continuous control domains. The experiments show that IPPG can significantly outperform state-of-the-art approaches for learning programmatic policies.


On the Optimality of Trees Generated by ID3

Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we analyze the heuristic of growing a decision tree with ID3 for a limited number of iterations t and given that nodes are split as in the case of exact information gain and probability computations. In several settings, we provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), produces trees with optimal or near-optimal test error among all trees with t internal nodes. We prove optimality in the case of learning conjunctions under product distributions and learning read-once DNFs with 2 terms under the uniform distribition. Using efficient dynamic programming algorithms, we empirically show that TopDown generates trees that are near-optimal (\sim \%1 difference from optimal test error) in a large number of settings for learning read-once DNFs under product distributions.


Analysis and Design of First-Order Distributed Optimization Algorithms over Time-Varying Graphs

This work concerns the analysis and design of distributed first-order optimization algorithms over time-varying graphs. The goal of such algorithms is to optimize a global function that is the average of local functions using only local computations and communications. Several different algorithms have been proposed that achieve linear convergence to the global optimum when the local functions are strongly convex. We provide a unified analysis that yields a worst-case linear convergence rate as a function of the condition number of the local functions, the spectral gap of the graph, and the parameters of the algorithm. The framework requires solving a small semidefinite program whose size is fixed; it does not depend on the number of local functions or the dimension of the domain. The result is a computationally efficient method for distributed algorithm analysis that enables the rapid comparison, selection, and tuning of algorithms. Finally, we propose a new algorithm, which we call SVL, that is easily implementable and achieves the fastest possible worst-case convergence rate among all algorithms in the family we considered. We support our theoretical analysis with numerical experiments that generate worst-case examples demonstrating the effectiveness of SVL.


Model based Level Shift Detection in Autocorrelated Data Streams using a moving window

Standard Control Chart techniques to detect level shift in data streams assume independence between observations. As data today is collected with high frequency, this assumption is seldom valid. To overcome this, we propose to adapt the off-line test procedure for detection of outliers based on one-step prediction errors proposed by Tsay (1988) into an on-line framework by considering a moving window. Further, we present two algorithms, that in combination, estimate an appropriate test value for our control chart. We test our method on AR(1) processes exposed to level shifts of different sizes and compare it to CUSUM applied to one-step prediction errors. We find that, even though both methods perform comparable when tuned correctly, our method has higher probability of identifying the correct change point of the process. Furthermore, for more complicated processes our method is easier to tune, as the range of window size to be tested is independent of the process.


Randomized Functional Sparse Tucker Tensor for Compression and Fast Visualization of Scientific Data

We propose a strategy to compress and store large volumes of scientific data represented on unstructured grids. Approaches utilizing tensor decompositions for data compression have already been proposed. Here, data on a structured grid is stored as a tensor which is then subjected to appropriate decomposition in suitable tensor formats. Such decompositions are based on generalization of singular value decomposition to tensors and capture essential features in the data with storage cost lower by orders of magnitude. However, tensor based data compression is limited by the fact that one can only consider scientific data represented on structured grids. In case of data on unstructured meshes, we propose to consider data as realizations of a function that is based on functional view of the tensor thus avoiding such limitations. The key is to efficiently estimate the parameters of the function whose complexity is small compared to the cardinality of the dataset (otherwise there is no compression). Here, we introduce the set of functional sparse Tucker tensors and propose a method to construct approximation in this set such that the resulting compact functional tensor can be rapidly evaluated to recover the original data. The compression procedure consists of three steps. In the first step, we consider a fraction of the original dataset for interpolation on a structured grid followed by sequentially truncated higher order singular value decomposition to get a compressed version of the interpolated data.We then fit singular vectors on a set of functional basis using sparse approximation to obtain corresponding functional sparse Tucker tensor representation. Finally, we re-evaluate the coefficients of this functional tensor using randomized least squares at a reduced computational complexity. This strategy leads to compression ratio of orders of magnitude on combustion simulation datasets.


Generalized Mutual Information

Mutual information is one of the essential building blocks of information theory. Yet, it is only finitely defined for distributions with fast decaying tails on a countable joint alphabet of two random elements. The unboundedness of mutual information over the general class of all distributions on a joint alphabet prevents its potential utility to be fully realized. This is in fact a void in the foundation of information theory that needs to be filled. This article proposes a family of generalized mutual information all of whose members 1) are finitely defined for each and every distribution of two random elements on a joint countable alphabet, except the one by Shannon, and 2) enjoy all utilities of a finite Shannon’s mutual information.


Artificial Intelligence as a Services (AI-aaS) on Software-Defined Infrastructure

This paper investigates a paradigm for offering artificial intelligence as a service (AI-aaS) on software-defined infrastructures (SDIs). The increasing complexity of networking and computing infrastructures is already driving the introduction of automation in networking and cloud computing management systems. Here we consider how these automation mechanisms can be leveraged to offer AI-aaS. Use cases for AI-aaS are easily found in addressing smart applications in sectors such as transportation, manufacturing, energy, water, air quality, and emissions. We propose an architectural scheme based on SDIs where each AI-aaS application is comprised of a monitoring, analysis, policy, execution plus knowledge (MAPE-K) loop (MKL). Each application is composed as one or more specific service chains embedded in SDI, some of which will include a Machine Learning (ML) pipeline. Our model includes a new training plane and an AI-aaS plane to deal with the model-development and operational phases of AI applications. We also consider the role of an ML/MKL sandbox in ensuring coherency and consistency in the operation of multiple parallel MKL loops. We present experimental measurement results for three AI-aaS applications deployed on the SAVI testbed: 1. Compressing monitored data in SDI using autoencoders; 2. Traffic monitoring to allocate CPUs resources to VNFs; and 3. Highway segment classification in smart transportation.


The Dynamic Embedded Topic Model

Topic modeling analyzes documents to learn meaningful patterns of words. Dynamic topic models capture how these patterns vary over time for a set of documents that were collected over a large time span. We develop the dynamic embedded topic model (D-ETM), a generative model of documents that combines dynamic latent Dirichlet allocation (D-LDA) and word embeddings. The D-ETM models each word with a categorical distribution whose parameter is given by the inner product between the word embedding and an embedding representation of its assigned topic at a particular time step. The word embeddings allow the D-ETM to generalize to rare words. The D-ETM learns smooth topic trajectories by defining a random walk prior over the embeddings of the topics. We fit the D-ETM using structured amortized variational inference. On a collection of United Nations debates, we find that the D-ETM learns interpretable topics and outperforms D-LDA in terms of both topic quality and predictive performance.


Faster Neural Network Training with Data Echoing

In the twilight of Moore’s law, GPUs and other specialized hardware accelerators have dramatically sped up neural network training. However, earlier stages of the training pipeline, such as disk I/O and data preprocessing, do not run on accelerators. As accelerators continue to improve, these earlier stages will increasingly become the bottleneck. In this paper, we introduce ‘data echoing,’ which reduces the total computation used by earlier pipeline stages and speeds up training whenever computation upstream from accelerators dominates the training time. Data echoing reuses (or ‘echoes’) intermediate outputs from earlier pipeline stages in order to reclaim idle capacity. We investigate the behavior of different data echoing algorithms on various workloads, for various amounts of echoing, and for various batch sizes. We find that in all settings, at least one data echoing algorithm can match the baseline’s predictive performance using less upstream computation. In some cases, data echoing can even compensate for a 4x slower input pipeline.


ScenarioSA: A Large Scale Conversational Database for Interactive Sentiment Analysis

Interactive sentiment analysis is an emerging, yet challenging, subtask of the sentiment analysis problem. It aims to discover the affective state and sentimental change of each person in a conversation. Existing sentiment analysis approaches are insufficient in modelling the interactions among people. However, the development of new approaches are critically limited by the lack of labelled interactive sentiment datasets. In this paper, we present a new conversational emotion database that we have created and made publically available, namely ScenarioSA. We manually label 2,214 multi-turn English conversations collected from natural contexts. In comparison with existing sentiment datasets, ScenarioSA (1) covers a wide range of scenarios; (2) describes the interactions between two speakers; and (3) reflects the sentimental evolution of each speaker over the course of a conversation. Finally, we evaluate various state-of-the-art algorithms on ScenarioSA, demonstrating the need of novel interactive sentiment analysis models and the potential of ScenarioSA to facilitate the development of such models.


A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization

Non-negative Matrix Factorization (NMF) asks to decompose a (entry-wise) non-negative matrix into the product of two smaller-sized nonnegative matrices, which has been shown intractable in general. In order to overcome this issue, the separability assumption is introduced which assumes all data points are in a conical hull. This assumption makes NMF tractable and is widely used in text analysis and image processing, but still impractical for huge-scale datasets. In this paper, inspired by recent development on dequantizing techniques, we propose a new classical algorithm for separable NMF problem. Our new algorithm runs in polynomial time in the rank and logarithmic in the size of input matrices, which achieves an exponential speedup in the low-rank setting.


R-Transformer: Recurrent Neural Network Enhanced Transformer

Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code publicly available at \url{https://…/R-transformer}.


Neural News Recommendation with Attentive Multi-View Learning

Personalized news recommendation is very important for online news platforms to help users find interested news and improve user experience. News and user representation learning is critical for news recommendation. Existing news recommendation methods usually learn these representations based on single news information, e.g., title, which may be insufficient. In this paper we propose a neural news recommendation approach which can learn informative representations of users and news by exploiting different kinds of news information. The core of our approach is a news encoder and a user encoder. In the news encoder we propose an attentive multi-view learning model to learn unified news representations from titles, bodies and topic categories by regarding them as different views of news. In addition, we apply both word-level and view-level attention mechanism to news encoder to select important words and views for learning informative news representations. In the user encoder we learn the representations of users based on their browsed news and apply attention mechanism to select informative news for user representation learning. Extensive experiments on a real-world dataset show our approach can effectively improve the performance of news recommendation.


Can Bayes Factors ‘Prove’ the Null Hypothesis?

It is possible to obtain a large Bayes Factor (BF) favoring the null hypothesis when both the null and alternative hypotheses have low likelihoods, and there are other hypotheses being ignored that are much more strongly supported by the data. As sample sizes become large it becomes increasingly probable that a strong BF favouring a point null against a conventional Bayesian vague alternative co-occurs with a BF favouring various specific alternatives against the null. For any BF threshold q and sample mean, there is a value n such that sample sizes larger than n guarantee that although the BF comparing H0 against a conventional (vague) alternative exceeds q, nevertheless for some range of hypothetical {\mu}, a BF comparing H0 against {\mu} in that range falls below 1/q. This paper discusses the conditions under which this conundrum occurs and investigates methods for resolving it.


A Versatile Queuing System For Sharing Economy Platform Operations

The paper deals with a sharing economy system with various management factors by using a bulk input G/M/1 type queuing model. The effective management of operating costs is vital for controlling the sharing economy platform and this research builds the theoretical background to understand the sharing economy business model. Analytically, the techniques include a classical Markov process of the single channel queueing system, semi-Markov process and semi-regenerative process. It uses the stochastic congruent properties to find the probability distribution of the number of contractors in the sharing economy platform. The obtained explicit formulas demonstrate the usage of functional for the main stochastic characteristics including sharing expenses due to over contracted resources and optimization of their objective function.


Sparsely Activated Networks

Previous literature on unsupervised learning focused on designing structural priors and optimization functions with the aim of learning meaningful features, but without considering the description length of the representations. Here we present Sparsely Activated Networks (SANs), which decompose their input as a sum of sparsely reoccurring patterns of varying amplitude, and combined with a newly proposed metric \varphi they learn representations with minimal description lengths. SANs consist of kernels with shared weights that during encoding are convolved with the input and then passed through a ReLU and a sparse activation function. During decoding, the same weights are convolved with the sparse activation map and the individual reconstructions from each weight are summed to reconstruct the input. We also propose a metric \varphi for model selection that favors models which combine high compression ratio and low reconstruction error and we justify its definition by exploring the hyperparameter space of SANs. We compare four sparse activation functions (Identity, Max-Activations, Max-Pool indices, Peaks) on a variety of datasets and show that SANs learn interpretable kernels that combined with \varphi, they minimize the description length of the representations.


Towards Probabilistic Generative Models Harnessing Graph Neural Networks for Disease-Gene Prediction

Disease-gene prediction (DGP) refers to the computational challenge of predicting associations between genes and diseases. Effective solutions to the DGP problem have the potential to accelerate the therapeutic development pipeline at early stages via efficient prioritization of candidate genes for various diseases. In this work, we introduce the variational graph auto-encoder (VGAE) as a promising unsupervised approach for learning powerful latent embeddings in disease-gene networks that can be used for the DGP problem, the first approach using a generative model involving graph neural networks (GNNs). In addition to introducing the VGAE as a promising approach to the DGP problem, we further propose an extension (constrained-VGAE or C-VGAE) which adapts the learning algorithm for link prediction between two distinct node types in heterogeneous graphs. We evaluate and demonstrate the effectiveness of the VGAE on general link prediction in a disease-gene association network and the C-VGAE on disease-gene prediction in the same network, using popular random walk driven methods as baselines. While the methodology presented demonstrates potential solely based on utilizing the topology of a disease-gene association network, it can be further enhanced and explored through the integration of additional biological networks such as gene/protein interaction networks and additional biological features pertaining to the diseases and genes represented in the disease-gene association network.


Automatic Generation of Atomic Consistency Preserving Search Operators for Search-Based Model Engineering

Recently there has been increased interest in combining the fields of Model-Driven Engineering (MDE) and Search-Based Software Engineering (SBSE). Such approaches use meta-heuristic search guided by search operators (model mutators and sometimes breeders) implemented as model transformations. The design of these operators can substantially impact the effectiveness and efficiency of the meta-heuristic search. Currently, designing search operators is left to the person specifying the optimisation problem. However, developing consistent and efficient search-operator rules requires not only domain expertise but also in-depth knowledge about optimisation, which makes the use of model-based meta-heuristic search challenging and expensive. In this paper, we propose a generalised approach to automatically generate atomic consistency preserving search operators (aCPSOs) for a given optimisation problem. This reduces the effort required to specify an optimisation problem and shields optimisation users from the complexity of implementing efficient meta-heuristic search mutation operators. We evaluate our approach with a set of case studies, and show that the automatically generated rules are comparable to, and in some cases better than, manually created rules at guiding evolutionary search towards near-optimal solutions. This paper is an extended version of the paper with the same title published in the proceedings of the 22nd International Conference on Model Driven Engineering Languages and Systems (MODELS ’19).


Saliency Maps Generation for Automatic Text Summarization

Saliency map generation techniques are at the forefront of explainable AI literature for a broad range of machine learning applications. Our goal is to question the limits of these approaches on more complex tasks. In this paper we apply Layer-Wise Relevance Propagation (LRP) to a sequence-to-sequence attention model trained on a text summarization dataset. We obtain unexpected saliency maps and discuss the rightfulness of these ‘explanations’. We argue that we need a quantitative way of testing the counterfactual case to judge the truthfulness of the saliency maps. We suggest a protocol to check the validity of the importance attributed to the input and show that the saliency maps obtained sometimes capture the real use of the input features by the network, and sometimes do not. We use this example to discuss how careful we need to be when accepting them as explanation.


Justifying Diagnosis Decisions by Deep Neural Networks

An integrated approach is proposed across visual and textual data to both determine and justify a medical diagnosis by a neural network. As deep learning techniques improve, interest grows to apply them in medical applications. To enable a transition to workflows in a medical context that are aided by machine learning, the need exists for such algorithms to help justify the obtained outcome so human clinicians can judge their validity. In this work, deep learning methods are used to map a frontal X-Ray image to a continuous textual representation. This textual representation is decoded into a diagnosis and the associated textual justification that will help a clinician evaluate the outcome. Additionally, more explanatory data is provided for the diagnosis by generating a realistic X-Ray that belongs to the nearest alternative diagnosis. With a clinical expert opinion study on a subset of the X-Ray data set from the Indiana University hospital network, we demonstrate that our justification mechanism significantly outperforms existing methods that use saliency maps. While performing multi-task training with multiple loss functions, our method achieves excellent diagnosis accuracy and captioning quality when compared to current state-of-the-art single-task methods.


Virtual Adversarial Lipschitz Regularization

Generative adversarial networks (GANs) are one of the most popular approaches when it comes to training generative models, among which variants of Wasserstein GANs are considered superior to the standard GAN formulation in terms of learning stability and sample quality. However, Wasserstein GANs require the critic to be K-Lipschitz, which is often enforced implicitly by penalizing the norm of its gradient, or by globally restricting its Lipschitz constant via weight normalization techniques. Training with a regularization term penalizing the violation of the Lipschitz constraint explicitly, instead of through the norm of the gradient, was found to be practically infeasible in most situations. With a novel generalization of Virtual Adversarial Training, called Virtual Adversarial Lipschitz Regularization, we show that using an explicit Lipschitz penalty is indeed viable and leads to state-of-the-art performance in terms of Inception Score and Fr\’echet Inception Distance when applied to Wasserstein GANs trained on CIFAR-10.


Deep network as memory space: complexity, generalization, disentangled representation and interpretability

By bridging deep networks and physics, the programme of geometrization of deep networks was proposed as a framework for the interpretability of deep learning systems. Following this programme we can apply two key ideas of physics, the geometrization of physics and the least action principle, on deep networks and deliver a new picture of deep networks: deep networks as memory space of information, where the capacity, robustness and efficiency of the memory are closely related with the complexity, generalization and disentanglement of deep networks. The key components of this understanding include:(1) a Fisher metric based formulation of the network complexity; (2)the least action (complexity=action) principle on deep networks and (3)the geometry built on deep network configurations. We will show how this picture will bring us a new understanding of the interpretability of deep learning systems.


PC-DARTS: Partial Channel Connections for Memory-Efficient Differentiable Architecture Search

Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-net and search for an optimal architecture. In this paper, we present a novel approach, namely Partially-Connected DARTS, by sampling a small part of super-net to reduce the redundancy in network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels and leave the held out part unchanged. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by the sampling of different channels. We solve it by introducing edge normalization, which adds a new set of edge-level hyper-parameters during search to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2:57% on CIFAR10 within merely 0:1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24:2% on ImageNet (under the mobile setting) within 3.8 GPU-days for search. We have made our code available: https://…/PC-DARTS.


Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Current state-of-the-art methods for image segmentation form a dense image representation where the color, shape and texture information are all processed together inside a deep CNN. This however may not be ideal as they contain very different type of information relevant for recognition. Here, we propose a new two-stream CNN architecture for semantic segmentation that explicitly wires shape information as a separate processing branch, i.e. shape stream, that processes information in parallel to the classical stream. Key to this architecture is a new type of gates that connect the intermediate layers of the two streams. Specifically, we use the higher-level activations in the classical stream to gate the lower-level activations in the shape stream, effectively removing noise and helping the shape stream to only focus on processing the relevant boundary-related information. This enables us to use a very shallow architecture for the shape stream that operates on the image-level resolution. Our experiments show that this leads to a highly effective architecture that produces sharper predictions around object boundaries and significantly boosts performance on thinner and smaller objects. Our method achieves state-of-the-art performance on the Cityscapes benchmark, in terms of both mask (mIoU) and boundary (F-score) quality, improving by 2% and 4% over strong baselines.


Semi-Supervised Graph Embedding for Multi-Label Graph Node Classification

The graph convolution network (GCN) is a widely-used facility to realize graph-based semi-supervised learning, which usually integrates node features and graph topologic information to build learning models. However, as for multi-label learning tasks, the supervision part of GCN simply minimizes the cross-entropy loss between the last layer outputs and the ground-truth label distribution, which tends to lose some useful information such as label correlations, so that prevents from obtaining high performance. In this paper, we pro-pose a novel GCN-based semi-supervised learning approach for multi-label classification, namely ML-GCN. ML-GCN first uses a GCN to embed the node features and graph topologic information. Then, it randomly generates a label matrix, where each row (i.e., label vector) represents a kind of labels. The dimension of the label vector is the same as that of the node vector before the last convolution operation of GCN. That is, all labels and nodes are embedded in a uniform vector space. Finally, during the ML-GCN model training, label vectors and node vectors are concatenated to serve as the inputs of the relaxed skip-gram model to detect the node-label correlation as well as the label-label correlation. Experimental results on several graph classification datasets show that the proposed ML-GCN outperforms four state-of-the-art methods.