• On Consensus-Optimality Trade-offs in Collaborative Deep Learning
• Critical and minimal connectivity of power graphs of finite groups
• Amnestic Forgery: an Ontology of Conceptual Metaphors
• Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
• Comparative analysis of the structures and outcomes of geophysical flow models and modeling assumptions using uncertainty quantification
• Context-aware Cascade Attention-based RNN for Video Emotion Recognition
• Social Signals in the Ethereum Trading Network
• Automorphism groups of designs with $λ=1$
• PID2018 Benchmark Challenge: learning feedforward control
• CuisineNet: Food Attributes Classification using Multi-scale Convolution Network
• Automatic, fast and robust characterization of noise distributions for diffusion MRI
• Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning
• A Robust and Effective Approach Towards Accurate Metastasis Detection and pN-stage Classification in Breast Cancer
• A four vertex theorem for frieze patterns
• Stochastic Deep Compressive Sensing for the Reconstruction of Diffusion Tensor Cardiac MRI
• Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition
• On Theorem 6 in ‘Relative Entropy and the Multivariable Multidimensional Moment Problem’ [Mar 2006 1052-1066]
• Connectedness of the Cross-Join Graph of de Bruijn Sequences
• The Aldous chain on cladograms in the diffusion limit
• Unwinding the model manifold: choosing similarity measures to remove local minima in sloppy dynamical systems
• Graph Sparsification, Spectral Sketches, and Faster Resistance Computation, via Short Cycle Decompositions
• Robust Place Categorization with Deep Domain Generalization
• End-to-end named entity extraction from speech
• Predicting County Level Corn Yields Using Deep Long Short Term Memory Models
• Polynomial Factorization Is Simple and Helpful — More So Than It Seems to Be
• BUNDLEP: Prioritizing Conflict Free Regions in Multi-Threaded Programs to Improve Cache Reuse — Extended Results and Technical Report
• Optimal dividends with partial information and stopping of a degenerate reflecting diffusion
• Identifying and Understanding User Reactions to Deceptive and Trusted Social News Sources
• On short expressions for cosets of permutation subgroups
• Privacy Aware Offloading of Deep Neural Networks
• On $q$-ratio CMSV for sparse recovery
• Generalizing to Unseen Domains via Adversarial Data Augmentation
• Optimal Placement of Baseband Functions for Energy Harvesting Virtual Small Cells
• Beam Discovery Using Linear Block Codes for Millimeter Wave Communication Networks
• Why Is My Classifier Discriminatory
• Reference-free Calibration in Sensor Networks
• l0-norm Based Centers Selection for Failure Tolerant RBF Networks
• Automatic generation of object shapes with desired functionalities
• MolGAN: An implicit generative model for small molecular graphs
• Two-stage Method for Millimeter Wave Channel Estimation
• Automatic Large-Scale Data Acquisition via Crowdsourcing for Crosswalk Classification: A Deep Learning Approach
• Short-term Load Forecasting with Deep Residual Networks
• Adjacency and Tensor Representation in General Hypergraphs.Part 2: Multisets, Hb-graphs and Related e-adjacency Tensors
• Fast L1-Minimization Algorithm for Sparse Approximation Based on an Improved LPNN-LCA framework
• Two-stage Method for the Reconstruction of a Low-Rank Matrix
• A Lagrangian Dual Based Approach to Sparse Linear Programming
• Well-posedness of Stochastic 3D Leray-$α$ Model with Fractional Dissipation
• Character-Level Models versus Morphology in Semantic Role Labeling
• Matrix-free multigrid block-preconditioners for higher order Discontinuous Galerkin discretisations
• Learning to Generate Facial Depth Maps
• Square-free Groebner degenerations
• Anonymous Walk Embeddings
• Multiple Manifolds Metric Learning with Application to Image Set Classification
• On the Spectrum of Random Features Maps of High Dimensional Data
• Iterative Antenna Selection for Secrecy Enhancement in Massive MIMO Wiretap Channels
• Propagating Confidences through CNNs for Sparse Data Regression
• Needle Tip Force Estimation using an OCT Fiber and a Fused convGRU-CNN Architecture
• Quantitative approach to multifractality induced by correlations and broad distribution of data
• Who Learns Better Bayesian Network Structures: Constraint-Based, Score-based or Hybrid Algorithms
• Estimation of seasonal long-memory parameters
• Q-Graph: Preserving Query Locality in Multi-Query Graph Processing
• Capacity bounds for bandlimited Gaussian channels with peak-to-average-power-ratio constraint
• Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance
• RLS Recovery with Asymmetric Penalty: Fundamental Limits and Algorithmic Approaches
• Theoretical Bounds on MAP Estimation in Distributed Sensing Networks
• Multi-Message Private Information Retrieval with Private Side Information
• Orientable arithmetic matroids
• DATA:SEARCH’18 — Searching Data on the Web
• Energy-Efficient Caching for Scalable Videos in Heterogeneous Networks
• The One-Shot Crowdfunding Game
• Multidimensional free-mobility equilibrium: Tiebout revisited
• A Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection
• An English-Hindi Code-Mixed Corpus: Stance Annotation and Baseline System
• Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling
• Space-Efficient DFS and Applications: Simpler, Leaner, Faster
• Foresee: Attentive Future Projections of Chaotic Road Environments with Online Training
• Resilience Control of DC Shipboard Power Systems
• Invariance pressure of control sets
• RUN:Residual U-Net for Computer-Aided Detection of Pulmonary Nodules without Candidate Selection
• ADAGIO: Interactive Experimentation with Adversarial Attack and Defense for Audio
• Neural Joking Machine : Humorous image captioning
• An Information-Theoretic Analysis of Thompson Sampling for Large Action Spaces
• Learning multiple non-mutually-exclusive tasks for improved classification of inherently ordered labels
• A Radial Basis Function based Optimization Algorithm with Regular Simplex set geometry in Ellipsoidal Trust-Regions
• Anaphora and Coreference Resolution: A Review
• Generic CP-Supported CMSA for Binary Integer Linear Programs
• Visual Referring Expression Recognition: What Do Systems Actually Learn
• Enabling Pedestrian Safety using Computer Vision Techniques: A Case Study of the 2018 Uber Inc. Self-driving Car Crash
• The VIREO KIS at VBS 2018
• Stochastic Zeroth-order Optimization via Variance Reduction method
• A Markov Chain Model for the Cure Rate of Non-Performing Loans
• New Bounds for the Signless Laplacian Spread
• CRRN: Multi-Scale Guided Concurrent Reflection Removal Network
• Long short-term memory networks in memristor crossbars
• Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist
• Automated proof synthesis for propositional logic with deep neural networks
• Quantum correlations and entanglement in a Kitaev-type spin chain
• Infinite Arms Bandit: Optimality via Confidence Bounds
• Tight Regret Bounds for Bayesian Optimization in One Dimension
• A Fine-to-Coarse Convolutional Neural Network for 3D Human Action Recognition
• Multi-function Convolutional Neural Networks for Improving Image Classification Performance
• Hyperspectral Imaging Technology and Transfer Learning Utilized in Identification Haploid Maize Seeds
• Critical Exponent of the Anderson Transition using Massively Parallel Supercomputing
• Detecting Data Leakage from Databases on Android Apps with Concept Drift
• Cellular Controlled Cooperative Unmanned Aerial Vehicle Networks with Sense-and-Send Protocol
• Object Detection using Domain Randomization and Generative Adversarial Refinement of Synthetic Images
• Efficient Sequential and Parallel Algorithms for Estimating Higher Order Spectra
• Planning, Inference and Pragmatics in Sequential Language Games
• Autonomous Vehicles that Interact with Pedestrians:A Survey of Theory and Practice
• Critical point for infinite cycles in a random loop model on trees
• AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking Black-box Neural Networks
• Fast Incremental von Neumann Graph Entropy Computation: Theory, Algorithm, and Applications
• ‘Press Space to Fire’: Automatic Video Game Tutorial Generation
• A Geometric Property of Relative Entropy and the Universal Threshold Phenomenon for Binary-Input Channels with Noisy State Information at the Encoder
• Adversarial Learning of Task-Oriented Neural Dialog Models
• Optimal Testing in the Experiment-rich Regime
• Multi-turn Dialogue Response Generation in an Adversarial Learning Framework
• On seeking efficient Pareto optimal points in multi-player minimum cost flow problems with application to transportation systems
• Unsupervised Text Style Transfer using Language Models as Discriminators
• Sublinear decoding schemes for non-adaptive group testing with inhibitors
• Bayesian Estimations for Diagonalizable Bilinear SPDEs
• Semantic Road Layout Understanding by Generative Adversarial Inpainting
• Superpixel-enhanced Pairwise Conditional Random Field for Semantic Segmentation
• A study on prefixes of $c_2$ invariants
• Inexact Stochastic Mirror Descent for two-stage nonlinear stochastic programs
• HeadOn: Real-time Reenactment of Human Portrait Videos
• Characterizing Energy Efficiency of Wireless Transmission for Green Internet of Things: A Data-Oriented Approach
• Fairness and Sum-Rate Maximization via Joint Channel and Power Allocation in Uplink SCMA Networks
• The Age of Updates in a Simple Relay Network
• Bottom-up approach to torus bifurcation in neuron models
• Deep Mesh Projectors for Inverse Problems
• Deep Video Portraits
• Depth and nonlinearity induce implicit exploration for RL
• Active and Adaptive Sequential learning
• Deep Semantic Architecture with discriminative feature visualization for neuroimage analysis
• Regularization of time-varying covariance matrices using linear stochastic systems
• Regularization of covariance matrices on Riemannian manifolds using linear systems
• Coded Computation Against Distributed Straggling Decoders for Gaussian Channels in C-RAN
• A projected primal-dual splitting for solving constrained monotone inclusions
• Can DNNs Learn to Lipread Full Sentences
• On Visibility Problems with an Infinite Discrete, set of Obstacles
• Simulation of particle systems interacting through hitting times
• Why Botnets Work: Distributed Brute-Force Attacks Need No Synchronization
• A Unified Particle-Optimization Framework for Scalable Bayesian Sampling
• A doubly stochastic enhancement of the Failure Forecast Method using a noisy mean-reverting process
• Sign matrix polytopes from Young tableaux
• Optimal Bidding, Allocation and Budget Spending for a Demand Side Platform Under Many Auction Types
• Classifying Rotationally-Closed Languages Having Greedy Universal Cycles
• K-Beam Subgradient Descent for Minimax Optimization
• Diagnosing Glaucoma Progression with Visual Field Data Using a Spatiotemporal Boundary Detection Method
• Continuity Of Pontryagin Extremals With Respect To Delays In Nonlinear Optimal Control
• Entropy-controlled Last-Passage Percolation
• A law of large numbers for the range of rotor walks on periodic trees
• Long Short-Term Memory Networks for CSI300 Volatility Prediction with Baidu Search Volume
• On a sufficient condition for a Fano manifold to be covered by rational $N$-folds
• Duopoly Investment Problems with Minimally Bounded Adjustment Costs
• Algebraic Expression of Spatial and Temporal Pattern
• Deep Learning for Topological Invariants
• Biologically Motivated Algorithms for Propagating Local Target Representations
• Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network
• Dynamic Advisor-Based Ensemble (dynABE): Case Study in Stock Trend Prediction of a Major Critical Metal Producer
Future 5G wireless networks will rely on agile and automated network management, where the usage of diverse resources must be jointly optimized with surgical accuracy. A number of key wireless network functionalities (e.g., traffic steering, energy savings) give rise to hard optimization problems. What is more, high spatio-temporal traffic variability coupled with the need to satisfy strict per slice/service SLAs in modern networks, suggest that these problems must be constantly (re-)solved, to maintain close-to-optimal performance. To this end, in this paper we propose the framework of Online Network Optimization (ONO), which seeks to maintain both agile and efficient control over time, using an arsenal of data-driven, adaptive, and AI-based techniques. Since the mathematical tools and the studied regimes vary widely among these methodologies, a theoretical comparison is often out of reach. Therefore, the important question ‘what is the right ONO technique ‘ remains open to date. In this paper, we discuss the pros and cons of each technique and further attempt a direct quantitative comparison for a specific use case, using real data. Our results suggest that carefully combining the insights of problem modeling with state-of-the-art AI techniques provides significant advantages at reasonable complexity.
Modern information systems often collect raw data in the form of text, images, video, and sensor readings. Such data needs to be further interpreted/enriched prior to being analyzed. Enrichment is often a result of automated machine learning and or signal processing techniques that associate appropriate but uncertain tags with the data. Traditionally, with the notable exception of a few systems, enrichment is considered to be a separate pre-processing step performed independently prior to data analysis. Such an approach is becoming increasingly infeasible since modern data capture technologies enable creation of very large data collections for which it is computationally difficult/impossible and ultimately not beneficial to derive all tags as a preprocessing step. Hence, approaches that perform tagging at query/analysis time on the data of interest need to be considered. This paper explores the problem of joint tagging and query processing. In particular, the paper considers a scenario where tagging can be performed using several techniques that differ in cost and accuracy and develops a progressive approach to answering queries (SPJ queries with a restricted version of join) that enriches the right data to the right degree so as to maximize the quality of the query results. The experimental results show that proposed approach performs significantly better compared to baseline approaches.
Ensuring that all supposedly valid configurations of a software product line (SPL) lead to well-formed and acceptable products is challenging since it is most of the time impractical to enumerate and test all individual products of an SPL. Machine learning classifiers have been recently used to predict the acceptability of products associated with unseen configurations. For some configurations, a tiny change in their feature values can make them pass from acceptable to non-acceptable regarding users’ requirements and vice-versa. In this paper, we introduce the idea of leveraging these specific configurations and their positions in the feature space to improve the classifier and therefore the engineering of an SPL. Starting from a variability model, we propose to use Adversarial Machine Learning techniques to create new, adversarial configurations out of already known configurations by modifying their feature values. Using an industrial video generator we show how adversarial configurations can improve not only the classifier, but also the variability model, the variability implementation, and the testing oracle.
With the emergence of Web 2.0, tag recommenders have become important tools, which aim to support users in finding descriptive tags for their bookmarked resources. Although current algorithms provide good results in terms of tag prediction accuracy, they are often designed in a data-driven way and thus, lack a thorough understanding of the cognitive processes that play a role when people assign tags to resources. This thesis aims at modeling these cognitive dynamics in social tagging in order to improve tag recommendations and to better understand the underlying processes. As a first attempt in this direction, we have implemented an interplay between individual micro-level (e.g., categorizing resources or temporal dynamics) and collective macro-level (e.g., imitating other users’ tags) processes in the form of a novel tag recommender algorithm. The preliminary results for datasets gathered from BibSonomy, CiteULike and Delicious show that our proposed approach can outperform current state-of-the-art algorithms, such as Collaborative Filtering, FolkRank or Pairwise Interaction Tensor Factorization. We conclude that recommender systems can be improved by incorporating related principles of human cognition.
Control of complex systems involves both system identification and controller design. Deep neural networks have proven to be successful in many identification tasks, such as classification, prediction, and end-to-end system modeling. However, from the controller design perspective, these networks are difficult to work with because they are typically nonlinear and nonconvex. Therefore many systems are still optimized and controlled based on simple linear models despite their poor identification performance. In this paper we address this problem by explicitly constructing deep neural networks that are convex with respect to their inputs. We show that these input convex networks can be trained to obtain accurate models of complex physical systems. In particular, we design input convex recurrent neural networks to capture temporal behavior of dynamical systems. Then optimal controllers based on these networks can be designed by solving convex optimization problems. Results on both toy models and real-world image denoising and building energy optimization problems demonstrate the modeling accuracy and control efficiency of the proposed approach.
In many domains, the previous decade was characterized by increasing data volumes and growing complexity of computational workloads, creating new demands for highly data-parallel computing in distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g., for scheduling and resource allocation. We survey predictive performance modeling (PPM) approaches to estimate performance metrics such as execution duration, required memory or wait times of future jobs and tasks based on past performance observations. We focus on non-intrusive methods, i.e., methods that can be applied to any workload without modification, since the workload is usually a black-box from the perspective of the systems managing the computational infrastructure. We classify and compare sources of performance variation, predicted performance metrics, required training data, use cases, and the underlying prediction techniques. We conclude by identifying several open problems and pressing research needs in the field.
Today, organizations typically perform tedious and costly tasks to juggle their code and data across different data processing platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging because it requires quite good expertise for all the available data processing platforms. In this report, we present Rheem, a general-purpose cross-platform data processing system that alleviates users from the pain of finding the most efficient data processing platform for a given task. It also splits a task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). To offer cross-platform functionality, it features (i) a robust interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Rheem is released under an open source license.
The adoption of machine learning in high-stakes applications such as healthcare and law has lagged in part because predictions are not accompanied by explanations comprehensible to the domain user, who often holds ultimate responsibility for decisions and outcomes. In this paper, we propose an approach to generate such explanations in which training data is augmented to include, in addition to features and labels, explanations elicited from domain users. A joint model is then learned to produce both labels and explanations from the input features. This simple idea ensures that explanations are tailored to the complexity expectations and domain knowledge of the consumer. Evaluation spans multiple modeling techniques on a simple game dataset, an image dataset, and a chemical odor dataset, showing that our approach is generalizable across domains and algorithms. Results demonstrate that meaningful explanations can be reliably taught to machine learning algorithms, and in some cases, improve modeling accuracy.
In this proposal we present the idea of a ‘macro recommender system’, and ‘micro recommender system’. Both systems can be considered as a recommender system for recommendation algorithms. A macro recommender system recommends the best performing recommendation algorithm to an organization that wants to build a recommender system. This way, an organization does not need to test many algorithms over long periods to find the best one for their particular platform. A micro recommender system recommends the best performing recommendation algorithm for each individual recommendation request. This proposal is based on the premise that there is no single-best algorithm for all users, items, and contexts. For instance, a micro recommender system might recommend one algorithm when recommendations for an elderly male user in the evening should be created. When recommendations for a young female user in the morning should be given, the micro recommender system might recommend a different algorithm.
This paper describes the submissions of the ‘Marian’ team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we create a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier for this shared task.
Deep neural networks (DNNs) have become the state-of-the-art technique for machine learning tasks in various applications. However, due to their size and the computational complexity, large DNNs are not readily deployable on edge devices in real-time. To manage complexity and accelerate computation, network compression techniques based on pruning and quantization have been proposed and shown to be effective in reducing network size. However, such network compression can result in irregular matrix structures that are mismatched with modern hardware-accelerated platforms, such as graphics processing units (GPUs) designed to perform the DNN matrix multiplications in a structured (block-based) way. We propose MPDCompress, a DNN compression algorithm based on matrix permutation decomposition via random mask generation. In-training application of the masks molds the synaptic weight connection matrix to a sub-graph separation format. Aided by the random permutations, a hardware-desirable block matrix is generated, allowing for a more efficient implementation and compression of the network. To show versatility, we empirically verify MPDCompress on several network models, compression rates, and image datasets. On the LeNet 300-100 model (MNIST dataset), Deep MNIST, and CIFAR10, we achieve 10 X network compression with less than 1% accuracy loss compared to non-compressed accuracy performance. On AlexNet for the full ImageNet ILSVRC-2012 dataset, we achieve 8 X network compression with less than 1% accuracy loss, with top-5 and top-1 accuracies of 79.6% and 56.4%, respectively. Finally, we observe that the algorithm can offer inference speedups across various hardware platforms, with 4 X faster operation achieved on several mobile GPUs.
Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.
We introduce Regularized Kernel and Neural Sobolev Descent for transporting a source distribution to a target distribution along smooth paths of minimum kinetic energy (defined by the Sobolev discrepancy), related to dynamic optimal transport. In the kernel version, we give a simple algorithm to perform the descent along gradients of the Sobolev critic, and show that it converges asymptotically to the target distribution in the MMD sense. In the neural version, we parametrize the Sobolev critic with a neural network with input gradient norm constrained in expectation. We show in theory and experiments that regularization has an important role in favoring smooth transitions between distributions, avoiding large discrete jumps. Our analysis could provide a new perspective on the impact of critic updates (early stopping) on the paths to equilibrium in the GAN setting.
Adversarial attacks on deep learning models have been demonstrated to be imperceptible to a human, while decreasing the model performance considerably. Attempts to provide invariance against such attacks have denoised adversarial samples to only send cleaned samples to the classifier. In a similar spirit this paper proposes a novel effective strategy that allows to relax adversarial samples onto the underlying manifold of the (unknown) target class distribution. Specifically, given an off-manifold adversarial example, our Metroplis-adjusted Langevin algorithm (Mala) guided through a supervised denoising autoencoder network (sDAE) allows to drive the adversarial samples towards high density regions of the data generating distribution. So, in a nutshell the adversarial example is transformed back from off-manifold onto the data manifold for which the learning model was originally trained and where it can perform well and robustly. Experiments on various benchmark datasets show that our novel Malade method exhibits a high robustness against blackbox and whitebox attacks and outperforms state-of-the-art defense algorithms.
Understanding the learning dynamics of neural networks is one of the key issues for the improvement of optimization algorithms as well as for the theoretical comprehension of why deep neural nets work so well today. In this paper, we introduce a random matrix-based framework to analyze the learning dynamics of a single-layer linear network on a binary classification problem, for data of simultaneously large dimension and size, trained by gradient descent. Our results provide rich insights into common questions in neural nets, such as overfitting, early stopping and the initialization of training, thereby opening the door for future studies of more elaborate structures and models appearing in today’s neural networks.
Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM’s original one level non-linear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning and speech recognition applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7x [floating-point operations (FLOPs) by 45.5x], run-time latency by 4.5x, and improve the CIDEr score by 2.6. For the DeepSpeech2 architecture on the AN4 dataset, our two models reduce the number of parameters by 19.4x (FLOPs by 23.5x), run-time latency by 15.7%, and the word error rate from 12.9% to 8.7%. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate.
Knowing when a classifier’s prediction can be trusted is useful in many applications and critical for safely using AI. While the bulk of the effort in machine learning research has been towards improving classifier performance, understanding when a classifier’s predictions should and should not be trusted has received far less attention. The standard approach is to use the classifier’s discriminant or confidence score; however, we show there exists a considerably more effective alternative. We propose a new score, called the trust score, which measures the agreement between the classifier and a modified nearest-neighbor classifier on the testing example. We show empirically that high (low) trust scores produce surprisingly high precision at identifying correctly (incorrectly) classified examples, consistently outperforming the classifier’s confidence score as well as many other baselines. Further, under some mild distributional assumptions, we show that if the trust score for an example is high (low), the classifier will likely agree (disagree) with the Bayes-optimal classifier. Our guarantees consist of non-asymptotic rates of statistical consistency under various nonparametric settings and build on recent developments in topological data analysis.
We introduce collaborative learning in which multiple classifier heads of the same network are simultaneously trained on the same training data to improve generalization and robustness to label noise with no extra inference cost. It acquires the strengths from auxiliary training, multi-task learning and knowledge distillation. There are two important mechanisms involved in collaborative learning. First, the consensus of multiple views from different classifier heads on the same example provides supplementary information as well as regularization to each classifier, thereby improving generalization. Second, intermediate-level representation (ILR) sharing with backpropagation rescaling aggregates the gradient flows from all heads, which not only reduces training computational complexity, but also facilitates supervision to the shared layers. The empirical results on CIFAR and ImageNet datasets demonstrate that deep neural networks learned as a group in a collaborative way significantly reduce the generalization error and increase the robustness to label noise.
Combining complementary information from multiple modalities is intuitively appealing for improving the performance of learning-based approaches. However, it is challenging to fully leverage different modalities due to practical challenges such as varying levels of noise and conflicts between modalities. Existing methods do not adopt a joint approach to capturing synergies between the modalities while simultaneously filtering noise and resolving conflicts on a per sample basis. In this work we propose a novel deep neural network based technique that multiplicatively combines information from different source modalities. Thus the model training process automatically focuses on information from more reliable modalities while reducing emphasis on the less reliable modalities. Furthermore, we propose an extension that multiplicatively combines not only the single-source modalities, but a set of mixtured source modalities to better capture cross-modal signal correlations. We demonstrate the effectiveness of our proposed technique by presenting empirical results on three multimodal classification tasks from different domains. The results show consistent accuracy improvements on all three tasks.
RDF data in the linked open data (LOD) cloud is very valuable for many different applications. In order to unlock the full value of this data, users should be able to issue complex queries on the RDF datasets in the LOD cloud. SPARQL can express such complex queries, but constructing SPARQL queries can be a challenge to users since it requires knowing the structure and vocabulary of the datasets being queried. In this paper, we introduce Sapphire, a tool that helps users write syntactically and semantically correct SPARQL queries without prior knowledge of the queried datasets. Sapphire interactively helps the user while typing the query by providing auto-complete suggestions based on the queried data. After a query is issued, Sapphire provides suggestions on ways to change the query to better match the needs of the user. We evaluated Sapphire based on performance experiments and a user study and showed it to be superior to competing approaches.
The potential of graph convolutional neural networks for the task of zero-shot learning has been demonstrated recently. These models are highly sample efficient as related concepts in the graph structure share statistical strength allowing generalization to new classes when faced with a lack of data. However, knowledge from distant nodes can get diluted when propagating through intermediate nodes, because current approaches to zero-shot learning use graph propagation schemes that perform Laplacian smoothing at each layer. We show that extensive smoothing does not help the task of regressing classifier weights in zero-shot learning. In order to still incorporate information from distant nodes and utilize the graph structure, we propose an Attentive Dense Graph Propagation Module (ADGPM). ADGPM allows us to exploit the hierarchical graph structure of the knowledge graph through additional connections. These connections are added based on a node’s relationship to its ancestors and descendants and an attention scheme is further used to weigh their contribution depending on the distance to the node. Finally, we illustrate that finetuning of the feature representation after training the ADGPM leads to considerable improvements. Our method achieves competitive results, outperforming previous zero-shot learning approaches.
Bagging and boosting are proved to be the best methods of building multiple classifiers in classification combination problems. In the area of ‘flat clustering’ problems, it is also recognized that multi-clustering methods based on boosting provide clusterings of an improved quality. In this paper, we introduce a novel multi-clustering method for ‘hierarchical clusterings’ based on boosting theory, which creates a more stable hierarchical clustering of a dataset. The proposed algorithm includes a boosting iteration in which a bootstrap of samples is created by weighted random sampling of elements from the original dataset. A hierarchical clustering algorithm is then applied to selected subsample to build a dendrogram which describes the hierarchy. Finally, dissimilarity description matrices of multiple dendrogram results are combined to a consensus one, using a hierarchical-clustering-combination approach. Experiments on real popular datasets show that boosted method provides superior quality solutions compared to standard hierarchical clustering methods.
We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.
The design of a reward function often poses a major practical challenge to real-world applications of reinforcement learning. Approaches such as inverse reinforcement learning attempt to overcome this challenge, but require expert demonstrations, which can be difficult or expensive to obtain in practice. We propose variational inverse control with events (VICE), which generalizes inverse reinforcement learning methods to cases where full demonstrations are not needed, such as when only samples of desired goal states are available. Our method is grounded in an alternative perspective on control and reinforcement learning, where an agent’s goal is to maximize the probability that one or more events will happen at some point in the future, rather than maximizing cumulative rewards. We demonstrate the effectiveness of our methods on continuous control tasks, with a focus on high-dimensional observations like images where rewards are hard or even impossible to specify.
While recurrent neural networks have found success in a variety of natural language processing applications, they are general models of sequential data. We investigate how the properties of natural language data affect an LSTM’s ability to learn a nonlinguistic task: recalling elements from its input. We find that models trained on natural language data are able to recall tokens from much longer sequences than models trained on non-language sequential data. Furthermore, we show that the LSTM learns to solve the memorization task by explicitly using a subset of its neurons to count timesteps in the input. We hypothesize that the patterns and structure in natural language data enable LSTMs to learn by providing approximate ways of reducing loss, but understanding the effect of different training data on the learnability of LSTMs remains an open question.
We provide a novel — and to the best of our knowledge, the first — algorithm for high dimensional sparse regression with corruptions in explanatory and/or response variables. Our algorithm recovers the true sparse parameters in the presence of a constant fraction of arbitrary corruptions. Our main contribution is a robust variant of Iterative Hard Thresholding. Using this, we provide accurate estimators with sub-linear sample complexity. Our algorithm consists of a novel randomized outlier removal technique for robust sparse mean estimation that may be of interest in its own right: it is orderwise more efficient computationally than existing algorithms, and succeeds with high probability, thus making it suitable for general use in iterative algorithms. We demonstrate the effectiveness on large-scale sparse regression problems with arbitrary corruptions.