Structured Set Matching Networks for One-Shot Part Labeling

Diagrams often depict complex phenomena and serve as a good test bed for visual and textual reasoning. However, understanding diagrams using natural image understanding approaches requires large training datasets of diagrams, which are very hard to obtain. Instead, this can be addressed as a matching problem either between labeled diagrams, images or both. This problem is very challenging since the absence of significant color and texture renders local cues ambiguous and requires global reasoning. We consider the problem of one-shot part labeling: labeling multiple parts of an object in a target image given only a single source image of that category. For this set-to-set matching problem, we introduce the Structured Set Matching Network (SSMN), a structured prediction model that incorporates convolutional neural networks. The SSMN is trained using global normalization to maximize local match scores between corresponding elements and a global consistency score among all matched elements, while also enforcing a matching constraint between the two sets. The SSMN significantly outperforms several strong baselines on three label transfer scenarios: diagram-to-diagram, evaluated on a new diagram dataset of over 200 categories; image-to-image, evaluated on a dataset built on top of the Pascal Part Dataset; and image-to-diagram, evaluated on transferring labels across these datasets.

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.

Online Learning with Gated Linear Networks

This paper describes a family of probabilistic architectures designed for online learning under the logarithmic loss. Rather than relying on non-linear transfer functions, our method gains representational power by the use of data conditioning. We state under general conditions a learnable capacity theorem that shows this approach can in principle learn any bounded Borel-measurable function on a compact subset of euclidean space; the result is stronger than many universality results for connectionist architectures because we provide both the model and the learning procedure for which convergence is guaranteed.

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks

We examine the problem of question answering over knowledge graphs, focusing on simple questions that can be answered by the lookup of a single fact. Adopting a straightforward decomposition of the problem into entity detection, entity linking, relation prediction, and evidence combination, we explore simple yet strong baselines. On the SimpleQuestions dataset, we find that baseline LSTMs and GRUs plus a few heuristics yield accuracies that approach the state of the art, and techniques that do not use neural networks also perform reasonably well. These results show that gains from sophisticated deep learning techniques proposed in the literature are quite modest and that some previous models exhibit unnecessary complexity.

Sparsity Regularization for classification of large dimensional data

Feature selection has evolved to be a very important step in several machine learning paradigms. Especially in the domains of bio-informatics and text classification which involve data of high dimensions, feature selection can help in drastically reducing the feature space. In cases where it is difficult or infeasible to obtain sufficient training examples, feature selection helps overcome the curse of dimensionality which in turn helps improve performance of the classification algorithm. The focus of our research are five embedded feature selection methods which use the ridge regression, or use Lasso regression, and those which combine the two with the goal of simultaneously performing variable selection and grouping correlated variables.

Dual Attention Network for Product Compatibility and Function Satisfiability Analysis

Product compatibility and their functionality are of utmost importance to customers when they purchase products, and to sellers and manufacturers when they sell products. Due to the huge number of products available online, it is infeasible to enumerate and test the compatibility and functionality of every product. In this paper, we address two closely related problems: product compatibility analysis and function satisfiability analysis, where the second problem is a generalization of the first problem (e.g., whether a product works with another product can be considered as a special function). We first identify a novel question and answering corpus that is up-to-date regarding product compatibility and functionality information. To allow automatic discovery product compatibility and functionality, we then propose a deep learning model called Dual Attention Network (DAN). Given a QA pair for a to-be-purchased product, DAN learns to 1) discover complementary products (or functions), and 2) accurately predict the actual compatibility (or satisfiability) of the discovered products (or functions). The challenges addressed by the model include the briefness of QAs, linguistic patterns indicating compatibility, and the appropriate fusion of questions and answers. We conduct experiments to quantitatively and qualitatively show that the identified products and functions have both high coverage and accuracy, compared with a wide spectrum of baselines.

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch size for all epochs, adaptively increases the batch size during the training process. Our method delivers the convergence rate of small batch sizes while achieving performance similar to large batch sizes. We analyse our approach using the standard AlexNet, ResNet, and VGG networks operating on the popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate that learning with adaptive batch sizes can improve performance by factors of up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1% relative to training with fixed batch sizes.

Learning General Latent-Variable Graphical Models with Predictive Belief Propagation and Hilbert Space Embeddings

In this paper, we propose a new algorithm for learning general latent-variable probabilistic graphical models using the techniques of predictive state representation, instrumental variable regression, and reproducing-kernel Hilbert space embeddings of distributions. Under this new learning framework, we first convert latent-variable graphical models into corresponding latent-variable junction trees, and then reduce the hard parameter learning problem into a pipeline of supervised learning problems, whose results will then be used to perform predictive belief propagation over the latent junction tree during the actual inference procedure. We then give proofs of our algorithm’s correctness, and demonstrate its good performance in experiments on one synthetic dataset and two real-world tasks from computational biology and computer vision – classifying DNA splice junctions and recognizing human actions in videos.

Multi-channel Encoder for Neural Machine Translation

Attention-based Encoder-Decoder has the effective architecture for neural machine translation (NMT), which typically relies on recurrent neural networks (RNN) to build the blocks that will be lately called by attentive reader during the decoding process. This design of encoder yields relatively uniform composition on source sentence, despite the gating mechanism employed in encoding RNN. On the other hand, we often hope the decoder to take pieces of source sentence at varying levels suiting its own linguistic structure: for example, we may want to take the entity name in its raw form while taking an idiom as a perfectly composed unit. Motivated by this demand, we propose Multi-channel Encoder (MCE), which enhances encoding components with different levels of composition. More specifically, in addition to the hidden state of encoding RNN, MCE takes 1) the original word embedding for raw encoding with no composition, and 2) a particular design of external memory in Neural Turing Machine (NTM) for more complex composition, while all three encoding strategies are properly blended during decoding. Empirical study on Chinese-English translation shows that our model can improve by 6.52 BLEU points upon a strong open source NMT system: DL4MT1. On the WMT14 English- French task, our single shallow system achieves BLEU=38.8, comparable with the state-of-the-art deep models.

A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network

We introduce a novel embedding method for knowledge base completion task. Our approach advances state-of-the-art (SOTA) by employing a convolutional neural network (CNN) for the task which can capture global relationships and transitional characteristics. We represent each triple (head entity, relation, tail entity) as a 3-column matrix which is the input for the convolution layer. Different filters having a same shape of 1×3 are operated over the input matrix to produce different feature maps which are then concatenated into a single feature vector. This vector is used to return a score for the triple via a dot product. The returned score is used to predict whether the triple is valid or not. Experiments show that ConvKB achieves better link prediction results than previous SOTA models on two current benchmark datasets WN18RR and FB15k-237.

Generalized Probability Smoothing

In this work we consider a generalized version of Probability Smoothing, the core elementary model for sequential prediction in the state of the art PAQ family of data compression algorithms. Our main contribution is a code length analysis that considers the redundancy of Probability Smoothing with respect to a Piecewise Stationary Source. The analysis holds for a finite alphabet and expresses redundancy in terms of the total variation in probability mass of the stationary distributions of a Piecewise Stationary Source. By choosing parameters appropriately Probability Smoothing has redundancy O(S\cdot\sqrt{T\log T}) for sequences of length T with respect to a Piecewise Stationary Source with S segments.

Guided Labeling using Convolutional Neural Networks

Over the last couple of years, deep learning and especially convolutional neural networks have become one of the work horses of computer vision. One limiting factor for the applicability of supervised deep learning to more areas is the need for large, manually labeled datasets. In this paper we propose an easy to implement method we call guided labeling, which automatically determines which samples from an unlabeled dataset should be labeled. We show that using this procedure, the amount of samples that need to be labeled is reduced considerably in comparison to labeling images arbitrarily.

Distribution-Based Categorization of Classifier Transfer Learning

Transfer Learning (TL) aims to transfer knowledge acquired in one problem, the source problem, onto another problem, the target problem, dispensing with the bottom-up construction of the target model. Due to its relevance, TL has gained significant interest in the Machine Learning community since it paves the way to devise intelligent learning models that can easily be tailored to many different applications. As it is natural in a fast evolving area, a wide variety of TL methods, settings and nomenclature have been proposed so far. However, a wide range of works have been reporting different names for the same concepts. This concept and terminology mixture contribute however to obscure the TL field, hindering its proper consideration. In this paper we present a review of the literature on the majority of classification TL methods, and also a distribution-based categorization of TL with a common nomenclature suitable to classification problems. Under this perspective three main TL categories are presented, discussed and illustrated with examples.

An Efficient Algorithm for Non-Negative Matrix Factorization with Random Projections

Non-negative matrix factorization (NMF) is one of the most popular decomposition techniques for multivariate data. NMF is a core method for many machine-learning related computational problems, such as data compression, feature extraction, word embedding, recommender systems etc. In practice, however, its application is challenging for large datasets. The efficiency of NMF is constrained by long data loading times, by large memory requirements and by limited parallelization capabilities. Here we present a novel and efficient compressed NMF algorithm. Our algorithm applies a random compression scheme to drastically reduce the dimensionality of the problem, preserving well the pairwise distances between data points and inherently limiting the memory and communication load. Our algorithm supersedes existing methods in speed. Nonetheless, it matches the best non-compressed algorithms in reconstruction precision.

Why Do Neural Dialog Systems Generate Short and Meaningless Replies? A Comparison between Dialog and Translation

This paper addresses the question: Why do neural dialog systems generate short and meaningless replies? We conjecture that, in a dialog system, an utterance may have multiple equally plausible replies, causing the deficiency of neural networks in the dialog application. We propose a systematic way to mimic the dialog scenario in a machine translation system, and manage to reproduce the phenomenon of generating short and less meaningful sentences in the translation setting, showing evidence of our conjecture.

Stretching Domain Adaptation: How far is too far?

While deep learning has led to significant advances in visual recognition over the past few years, such advances often require a lot of annotated data. While unsupervised domain adaptation has emerged as an alternative approach that doesn’t require as much annotated data, prior evaluations of domain adaptation have been limited to relatively simple datasets. This work pushes the state of the art in unsupervised domain adaptation through an in depth evaluation of AlexNet, DenseNet and Residual Transfer Networks (RTN) on multimodal benchmark datasets that shows and identifies which layers more effectively transfer features across different domains. We also modify the existing RTN architecture and propose a novel domain adaptation architecture called ‘Deep MagNet’ that combines Deep Convolutional Blocks with multiple Maximum Mean Discrepancy losses. Our experiments show quantitative and qualitative improvements in performance of our method on benchmarking datasets for complex data domains.

Named Entity Sequence Classification

Named Entity Recognition (NER) aims at locating and classifying named entities in text. In some use cases of NER, including cases where detected named entities are used in creating content recommendations, it is crucial to have a reliable confidence level for the detected named entities. In this work we study the problem of finding confidence levels for detected named entities. We refer to this problem as Named Entity Sequence Classification (NESC). We frame NESC as a binary classification problem and we use NER as well as recurrent neural networks to find the probability of candidate named entity is a real named entity. We apply this approach to Tweet texts and we show how we could find named entities with high confidence levels from Tweets.

Burst Denoising with Kernel Prediction Networks

We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data.

SGAN: An Alternative Training of Generative Adversarial Networks

The Generative Adversarial Networks (GANs) have demonstrated impressive performance for data synthesis, and are now used in a wide range of computer vision tasks. In spite of this success, they gained a reputation for being difficult to train, what results in a time-consuming and human-involved development process to use them. We consider an alternative training process, named SGAN, in which several adversarial ‘local’ pairs of networks are trained independently so that a ‘global’ supervising pair of networks can be trained against them. The goal is to train the global pair with the corresponding ensemble opponent for improved performances in terms of mode coverage. This approach aims at increasing the chances that learning will not stop for the global pair, preventing both to be trapped in an unsatisfactory local minimum, or to face oscillations often observed in practice. To guarantee the latter, the global pair never affects the local ones. The rules of SGAN training are thus as follows: the global generator and discriminator are trained using the local discriminators and generators, respectively, whereas the local networks are trained with their fixed local opponent. Experimental results on both toy and real-world problems demonstrate that this approach outperforms standard training in terms of better mitigating mode collapse, stability while converging and that it surprisingly, increases the convergence speed as well.

Optimizing Human Learning
No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models
Large MIMO Detection Schemes Based on Channel Puncturing: Performance and Complexity Analysis
Sum of previous inpatient serum creatinine measurements predicts acute kidney injury in rehospitalized patients
A solution theory for quasilinear singular SPDEs
Robust and scalable methods for the dynamic mode decomposition
Grounding Referring Expressions in Images by Variational Context
Population-based Respiratory 4D Motion Atlas Construction and its Application for VR Simulations of Liver Punctures
Integrated Facility Location and Production Scheduling in Multi-Generation Energy Systems
The Role of Data Analysis in Uncertainty Quantification: Case Studies for Materials Modeling
On the linear convergence of the projected stochastic gradient method with constant step-size
Co-domain Embedding using Deep Quadruplet Networks for Unseen Traffic Sign Recognition
Approaching the Ad Placement Problem with Online Linear Classification: The winning solution to the NIPS’17 Ad Placement Challenge
State spaces of convolutional codes, codings and encoders
Optimal Sample Complexity for Stable Matrix Recovery
Parity Factors I: General Kotzig-Lovász Decomposition for Grafts
On the regularity of orientable matroids
Many body localization proximity effects in platforms of coupled spins and bosons
The Best of Both Worlds: Learning Geometry-based 6D Object Pose Estimation
Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Network
Combinatorial interpretations of the Kreweras triangle in terms of subset tuples
Predicting Demographics, Moral Foundations, and Human Values from Digital Behaviors
Circuit Walks in Integral Polyhedra
Concentration of weakly dependent Banach-valued sums and applications to kernel learning methods
How to Learn a Model Checker
Blind Image Deblurring Using Row-Column Sparse Representations
Learning Latent Super-Events to Detect Multiple Activities in Videos
Special numbers, special quaternions and special symbol elements
Leaf realization problem, caterpillar graphs and prefix normal words
Recognizing Plans by Learning Embeddings from Observed Action Distributions
Learning to Forecast Videos of Human Activity with Multi-granularity Models and Adaptive Rendering
An isomorphism between branched and geometric rough paths
What’s in my closet?: Image classification using fuzzy logic
Deterministic Heavy Hitters with Sublinear Query Time
Reconstruction of rational polytopes from the real-parameter Ehrhart function of its translates
Single-trial P300 Classification using PCA with LDA, QDA and Neural Networks
A Scalable Deep Neural Network Architecture for Multi-Building and Multi-Floor Indoor Localization Based on Wi-Fi Fingerprinting
A Multi-Resolution Spatial Model for Large Datasets Based on the Skew-t Distribution
A High-resolution DOA Estimation Method with a Family of Nonconvex Penalties
Short-Term Prediction of Signal Cycle in Actuated-Controlled Corridor Using Sparse Time Series Models
An analysis of incorporating an external language model into a sequence-to-sequence model
Predicting Short-Term Uber Demand Using Spatio-Temporal Modeling: A New York City Case Study
On the nonparametric maximum likelihood estimator for Gaussian location mixture densities with application to Gaussian denoising
ADC Bit Optimization for Spectrum- and Energy-Efficient Millimeter Wave Communications
The vertex-isoperimetric number of the incidence andnon-incidence graphs of unitals
Towards Recovery of Conditional Vectors from Conditional Generative Adversarial Networks
Evolutionary Game for Mining Pool Selection in Blockchain Networks
SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties
Learning Semantic Concepts and Order for Image and Sentence Matching
Bayesian Policy Gradients via Alpha Divergence Dropout Inference
Dynamic adaptive procedures for false discovery rate estimation and control
Operators on random hypergraphs and random simplicial complexes
Distance-based Self-Attention Network for Natural Language Inference
Saliency Preservation in Low-Resolution Grayscale Images
Hydrodynamic Limit of Multiple SLE
Unsupervised Multi-Domain Image Translation with Domain-Specific Encoders/Decoders
Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning
On Path Memory in List Successive Cancellation Decoder of Polar Codes
On the Reliability of LTE Random Access: Performance Bounds for Machine-to-Machine Burst Resolution Time
Automatic Segmentation and Overall Survival Prediction in Gliomas using Fully Convolutional Neural Network and Texture Analysis
Improved lower bound on generalized Erdos-Ginzburg-Ziv constants
Inverse modeling of hydrologic systems with adaptive multi-fidelity simulations
Oblivious Routing via Random Walks
A family of constacyclic codes over $\mathbb{F}_{2^{m}}+u\mathbb{F}_{2^{m}}$ and application to quantum codes
A Local Analysis of Block Coordinate Descent for Gaussian Phase Retrieval
Low-Complexity and High-Resolution DOA Estimation for Hybrid Analog and Digital Massive MIMO Receive Array
Strong Disorder Real-Space Renormalization for the Many-Body-Localized phase of random Majorana models
On binomial coefficients modulo squares of primes
Separating Reflection and Transmission Images in the Wild
Exact Algorithms With Worst-case Guarantee For Scheduling: From Theory to Practice
Adaptive Robust Null-Space Projection Beamforming Scheme for Secure Wireless Transmission with AN-aided Directional Modulation
On the arithmetic Kakeya conjecture of Katz and Tao
Enabling Early Audio Event Detection with Neural Networks
Uniform generation of infinite concurrent runs: the case of trace monoids
CNN training with graph-based sample preselection: application to handwritten character recognition
Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction
Arrangements of Pseudocircles: On Circularizability
A Kalman Filter Approach for Biomolecular Systems with Noise Covariance Updating
Constructions of block designs with block sizes larger than 2
A trans-disciplinary review of deep learning research for water resources scientists
On the Singular Control of Exchange Rates
Cryptanalysis of a public key encryption scheme based on QC-LDPC and QC-MDPC codes
Large Deviation Principles of Obstacle Problems for Quasilinear Stochastic PDEs
Detecting Curve Text in the Wild: New Dataset and New Solution
Lifting Linear Extension Complexity Bounds to the Mixed-Integer Setting
Origins of the Poynting effect in sheared elastic networks
Fitting a Hurdle Generalized Lambda Distribution to Health Care Expenses
Product Function Need Recognition via Semi-supervised Attention Network
Beyond the Pixel-Wise Loss for Topology-Aware Delineation
Fast spatial inference in the homogeneous Ising model
Discourse-Aware Rumour Stance Classification in Social Media Using Sequential Classifiers
Pose-Normalized Image Generation for Person Re-identification
Stochastic Geometry Analysis of Ultra-Dense Networks: Impact of Antenna Height and Performance Limits
An innovative solution for breast cancer textual big data analysis
Disorder and critical phenomena: the $α=0$ copolymer model
Attention based convolutional neural network for predicting RNA-protein binding sites
Functional equations as an important analytic method in stochastic modelling and in combinatorics
Joint 3D Proposal Generation and Object Detection from View Aggregation
Which groups are amenable to proving exponent two for matrix multiplication?
On the double random current nesting field
Cooperative Data Exchange based on MDS Codes
From Lifestyle Vlogs to Everyday Interactions
A posteriori noise estimation in variable data sets
Exchangeable modelling of relational data: checking sparsity, train-test splitting, and sparse exchangeable Poisson matrix factorization
For every quantum walk there is a (classical) lifted Markov chain with the same mixing time
Evolutionary dynamics of cooperation in neutral populations
Stochastic Volatily Models using Hamiltonian Monte Carlo Methods and Stan
Generative Adversarial Perturbations