Compositional Attention Networks for Machine Reasoning

We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning. Drawing inspiration from first principles of computer organization, MAC moves away from monolithic black-box neural architectures towards a design that encourages both transparency and versatility. The model approaches problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintains a separation between control and memory. By stringing the cells together and imposing structural constraints that regulate their interaction, MAC effectively learns to perform iterative reasoning processes that are directly inferred from the data in an end-to-end approach. We demonstrate the model’s strength, robustness and interpretability on the challenging CLEVR dataset for visual reasoning, achieving a new state-of-the-art 98.9% accuracy, halving the error rate of the previous best model. More importantly, we show that the model is computationally-efficient and data-efficient, in particular requiring 5x less data than existing models to achieve strong results.

Efficient Algorithms for Outlier-Robust Regression

We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is \emph{certifiably hypercontractive}. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets.

Generating Differentially Private Datasets Using GANs

In this paper, we present a technique for generating artificial datasets that retain statistical properties of the real data while providing differential privacy guarantees with respect to this data. We include a Gaussian noise layer in the discriminator of a generative adversarial network to make the output and the gradients differentially private with respect to the training data, and then use the generator component to synthesise privacy-preserving artificial dataset. Our experiments show that under a reasonably small privacy budget we are able to generate data of high quality and successfully train machine learning models on this artificial data.

WNGrad: Learn the Learning Rate in Gradient Descent

Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice. If certain parameters of the loss function such as smoothness or strong convexity constants are known, theoretical learning rate schedules can be applied. However, in practice, such parameters are not known, and the loss function of interest is not convex in any case. The recently proposed batch normalization reparametrization is widely adopted in most neural network architectures today because, among other advantages, it is robust to the choice of Lipschitz constant of the gradient in loss function, allowing one to set a large learning rate without worry. Inspired by batch normalization, we propose a general nonlinear update rule for the learning rate in batch and stochastic gradient descent so that the learning rate can be initialized at a high value, and is subsequently decreased according to gradient observations along the way. The proposed method is shown to achieve robustness to the relationship between the learning rate and the Lipschitz constant, and near-optimal convergence rates in both the batch and stochastic settings (O(1/T) for smooth loss in the batch setting, and O(1/\sqrt{T}) for convex loss in the stochastic setting). We also show through numerical evidence that such robustness of the proposed method extends to highly nonconvex and possibly non-smooth loss function in deep learning problems.Our analysis establishes some first theoretical understanding into the observed robustness for batch normalization and weight normalization.

An efficient framework for learning sentence representations

In this work we propose a simple and efficient framework for learning sentence representations from unlabelled data. Drawing inspiration from the distributional hypothesis and recent work on learning sentence representations, we reformulate the problem of predicting the context in which a sentence appears as a classification problem. Given a sentence and its context, a classifier distinguishes context sentences from other contrastive sentences based on their vector representations. This allows us to efficiently learn different types of encoding functions, and we show that the model learns high-quality sentence representations. We demonstrate that our sentence representations outperform state-of-the-art unsupervised and supervised representation learning methods on several downstream NLP tasks that involve understanding sentence semantics while achieving an order of magnitude speedup in training time.

A Multi-Objective Deep Reinforcement Learning Framework

This paper presents a new multi-objective deep reinforcement learning (MODRL) framework based on deep Q-networks. We propose linear and non-linear methods to develop the MODRL framework that includes both single-policy and multi-policy strategies. The experimental results on a deep sea treasure environment indicate that the proposed approach is able to converge to the optimal Pareto solutions. The proposed framework is generic, which allows implementation of different deep reinforcement learning algorithms in various complex environments. Details of the framework implementation can be referred to http://…/drl.htm.

Learning Deep Models: Critical Points and Local Openness

With the increasing interest in deeper understanding of the loss surface of many non-convex deep models, this paper presents a unifying framework to establish the local/global optima equivalence of the optimization problems arising from training of such non-convex models. Using the local openness property of the underlying training models, we provide simple sufficient conditions under which any local optimum of the resulting optimization problem is globally optimal. We first completely characterize the local openness of the symmetric and non-symmetric matrix multiplication mapping in its range. Then we use our characterization to: 1) provide a simple proof for the classical result of Burer-Monteiro and extend it to non-continuous loss functions. 2) show that every local optimum of two layer linear networks is globally optimal. Unlike many existing results in the literature, our result requires no assumption on the target data matrix Y, and input data matrix X. 3) Develop almost complete characterization of the local/global optima equivalence of multi-layer linear neural networks. We provide various counterexamples to show the necessity of each of our assumptions. 4) Show global/local optima equivalence of non-linear deep models having certain pyramidal structure. Unlike some existing works, our result requires no assumption on the differentiability of the activation functions and can go beyond ‘full-rank’ cases.

Reptile: a Scalable Metalearning Algorithm

This paper considers metalearning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution. We present a remarkably simple metalearning algorithm called Reptile, which learns a parameter initialization that can be fine-tuned quickly on a new task. Reptile works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task. Unlike MAML, which also learns an initialization, Reptile doesn’t require differentiating through the optimization process, making it more suitable for optimization problems where many update steps are required. We show that Reptile performs well on some well-established benchmarks for few-shot classification. We provide some theoretical analysis aimed at understanding why Reptile works.

Learning Effective Binary Visual Representations with Deep Networks

Although traditionally binary visual representations are mainly designed to reduce computational and storage costs in the image retrieval research, this paper argues that binary visual representations can be applied to large scale recognition and detection problems in addition to hashing in retrieval. Furthermore, the binary nature may make it generalize better than its real-valued counterparts. Existing binary hashing methods are either two-stage or hinging on loss term regularization or saturated functions, hence converge slowly and only emit soft binary values. This paper proposes Approximately Binary Clamping (ABC), which is non-saturating, end-to-end trainable, with fast convergence and can output true binary visual representations. ABC achieves comparable accuracy in ImageNet classification as its real-valued counterpart, and even generalizes better in object detection. On benchmark image retrieval datasets, ABC also outperforms existing hashing methods.

Cross-domain Recommendation via Deep Domain Adaptation

The behavior of users in certain services could be a clue that can be used to infer their preferences and may be used to make recommendations for other services they have never used. However, the cross-domain relationships between items and user consumption patterns are not simple, especially when there are few or no common users and items across domains. To address this problem, we propose a content-based cross-domain recommendation method for cold-start users that does not require user- and item- overlap. We formulate recommendation as extreme multi-class classification where labels (items) corresponding to the users are predicted. With this formulation, the problem is reduced to a domain adaptation setting, in which a classifier trained in the source domain is adapted to the target domain. For this, we construct a neural network that combines an architecture for domain adaptation, Domain Separation Network, with a denoising autoencoder for item representation. We assess the performance of our approach in experiments on a pair of data sets collected from movie and news services of Yahoo! JAPAN and show that our approach outperforms several baseline methods including a cross-domain collaborative filtering method.

Learning with Rules

Complex classifiers may exhibit ’embarassing’ failures in cases that would be easily classified and justified by a human. Avoiding such failures is obviously paramount, particularly in domains where we cannot accept this unexplained behavior. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, and otherwise, it is predictable by a linear classifier. We define a related hypothesis class and determine its sample complexity. We also give evidence that efficient algorithms cannot, unfortunately, enjoy this sample complexity. We then derive a simple and efficient algorithm, and also give evidence that its sample complexity is optimal, among efficient algorithms. Experiments on sentiment analysis demonstrate the efficacy of the method, both in terms of accuracy and interpretability.

A Bayesian and Machine Learning approach to estimating Influence Model parameters for IM-RO

The rise of Online Social Networks (OSNs) has caused an insurmountable amount of interest from advertisers and researchers seeking to monopolize on its features. Researchers aim to develop strategies for determining how information is propagated among users within an OSN that is captured by diffusion or influence models. We consider the influence models for the IM-RO problem, a novel formulation to the Influence Maximization (IM) problem based on implementing Stochastic Dynamic Programming (SDP). In contrast to existing approaches involving influence spread and the theory of submodular functions, the SDP method focuses on optimizing clicks and ultimately revenue to advertisers in OSNs. Existing approaches to influence maximization have been actively researched over the past decade, with applications to multiple fields, however, our approach is a more practical variant to the original IM problem. In this paper, we provide an analysis on the influence models of the IM-RO problem by conducting experiments on synthetic and real-world datasets. We propose a Bayesian and Machine Learning approach for estimating the parameters of the influence models for the (Influence Maximization- Revenue Optimization) IM-RO problem. We present a Bayesian hierarchical model and implement the well-known Naive Bayes classifier (NBC), Decision Trees classifier (DTC) and Random Forest classifier (RFC) on three real-world datasets. Compared to previous approaches to estimating influence model parameters, our strategy has the great advantage of being directly implementable in standard software packages such as WinBUGS/OpenBUGS/JAGS and Apache Spark. We demonstrate the efficiency and usability of our methods in terms of spreading information and generating revenue for advertisers in the context of OSNs.

Beyond many-body localized states in a spin-disordered Hubbard model with pseudo-spin symmetry
The emergent algebraic structure of RNNs and embeddings in NLP
Flip procedure in geometric approximation of multiple-component shapes — Application to multiple-inclusion detection
The Randomized Kaczmarz Method with Mismatched Adjoint
Good Distance Lattices from High Dimensional Expanders
Value Alignment, Fair Play, and the Rights of Service Robots
Satisficing in Time-Sensitive Bandit Learning
Approximation algorithms for two-machine flow-shop scheduling with a conflict graph
Phase transitions for a model with uncountable spin space on the Cayley tree: the general case
Entanglement in a dephasing model and many-body localization
Optimizing cluster-based randomized experiments under a monotonicity assumption
Distributed Base Station: A Concept System for Long-Range Broadband Wireless Access
Deep Models of Interactions Across Sets
Algorithms and diagnostics for the analysis of preference rankings with the Extended Plackett-Luce model
Graph extensions, edit number and regular graphs
A Robustness Measure of Transient Stability under Operational Constraints in Power Systems
Simultaneous Task Allocation and Planning Under Uncertainty
On the Improved Nonlinear Tracking Differentiator based Nonlinear PID Controller Design
A Brandom-ian view of Reinforcement Learning towards strong-AI
Translating Questions into Answers using DBPedia n-triples
A Bayesian framework for molecular strain identification from mixed diagnostic samples
Quasi-patterns produced by a Mexican Hat coupling of quasi-cycles
Mixed Voltage Angle and Frequency Droop Control for Transient Stability of Interconnected Microgrids
Proximal Activation of Smooth Functions in Splitting Algorithms for Convex Minimization
Analysis of Decimation on Finite Frames with Sigma-Delta Quantization
Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit
A Newton-CG Algorithm with Complexity Guarantees for Smooth Unconstrained Optimization
Stochastic Games for Fuel Followers Problem: N vs MFG
Distributed Computation of Wasserstein Barycenters over Networks
The Advantage of Doubling: A Deep Reinforcement Learning Approach to Studying the Double Team in the NBA
Multi-objective evolution for 3D RTS Micro
Comparison of Noisy Channels and Reverse Data-Processing Theorems
Nearly orthogonal vectors and small antipodal spherical codes
Strong Convex Nonlinear Relaxations of the Pooling Problem
Some Approximation Bounds for Deep Networks
Modulus p^2 congruences involving harmonic numbers
Decomposition of Nonlinear Dynamical Networks via Comparison Systems
On properties of a class of strong limits for supercritical superprocesses
Accelerating a fluvial incision and landscape evolution model with parallelism
A framework with updateable joint images re-ranking for Person Re-identification
Instance Similarity Deep Hashing for Multi-Label Image Retrieval
Rethinking Feature Distribution for Loss Functions in Image Classification
A Deep Generative Model for Disentangled Representations of Sequential Data
Pointing consensus for rooted out-branching graphs
How Images Inspire Poems: Generating Classical Chinese Poetry from Images with Memory Networks
DeepCAS: A Deep Reinforcement Learning Algorithm for Control-Aware Scheduling
Universal Transport Dynamics of Complex Fluids: Effects of Intrinsic and Extrinsic Disorder
Prime lattice points in ovals
An FPGA-based Massively Parallel Neuromorphic Cortex Simulator
Infinite Reduced Words, Lattice Property And Braid Graph of Affine Weyl Groups
Generalized Linear Models for Geometrical Current predictors. An application to predict garment fit
SA-IGA: A Multiagent Reinforcement Learning Method Towards Socially Optimal Outcomes
Robustness of control point configurations for homography and planar pose estimation
Sample Complexity of Total Variation Minimization
Redundancy in Distributed Proofs
Input-to-State Safety with Control Barrier Functions
Design of a nickel-base superalloy using a neural network
Self-healing Routing and Other Problems in Compact Memory
Renormalisation of parabolic stochastic PDEs
Preserving Semantic Relations for Zero-Shot Learning
Log Gaussian Cox processes on the sphere
A frequency-constrained geometric Pontryagin maximum principle on matrix Lie groups
Hierarchical Heuristic Learning towards Effcient Norm Emergence
A note on two-colorability of nonuniform hypergraphs
Reflection length in the general linear and affine groups
Chomp on Kneser graphs and graphs with only one odd cycle
Equivariant Euler characteristics of the symplectic building
Generalized partially linear models on Riemannian manifolds
Distributed virtual machine consolidation: A systematic mapping study
Leveraging Unlabeled Data for Crowd Counting by Learning to Rank
Applicability and interpretation of the deterministic weighted cepstral distance
Concise Fuzzy Representation of Big Graphs: a Dimensionality Reduction Approach
Physical Layer Communications System Design Over-the-Air Using Adversarial Networks
SentRNA: Improving computational RNA design by incorporating a prior of human design strategies
Some properties of $\{k\}$-packing function problem in graphs
The Whitney Duals of a Graded Poset
Two Distinct Seasonally Fractionally Differenced Periodic Processes
Aggregation using input-output trade-off
An Enabling Waveform for 5G – QAM-FBMC: Initial Analysis
Modeling Activation Processes in Human Memory to Improve Tag Recommendations
Fact Checking in Community Forums
Drug Recommendation toward Safe Polypharmacy
RAN Enablers for 5G Radio Resource Management
Dynamic IoT Choreographies — Managing Discovery, Distribution, Failure and Reconfiguration
Universal Dielectric Response across a Continuous Metal-Insulator Transition
Not all phylogenetic networks are leaf-reconstructible
From coalescing random walks on a torus to Kingman’s coalescent
Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio — Episode 1: Machine Transcription of the Manuscripts
The open dihypergraph dichotomy and the second level of the Borel hierarchy
Toward a probability theory for product logic: states, integral representation and reasoning
PhaseNet: A Deep-Neural-Network-Based Seismic Arrival Time Picking Method
Distributed Fault Detection and Accommodation in Dynamic Average Consensus
Multilevel Illumination Coding for Fourier Transform Interferometry in Fluorescence Spectroscopy
Distributed Maximum Likelihood using Dynamic Average Consensus
Feudal Reinforcement Learning for Dialogue Management in Large Domains
Improving Optimization in Models With Continuous Symmetry Breaking
Length of the longest common subsequence between overlapping words
Fairness Through Computationally-Bounded Awareness
The maximum number of $P_\ell$ copies in $P_k$-free graphs
Probably Approximately Metric-Fair Learning
Domain Adaptive Faster R-CNN for Object Detection in the Wild
Improved Distributed $Δ$-Coloring
Computing the Nucleolus of Weighted Cooperative Matching Games in Polynomial Time
Dynamic Spike Super-resolution and Applications to Ultrafast Ultrasound Imaging
GONet: A Semi-Supervised Deep Learning Approach For Traversability Estimation
Global well-posedness for the mass-critical stochastic nonlinear Schrödinger equation on $\mathbb{R}$: small initial data