Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet, a deep residual network and a generative adversarial network, using a simulator implemented with the neon deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.

An Unsupervised Deep Learning Approach for Scenario Forecasts

In this paper, we propose a novel scenario forecasts approach which can be applied to a broad range of power system operations (e.g., wind, solar, load) over various forecasts horizons and prediction intervals. This approach is model-free and data-driven, producing a set of scenarios that represent possible future behaviors based only on historical observations and point forecasts. It first applies a newly-developed unsupervised deep learning framework, the generative adversarial networks, to learn the intrinsic patterns in historical renewable generation data. Then by solving an optimization problem, we are able to quickly generate large number of realistic future scenarios. The proposed method has been applied to a wind power generation and forecasting dataset from national renewable energy laboratory. Simulation results indicate our method is able to generate scenarios that capture spatial and temporal correlations. Our code and simulation datasets are freely available online.

Convolutional Normalizing Flows

Bayesian posterior inference is prevalent in various machine learning problems. Variational inference provides one way to approximate the posterior distribution, however its expressive power is limited and so is the accuracy of resulting approximation. Recently, there has a trend of using neural networks to approximate the variational posterior distribution due to the flexibility of neural network architecture. One way to construct flexible variational distribution is to warp a simple density into a complex by normalizing flows, where the resulting density can be analytically evaluated. However, there is a trade-off between the flexibility of normalizing flow and computation cost for efficient transformation. In this paper, we propose a simple yet effective architecture of normalizing flows, ConvFlow, based on convolution over the dimensions of random input vector. Experiments on synthetic and real world posterior inference problems demonstrate the effectiveness and efficiency of the proposed method.

GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

Deep multitask networks, in which one neural network produces multiple predictive outputs, are more scalable and often better regularized than their single-task counterparts. Such advantages can potentially lead to gains in both speed and performance, but multitask networks are also difficult to train without finding the right balance between tasks. We present a novel gradient normalization (GradNorm) technique which automatically balances the multitask loss function by directly tuning the gradients to equalize task training rates. We show that for various network architectures, for both regression and classification tasks, and on both synthetic and real datasets, GradNorm improves accuracy and reduces overfitting over single networks, static baselines, and other adaptive multitask loss balancing techniques. GradNorm also matches or surpasses the performance of exhaustive grid search methods, despite only involving a single asymmetry hyperparameter \alpha. Thus, what was once a tedious search process which incurred exponentially more compute for each task added can now be accomplished within a few training runs, irrespective of the number of tasks. Ultimately, we hope to demonstrate that direct gradient manipulation affords us great control over the training dynamics of multitask networks and may be one of the keys to unlocking the potential of multitask learning.

Finding Heavily-Weighted Features in Data Streams

We introduce a new sub-linear space data structure—the Weight-Median Sketch—that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.

FADO: A Deterministic Detection/Learning Algorithm

This paper proposes and studies a detection technique for adversarial scenarios (dubbed deterministic detection). This technique provides an alternative detection methodology in case the usual stochastic methods are not applicable: this can be because the studied phenomenon does not follow a stochastic sampling scheme, samples are high-dimensional and subsequent multiple-testing corrections render results overly conservative, sample sizes are too low for asymptotic results (as e.g. the central limit theorem) to kick in, or one cannot allow for the small probability of failure inherent to stochastic approaches. This paper instead designs a method based on insights from machine learning and online learning theory: this detection algorithm – named Online FAult Detection (FADO) – comes with theoretical guarantees of its detection capabilities. A version of the margin is found to regulate the detection performance of FADO. A precise expression is derived for bounding the performance, and experimental results are presented assessing the influence of involved quantities. A case study of scene detection is used to illustrate the approach. The technology is closely related to the linear perceptron rule, inherits its computational attractiveness and flexibility towards various extensions.

A Tutorial on Canonical Correlation Methods

Canonical correlation analysis is a family of multivariate statistical methods for the analysis of paired sets of variables. Since its proposition, canonical correlation analysis has for instance been extended to extract relations between two sets of variables when the sample size is insufficient in relation to the data dimensionality, when the relations have been considered to be non-linear, and when the dimensionality is too large for human interpretation. This tutorial explains the theory of canonical correlation analysis including its regularised, kernel, and sparse variants. Additionally, the deep and Bayesian CCA extensions are briefly reviewed. Together with the numerical examples, this overview provides a coherent compendium on the applicability of the variants of canonical correlation analysis. By bringing together techniques for solving the optimisation problems, evaluating the statistical significance and generalisability of the canonical correlation model, and interpreting the relations, we hope that this article can serve as a hands-on tool for applying canonical correlation methods in data analysis.

SWOOP: Top-k Similarity Joins over Set Streams

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top-k most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top-k result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top-k result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top-k result with any set already in the window. Second, when a set leaves the window, all its pairings in the top-k are invalidated and must be replaced. It is not enough to maintain the k most similar pairs, as less similar pairs may eventually be promoted to the top-k result. A straightforward solution that pairs every new set with all sets in the window and keeps all pairs for maintaining the top-k result is memory intensive and too slow. We propose SWOOP, a highly scalable stream join algorithm that solves these issues. Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-k result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing approaches.

Deep density networks and uncertainty in recommender systems

Building robust online content recommendation systems requires learning complex interactions between user preferences and content features. The field has evolved rapidly in recent years from traditional multi-arm bandit and collaborative filtering techniques, with new methods integrating Deep Learning models that enable to capture non-linear feature interactions. Despite progress, the dynamic nature of online recommendations still poses great challenges, such as finding the delicate balance between exploration and exploitation. In this paper we provide a novel method, Deep Density Networks (DDN) which deconvolves measurement and data uncertainties and predicts probability density of CTR (Click Through Rate), enabling us to perform more efficient exploration of the feature space. We show the usefulness of using DDN online in a real world content recommendation system that serves billions of recommendations per day, and present online and offline results to evaluate the benefit of using DDN.

Moonshine: Distilling with Cheap Convolutions

Model distillation compresses a trained machine learning model, such as a neural network, into a smaller alternative such that it could be easily deployed in a resource limited setting. Unfortunately, this requires engineering two architectures: a student architecture smaller than the first teacher architecture but trained to emulate it. In this paper, we present a distillation strategy that produces a student architecture that is a simple transformation of the teacher architecture. Recent model distillation methods allow us to preserve most of the performance of the trained model after replacing convolutional blocks with a cheap alternative. In addition, distillation by attention transfer provides student network performance that is better than training that student architecture directly on data.

Bounding and Counting Linear Regions of Deep Neural Networks
Small Resolution Proofs for QBF using Dependency Treewidth
Consistency of Maximum Likelihood for Continuous-Space Network Models
Projection Theorems Using Effective Dimension
Sequential Multi-Class Labeling in Crowdsourcing
Moduli of regularity and rates of convergence for Fejér monotone sequences
Characterizing Sparse Connectivity Patterns in Neural Networks
Weighted Transformer Network for Machine Translation
Asymptotic properties of maximum likelihood estimator for the growth rate of a stable CIR process based on continuous time observations
Optimal rates of entropy estimation over Lipschitz balls
A Joint 3D-2D based Method for Free Space Detection on Roads
Conditioned Functional Limits and Applications to Queues
On Convergence of the Alternating Projection Method for Matrix Completion and Sparse Recovery Problems
Adaptive Bayesian Sampling with Monte Carlo EM
TAMU at KBP 2017: Event Nugget Detection and Coreference Resolution
The menu complexity of ‘one-and-a-half-dimensional’ mechanism design
On the Monetary Loss Due to Passive and Active Attacks on MIMO Smart Grid Communications
Synthetic and Natural Noise Both Break Neural Machine Translation
Semiparametric Estimation of Structural Functions in Nonseparable Triangular Models
Quickest Change Detection under Transient Dynamics: Theory and Asymptotic Analysis
A note on dispersing particles on a line
On Derandomizing Local Distributed Algorithms
Alpha-expansion is Exact on Stable Instances
Regret Bounds and Regimes of Optimality for User-User and Item-Item Collaborative Filtering
Distribution Systems Hardening against Natural Disasters
Towards Language-Universal End-to-End Speech Recognition
Unsupervised Learning of Semantic Audio Representations
Social Welfare and Profit Maximization from Revealed Preferences
Improved training for online end-to-end speech recognition systems
On ${\cal Z}_p$-norms of random vectors
Image Segmentation of Multi-Shaped Overlapping Objects
Combining Trajectory Optimization, Supervised Machine Learning, and Model Structure for Mitigating the Curse of Dimensionality in the Control of Bipedal Robots
Unsupervised Transformation Learning via Convex Relaxations
Sequence Pairs with Lowest Combined Autocorrelation and Crosscorrelation
Visually-Aware Fashion Recommendation and Design with Generative Image Models
Rudin-Shapiro-Like Polynomials with Maximum Asymptotic Merit Factor
Challenges in Disentangling Independent Factors of Variation
Ergodicity and Lyapunov functions for Langevin dynamics with singular potentials
Doppler-Radar Based Hand Gesture Recognition System Using Convolutional Neural Networks
Optimal control of a nonconvex perturbed sweeping process
High-order Tensor Completion for Data Recovery via Sparse Tensor-train Optimization
Quantum Lighning Never Strikes the Same State Twice
Modeling and Optimization of Complex Building Energy Systems with Deep Neural Networks
Non-Autoregressive Neural Machine Translation
Variational Walkback: Learning a Transition Operator as a Stochastic Recurrent Net
Large-Scale Optimal Transport and Mapping Estimation
Performance analysis of carrier aggregation for various mobile network implementations scenario based on spectrum allocated
Estimation of Treatment Effects for Heterogeneous Matched Pairs Data with Probit Models
Explicit Deep Holes of Reed-Solomon Codes
Quality-Efficiency Trade-offs in Machine Learning for Text Processing
Regular Incidence Complexes, Polytopes, and C-Groups
Dynamical, structural and chemical heterogeneities in a binary metallic glass-forming liquid
Can Deep Reinforcement Learning Solve Erdos-Selfridge-Spencer Games?
Iterative Computation of Security Strategies of Matrix Games with Growing Action Set
Security Strategies of Both Players in Asymmetric Information Zero-Sum Stochastic Games with an Informed Controller
Learning Overcomplete HMMs
Signless Laplacian spectral radius and fractional (perfect) matchings in graphs
DeepRain: ConvLSTM Network for Precipitation Prediction using Multichannel Radar Data
Multi-Player Bandits Models Revisited
Can Maxout Units Downsize Restoration Networks? – Single Image Super-Resolution Using Lightweight CNN with Maxout Units
Performance Analysis of a Non-Orthogonal Multiple Access based Cooperative Relaying System over Rician Fading Channels
Congruences modulo powers of 5 for $k$-colored partitions
Sparse Attentive Backtracking: Long-Range Credit Assignment in Recurrent Networks
Interpreting Convolutional Neural Networks Through Compression
Reverse plane partitions of skew staircase shapes and $q$-Euler numbers
Universality for the random-cluster model on isoradial graphs
Emergent explosive synchronization in adaptive complex networks
Joint Altitude and Beamwidth Optimization for UAV-Enabled Multiuser Communications
Multi-mode Tracking of a Group of Mobile Sensors
Robust Truss Topology Optimization via Semidefinite A Difference-of-Convex Programming Approach
Critical scaling of the mutual information in two-dimensional disordered Ising models
Variance Reduction Result for a Projected Adaptive Biasing Force Method
Distributed Bayesian Piecewise Sparse Linear Models
The convergence guarantee of the iterative thresholding algorithm with suboptimal feedbacks for large systems
An EEG-based Image Annotation System
Reversible DNA codes over a family of non-chain rings
On extremal cacti with respect to the edge Szeged index and edge-vertex Szeged index
Beetle Antennae Search without Parameter Tuning (BAS-WPT) for Multi-objective Optimization
Unconstrained Scene Text and Video Text Recognition for Arabic Script
Whirling injections, surjections, and other functions between finite sets
ZipNet-GAN: Inferring Fine-grained Mobile Traffic Patterns via a Generative Adversarial Neural Network
Gaussian Lower Bound for the Information Bottleneck Limit
Strong convergence rates for explicit space-time discrete numerical approximations of stochastic Allen-Cahn equations
Hybrid stochastic kinetic description of two-dimensional traffic dynamics
Tits arrangements on cubic curves
Spatial Characteristics of Distortion Radiated from Antenna Arrays with Transceiver Nonlinearities
A Survey on Hardware Implementations of Visual Object Trackers
Cortical microcircuits as gated-recurrent neural networks
Revisionist Simulations: A New Approach to Proving Space Lower Bounds
Self-referential basis of undecidable dynamics: from The Liar Paradox and The Halting Problem to The Edge of Chaos
Snake graphs and continued fractions
Bridges with random length: Gaussian-Markovian case
Integrating Queuing Regime into Cognitive Radio Channel Aggregation Policies: A Performance Evaluation
A Distributed Power Routing Method between Regional Markets based on Bellman-Ford Algorithm
Exposing and exploiting structure: optimal code generation for high-order finite element methods
Bayesian model and dimension reduction for uncertainty propagation: applications in random media
Grafting for Combinatorial Boolean Model using Frequent Itemset Mining
Cache-Enabled Physical Layer Security for Video Streaming in Backhaul-Limited Cellular Networks
MSR-net:Low-light Image Enhancement Using Deep Convolutional Network
A feasibility approach for constructing combinatorial designs of circulant type
Detecting the direction of a signal on high-dimensional spheres: Non-null and Le Cam optimality results
A shape optimization algorithm for interface identification allowing topological changes
Remote Sensing Image Fusion Based on Two-stream Fusion Network
Detecting Symmetry in Designing Heat Exchanger Networks
A review of the deterministic and diffusion approximations for stochastic chemical reaction networks
Robust Optimization of PDEs with Random Coefficients Using a Multilevel Monte Carlo Method
Image Captioning and Classification of Dangerous Situations
Overlap in Observational Studies with High-Dimensional Covariates
On the Non-Polyhedricity of Sets with Upper and Lower Bounds in Dual Spaces
Gene flow accross geographical barriers – scaling limits of random walks with obstacles
Non-uniqueness and mean-field criticality for percolation on nonunimodular transitive graphs
On the Strong Convergence of Forward-Backward Splitting in Reconstructing Jointly Sparse Signals
Streaming Robust Submodular Maximization: A Partitioned Thresholding Approach
Combinatorial Assortment Optimizationv
Unbounded cache model for online language modeling with open vocabulary
Extractive Multi-document Summarization Using Multilayer Networks
Canonical measures on metric graphs and a Kazhdan’s theorem
A product form and a sub-additive theorem for the general stochastic matching model
Convex Optimization with Nonconvex Oracles
Loglinear model selection and human mobility
Internalising Interaction Protocols as First-Class Programming Elements in Multi Agent Systems
Safe Adaptive Importance Sampling
Compression-aware Training of Deep Networks
Theoretical limitations of Encoder-Decoder GAN architectures
Latent hypernet: Exploring all Layers from Convolutional Neural Networks
Neural system identification for large populations separating ‘what’ and ‘where’
Optimizing ROOT IO For Analysis