A Deep Neural Network Compression Pipeline: Pruning, Quantization, Huffman Encoding

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.

A Generative Model of Words and Relationships from Multiple Sources

Neural Language Models are a powerful tool to meaningfully embed words into semantic vector spaces. However, learning vector space models of language generally relies on the availability of abundant and diverse training examples. In highly specialized domains this requirement may not be met due to difficulties in obtaining a large corpus, or the limited range of expression in average usage. Prior knowledge about entities in the language often exists in a knowledge base or ontology. We propose a generative model which allows for modeling and transfering semantic information in vector spaces by combining diverse data sources. We generalize the concept of co-occurrence from distributional semantics to include other types of relations between entities, evidence for which can come from a knowledge base (such as WordNet or UMLS). Our model defines a probability distribution over triplets consisting of word pairs with relations. Through stochastic maximum likelihood we learn a representation of these words as elements of a vector space and model the relations as affine transformations. We demonstrate the effectiveness of our generative approach by outperforming recent models on a knowledge-base completion task and demonstrating its ability to profit from the use of partially observed or fully unobserved data entries. Our model is capable of operating semi-supervised, where word pairs with no known relation are used as training data. We further demonstrate the usefulness of learning from different data sources with overlapping vocabularies.

A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization

Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A \approx W H. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel software to solve the problem for big datasets. Existing distributed-memory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices. We propose a distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for W and H. To our knowledge, our algorithm is the first high-performance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.

Accelerated Discrete Distribution Clustering under Wasserstein Distance

In a variety of research areas, the bag of weighted vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich-Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution that minimizes its sum of squared distances to the cluster members. In this paper, we develop three scalable optimization techniques, specifically, the subgradient descent method, ADMM, and modified Bregman ADMM, for computing the centroids of large clusters without compromising the objective function. The strengths and weaknesses of these techniques are examined through experiments; and scenarios for their respective usage are recommended. Moreover, we develop both serial and parallelized versions of the algorithms, collectively named the AD2-clustering. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods’ in the corresponding areas.

Asynchronous Distributed Gibbs Sampling

Gibbs sampling is a widely used Markov Chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. It is widely believed that MCMC methods do not extend easily to parallel implementations, as their inherently sequential nature incurs a large synchronization cost. This means that new solutions are needed to bring Bayesian analysis fully into the era of large-scale computation. In this paper, we present a novel scheme – Asynchronous Distributed Gibbs (ADG) sampling – that allows us to perform MCMC in a parallel fashion with no synchronization or locking, avoiding the typical performance bottlenecks of parallel algorithms. Our method is especially attractive in settings, such as hierarchical random-effects modeling in which each observation has its own random effect, where the problem dimension grows with the sample size. We prove convergence under some basic regularity conditions, and discuss the proof for similar parallelization schemes for other iterative algorithms. We provide three examples that illustrate some of the algorithm’s properties with respect to scaling. Because our hardware resources are bounded, we have not yet found a limit to the algorithm’s scaling, and thus its true capabilities remain unknown.

Deep Haar Scattering Networks

An orthogonal Haar scattering transform is a deep network, computed with a hierarchy of additions, subtractions and absolute values, over pairs of coefficients. It provides a simple mathematical model for unsupervised deep network learning. It implements non-linear contractions, which are optimized for classification, with an unsupervised pair matching algorithm, of polynomial complexity. A structured Haar scattering over graph data computes permutation invariant representations of groups of connected points in the graph. If the graph connectivity is unknown, unsupervised Haar pair learning can provide a consistent estimation of connected dyadic groups of points. Classification results are given on image data bases, defined on regular grids or graphs, with a connectivity which may be known or unknown.

Distributed Weighted Parameter Averaging for SVM Training on Big Data

Two popular approaches for distributed training of SVMs on big data are parameter averaging and ADMM. Parameter averaging is efficient but suffers from loss of accuracy with increase in number of partitions, while ADMM in the feature space is accurate but suffers from slow convergence. In this paper, we report a hybrid approach called weighted parameter averaging (WPA), which optimizes the regularized hinge loss with respect to weights on parameters. The problem is shown to be same as solving SVM in a projected space. We also demonstrate an O(\frac{1}{N}) stability bound on final hypothesis given by WPA, using novel proof techniques. Experimental results on a variety of toy and real world datasets show that our approach is significantly more accurate than parameter averaging for high number of partitions. It is also seen the proposed method enjoys much faster convergence compared to ADMM in features space.

Distributionally Robust Logistic Regression

This paper proposes a distributionally robust approach to logistic regression. We use the Wasserstein distance to construct a ball in the space of probability distributions centered at the uniform distribution on the training samples. If the radius of this ball is chosen judiciously, we can guarantee that it contains the unknown data-generating distribution with high confidence. We then formulate a distributionally robust logistic regression model that minimizes a worst-case expected logloss function, where the worst case is taken over all distributions in the Wasserstein ball. We prove that this optimization problem admits a tractable reformulation and encapsulates the classical as well as the popular regularized logistic regression problems as special cases. We further propose a distributionally robust approach based on Wasserstein balls to compute upper and lower confidence bounds on the misclassification probability of the resulting classifier. These bounds are given by the optimal values of two highly tractable linear programs. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.

Fast Algorithms for Convolutional Neural Networks

We derive a new class of fast algorithms for convolutional neural networks using Winograd’s minimal filtering algorithms. Specifically we derive algorithms for network layers with 3×3 kernels, which are the preferred kernel size for image recognition tasks. The best of our algorithms reduces arithmetic complexity up to 4X compared with direct convolution, while using small block sizes with limited transform overhead and high computational intensity. By comparison, FFT based convolution requires larger block sizes and significantly greater transform overhead to achieve an equal complexity reduction. We measure the accuracy of our algorithms to be sufficient for deep learning and inference with fp32 or fp16 data. Also, we demonstrate the practical application of our approach with a simple CPU implementation of our slowest algorithm using the Intel Math Kernel Library, and report VGG network inference results that are 2.6X as fast as Caffe with an effective utilization of 109%. We believe these are the highest utilization convnet inference results to date, and that they can be improved significantly with more implementation effort. We also believe that the new algorithms lend themselves equally well to GPU and FPGA implementations for both training and inference.

Fast Algorithms for Exact String Matching

Given a pattern string P of length n and a query string T of length m, where the characters of P and T are drawn from an alphabet of size \Delta, the {\em exact string matching} problem consists of finding all occurrences of P in T. For this problem, we present algorithms that in O(n\Delta^2) time pre-process P to essentially identify sparse(P), a rarely occurring substring of P, and then use it to find occurrences of P in T efficiently. Our algorithms require a worst case search time of O(m), and expected search time of O(m/min(|sparse(P)|, \Delta)), where |sparse(P)| is at least \delta (i.e. the number of distinct characters in P), and for most pattern strings it is observed to be \Omega(n^{1/2}).

Knowledge-based system for collaborative process specification

This paper presents an ontology-based approach for the design of a collaborative business process model (CBP). This CBP is considered as a specification of needs in order to build a collaboration information system (CIS) for a network of organisations. The study is a part of a model driven engineering approach of the CIS in a specific enterprise interoperability framework that will be summarised. An adaptation of the Business Process Modeling Notation (BPMN) is used to represent the CBP model. We develop a knowledge-based system (KbS) which is composed of three main parts: knowledge gathering, knowledge representation and reasoning, and collaborative business process modelling. The first part starts from a high abstraction level where knowledge from business partners is captured. A collaboration ontology is defined in order to provide a structure to store and use the knowledge captured. In parallel, we try to reuse generic existing knowledge about business processes from the MIT Process Handbook repository. This results in a collaboration process ontology that is also described. A set of rules is defined in order to extract knowledge about fragments of the CBP model from the two previous ontologies. These fragments are finally assembled in the third part of the KbS. A prototype of the KbS has been developed in order to implement and support this approach. The prototype is a computer-aided design tool of the CBP. In this paper, we will present the theoretical aspects of each part of this KbS as well as the tools that we developed and used in order to support its functionalities.

Lecture notes on ridge regression

The linear regression model cannot be fitted to high-dimensional data, as the high-dimensionality brings about empirical non-identifiability. Penalized regression overcomes this non-identifiability by augmentation of the loss function by a penalty (i.e. a function of regression coefficients). The ridge penalty is the sum of squared regression coefficients, giving rise to ridge regression. Here many aspect of ridge regression are reviewed e.g. moments, mean squared error, its equivalence to constrained estimation, and its relation to Bayesian regression. Finally, its behaviour and use are illustrated in simulation and on omics data.

RDF Knowledge Graph Visualization From a Knowledge Extraction System

In this paper, we present a system to visualize RDF knowledge graphs. These graphs are obtained from a knowledge extraction system designed by GEOLSemantics. This extraction is performed using natural language processing and trigger detection. The user can visualize subgraphs by selecting some ontology features like concepts or individuals. The system is also multilingual, with the use of the annotated ontology in English, French, Arabic and Chinese.

A glass half full interpretation of the replicability of psychological science

A New Method for tackling Asymmetric Decision Problems

A note on corollaries to Tokuyama’s Identity for symplectic Schur $Q$-Functions

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

A tight bound on the size of certain separating hash families

A transference approach to a Roth-type theorem in the squares

A Wait-Free Stack

A weighted cellular matrix-tree theorem, with applications to complete colorful and cubical complexes

Accelerating MCMC with active subspaces

Active Perception for Multimodal Object Category Recognition Using Information Gain

Almost partitioning a 3-edge-coloured $K_{n,n}$ into 5 monochromatic cycles

An Algebraic Geometric Approach to Nivat’s Conjecture

An Introduction to Twisted Particle Filters and Parameter Estimation in Non-linear State-space Models

Assigning polymorphic endogenous retrovirus integration sites using a mixture model

Asymptotic confidence bands for copulas based on the local linear kernel estimator

Asymptotic Normality of In- and Out-Degree Counts in a Preferential Attachment Model

Automated Discovery and Proof of Congruence Theorems for Partial Sums of Combinatorial Sequences

Chebyshev polynomials, quadratic surds and a variation of Pascal’s triangle

Chimera-like states in modular neural networks

Clamping Improves TRW and Mean Field Approximations

Coloring Square-free Berge Graphs

Combinatorial Auctions with Conflict-Based Externalities

Community detection for interaction networks

Complexity of the Game Domination Problem

Convergence of Stochastic Gradient Descent for PCA

Convolutional Networks on Graphs for Learning Molecular Fingerprints

Correlation functions of real zeros of random polynomials

Data-based stochastic model reduction for the Kuramoto–Sivashinsky equation

Determination of the Internet Anonymity Influence on the Level of Aggression and Usage of Obscene Lexis

Disk storage management for LHCb based on Data Popularity estimator

Distances in Dense Subsets of $\mathbb{Z}^d$

Distributed Inference for Relay-Assisted Sensor Networks With Intermittent Measurements Over Fading Channels

Divisionally free Restrictions of Reflection Arrangements

Dynamics of multivariate default system in random environment

Effects of Anderson localization physics on quench dynamics

Efficiently Finding All Maximal $α$-gapped Repeats

EM estimation of a Structural Equation Model

Embedding simplices into sets of positive upper density in $\mathbb{R}^d$

Enhanced Bilingual Evaluation Understudy

Experimental Study of Synchronization of Coupled Electrical Self-Oscillators and Comparison to the Sakaguchi-Kuramoto model

Fault Tolerance in Distributed Neural Computing

Feynman identity for planar graphs

Finding Desirable Objects under Group Categorical Preferences

Fluctuations of the total number of critical points of random spherical harmonics

Four variants of the Fourier-analytic transference principle

FPT Approximation Schemes for Maximizing Submodular Functions

Fractional diffusion equation with distributed-order material derivative. Stochastic foundations

Functional central limit theorems in survey sampling

Fundamental Results for a Generic Implementation of Barriers using Optical Interconnects

Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree

Generating Permutations with Restricted Containers

Generative Adversarial Networks in Estimation of Distribution Algorithms for Combinatorial Optimization

GRAPLEr: A Distributed Collaborative Environment for Lake Ecosystem Modeling that Integrates Overlay Networks, High-throughput Computing, and Web Services

Higher-order asymptotics for the parametric complexity

Homoscedasticity tests valid in both low and high-dimensional regressions

Improved bridge constructs for stochastic differential equations

iotools: High-Performance I/O Tools for R

Large deviations for the empirical measure of random polynomials: revisit of the Zeitouni-Zelditch theorem

Learning From Missing Data Using Selection Bias in Movie Recommendation

Learning without Recall: A Case for Log-Linear Learning

Maximum Likelihood Learning With Arbitrary Treewidth via Fast-Mixing Parameter Sets

Modeling extreme values by the residual coefficient of variation

Multidimensional two-component Gaussian mixtures detection

Multi-objective Differential Evolution with Helper Functions for Constrained Optimization

Necessary conditions for the existence of 3-designs over finite fields with nontrivial automorphism groups

Notes on Various Methods for Constructing Directed Strongly Regular Graphs

On Classification Issues within Ensemble-Based Complex System Simulation Tasks

On Mixing Properties of Some INAR Models

On the Complexity of Robust PCA and $\ell_1$-norm Low-Rank Matrix Approximation

On the Hausdorff Continuity of Free Lèvy Processes and Free Convolution Semigroups

On the Maximal Displacement of Subcritical Branching Random Walks

On the policy improvement algorithm in continuous time

On the similarity of symbol frequency distributions with heavy tails

On the stability of some Erdős–Ko–Rado type results

On trees with the same restricted U-polynomial and the Prouhet-Tarry-Escott problem

Optimal quantum algorithm for polynomial interpolation

Parallel Metric Tree Embedding based on an Algebraic View on Moore-Bellman-Ford

Partitioning random graphs into monochromatic components

Pattern Formation Problem for Synchronous Mobile Robots in the Three Dimensional Euclidean Space

Pinsker bound under measurement budget constrain: optimal allocation

Poisson point process convergence and extreme values in stochastic geometry

Polar decomposition of scale-homogeneous measures with application to Lévy measures of strictly stable laws

Polish – English Speech Statistical Machine Translation Systems for the IWSLT 2013

Polish to English Statistical Machine Translation

Pure threshold strategies for a two-node tandem network under partial information

QUDA: A Direct Approach for Sparse Quadratic Discriminant Analysis

Real-time Bidding based Vehicle Sharing

Real-Time Statistical Speech Translation

Regret Lower Bound and Optimal Algorithm in Finite Stochastic Partial Monitoring

Roads and cities of $18^{th}$ century France

Single-Seed Cascades on Clustered Networks

Special elements in the quaternion algebras over finite fields

Spectra of nearly Hermitian random matrices

Stochastic differential equation based on a Gaussian potential field to model fishing vessels trajectories

Stochastic representation of fractional subdiffusion equation. The case of infinitely divisible waiting times, Levy noise and space-time-dependent coefficients

Supporting interoperability of collaborative networks through engineering of a service-based Mediation Information System (MISE 2.0)

Supporting Regularized Logistic Regression Privately and Efficiently

Supratransmission in a Disordered Nonlinear Periodic Structure

Symbol Emergence in Robotics: A Survey

Temporal and structural heterogeneities emerging in adaptive temporal networks

The automorphism group of the $s$-stable Kneser graphs

The ‘handedness’ of language: Directional symmetry breaking of sign usage in words

The Many-Body Localized Phase of the Quantum Random Energy Model

The maximizing set of the asymptotic normalized log-likelihood for partially observed Markov chains

The Maximum Likelihood Data Singular Locus

Transient one-dimensional diffusions conditioned to converge to a different limit point

Two-color balanced affine urn models with multiple drawings II: large-index and triangular urns

Using consumer behavior data to reduce energy consumption in smart homes

Variable Selection for Additive Partial Linear Quantile Regression with Missing Covariates

Very Deep Multilingual Convolutional Neural Networks for LVCSR

VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing

Weighted least squares estimator for the squared radial Ornstein-Uhlenbeck process