Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.
Neural Language Models are a powerful tool to meaningfully embed words into semantic vector spaces. However, learning vector space models of language generally relies on the availability of abundant and diverse training examples. In highly specialized domains this requirement may not be met due to difficulties in obtaining a large corpus, or the limited range of expression in average usage. Prior knowledge about entities in the language often exists in a knowledge base or ontology. We propose a generative model which allows for modeling and transfering semantic information in vector spaces by combining diverse data sources. We generalize the concept of co-occurrence from distributional semantics to include other types of relations between entities, evidence for which can come from a knowledge base (such as WordNet or UMLS). Our model defines a probability distribution over triplets consisting of word pairs with relations. Through stochastic maximum likelihood we learn a representation of these words as elements of a vector space and model the relations as affine transformations. We demonstrate the effectiveness of our generative approach by outperforming recent models on a knowledge-base completion task and demonstrating its ability to profit from the use of partially observed or fully unobserved data entries. Our model is capable of operating semi-supervised, where word pairs with no known relation are used as training data. We further demonstrate the usefulness of learning from different data sources with overlapping vocabularies.
Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $W$ and $H$, for the given input matrix $A$, such that $A \approx W H$. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel software to solve the problem for big datasets. Existing distributed-memory algorithms are limited in terms of performance and applicability, as they are implemented using Hadoop and are designed only for sparse matrices. We propose a distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for $W$ and $H$. To our knowledge, our algorithm is the first high-performance parallel algorithm for NMF. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for dense and sparse matrices, and (2) it allows the user to choose from among multiple algorithms for solving local NLS subproblems within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.
In a variety of research areas, the bag of weighted vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich-Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution that minimizes its sum of squared distances to the cluster members. In this paper, we develop three scalable optimization techniques, specifically, the subgradient descent method, ADMM, and modified Bregman ADMM, for computing the centroids of large clusters without compromising the objective function. The strengths and weaknesses of these techniques are examined through experiments; and scenarios for their respective usage are recommended. Moreover, we develop both serial and parallelized versions of the algorithms, collectively named the AD2-clustering. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods’ in the corresponding areas.
Gibbs sampling is a widely used Markov Chain Monte Carlo (MCMC) method for numerically approximating integrals of interest in Bayesian statistics and other mathematical sciences. It is widely believed that MCMC methods do not extend easily to parallel implementations, as their inherently sequential nature incurs a large synchronization cost. This means that new solutions are needed to bring Bayesian analysis fully into the era of large-scale computation. In this paper, we present a novel scheme – Asynchronous Distributed Gibbs (ADG) sampling – that allows us to perform MCMC in a parallel fashion with no synchronization or locking, avoiding the typical performance bottlenecks of parallel algorithms. Our method is especially attractive in settings, such as hierarchical random-effects modeling in which each observation has its own random effect, where the problem dimension grows with the sample size. We prove convergence under some basic regularity conditions, and discuss the proof for similar parallelization schemes for other iterative algorithms. We provide three examples that illustrate some of the algorithm’s properties with respect to scaling. Because our hardware resources are bounded, we have not yet found a limit to the algorithm’s scaling, and thus its true capabilities remain unknown.
An orthogonal Haar scattering transform is a deep network, computed with a hierarchy of additions, subtractions and absolute values, over pairs of coefficients. It provides a simple mathematical model for unsupervised deep network learning. It implements non-linear contractions, which are optimized for classification, with an unsupervised pair matching algorithm, of polynomial complexity. A structured Haar scattering over graph data computes permutation invariant representations of groups of connected points in the graph. If the graph connectivity is unknown, unsupervised Haar pair learning can provide a consistent estimation of connected dyadic groups of points. Classification results are given on image data bases, defined on regular grids or graphs, with a connectivity which may be known or unknown.
Two popular approaches for distributed training of SVMs on big data are parameter averaging and ADMM. Parameter averaging is efficient but suffers from loss of accuracy with increase in number of partitions, while ADMM in the feature space is accurate but suffers from slow convergence. In this paper, we report a hybrid approach called weighted parameter averaging (WPA), which optimizes the regularized hinge loss with respect to weights on parameters. The problem is shown to be same as solving SVM in a projected space. We also demonstrate an $O(\frac{1}{N})$ stability bound on final hypothesis given by WPA, using novel proof techniques. Experimental results on a variety of toy and real world datasets show that our approach is significantly more accurate than parameter averaging for high number of partitions. It is also seen the proposed method enjoys much faster convergence compared to ADMM in features space.
This paper proposes a distributionally robust approach to logistic regression. We use the Wasserstein distance to construct a ball in the space of probability distributions centered at the uniform distribution on the training samples. If the radius of this ball is chosen judiciously, we can guarantee that it contains the unknown data-generating distribution with high confidence. We then formulate a distributionally robust logistic regression model that minimizes a worst-case expected logloss function, where the worst case is taken over all distributions in the Wasserstein ball. We prove that this optimization problem admits a tractable reformulation and encapsulates the classical as well as the popular regularized logistic regression problems as special cases. We further propose a distributionally robust approach based on Wasserstein balls to compute upper and lower confidence bounds on the misclassification probability of the resulting classifier. These bounds are given by the optimal values of two highly tractable linear programs. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.
We derive a new class of fast algorithms for convolutional neural networks using Winograd’s minimal filtering algorithms. Specifically we derive algorithms for network layers with 3×3 kernels, which are the preferred kernel size for image recognition tasks. The best of our algorithms reduces arithmetic complexity up to 4X compared with direct convolution, while using small block sizes with limited transform overhead and high computational intensity. By comparison, FFT based convolution requires larger block sizes and significantly greater transform overhead to achieve an equal complexity reduction. We measure the accuracy of our algorithms to be sufficient for deep learning and inference with fp32 or fp16 data. Also, we demonstrate the practical application of our approach with a simple CPU implementation of our slowest algorithm using the Intel Math Kernel Library, and report VGG network inference results that are 2.6X as fast as Caffe with an effective utilization of 109%. We believe these are the highest utilization convnet inference results to date, and that they can be improved significantly with more implementation effort. We also believe that the new algorithms lend themselves equally well to GPU and FPGA implementations for both training and inference.
Given a pattern string $P$ of length $n$ and a query string $T$ of length $m$, where the characters of $P$ and $T$ are drawn from an alphabet of size $\Delta$, the {\em exact string matching} problem consists of finding all occurrences of $P$ in $T$. For this problem, we present algorithms that in $O(n\Delta^2)$ time pre-process $P$ to essentially identify $sparse(P)$, a rarely occurring substring of $P$, and then use it to find occurrences of $P$ in $T$ efficiently. Our algorithms require a worst case search time of $O(m)$, and expected search time of $O(m/min(|sparse(P)|, \Delta))$, where $|sparse(P)|$ is at least $\delta$ (i.e. the number of distinct characters in $P$), and for most pattern strings it is observed to be $\Omega(n^{1/2})$.
This paper presents an ontology-based approach for the design of a collaborative business process model (CBP). This CBP is considered as a specification of needs in order to build a collaboration information system (CIS) for a network of organisations. The study is a part of a model driven engineering approach of the CIS in a specific enterprise interoperability framework that will be summarised. An adaptation of the Business Process Modeling Notation (BPMN) is used to represent the CBP model. We develop a knowledge-based system (KbS) which is composed of three main parts: knowledge gathering, knowledge representation and reasoning, and collaborative business process modelling. The first part starts from a high abstraction level where knowledge from business partners is captured. A collaboration ontology is defined in order to provide a structure to store and use the knowledge captured. In parallel, we try to reuse generic existing knowledge about business processes from the MIT Process Handbook repository. This results in a collaboration process ontology that is also described. A set of rules is defined in order to extract knowledge about fragments of the CBP model from the two previous ontologies. These fragments are finally assembled in the third part of the KbS. A prototype of the KbS has been developed in order to implement and support this approach. The prototype is a computer-aided design tool of the CBP. In this paper, we will present the theoretical aspects of each part of this KbS as well as the tools that we developed and used in order to support its functionalities.
The linear regression model cannot be fitted to high-dimensional data, as the high-dimensionality brings about empirical non-identifiability. Penalized regression overcomes this non-identifiability by augmentation of the loss function by a penalty (i.e. a function of regression coefficients). The ridge penalty is the sum of squared regression coefficients, giving rise to ridge regression. Here many aspect of ridge regression are reviewed e.g. moments, mean squared error, its equivalence to constrained estimation, and its relation to Bayesian regression. Finally, its behaviour and use are illustrated in simulation and on omics data.
In this paper, we present a system to visualize RDF knowledge graphs. These graphs are obtained from a knowledge extraction system designed by GEOLSemantics. This extraction is performed using natural language processing and trigger detection. The user can visualize subgraphs by selecting some ontology features like concepts or individuals. The system is also multilingual, with the use of the annotated ontology in English, French, Arabic and Chinese.