ZNN – A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines

Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brent’s theorem to the task dependency graph implies that linear speedup with the number of processors is attainable within the PRAM model of parallel computation, for wide network architectures. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. We implement the algorithm with a publicly available software package called ZNN. Benchmarking with multi-core CPUs shows that ZNN can attain speedup roughly equal to the number of physical cores. We also show that ZNN can attain over 90x speedup on a many-core CPU (Xeon Phi Knights Corner). These speedups are achieved for network architectures with widths that are in common use. The task parallelism of the ZNN algorithm is suited to CPUs, while the SIMD parallelism of previous algorithms is compatible with GPUs. Through examples, we show that ZNN can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch. ZNN may be less costly to develop and maintain, due to the relative ease of general-purpose CPU programming.

A ‘Gibbs-Newton’ Technique for Enhanced Inference of Multivariate Polya Parameters and Topic Models

Hyper-parameters play a major role in the learning and inference process of latent Dirichlet allocation (LDA). In order to begin the LDA latent variables learning process, these hyper-parameters values need to be pre-determined. We propose an extension for LDA that we call ‘Latent Dirichlet allocation Gibbs Newton’ (LDA-GN), which places non-informative priors over these hyper-parameters and uses Gibbs sampling to learn appropriate values for them. At the heart of LDA-GN is our proposed ‘Gibbs-Newton’ algorithm, which is a new technique for learning the parameters of multivariate Polya distributions. We report Gibbs-Newton performance results compared with two prominent existing approaches to the latter task: Minka’s fixed-point iteration method and the Moments method. We then evaluate LDA-GN in two ways: (i) by comparing it with standard LDA in terms of the ability of the resulting topic models to generalize to unseen documents; (ii) by comparing it with standard LDA in its performance on a binary classification task.

Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling

There is an explosion of data, documents, and other content, and people require tools to analyze and interpret these, tools to turn the content into information and knowledge. Topic modeling have been developed to solve these problems. Topic models such as LDA [Blei et. al. 2003] allow salient patterns in data to be extracted automatically. When analyzing texts, these patterns are called topics. Among numerous extensions of LDA, few of them can reliably analyze multiple groups of documents and extract topic similarities. Recently, the introduction of differential topic modeling (SPDP) [Chen et. al. 2012] performs uniformly better than many topic models in a discriminative setting. There is also a need to improve the sampling speed for topic models. While some effort has been made for distributed algorithms, there is no work currently done using graphical processing units (GPU). Note the GPU framework has already become the most cost-efficient platform for many problems. In this thesis, I propose and implement a scalable multi-GPU distributed parallel framework which approximates SPDP. Through experiments, I have shown my algorithms have a gain in speed of about 50 times while being almost as accurate, with only one single cheap laptop GPU. Furthermore, I have shown the speed improvement is sublinearly scalable when multiple GPUs are used, while fairly maintaining the accuracy. Therefore on a medium-sized GPU cluster, the speed improvement could potentially reach a factor of a thousand. Note SPDP is just a representative of other extensions of LDA. Although my algorithm is implemented to work with SPDP, it is designed to be a general enough to work with other topic models. The speed-up on smaller collections (i.e., 1000s of documents), means that these more complex LDA extensions could now be done in real-time, thus opening up a new way of using these LDA models in industry.

Hollow Heaps

We introduce the hollow heap, a very simple data structure with the same amortized efficiency as the classical Fibonacci heap. All heap operations except delete and delete-min take O(1) time, worst case as well as amortized; delete and delete-min take O(\log n) amortized time on a heap of n items. Hollow heaps are by far the simplest structure to achieve this. Hollow heaps combine two novel ideas: the use of lazy deletion and re-insertion to do decrease-key operations, and the use of a dag (directed acyclic graph) instead of a tree or set of trees to represent a heap. Lazy deletion produces hollow nodes (nodes without items), giving the data structure its name.

Big Data Analytics-Enhanced Cloud Computing: Challenges, Architectural Elements, and Future Directions

The emergence of cloud computing has made dynamic provisioning of elastic capacity to applications on-demand. Cloud data centers contain thousands of physical servers hosting orders of magnitude more virtual machines that can be allocated on demand to users in a pay-as-you-go model. However, not all systems are able to scale up by just adding more virtual machines. Therefore, it is essential, even for scalable systems, to project workloads in advance rather than using a purely reactive approach. Given the scale of modern cloud infrastructures generating real time monitoring information, along with all the information generated by operating systems and applications, this data poses the issues of volume, velocity, and variety that are addressed by Big Data approaches. In this paper, we investigate how utilization of Big Data analytics helps in enhancing the operation of cloud computing environments. We discuss diverse applications of Big Data analytics in clouds, open issues for enhancing cloud operations via Big Data analytics, and architecture for anomaly detection and prevention in clouds along with future research directions.

Reduced Palm Intensity for Track Extraction

On the shadow moments of apparently infinite-mean phenomena

On Hypoelliptic Bridge

On the Decomposition of GF(4)-Representable Signed-Graphic Matroids

On the one–dimensional spectral Heat content for stable processes

On the Greatest Common Divisor of $\binom{qn}{q}, \binom{qn}{2q},\dots, \binom{qn}{qn-q}$

And/or trees: A local limit point of view

Parallel Tensor Compression for Large-Scale Scientific Data

Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?

Dual Free SDCA for Empirical Risk Minimization with Adaptive Probabilities

Disjoint cycles of different lengths in graphs and digraphs

Random Projections through multiple optical scattering: Approximating kernels at the speed of light

Typical index distribution of Gaussian random matrices: replica approach and finite-size corrections

Hitting probabilities of random covering sets in torus and metric spaces

Determinantal Sampling Designs

A Bramble like Witness for Large Branch-Width

Rectangular Young tableaux and the Jacobi ensemble

Spacings – An Example for Universality in Random Matrix Theory

Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments

Collective Prediction of Individual Mobility Traces with Exponential Weights

Generalized conditional gradient: analysis of convergence and applications

Some advances on Sidorenko’s conjecture

Generalized Shortest Path Kernel on Graphs

Inventory Control Involving Unknown Demand of Discrete Nonperishable Items – Analysis of a Newsvendor-based Policy

Subquadratic Algorithms for Succinct Stable Matching

Thin tails of fixed points of the nonhomogeneous smoothing transform

Spectral Density for Random Matrices with Independent Skew-Diagonals

Optimization as Estimation with Gaussian Processes in Bandit Settings

A note on the best invariant estimation of continuous probability distributions under mean square loss