The PARAFAC tensor decomposition has enjoyed an increasing success in exploratory multi-aspect data mining scenarios. A major challenge remains the estimation of the number of latent factors (i.e., the rank) of the decomposition, which yields high-quality, interpretable results. Previously, we have proposed an automated tensor mining method which leverages a well-known quality heuristic from the field of Chemometrics, the Core Consistency Diagnostic (CORCONDIA), in order to automatically determine the rank for the PARAFAC decomposition. In this work we set out to explore the trade-off between 1) the interpretability/quality of the results (as expressed by CORCONDIA), and 2) the predictive accuracy of the results, in order to further improve the rank estimation quality. Our preliminary results indicate that striking a good balance in that trade-off benefits rank estimation.
The pairing heap is a classical heap data structure introduced in 1986 by Fredman, Sedgewick, Sleator, and Tarjan. It is remarkable both for its simplicity and for its excellent performance in practice. The ‘magic’ of pairing heaps lies in the restructuring that happens after the deletion of the smallest element. The resulting collection of trees is consolidated in two rounds: a left-to-right pairing round, followed by a right-to-left accumulation round. Fredman et al. showed, via an elegant correspondence to splay trees, that all operations take logarithmic amortized time. They also proposed an arguably more natural variant, where both pairing and accumulation are performed in a combined left-to-right round (we call this the forward variant). The analogy to splaying breaks down for this variant, and its analysis was left open. Since then, no progress was made on this question. In this paper we show that the amortized costs of inserting an element and of deleting the minimum in the forward variant of pairing heaps are both $O(\log{n} \cdot 4^{\sqrt{\log{n}}} )$. Our analysis relies on a new potential function that captures parent-child rank-differences in the heap, and may have further applications.
Normalizing flows have been developed recently as a method for drawing samples from an arbitrary distribution. This method is attractive due to its intrinsic ability to approximate a target distribution arbitrarily well. In practice, however, normalizing flows only consist of a finite number of deterministic transformations, and thus there is no guarantees on the approximation accuracy. In this paper we study the problem of learning deep generative models with {\em continuous-time} flows (CTFs), a family of diffusion-based methods that are able to asymptotically approach a target distribution. We discretize the CTF to make training feasible, and develop theory on the approximation error. A framework is then adopted to distill knowledge from a CTF to an efficient inference network. We apply the technique to deep generative models, including a CTF-based variational autoencoder and an adversarial-network-like density estimator. Experiments on various tasks demonstrate the superiority of the proposed CTF framework compared to existing techniques.
To harness the full benefit of new computing platforms, it is necessary to develop software with parallel computing capabilities. This is no less true for statisticians than for astrophysicists. The R programming language, which is perhaps the most popular software environment for statisticians today, has many packages available for parallel computing. Their diversity in approach can be difficult to navigate. Some have attempted to alleviate this problem by designing common interfaces. However, these approaches offer limited flexibility to the user; additionally, they often serve as poor abstractions to the reality of modern hardware, leading to poor performance. We give a short introduction to two basic parallel computing approaches that closely align with hardware reality, allow the user to understand its performance, and provide sufficient capability to fully utilize multicore and multinode environments. We illustrate both approaches by working through a simple example fitting a random forest model. Beginning with a serial algorithm, we derive two parallel versions. Our objective is to illustrate the use of multiple cores on a single processor and the use of multiple processors in a cluster computer. We discuss the differences between the two versions and how the underlying hardware is used in each case.
We investigate the non-identifiability issues associated with bidirectional adversarial training for joint distribution matching. Within a framework of conditional entropy, we propose both adversarial and non-adversarial approaches to learn desirable matched joint distributions for unsupervised and supervised tasks. We unify a broad family of adversarial models as joint distribution matching problems. Our approach stabilizes learning of unsupervised bidirectional adversarial learning methods. Further, we introduce an extension for semi-supervised learning tasks. Theoretical results are validated in synthetic data and real-world applications.
Similarity-based clustering and semi-supervised learning methods separate the data into clusters or classes according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose a novel discriminative similarity learning framework which learns discriminative similarity for either data clustering or semi-supervised learning. The proposed framework learns classifier from each hypothetical labeling, and searches for the optimal labeling by minimizing the generalization error of the learned classifiers associated with the hypothetical labeling. Kernel classifier is employed in our framework. By generalization analysis via Rademacher complexity, the generalization error bound for the kernel classifier learned from hypothetical labeling is expressed as the sum of pairwise similarity between the data from different classes, parameterized by the weights of the kernel classifier. Such pairwise similarity serves as the discriminative similarity for the purpose of clustering and semi-supervised learning, and discriminative similarity with similar form can also be induced by the integrated squared error bound for kernel density classification. Based on the discriminative similarity induced by the kernel classifier, we propose new clustering and semi-supervised learning methods.
In this paper, we model the document revision detection problem as a minimum cost branching problem that relies on computing document distances. Furthermore, we propose two new document distance measures, word vector-based Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED). Our revision detection system is designed for a large scale corpus and implemented in Apache Spark. We demonstrate that our system can more precisely detect revisions than state-of-the-art methods by utilizing the Wikipedia revision dumps https://…/wiki-meta.html and simulated data sets.
Hamiltonian Flow Monte Carlo(HFMC) methods have been implemented in engineering, biology and chemistry. HFMC makes large gradient based steps to rapidly explore the state space. The application of the Hamiltonian dynamics allows to estimate rare events and sample from target distributions defined as the change of measures. The estimates demonstrated a variance reduction of the presented algorithm and its efficiency with respect to a standard Monte Carlo and interacting particle based system(IPS). We tested the algorithm on the case of the barrier option pricing.
This paper proposes a novel deep reinforcement learning (RL) method integrating the neural-network-based RL and the classical RL based on dynamic programming. In comparison to the conventional deep RL methods, our method enhances the convergence speed and the performance by delving into the following two characteristic features in the training of conventional RL: (1) Having many credible experiences is important in training RL algorithms, (2) Input states can be semantically clustered into a relatively small number of core clusters, and the states belonging to the same cluster tend to share similar Q-values given an action. By following the two observations, we propose a dictionary-type memory that accumulates the Q-value for each cluster of states as well as the corresponding action, in terms of priority. Then, we iteratively update each Q-value in the memory from the Q-value acquired from the network trained by the experiences stored in the memory. We demonstrate the effectiveness of our method through training RL algorithms on widely used game environments from OpenAI.
Scattering Transforms (or ScatterNets) introduced by Mallat are a promising start into creating a well-defined feature extractor to use for pattern recognition and image classification tasks. They are of particular interest due to their architectural similarity to Convolutional Neural Networks (CNNs), while requiring no parameter learning and still performing very well (particularly in constrained classification tasks). In this paper we visualize what the deeper layers of a ScatterNet are sensitive to using a ‘DeScatterNet’. We show that the higher orders of ScatterNets are sensitive to complex, edge-like patterns (checker-boards and rippled edges). These complex patterns may be useful for texture classification, but are quite dissimilar from the patterns visualized in second and third layers of Convolutional Neural Networks (CNNs) – the current state of the art Image Classifiers. We propose that this may be the source of the current gaps in performance between ScatterNets and CNNs (83% vs 93% on CIFAR-10 for ScatterNet+SVM vs ResNet). We then use these visualization tools to propose possible enhancements to the ScatterNet design, which show they have the power to extract features more closely resembling CNNs, while still being well-defined and having the invariance properties fundamental to ScatterNets.
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec’s properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.
Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructures, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. This work examines some of these challenges and discusses solutions proposed in the literature to address them.
This note presents in a technical though hopefully pedagogical way the three most common forms of neural network architectures: Feedforward, Convolutional and Recurrent. For each network, their fundamental building blocks are detailed. The forward pass and the update rules for the backpropagation algorithm are then derived in full.
M-estimation, or estimating equation, methods are widely applicable for point estimation and asymptotic inference. In this paper, we present an R package that can find roots and compute the empirical sandwich variance estimator for any set of user- specified, unbiased estimating equations. Examples from the M-estimation primer by Stefanski and Boos (2002) demonstrate use of the software. The package also includes a framework for finite sample variance corrections and a website with an extensive collection of tutorials.
Probabilistic mixture models have been widely used for different machine learning and pattern recognition tasks such as clustering, dimensionality reduction, and classification. In this paper, we focus on trying to solve the most common challenges related to supervised learning algorithms by using mixture probability distribution functions. With this modeling strategy, we identify sub-labels and generate synthetic data in order to reach better classification accuracy. It means we focus on increasing the training data synthetically to increase the classification accuracy.
Bayesian data analysis is not only about computing a posterior distribution, and Bayesian visualization is about more than trace plots of Markov chains. Rather, practical Bayesian data analysis is an iterative process of model building, inference, model checking and evaluation, and model expansion. Visualization is not only helpful in each of these stages of the Bayesian workflow, it is indispensable for making inferences from the intricate, high-dimensional models of interest to applied researchers, and essential for understanding and diagnosing the increasingly complex algorithms required to fit such models.
In this note, we point out a basic link between generative adversarial (GA) training and binary classification — any powerful discriminator essentially computes an (f-)divergence between real and generated samples. The result, repeatedly re-derived in decision theory, has implications for GA Networks (GANs), providing an alternative perspective on training f-GANs by designing the discriminator loss function.