Enabling adaptive scientific workflows via trigger detection

Next generation architectures necessitate a shift away from traditional workflows in which the simulation state is saved at prescribed frequencies for post-processing analysis. While the need to shift to in~situ workflows has been acknowledged for some time, much of the current research is focused on static workflows, where the analysis that would have been done as a post-process is performed concurrently with the simulation at user-prescribed frequencies. Recently, research efforts are striving to enable adaptive workflows, in which the frequency, composition, and execution of computational and data manipulation steps dynamically depend on the state of the simulation. Adapting the workflow to the state of simulation in such a data-driven fashion puts extremely strict efficiency requirements on the analysis capabilities that are used to identify the transitions in the workflow. In this paper we build upon earlier work on trigger detection using sublinear techniques to drive adaptive workflows. Here we propose a methodology to detect the time when sudden heat release occurs in simulations of turbulent combustion. Our proposed method provides an alternative metric that can be used along with our former metric to increase the robustness of trigger detection. We show the effectiveness of our metric empirically for predicting heat release for two use cases.

Fast, Flexible Models for Discovering Topic Correlation across Weakly-Related Collections

Weak topic correlation across document collections with different numbers of topics in individual collections presents challenges for existing cross-collection topic models. This paper introduces two probabilistic topic models, Correlated LDA (C-LDA) and Correlated HDP (C-HDP). These address problems that can arise when analyzing large, asymmetric, and potentially weakly-related collections. Topic correlations in weakly-related collections typically lie in the tail of the topic distribution, where they would be overlooked by models unable to fit large numbers of topics. To efficiently model this long tail for large-scale analysis, our models implement a parallel sampling algorithm based on the Metropolis-Hastings and alias methods (Yuan et al., 2015). The models are first evaluated on synthetic data, generated to simulate various collection-level asymmetries. We then present a case study of modeling over 300k documents in collections of sciences and humanities research from JSTOR.

Fishing out Winners from Vote Streams

We investigate the problem of winner determination from computational social choice theory in the data stream model. Specifically, we consider the task of summarizing an arbitrarily ordered stream of n votes on m candidates into a small space data structure so as to be able to obtain the winner determined by popular voting rules. As we show, finding the exact winner requires storing essentially all the votes. So, we focus on the problem of finding an {\em \eps-winner}, a candidate who could win by a change of at most \eps fraction of the votes. We show non-trivial upper and lower bounds on the space complexity of \eps-winner determination for several voting rules, including k-approval, k-veto, scoring rules, approval, maximin, Bucklin, Copeland, and plurality with run off.

Introduction to Cross-Entropy Clustering The R Package CEC

The R Package CEC performs clustering based on the cross-entropy clustering (CEC) method, which was recently developed with the use of information theory. The main advantage of CEC is that it combines the speed and simplicity of k-means with the ability to use various Gaussian mixture models and reduce unnecessary clusters. In this work we present a practical tutorial to CEC based on the R Package CEC. Functions are provided to encompass the whole process of clustering.

Learning to Predict Independent of Span

We consider how to learn multi-step predictions efficiently. Conventional algorithms wait until observing actual outcomes before performing the computations to update their predictions. If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed. We show that the exact same predictions can be learned in a much more computationally congenial way, with uniform per-step computation that does not depend on the span of the predictions. We apply this idea to various settings of increasing generality, repeatedly adding desired properties and each time deriving an equivalent span-independent algorithm for the conventional algorithm that satisfies these desiderata. Interestingly, along the way several known algorithmic constructs emerge spontaneously from our derivations, including dutch eligibility traces, temporal difference errors, and averaging. This allows us to link these constructs one-to-one to the corresponding desiderata, unambiguously connecting the `how’ to the `why’. Each step, we make sure that the derived algorithm subsumes the previous algorithms, thereby retaining their properties. Ultimately we arrive at a single general temporal-difference algorithm that is applicable to the full setting of reinforcement learning.

On Designing and Testing Distributed Virtual Environments

Distributed Real-Time (DRT) systems are among the most complex software systems to design, test, maintain and evolve. The existence of components distributed over a network often conflicts with real-time requirements, leading to design strategies that depend on domain- and even application-specific knowledge. Distributed Virtual Environment (DVE) systems are DRT systems that connect multiple users instantly with each other and with a shared virtual space over a network. DVE systems deviate from traditional DRT systems in the importance of the quality of the end user experience. We present an analysis of important, but challenging, issues in the design, testing and evaluation of DVE systems through the lens of experiments with a concrete DVE, OpenSimulator. We frame our observations within six dimensions of well-known design concerns: correctness, fault tolerance/prevention, scalability, time sensitivity, consistency, and overhead of distribution. Furthermore, we place our experimental work in a broader historical context, showing that these challenges are intrinsic to DVEs and suggesting lines of future research.

Personalized QoS Prediction of Cloud Services via Learning Neighborhood-based Model

The explosion of cloud services on the Internet brings new challenges in service discovery and selection. Particularly, the demand for efficient quality-of-service (QoS) evaluation is becoming urgently strong. To address this issue, this paper proposes neighborhood-based approach for QoS prediction of cloud services by taking advantages of collaborative intelligence. Different from heuristic collaborative filtering and matrix factorization, we define a formal neighborhood-based prediction framework which allows an efficient global optimization scheme, and then exploit different baseline estimate component to improve predictive performance. To validate the proposed methods, a large-scale QoS-specific dataset which consists of invocation records from 339 service users on 5,825 web services on a world-scale distributed network is used. Experimental results demonstrate that the learned neighborhood-based models can overcome existing difficulties of heuristic collaborative filtering methods and achieve superior performance than state-of-the-art prediction methods.

Time Series Clustering via Community Detection in Networks

In this paper, we propose a technique for time series clustering using community detection in complex networks. Firstly, we present a method to transform a set of time series into a network using different distance functions, where each time series is represented by a vertex and the most similar ones are connected. Then, we apply community detection algorithms to identify groups of strongly connected vertices (called a community) and, consequently, identify time series clusters. Still in this paper, we make a comprehensive analysis on the influence of various combinations of time series distance functions, network generation methods and community detection techniques on clustering results. Experimental study shows that the proposed network-based approach achieves better results than various classic or up-to-date clustering techniques under consideration. Statistical tests confirm that the proposed method outperforms some classic clustering algorithms, such as k-medoids, diana, median-linkage and centroid-linkage in various data sets. Interestingly, the proposed method can effectively detect shape patterns presented in time series due to the topological structure of the underlying network constructed in the clustering process. At the same time, other techniques fail to identify such patterns. Moreover, the proposed method is robust enough to group time series presenting similar pattern but with time shifts and/or amplitude variations. In summary, the main point of the proposed method is the transformation of time series from time-space domain to topological domain. Therefore, we hope that our approach contributes not only for time series clustering, but also for general time series analysis tasks.

A Dictionary Learning Approach for Factorial Gaussian Models

A short note on estimation of WCRE and WCE

Automorphism Groups of Generic Structures: Extreme Amenability and Amenability

Convex integral functionals of regular processes

Design and Implementation of Distributed Resource Management for Time Sensitive Applications

Discrimination of time-dependent inflow properties with a cooperative dynamical system

Drawing and Analyzing Causal DAGs with DAGitty

Dynamical localization and the nonequilibrium interplay between disorder and interactions

Efficient simulation of many-body localized systems

Exploring chance in NCAA basketball

Exploring Metaphorical Senses and Word Representations for Identifying Metonyms

Fault Diagnosis of Helical Gear Box using Large Margin K-Nearest Neighbors Classifier using Sound Signals

Generating functions and triangulations for lecture hall cones

Independent Sets, Matchings, and Occupancy Fractions

Linear rank-width of distance-hereditary graphs II. Vertex-minor obstructions

Lower Bound on the Rate of Adaptation in an Asexual Population

Mining Brain Networks using Multiple Side Views for Neurological Disorder Identification

Near-Optimal Distributed Maximum Flow

Nilpotent dessins: Decomposition theorem and classification of the abelian dessins

Nonlinear filtering with correlated Lévy noise characterized by copulas

Nowhere-zero 9-flows in 3-edge-connected signed graphs

On jump-diffusion processes with regime switching: martingale approach

On photon statistics parametrized by a non-central Wishart random matrix

On the Sprague-Grundy function of Exact $k$-Nim

Proposal for the creation of a research facility for the development of the SP machine

Quantum Max-flow/Min-cut

Quickest Detection for Changes in Maximal kNN Coherence of Random Matrices

Rapidly Computing Sparse Legendre Expansions via Sparse Fourier Transforms

Recognizing Extended Spatiotemporal Expressions by Actively Trained Average Perceptron Ensembles

Robust Subspace Clustering via Smoothed Rank Approximation

Semiparametric estimation of mutual information and related criteria : optimal test of independence

Short-range dependent processes subordinated to the Gaussian may not be strong mixing

Simulated Tempering and Swapping on Mean-Field Models

Spatio-temporal Spike and Slab Priors for Multiple Measurement Vector Problems

Strong convergence of the symmetrized Milstein scheme for some CEV-like SDEs

The extremal function for Petersen minors

The multiplicative coalescent, inhomogeneous continuum random trees, and new universality classes for critical random graphs

The Smith Normal Form of a Specialized Jacobi-Trudi Matrix

Toric $g$-polynomials of hook shape lattice Path Matroid Polytopes and product of simplices

Translation invariant extensions of finite volume measures

Upper triangular matrices and Billiard Arrays

Viscosity solutions of second order integral-partial differential equations: A new result

Weak convergence analysis of the symmetrized Euler scheme for one dimensional SDEs with diffusion coefficient |x|^a, a in [1/2,1)