Next generation architectures necessitate a shift away from traditional workflows in which the simulation state is saved at prescribed frequencies for post-processing analysis. While the need to shift to in~situ workflows has been acknowledged for some time, much of the current research is focused on static workflows, where the analysis that would have been done as a post-process is performed concurrently with the simulation at user-prescribed frequencies. Recently, research efforts are striving to enable adaptive workflows, in which the frequency, composition, and execution of computational and data manipulation steps dynamically depend on the state of the simulation. Adapting the workflow to the state of simulation in such a data-driven fashion puts extremely strict efficiency requirements on the analysis capabilities that are used to identify the transitions in the workflow. In this paper we build upon earlier work on trigger detection using sublinear techniques to drive adaptive workflows. Here we propose a methodology to detect the time when sudden heat release occurs in simulations of turbulent combustion. Our proposed method provides an alternative metric that can be used along with our former metric to increase the robustness of trigger detection. We show the effectiveness of our metric empirically for predicting heat release for two use cases.
Weak topic correlation across document collections with different numbers of topics in individual collections presents challenges for existing cross-collection topic models. This paper introduces two probabilistic topic models, Correlated LDA (C-LDA) and Correlated HDP (C-HDP). These address problems that can arise when analyzing large, asymmetric, and potentially weakly-related collections. Topic correlations in weakly-related collections typically lie in the tail of the topic distribution, where they would be overlooked by models unable to fit large numbers of topics. To efficiently model this long tail for large-scale analysis, our models implement a parallel sampling algorithm based on the Metropolis-Hastings and alias methods (Yuan et al., 2015). The models are first evaluated on synthetic data, generated to simulate various collection-level asymmetries. We then present a case study of modeling over 300k documents in collections of sciences and humanities research from JSTOR.
We investigate the problem of winner determination from computational social choice theory in the data stream model. Specifically, we consider the task of summarizing an arbitrarily ordered stream of votes on candidates into a small space data structure so as to be able to obtain the winner determined by popular voting rules. As we show, finding the exact winner requires storing essentially all the votes. So, we focus on the problem of finding an {\em -winner}, a candidate who could win by a change of at most fraction of the votes. We show non-trivial upper and lower bounds on the space complexity of -winner determination for several voting rules, including -approval, -veto, scoring rules, approval, maximin, Bucklin, Copeland, and plurality with run off.
The R Package CEC performs clustering based on the cross-entropy clustering (CEC) method, which was recently developed with the use of information theory. The main advantage of CEC is that it combines the speed and simplicity of -means with the ability to use various Gaussian mixture models and reduce unnecessary clusters. In this work we present a practical tutorial to CEC based on the R Package CEC. Functions are provided to encompass the whole process of clustering.
We consider how to learn multi-step predictions efficiently. Conventional algorithms wait until observing actual outcomes before performing the computations to update their predictions. If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed. We show that the exact same predictions can be learned in a much more computationally congenial way, with uniform per-step computation that does not depend on the span of the predictions. We apply this idea to various settings of increasing generality, repeatedly adding desired properties and each time deriving an equivalent span-independent algorithm for the conventional algorithm that satisfies these desiderata. Interestingly, along the way several known algorithmic constructs emerge spontaneously from our derivations, including dutch eligibility traces, temporal difference errors, and averaging. This allows us to link these constructs one-to-one to the corresponding desiderata, unambiguously connecting the `how’ to the `why’. Each step, we make sure that the derived algorithm subsumes the previous algorithms, thereby retaining their properties. Ultimately we arrive at a single general temporal-difference algorithm that is applicable to the full setting of reinforcement learning.
Distributed Real-Time (DRT) systems are among the most complex software systems to design, test, maintain and evolve. The existence of components distributed over a network often conflicts with real-time requirements, leading to design strategies that depend on domain- and even application-specific knowledge. Distributed Virtual Environment (DVE) systems are DRT systems that connect multiple users instantly with each other and with a shared virtual space over a network. DVE systems deviate from traditional DRT systems in the importance of the quality of the end user experience. We present an analysis of important, but challenging, issues in the design, testing and evaluation of DVE systems through the lens of experiments with a concrete DVE, OpenSimulator. We frame our observations within six dimensions of well-known design concerns: correctness, fault tolerance/prevention, scalability, time sensitivity, consistency, and overhead of distribution. Furthermore, we place our experimental work in a broader historical context, showing that these challenges are intrinsic to DVEs and suggesting lines of future research.
The explosion of cloud services on the Internet brings new challenges in service discovery and selection. Particularly, the demand for efficient quality-of-service (QoS) evaluation is becoming urgently strong. To address this issue, this paper proposes neighborhood-based approach for QoS prediction of cloud services by taking advantages of collaborative intelligence. Different from heuristic collaborative filtering and matrix factorization, we define a formal neighborhood-based prediction framework which allows an efficient global optimization scheme, and then exploit different baseline estimate component to improve predictive performance. To validate the proposed methods, a large-scale QoS-specific dataset which consists of invocation records from 339 service users on 5,825 web services on a world-scale distributed network is used. Experimental results demonstrate that the learned neighborhood-based models can overcome existing difficulties of heuristic collaborative filtering methods and achieve superior performance than state-of-the-art prediction methods.
In this paper, we propose a technique for time series clustering using community detection in complex networks. Firstly, we present a method to transform a set of time series into a network using different distance functions, where each time series is represented by a vertex and the most similar ones are connected. Then, we apply community detection algorithms to identify groups of strongly connected vertices (called a community) and, consequently, identify time series clusters. Still in this paper, we make a comprehensive analysis on the influence of various combinations of time series distance functions, network generation methods and community detection techniques on clustering results. Experimental study shows that the proposed network-based approach achieves better results than various classic or up-to-date clustering techniques under consideration. Statistical tests confirm that the proposed method outperforms some classic clustering algorithms, such as -medoids, diana, median-linkage and centroid-linkage in various data sets. Interestingly, the proposed method can effectively detect shape patterns presented in time series due to the topological structure of the underlying network constructed in the clustering process. At the same time, other techniques fail to identify such patterns. Moreover, the proposed method is robust enough to group time series presenting similar pattern but with time shifts and/or amplitude variations. In summary, the main point of the proposed method is the transformation of time series from time-space domain to topological domain. Therefore, we hope that our approach contributes not only for time series clustering, but also for general time series analysis tasks.