We develop a set of scalable Bayesian inference procedures for a general class of nonparametric regression models based on embarrassingly parallel MCMC. Specifically, we first perform independent nonparametric Bayesian inference on each subset split from a massive dataset, and then aggregate those results into global counterparts. By partitioning the dataset carefully, we show that our aggregated inference results obtain the oracle rule in the sense that they are equivalent to those obtained directly from the massive data (which are computationally prohibitive in practice, though). For example, the aggregated credible sets achieve desirable credibility level and frequentist coverage possessed by the oracle counterparts (with similar radius). The oracle matching phenomenon occurs due to the nice geometric structures of the infinite-dimensional parameter space. A technical by-product is a new version of uniformly consistent test that applies to a general regression model under Sobolev norm.
This paper offers a comprehensive analysis of the Pilot-Job abstraction assessing its evolution, properties, and implementation as multiple Pilot-Job software systems. Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to consume more than 700 million CPU hours a year by the Open Science Grid communities, and by processing up to 5 million jobs a week for the ATLAS experiment on the Worldwide LHC Computing Grid. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement upon a definition of Pilot-Job system and no clear understanding of the underlying Pilot abstraction and paradigm. This lack of foundational understanding has lead to a proliferation of unsustainable Pilot-Job implementations with no shared best practices or interoperability, ultimately hindering a realization of the full impact of Pilot-Jobs. This paper offers the conceptual tools to promote this fundamental understanding while critically reviewing the state of the art of Pilot-Job implementations. The five main contributions of this paper are: (i) an analysis of the motivations and evolution of the Pilot-Job abstraction; (ii) an outline of the minimal set of distinguishing functionalities; (iii) the definition of a core vocabulary to reason consistently about Pilot-Jobs; (iv) the description of core and auxiliary properties of Pilot-Job systems; and (v) a critical review of the current state of the art of their implementations. These contributions are brought together to illustrate the generality of the Pilot-Job paradigm, to discuss some challenges in distributed computing that it addresses and future opportunities.
Cascaded AdaBoost classifier is a well-known efficient object detection algorithm. The cascade structure has many parameters to be determined. Most of existing cascade learning algorithms are designed by assigning detection rate and false positive rate to each stage either dynamically or statically. Their objective functions are not directly related to minimum computation cost. These algorithms are not guaranteed to have optimal solution in the sense of minimizing computation cost. On the assumption that a strong classifier is given, in this paper we propose an optimal cascade learning algorithm (we call it iCascade) which iteratively partitions the strong classifiers into two parts until predefined number of stages are generated. iCascade searches the optimal number ri of weak classifiers of each stage i by directly minimizing the computation cost of the cascade. Theorems are provided to guarantee the existence of the unique optimal solution. Theorems are also given for the proposed efficient algorithm of searching optimal parameters ri. Once a new stage is added, the parameter ri for each stage decreases gradually as iteration proceeds, which we call decreasing phenomenon. Moreover, with the goal of minimizing computation cost, we develop an effective algorithm for setting the optimal threshold of each stage classifier. In addition, we prove in theory why more new weak classifiers are required compared to the last stage. Experimental results on face detection demonstrate the effectiveness and efficiency of the proposed algorithm.
Recently ensemble selection for consensus clustering has emerged as a research problem in Machine Intelligence. Normally consensus clustering algorithms take into account the entire ensemble of clustering, where there is a tendency of generating a very large size ensemble before computing its consensus. One can avoid considering the entire ensemble and can judiciously select few partitions in the ensemble without compromising on the quality of the consensus. This may result in an efficient consensus computation technique and may save unnecessary computational overheads. The ensemble selection problem addresses this issue of consensus clustering. In this paper, we propose an efficient method of ensemble selection for a large ensemble. We prioritize the partitions in the ensemble based on diversity and frequency. Our method selects top K of the partitions in order of priority, where K is decided by the user. We observe that considering jointly the diversity and frequency helps in identifying few representative partitions whose consensus is qualitatively better than the consensus of the entire ensemble. Experimental analysis on a large number of datasets shows our method gives better results than earlier ensemble selection methods.
Word embeddings — distributed representations for words — in deep learning are beneficial for many tasks in Natural Language Processing (NLP). However, different embedding sets vary greatly in quality and characteristics of the captured semantics. Instead of relying on a more advanced algorithm for embedding learning, this paper proposes an ensemble approach of combining different public embedding sets with the aim of learning meta-embeddings. Experiments on word similarity and analogy tasks and on part-of-speech (POS) tagging show better performance of meta-embeddings compared to individual embedding sets. One advantage of meta-embeddings is that they have increased coverage of the vocabulary. We will release our meta-embeddings publicly.
We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usage with other implementations. The new software proves to scale best with the number of features, samples, trees, and features tried for splitting. Finally, we show that ranger is the fastest and most memory efficient implementation of random forests to analyze data on the scale of a genome-wide association study.
SelectScript is an extendable, adaptable, and declarative domain-specific language aimed at information retrieval from simulation environments and robotic world models in an SQL-like manner. In this work we have extended the language in two directions. First, we have implemented hierarchical queries; second, we improve efficiency enabling manual design space exploration on different ‘search’ strategies. As depicted in the teaser above, we demonstrate the applicability of such extensions in two application problems; the basic language concepts are explained by solving the classical problem of the Towers of Hanoi and then a common path planning problem in a complex 3D environment is implemented.
Context of data points, which is usually defined as the other data points in a data set, has been found to play important roles in data representation and classification. In this paper, we study the problem of using context of a data point for its classification problem. Our work is inspired by the observation that actually only very few data points are critical in the context of a data point for its representation and classification. We propose to represent a data point as the sparse linear combination of its context, and learn the sparse context in a supervised way to increase its discriminative ability. To this end, we proposed a novel formulation for context learning, by modeling the learning of context parameter and classifier in a unified objective, and optimizing it with an alternative strategy in an iterative algorithm. Experiments on three benchmark data set show its advantage over state-of-the-art context-based data representation and classification methods.