Pangea google
Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non- shared execution data in separate systems such as distributed file system like HDFS, in-memory file system like Alluxio and computation framework like Spark. Such layering introduces significant performance and management costs for copying data across layers redundantly and deciding proper resource allocation for all layers. In this paper we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery—all in one monolithic storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark. …

Neyman-Pearson Criterion (NPC) google
We propose a new model selection criterion, the Neyman-Pearson criterion (NPC), for asymmetric binary classification problems such as cancer diagnosis, where the two types of classification errors have vastly different priorities. The NPC is a general prediction-based criterion that works for most classification methods including logistic regression, support vector machines, and random forests. We study the theoretical model selection properties of the NPC for nonparametric plug-in methods. Simulation studies show that the NPC outperforms the classical prediction-based criterion that minimizes the overall classification error under various asymmetric classification scenarios. A real data case study of breast cancer suggests that the NPC is a practical criterion that leads to the discovery of novel gene markers with both high sensitivity and specificity for breast cancer diagnosis. The NPC is available in an R package NPcriterion. …

Harmonia google
Distributed storage employs replication to mask failures and improve availability. However, these systems typically exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of new-generation programmable switches to obviate this tradeoff by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write conflicts in the network, which enables any replica to serve reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of Harmonia on a cluster of commodity servers connected by a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a variety of replication protocols, including primary-backup, chain replication, Viewstamped Replication, and NOPaxos. Experimental results show that Harmonia improves the throughput of these protocols by up to 10X for a replication factor of 10, providing near-linear scalability up to the limit of our testbed. …

Stochastic Distance Transform (SDT) google
The distance transform (DT) and its many variations are ubiquitous tools for image processing and analysis. In many imaging scenarios, the images of interest are corrupted by noise. This has a strong negative impact on the accuracy of the DT, which is highly sensitive to spurious noise points. In this study, we consider images represented as discrete random sets and observe statistics of DT computed on such representations. We, thus, define a stochastic distance transform (SDT), which has an adjustable robustness to noise. Both a stochastic Monte Carlo method and a deterministic method for computing the SDT are proposed and compared. Through a series of empirical tests, we demonstrate that the SDT is effective not only in improving the accuracy of the computed distances in the presence of noise, but also in improving the performance of template matching and watershed segmentation of partially overlapping objects, which are examples of typical applications where DTs are utilized. …