A new non-parametric detector of univariate outliers for distributions with positive unbounded support

The purpose of this paper is the construction and the asymptotic property study of a new non-parametric detector of univariate outliers. This detector, based on a Hill’s type statistics, is valid for a large set of probability distributions with positive unbounded support, for instance for the absolute value of Gaussian, Gamma, Weibull, Student or regular variations distributions. We illustrate our results by numerical simulations which show the accuracy of this detector with respect to other usual univariate outlier detectors (Tukey, MADE or Local Outlier Factor detectors). An application to real-life data allows to detect outliers in a database providing the prices of used cars.

A Variational Bayesian State-Space Approach to Online Passive-Aggressive Regression

Online Passive-Aggressive (PA) learning is a class of online margin-based algorithms suitable for a wide range of real-time prediction tasks, including classification and regression. PA algorithms are formulated in terms of deterministic point-estimation problems governed by a set of user-defined hyperparameters: the approach fails to capture model/prediction uncertainty and makes their performance highly sensitive to hyperparameter configurations. In this paper, we introduce a novel PA learning framework for regression that overcomes the above limitations. We contribute a Bayesian state-space interpretation of PA regression, along with a novel online variational inference scheme, that not only produces probabilistic predictions, but also offers the benefit of automatic hyperparameter tuning. Experiments with various real-world data sets show that our approach performs significantly better than a more standard, linear Gaussian state-space model.

C3: Lightweight Incrementalized MCMC for Probabilistic Programs using Continuations and Callsite Caching

Lightweight, source-to-source transformation approaches to implementing MCMC for probabilistic programming languages are popular for their simplicity, support of existing deterministic code, and ability to execute on existing fast runtimes. However, they are also slow, requiring a complete re-execution of the program on every Metropolis Hastings proposal. We present a new extension to the lightweight approach, C3, which enables efficient, incrementalized re-execution of MH proposals. C3 is based on two core ideas: transforming probabilistic programs into continuation passing style (CPS), and caching the results of function calls. We show that on several common models, C3 reduces proposal runtime by 20-100x, in some cases reducing runtime complexity from linear in model size to constant. We also demonstrate nearly an order of magnitude speedup on a complex inverse procedural modeling application.

Characterizing and Adapting the Consistency-Latency Tradeoff in Distributed Key-value Stores

The CAP theorem is a fundamental result that applies to distributed storage systems. In this paper, we first present and prove a probabilistic variation of the CAP theorem. We present probabilistic models to characterize the three important elements of the CAP theorem: consistency (C), availability or latency (A), and partition-tolerance (P). Then, we provide \textit{quantitative} characterization of the tradeoff among these three elements. Next, we leverage this result to present a new system, called PCAP, which allows applications running on a single data-center to specify either a latency SLA or a consistency SLA. The PCAP system automatically adapts, in real-time and under changing network conditions, to meet the SLA while optimizing the other C/A metric. We incorporate PCAP into two popular key-value stores — Apache Cassandra and Riak. Our experiments with these two deployments, under realistic workloads, reveal that the PCAP system satisfactorily meets SLAs, and performs close to the bounds dictated by our tradeoff analysis. We also extend PCAP from a single data-center to multiple geo-distributed data-centers.

Deep Attributes from Context-Aware Regional Neural Codes

Recently, many researches employ middle-layer output of convolutional neural network models (CNN) as features for different visual recognition tasks. Although promising results have been achieved in some empirical studies, such type of representations still suffer from the well-known issue of semantic gap. This paper proposes so-called deep attribute framework to alleviate this issue from three aspects. First, we introduce object region proposals as intermedia to represent target images, and extract features from region proposals. Second, we study aggregating features from different CNN layers for all region proposals. The aggregation yields a holistic yet compact representation of input images. Results show that cross-region max-pooling of soft-max layer output outperform all other layers. As soft-max layer directly corresponds to semantic concepts, this representation is named ‘deep attributes’. Third, we observe that only a small portion of generated regions by object proposals algorithm are correlated to classification target. Therefore, we introduce context-aware region refining algorithm to pick out contextual regions and build context-aware classifiers. We apply the proposed deep attributes framework for various vision tasks. Extensive experiments are conducted on standard benchmarks for three visual recognition tasks, i.e., image classification, fine-grained recognition and visual instance retrieval. Results show that deep attribute approaches achieve state-of-the-art results, and outperforms existing peer methods with a significant margin, even though some benchmarks have little overlap of concepts with the pre-trained CNN models.

Evolving TSP heuristics using Multi Expression Programming

Multi Expression Programming (MEP) is an evolutionary technique that may be used for solving computationally difficult problems. MEP uses a linear solution representation. Each MEP individual is a string encoding complex expressions (computer programs). A MEP individual may encode multiple solutions of the current problem. In this paper MEP is used for evolving a Traveling Salesman Problem (TSP) heuristic for graphs satisfying triangle inequality. Evolved MEP heuristic is compared with Nearest Neighbor Heuristic (NN) and Minimum Spanning Tree Heuristic (MST) on some difficult problems in TSPLIB. For most of the considered problems the evolved MEP heuristic outperforms NN and MST. The obtained algorithm was tested against some problems in TSPLIB. The results emphasizes that evolved MEP heuristic is a powerful tool for solving difficult TSP instances.

Improved Twitter Sentiment Prediction through Cluster-then-Predict Model

Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Facebook, Twitter, YouTube and also mobile applications such as WhatsApp, Line, etc. Many companies have identified these resources as a rich mine of marketing knowledge. This knowledge provides valuable feedback which allows them to further develop the next generation of their product. In this paper, sentiment analysis of a product is performed by extracting tweets about that product and classifying the tweets showing it as positive and negative sentiment. The authors propose a hybrid approach which combines unsupervised learning in the form of K-means clustering to cluster the tweets and then performing supervised learning methods such as Decision Trees and Support Vector Machines for classification.

linalg: Matrix Computations in Apache Spark

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark comes with the mllib.linalg library, which provides abstractions and implementations for distributed matrices. Using these abstractions, we highlight the computations that were more challenging to distribute. When translating single-node algorithms to run on a distributed cluster, we observe that often a simple idea is enough: separating matrix operations from vector operations and shipping the matrix operations to be ran on the cluster, while keeping vector operations local to the driver. In the case of the Singular Value Decomposition, by taking this idea to an extreme, we are able to exploit the computational power of a cluster, while running code written decades ago for a single core. We conclude with a comprehensive set of benchmarks for hardware accelerated matrix computations from the JVM, which is interesting in its own right, as many cluster programming frameworks use the JVM.

Optimizing Static and Adaptive Probing Schedules for Rapid Event Detection

We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines. We consider a large system consists of many nodes, where each node has its own rate of generating new events, or items. A monitoring application can probe a small number of nodes at each step, and our goal is to compute a probing schedule that minimizes the expected number of undiscovered items at the system, or equivalently, minimizes the expected time to discover a new item in the system. We study the Schedule Optimization problem both for deterministic and randomized memoryless algorithms. We provide lower bounds on the cost of an optimal schedule and construct close to optimal schedules with rigorous mathematical guarantees. Finally, we present an adaptive algorithm that starts with no prior information on the system and converges to the optimal memoryless algorithms by adapting to observed data.

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

The goal of entity linking is to map spans of text to canonical entity representations such as Freebase entries or Wikipedia articles. It provides a foundation for various natural language processing tasks, including text understanding, summarization and machine translation. Name ambiguity, word polysemy, context dependencies, and a heavy-tailed distribution of entities contribute to the complexity of this problem. We propose a simple, yet effective, probabilistic graphical model for collective entity linking, which resolves entity links jointly across an entire document. Our model captures local information from linkable token spans (i.e., mentions) and their surrounding context and combines it with a document-level prior of entity co-occurrences. The model is acquired automatically from entity-linked text repositories with a lightweight computational step for parameter adaptation. Loopy belief propagation is used as an efficient approximate inference algorithm. In contrast to state-of-the-art methods, our model is conceptually simple and easy to reproduce. It comes with a small memory footprint and is sufficiently fast for real-time usage. We demonstrate its benefits on a wide range of well-known entity linking benchmark datasets. Our empirical results show the merits of the proposed approach and its competitiveness in comparison to state-of-the-art methods.

A Behavior Analysis-Based Game Bot Detection Approach Considering Various Play Styles

A general non-existence result for linear BSDEs driven by Gaussian processes

A representation theorem on a filtering model with first-passage-type stopping time

A review on symmetry properties of birth-death processes

A unified heuristic and an annotated bibliography for a large class of earliness-tardiness scheduling problems

Annealed scaling for a charged polymer

Central Pattern Generators for the control of robotic systems

Cost, $\ell^2$-Betti numbers and the Sofic entropy of some algebraic actions

Coupling and an application to level-set percolation of the Gaussian free field

Data-selective Transfer Learning for Multi-Domain Speech Recognition

DeepCough: A Deep Convolutional Neural Network in A Wearable Cough Detection System

Diffusion tensor imaging with deterministic error bounds

Dissecting GPU Memory Hierarchy through Microbenchmarking

Edge-enhancing Filters with Negative Weights

Empirical risk minimization is consistent with the mean absolute percentage error

Enhancing Automatically Discovered Multi-level Acoustic Patterns Considering Context Consistency With Applications in Spoken Term Detection

Extended inequalities for weighted Renyi entropy involving generalized Gaussian densities

Fuzzy Jets

Guided Signal Reconstruction with Application to Image Magnification

Improved Second Order Estimation in the Singular Multivariate Normal Model

Intermittency for branching walks with heavy tails

Ising model in clustered scale-free networks

Kriging Metamodels for Bermudan Option Pricing

Large deviations for some corner growth models with inhomogeneity

Level lines of the Gaussian Free Field with general boundary data

Local limits of Galton-Watson trees conditioned on the number of protected nodes

Misspecification in mixed-model based association analysis

Mixed Ehrhart polynomials

Modelling time evolving interactions in networks through a non stationary extension of stochastic block models

More on hypergeometric Levy processes

On the complexity of piecewise affine system identification

On the convergence of the extremal eigenvalues of empirical covariance matrices with dependence

On the enumeration of restricted words over a finite alphabet

On the interval of fluctuation of the singular values of random matrices

On the size of planarly connected crossing graphs

On Wasserstein Two Sample Testing and Related Families of Nonparametric Tests

Per-bucket concurrent rehashing algorithms

Personalized Search

Polychromatic X-ray CT Image Reconstruction and Mass-Attenuation Spectrum Estimation

Properties of the Affine Invariant Ensemble Sampler in high dimensions

Quantum walk speedup of backtracking algorithms

SEP-QN: Scalable and Extensible Proximal Quasi-Newton Method for Dirty Statistical Models

Some variance reduction methods for numerical stochastic homogenization

Space-time transformations and copulae for Diffusion Processes

TDOA denoising for acoustic source localization

Thermal conductivity in harmonic lattices with random collisions

Towards extending the Ahlswede-Khachatrian theorem to cross t-intersecting families

Tropical curves in sandpiles

Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization

Unsupervised Domain Discovery using Latent Dirichlet Allocation for Acoustic Modelling in Speech Recognition

Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity

When the sieve works II

Wikipedia Page View Reflects Web Search Trend

Zero-divisor graphs of lower dismantlable lattices-I