AUC Optimisation and Collaborative Filtering

In recommendation systems, one is interested in the ranking of the predicted items as opposed to other losses such as the mean squared error. Although a variety of ways to evaluate rankings exist in the literature, here we focus on the Area Under the ROC Curve (AUC) as it widely used and has a strong theoretical underpinning. In practical recommendation, only items at the top of the ranked list are presented to the users. With this in mind, we propose a class of objective functions over matrix factorisations which primarily represent a smooth surrogate for the real AUC, and in a special case we show how to prioritise the top of the list. The objectives are differentiable and optimised through a carefully designed stochastic gradient-descent-based algorithm which scales linearly with the size of the data. In the special case of square loss we show how to improve computational complexity by leveraging previously computed measures. To understand theoretically the underlying matrix factorisation approaches we study both the consistency of the loss functions with respect to AUC, and generalisation using Rademacher theory. The resulting generalisation analysis gives strong motivation for the optimisation under study. Finally, we provide computation results as to the efficacy of the proposed method using synthetic and real data.

Bayesian Intent Prediction in Object Tracking Using Bridging Distributions

In several application areas, such as human computer interaction, surveillance and defence, determining the intent of a tracked object enables systems to aid the user/operator and facilitate effective, possibly automated, decision making. In this paper, we propose a probabilistic inference approach that permits the prediction, well in advance, of the intended destination of a tracked object and its future trajectory. Within the framework introduced here, the observed partial track of the object is modeled as being part of a Markov bridge terminating at its destination, since the target path, albeit random, must end at the intended endpoint. This captures the underlying long term dependencies in the trajectory, as dictated by the object intent. By determining the likelihood of the partial track being drawn from a particular constructed bridge, the probability of each of a number of possible destinations is evaluated. These bridges can also be employed to produce refined estimates of the latent system state (e.g. object position, velocity, etc.), predict its future values (up until reaching the designated endpoint) and estimate the time of arrival. This is shown to lead to a low complexity Kalman-filter-based implementation of the inference routine, where any linear Gaussian motion model, including proposed destination reverting ones, can be applied. Free hand pointing gestures data collected in an instrumented vehicle and synthetic trajectories of a vessel heading towards multiple possible harbours are utilised to demonstrate the effectiveness of the proposed approach.

Cardinality Estimation Meets Good-Turing

Cardinality estimation algorithms receive a stream of elements whose order might be arbitrary, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage and processing at the price of inaccuracy in their output. Real-world applications of these algorithms are required to process large volumes of monitored data, making it impractical to collect and analyze the entire input stream. In such cases, it is common practice to sample and process only a small part of the stream elements. This paper presents and analyzes a generic algorithm for combining every cardinality estimation algorithm with a sampling process. We show that the proposed sampling algorithm does not affect the estimator’s asymptotic unbiasedness, and we analyze the sampling effect on the estimator’s variance.

Clustering With Side Information: From a Probabilistic Model to a Deterministic Algorithm

In this paper, we propose a model-based clustering method (TVClust) that robustly incorporates noisy side information as soft-constraints and aims to seek a consensus between side information and the observed data. Our method is based on a nonparametric Bayesian hierarchical model that combines the probabilistic model for the data instance and the one for the side-information. An efficient Gibbs sampling algorithm is proposed for posterior inference. Using the small-variance asymptotics of our probabilistic model, we then derive a new deterministic clustering algorithm (RDP-means). It can be viewed as an extension of K-means that allows for the inclusion of side information and has the additional property that the number of clusters does not need to be specified a priori. Empirical studies have been carried out to compare our work with many constrained clustering algorithms from the literature on both a variety of data sets and under a variety of conditions such as using noisy side information and erroneous k values. The results of our experiments show strong results for our probabilistic and deterministic approaches under these conditions when compared to other algorithms in the literature.

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.

Formulations and algorithms for the multiple depot, fuel-constrained, multiple vehicle routing problem

We consider a multiple depot, multiple vehicle routing problem with fuel constraints. We are given a set of targets, a set of depots and a set of homogeneous vehicles, one for each depot. The depots are also allowed to act as refueling stations. The vehicles are allowed to refuel at any depot, and our objective is to determine a route for each vehicle with a minimum total cost such that each target is visited at least once by some vehicle, and the vehicles never run out fuel as it traverses its route. We refer this problem as Multiple Depot, Fuel-Constrained, Multiple Vehicle Routing Problem (FCMVRP). This paper presents four new mixed integer linear programming formulations to compute an optimal solution for the problem. Extensive computational results for a large set of instances are also presented.

Visualizing NLP annotations for Crowdsourcing

Visualizing NLP annotation is useful for the collection of training data for the statistical NLP approaches. Existing toolkits either provide limited visual aid, or introduce comprehensive operators to realize sophisticated linguistic rules. Workers must be well trained to use them. Their audience thus can hardly be scaled to large amounts of non-expert crowdsourced workers. In this paper, we present CROWDANNO, a visualization toolkit to allow crowd-sourced workers to annotate two general categories of NLP problems: clustering and parsing. Workers can finish the tasks with simplified operators in an interactive interface, and fix errors conveniently. User studies show our toolkit is very friendly to NLP non-experts, and allow them to produce high quality labels for several sophisticated problems. We release our source code and toolkit to spur future research.

A method for calculating quantile function and its further use for data fitting

A Neuro-Fuzzy Method to Improving Backfiring Conversion Ratios

A new family of posets generalizing the weak order on some Coxeter groups

A Review of Nonparametric Hypothesis Tests of Isotropy Properties in Spatial Data

Adaptive Multi-GPU Exchange Monte Carlo for the 3D Random Field Ising Model

All-thermal switching of amorphous Gd-Fe alloys: analysis of structural properties and magnetization dynamics

An analysis of numerical issues in neural training by pseudoinversion

An extension of MacMahon’s Equidistribution Theorem to ordered multiset partitions

Better Summarization Evaluation with Word Embeddings for ROUGE

Bijections on m-level Rook Placements

Binomial partial Steiner triple systems with complete graphs: structural problems

Common ancestor type distribution: a Moran model and its deterministic limit

Dense Subset Sum may be the hardest

Deterministic Abelian Sandpile and square-triangle tilings

Extremes of the two-dimensional Gaussian free field with scale-dependent variance

Flexible Expectile Regression in Reproducing Kernel Hilbert Space

Fractional White-Noise Limit and Paraxial Approximation for Waves in Random Media

Improved estimation in a general multivariate elliptical model

Multiple kernel multivariate performance learning using cutting plane algorithm

Nucleation and growth in two dimensions

OCReP: An Optimally Conditioned Regularization for Pseudoinversion Based Neural Training

On distributions of diffusion processes in Hilbert spaces at a fixed moment in time

On the neighbour sum distinguishing index of graphs with bounded maximum average degree

On Zeros of a Polynomial in a Finite Grid

Pathwise Stochastic Calculus with Local Times

Persistent chimera states in nonlocally coupled phase oscillators

Power flow tracing in a simplified highly renewable European electricity network

Prediction of Cyberbullying Incidents on the Instagram Social Network

Recent advances on Dirac-type problems for hypergraphs

Reverse Monte Carlo investigations concerning recent isotopic substitution neutron diffraction data on liquid water

Robot Language Learning, Generation, and Comprehension

Self-intersections of Two-Dimensional Equilateral Random Walks and Polygons

Strong Feller processes with measure-valued drifts

The Open Service Compendium. Business-pertinent Cloud Service Discovery, Assessment, and Selection

The structure of the consecutive pattern poset

Two-level space-time domain decomposition methods for unsteady inverse problems

Unsatisfiable Cores and Lower Bounding for Constraint Programming

Useful bounds on the extreme eigenvalues and vectors of matrices for Harper’s operators