In recommendation systems, one is interested in the ranking of the predicted items as opposed to other losses such as the mean squared error. Although a variety of ways to evaluate rankings exist in the literature, here we focus on the Area Under the ROC Curve (AUC) as it widely used and has a strong theoretical underpinning. In practical recommendation, only items at the top of the ranked list are presented to the users. With this in mind, we propose a class of objective functions over matrix factorisations which primarily represent a smooth surrogate for the real AUC, and in a special case we show how to prioritise the top of the list. The objectives are differentiable and optimised through a carefully designed stochastic gradient-descent-based algorithm which scales linearly with the size of the data. In the special case of square loss we show how to improve computational complexity by leveraging previously computed measures. To understand theoretically the underlying matrix factorisation approaches we study both the consistency of the loss functions with respect to AUC, and generalisation using Rademacher theory. The resulting generalisation analysis gives strong motivation for the optimisation under study. Finally, we provide computation results as to the efficacy of the proposed method using synthetic and real data.
In several application areas, such as human computer interaction, surveillance and defence, determining the intent of a tracked object enables systems to aid the user/operator and facilitate effective, possibly automated, decision making. In this paper, we propose a probabilistic inference approach that permits the prediction, well in advance, of the intended destination of a tracked object and its future trajectory. Within the framework introduced here, the observed partial track of the object is modeled as being part of a Markov bridge terminating at its destination, since the target path, albeit random, must end at the intended endpoint. This captures the underlying long term dependencies in the trajectory, as dictated by the object intent. By determining the likelihood of the partial track being drawn from a particular constructed bridge, the probability of each of a number of possible destinations is evaluated. These bridges can also be employed to produce refined estimates of the latent system state (e.g. object position, velocity, etc.), predict its future values (up until reaching the designated endpoint) and estimate the time of arrival. This is shown to lead to a low complexity Kalman-filter-based implementation of the inference routine, where any linear Gaussian motion model, including proposed destination reverting ones, can be applied. Free hand pointing gestures data collected in an instrumented vehicle and synthetic trajectories of a vessel heading towards multiple possible harbours are utilised to demonstrate the effectiveness of the proposed approach.
Cardinality estimation algorithms receive a stream of elements whose order might be arbitrary, with possible repetitions, and return the number of distinct elements. Such algorithms usually seek to minimize the required storage and processing at the price of inaccuracy in their output. Real-world applications of these algorithms are required to process large volumes of monitored data, making it impractical to collect and analyze the entire input stream. In such cases, it is common practice to sample and process only a small part of the stream elements. This paper presents and analyzes a generic algorithm for combining every cardinality estimation algorithm with a sampling process. We show that the proposed sampling algorithm does not affect the estimator’s asymptotic unbiasedness, and we analyze the sampling effect on the estimator’s variance.
In this paper, we propose a model-based clustering method (TVClust) that robustly incorporates noisy side information as soft-constraints and aims to seek a consensus between side information and the observed data. Our method is based on a nonparametric Bayesian hierarchical model that combines the probabilistic model for the data instance and the one for the side-information. An efficient Gibbs sampling algorithm is proposed for posterior inference. Using the small-variance asymptotics of our probabilistic model, we then derive a new deterministic clustering algorithm (RDP-means). It can be viewed as an extension of K-means that allows for the inclusion of side information and has the additional property that the number of clusters does not need to be specified a priori. Empirical studies have been carried out to compare our work with many constrained clustering algorithms from the literature on both a variety of data sets and under a variety of conditions such as using noisy side information and erroneous k values. The results of our experiments show strong results for our probabilistic and deterministic approaches under these conditions when compared to other algorithms in the literature.
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.
We consider a multiple depot, multiple vehicle routing problem with fuel constraints. We are given a set of targets, a set of depots and a set of homogeneous vehicles, one for each depot. The depots are also allowed to act as refueling stations. The vehicles are allowed to refuel at any depot, and our objective is to determine a route for each vehicle with a minimum total cost such that each target is visited at least once by some vehicle, and the vehicles never run out fuel as it traverses its route. We refer this problem as Multiple Depot, Fuel-Constrained, Multiple Vehicle Routing Problem (FCMVRP). This paper presents four new mixed integer linear programming formulations to compute an optimal solution for the problem. Extensive computational results for a large set of instances are also presented.
Visualizing NLP annotation is useful for the collection of training data for the statistical NLP approaches. Existing toolkits either provide limited visual aid, or introduce comprehensive operators to realize sophisticated linguistic rules. Workers must be well trained to use them. Their audience thus can hardly be scaled to large amounts of non-expert crowdsourced workers. In this paper, we present CROWDANNO, a visualization toolkit to allow crowd-sourced workers to annotate two general categories of NLP problems: clustering and parsing. Workers can finish the tasks with simplified operators in an interactive interface, and fix errors conveniently. User studies show our toolkit is very friendly to NLP non-experts, and allow them to produce high quality labels for several sophisticated problems. We release our source code and toolkit to spur future research.