Multi-label active learning is a hot topic in reducing the label cost by optimally choosing the most valuable instance to query its label from an oracle. In this paper, we consider the poolbased multi-label active learning under the crowdsourcing setting, where during the active query process, instead of resorting to a high cost oracle for the ground-truth, multiple low cost imperfect annotators with various expertise are available for labeling. To deal with this problem, we propose the MAC (Multi-label Active learning from Crowds) approach which incorporate the local influence of label correlations to build a probabilistic model over the multi-label classifier and annotators. Based on this model, we can estimate the labels for instances as well as the expertise of each annotator. Then we propose the instance selection and annotator selection criteria that consider the uncertainty/diversity of instances and the reliability of annotators, such that the most reliable annotator will be queried for the most valuable instances. Experimental results demonstrate the effectiveness of the proposed approach.
We propose a new data-centric synchronization framework for carrying out of machine learning (ML) tasks in a distributed environment. Our framework exploits the iterative nature of ML algorithms and relaxes the application agnostic bulk synchronization parallel (BSP) paradigm that has previously been used for distributed machine learning. Data-centric synchronization complements function-centric synchronization based on using stale updates to increase the throughput of distributed ML computations. Experiments to validate our framework suggest that we can attain substantial improvement over BSP while guaranteeing sequential correctness of ML tasks.
Perceptron is a classic online algorithm for learning a classification function. We provide the first extension of the perceptron algorithm to the learning to rank problem under popular performance measures such as NDCG and MAP. A modern perspective on perceptron in classification is to interpret it as online gradient descent (OGD), during mistake rounds, on the hinge loss function. Motivated by this interpretation, we propose a novel family of \emph{listwise} large margin ranking surrogates. Members of this family can be thought of as analogues of the hinge loss. Using the proposed family, we provide a guaranteed upper bound on the cumulative NDCG (or MAP) induced loss incurred by our perceptron like algorithm. In a guarantee reminiscent of Novikoff’s convergence theorem for the classic perceptron, we show that if there exists an oracle ranker which can correctly rank each instance in an online sequence of ranking data, our perceptron algorithm makes no more than a finite number of mistakes on that sequence. We provide empirical results on large commercial datasets to corroborate our theoretical results.
The fundamental problem underlying all multi-criteria decision analysis (MCDA) problems is that of dominance between any two alternatives: ‘Given two alternatives A and B, each described by a set criteria, is A preferred to B with respect to a set of decision maker (DM) preferences over the criteria?’. Depending on the application in which MCDA is performed, the alternatives may represent strategies and policies for business, potential locations for setting up new facilities, designs of buildings, etc. The general objective of MCDA is to enable the DM to order all alternatives in order of the stated preferences, and choose the ones that are best, i.e., optimal with respect to the preferences over the criteria. This article presents and summarizes a recently developed MCDA framework that orders the set of alternatives when the relative importance preferences are incomplete, imprecise, or qualitative in nature.
We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output ‘No’. We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic $O(n k^2\log{k} / m+n \text{polylog} m)$ time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only $O(k^2\text{polylog} {m})$ space in total. 3) Next we give a randomised $(1+\epsilon)$-approximation algorithm for the streaming k-mismatch problem which uses $O(k^2\text{polylog} m / \epsilon^2)$ space and runs in $O(\text{polylog} m / \epsilon^2)$ worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised $O(k^2\text{polylog} {m})$ space algorithm for the streaming k-mismatch problem which runs in $O(\sqrt{k}\log{k} + \text{polylog} {m})$ worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).