Lurking Variable
Lurking variables represent hidden information, and preclude a full understanding of phenomena of interest. Detection is usually based on serendipity — visual detection of unexplained, systematic variation. However, these approaches are doomed to fail if the lurking variables do not vary. …

Instance Selection (IS)
In supervised learning, a training set providing previously known information is used to classify new instances. Commonly, several instances are stored in the training set but some of them are not useful for classifying therefore it is possible to get acceptable classification rates ignoring non useful cases; this process is known as instance selection. Through instance selection the training set is reduced which allows reducing runtimes in the classification and/or training stages of classifiers. …

ACC
We investigate learning algorithms that use similarity queries to approximately solve correlation clustering problems. The input consists of $n$ objects; each pair of objects has a hidden binary similarity score that we can learn through a query. The goal is to use as few queries as possible to partition the objects into clusters so to achieve the optimal number OPT of disagreements with the scores. Our first set of contributions is algorithmic: we introduce ACC, a simple query-aware variant of an existing algorithm (KwikCluster, with expected error 3OPT but a vacuous $\mathcal{O}(n^2)$ worst-case bound on the number of queries) for which we prove several desirable properties. First, ACC has expected error 3OPT$+ \mathcal{O}(n^3/Q)$ when using $Q < \binom{n}{2}$ queries, and recovers KwikCluster’s bound of 3OPT for $Q=\binom{n}{2}$. Second, ACC accurately recovers every adversarially perturbed latent cluster $C$. Under stronger conditions on $C$, ACC can even be used to recover exactly all clusters with high probability. Third, we show an efficient variant, \aggress, with the same expected error as ACC but using significantly less queries on some graphs. We empirically test our algorithms on real-world and synthetic datasets. Our second set of contributions is a nearly complete information-theoretic characterization of the query vs.\ error trade-off. First, using VC theory, for all $Q = \Omega(n)$ we prove the existence of algorithms with expected error at most OPT$+ n^{5/2}/\sqrt{Q}$, and at most $\widetilde{\mathcal{O}}\big(n^3/Q\big)$ if OPT=0. We then show that any randomized algorithm, when using at most $Q$ queries, must output a clustering with expected cost OPT$+ \Omega\big(n^3/Q\big)$, which matches the upper bound for $Q=\Theta(n)$. For the special case of OPT=0 we prove a weaker lower bound of $\Omega\big(n^2/\sqrt{Q}\big)$. …

HUSP-ULL
High-utility sequential pattern mining is an emerging topic in the field of Knowledge Discovery in Databases. It consists of discovering subsequences having a high utility (importance) in sequences, referred to as high-utility sequential patterns (HUSPs). HUSPs can be applied to many real-life applications, such as market basket analysis, E-commerce recommendation, click-stream analysis and scenic route planning. For example, in economics and targeted marketing, understanding economic behavior of consumers is quite challenging, such as finding credible and reliable information on product profitability. Several algorithms have been proposed to address this problem by efficiently mining utility-based useful sequential patterns. Nevertheless, the performance of these algorithms can be unsatisfying in terms of runtime and memory usage due to the combinatorial explosion of the search space for low utility threshold and large databases. Hence, this paper proposes a more efficient algorithm for the task of high-utility sequential pattern mining, called HUSP-ULL. It utilizes a lexicographic sequence (LS)-tree and a utility-linked (UL)-list structure to fast discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper-bounds on the utility of candidate sequences, and reduce the search space by pruning unpromising candidates early. Substantial experiments both on real-life and synthetic datasets show that the proposed algorithm can effectively and efficiently discover the complete set of HUSPs and outperforms the state-of-the-art algorithms. …