**Probably the Best Itemsets**

One of the main current challenges in itemset mining is to discover a small set of high-quality itemsets. In this paper we propose a new and general approach for measuring the quality of itemsets. The method is solidly founded in Bayesian statistics and decreases monotonically, allowing for efficient discovery of all interesting itemsets. The measure is defined by connecting statistical models and collections of itemsets. This allows us to score individual itemsets with the probability of them occuring in random models built on the data. As a concrete example of this framework we use exponential models. This class of models possesses many desirable properties. Most importantly, Occam’s razor in Bayesian model selection provides a defence for the pattern explosion. As general exponential models are infeasible in practice, we use decomposable models; a large sub-class for which the measure is solvable. For the actual computation of the score we sample models from the posterior distribution using an MCMC approach. Experimentation on our method demonstrates the measure works in practice and results in interpretable and insightful itemsets for both synthetic and real-world data.

**Pipeline Leak Detection Systems and Data Fusion: A Survey**

The pipeline leakage problem is very challenging and critical issue. Solving this problem will save the nation a lot of money, resources and more importantly, it will save the environment. This paper discusses the state-of-the-art of leak detection systems (LDSs) and data fusion approaches that are applicable to pipeline monitoring. A comparison of LDSs is performed based on well-defined criteria. We have classified and critically reviewed these techniques. A thorough analysis and comparison of all the recent works have been tabulated focusing on the capabilities and limitations of each technique.

**Ask Not What AI Can Do, But What AI Should Do: Towards a Framework of Task Delegability**

Although artificial intelligence holds promise for addressing societal challenges, issues of exactly which tasks to automate and the extent to do so remain understudied. We approach the problem of task delegability from a human-centered perspective by developing a framework on human perception of task delegation to artificial intelligence. We consider four high-level factors that can contribute to a delegation decision: motivation, difficulty, risk, and trust. To obtain an empirical understanding of human preferences in different tasks, we build a dataset of 100 tasks from academic papers, popular media portrayal of AI, and everyday life. For each task, we administer a survey to collect judgments of each factor and ask subjects to pick the extent to which they prefer AI involvement. We find little preference for full AI control and a strong preference for machine-in-the-loop designs, in which humans play the leading role. Our framework can effectively predict human preferences in degrees of AI assistance. Among the four factors, trust is the most predictive of human preferences of optimal human-machine delegation. This framework represents a first step towards characterizing human preferences of automation across tasks. We hope this work may encourage and aid in future efforts towards understanding such individual attitudes; our goal is to inform the public and the AI research community rather than dictating any direction in technology development.

**Insertion Transformer: Flexible Sequence Generation via Insertion Operations**

We present the Insertion Transformer, an iterative, partially autoregressive model for sequence generation based on insertion operations. Unlike typical autoregressive models which rely on a fixed, often left-to-right ordering of the output, our approach accommodates arbitrary orderings by allowing for tokens to be inserted anywhere in the sequence during decoding. This flexibility confers a number of advantages: for instance, not only can our model be trained to follow specific orderings such as left-to-right generation or a binary tree traversal, but it can also be trained to maximize entropy over all valid insertions for robustness. In addition, our model seamlessly accommodates both fully autoregressive generation (one insertion at a time) and partially autoregressive generation (simultaneous insertions at multiple locations). We validate our approach by analyzing its performance on the WMT 2014 English-German machine translation task under various settings for training and decoding. We find that the Insertion Transformer outperforms many prior non-autoregressive approaches to translation at comparable or better levels of parallelism, and successfully recovers the performance of the original Transformer while requiring only logarithmically many iterations during decoding.

**Invariant-equivariant representation learning for multi-class data**

Representations learnt through deep neural networks tend to be highly informative, but opaque in terms of what information they learn to encode. We introduce an approach to probabilistic modelling that learns to represent data with two separate deep representations: an invariant representation that encodes the information of the class from which the data belongs, and an equivariant representation that encodes the symmetry transformation defining the particular data point within the class manifold (equivariant in the sense that the representation varies naturally with symmetry transformations). This approach is based primarily on the strategic routing of data through the two latent variables, and thus is conceptually transparent, easy to implement, and in-principle generally applicable to any data comprised of discrete classes of continuous distributions (e.g. objects in images, topics in language, individuals in behavioural data). We demonstrate qualitatively compelling representation learning and competitive quantitative performance, in both supervised and semi-supervised settings, versus comparable modelling approaches in the literature with little fine tuning.

**FSNet: Compression of Deep Convolutional Neural Networks by Filter Summary**

We present a novel method of compression of deep Convolutional Neural Networks (CNNs). The proposed method reduces the number of parameters of each convolutional layer by learning a 3D tensor termed Filter Summary (FS). The convolutional filters are extracted from FS as overlapping 3D blocks, and nearby filters in FS share weights in their overlapping regions in a natural way. The resultant neural network based on such weight sharing scheme, termed Filter Summary CNNs or FSNet, has a FS in each convolution layer instead of a set of independent filters in the conventional convolution layer. FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet generates the same number of filters from FS as that of the basline CNN in the forward process. Without hurting the inference speed, the parameter space of FSNet is much smaller than that of the baseline CNN. In addition, FSNet is compatible with weight quantization, leading to even higher compression ratio when combined with weight quantization. Experiments demonstrate the effectiveness of FSNet in compression of CNNs for computer vision tasks including image classification and object detection. For classification task, FSNet of 0.22M effective parameters has prediction accuracy of 93.91% on the CIFAR-10 dataset with less than 0.3% accuracy drop, using ResNet-18 of 11.18M parameters as baseline. Furthermore, FSNet version of ResNet-50 with 2.75M effective parameters achieves the top-1 and top-5 accuracy of 63.80% and 85.72% respectively on ILSVRC-12 benchmark. For object detection task, FSNet is used to compress the Single Shot MultiBox Detector (SSD300) of 26.32M parameters. FSNet of 0.45M effective parameters achieves mAP of 67.63% on the VOC2007 test data with weight quantization, and FSNet of 0.68M parameters achieves mAP of 70.00% with weight quantization on the same test data.

**Discovering Context Effects from Raw Choice Data**

Many applications in preference learning assume that decisions come from the maximization of a stable utility function. Yet a large experimental literature shows that individual choices and judgements can be affected by ‘irrelevant’ aspects of the context in which they are made. An important class of such contexts is the composition of the choice set. In this work, our goal is to discover such choice set effects from raw choice data. We introduce an extension of the Multinomial Logit (MNL) model, called the context dependent random utility model (CDM), which allows for a particular class of choice set effects. We show that the CDM can be thought of as a second-order approximation to a general choice system, can be inferred optimally using maximum likelihood and, importantly, is easily interpretable. We apply the CDM to both real and simulated choice data to perform principled exploratory analyses for the presence of choice set effects.

**Accounting for Significance and Multicollinearity in Building Linear Regression Models**

We derive explicit Mixed Integer Optimization (MIO) constraints, as opposed to iteratively imposing them in a cutting plane framework, that impose significance and avoid multicollinearity for building linear regression models. In this way we extend and improve the research program initiated in Bertsimas and King (2016) that imposes sparsity, robustness, pairwise collinearity and group sparsity explicitly and significance and avoiding multicollinearity iteratively. We present a variety of computational results on real and synthetic datasets that suggest that the proposed MIO has a significant computational edge compared to Bertsimas and King (2016) in accuracy, false detection rate and computational time in accounting for significance and multicollinearity as well as providing a holistic framework to produce regression models with desirable properties a priori.

**Learning Ontologies with Epistemic Reasoning: The EL Case**

We investigate the problem of learning description logic ontologies from entailments via queries, using epistemic reasoning. We introduce a new learning model consisting of epistemic membership and example queries and show that polynomial learnability in this model coincides with polynomial learnability in Angluin’s exact learning model with membership and equivalence queries. We then instantiate our learning framework to EL and show some complexity results for an epistemic extension of EL where epistemic operators can be applied over the axioms. Finally, we transfer known results for EL ontologies and its fragments to our learning model based on epistemic reasoning.

**Automatic dimensionality selection for principal component analysis models with the ignorance score**

Principal component analysis (PCA) is by far the most widespread tool for unsupervised learning with high-dimensional data sets. Its application is popularly studied for the purpose of exploratory data analysis and online process monitoring. Unfortunately, fine-tuning PCA models and particularly the number of components remains a challenging task. Today, this selection is often based on a combination of guiding principles, experience, and process understanding. Unlike the case of regression, where cross-validation of the prediction error is a widespread and trusted approach for model selection, there are no tools for PCA model selection which reach this level of acceptance. In this work, we address this challenge and evaluate the utility of the cross-validated ignorance score with both simulated and experimental data sets. Application of this method is based on the interpretation of PCA as a density model, as in probabilistic principal component analysis, and is shown to be a valuable tool to identify an optimal number of principal components.

**The Influence of Time Series Distance Functions on Climate Networks**

Network theory has established itself as an important tool for the analysis of complex systems such as the climate. In this context, climate networks are constructed using a spatiotemporal climate dataset and a time series distance function. It consists of representing each spatial area by a node and connecting nodes that present similar time series. One fundamental concern when building climate network is the definition of a metric that captures similarity between time series. The majority of papers in the literature use Pearson correlation with or without lag. Here we study the influence of 29 time series distance functions on climate network construction using global temperature data. We observed that the distance functions used in the literature generate similar networks in general while alternative ones generate distinct networks and exhibit different long-distance connection patterns (teleconnections). These patterns are highly important for the study of climate dynamics since they generally represent long-distance transportation of energy and can be used to forecast climatological events. Therefore, we consider that the measures here studied represent an alternative for the analysis of climate systems due to their capability of capturing different underlying dynamics, what may provide a better understanding of global climate.

**PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite**

Tensor methods have gained increasingly attention from various applications, including machine learning, quantum chemistry, healthcare analytics, social network analysis, data mining, and signal processing, to name a few. Sparse tensors and their algorithms become critical to further improve the performance of these methods and enhance the interpretability of their output. This work presents a sparse tensor algorithm benchmark suite (PASTA) for single- and multi-core CPUs. To the best of our knowledge, this is the first benchmark suite for sparse tensor world. PASTA targets on: 1) helping application users to evaluate different computer systems using its representative computational workloads; 2) providing insights to better utilize existed computer architecture and systems and inspiration for the future design. This benchmark suite is publicly released

https://…/pasta.

**Architecture Compression**

In this paper we propose a novel approach to model compression termed Architecture Compression. Instead of operating on the weight or filter space of the network like classical model compression methods, our approach operates on the architecture space. A 1-D CNN encoder-decoder is trained to learn a mapping from discrete architecture space to a continuous embedding and back. Additionally, this embedding is jointly trained to regress accuracy and parameter count in order to incorporate information about the architecture’s effectiveness on the dataset. During the compression phase, we first encode the network and then perform gradient descent in continuous space to optimize a compression objective function that maximizes accuracy and minimizes parameter count. The final continuous feature is then mapped to a discrete architecture using the decoder. We demonstrate the merits of this approach on visual recognition tasks such as CIFAR-10, CIFAR-100, Fashion-MNIST and SVHN and achieve a greater than 20x compression on CIFAR-10.

**Censored Quantile Regression Forests**

Random forests are powerful non-parametric regression method but are severely limited in their usage in the presence of randomly censored observations, and naively applied can exhibit poor predictive performance due to the incurred biases. Based on a local adaptive representation of random forests, we develop its regression adjustment for randomly censored regression quantile models. Regression adjustment is based on new estimating equations that adapt to censoring and lead to quantile score whenever the data do not exhibit censoring. The proposed procedure named censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption. We establish its consistency under mild model specifications. Numerical studies showcase a clear advantage of the proposed procedure.

**Mini-batch learning of exponential family finite mixture models**

Mini-batch algorithms have become increasingly popular due to the requirement for solving optimization problems, based on large-scale data sets. Using an existing online expectation–maximization (EM) algorithm framework, we demonstrate how mini-batch (EM) algorithms may be constructed, and propose a scheme for the stochastic stabilization of the constructed mini-batch algorithms. Theoretical results regarding the convergence of these mini-batch EM algorithms are presented. We then demonstrate how the mini-batch framework may be applied to conduct maximum likelihood (ML) estimation of mixtures of exponential family distributions, with emphasis on ML estimation for mixtures of normal distributions. Via a simulation study, we demonstrate that the mini-batch algorithm for mixtures of normal distributions can outperform the standard EM algorithm. Further evidence of the performance of the mini-batch framework is provided via an application to the famous MNIST data set.

**Bayesian Nonparametric Adaptive Spectral Density Estimation for Financial Time Series**

Discrimination between non-stationarity and long-range dependency is a difficult and long-standing issue in modelling financial time series. This paper uses an adaptive spectral technique which jointly models the non-stationarity and dependency of financial time series in a non-parametric fashion assuming that the time series consists of a finite, but unknown number, of locally stationary processes, the locations of which are also unknown. The model allows a non-parametric estimate of the dependency structure by modelling the auto-covariance function in the spectral domain. All our estimates are made within a Bayesian framework where we use aReversible Jump Markov Chain Monte Carlo algorithm for inference. We study the frequentist properties of our estimates via a simulation study, and present a novel way of generating time series data from a nonparametric spectrum. Results indicate that our techniques perform well across a range of data generating processes. We apply our method to a number of real examples and our results indicate that several financial time series exhibit both long-range dependency and non-stationarity.

**Meta-Curvature**

We propose to learn curvature information for better generalization and fast model adaptation, called meta-curvature. Based on the model-agnostic meta-learner (MAML), we learn to transform the gradients in the inner optimization such that the transformed gradients achieve better generalization performance to a new task. For training large scale neural networks, we decompose the curvature matrix into smaller matrices and capture the dependencies of the model’s parameters with a series of tensor products. We demonstrate the effects of our proposed method on both few-shot image classification and few-shot reinforcement learning tasks. Experimental results show consistent improvements on classification tasks and promising results on reinforcement learning tasks. Furthermore, we observe faster convergence rates of the meta-training process. Finally, we present an analysis that explains better generalization performance with the meta-trained curvature.

**Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher**

Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too gigantic to be deployed on edge devices like smart-phones or embedded sensor nodes. There has been efforts to compress these networks, and a popular method is knowledge distillation, where a large (a.k.a. teacher) pre-trained network is used to train a smaller (a.k.a. student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation which employs an intermediate-sized network (a.k.a. teacher assistant) to bridge the gap between the student and the teacher. We study the effect of teacher assistant size and extend the framework to multi-step distillation. Moreover, empirical and theoretical analysis are conducted to analyze the teacher assistant knowledge distillation framework. Extensive experiments on CIFAR-10 and CIFAR-100 datasets and plain CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

**A new simple and effective measure for bag-of-word inter-document similarity measurement**

To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity—the most widely used inter-document similarity measure in text mining. In this paper, we identify the shortcomings of the underlying assumptions of term weighting in the inter-document similarity measurement task; and provide a more fit-to-the-purpose alternative. Based on this new assumption, we introduce a new simple but effective similarity measure which does not require explicit term weighting. The proposed measure employs a more nuanced probabilistic approach than those used in term weighting to measure the similarity of two documents w.r.t each term occurring in the two documents. Our empirical comparison with the existing similarity measures using different term weighting schemes shows that the new measure produces (i) better results in the binary BoW representation; and (ii) competitive and more consistent results in the term-frequency-based BoW representation.

**Low-pass filtering as Bayesian inference**

We propose a Bayesian nonparametric method for low-pass filtering that can naturally handle unevenly-sampled and noise-corrupted observations. The proposed model is constructed as a latent-factor model for time series, where the latent factors are Gaussian processes with non-overlapping spectra. With this construction, the low-pass version of the time series can be identified as the low-frequency latent component, and therefore it can be found by means of Bayesian inference. We show that the model admits exact training and can be implemented with minimal numerical approximations. Finally, the proposed model is validated against standard linear filters on synthetic and real-world time series.

**Contextual Recurrent Neural Networks**

There is an implicit assumption that by unfolding recurrent neural networks (RNN) in finite time, the misspecification of choosing a zero value for the initial hidden state is mitigated by later time steps. This assumption has been shown to work in practice and alternative initialization may be suggested but often overlooked. In this paper, we propose a method of parameterizing the initial hidden state of an RNN. The resulting architecture, referred to as a Contextual RNN, can be trained end-to-end. The performance on an associative retrieval task is found to improve by conditioning the RNN initial hidden state on contextual information from the input sequence. Furthermore, we propose a novel method of conditionally generating sequences using the hidden state parameterization of Contextual RNN.

**Equivalence of regression curves sharing common parameters**

In clinical trials the comparison of two different populations is a frequently addressed problem. Non-linear (parametric) regression models are commonly used to describe the relationship between covariates as the dose and a response variable in the two groups. In some situations it is reasonable to assume some model parameters to be the same, for instance the placebo effect or the maximum treatment effect. In this paper we develop a (parametric) bootstrap test to establish the similarity of two regression curves sharing some common parameters. We show by theoretical arguments and by means of a simulation study that the new test controls its level and achieves a reasonable power. Moreover, it is demonstrated that under the assumption of common parameters a considerable more powerful test can be constructed compared to the test which does not use this assumption. Finally, we illustrate potential applications of the new methodology by a clinical trial example.

**Quickest Change Detection in the Presence of a Nuisance Change**

In the quickest change detection problem in which both nuisance and critical changes may occur, the objective is to detect the critical change as quickly as possible without raising an alarm when either there is no change or a nuisance change has occurred. A window-limited sequential change detection procedure based on the generalized likelihood ratio test statistic is proposed. A recursive update scheme for the proposed test statistic is developed and is shown to be asymptotically optimal under mild technical conditions. In the scenario where the post-change distribution belongs to a parametrized family, a generalized stopping time and a lower bound on its average run length are derived. The proposed stopping rule is compared with the naive 2-stage procedure that detects the nuisance or critical change using separate CuSum stopping procedures for the nuisance and critical changes. Simulations demonstrate that the proposed rule outperforms the 2-stage procedure, and experiments on a real dataset on bearing failure verify the performance of the proposed stopping time.

**Passing Tests without Memorizing: Two Models for Fooling Discriminators**

We introduce two mathematical frameworks for foolability in the context of generative distribution learning. In a nuthsell, fooling is an algorithmic task in which the input sample is drawn from some target distribution and the goal is to output a synthetic distribution that is indistinguishable from the target w.r.t to some fixed class of tests. This framework received considerable attention in the context of Generative Adversarial Networks (GANs), a recently proposed approach which achieves impressive empirical results. From a theoretical viewpoint this problem seems difficult to model. This is due to the fact that in its basic form, the notion of foolability is susceptible to a type of overfitting called memorizing. This raises a challenge of devising notions and definitions that separate between fooling algorithms that generate new synthetic data vs. algorithms that merely memorize or copy the training set. The first model we consider is called GAM–Foolability and is inspired by GANs. Here the learner has only an indirect access to the target distribution via a discriminator. The second model, called DP–Foolability, exploits the notion of differential privacy as a candidate criterion for non-memorization. We proceed to characterize foolability within these two models and study their interrelations. We show that DP–Foolability implies GAM–Foolability and prove partial results with respect to the converse. It remains, though, an open question whether GAM–Foolability implies DP–Foolability. We also present an application in the context of differentially private PAC learning. We show that from a statistical perspective, for any class H, learnability by a private proper learner is equivalent to the existence of a private sanitizer for H. This can be seen as an analogue of the equivalence between uniform convergence and learnability in classical PAC learning.

**A combinatorial identity**

We prove an interesting combinatorial identity, which we came across in counting contributions from forest graphs to a cluster expansion for classical gas correlation functions, but may be of more general interest.

**The Perils of Detecting Measurement Faults in Environmental Monitoring Networks**

Scientists deploy environmental monitoring net-works to discover previously unobservable phenomena and quantify subtle spatial and temporal differences in the physical quantities they measure. Our experience, shared by others, has shown that measurements gathered by such networks are perturbed by sensor faults. In response, multiple fault detection techniques have been proposed in the literature. However, in this paper we argue that these techniques may misclassify events (e.g. rain events for soil moisture measurements) as faults, potentially discarding the most interesting measurements. We support this argument by applying two commonly used fault detection techniques on data collected from a soil monitoring network. Our results show that in this case, up to 45% of the event measurements are misclassified as faults. Furthermore, tuning the fault detection algorithms to avoid event misclassification, causes them to miss the majority of actual faults. In addition to exposing the tension between fault and event detection, our findings motivate the need to develop novel fault detection mechanisms which incorporate knowledge of the underlying events and are customized to the sensing modality they monitor.

**Assessing the Local Interpretability of Machine Learning Models**

The increasing adoption of machine learning tools has led to calls for accountability via model interpretability. But what does it mean for a machine learning model to be interpretable by humans, and how can this be assessed? We focus on two definitions of interpretability that have been introduced in the machine learning literature: simulatability (a user’s ability to run a model on a given input) and ‘what if’ local explainability (a user’s ability to correctly indicate the outcome to a model under local changes to the input). Through a user study with 1000 participants, we test whether humans perform well on tasks that mimic the definitions of simulatability and ‘what if’ local explainability on models that are typically considered locally interpretable. We find evidence consistent with the common intuition that decision trees and logistic regression models are interpretable and are more interpretable than neural networks. We propose a metric – the runtime operation count on the simulatability task – to indicate the relative interpretability of models and show that as the number of operations increases the users’ accuracy on the local interpretability tasks decreases.

**Multi-Domain Translation by Learning Uncoupled Autoencoders**

Multi-domain translation seeks to learn a probabilistic coupling between marginal distributions that reflects the correspondence between different domains. We assume that data from different domains are generated from a shared latent representation based on a structural equation model. Under this assumption, we show that the problem of computing a probabilistic coupling between marginals is equivalent to learning multiple uncoupled autoencoders that embed to a given shared latent distribution. In addition, we propose a new framework and algorithm for multi-domain translation based on learning the shared latent distribution and training autoencoders under distributional constraints. A key practical advantage of our framework is that new autoencoders (i.e., new domains) can be added sequentially to the model without retraining on the other domains, which we demonstrate experimentally on image as well as genomics datasets.

**Scalable Fair Clustering**

We study the fair variant of the classic

-median problem introduced by Chierichetti et al. [2017]. In the standard

-median problem, given an input pointset

, the goal is to find

centers

and assign each input point to one of the centers in

such that the average distance of points to their cluster center is minimized. In the fair variant of

-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an ‘approximately equal’ number of points of each color. Chierichetti et al. proposed a two-phase algorithm for fair

-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the

-median objective. In the second step, fairlets are merged into

clusters by one of the existing

-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time. In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. Our algorithm additionally allows for finer control over the balance of resulting clusters than the original work. We complement our theoretical bounds with empirical evaluation.

**Semantically Enhanced Time Series Databases in IoT-Edge-Cloud Infrastructure**

Many IoT systems are data intensive and are for the purpose of monitoring for fault detection and diagnosis of critical systems. A large volume of data steadily come out of a large number of sensors in the monitoring system. Thus, we need to consider how to store and manage these data. Existing time series databases (TSDBs) can be used for monitoring data storage, but they do not have good models for describing the data streams stored in the database. In this paper, we develop a semantic model for the specification of the monitoring data streams (time series data) in terms of which sensor generated the data stream, which metric of which entity the sensor is monitoring, what is the relation of the entity to other entities in the system, which measurement unit is used for the data stream, etc. We have also developed a tool suite, SE-TSDB, that can run on top of existing TSDBs to help establish semantic specifications for data streams and enable semantic-based data retrievals. With our semantic model for monitoring data and our SE-TSDB tool suite, users can retrieve non-existing data streams that can be automatically derived from the semantics. Users can also retrieve data streams without knowing where they are. Semantic based retrieval is especially important in a large-scale integrated IoT-Edge-Cloud system, because of its sheer quantity of data, its huge number of computing and IoT devices that may store the data, and the dynamics in data migration and evolution. With better data semantics, data streams can be more effectively tracked and flexibly retrieved to help with timely data analysis and control decision making anywhere and anytime.

**Feature Selection for multi-labeled variables via Dependency Maximization**

Feature selection and reducing the dimensionality of data is an essential step in data analysis. In this work, we propose a new criterion for feature selection that is formulated as conditional theoretical information between features given the labeled variable. Instead of using the standard mutual information measure based on Kullback-Leibler divergence, we use our proposed criterion to filter out redundant features. This approach results in an efficient and fast non-parametric implementation of feature selection as it can be directly estimated using a geometric measure of dependency, the global Friedman-Rafsky (FR) multivariate run test statistic constructed by a global minimal spanning tree (MST). We demonstrate the advantages of our proposed feature selection approach through simulation. In addition, the proposed feature selection method is applied to the MNIST data set and compared with [1].