We present a novel notion of outlier, called the Concentration Free Outlier Factor, or CFOF. As a main contribution, we formalize the notion of concentration of outlier scores and theoretically prove that CFOF does not concentrate in the Euclidean space for any arbitrary large dimensionality. To the best of our knowledge, there are no other proposals of data analysis measures related to the Euclidean distance for which it has been provided theoretical evidence that they are immune to the concentration effect. We determine the closed form of the distribution of CFOF scores in arbitrarily large dimensionalities and show that the CFOF score of a point depends on its squared norm standard score and on the kurtosis of the data distribution, thus providing a clear and statistically founded characterization of this notion. Moreover, we leverage this closed form to provide evidence that the definition does not suffer of the hubness problem affecting other measures. We prove that the number of CFOF outliers coming from each cluster is proportional to cluster size and kurtosis, a property that we call semi-locality. We determine that semi-locality characterizes existing reverse nearest neighbor-based outlier definitions, thus clarifying the exact nature of their observed local behavior. We also formally prove that classical distance-based and density-based outliers concentrate both for bounded and unbounded sample sizes and for fixed and variable values of the neighborhood parameter. We introduce the fast-CFOF algorithm for detecting outliers in large high-dimensional dataset. The algorithm has linear cost, supports multi-resolution analysis, and is embarrassingly parallel. Experiments highlight that the technique is able to efficiently process huge datasets and to deal even with large values of the neighborhood parameter, to avoid concentration, and to obtain excellent accuracy.
The prevalence of networked sensors and actuators in many real-world systems such as smart buildings, factories, power plants, and data centers generate substantial amounts of multivariate time series data for these systems. The rich sensor data can be continuously monitored for intrusion events through anomaly detection. However, conventional threshold-based anomaly detection methods are inadequate due to the dynamic complexities of these systems, while supervised machine learning methods are unable to exploit the large amounts of data due to the lack of labeled data. On the other hand, current unsupervised machine learning approaches have not fully exploited the spatial-temporal correlation and other dependencies amongst the multiple variables (sensors/actuators) in the system for detecting anomalies. In this work, we propose an unsupervised multivariate anomaly detection method based on Generative Adversarial Networks (GANs). Instead of treating each data stream independently, our proposed MAD-GAN framework considers the entire variable set concurrently to capture the latent interactions amongst the variables. We also fully exploit both the generator and discriminator produced by the GAN, using a novel anomaly score called DR-score to detect anomalies by discrimination and reconstruction. We have tested our proposed MAD-GAN using two recent datasets collected from real-world CPS: the Secure Water Treatment (SWaT) and the Water Distribution (WADI) datasets. Our experimental results showed that the proposed MAD-GAN is effective in reporting anomalies caused by various cyber-intrusions compared in these complex real-world systems.
Motivated by the success of using black-box predictive algorithms as subroutines for online decision-making, we develop a new framework for designing online policies given access to an oracle providing statistical information about an offline benchmark. Having access to such prediction oracles enables simple and natural Bayesian selection policies, and raises the question as to how these policies perform in different settings. Our work makes two important contributions towards tackling this question: First, we develop a general technique we call *compensated coupling* which can be used to derive bounds on the expected regret (i.e., additive loss with respect to a benchmark) for any online policy and offline benchmark; Second, using this technique, we show that the Bayes Selector has constant expected regret (i.e., independent of the number of arrivals and resource levels) in any online packing and matching problem with a finite type-space. Our results generalize and simplify many existing results for online packing and matching problems, and suggest a promising pathway for obtaining oracle-driven policies for other online decision-making settings.
Internet of Things (IoT) has gained substantial attention over the past years. And the main discussion has been how to process the amount of data that it generates which has lead to the edge computing paradigm. Wether it is called fog1, edge or mist, the principle remains that cloud services must become available closer to clients. This documents presents ongoing work on future edge systems that are built to provide steadfast IoT services to users by bringing storage and processing power closer to peripheral parts of networks. Designing such infrastructures is becoming much more challenging as the number of IoT devices keeps growing. Production grade deployments have to meet very high performance requirements, and end-to-end solutions involve significant investments. In this paper, we aim at providing a solution to extend the range of the edge model to the very farthest nodes in the network. Specifically, we focus on providing reliable storage and computation capabilities immediately on wireless IoT sensor nodes. This extended edge model will allow end users to manage their IoT ecosystem without forcibly relying on gateways or Internet provider solutions. In this document, we introduce Achlys, a prototype implementation of an edge node that is a concrete port of the Lasp programming library on the GRiSP Erlang embedded system. This way, we aim at addressing the need for a general purpose edge that is both resilient and consistent in terms of storage and network. Finally, we study example use cases that could take advantage of integrating the Achlys framework and discuss future work for the latter.
Next generation of embedded Information and Communication Technology (ICT) systems are interconnected collaborative intelligent systems able to perform autonomous tasks. Training and deployment of such systems on Edge devices however require a fine-grained integration of data and tools to achieve high accuracy and overcome functional and non-functional requirements. In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms and deployment tools together. By these means, we are able to interconnect the different entities or stages of particular systems and provide an end-to-end development of AI products. We demonstrate the effectiveness of the AI pipeline by solving an Automatic Speech Recognition challenge and we show that all the steps leading to an end-to-end development for Key-word Spotting tasks: importing, partitioning and pre-processing of speech data, training of different neural network architectures and their deployment on heterogeneous embedded platforms.
Prevailing user authentication schemes on smartphones rely on explicit user interaction, where a user types in a passcode or presents a biometric cue such as face, fingerprint, or iris. In addition to being cumbersome and obtrusive to the users, such authentication mechanisms pose security and privacy concerns. Passive authentication systems can tackle these challenges by frequently and unobtrusively monitoring the user’s interaction with the device. In this paper, we propose a Siamese Long Short-Term Memory network architecture for passive authentication, where users can be verified without requiring any explicit authentication step. We acquired a dataset comprising of measurements from 30 smartphone sensor modalities for 37 users. We evaluate our approach on 8 dominant modalities, namely, keystroke dynamics, GPS location, accelerometer, gyroscope, magnetometer, linear accelerometer, gravity, and rotation sensors. Experimental results find that, within 3 seconds, a genuine user can be correctly verified 97.15% of the time at a false accept rate of 0.1%.
For optimization of a sum of functions in a distributed computing environment, we present a novel communication efficient Newton-type algorithm that enjoys a variety of advantages over similar existing methods. Similar to Newton-MR, our algorithm, DINGO, is derived by optimization of the gradient’s norm as a surrogate function. DINGO does not impose any specific form on the underlying functions, and its application range extends far beyond convexity. In addition, the distribution of the data across the computing environment can be almost arbitrary. Further, the underlying sub-problems of DINGO are simple linear least-squares, for which a plethora of efficient algorithms exist. Lastly, DINGO involves a few hyper-parameters that are easy to tune. Moreover, we theoretically show that DINGO is not sensitive to the choice of its hyper-parameters in that a strict reduction in the gradient norm is guaranteed, regardless of the selected hyper-parameters. We demonstrate empirical evidence of the effectiveness, stability and versatility of our method compared to other relevant algorithms, on both convex and non-convex problems.
SkinnerDB is designed from the ground up for reliable join ordering. It maintains no data statistics and uses no cost or cardinality models. Instead, it uses reinforcement learning to learn optimal join orders on the fly, during the execution of the current query. To that purpose, we divide the execution of a query into many small time slices. Different join orders are tried in different time slices. We merge result tuples generated according to different join orders until a complete result is obtained. By measuring execution progress per time slice, we identify promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We compare expected execution cost against execution cost for an optimal join order. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark and TPC-H variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.
Graph matching is an important and persistent problem in computer vision and pattern recognition for finding node-to-node correspondence between graph-structured data. However, as widely used, graph matching that incorporates pairwise constraints can be formulated as a quadratic assignment problem (QAP), which is NP-complete and results in intrinsic computational difficulties. In this paper, we present a functional representation for graph matching (FRGM) that aims to provide more geometric insights on the problem and reduce the space and time complexities of corresponding algorithms. To achieve these goals, we represent a graph endowed with edge attributes by a linear function space equipped with a functional such as inner product or metric, that has an explicit geometric meaning. Consequently, the correspondence between graphs can be represented as a linear representation map of that functional. Specifically, we reformulate the linear functional representation map as a new parameterization for Euclidean graph matching, which is associative with geometric parameters for graphs under rigid or nonrigid deformations. This allows us to estimate the correspondence and geometric deformations simultaneously. The use of the representation of edge attributes rather than the affinity matrix enables us to reduce the space complexity by two orders of magnitudes. Furthermore, we propose an efficient optimization strategy with low time complexity to optimize the objective function. The experimental results on both synthetic and real-world datasets demonstrate that the proposed FRGM can achieve state-of-the-art performance.
In this paper, we deal with the problem of estimating the intervention effect in the statistical causal analysis using the structural equation model and the causal diagram. The intervention effect is defined as a causal effect on the response variable when the causal variable is fixed to a certain value by an external operation and is defined based on the causal diagram. The intervention effect is defined as a function of the probability distributions in the causal diagram, however, generally these probability distributions are unknown, so it is required to estimate them from data. In other words, the steps of the estimation of the intervention effect using the causal diagram are as follows: 1. Estimate the causal diagram from the data, 2. Estimate the probability distributions in the causal diagram from the data, 3. Calculate the intervention effect. However, if the problem of estimating the intervention effect is formulated in the statistical decision theory framework, estimation with this procedure is not necessarily optimal. In this study, we formulate the problem of estimating the intervention effect for the two cases, the case where the causal diagram is known and the case where it is unknown, in the framework of statistical decision theory and derive the optimal decision method under the Bayesian criterion. We show the effectiveness of the proposed method through numerical simulations.
Sentence embedding is a significant research topic in the field of natural language processing (NLP). Generating sentence embedding vectors reflecting the intrinsic meaning of a sentence is a key factor to achieve an enhanced performance in various NLP tasks such as sentence classification and document summarization. Therefore, various sentence embedding models based on supervised and unsupervised learning have been proposed after the advent of researches regarding the distributed representation of words. They were evaluated through semantic textual similarity (STS) tasks, which measure the degree of semantic preservation of a sentence and neural network-based supervised embedding models generally yielded state-of-the-art performance. However, these models have a limitation in that they have multiple parameters to update, thereby requiring a tremendous amount of labeled training data. In this study, we propose an efficient approach that learns a transition matrix that refines a sentence embedding vector to reflect the latent semantic meaning of a sentence. The proposed method has two practical advantages; (1) it can be applied to any sentence embedding method, and (2) it can achieve robust performance in STS tasks irrespective of the number of training examples.
This paper considers a high dimensional linear regression model with corrected variables. A variety of methods have been developed in recent years, yet it is still challenging to keep accurate estimation when there are complex correlation structures among predictors and the response. We propose an adaptive and ‘reversed’ penalty for regularization to solve this problem. This penalty doesn’t shrink variables but focuses on removing the shrinkage bias and encouraging grouping effect. Combining the l_1 penalty and the Minimax Concave Penalty (MCP), we propose two methods called Smooth Adjustment for Correlated Effects (SACE) and Generalized Smooth Adjustment for Correlated Effects (GSACE). Compared with the traditional adaptive estimator, the proposed methods have less influence from the initial estimator and can reduce the false negatives of the initial estimation. The proposed methods can be seen as linear functions of the new penalty’s tuning parameter, and are shown to estimate the coefficients accurately in both extremely highly correlated variables situation and weakly correlated variables situation. Under mild regularity conditions we prove that the methods satisfy certain oracle property. We show by simulations and applications that the proposed methods often outperforms other methods.
Sequential decision-making (SDM) plays a key role in intelligent robotics, and can be realized in very different ways, such as supervised learning, automated reasoning, and probabilistic planning. The three families of methods follow different assumptions and have different (dis)advantages. In this work, we aim at a robot SDM framework that exploits the complementary features of learning, reasoning, and planning. We utilize long short-term memory (LSTM), for passive state estimation with streaming sensor data, and commonsense reasoning and probabilistic planning (CORPP) for active information collection and task accomplishment. In experiments, a mobile robot is tasked with estimating human intentions using their motion trajectories, declarative contextual knowledge, and human-robot interaction (dialog-based and motion-based). Results suggest that our framework performs better than its no-learning and no-reasoning versions in a real-world office environment.
This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, classification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and potential research directions are highlighted.
Domain adaptation has become a prominent problem setting in machine learning and related fields. This review asks the questions: when and how a classifier can learn from a source domain and generalize to a target domain. As for when, we review conditions that allow for cross-domain generalization error bounds. As for how, we present a categorization of approaches, divided into, what we refer to as, sample-based, feature-based and inference-based methods. Sample-based methods focus on weighting individual observations during training based on their importance to the target domain. Feature-based methods focus on mapping, projecting and representing features such that a source classifier performs well on the target domain and inference-based methods focus on alternative estimators, such as robust, minimax or Bayesian. Our categorization highlights recurring ideas and raises a number of questions important to further research.
In this work, we study value function approximation in reinforcement learning (RL) problems with high dimensional state or action spaces via a generalized version of representation policy iteration (RPI). We consider the limitations of proto-value functions (PVFs) at accurately approximating the value function in low dimensions and we highlight the importance of features learning for an improved low-dimensional value function approximation. Then, we adopt different representation learning algorithm on graphs to learn the basis functions that best represent the value function. We empirically show that node2vec, an algorithm for scalable feature learning in networks, and the Variational Graph Auto-Encoder constantly outperform the commonly used smooth proto-value functions in low-dimensionl feature space.
Data competitions rely on real-time leaderboards to rank competitor entries and stimulate algorithm improvement. While such competitions have become quite popular and prevalent, particularly in supervised learning formats, their implementations by the host are highly variable. Without careful planning, a supervised learning competition is vulnerable to overfitting, where the winning solutions are so closely tuned to the particular set of provided data that they cannot generalize to the underlying problem of interest to the host. This paper outlines some important considerations for strategically designing relevant and informative data sets to maximize the learning outcome from hosting a competition based on our experience. It also describes a post-competition analysis that enables robust and efficient assessment of the strengths and weaknesses of solutions from different competitors, as well as greater understanding of the regions of the input space that are well-solved. The post-competition analysis, which complements the leaderboard, uses exploratory data analysis and generalized linear models (GLMs). The GLMs not only expand the range of results we can explore, they also provide more detailed analysis of individual sub-questions including similarities and differences between algorithms across different types of scenarios, universally easy or hard regions of the input space, and different learning objectives. When coupled with a strategically planned data generation approach, the methods provide richer and more informative summaries to enhance the interpretation of results beyond just the rankings on the leaderboard. The methods are illustrated with a recently completed competition to evaluate algorithms capable of detecting, identifying, and locating radioactive materials in an urban environment.
We propose a new architecture that learns to attend to different Convolutional Neural Networks (CNN) layers (i.e., different levels of abstraction) and different spatial locations (i.e., specific layers within a given feature map) in a sequential manner to perform the task at hand. Specifically, at each Recurrent Neural Network (RNN) timestep, a CNN layer is selected and its output is processed by a spatial soft-attention mechanism. We refer to this architecture as the Unified Attention Network (UAN), since it combines the ‘what’ and ‘where’ aspects of attention, i.e., ‘what’ level of abstraction to attend to, and ‘where’ should the network look at. We demonstrate the effectiveness of this approach on two computer vision tasks: (i) image-based camera pose and orientation regression and (ii) indoor scene classification. We evaluate our method on standard benchmarks for camera localization (Cambridge, 7-Scene, and TUM-LSI datasets) and for scene classification (MIT-67 indoor dataset), and show that our method improves upon the results of previous methods. Empirically, we show that combining ‘what’ and ‘where’ aspects of attention improves network performance on both tasks.
The majority of conversations a dialogue agent sees over its lifetime occur after it has already been trained and deployed, leaving a vast store of potential training signal untapped. In this work, we propose the self-feeding chatbot, a dialogue agent with the ability to extract new training examples from the conversations it participates in. As our agent engages in conversation, it also estimates user satisfaction in its responses. When the conversation appears to be going well, the user’s responses become new training examples to imitate. When the agent believes it has made a mistake, it asks for feedback; learning to predict the feedback that will be given improves the chatbot’s dialogue abilities further. On the PersonaChat chit-chat dataset with over 131k training examples, we find that learning from dialogue with a self-feeding chatbot significantly improves performance, regardless of the amount of traditional supervision.
In this paper, we introduce a new data freshness metric, relative Age of Information (rAoI), and examine it in a single server system with various packet management schemes. The (classical) AoI metric was introduced to measure the staleness of status updates at the receiving end with respect to their generation at the source. This metric addresses systems where the timings of update generation at the source are absolute and can be designed separately or jointly with the transmission schedules. In many decentralized applications, transmission schedules are blind to update generation timing, and the transmitter can know the timing of an update packet only after it arrives. As such, an update becomes stale after a new one arrives. The rAoI metric measures how fresh the data is at the receiver with respect to the data at the transmitter. It introduces a particularly explicit dependence on the arrival process in the evaluation of age. We investigate several queuing disciplines and provide closed form expressions for rAoI and numerical comparisons.
In this paper we propose a new training loop for deep reinforcement learning agents with an evolutionary generator. Evolutionary procedural content generation has been used in the creation of maps and levels for games before. Our system incorporates an evolutionary map generator to construct a training curriculum that is evolved to maximize loss within the state-of-the-art Double Dueling Deep Q Network architecture with prioritized replay. We present a case-study in which we prove the efficacy of our new method on a game with a discrete, large action space we made called Attackers and Defenders. Our results demonstrate that training on an evolutionarily-curated curriculum (directed sampling) of maps both expedites training and improves generalization when compared to a network trained on an undirected sampling of maps.
We develop a likelihood free inference procedure for conditioning a probabilistic model on a predicate. A predicate is a Boolean valued function which expresses a yes/no question about a domain. Our contribution, which we call predicate exchange, constructs a softened predicate which takes value in the unit interval [0, 1] as opposed to a simply true or false. Intuitively, 1 corresponds to true, and a high value (such as 0.999) corresponds to ‘nearly true’ as determined by a distance metric. We define Boolean algebra for soft predicates, such that they can be negated, conjoined and disjoined arbitrarily. A softened predicate can serve as a tractable proxy to a likelihood function for approximate posterior inference. However, to target exact inference, we temper the relaxation by a temperature parameter, and add a accept/reject phase use to replica exchange Markov Chain Mont Carlo, which exchanges states between a sequence of models conditioned on predicates at varying temperatures. We describe a lightweight implementation of predicate exchange that it provides a language independent layer that can be implemented on top of existingn modeling formalisms.