As the Industrial Internet of Things (IIoT) grows, systems are increasingly being monitored by arrays of sensors returning time-series data at ever-increasing ‘volume, velocity and variety’ (i.e. Industrial Big Data). An obvious use for these data is real-time systems condition monitoring and prognostic time to failure analysis (remaining useful life, RUL). (e.g. See white papers by Senseye.io, and output of the NASA Prognostics Center of Excellence (PCoE).) However, as noted by Agrawal and Choudhary ‘Our ability to collect ‘big data’ has greatly surpassed our capability to analyze it, underscoring the emergence of the fourth paradigm of science, which is data-driven discovery.’ In order to fully utilize the potential of Industrial Big Data we need data-driven techniques that operate at scales that process models cannot. Here we present a prototype technique for data-driven anomaly detection to operate at industrial scale. The method generalizes to application with almost any multivariate dataset based on independent ordinations of repeated (bootstrapped) partitions of the dataset and inspection of the joint distribution of ordinal distances.
We present efficient streaming algorithms to compute two commonly used anomaly measures: the rank-$k$ leverage scores (aka Mahalanobis distance) and the rank-$k$ projection distance, in the row-streaming model. We show that commonly used matrix sketching techniques such as the Frequent Directions sketch and random projections can be used to approximate these measures. Our main technical contribution is to prove matrix perturbation inequalities for operators arising in the computation of these measures.
Predictive process monitoring has recently gained traction in academia and is maturing also in companies. However, with the growing body of research, it might be daunting for companies to navigate in this domain in order to find, provided certain data, what can be predicted and what methods to use. The main objective of this paper is developing a value-driven framework for classifying existing work on predictive process monitoring. This objective is achieved by systematically identifying, categorizing, and analyzing existing approaches for predictive process monitoring. The review is then used to develop a value-driven framework that can support organizations to navigate in the predictive process monitoring field and help them to find value and exploit the opportunities enabled by these analysis techniques.
We study the problem of generating interpretable and verifiable policies through reinforcement learning. Unlike the popular Deep Reinforcement Learning (DRL) paradigm, in which the policy is represented by a neural network, the aim in Programmatically Interpretable Reinforcement Learning is to find a policy that can be represented in a high-level programming language. Such programmatic policies have the benefits of being more easily interpreted than neural networks, and being amenable to verification by symbolic methods. We propose a new method, called Neurally Directed Program Search (NDPS), for solving the challenging nonsmooth optimization problem of finding a programmatic policy with maxima reward. NDPS works by first learning a neural policy network using DRL, and then performing a local search over programmatic policies that seeks to minimize a distance from this neural ‘oracle’. We evaluate NDPS on the task of learning to drive a simulated car in the TORCS car-racing environment. We demonstrate that NDPS is able to discover human-readable policies that pass some significant performance bars. We also find that a well-designed policy language can serve as a regularizer, and result in the discovery of policies that lead to smoother trajectories and are more easily transferred to environments not encountered during training.
Deep networks have achieved impressive results across a variety of important tasks. However a known weakness is a failure to perform well when evaluated on data which differ from the training distribution, even if these differences are very small, as is the case with adversarial examples. We propose Fortified Networks, a simple transformation of existing networks, which fortifies the hidden layers in a deep network by identifying when the hidden states are off of the data manifold, and maps these hidden states back to parts of the data manifold where the network performs well. Our principal contribution is to show that fortifying these hidden states improves the robustness of deep networks and our experiments (i) demonstrate improved robustness to standard adversarial attacks in both black-box and white-box threat models; (ii) suggest that our improvements are not primarily due to the gradient masking problem and (iii) show the advantage of doing this fortification in the hidden layers instead of the input space.
Deep learning models, while effective and versatile, are becoming increasingly complex, often including multiple overlapping networks of arbitrary depths, multiple objectives and non-intuitive training methodologies. This makes it increasingly difficult for researchers and practitioners to design, train and understand them. In this paper we present ANNETT-O, a much-needed, generic and computer-actionable vocabulary for researchers and practitioners to describe their deep learning configurations, training procedures and experiments. The proposed ontology focuses on topological, training and evaluation aspects of complex deep neural configurations, while keeping peripheral entities more succinct. Knowledge bases implementing ANNETT-O can support a wide variety of queries, providing relevant insights to users. In addition to a detailed description of the ontology, we demonstrate its suitability to the task via a number of hypothetical use-cases of increasing complexity.
Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where most queries are ad-hoc, not based on predefined reports, and built incrementally. As a main contribution, we present a novel benchmark called IDEBench that can be used to evaluate the performance of database systems for IDE workloads. As opposed to traditional benchmarks for analytical database systems, our goal is to provide more meaningful workloads and datasets that can be used to benchmark IDE query engines, with a particular focus on metrics that capture the trade-off between query performance and quality of the result. As a second contribution, this paper evaluates and discusses the performance results of selected IDE query engines using our benchmark. The study includes two commercial systems, as well as two research prototypes (IDEA, approXimateDB/XDB), and one traditional analytical database system (MonetDB).
Geo-distributed data analysis in a cloud-edge system is emerging as a daily demand. Out of saving time in wide area data transfer, some tasks are dispersed to the edge clusters satisfied data locality. However, execution in the edge clusters is less well, due to limited resource, overload interference and cluster-level unreachable troubles, which obstructs the guarantee on the speed and completion of jobs. Synthesizing the impact of cluster heterogeneity and costly inter-cluster data fetch, we expect to make effective copies across clusters for tasks to provide both success and efficiency of the arriving jobs. To this end, we design PingAn, an online insurance algorithm making redundance across-cluster copies for tasks, promising $(1+\varepsilon)-speed \, o(\frac{1}{\varepsilon^2+\varepsilon})-competitive$ in sum of the job flowtimes. PingAn shares resource among a part of jobs with an adjustable $\varepsilon$ fraction to fit the system load condition and insures for tasks following efficiency-first reliability-aware principle to optimize the effect of copies on jobs’ performance. Trace-driven simulations demonstrate that PingAn can reduce the average job flowtimes by at least $14\%$ more than the state-of-the-art speculation mechanisms. We also build PingAn in Spark on Yarn System to verify its practicality and generality. Experiments show that PingAn can reduce the average job completion time by up to $40\%$ comparing to the default Spark execution.
A flexible infrastructure for normative reasoning is outlined. A small-scale demonstrator version of the envisioned system has been implemented in the proof assistant Isabelle/HOL by utilising the first authors universal logical reasoning approach based on shallow semantical embeddings in meta-logic HOL. The need for such a flexible reasoning infrastructure is motivated and illustrated with a contrary-to-duty example scenario selected from the General Data Protection Regulation.
Research must be reproducible in order to make an impact on science and to contribute to the body of knowledge in our field. Yet studies have shown that 70% of research from academic labs cannot be reproduced. In software engineering, and more specifically requirements engineering (RE), reproducible research is rare, with datasets not always available or methods not fully described. This lack of reproducible research hinders progress, with researchers having to replicate an experiment from scratch. A researcher starting out in RE has to sift through conference papers, finding ones that are empirical, then must look through the data available from the empirical paper (if any) to make a preliminary determination if the paper can be reproduced. This paper addresses two parts of that problem, identifying RE papers and identifying empirical papers within the RE papers. Recent RE and empirical conference papers were used to learn features and to build an automatic classifier to identify RE and empirical papers. We introduce the Empirical Requirements Research Classifier (ERRC) method, which uses natural language processing and machine learning to perform supervised classification of conference papers. We compare our method to a baseline keyword-based approach. To evaluate our approach, we examine sets of papers from the IEEE Requirements Engineering conference and the IEEE International Symposium on Software Testing and Analysis. We found that the ERRC method performed better than the baseline method in all but a few cases.
The present work introduces the hybrid consensus alternating direction method of multipliers (H-CADMM), a novel framework for optimization over networks which unifies existing distributed optimization approaches, including the centralized and the decentralized consensus ADMM. H-CADMM provides a flexible tool that leverages the underlying graph topology in order to achieve a desirable sweet-spot between node-to-node communication overhead and rate of convergence — thereby alleviating known limitations of both C-CADMM and D-CADMM. A rigorous analysis of the novel method establishes linear convergence rate, and also guides the choice of parameters to optimize this rate. The novel hybrid update rules of H-CADMM lend themselves to ‘in-network acceleration’ that is shown to effect considerable — and essentially ‘free-of-charge’ — performance boost over the fully decentralized ADMM. Comprehensive numerical tests validate the analysis and showcase the potential of the method in tackling efficiently, widely useful learning tasks.
Traditionally, deep learning algorithms update the network weights whereas the network architecture is chosen manually, using a process of trial and error. In this work, we propose two novel approaches that automatically update the network structure while also learning its weights. The novelty of our approach lies in our parameterization where the depth, or additional complexity, is encapsulated continuously in the parameter space through control parameters that add additional complexity. We propose two methods: In tunnel networks, this selection is done at the level of a hidden unit, and in budding perceptrons, this is done at the level of a network layer; updating this control parameter introduces either another hidden unit or another hidden layer. We show the effectiveness of our methods on the synthetic two-spirals data and on two real data sets of MNIST and MIRFLICKR, where we see that our proposed methods, with the same set of hyperparameters, can correctly adjust the network complexity to the task complexity.
Principal component analysis (PCA) is often used for analysing data in the most diverse areas. In this work, we report an integrated approach to several theoretical and practical aspects of PCA. We start by providing, in an intuitive and accessible manner, the basic principles underlying PCA and its applications. Next, we present a systematic, though no exclusive, survey of some representative works illustrating the potential of PCA applications to a wide range of areas. An experimental investigation of the ability of PCA for variance explanation and dimensionality reduction is also developed, which confirms the efficacy of PCA and also shows that standardizing or not the original data can have important effects on the obtained results. Overall, we believe the several covered issues can assist researchers from the most diverse areas in using and interpreting PCA.
We generalise Spatial Transformer Networks (STN) by replacing the parametric transformation of a fixed, regular sampling grid with a deformable, statistical shape model which is itself learnt. We call this a Statistical Transformer Network (StaTN). By training a network containing a StaTN end-to-end for a particular task, the network learns the optimal nonrigid alignment of the input data for the task. Moreover, the statistical shape model is learnt with no direct supervision (such as landmarks) and can be reused for other tasks. Besides training for a specific task, we also show that a StaTN can learn a shape model using generic loss functions. This includes a loss inspired by the minimum description length principle in which an appearance model is also learnt from scratch. In this configuration, our model learns an active appearance model and a means to fit the model from scratch with no supervision at all, even identity labels.
State-of-the-art machine learning algorithms demonstrate close to absolute performance in selected challenges. We provide arguments that the reason can be in low variability of the samples and high effectiveness in learning typical patterns. Due to this fact, standard performance metrics do not reveal model capacity and new metrics are required for the better understanding of state-of-the-art.
Modern threats have emerged from the prevalence of social networks. Hostile actors, such as extremist groups or foreign governments, utilize these networks to run propaganda campaigns with different aims. For extremists, these campaigns are designed for recruiting new members or inciting violence. For foreign governments, the aim may be to create instability in rival nations. Proper social network counter-measures are needed to combat these threats. Here we present one important counter-measure: penetrating social networks. This means making target users connect with or follow agents deployed in the social network. Once such connections are established with the targets, the agents can influence them by sharing content which counters the influence campaign. In this work we study how to penetrate a social network, which we call the follow-back problem. The goal here is to find a policy that maximizes the number of targets that follow the agent. We conduct an empirical study to understand what behavioral and network features affect the probability of a target following an agent. We find that the degree of the target and the size of the mutual neighborhood of the agent and target in the network affect this probability. Based on our empirical findings, we then propose a model for targets following an agent. Using this model, we solve the follow-back problem exactly on directed acyclic graphs and derive a closed form expression for the expected number of follows an agent receives under the optimal policy. We then formulate the follow-back problem on an arbitrary graph as an integer program. To evaluate our integer program based policies, we conduct simulations on real social network topologies in Twitter. We find that our polices result in more effective network penetration, with significant increases in the expected number of targets that follow the agent.
Generative Adversarial Networks (GANs) have been promising in the field of image generation, however, they have been hard to train for language generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them which causes high levels of instability in training GANs. Consequently, past work has resorted to pre-training with maximum-likelihood or training GANs without pre-training with a WGAN objective with a gradient penalty. In this study, we present a comparison of those approaches. Furthermore, we present the results of some experiments that indicate better training and convergence of Wasserstein GANs (WGANs) when a weaker regularization term is enforcing the Lipschitz constraint.
Self Organizing Map is trained using unsupervised learning to produce a two-dimensional discretized representation of input space of the training cases. Growing Hierarchical SOM is an architecture which grows both in a hierarchical way representing the structure of data distribution and in a horizontal way representation the size of each individual maps. The control method of the growing degree of GHSOM by pruning off the redundant branch of hierarchy in SOM is proposed in this paper. Moreover, the interface tool for the proposed method called interactive GHSOM is developed. We discuss the computation results of Iris data by using the developed tool.
In the framework of convolutional neural networks that lie at the heart of deep learning, downsampling is often performed with a max-pooling operation that however completely discards the information from other activations in a pooling region. To address this issue, a novel pooling scheme, Ordinal Pooling Network (OPN), is introduced in this work. OPN rearranges all the elements of a pooling region in a sequence and assigns different weights to all the elements based upon their orders in the sequence, where the weights are learned via the gradient-based optimisation. The results of our small-scale experiments on image classification task on MNIST database demonstrate that this scheme leads to a consistent improvement in the accuracy over max-pooling operation. This improvement is expected to increase in the deep networks, where several layers of pooling become necessary.
Automated process discovery is a class of process mining methods that allow analysts to extract business process models from event logs. Traditional process discovery methods extract process models from a snapshot of an event log stored in its entirety. In some scenarios, however, events keep coming with a high arrival rate to the extent that it is impractical to store the entire event log and to continuously re-discover a process model from scratch. Such scenarios require online process discovery approaches. Given an event stream produced by the execution of a business process, the goal of an online process discovery method is to maintain a continuously updated model of the process with a bounded amount of memory while at the same time achieving similar accuracy as offline methods. However, existing online discovery approaches require relatively large amounts of memory to achieve levels of accuracy comparable to that of offline methods. Therefore, this paper proposes an approach that addresses this limitation by mapping the problem of online process discovery to that of cache memory management, and applying well-known cache replacement policies to the problem of online process discovery. The approach has been implemented in .NET, experimentally integrated with the Minit process mining tool and comparatively evaluated against an existing baseline using real-life datasets.
Given a regular expression $R$ and a string $Q$ the regular expression matching problem is to determine if $Q$ is a member of the language generated by $R$. The classic textbook algorithm by Thompson [C. ACM 1968] constructs and simulates a non-deterministic finite automaton in $O(nm)$ time and $O(m)$ space, where $n$ and $m$ are the lengths of the string and the regular expression, respectively. Assuming the strong exponential time hypothesis Backurs and Indyk [FOCS 2016] showed that this result is nearly optimal. However, for most applications determining membership is insufficient and we need to compute \emph{how we match}, i.e., to identify or replace matches or submatches in the string. Using backtracking we can extend Thompson’s algorithm to solve this problem, called regular expression parsing, in the same asymptotic time but with a blow up in space to $\Omega(nm)$. Surprisingly, all existing approaches suffer the same or a similar quadratic blow up in space and no known solutions for regular expression parsing significantly improve this gap between matching and parsing. In this paper, we overcome this gap and present a new algorithm for regular expression parsing using $O(nm)$ time and $O(n + m)$ space. To achieve our result, we develop a novel divide and conquer approach similar in spirit to the classic divide and conquer technique by Hirshberg [C. ACM 1975] for computing a longest common subsequence of two strings in quadratic time and linear space. We show how to carefully decompose the problem to handle cyclic interactions in the automaton leading to a subproblem construction of independent interest. Finally, we generalize our techniques to convert other existing state-set transition algorithms for matching to parsing using only linear space.
Recently, dense connections have attracted substantial attention in computer vision because they facilitate gradient flow and implicit deep supervision during training. Particularly, DenseNet, which connects each layer to every other layer in a feed-forward fashion, has shown impressive performances in natural image classification tasks. We propose HyperDenseNet, a 3D fully convolutional neural network that extends the definition of dense connectivity to multi-modal segmentation problems. Each imaging modality has a path, and dense connections occur not only between the pairs of layers within the same path, but also between those across different paths. This contrasts with the existing multi-modal CNN approaches, in which modeling several modalities relies entirely on a single joint layer (or level of abstraction) for fusion, typically either at the input or at the output of the network. Therefore, the proposed network has total freedom to learn more complex combinations between the modalities, within and in-between all the levels of abstraction, which increases significantly the learning representation. We report extensive evaluations over two different and highly competitive multi-modal brain tissue segmentation challenges, iSEG 2017 and MRBrainS 2013, with the former focusing on 6-month infant data and the latter on adult images. HyperDenseNet yielded significant improvements over many state-of-the-art segmentation networks, ranking at the top on both benchmarks. We further provide a comprehensive experimental analysis of features re-use, which confirms the importance of hyper-dense connections in multi-modal representation learning. Our code is publicly available at https://…/HyperDenseNet.
This paper aims to develop an optimality theory for linear discriminant analysis in the high-dimensional setting. A data-driven and tuning free classification rule, which is based on an adaptive constrained $\ell_1$ minimization approach, is proposed and analyzed. Minimax lower bounds are obtained and this classification rule is shown to be simultaneously rate optimal over a collection of parameter spaces. In addition, we consider classification with incomplete data under the missing completely at random (MCR) model. An adaptive classifier with theoretical guarantees is introduced and optimal rate of convergence for high-dimensional linear discriminant analysis under the MCR model is established. The technical analysis for the case of missing data is much more challenging than that for the complete data. We establish a large deviation result for the generalized sample covariance matrix, which serves as a key technical tool and can be of independent interest. An application to lung cancer and leukemia studies is also discussed.
Nearest neighbor search and k-nearest neighbor graph construction are two fundamental issues arise from many disciplines such as information retrieval, data-mining, machine learning and computer vision. Despite continuous efforts have been taken in the last several decades, these two issues remain challenging. They become more and more imminent given the big data emerges in various fields and has been expanded significantly over the years. In this paper, a simple but effective solution both for k-nearest neighbor search and k-nearest neighbor graph construction is presented. Namely, these two issues are addressed jointly. On one hand, the k-nearest neighbor graph construction is treated as a nearest neighbor search task. Each data sample along with its k-nearest neighbors are joined into the k-nearest neighbor graph by sequentially performing the nearest neighbor search on the graph under construction. On the other hand, the built k-nearest neighbor graph is used to support k-nearest neighbor search. Since the graph is built online, dynamic updating of the graph, which is not desirable from most of the existing solutions, is supported. Moreover, this solution is feasible for various distance measures. Its effectiveness both as a k-nearest neighbor construction and k-nearest neighbor search approach is verified across various datasets in different scales, various dimensions and under different metrics.
Data clustering is a common unsupervised learning method frequently used in exploratory data analysis. However, identifying relevant structures in unlabeled, high-dimensional data is nontrivial, requiring iterative experimentation with clustering parameters as well as data features and instances. The space of possible clusterings for a typical dataset is vast, and navigating in this vast space is also challenging. The absence of ground-truth labels makes it impossible to define an optimal solution, thus requiring user judgment to establish what can be considered a satisfiable clustering result. Data scientists need adequate interactive tools to effectively explore and navigate the large space of clusterings so as to improve the effectiveness of exploratory clustering analysis. We introduce \textit{Clustrophile 2}, a new interactive tool for guided clustering analysis. \textit{Clustrophile 2} guides users in clustering-based exploratory analysis, adapts user feedback to improve user guidance, facilitates the interpretation of clusters, and helps quickly reason about differences between clusterings. To this end, \textit{Clustrophile 2} contributes a novel feature, the clustering tour, to help users choose clustering parameters and assess the quality of different clustering results in relation to current analysis goals and user expectations. We evaluate \textit{Clustrophile 2} through a user study with 12 data scientists, who used our tool to explore and interpret sub-cohorts in a dataset of Parkinson’s disease patients. Results suggest that \textit{Clustrophile 2} improves the speed and effectiveness of exploratory clustering analysis for both experts and non-experts.
Rapidly creating effective visualizations using expressive grammars is challenging for users who have limited time and limited skills in statistics and data visualization. Even high-level, dedicated visualization tools often require users to manually select among data attributes, decide which transformations to apply, and specify mappings between visual encoding variables and raw or transformed attributes. In this paper, we introduce Data2Vis, a neural translation model, for automatically generating visualizations from given datasets. We formulate visualization generation as a sequence to sequence translation problem where data specification is mapped to a visualization specification in a declarative language (Vega-Lite). To this end, we train a multilayered Long Short-Term Memory (LSTM) model with attention on a corpus of visualization specifications. Qualitative results show that our model learns the vocabulary and syntax for a valid visualization specification, appropriate transformations (count, bins, mean) and how to use common data selection patterns that occur within data visualizations. Our model generates visualizations that are comparable to manually-created visualizations in a fraction of the time, with potential to learn more complex visualization strategies at scale.