Having a clear view of events that occurred over time is a difficult objective to achieve in digital investigations (DI). Event reconstruction, which allows investigators to understand the timeline of a crime, is one of the most important step of a DI process. This complex task requires exploration of a large amount of events due to the pervasiveness of new technologies nowadays. Any evidence produced at the end of the investigative process must also meet the requirements of the courts, such as reproducibility, verifiability, validation, etc. For this purpose, we propose a new methodology, supported by theoretical concepts, that can assist investigators through the whole process including the construction and the interpretation of the events describing the case. The proposed approach is based on a model which integrates knowledge of experts from the fields of digital forensics and software development to allow a semantically rich representation of events related to the incident. The main purpose of this model is to allow the analysis of these events in an automatic and efficient way. This paper describes the approach and then focuses on the main conceptual and formal aspects: a formal incident modelization and operators for timeline reconstruction and analysis.
Scientific workflows are powerful tools for management of scalable experiments, often composed of complex tasks running on distributed resources. Existing cyberinfrastructure provides components that can be utilized within repeatable workflows. However, data and computing advances continuously change the way scientific workflows get developed and executed, pushing the scientific activity to be more data-driven, heterogeneous and collaborative. Workflow development today depends on the effective collaboration and communication of a cross-disciplinary team, not only with humans but also with analytical systems and infrastructure. This paper presents a collaboration-centered reference architecture to extend workflow systems with dynamic, predictable and programmable interfaces to systems and infrastructure while bridging the exploratory and scalable activities in the scientific process. We also present a conceptual design towards the development of methodologies and tools for effective workflow-driven collaborations, namely the PPoDS methodology and the SmartFlows Toolkit for smart utilization of workflows in a rapidly evolving cyberinfrastructure ecosystem.
The huge amount of available data nowadays is a challenge for kernel-based machine learning algorithms like SVMs with respect to runtime and storage capacities. Local approaches might help to relieve these issues and to improve statistical accuracy. It has already been shown that these local approaches are consistent and robust in a basic sense. This article refines the analysis of robustness properties towards the so-called influence function which expresses the differentiability of the learning method: We show that there is a differentiable dependency of our locally learned predictor on the underlying distribution. The assumptions of the proven theorems can be verified without knowing anything about this distribution. This makes the results interesting also from an applied point of view.
We propose a new approach for constructing synthetic pseudo-panel data from cross-sectional data. The pseudo panel and the preferences it intends to describe is constructed at the individual level and is not affected by aggregation bias across cohorts. This is accomplished by creating a high-dimensional probabilistic model representation of the entire data set, which allows sampling from the probabilistic model in such a way that all of the intrinsic correlation properties of the original data are preserved. The key to this is the use of deep learning algorithms based on the Conditional Variational Autoencoder (CVAE) framework. From a modelling perspective, the concept of a model-based resampling creates a number of opportunities in that data can be organized and constructed to serve very specific needs of which the forming of heterogeneous pseudo panels represents one. The advantage, in that respect, is the ability to trade a serious aggregation bias (when aggregating into cohorts) for an unsystematic noise disturbance. Moreover, the approach makes it possible to explore high-dimensional sparse preference distributions and their linkage to individual specific characteristics, which is not possible if applying traditional pseudo-panel methods. We use the presented approach to reveal the dynamics of transport preferences for a fixed pseudo panel of individuals based on a large Danish cross-sectional data set covering the period from 2006 to 2016. The model is also utilized to classify individuals into ‘slow’ and ‘fast’ movers with respect to the speed at which their preferences change over time. It is found that the prototypical fast mover is a young woman who lives as a single in a large city whereas the typical slow mover is a middle-aged man with high income from a nuclear family who lives in a detached house outside a city.
Despite a growing literature on explaining neural networks, no consensus has been reached on how to explain a neural network decision or how to evaluate an explanation. In fact, most works rely on manually assessing the explanation to evaluate the quality of a method. This injects uncertainty in the explanation process along several dimensions: Which explanation method to apply? Who should we ask to evaluate it and which criteria should be used for the evaluation? Our contributions in this paper are twofold. First, we investigate schemes to combine explanation methods and reduce model uncertainty to obtain a single aggregated explanation. Our findings show that the aggregation is more robust, well-aligned with human explanations and can attribute relevance to a broader set of features (completeness). Second, we propose a novel way of evaluating explanation methods that circumvents the need for manual evaluation and is not reliant on the alignment of neural networks and humans decision processes.
Hypothesis testing is one of the most common types of data analysis and forms the backbone of scientific research in many disciplines. Analysis of variance (ANOVA) in particular is used to detect dependence between a categorical and a numerical variable. Here we show how one can carry out this hypothesis test under the restrictions of differential privacy. We show that the $F$-statistic, the optimal test statistic in the public setting, is no longer optimal in the private setting, and we develop a new test statistic $F_1$ with much higher statistical power. We show how to rigorously compute a reference distribution for the $F_1$ statistic and give an algorithm that outputs accurate $p$-values. We implement our test and experimentally optimize several parameters. We then compare our test to the only previous work on private ANOVA testing, using the same effect size as that work. We see an order of magnitude improvement, with our test requiring only 7% as much data to detect the effect.
Previous work has shown that popular trending events are important external factors which pose significant influence on user search behavior and also provided a way to computationally model this influence. However, their problem formulation was based on the strong assumption that each event poses its influence independently. This assumption is unrealistic as there are many correlated events in the real world which influence each other and thus, would pose a joint influence on the user search behavior rather than posing influence independently. In this paper, we study this novel problem of Modeling the Joint Influences posed by multiple correlated events on user search behavior. We propose a Joint Influence Model based on the Multivariate Hawkes Process which captures the inter-dependency among multiple events in terms of their influence upon user search behavior. We evaluate the proposed Joint Influence Model using two months query-log data from https://…/. Experimental results show that the model can indeed capture the temporal dynamics of the joint influence over time and also achieves superior performance over different baseline methods when applied to solve various interesting prediction problems as well as real-word application scenarios, e.g., query auto-completion.
Learning from complex demonstrations is challenging, especially when the demonstration consists of different strategies. A popular approach is to use a deep neural network to perform imitation learning. However, the structure of that deep neural network has to be “deep’ enough to capture all possible scenarios. Besides the machine learning issue, how humans learn in the sense of physiology has rarely been addressed and relevant works on spinal cord learning are rarer. In this work, we develop a novel modular learning architecture, the Generator and Responsibility Predictor (GRP) model, which automatically learns the sub-task policies from an unsegmented controller demonstration and learns to switch between the policies. We also introduce a more physiological based neural network architecture. We implemented our GRP model and our proposed neural network to form a model the transfers the swing leg control from the brain to the spinal cord. Our result suggests that by using the GRP model the brain can successfully transfer the target swing leg control to the spinal cord and the resulting model can switch between sub-control policies automatically.
Deep neural networks are widely used and exhibit excellent performance in many areas. However, they are vulnerable to adversarial attacks that compromise the network at the inference time by applying elaborately designed perturbation to input data. Although several defense methods have been proposed to address specific attacks, other attack methods can circumvent these defense mechanisms. Therefore, we propose Purifying Variational Autoencoder (PuVAE), a method to purify adversarial examples. The proposed method eliminates an adversarial perturbation by projecting an adversarial example on the manifold of each class, and determines the closest projection as a purified sample. We experimentally illustrate the robustness of PuVAE against various attack methods without any prior knowledge. In our experiments, the proposed method exhibits performances competitive with state-of-the-art defense methods, and the inference time is approximately 130 times faster than that of Defense-GAN that is the state-of-the art purifier model.
Neural networks in the real domain have been studied for a long time and achieved promising results in many vision tasks for recent years. However, the extensions of the neural network models in other number fields and their potential applications are not fully-investigated yet. Focusing on color images, which can be naturally represented as quaternion matrices, we propose a quaternion convolutional neural network (QCNN) model to obtain more representative features. In particular, we redesign the basic modules like convolution layer and fully-connected layer in the quaternion domain, which can be used to establish fully-quaternion convolutional neural networks. Moreover, these modules are compatible with almost all deep learning techniques and can be plugged into traditional CNNs easily. We test our QCNN models in both color image classification and denoising tasks. Experimental results show that they outperform the real-valued CNNs with same structures.
All knowing is material. The challenge for Information Systems (IS) research is to specify how knowing is material by drawing on theoretical characterizations of the digital. Synthetic knowing is knowing informed by theorizing digital materiality. We focus on two defining qualities: liquefaction (unhinging digital representations from physical objects, qualities, or processes) and open-endedness (extendable and generative). The Internet of Things (IoT) is crucial because sensors are vehicles of liquefaction. Their expanding scope for real-time seeing, hearing, tasting, smelling, and touching increasingly mimics phenomenologically perceived reality. Empirically, we present a longitudinal case study of IoT-rendered marine environmental monitoring by an oil and gas company operating in the politically contested Arctic. We characterize synthetic knowing into four concepts, the former three tied to liquefaction and the latter to open-endedness: (i) the objects of knowing are algorithmic phenomena; (ii) the sensors increasingly conjure up phenomenological reality; (iii) knowing is scoped (configurable); and (iv) open knowing/data is politically charged.
A syntax tree is a planar rooted tree where internal nodes are labeled on a graded set of generators. There is a natural notion of occurrence of contiguous pattern in such trees. We describe a way, given a set of generators and a set of patterns, to enumerate the trees avoiding this last. The method is built around inclusion-exclusion formulas forming a system of equations on formal power series of trees, and composition operations of trees. This does not require particular conditions on the set of patterns to avoid. We connect this result to the theory of nonsymmetric operads. Syntax trees are the elements of such free structures, so that any operad can be seen as a quotient of a free operad. Moreover, in some cases, the elements of an operad can be seen as trees avoiding some patterns. Relying on this, we use operads as devices for enumeration: given a set of combinatorial objects we want enumerate, we endow it with the structure of an operad, understand it in term of trees and pattern avoidance, and use our method to count them. Several examples are provided.
In many statistical modeling frameworks, goodness-of-fit tests are typically administered to the estimated residuals. In the time series setting, whiteness of the residuals is assessed using the sample autocorrelation function. For many time series models, especially those used for financial time series, the key assumption on the residuals is that they are in fact independent and not just uncorrelated. In this paper, we apply the auto-distance covariance function (ADCV) to evaluate the serial dependence of the estimated residuals. Distance covariance can discriminate between dependence and independence of two random vectors. The limit behavior of the test statistic based on the ADCV is derived for a general class of time series models. One of the key aspects in this theory is adjusting for the dependence that arises due to parameter estimation. This adjustment has essentially the same form regardless of the model specification. We illustrate the results in simulated examples.
Widespread applications of deep learning have led to a plethora of pre-trained neural network models for common tasks. Such models are often adapted from other models via transfer learning. The models may have varying training sets, training algorithms, network architectures, and hyper-parameters. For a given application, what isthe most suitable model in a model repository? This is a critical question for practical deployments but it has not received much attention. This paper introduces the novel problem of searching and ranking models based on suitability relative to a target dataset and proposes a ranking algorithm called \textit{neuralRank}. The key idea behind this algorithm is to base model suitability on the discriminating power of a model, using a novel metric to measure it. With experimental results on the MNIST, Fashion, and CIFAR10 datasets, we demonstrate that (1) neuralRank is independent of the domain, the training set, or the network architecture and (2) that the models ranked highly by neuralRank ranking tend to have higher model accuracy in practice.
2LS is a framework for analysis of sequential C programs that can verify and refute program assertions and termination. The 2LS framework is built upon the CPROVER infrastructure and implements template-based synthesis techniques, e.g. to find invariants and ranking functions, and incremental loop unwinding techniques to find counterexamples and k-induction proofs. The main improvements in this year’s version are the ability of 2LS to analyse programs requiring combined reasoning about shape and content of dynamic data structures, and an instrumentation for memory safety properties.
Resource balancing within complex transportation networks is one of the most important problems in real logistics domain. Traditional solutions on these problems leverage combinatorial optimization with demand and supply forecasting. However, the high complexity of transportation routes, severe uncertainty of future demand and supply, together with non-convex business constraints make it extremely challenging in the traditional resource management field. In this paper, we propose a novel sophisticated multi-agent reinforcement learning approach to address these challenges. In particular, inspired by the externalities especially the interactions among resource agents, we introduce an innovative cooperative mechanism for state and reward design resulting in more effective and efficient transportation. Extensive experiments on a simulated ocean transportation service demonstrate that our new approach can stimulate cooperation among agents and lead to much better performance. Compared with traditional solutions based on combinatorial optimization, our approach can give rise to a significant improvement in terms of both performance and stability.
Most existing feature selection methods are insufficient for analytic purposes as soon as high dimensional data or redundant sensor signals are dealt with since features can be selected due to spurious effects or correlations rather than causal effects. To support the finding of causal features in biomedical experiments, we hereby present FRI, an open source Python library that can be used to identify all-relevant variables in linear classification and (ordinal) regression problems. Using the recently proposed feature relevance method, FRI is able to provide the base for further general experimentation or in specific can facilitate the search for alternative biomarkers. It can be used in an interactive context, by providing model manipulation and visualization methods, or in a batch process as a filter method.
We propose a novel way to handle out of vocabulary (OOV) words in downstream natural language processing (NLP) tasks. We implement a network that predicts useful embeddings for OOV words based on their morphology and on the context in which they appear. Our model also incorporates an attention mechanism indicating the focus allocated to the left context words, the right context words or the word’s characters, hence making the prediction more interpretable. The model is a “drop-in” module that is jointly trained with the downstream task’s neural network, thus producing embeddings specialized for the task at hand. When the task is mostly syntactical, we observe that our model aims most of its attention on surface form characters. On the other hand, for tasks more semantical, the network allocates more attention to the surrounding words. In all our tests, the module helps the network to achieve better performances in comparison to the use of simple random embeddings.
We propose and study a general framework for regularized Markov decision processes (MDPs) where the goal is to find an optimal policy that maximizes the expected discounted total reward plus a policy regularization term. The extant entropy-regularized MDPs can be cast into our framework. Moreover, under our framework there are many regularization terms can bring multi-modality and sparsity which are potentially useful in reinforcement learning. In particular, we present sufficient and necessary conditions that induce a sparse optimal policy. We also conduct a full mathematical analysis of the proposed regularized MDPs, including the optimality condition, performance error and sparseness control. We provide a generic method to devise regularization forms and propose off-policy actor-critic algorithms in complex environment settings. We empirically analyze the statistical properties of optimal policies and compare the performance of different sparse regularization forms in discrete and continuous environments.
Building a good predictive model requires an array of activities such as data imputation, feature transformations, estimator selection, hyper-parameter search and ensemble construction. Given the large, complex and heterogenous space of options, off-the-shelf optimization methods are infeasible for realistic response times. In practice, much of the predictive modeling process is conducted by experienced data scientists, who selectively make use of available tools. Over time, they develop an understanding of the behavior of operators, and perform serial decision making under uncertainty, colloquially referred to as educated guesswork. With an unprecedented demand for application of supervised machine learning, there is a call for solutions that automatically search for a good combination of parameters across these tasks to minimize the modeling error. We introduce a novel system called APRL (Autonomous Predictive modeler via Reinforcement Learning), that uses past experience through reinforcement learning to optimize such sequential decision making from within a set of diverse actions under a time constraint on a previously unseen predictive learning problem. APRL actions are taken to optimize the performance of a final ensemble. This is in contrast to other systems, which maximize individual model accuracy first and create ensembles as a disconnected post-processing step. As a result, APRL is able to reduce up to 71\% of classification error on average over a wide variety of problems.
We propose a novel {\it Equilibrated Recurrent Neural Network} (ERNN) to combat the issues of inaccuracy and instability in conventional RNNs. Drawing upon the concept of autapse in neuroscience, we propose augmenting an RNN with a time-delayed self-feedback loop. Our sole purpose is to modify the dynamics of each internal RNN state and, at any time, enforce it to evolve close to the equilibrium point associated with the input signal at that time. We show that such self-feedback helps stabilize the hidden state transitions leading to fast convergence during training while efficiently learning discriminative latent features that result in state-of-the-art results on several benchmark datasets at test-time. We propose a novel inexact Newton method to solve fixed-point conditions given model parameters for generating the latent features at each hidden state. We prove that our inexact Newton method converges locally with linear rate (under mild conditions). We leverage this result for efficient training of ERNNs based on backpropagation.
Recommender systems are one of the most pervasive applications of machine learning in industry, with many services using them to match users to products or information. As such it is important to ask: what are the possible fairness risks, how can we quantify them, and how should we address them? In this paper we offer a set of novel metrics for evaluating algorithmic fairness concerns in recommender systems. In particular we show how measuring fairness based on pairwise comparisons from randomized experiments provides a tractable means to reason about fairness in rankings from recommender systems. Building on this metric, we offer a new regularizer to encourage improving this metric during model training and thus improve fairness in the resulting rankings. We apply this pairwise regularization to a large-scale, production recommender system and show that we are able to significantly improve the system’s pairwise fairness.
The emergence of complex life on Earth is often attributed to the arms race that ensued from a huge number of organisms all competing for finite resources. We present an artificial intelligence research environment, inspired by the human game genre of MMORPGs (Massively Multiplayer Online Role-Playing Games, a.k.a. MMOs), that aims to simulate this setting in microcosm. As with MMORPGs and the real world alike, our environment is persistent and supports a large and variable number of agents. Our environment is well suited to the study of large-scale multiagent interaction: it requires that agents learn robust combat and navigation policies in the presence of large populations attempting to do the same. Baseline experiments reveal that population size magnifies and incentivizes the development of skillful behaviors and results in agents that outcompete agents trained in smaller populations. We further show that the policies of agents with unshared weights naturally diverge to fill different niches in order to avoid competition.
Decision trees and logistic regression are one of the most popular and well-known machine learning algorithms, frequently used to solve a variety of real-world problems. Stability of learning algorithms is a powerful tool to analyze their performance and sensitivity and subsequently allow researchers to draw reliable conclusions. The stability of these two algorithms has remained obscure. To that end, in this paper, we derive two stability notions for decision trees and logistic regression: hypothesis and pointwise hypothesis stability. Additionally, we derive these notions for L2-regularized logistic regression and confirm existing findings that it is uniformly stable. We show that the stability of decision trees depends on the number of leaves in the tree, i.e., its depth, while for logistic regression, it depends on the smallest eigenvalue of the Hessian matrix of the cross-entropy loss. We show that logistic regression is not a stable learning algorithm. We construct the upper bounds on the generalization error of all three algorithms. Moreover, we present a novel stability measuring framework that allows one to measure the aforementioned notions of stability. The measures are equivalent to estimates of expected loss differences at an input example and then leverage bootstrap sampling to yield statistically reliable estimates. Finally, we apply this framework to the three algorithms analyzed in this paper to confirm our theoretical findings and, in addition, we discuss the possibilities of developing new training techniques to optimize the stability of logistic regression, and hence decrease its generalization error.
We introduce the task of algorithm class prediction for programming word problems. A programming word problem is a problem written in natural language, which can be solved using an algorithm or a program. We define classes of various programming word problems which correspond to the class of algorithms required to solve the problem. We present four new datasets for this task, two multiclass datasets with 550 and 1159 problems each and two multilabel datasets having 3737 and 3960 problems each. We pose the problem as a text classification problem and train neural network and non-neural network-based models on this task. Our best performing classifier gets an accuracy of 62.7 percent for the multiclass case on the five class classification dataset, Codeforces Multiclass-5 (CFMC5). We also do some human-level analysis and compare human performance with that of our text classification models. Our best classifier has an accuracy only 9 percent lower than that of a human on this task. To the best of our knowledge, these are the first reported results on such a task. We make our code and datasets publicly available.
Regression problems that have closed-form solutions are well understood and can be easily implemented when the dataset is small enough to be all loaded into the RAM. Challenges arise when data is too big to be stored in RAM to compute the closed form solutions. Many techniques were proposed to overcome or alleviate the memory barrier problem but the solutions are often local optimal. In addition, most approaches require accessing the raw data again when updating the models. Parallel computing clusters are also expected if multiple models need to be computed simultaneously. We propose multiple learning approaches that utilize an array of sufficient statistics (SS) to address this big data challenge. This memory oblivious approach breaks the memory barrier when computing regressions with closed-form solutions, including but not limited to linear regression, weighted linear regression, linear regression with Box-Cox transformation (Box-Cox regression) and ridge regression models. The computation and update of the SS array can be handled at per row level or per mini-batch level. And updating a model is as easy as matrix addition and subtraction. Furthermore, multiple SS arrays for different models can be easily computed simultaneously to obtain multiple models at one pass through the dataset. We implemented our approaches on Spark and evaluated over the simulated datasets. Results showed our approaches can achieve closed-form solutions of multiple models at the cost of half training time of the traditional methods for a single model.
Recently many NLP-based deep learning models have been applied to model source code for source code suggestion and recommendation tasks. A major limitation of these approaches is that they take source code as simple tokens of text and ignore its contextual, syntaxtual and structural dependencies. In this work, we present CodeGRU, a Gated Recurrent Unit based source code language model that is capable of capturing contextual, syntaxtual and structural dependencies for modeling the source code. The CodeGRU introduces the following several new components. The Code Sampler is first proposed for selecting noise-free code samples and transforms obfuscate code to its proper syntax, which helps to capture syntaxtual and structural dependencies. The Code Regularize is next introduced to encode source code which helps capture the contextual dependencies of the source code. Finally, we propose a novel method which can learn variable size context for modeling source code. We evaluated CodeGRU with real-world dataset and it shows that CodeGRU can effectively capture contextual, syntaxtual and structural dependencies which previous works fails. We also discuss and visualize two use cases of CodeGRU for source code modeling tasks (1) source code suggestion, and (2) source code generation.
In this paper, we propose a test, called Flagged-1-Bit (F1B) test, to study the intrinsic capability of recurrent neural networks in sequence learning. Four different recurrent network models are studied both analytically and experimentally using this test. Our results suggest that in general there exists a conflict between feature selection and feature memorization in sequence learning. Such a conflict can be resolved either using a gating mechanism as in LSTM, or by increasing the state dimension as in Vanilla RNN. Gated models resolve this conflict by adaptively adjusting their state-update equations, whereas Vanilla RNN resolves this conflict by assigning different dimensions different tasks. Insights into feature selection and memorization in recurrent networks are given.