Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification, clustering, and regression algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.
During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly and solve big data problems using distributed algorithms. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics, by highlighting both its classic and novel applications.
The Renormalisation Group (RG) provides a framework in which it is possible to assess whether a deep-learning network is sensitive to small changes in the input data and hence prone to error, or susceptible to adversarial attack. Distinct classification outputs are associated with different RG fixed points and sensitivity to small changes in the input data is due to the presence of relevant operators at a fixed point. A numerical scheme, based on Monte Carlo RG ideas, is proposed for identifying the existence of relevant operators and the corresponding directions of greatest sensitivity in the input data. Thus, a trained deep-learning network may be tested for its robustness and, if it is vulnerable to attack, dangerous perturbations of the input data identified.
In this short paper, we consider the roles of HCI in enabling the better governance of consequential machine learning systems using the rights and obligations laid out in the recent 2016 EU General Data Protection Regulation (GDPR)—a law which involves heavy interaction with people and systems. Focussing on those areas that relate to algorithmic systems in society, we propose roles for HCI in legal contexts in relation to fairness, bias and discrimination; data protection by design; data protection impact assessments; transparency and explanations; the mitigation and understanding of automation bias; and the communication of envisaged consequences of processing.
Big data features not only large volumes of data but also data with complicated structures. Complexity imposes unique challenges in big data analytics. Meeker and Hong (2014, Quality Engineering, pp. 102-116) provided an extensive discussion of the opportunities and challenges in big data and reliability, and described engineering systems that can generate big data that can be used in reliability analysis. Meeker and Hong (2014) focused on large scale system operating and environment data (i.e., high-frequency multivariate time series data), and provided examples on how to link such data as covariates to traditional reliability responses such as time to failure, time to recurrence of events, and degradation measurements. This paper intends to extend that discussion by focusing on how to use data with complicated structures to do reliability analysis. Such data types include high-dimensional sensor data, functional curve data, and image streams. We first provide a review of recent development in those directions, and then we provide a discussion on how analytical methods can be developed to tackle the challenging aspects that arise from the complexity feature of big data in reliability applications. The use of modern statistical methods such as variable selection, functional data analysis, scalar-on-image regression, spatio-temporal data models, and machine learning techniques will also be discussed.
Working memory is a cognitive process that is responsible for temporarily holding and manipulating information. Most of the empirical neuroscience research on working memory has focused on measuring sustained activity in prefrontal cortex (PFC) and/or parietal cortex during simple delayed-response tasks, and most of the models of working memory have been based on neural integrators. But working memory means much more than just holding a piece of information online. We describe a new theory of working memory, based on a recurrent neural circuit that we call ORGaNICs (Oscillatory Recurrent GAted Neural Integrator Circuits). ORGaNICs are a variety of Long Short Term Memory units (LSTMs), imported from machine learning and artificial intelligence. ORGaNICs can be used to explain the complex dynamics of delay-period activity in prefrontal cortex (PFC) during a working memory task. The theory is analytically tractable so that we can characterize the dynamics, and the theory provides a means for reading out information from the dynamically varying responses at any point in time, in spite of the complex dynamics. ORGaNICs can be implemented with a biophysical (electrical circuit) model of pyramidal cells, combined with shunting inhibition via a thalamocortical loop. Although introduced as a computational theory of working memory, ORGaNICs are also applicable to models of sensory processing, motor preparation and motor control. ORGaNICs offer computational advantages compared to other varieties of LSTMs that are commonly used in AI applications. Consequently, ORGaNICs are a framework for canonical computation in brains and machines.
Few models have been more ubiquitous in their respective fields than Bayesian knowledge tracing and item response theory. Both of these models were developed to analyze data on learners. However, the study designs that these models are designed for differ; Bayesian knowledge tracing is designed to analyze longitudinal data while item response theory is built for cross-sectional data. This paper illustrates a fundamental connection between these two models. Specifically, the stationary distribution of the latent variable and the observed response variable in Bayesian knowledge Tracing are related to an item response theory model. Furthermore, recent advances in network psychometrics demonstrate how this relationship can be exploited and generalized to a network model.
Convolutional Neural Networks (CNNs) define an exceptionally powerful class of models for image classification, but the theoretical background and the understanding of how invariances to certain transformations are learned is limited. In a large scale screening with images modified by different affine and nonaffine transformations of varying magnitude, we analyzed the behavior of the CNN architectures AlexNet and ResNet. If the magnitude of different transformations does not exceed a class- and transformation dependent threshold, both architectures show invariant behavior. In this work we furthermore introduce a new learnable module, the Invariant Transformer Net, which enables us to learn differentiable parameters for a set of affine transformations. This allows us to extract the space of transformations to which the CNN is invariant and its class prediction robust.
Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the recognition’ and localization’ of an unseen category. To address this limitation, we introduce a new \emph{Zero-Shot Detection’} (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.
One of the most compelling features of Gaussian process (GP) regression is its ability to provide well calibrated posterior distributions. Recent advances in inducing point methods have drastically sped up marginal likelihood and posterior mean computations, leaving posterior covariance estimation and sampling as the remaining computational bottlenecks. In this paper we address this shortcoming by using the Lanczos decomposition algorithm to rapidly approximate the predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs Variance Estimates), substantially reduces the time and space complexity over any previous method. In practice, it can compute predictive covariances up to 2,000 times faster and draw samples 18,000 time faster than existing methods, all without sacrificing accuracy.
Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding modern data augmentation techniques. We start by showing that for kernel classifiers, data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. We connect this general approximation framework to prior work in invariant kernels, tangent propagation, and robust optimization. Next, we explicitly tackle the compositional aspect of modern data augmentation techniques, proposing a novel model of data augmentation as a Markov process. Under this model, we show that performing $k$-nearest neighbors with data augmentation is asymptotically equivalent to a kernel classifier. Finally, we illustrate ways in which our theoretical framework can be leveraged to accelerate machine learning workflows in practice, including reducing the amount of computation needed to train on augmented data, and predicting the utility of a transformation prior to training.
Structural Equation Models (SEMs) are routinely used in the analysis of empirical data by researchers spanning different scientific fields such as psychologists or econometricians. In some fields, such as in ecology, SEMs have only started recently to attract attention and thanks to dedicated software packages the use of SEMs have steadily increased. Yet, common analysis practices in such fields that might be transposed from other statistical techniques such as model acceptance or rejection based on p value screening might be poorly fitted for SEMs. In this simulation study, SEMs were fitted via two commonly used R packages: lavaan and piecewiseSEM. Datasets were simulated under different modelling scenarios to test the impact of sample size and model complexity on various global and local model fitness indices. The results showed that not one single model indices should be used to decide on model fitness but rather a combination of different model fitness indices is needed. The global chi square test for lavaan or the Fisher C statistic are, in isolation, poor indicators of model fitness. Combining the different metrics explored here provided little safeguards against model overfitting, this emphasizes the need to cautiously interpret the inferred (causal) relations from fitted SEMs. Researchers in scientific fields with little experience in SEMs, such as in ecology, should consider and accept these limitations.
For a locally Lipschitz continuous function $f:X\to\mathbb{R}$ the generalized gradient $\partial f(x)$ of Clarke is used to develop some (set-valued) gradient on a set $A\subset X$. Existence, uniqueness and some approximation are considered for optimal descent directions on set $A$. The results serve as basis for nonsmooth numerical descent algorithms that can be found in subsequent papers.
We present graph partition neural networks (GPNN), an extension of graph neural networks (GNNs) able to handle extremely large graphs. GPNNs alternate between locally propagating information between nodes in small subgraphs and globally propagating information between the subgraphs. To efficiently partition graphs, we experiment with several partitioning algorithms and also propose a novel variant for fast processing of large scale graphs. We extensively test our model on a variety of semi-supervised node classification tasks. Experimental results indicate that GPNNs are either superior or comparable to state-of-the-art methods on a wide variety of datasets for graph-based semi-supervised classification. We also show that GPNNs can achieve similar performance as standard GNNs with fewer propagation steps.
Performing stochastic inversion on a computationally expensive forward simulation model with a high-dimensional uncertain parameter space (e.g. a spatial random field) is computationally prohibitive even with gradient information provided. Moreover, the nonlinear’ mapping from parameters to observables generally gives rise to non-Gaussian posteriors even with Gaussian priors, thus hampering the use of efficient inversion algorithms designed for models with Gaussian assumptions. In this paper, we propose a novel Bayesian stochastic inversion methodology, characterized by a tight coupling between a gradient-based Langevin Markov Chain Monte Carlo (LMCMC) method and a kernel principal component analysis (KPCA). This approach addresses the `curse-of-dimensionality’ via KPCA to identify a low-dimensional feature space within the high-dimensional and nonlinearly correlated spatial random field. Moreover, non-Gaussian full posterior probability distribution functions are estimated via an efficient LMCMC method on both the projected low-dimensional feature space and the recovered high-dimensional parameter space. We demonstrate this computational framework by integrating and adapting recent developments such as data-driven statistics-on-manifolds constructions and reduction-through-projection techniques to solve inverse problems in linear elasticity.
We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, but that existing systems take approaches that lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce, and prove the correctness of, a new online nested Monte Carlo estimation method that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines.
We describe an efficient, scalable machine learning library that enables very fast training of generalized linear models. We demonstrate that our library can remove the training time as a bottleneck for machine learning workloads, opening the door to a range of new applications. For instance, it allows more agile development, faster and more fine-grained exploration of the hyper-parameter space, enables scaling to massive datasets and makes frequent re-training of models possible in order to adapt to events as they occur. Our library, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the hierarchical architecture of modern distributed systems. This allows us to effectively leverage available network, memory and heterogeneous compute resources. On a terabyte-scale publicly available dataset for click-through-rate prediction in computational advertising, we demonstrate the training of a logistic regression classifier in 1.53 minutes, a 46x improvement over the fastest reported performance.