Change point detection algorithms have numerous applications in fields of scientific and economic importance. We consider the problem of change point detection on compositional multivariate data (each sample is a probability mass function), which is a practically important sub-class of general multivariate data. While the problem of change-point detection is well studied in univariate setting, and there are few viable implementations for a general multivariate data, the existing methods do not perform well on compositional data. In this paper, we propose a parametric approach for change point detection in compositional data. Moreover, using simple transformations on data, we extend our approach to handle any general multivariate data. Experimentally, we show that our method performs significantly better on compositional data and is competitive on general data compared to the available state of the art implementations.
We propose nnstreamer, a software system that handles neural networks as filters of stream pipelines, applying the stream processing paradigm to neural network applications. A new trend with the wide-spread of deep neural network applications is on-device AI; i.e., processing neural networks directly on mobile devices or edge/IoT devices instead of cloud servers. Emerging privacy issues, data transmission costs, and operational costs signifies the need for on-device AI especially when a huge number of devices with real-time data processing are deployed. Nnstreamer efficiently handles neural networks with complex data stream pipelines on devices, improving the overall performance significantly with minimal efforts. Besides, nnstreamer simplifies the neural network pipeline implementations and allows reusing off-shelf multimedia stream filters directly; thus it reduces the developmental costs significantly. Nnstreamer is already being deployed with a product releasing soon and is open source software applicable to a wide range of hardware architectures and software platforms.
This paper proposes a novel adaptive guidance system developed using reinforcement meta-learning with a recurrent policy and value function approximator. The use of recurrent network layers allows the deployed policy to adapt real time to environmental forces acting on the agent. We compare the performance of the DR/DV guidance law, an RL agent with a non-recurrent policy, and an RL agent with a recurrent policy in four difficult tasks with unknown but highly variable dynamics. These tasks include a safe Mars landing with random engine failure and a landing on an asteroid with unknown environmental dynamics. We also demonstrate the ability of a recurrent policy to navigate using only Doppler radar altimeter returns, thus integrating guidance and navigation.
In this study, a novel topology optimization approach based on conditional Wasserstein generative adversarial networks (CWGAN) is developed to replicate the conventional topology optimization algorithms in an extremely computationally inexpensive way. CWGAN consists of a generator and a discriminator, both of which are deep convolutional neural networks (CNN). The limited samples of data, quasi-optimal planar structures, needed for training purposes are generated using the conventional topology optimization algorithms. With CWGANs, the topology optimization conditions can be set to a required value before generating samples. CWGAN truncates the global design space by introducing an equality constraint by the designer. The results are validated by generating an optimized planar structure using the conventional algorithms with the same settings. A proof of concept is presented which is known to be the first such illustration of fusion of CWGANs and topology optimization.
The goal of this article is to inspire data scientists to participate in the debate on the impact that their professional work has on society, and to become active in public debates on the digital world as data science professionals. How do ethical principles (e.g., fairness, justice, beneficence, and non-maleficence) relate to our professional lives? What lies in our responsibility as professionals by our expertise in the field? More specifically this article makes an appeal to statisticians to join that debate, and to be part of the community that establishes data science as a proper profession in the sense of Airaksinen, a philosopher working on professional ethics. As we will argue, data science has one of its roots in statistics and extends beyond it. To shape the future of statistics, and to take responsibility for the statistical contributions to data science, statisticians should actively engage in the discussions. First the term data science is defined, and the technical changes that have led to a strong influence of data science on society are outlined. Next the systematic approach from CNIL is introduced. Prominent examples are given for ethical issues arising from the work of data scientists. Further we provide reasons why data scientists should engage in shaping morality around and to formulate codes of conduct and codes of practice for data science. Next we present established ethical guidelines for the related fields of statistics and computing machinery. Thereafter necessary steps in the community to develop professional ethics for data science are described. Finally we give our starting statement for the debate: Data science is in the focal point of current societal development. Without becoming a profession with professional ethics, data science will fail in building trust in its interaction with and its much needed contributions to society!
Recent GAN-based architectures have been able to deliver impressive performance on the general task of image-to-image translation. In particular, it was shown that a wide variety of image translation operators may be learned from two image sets, containing images from two different domains, without establishing an explicit pairing between the images. This was made possible by introducing clever regularizers to overcome the under-constrained nature of the unpaired translation problem. In this work, we introduce a novel architecture for unpaired image translation, and explore several new regularizers enabled by it. Specifically, our architecture comprises a pair of GANs, as well as a pair of translators between their respective latent spaces. These cross-translators enable us to impose several regularizing constraints on the learnt image translation operator, collectively referred to as latent cross-consistency. Our results show that our proposed architecture and latent cross-consistency constraints are able to outperform the existing state-of-the-art on a wide variety of image translation tasks.
We present a method to reconstruct networks of socialbots given minimal input. Then we use Kernel Density Estimates of Botometer scores from 47,000 social networking accounts to find clusters of automated accounts, discovering over 5,000 socialbots. This statistical and data driven approach allows for inference of thresholds for socialbot detection, as illustrated in a case study we present from Guatemala.
As more researchers have become aware of and passionate about algorithmic fairness, there has been an explosion in papers laying out new metrics, suggesting algorithms to address issues, and calling attention to issues in existing applications of machine learning. This research has greatly expanded our understanding of the concerns and challenges in deploying machine learning, but there has been much less work in seeing how the rubber meets the road. In this paper we provide a case-study on the application of fairness in machine learning research to a production classification system, and offer new insights in how to measure and address algorithmic fairness issues. We discuss open questions in implementing equality of opportunity and describe our fairness metric, conditional equality, that takes into account distributional differences. Further, we provide a new approach to improve on the fairness metric during model training and demonstrate its efficacy in improving performance for a real-world product
Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related, and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the Predictive, Descriptive, Relevant (PDR) framework for discussing interpretations. The PDR framework provides three overarching desiderata for evaluation: predictive accuracy, descriptive accuracy and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post-hoc categories, with sub-groups including sparsity, modularity and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often under-appreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods.
Interval-censored data analysis is important in biomedical statistics for any type of time-to-event response where the time of response is not known exactly, but rather only known to occur between two assessment times. Many clinical trials and longitudinal studies generate interval-censored data; one common example occurs in medical studies that entail periodic follow-up. In this paper we propose a survival forest method for interval-censored data based on the conditional inference framework. We describe how this framework can be adapted to the situation of interval-censored data. We show that the tuning parameters have a non-negligible effect on the survival forest performance and guidance is provided on how to tune the parameters in a data-dependent way to improve the overall performance of the method. Using Monte Carlo simulations we find that the proposed survival forest is at least as effective as a survival tree method when the underlying model has a tree structure, performs similarly to an interval-censored Cox proportional hazards model fit when the true relationship is linear, and outperforms the survival tree method and Cox model when the true relationship is nonlinear. We illustrate the application of the method on a tooth emergence data set.
Large amount of public data produced by enterprises are in semi-structured PDF form. Tabular data extraction from reports and other published data in PDF format is of interest for various data consolidation purposes such as analysing and aggregating financial reports of a company. Queries into the structured tabular data in PDF format are normally processed in an unstructured manner through means like text-match. This is mainly due to that the binary format of PDF documents is optimized for layout and rendering and do not have great support for automated parsing of data. Moreover, even the same table type in PDF files varies in schema, row or column headers, which makes it difficult for a query plan to cover all relevant tables. This paper proposes a deep learning based method to enable SQL-like query and analysis of financial tables from annual reports in PDF format. This is achieved through table type classification and nearest row search. We demonstrate that using word embedding trained on Google news for header match clearly outperforms the text-match based approach in traditional database. We also introduce a practical system that uses this technology to query and analyse finance tables in PDF documents from various sources.
In many security and healthcare systems, the detection and diagnosis systems use a sequence of sensors/tests. Each test outputs a prediction of the latent state and carries an inherent cost. However, the correctness of the predictions cannot be evaluated since the ground truth annotations may not be available. Our objective is to learn strategies for selecting a test that gives the best trade-off between accuracy and costs in such Unsupervised Sensor Selection (USS) problems. Clearly, learning is feasible only if ground truth can be inferred (explicitly or implicitly) from the problem structure. It is observed that this happens if the problem satisfies the ‘Weak Dominance’ (WD) property. We set up the USS problem as a stochastic partial monitoring problem and develop an algorithm with sub-linear regret under the WD property. We argue that our algorithm is optimal and evaluate its performance on problem instances generated from synthetic and real-world datasets.
A simultaneous change-point detection and estimation in a piece-wise constant model is a common task in modern statistics. If, in addition, the whole estimation can be performed automatically, in just one single step without going through any hypothesis tests for non-identifiable models, or unwieldy classical a-posterior methods, it becomes an interesting, but also challenging idea. In this paper we introduce the estimation method based on the quantile LASSO approach. Unlike standard LASSO approaches, our method does not rely on typical assumptions usually required for the model errors, such as sub-Gaussian or Normal distribution. The proposed quantile LASSO method can effectively handle heavy-tailed random error distributions, and, in general, it offers a more complex view of the data as one can obtain any conditional quantile of the target distribution, not just the conditional mean. It is proved that under some reasonable assumptions the number of change-points is not underestimated with probability tenting to one, and, in addition, when the number of change-points is estimated correctly, the change-point estimates provided by the quantile LASSO are consistent. Numerical simulations are used to demonstrate these results and to illustrate the empirical performance robust favor of the proposed quantile LASSO method.
The Laplace approximation has been one of the workhorses of Bayesian inference. It often delivers good approximations in practice despite the fact that it does not strictly take into account where the volume of posterior density lies. Variational approaches avoid this issue by explicitly minimising the Kullback-Leibler divergence DKL between a postulated posterior and the true (unnormalised) logarithmic posterior. However, they rely on a closed form DKL in order to update the variational parameters. To address this, stochastic versions of variational inference have been devised that approximate the intractable DKL with a Monte Carlo average. This approximation allows calculating gradients with respect to the variational parameters. However, variational methods often postulate a factorised Gaussian approximating posterior. In doing so, they sacrifice a-posteriori correlations. In this work, we propose a method that combines the Laplace approximation with the variational approach. The advantages are that we maintain: applicability on non-conjugate models, posterior correlations and a reduced number of free variational parameters. Numerical experiments demonstrate improvement over the Laplace approximation and variational inference with factorised Gaussian posteriors.
In this paper, we propose a model for non-cooperative Markov games with time-consistent risk-aware players. In particular, our model characterizes the risk arising from both the stochastic state transitions and the randomized strategies of the other players. We give an appropriate equilibrium concept for our risk-aware Markov game model and we demonstrate the existence of such equilibria in stationary strategies. We then propose and analyze a simulation-based $Q$-learning type algorithm for equilibrium computation, and work through the details for some specific risk measures. Our numerical experiments on a two player queuing game demonstrate the worth and applicability of our model and corresponding $Q$-learning algorithm.