We apply techniques in natural language processing, computational linguistics, and machine-learning to investigate papers in hep-th and four related sections of the arXiv: hep-ph, hep-lat, gr-qc, and math-ph. All of the titles of papers in each of these sections, from the inception of the arXiv until the end of 2017, are extracted and treated as a corpus which we use to train the neural network Word2Vec. A comparative study of common n-grams, linear syntactical identities, word cloud and word similarities is carried out. We find notable scientific and sociological differences between the fields. In conjunction with support vector machines, we also show that the syntactic structure of the titles in different sub-fields of high energy and mathematical physics are sufficiently different that a neural network can perform a binary classification of formal versus phenomenological sections with 87.1% accuracy, and can perform a finer five-fold classification across all sections with 65.1% accuracy.
Data shuffling of training data among different computing nodes (workers) has been identified as a core element to improve the statistical performance of modern large scale machine learning algorithms. Data shuffling is often considered one of the most significant bottlenecks in such systems due to the heavy communication load. Under a master-worker architecture (where a master has access to the entire dataset and only communications between the master and workers is allowed) coding has been recently proved to considerably reduce the communication load. In this work, we consider a different communication paradigm referred to as distributed data shuffling, where workers, connected by a shared link, are allowed to communicate with one another while no communication between the master and workers is allowed. Under the constraint of uncoded cache placement, we first propose a general coded distributed data shuffling scheme, which achieves the optimal communication load within a factor two. Then, we propose an improved scheme achieving the exact optimality for either large memory size or at most four workers in the system.
Quasi-Monte Carlo (QMC) methods for estimating integrals are attractive since the resulting estimators converge at a faster rate than pseudo-random Monte Carlo. However, they can be difficult to set up on arbitrary posterior densities within the Bayesian framework, in particular for inverse problems. We introduce a general parallel Markov chain Monte Carlo (MCMC) framework, for which we prove a law of large numbers and a central limit theorem. We further extend this approach to the use of adaptive kernels and state conditions, under which ergodicity holds. As a further extension, an importance sampling estimator is derived, for which asymptotic unbiasedness is proven. We consider the use of completely uniformly distributed (CUD) numbers and non-reversible transitions within the above stated methods, which leads to a general parallel quasi-MCMC (QMCMC) methodology. We prove consistency of the resulting estimators and demonstrate numerically that this approach scales close to $n^{-1}$ as we increase parallelisation, instead of the usual $n^{-1/2}$ that is typical of standard MCMC algorithms. In practical statistical models we observe up to 2 orders of magnitude improvement compared with pseudo-random methods.
This document provides an overview of the material covered in a course taught at Stanford in the spring quarter of 2018. The course draws upon insight from cognitive and systems neuroscience to implement hybrid connectionist and symbolic reasoning systems that leverage and extend the state of the art in machine learning by integrating human and machine intelligence. As a concrete example we focus on digital assistants that learn from continuous dialog with an expert software engineer while providing initial value as powerful analytical, computational and mathematical savants. Over time these savants learn cognitive strategies (domain-relevant problem solving skills) and develop intuitions (heuristics and the experience necessary for applying them) by learning from their expert associates. By doing so these savants elevate their innate analytical skills allowing them to partner on an equal footing as versatile collaborators – effectively serving as cognitive extensions and digital prostheses, thereby amplifying and emulating their human partner’s conceptually-flexible thinking patterns and enabling improved access to and control over powerful computing resources.
We present a learning theory for the training of a linear system operator having an input compositional variable and propose a Bayesian inversion method for inferring the unknown variable from an output of a noisy linear system. We assume that we have partial or even no knowledge of the operator but have training data of input and ouput. A compositional variable satisfies the constraints that the elements of the variable are all non-negative and sum to unity. We quantified the uncertainty in the trained operator and present the convergence rates of training in explicit forms for several interesting cases under stochastic compositional models. The trained linear operator with the covariance matrix, estimated from the training set of pairs of ground-truth input and noisy output data, is further used in evaluation of posterior uncertainty of the solution. This posterior uncertainty clearly demonstrates uncertainty propagation from noisy training data and addresses possible mismatch between the true operator and the estimated one in the final solution.
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Interpretability has arisen as a key desideratum of machine learning models alongside performance. Approaches so far have been primarily concerned with fixed dimensional inputs emphasizing feature relevance or selection. In contrast, we focus on temporal modeling and the problem of tailoring the predictor, functionally, towards an interpretable family. To this end, we propose a co-operative game between the predictor and an explainer without any a priori restrictions on the functional class of the predictor. The goal of the explainer is to highlight, locally, how well the predictor conforms to the chosen interpretable family of temporal models. Our co-operative game is setup asymmetrically in terms of information sets for efficiency reasons. We develop and illustrate the framework in the context of temporal sequence models with examples.
Functional data analysis is a fast evolving branch of modern statistics and the functional linear model has become popular in recent years. However, most estimation methods for this model rely on generalized least squares procedures and therefore are sensitive to atypical observations. To remedy this, we propose a two-step estimation procedure that combines robust functional principal components and robust linear regression. Moreover, we propose a transformation that reduces the curvature of the estimators and can be advantageous in many settings. For these estimators we prove Fisher-consistency at elliptical distributions and consistency under mild regularity conditions. The influence function of the estimators is investigated as well. Simulation experiments show that the proposed estimators have reasonable efficiency, protect against outlying observations, produce smooth estimates and perform well in comparison to existing approaches.
Machine Learning models incorporating multiple layered learning networks have been seen to provide effective models for various classification problems. The resulting optimization problem to solve for the optimal vector minimizing the empirical risk is, however, highly nonlinear. This presents a challenge to application and development of appropriate optimization algorithms for solving the problem. In this paper, we summarize the primary challenges involved and present the case for a Newton-based method incorporating directions of negative curvature, including promising numerical results on data arising from security anomally deetection.
Machine Learning models incorporating multiple layered learning networks have been seen to provide effective models for various classification problems. The resulting optimization problem to solve for the optimal vector minimizing the empirical risk is, however, highly nonconvex. This alone presents a challenge to application and development of appropriate optimization algorithms for solving the problem. However, in addition, there are a number of interesting problems for which the objective function is non- smooth and nonseparable. In this paper, we summarize the primary challenges involved, the state of the art, and present some numerical results on an interesting and representative class of problems.
Measuring similarity is a basic task in information retrieval, and now often a building-block for more complex arguments about cultural change. But do measures of textual similarity and distance really correspond to evidence about cultural proximity and differentiation To explore that question empirically, this paper compares textual and social measures of the similarities between genres of English-language fiction. Existing measures of textual similarity (cosine similarity on tf-idf vectors or topic vectors) are also compared to new strategies that use supervised learning to anchor textual measurement in a social context.
The conventional nonparametric tests in survival analysis, such as the log-rank test, assess the null hypothesis that the hazards are equal at all times. However, hazards are hard to interpret causally, and other null hypotheses are more relevant in many scenarios with survival outcomes. To allow for a wider range of null hypotheses, we present a generic approach to define test statistics. This approach utilizes the fact that a wide range of common parameters in survival analysis can be expressed as solutions of differential equations. Thereby, we can test hypotheses based on survival parameters that solve differential equations driven by hazards, and it is easy to implement the tests on a computer. We present simulations, suggesting that our generic approach performs well for several hypotheses in a range of scenarios. Finally, we extend the strategy to to allow for testing conditional on covariates.
In recent years, reinforcement learning (RL) methods have been applied to model gameplay with great success, achieving super-human performance in various environments, such as Atari, Go, and Poker. However, those studies mostly focus on winning the game and have largely ignored the rich and complex human motivations, which are essential for understanding different players’ diverse behaviors. In this paper, we present a novel method called Multi-Motivation Behavior Modeling (MMBM) that takes the multifaceted human motivations into consideration and models the underlying value structure of the players using inverse RL. Our approach does not require the access to the dynamic of the system, making it feasible to model complex interactive environments such as massively multiplayer online games. MMBM is tested on the World of Warcraft Avatar History dataset, which recorded over 70,000 users’ gameplay spanning three years period. Our model reveals the significant difference of value structures among different player groups. Using the results of motivation modeling, we also predict and explain their diverse gameplay behaviors and provide a quantitative assessment of how the redesign of the game environment impacts players’ behaviors.
Recently, many systems for graph analysis have been developed to address the growing needs of both industry and academia to study complex graphs. Insight into the practical uses of graph analysis will allow future developments of such systems to optimize for real-world usage, instead of targeting single use cases or hypothetical workloads. This insight may be derived from surveys on the applications of graph analysis. However, existing surveys are limited in the variety of application domains, datasets, and/or graph analysis techniques they study. In this work we present and apply a systematic method for identifying practical use cases of graph analysis. We identify commonly used graph features and analysis methods and use our findings to construct a taxonomy of graph analysis applications. We conclude that practical use cases of graph analysis cover a diverse set of graph features and analysis methods. Furthermore, most applications combine multiple features and methods. Our findings motivate further development of graph analysis systems to support a broader set of applications and to facilitate the combination of multiple analysis methods in an (interactive) workflow.