Hamiltonian Monte Carlo with discrete time sampling steps is limited by the factthat its transition rates must be less than or equal to 1. By deriving an HMC algorithm that is a continuous time Markov jump process, we are able to overcome this constraint. This allows us to propose new transition rates that lead to improvedmixing, as measured both by both the spectral gap and auto-correlations on several example problems. We release the algorithm as an open source python package.
Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we survey metrics used to evaluate ER results in order to iteratively improve performance and guarantee sufficient quality prior to deployment. Some of these metrics are borrowed from multi-class classification and clustering domains, though some key differences exist differentiating entity resolution from general clustering. Menestrina et al. empirically showed rankings from these metrics often conflict with each other, thus our primary motivation for studying them. This paper provides practitioners the basic knowledge to begin evaluating their entity resolution results.
We propose a new algorithm for envelope estimation, along with a new root n consistent method for computing starting values. The new algorithm, which does not require optimization over a Grassmannian, is shown by simulation to be much faster and typically more accurate that the best existing algorithm proposed by Cook and Zhang (2015c).
High-dimensional time-series data are becoming increasingly abundant across a wide variety of domains, spanning economics, neuroscience, particle physics, and cosmology. Fitting statistical models to such data, to enable parameter estimation and time-series prediction, is an important computational primitive. Existing methods, however, are unable to cope with the high-dimensional nature of these problems, due to both computational and statistical reasons. We mitigate both kinds of issues via proposing an M-estimator for Reduced-rank System IDentification (MR. SID). A combination of low-rank approximations, L-1 and L-2 penalties, and some numerical linear algebra tricks, yields an estimator that is computationally efficient and numerically stable. Simulations and real data examples demonstrate the utility of this approach in a variety of problems. In particular, we demonstrate that MR. SID can estimate spatial filters, connectivity graphs, and time-courses from native resolution functional magnetic resonance imaging data. Other applications and extensions are immediately available, as our approach is a generalization of the classical Kalman Filter-Smoother Expectation-Maximization algorithm.
In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but even though a few toy examples exist in the literature, there are still no extensive or rigorous benchmarks to compare them. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed.
Causal inference in observational studies is notoriously difficult, due to the fact that the experimenter is not in charge of the treatment assignment mechanism. Many potential confounding factors (PCFs) exist in such a scenario, and if one seeks to estimate the causal effect of the treatment on a response, one needs to control for such factors. Identifying all relevant PCFs may be difficult (or impossible) given a single observational study. Instead, we argue that if one can observe a sequence of similar treatments over the course of a lengthy time period, one can identify patterns of behavior in the experimental subjects that are correlated with the response of interest and control for those patterns directly. Specifically, in our case-study we find and control for an early-adopter effect: the scenario in which the magnitude of the response is highly correlated with how quickly one adopts a treatment after its release. We provide a flexible hierarchical Bayesian framework that controls for such early-adopter effects in the analysis of the effects of multiple sequential treatments. The methods are presented and evaluated in the context of a detailed case-study involving product updates (newer versions of the same product) from eBay, Inc. The users in our study upgrade (or not) to a new version of the product at their own volition and timing. Our response variable is a measure of user actions, and we study the behavior of a large set of users (n = 10.5 million) in a targeted subset of eBay categories over a period of one year. We find that (a) naive causal estimates are hugely misleading and (b) our method, which is relatively insensitive to modeling assumptions and exhibits good out-of-sample predictive validation, yields sensible causal estimates that offer eBay a stable basis for decision-making.
We analyze control of the familywise error rate (FWER) in a multiple testing scenario with a great many null hypotheses about the distribution of a high-dimensional random variable among which only a very small fraction are false, or ‘active’. In order to improve power relative to conservative Bonferroni bounds, we explore a coarse-to-fine procedure adapted to a situation in which tests are partitioned into subsets, or ‘cells’, and active hypotheses tend to cluster within cells. We develop procedures for a standard linear model with Gaussian data and a non-parametric case based on generalized permutation testing, and demonstrate considerably higher power than Bonferroni estimates at the same FWER when the active hypotheses do cluster. The main technical difficulty arises from the correlation between the test statistics at the individual and cell levels, which increases the likelihood of a hypothesis being falsely discovered when the cell that contains it is falsely discovered (survivorship bias). This requires sharp estimates of certain quadrant probabilities when a cell is inactive.
Herein this paper is presented a novel invention – called Dpush – that enables truly scalable spam resistant uncensorable automatically encrypted and inherently authenticated messaging; thus restoring our ability to exert our right to private communication, and thus a step forward in restoring an uncorrupted democracy. Using a novel combination of a distributed hash table (DHT) and a proof of work (POW), combined in a way that can only be called a synergy, the emergent property of a scalable and spam resistant unsolicited messaging protocol elegantly emerges. Notable is that the receiver does not need to be online at the time the message is sent. This invention is already implemented and operating within the package that is called MORPHiS – which is a Sybil resistant enhanced Kademlia DHT implementation combined with an already functioning implementation of Dpush, as well as a polished HTTP Dmail interface to send and receive such messages today. MORPHiS is available for free (GPLv2) at the https://morph.is website.
In this paper we examine the role of lies in human social relations by implementing some salient characteristics of deceptive interactions into an opinion formation model, so as to describe the dynamical behaviour of a social network more realistically. In this model we take into account such basic properties of social networks as the dynamics of the intensity of interactions, the influence of public opinion, and the fact that in every human interaction it might be convenient to deceive or withhold information depending on the instantaneous situation of each individual in the network. We find that lies shape the topology of social networks, especially the formation of tightly linked, small communities with loose connections between them. We also find that agents with a larger proportion of deceptive interactions are the ones that connect communities of different opinion, and in this sense they have substantial centrality in the network. We then discuss the consequences of these results for the social behaviour of humans and predict the changes that could arise due to a varying tolerance for lies in society.
The lasso (Tibshirani, 1996) is an essential tool in modern high-dimensional regression and variable selection. The Bayesian lasso of Park and Casella (2008) interprets the lasso objective function as a posterior under a Laplace prior and proposes a three-step Gibbs sampler to sample from this posterior. The Bayesian lasso yields a natural approach to quantifying the uncertainty of lasso estimates. Furthermore, the Gibbs sampler for the Bayesian lasso has been shown to be geometrically ergodic (Khare and Hobert, 2013). The geometric rate constant of this Markov chain, however, tends to 1 if the number of regression coefficients grows faster than the sample size (Rajaratnam and Sparks, 2015). Thus, convergence of the Bayesian lasso Gibbs sampler can still be quite slow in modern high-dimensional settings despite the apparent theoretical safeguard of geometric ergodicity. In order to address this challenge, we propose a new method to draw from the same posterior via a tractable two-step blocked Gibbs sampler, which we call the fast Bayesian lasso. We provide a theoretical underpinning to the new method by proving rigorously that the fast Bayesian lasso is geometrically ergodic. We then demonstrate numerically that this blocked sampler exhibits vastly superior convergence behavior in high-dimensional regimes.
Ridge regularized linear models (RRLMs), such as ridge regression and the SVM, are a popular group of methods that are used in conjunction with coefficient hypothesis testing to discover explanatory variables with a significant multivariate association to a response. However, many investigators are reluctant to draw causal interpretations of the selected variables due to the incomplete knowledge of the capabilities of RRLMs in causal inference. Under reasonable assumptions, we show that a modified form of RRLMs can get very close to identifying a subset of the Markov boundary by providing a worst-case bound on the space of possible solutions. The results hold for any convex loss, even when the underlying functional relationship is nonlinear, and the solution is not unique. Our approach combines ideas in Markov boundary and sufficient dimension reduction theory. Experimental results show that the modified RRLMs are competitive against state-of-the-art algorithms in discovering part of the Markov boundary from gene expression data.
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.
Most widely used state estimation algorithms, such as the Extended Kalman Filter and the Unscented Kalman Filter, belong to the family of Gaussian Filters (GFs). Unfortunately, GFs fail for observation models described by a fat-tailed distribution. This is a serious limitation because thin-tailed observation models, such as the Gaussian distribution, are sensitive to outliers. In this paper, we show that GFs can be enabled to work with fat-tailed models by applying a feature function to the measurement. Furthermore, we find an optimal feature function using variational inference techniques. Simulation results demonstrate the robustness of the proposed filter to outliers in the measurements.
In high-dimensional multivariate regression problems, rank reduction is a very effective way for dimension reduction to facilitate both model estimation and interpretation. However, commonly used reduced rank methods are extremely non-robust against data corruption, as the low-rank dependence structure between response variables and predictors could be easily distorted in the presence of gross outliers. We propose a robust reduced rank regression approach for joint reduced rank modeling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parametrization, and an efficient thresholding-based iterative procedure is developed for optimization. We show that the proposed approach generalizes and unifies several popular robust estimation methods. Our theoretical investigations focus on nonasymptotic robust analysis, which demonstrates that conducting rank reduction and outlier detection jointly leads to improved prediction accuracy. The performance of the proposed method is examined empirically by simulation studies and two real data examples.
Feature weighting algorithms try to solve a problem of great importance nowadays in machine learning: The search of a relevance measure for the features of a given domain. This relevance is primarily used for feature selection as feature weighting can be seen as a generalization of it, but it is also useful to better understand a problem’s domain or to guide an inductor in its learning process. Relief family of algorithms are proven to be very effective in this task. Some other feature weighting methods are reviewed in order to give some context and then the different existing extensions to the original algorithm are explained. One of Relief’s known issues is the performance degradation of its estimates when redundant features are present. A novel theoretical definition of redundancy level is given in order to guide the work towards an extension of the algorithm that is more robust against redundancy. A new extension is presented that aims for improving the algorithms performance. Some experiments were driven to test this new extension against the existing ones with a set of artificial and real datasets and denoted that in certain cases it improves the weight’s estimation accuracy.
This project addresses the problem of sentiment analysis in twitter; that is classifying tweets according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online micro-blogging and social-networking platform which allows users to write short status updates of maximum length 140 characters. It is a rapidly expanding service with over 200 million registered users – out of which 100 million are active users and half of them log on twitter on a daily basis – generating nearly 250 million tweets per day. Due to this large amount of usage we hope to achieve a reflection of public sentiment by analysing the sentiments expressed in the tweets. Analysing the public sentiment is important for many applications such as firms trying to find out the response of their products in the market, predicting political elections and predicting socioeconomic phenomena like stock exchange. The aim of this project is to develop a functional classifier for accurate and automatic sentiment classification of an unknown tweet stream.