A Markov Jump Process for More Efficient Hamiltonian Monte Carlo

Hamiltonian Monte Carlo with discrete time sampling steps is limited by the factthat its transition rates must be less than or equal to 1. By deriving an HMC algorithm that is a continuous time Markov jump process, we are able to overcome this constraint. This allows us to propose new transition rates that lead to improvedmixing, as measured both by both the spectral gap and auto-correlations on several example problems. We release the algorithm as an open source python package.

A Practioner’s Guide to Evaluating Entity Resolution Results

Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we survey metrics used to evaluate ER results in order to iteratively improve performance and guarantee sufficient quality prior to deployment. Some of these metrics are borrowed from multi-class classification and clustering domains, though some key differences exist differentiating entity resolution from general clustering. Menestrina et al. empirically showed rankings from these metrics often conflict with each other, thus our primary motivation for studying them. This paper provides practitioners the basic knowledge to begin evaluating their entity resolution results.

Algorithms for Envelope Estimation II

We propose a new algorithm for envelope estimation, along with a new root n consistent method for computing starting values. The new algorithm, which does not require optimization over a Grassmannian, is shown by simulation to be much faster and typically more accurate that the best existing algorithm proposed by Cook and Zhang (2015c).

An M-Estimator for Reduced-Rank High-Dimensional Linear Dynamical System Identification

High-dimensional time-series data are becoming increasingly abundant across a wide variety of domains, spanning economics, neuroscience, particle physics, and cosmology. Fitting statistical models to such data, to enable parameter estimation and time-series prediction, is an important computational primitive. Existing methods, however, are unable to cope with the high-dimensional nature of these problems, due to both computational and statistical reasons. We mitigate both kinds of issues via proposing an M-estimator for Reduced-rank System IDentification (MR. SID). A combination of low-rank approximations, L-1 and L-2 penalties, and some numerical linear algebra tricks, yields an estimator that is computationally efficient and numerically stable. Simulations and real data examples demonstrate the utility of this approach in a variety of problems. In particular, we demonstrate that MR. SID can estimate spatial filters, connectivity graphs, and time-courses from native resolution functional magnetic resonance imaging data. Other applications and extensions are immediately available, as our approach is a generalization of the classical Kalman Filter-Smoother Expectation-Maximization algorithm.

Benchmarking for Bayesian Reinforcement Learning

In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the collected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but even though a few toy examples exist in the literature, there are still no extensive or rigorous benchmarks to compare them. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test problems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed.

Causal Inference in Repeated Observational Studies: A Case Study of eBay Product Releases

Causal inference in observational studies is notoriously difficult, due to the fact that the experimenter is not in charge of the treatment assignment mechanism. Many potential confounding factors (PCFs) exist in such a scenario, and if one seeks to estimate the causal effect of the treatment on a response, one needs to control for such factors. Identifying all relevant PCFs may be difficult (or impossible) given a single observational study. Instead, we argue that if one can observe a sequence of similar treatments over the course of a lengthy time period, one can identify patterns of behavior in the experimental subjects that are correlated with the response of interest and control for those patterns directly. Specifically, in our case-study we find and control for an early-adopter effect: the scenario in which the magnitude of the response is highly correlated with how quickly one adopts a treatment after its release. We provide a flexible hierarchical Bayesian framework that controls for such early-adopter effects in the analysis of the effects of multiple sequential treatments. The methods are presented and evaluated in the context of a detailed case-study involving product updates (newer versions of the same product) from eBay, Inc. The users in our study upgrade (or not) to a new version of the product at their own volition and timing. Our response variable is a measure of user actions, and we study the behavior of a large set of users (n = 10.5 million) in a targeted subset of eBay categories over a period of one year. We find that (a) naive causal estimates are hugely misleading and (b) our method, which is relatively insensitive to modeling assumptions and exhibits good out-of-sample predictive validation, yields sensible causal estimates that offer eBay a stable basis for decision-making.

Coarse-to-fine Multiple Testing Strategies

We analyze control of the familywise error rate (FWER) in a multiple testing scenario with a great many null hypotheses about the distribution of a high-dimensional random variable among which only a very small fraction are false, or ‘active’. In order to improve power relative to conservative Bonferroni bounds, we explore a coarse-to-fine procedure adapted to a situation in which tests are partitioned into subsets, or ‘cells’, and active hypotheses tend to cluster within cells. We develop procedures for a standard linear model with Gaussian data and a non-parametric case based on generalized permutation testing, and demonstrate considerably higher power than Bonferroni estimates at the same FWER when the active hypotheses do cluster. The main technical difficulty arises from the correlation between the test statistics at the individual and cell levels, which increases the likelihood of a hypothesis being falsely discovered when the cell that contains it is falsely discovered (survivorship bias). This requires sharp estimates of certain quadrant probabilities when a cell is inactive.

Dpush: A scalable decentralized spam resistant unsolicited messaging protocol

Herein this paper is presented a novel invention – called Dpush – that enables truly scalable spam resistant uncensorable automatically encrypted and inherently authenticated messaging; thus restoring our ability to exert our right to private communication, and thus a step forward in restoring an uncorrupted democracy. Using a novel combination of a distributed hash table[1] (DHT) and a proof of work[2] (POW), combined in a way that can only be called a synergy, the emergent property of a scalable and spam resistant unsolicited messaging protocol elegantly emerges. Notable is that the receiver does not need to be online at the time the message is sent. This invention is already implemented and operating within the package that is called MORPHiS – which is a Sybil[3] resistant enhanced Kademlia[1] DHT implementation combined with an already functioning implementation of Dpush, as well as a polished HTTP Dmail interface to send and receive such messages today. MORPHiS is available for free (GPLv2) at the https://morph.is website.

Dynamics of deceptive interactions in social networks

In this paper we examine the role of lies in human social relations by implementing some salient characteristics of deceptive interactions into an opinion formation model, so as to describe the dynamical behaviour of a social network more realistically. In this model we take into account such basic properties of social networks as the dynamics of the intensity of interactions, the influence of public opinion, and the fact that in every human interaction it might be convenient to deceive or withhold information depending on the instantaneous situation of each individual in the network. We find that lies shape the topology of social networks, especially the formation of tightly linked, small communities with loose connections between them. We also find that agents with a larger proportion of deceptive interactions are the ones that connect communities of different opinion, and in this sense they have substantial centrality in the network. We then discuss the consequences of these results for the social behaviour of humans and predict the changes that could arise due to a varying tolerance for lies in society.

Fast Bayesian Lasso for High-Dimensional Regression

The lasso (Tibshirani, 1996) is an essential tool in modern high-dimensional regression and variable selection. The Bayesian lasso of Park and Casella (2008) interprets the lasso objective function as a posterior under a Laplace prior and proposes a three-step Gibbs sampler to sample from this posterior. The Bayesian lasso yields a natural approach to quantifying the uncertainty of lasso estimates. Furthermore, the Gibbs sampler for the Bayesian lasso has been shown to be geometrically ergodic (Khare and Hobert, 2013). The geometric rate constant of this Markov chain, however, tends to 1 if the number of regression coefficients grows faster than the sample size (Rajaratnam and Sparks, 2015). Thus, convergence of the Bayesian lasso Gibbs sampler can still be quite slow in modern high-dimensional settings despite the apparent theoretical safeguard of geometric ergodicity. In order to address this challenge, we propose a new method to draw from the same posterior via a tractable two-step blocked Gibbs sampler, which we call the fast Bayesian lasso. We provide a theoretical underpinning to the new method by proving rigorously that the fast Bayesian lasso is geometrically ergodic. We then demonstrate numerically that this blocked sampler exhibits vastly superior convergence behavior in high-dimensional regimes.

Markov Boundary Discovery with Ridge Regularized Linear Models

Ridge regularized linear models (RRLMs), such as ridge regression and the SVM, are a popular group of methods that are used in conjunction with coefficient hypothesis testing to discover explanatory variables with a significant multivariate association to a response. However, many investigators are reluctant to draw causal interpretations of the selected variables due to the incomplete knowledge of the capabilities of RRLMs in causal inference. Under reasonable assumptions, we show that a modified form of RRLMs can get very close to identifying a subset of the Markov boundary by providing a worst-case bound on the space of possible solutions. The results hold for any convex loss, even when the underlying functional relationship is nonlinear, and the solution is not unique. Our approach combines ideas in Markov boundary and sufficient dimension reduction theory. Experimental results show that the modified RRLMs are competitive against state-of-the-art algorithms in discovering part of the Markov boundary from gene expression data.

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning

This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.

Robust Gaussian Filtering

Most widely used state estimation algorithms, such as the Extended Kalman Filter and the Unscented Kalman Filter, belong to the family of Gaussian Filters (GFs). Unfortunately, GFs fail for observation models described by a fat-tailed distribution. This is a serious limitation because thin-tailed observation models, such as the Gaussian distribution, are sensitive to outliers. In this paper, we show that GFs can be enabled to work with fat-tailed models by applying a feature function to the measurement. Furthermore, we find an optimal feature function using variational inference techniques. Simulation results demonstrate the robustness of the proposed filter to outliers in the measurements.

Robust Reduced Rank Regression

In high-dimensional multivariate regression problems, rank reduction is a very effective way for dimension reduction to facilitate both model estimation and interpretation. However, commonly used reduced rank methods are extremely non-robust against data corruption, as the low-rank dependence structure between response variables and predictors could be easily distorted in the presence of gross outliers. We propose a robust reduced rank regression approach for joint reduced rank modeling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parametrization, and an efficient thresholding-based iterative procedure is developed for optimization. We show that the proposed approach generalizes and unifies several popular robust estimation methods. Our theoretical investigations focus on nonasymptotic robust analysis, which demonstrates that conducting rank reduction and outlier detection jointly leads to improved prediction accuracy. The performance of the proposed method is examined empirically by simulation studies and two real data examples.

Toward better feature weighting algorithms: a focus on Relief

Feature weighting algorithms try to solve a problem of great importance nowadays in machine learning: The search of a relevance measure for the features of a given domain. This relevance is primarily used for feature selection as feature weighting can be seen as a generalization of it, but it is also useful to better understand a problem’s domain or to guide an inductor in its learning process. Relief family of algorithms are proven to be very effective in this task. Some other feature weighting methods are reviewed in order to give some context and then the different existing extensions to the original algorithm are explained. One of Relief’s known issues is the performance degradation of its estimates when redundant features are present. A novel theoretical definition of redundancy level is given in order to guide the work towards an extension of the algorithm that is more robust against redundancy. A new extension is presented that aims for improving the algorithms performance. Some experiments were driven to test this new extension against the existing ones with a set of artificial and real datasets and denoted that in certain cases it improves the weight’s estimation accuracy.

Twitter Sentiment Analysis

This project addresses the problem of sentiment analysis in twitter; that is classifying tweets according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online micro-blogging and social-networking platform which allows users to write short status updates of maximum length 140 characters. It is a rapidly expanding service with over 200 million registered users – out of which 100 million are active users and half of them log on twitter on a daily basis – generating nearly 250 million tweets per day. Due to this large amount of usage we hope to achieve a reflection of public sentiment by analysing the sentiments expressed in the tweets. Analysing the public sentiment is important for many applications such as firms trying to find out the response of their products in the market, predicting political elections and predicting socioeconomic phenomena like stock exchange. The aim of this project is to develop a functional classifier for accurate and automatic sentiment classification of an unknown tweet stream.

2-complexes with large homological systoles

A Bayesian feature allocation model for tumor heterogeneity

A Terrible Expansion of the Determinant

Approximability of TSP on Power Law Graphs

Approximate subgroups of residually nilpotent groups

Asynchronous Networks and Event Driven Dynamics

Bandlimited Spatial Field Sampling with Mobile Sensors in the Absence of Location Information

Bayesian Epidemic Detection in Multiple Populations

Bayesian group Lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies

Bayesian optimal design for ordinary differential equation models

Biased doped silicene as a source for advanced electronics

Binary Codes and Period-$2$ Orbits of Sequential Dynamical Systems

Bio-Inspired Human Action Recognition using Hybrid Max-Product Neuro-Fuzzy Classifier and Quantum-Behaved PSO

Characterization of a class of weak transport-entropy inequalities on the line

Chromatic properties of the Euclidean plane

Constrained Eigenvalues Density of Invariant Random Matrices Ensembles

Controling contagious processes on temporal networks via adaptive rewiring

Counting tanglegrams with species

Deducing effective light transport parameters in optically thin systems

Diffusion of light in semitransparent media

Dropping Convexity for Faster Semi-definite Optimization

Ehrhart quasi-period collapse in rational polygons

Equiangular tight frames with centroidal symmetry

Estimating whole brain dynamics using spectral clustering

Extinction time for the contact process on general graphs

Failure Mitigation in Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

Fame for sale: efficient detection of fake Twitter followers

Fast Greedy Approaches for Compressive Sensing of Large-Scale Signals

Feynman-Kac Formulas for Solutions to Degenerate Elliptic and Parabolic Boundary-Value and Obstacle Problems with Dirichlet Boundary Conditions

From Pappus Theorem to moduli of some extremal line point configurations and applications

Functional generalized autoregressive conditional heteroskedasticity

Generalizing Top Trading Cycles for Housing Markets with Fractional Endowments

How transportation hierarchy shapes human mobility

Identifiability of Normal and Normal Mixture Models With Nonignorable Missing Data

Improved asymptotic estimates for the contact process with stirring

Improving distant supervision using inference learning

Integral representation with respect to fractional Brownian motion under a log-Holder assumption

Landau’s Theorem Revisited Again

Many-body Localization Transition in Rokhsar-Kivelson-type wave functions

Minimum parametric flow over time

Multi-type spatial branching models for local self-regulation I: Construction and an exponential duality

Multiversion Conflict Notion for Transactional Memory Systems

Natural scene statistics mediate the perception of image complexity

New Eulerian numbers of type D

Noetherianity and rooted trees

Non-existence of homotopical upper bounds for the chromatic number

On almost optimal universal hypergraphs

On the classification of self-dual [20,10,9] codes over GF(7)

On the probabilistic approach to the solution of generalized fractional differential equations of Caputo and Riemann-Liouville type

On transversal and $2$-packing numbers in straight line systems on $\mathbb{R}^{2}$

Optimization of anemia treatment in hemodialysis patients via reinforcement learning

Output-Polynomial Enumeration on Graphs of Bounded (Local) Linear MIM-Width

Parametric Maxflows for Structured Sparse Learning with Convex Relaxations of Submodular Functions

Parareal convergence for 2D unsteady flow around a cylinder

Persistent random walks

Practical Bayesian Tomography

Prime Power Divisibility,Periodicity and Other Properties of Some Second Order Recurrences

Project Beehive: A Hardware/Software Co-designed Stack for Runtime and Architectural Research

Raising The Bar For Vertex Cover: Fixed-parameter Tractability Above A Higher Guarantee

Randomization Improving Online Time-Sensitive Revenue Maximization for Green Data Centers

Refined dual stable Grothendieck polynomials and generalized Bender-Knuth involutions

Refined total variation bounds in the multivariate and compound Poisson approximation

Semiparametric time to event models in the presence of error-prone, self-reported outcomes – With application to the women’s health initiative

Sharp Oracle inequalities for square root regularization

Silent Self-stabilizing BFS Tree Algorithms Revised

Simultaneous flips on triangulated surfaces

Sojourn time in a single server queue with threshold service rate control

Spatial Bayesian variable selection and grouping for high-dimensional scalar-on-image regression

Stochastic Averaging and Sensitivity Analysis for Two Scale Reaction Networks

Stochastic integration with respect to cylindrical Lévy processes

Stochastic simulators based optimization by Gaussian process metamodels – Application to maintenance investments planning issues

The Spatial Context of Residential Segregation

The USFD Spoken Language Translation System for IWSLT 2014

t-Martin boundary of killed random walks in the quadrant

Transforming response values in small area prediction

Turning intractable counting into sampling: computing the granular entropy of three-dimensional jammed packings

Two closed forms for the Apostol-Bernoulli polynomials

Vectors of Locally Aggregated Centers for Compact Video Representation

Zigzag structure of thin chamber complexes