A Bayesian Generalized CAR Model for Correlated Signal Detection

Over the last decade, large-scale multiple testing has found itself at the forefront of modern data analysis. In many applications data are correlated, so that the observed test statistic used for detecting a non-null case, or signal, at each location in a dataset carries some information about the chances of a true signal at other locations. Brown, Lazar, Datta, Jang, and McDowell (2014) proposed in the neuroimaging context a Bayesian multiple testing model that accounts for the dependence of each volume element (voxel) on the behavior of its neighbors through a conditional autoregressive (CAR) model. Here, we propose more general definitions of neighborhood structures that allow for inclusion of points with no neighbors at all, something that is not possible under conventional CAR models. We also consider neighborhoods based on criteria other than physical location, such as genetic pathways in microarray defined based on existing biological knowledge. This modification allows for the simultaneous modeling of dependent and independent cases, resulting in increased precision in the estimates of non-null signal strengths. Further, we allow for less restrictive prior assumptions on the variance components, justify the selected prior distribution, and prove that the resulting posterior distribution is proper. We illustrate the effectiveness and applicability of our proposed model by using it to analyze both simulated and real microarray data in which the genes exhibit nontrivial dependence that is determined by physical adjacency on a chromosome or predefined gene pathways.


Words are not Equal: Graded Weighting Model for building Composite Document Vectors

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of such an yes-no decision, we consider several graded schemes where words are weighted according to their discriminatory relevance with respect to its use in the document (e.g., idf). Some of these methods (particularly tf-idf) are seen to result in a significant improvement in performance over prior state of the art. Further, combining such approaches into an ensemble based on alternate classifiers such as the RNN model, results in an 1.6% performance improvement on the standard IMDB movie review dataset, and a 7.01% improvement on Amazon product reviews. Since these are language free models and can be obtained in an unsupervised manner, they are of interest also for under-resourced languages such as Hindi as well and many more languages. We demonstrate the language free aspects by showing a gain of 12% for two review datasets over earlier results, and also release a new larger dataset for future testing (Singh,2015).


A Fast Heuristic for Exact String Matching

Given a pattern string P of length n consisting of \delta distinct characters and a query string T of length m, where the characters of P and T are drawn from an alphabet \Sigma of size \Delta, the {\em exact string matching} problem consists of finding all occurrences of P in T. For this problem, we present a randomized heuristic that in O(n\delta) time preprocesses P to identify sparse(P), a rarely occurring substring of P, and then use it to find all occurrences of P in T efficiently. This heuristic has an expected search time of O( \frac{m}{min(|sparse(P)|, \Delta)}), where |sparse(P)| is at least \delta. We also show that for a pattern string P whose characters are chosen uniformly at random from an alphabet of size \Delta, E[|sparse(P)|] is \Omega(\Delta log (\frac{2\Delta}{2\Delta-\delta})).


Multi-threshold Accelerate Failure Time Model

A two-stage procedure for simultaneously detecting multiple thresholds and achieving model selection in the segmented accelerate failure time (AFT) model is developed in this paper. In the first stage, we formulate the threshold problem as a group model selection problem so that a concave 2-norm group selection method can be applied. In the second stage, the thresholds are finalized via a refining method. We establish the strong consistency of the threshold estimates and regression coefficient estimates under some mild technical conditions. The proposed procedure performs satisfactorily in our extensive simulation studies. Its real world applicability is demonstrated via analyzing a follicular lymphoma data.


Grid: A next generation data parallel C++ QCD library

In this proceedings we discuss the motivation, implementation details, and performance of a new physics code base called Grid. It is intended to be more performant, more general, but similar in spirit to QDP++\cite{QDP}. Our approach is to engineer the basic type system to be consistently fast, rather than bolt on a few optimised routines, and we are attempt to write all our optimised routines directly in the Grid framework. It is hoped this will deliver best known practice performance across the next generation of supercomputers, which will provide programming challenges to traditional scalar codes. We illustrate the programming patterns used to implement our goals, and advances in productivity that have been enabled by using new features in C++11.


Measuring Semantic Relatedness using Mined Semantic Analysis

Mined Semantic Analysis (MSA) is a novel distributional semantics approach which employs data mining techniques. MSA embraces knowledge-driven analysis of natural languages. It uncovers implicit relations between concepts by mining for their associations in target encyclopedic corpora. MSA exploits not only target corpus content but also its knowledge graph (e.g., ‘See also’ link graph of Wikipedia). Empirical results show competitive performance of MSA compared to prior state-of-the-art methods for measuring semantic relatedness on benchmark data sets. Additionally, we introduce the first analytical study to examine statistical significance of results reported by different semantic relatedness methods. Our study shows that, top performing results could be statistically equivalent though mathematically different. The study positions MSA as one of state-of-the-art methods for measuring semantic relatedness.


Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

Recursive partitioning approaches producing tree-like models are a long standing staple of predictive modeling, in the last decade mostly as “sub-learners” within state of the art ensemble methods like Boosting and Random Forest. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods’ inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. Such variables can often be very informative, but current tree methods essentially leave us a choice of either not using them, or exposing our models to severe overfitting. We propose a conceptual framework to splitting using leave-one-out (LOO) cross validation for selecting the splitting variable, then performing a regular split (in our case, following CART’s approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our novel splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not increase the overall computational complexity compared to CART for two-class classification. For regression tasks, our approach carries an increased computational burden, replacing a O(log(n)) factor in CART splitting rule search with an O(n) term.


Rectangular Kronecker coefficients and plethysms in geometric complexity theory

(4,2)-choosability of planar graphs with forbidden structures

Tree sets

Branching Brownian motion and Selection in the Spatial Lambda-Fleming-Viot Process

Equivalence between asynchronous and delayed dynamics in coupled maps

Discovering the laws of urbanisation

The practical irrelevance of the collapse of the ensemble Kalman filter and other particle filters

Computing Affine Combinations, Distances, and Correlations for Recursive Partition Functions

A functional central limit theorem for integrals of stationary mixing random fields

Derivation and Analysis of Simplified Filters for Complex Dynamical Systems

Time-consistency of cash-subadditive risk measures

Effective interaction potential for amorphous silica from ab initio simulations

Unconventional critical activated scaling of two-dimensional quantum spin-glasses

Limits of subcritical random graphs and random graphs with excluded minors

Stability of Cramer’s Characterization of Normal Laws in Information Distances

Markov chains on graded posets: Compatibility of up-directed and down-directed transition probabilities

Regularity of the drift and entropy of random walks on groups

Graph Isomorphism in Quasipolynomial Time

Martingale Representation and Logarithmic-Sobolev Inequality for Fractional Ornstein-Uhlenbeck Measure

Near-Optimal Hardness Results for Signaling in Bayesian Games

Distilling Knowledge from Deep Networks with Applications to Healthcare Domain

A Conversation with Nan Laird

Fast Generation of Spatially Embedded Random Networks

Constructive noncommutative rank computation in deterministic polynomial time over fields of arbitrary characteristics

Pricing variable annuities with multi-layer expense strategy

A Unified Approach to Error Bounds for Structured Convex Optimization Problems

Product mixing in the alternating group

Subsumptive reflection in SNOMED CT: a large description logic-based terminology for diagnosis

A formula for the expected volume of the Wiener sausage with constant drift

Unbiasedness and Bayes Estimation

ClusPath: A Temporal-driven Clustering to Infer Typical Evolution Paths

Moment-Based Spectral Analysis of Random Graphs with Given Expected Degrees

On the combinatorial structure of 0/1-matrices representing nonobtuse simplices

Coloring graphs without fan vertex-minors and graphs without cycle pivot-minors

Optimal Adaptive Inference in Random Design Binary Regression

On the chromatic numbers of small-dimensional euclidian spaces

Computing factorized approximations of Pareto-fronts using mNM-landscapes and Boltzmann distributions

Neural Self Talk: Image Understanding via Continuous Questioning and Answering

Pointwise estimates for exceedance times of perpetuity sequences

Supercharacter theories of type $A$ unipotent radicals and unipotent polytopes

Scalable Modeling of Conversational-role based Self-presentation Characteristics in Large Online Forums