Over the last decade, large-scale multiple testing has found itself at the forefront of modern data analysis. In many applications data are correlated, so that the observed test statistic used for detecting a non-null case, or signal, at each location in a dataset carries some information about the chances of a true signal at other locations. Brown, Lazar, Datta, Jang, and McDowell (2014) proposed in the neuroimaging context a Bayesian multiple testing model that accounts for the dependence of each volume element (voxel) on the behavior of its neighbors through a conditional autoregressive (CAR) model. Here, we propose more general definitions of neighborhood structures that allow for inclusion of points with no neighbors at all, something that is not possible under conventional CAR models. We also consider neighborhoods based on criteria other than physical location, such as genetic pathways in microarray defined based on existing biological knowledge. This modification allows for the simultaneous modeling of dependent and independent cases, resulting in increased precision in the estimates of non-null signal strengths. Further, we allow for less restrictive prior assumptions on the variance components, justify the selected prior distribution, and prove that the resulting posterior distribution is proper. We illustrate the effectiveness and applicability of our proposed model by using it to analyze both simulated and real microarray data in which the genes exhibit nontrivial dependence that is determined by physical adjacency on a chromosome or predefined gene pathways.
Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of such an yes-no decision, we consider several graded schemes where words are weighted according to their discriminatory relevance with respect to its use in the document (e.g., idf). Some of these methods (particularly tf-idf) are seen to result in a significant improvement in performance over prior state of the art. Further, combining such approaches into an ensemble based on alternate classifiers such as the RNN model, results in an 1.6% performance improvement on the standard IMDB movie review dataset, and a 7.01% improvement on Amazon product reviews. Since these are language free models and can be obtained in an unsupervised manner, they are of interest also for under-resourced languages such as Hindi as well and many more languages. We demonstrate the language free aspects by showing a gain of 12% for two review datasets over earlier results, and also release a new larger dataset for future testing (Singh,2015).
Given a pattern string $P$ of length $n$ consisting of $\delta$ distinct characters and a query string $T$ of length $m$, where the characters of $P$ and $T$ are drawn from an alphabet $\Sigma$ of size $\Delta$, the {\em exact string matching} problem consists of finding all occurrences of $P$ in $T$. For this problem, we present a randomized heuristic that in $O(n\delta)$ time preprocesses $P$ to identify $sparse(P)$, a rarely occurring substring of $P$, and then use it to find all occurrences of $P$ in $T$ efficiently. This heuristic has an expected search time of $O( \frac{m}{min(|sparse(P)|, \Delta)})$, where $|sparse(P)|$ is at least $\delta$. We also show that for a pattern string $P$ whose characters are chosen uniformly at random from an alphabet of size $\Delta$, $E[|sparse(P)|]$ is $\Omega(\Delta log (\frac{2\Delta}{2\Delta-\delta}))$.
A two-stage procedure for simultaneously detecting multiple thresholds and achieving model selection in the segmented accelerate failure time (AFT) model is developed in this paper. In the first stage, we formulate the threshold problem as a group model selection problem so that a concave 2-norm group selection method can be applied. In the second stage, the thresholds are finalized via a refining method. We establish the strong consistency of the threshold estimates and regression coefficient estimates under some mild technical conditions. The proposed procedure performs satisfactorily in our extensive simulation studies. Its real world applicability is demonstrated via analyzing a follicular lymphoma data.
In this proceedings we discuss the motivation, implementation details, and performance of a new physics code base called Grid. It is intended to be more performant, more general, but similar in spirit to QDP++\cite{QDP}. Our approach is to engineer the basic type system to be consistently fast, rather than bolt on a few optimised routines, and we are attempt to write all our optimised routines directly in the Grid framework. It is hoped this will deliver best known practice performance across the next generation of supercomputers, which will provide programming challenges to traditional scalar codes. We illustrate the programming patterns used to implement our goals, and advances in productivity that have been enabled by using new features in C++11.
Mined Semantic Analysis (MSA) is a novel distributional semantics approach which employs data mining techniques. MSA embraces knowledge-driven analysis of natural languages. It uncovers implicit relations between concepts by mining for their associations in target encyclopedic corpora. MSA exploits not only target corpus content but also its knowledge graph (e.g., ‘See also’ link graph of Wikipedia). Empirical results show competitive performance of MSA compared to prior state-of-the-art methods for measuring semantic relatedness on benchmark data sets. Additionally, we introduce the first analytical study to examine statistical significance of results reported by different semantic relatedness methods. Our study shows that, top performing results could be statistically equivalent though mathematically different. The study positions MSA as one of state-of-the-art methods for measuring semantic relatedness.
Recursive partitioning approaches producing tree-like models are a long standing staple of predictive modeling, in the last decade mostly as “sub-learners” within state of the art ensemble methods like Boosting and Random Forest. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods’ inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. Such variables can often be very informative, but current tree methods essentially leave us a choice of either not using them, or exposing our models to severe overfitting. We propose a conceptual framework to splitting using leave-one-out (LOO) cross validation for selecting the splitting variable, then performing a regular split (in our case, following CART’s approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our novel splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not increase the overall computational complexity compared to CART for two-class classification. For regression tasks, our approach carries an increased computational burden, replacing a O(log(n)) factor in CART splitting rule search with an O(n) term.