Incremental Query Processing on Big Data Streams

This paper addresses online processing for large-scale, incremental computations on a distributed stream processing engine (DSPE). Our goal is to convert any distributed batch query to an incremental DSPE program automatically. In contrast to other approaches, we derive incremental programs that return accurate results, not approximate answers, by retaining a minimal state during the query evaluation lifetime and by using incremental evaluation techniques to return an accurate snapshot answer at each time interval that depends on the current state and the latest batches of data. Our methods can handle many forms of queries, including iterative and nested queries, group-by with aggregation, and joins on one-to-many relationships. Finally, we report on a prototype implementation of our framework using MRQL running on top of Spark and we experimentally validate the effectiveness of our methods.


A general framework for estimation and inference from clusters of features

Applied statistical problems often come with pre-specified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Classical tests for the presence of such signals rely either on tests for the omission of the entire block of variables (the classical F-test) or on the creation of an unsupervised prototype for the group (either a group centroid or first principal component) and subsequent t-tests on these prototypes. In this paper, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, hopefully improving on the unsupervised prototypes, and then testing with likelihood ratio statistics incorporating only these prototypes. We propose a (potentially) novel model, called the ‘prototype model’, which naturally models the two-step prototype-then-test procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and prediction. Prototype formation often relies on variable selection, which invalidates classical Gaussian test theory. We use recent advances in selective inference to account for selection in the prototyping step and retain test validity. Simulation experiments suggest that our testing procedure enjoys more power than do classical approaches.


Dynamic Capacity Networks

We introduce the Dynamic Capacity Network (DCN), a neural network that can adaptively assign its capacity across different portions of the input data. This is achieved by combining modules of two types: low-capacity sub-networks and high-capacity sub-networks. The low-capacity sub-networks are applied across most of the input, but also provide a guide to select a few portions of the input on which to apply the high-capacity sub-networks. The selection is made using a novel gradient-based attention mechanism, that efficiently identifies the modules and input features that are most likely to impact the DCN’s output and to which we’d like to devote more capacity. We focus our empirical evaluation on the cluttered MNIST and SVHN image datasets. Our findings indicate that DCNs are able to drastically reduce the number of computations, compared to traditional convolutional neural networks, while maintaining similar performance.


A novel, divergence based, regression for compositional data

In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science amongst others. The goal of this paper is to propose a new, divergence based, regression modelling technique for compositional data. To do so, a recently proved metric which is a special case of the Jensen-Shannon divergence is employed. A strong advantage of this new regression technique is that zeros are naturally handled. An example with real data and simulation studies are presented and are both compared with the log-ratio based regression suggested by Aitchison in 1986.


A Survey of Signed Network Mining in Social Media

Many real-world relations can be represented by signed networks with positive and negative links, and signed network analysis has attracted increasing attention from multiple disciplines. With the evolution of data from offline to social media networks, signed network analysis has evolved from developing and measuring theories to mining tasks. In this article, we present a review of mining signed networks in social media and discuss some promising research directions and new frontiers. We begin by giving basic concepts and unique properties and principles of signed networks. Then we classify and review tasks of signed network mining with representative algorithms. We also delineate some tasks that have not been extensively studied with formal definitions and research directions to expand the boundaries of signed network mining.


Convergent Learning: Do different neural networks learn the same representations?

Recent success in training deep neural networks have prompted active investigation into the features learned on their intermediate layers. Such research is difficult because it requires making sense of non-linear computations performed by millions of learned parameters, but valuable because it increases our ability to understand current models and training algorithms and thus create improved versions of them. In this paper we investigate the extent to which neural networks exhibit what we call convergent learning, which is when the representations learned by multiple nets converge to a set of features which are either individually similar between networks or where subsets of features span similar low-dimensional spaces. We propose a specific method of probing representations: training multiple networks and then comparing and contrasting their individual, learned representations at the level of neurons or groups of neurons. We begin research into this question using three techniques to approximately align different neural networks on a feature level: a bipartite matching approach that makes one-to-one assignments between neurons, a sparse prediction approach that finds one-to-many mappings, and a spectral clustering approach that finds many-to-many mappings. This initial investigation reveals a few previously unknown properties of neural networks, and we argue that future research into the question of convergent learning will yield many more. The insights described here include (1) that some features are learned reliably in multiple networks, yet other features are not consistently learned; (2) that units learn to span low-dimensional subspaces and, while these subspaces are common to multiple networks, the specific basis vectors learned are not; (3) that the representation codes are a mix between a local code and slightly, but not fully, distributed codes across multiple units.


Warmth and edge spaces of graphs

The L^p-norms of the Beurling-Ahlfors transform on radial functions

Super-Linear Gate and Super-Quadratic Wire Lower Bounds for Depth-Two and Depth-Three Threshold Circuits

Applications of Markov Chain Analysis to Integer Complexity

Generalized Conjugate Gradient Methods for $\ell_1$ Regularized Convex Quadratic Programming with Finite Convergence

Random cyclic dynamical systems

Lift-and-Round to Improve Weighted Completion Time on Unrelated Machines

2-Xor revisited: satisfiability and probabilities of functions

Spoken Language Translation for Polish

On the maximum number of spanning copies of an orientation in a tournament

LocNet: Improving Localization Accuracy for Object Detection

Proper two-sided exits of a Lévy process

A Note on Fault Tolerant Reachability for Directed Graphs

A generating function approach to branching random walks

Unlabeled Signed Graph Coloring

Statistical Properties of the Single Linkage Hierarchical Clustering Estimator

Searching for Objects using Structure in Indoor Scenes

Convergence of EM Scheme for Neutral Stochastic Differential Delay Equations

Acceptance sampling plans for inverse Weibull distribution based on truncated life test

A Distributed System for Storing and Processing Data from Earth-observing Satellites: System Design and Performance Evaluation of the Visualisation Tool

The Turan Number of Disjoint Copies of Paths

Transportation distances and noise sensitivity of multiplicative Lévy SDE with applications

Approximate Probabilistic Inference via Word-Level Counting

A Paradigmatic Regression Algorithm for Gene Selection Problems

Spin fluctuations of non-equilibrium electrons and excitons in semiconductors

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Homophily and missing links in citation networks

Multivariate Complexity Analysis of Geometric {\sc Red Blue Set Cover}

Maximum likelihood estimation for the Fréchet distribution based on block maxima extracted from a time series

Modeling Heterogeneity in Networks using Uncertainty Quantification Tools

Picking a Conveyor Clean by an Autonomously Learning Robot

Fine-Grain Annotation of Cricket Videos

On the Computational Complexity of Limit Cycles in Dynamical Systems

Exploring the Distribution for the Estimator of Rosenthal’s ‘Fail-Safe’ Number of Unpublished Studies in Meta-analysis

Almost All Regular Graphs are Normal

Complex unit gain bicyclic graphs with rank 2, 3 or 4

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Cost Minimizing Online Algorithms for Energy Storage Management with Worst-case Guarantee

Splines in geometry and topology

Transductive Log Opinion Pool of Gaussian Process Experts

Ramsey numbers of trees versus odd cycles

Hoffman’s coclique bound for normal regular digraphs, and nonsymmetric association schemes

Regular sequences and the joint spectral radius

Calculating the Unrooted Subtree Prune-and-Regraft Distance

Tradeoffs for nearest neighbors on the sphere

The Limitations of Deep Learning in Adversarial Settings

Centred quadratic stochastic operators

Analysis and Optimal Targets Setup of a Multihead Weighing Machine

Constrained Structured Regression with Convolutional Neural Networks

An approximation algorithm for Uniform Capacitated k-Median problem with 1 + ε capacity violation

Low-Rank Tensor Approximations versus Polynomial Chaos Expansions for Meta-Modeling in High-Dimensional Spaces

Decoding Reed-Muller codes over product sets

FFT-Based Fast Bandwidth Selector for Multivariate Kernel Density Estimation

Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize

On the Poisson equation for Metropolis-Hastings chains

Estimating the number of unseen species: How far can one foresee?