Arabesque: A System for Distributed Graph Mining – Extended version

Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some ‘interestingness’ criteria desired by the user. These algorithms are very important for areas such as social net- works, semantic web, and bioinformatics. In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque’s API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Mathematical Foundations for Designing and Development of Intelligent Systems of Information Analysis

This article is an attempt to combine different ways of working with sets of objects and their classes for designing and development of artificial intelligent systems (AIS) of analysis information, using object-oriented programming (OOP). This paper contains analysis of basic concepts of OOP and their relation with set theory and artificial intelligence (AI). Process of sets and multisets creation from different sides, in particular mathematical set theory, OOP and AI is considered. Definition of object and its properties, homogeneous and inhomogeneous classes of objects, set of objects, multiset of objects and constructive methods of their creation and classification are proposed. In addition, necessity of some extension of existing OOP tools for the purpose of practical implementation AIS of analysis information, using proposed approach, is shown.

Column Selection via Adaptive Sampling

Selecting a good column (or row) subset of massive data matrices has found many applications in data analysis and machine learning. We propose a new adaptive sampling algorithm that can be used to improve any relative-error column selection algorithm. Our algorithm delivers a tighter theoretical bound on the approximation error which we also demonstrate empirically using two well known relative-error column subset selection algorithms. Our experimental results on synthetic and real-world data show that our algorithm outperforms non-adaptive sampling as well as prior adaptive sampling approaches.

A Bayesian Network Model for Interesting Itemsets

Mining itemsets that are the most interesting under a statistical model of the underlying data is a frequently used and well-studied technique for exploratory data analysis. The most recent models of interestingness are predominantly based on maximum entropy distributions over items or tile entries with itemset constraints, and while computationally tractable are not easily interpretable. We therefore propose the first, to the best of our knowledge, generative model over itemsets, in the form of a Bayesian network, and an associated novel measure of interestingness. Our model is able to efficiently infer interesting itemsets directly from the transaction database using structural EM, in which the E-step employs the greedy approximation to weighted set cover. Our approach is theoretically simple, straightforward to implement, trivially parallelizable and exhibits competitive performance as we demonstrate on both synthetic and real-world examples.

Functional additive regression

We suggest a new method, called Functional Additive Regression, or FAR, for efficiently performing high-dimensional functional regression. FAR extends the usual linear regression model involving a functional predictor, X(t), and a scalar response, Y, in two key respects. First, FAR uses a penalized least squares optimization approach to efficiently deal with high-dimensional problems involving a large number of functional predictors. Second, FAR extends beyond the standard linear regression setting to fit general nonlinear additive models. We demonstrate that FAR can be implemented with a wide range of penalty functions using a highly efficient coordinate descent algorithm. Theoretical results are developed which provide motivation for the FAR optimization criterion. Finally, we show through simulations and two real data sets that FAR can significantly outperform competing methods.

Comparison of different Methods for Univariate Time Series Imputation in R

Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of univariate time series imputation in general and an in-detail insight into the respective implementations within R packages. Furthermore, we experimentally compare the R functions on different time series using four different ratios of missing data. Our results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.

Random Irregular Block-hierarchical Networks: Algorithms for Computation of Main Properties

Equivalence of zero entropy and the Liouville property for stationary random graphs

Contrast estimation for parametric stationary determinantal point processes

Inheritance in Object-Oriented Knowledge Representation

Exploiters-Based Knowledge Extraction in Object-Oriented Knowledge Representation

An Omnibus Nonparametric Test of Equality in Distribution for Unknown Functions

Object-Oriented Dynamic Networks

Improving Back-Propagation by Adding an Adversarial Gradient

Universal and Determined Constructors of Multisets of Objects

The Small-Mass Limit for Langevin Dynamics with Unbounded Coefficients and positive friction

Vector rearrangement invariant banach spaces of random variables with exponential decreasing tails of distributions

An Entropy-Based Technique for Classifying Bacterial Chromosomes According to Synonymous Codon Usage

Embarrassingly Parallel Variational Inference in Nonconjugate Models

Density-Matching for Turbomachinery Optimization Under Uncertainty

D-vine Copula Based Quantile Regression

Benchmarking Fast Data Platforms for the Aadhaar Biometric Database

Large deviations in relay-augmented wireless networks

Asymptotically optimal control for a multiclass queueing model in the moderate deviation heavy traffic regime

Matrix Schubert varieties and Gaussian conditional independence models

Reachability problems for PAMs

One-parameter statistical model for linear stochastic differential equation with time delay

Bak-Sneppen Backwards

Canonical Paths for MCMC: from Art to Science

Convex hulls of random walks, hyperplane arrangements, and Weyl chambers

Exact Small Time Equivalent for the density of the Circular Langevin Diffusion

Sample dependence in the maximum entropy solution to the generalized moment problem

Affine representations of fractional processes with applications in mathematical finance

Ends and Tangles

Entropy of Bernoulli convolutions and uniform exponential growth for linear groups

Automatic Transcription of Flamenco Singing from Polyphonic Music Recordings

The algebraic method in tree percolation

Corpus COFLA: A research corpus for the Computational study of Flamenco Music

Estimation and inference in generalized additive coefficient models for nonlinear interactions with high-dimensional covariates

Phase Transitions in a kinetic flocking model of Cucker-Smale type

Efficient, fast and principled mean-field inference for strongly coupled data

On the Classes of Interval Graphs of Limited Nesting and Count of Lengths

Insights into the variability of nucleated amyloid polymerization by a minimalistic model of stochastic protein assembly

Necessary and sufficient conditions for the existence of $α$-determinantal processes

Fused kernel-spline smoothing for repeatedly measured outcomes in a generalized partially linear model with functional single index

Natural Exponential Families With Reduction Functions and Resolution of A Conjecture

Detection of multiple perturbations in multi-omics biological networks

On the Ramsey-Turán number with small $s$-independence number

An $O(\log OPT)$-approximation for covering and packing minor models of $θ_r$

Constrained Percolation on Z^2

Inverse Littlewood-Offord problems for Quasi-Norms

Structured Memory for Neural Turing Machines

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Asymptotic Density of Zimin Words

Detecting and Handling Flash-Crowd Events on Cloud Environments

Erd\H os-Ko-Rado theorem for $\{0,\pm 1\}$-vectors

On Operator-Valued Bi-Free Distributions

A faster subquadratic algorithm for finding outlier correlations

Nonlinear memory capacity of parallel time-delay reservoir computers in the processing of multidimensional signals

The Surface Area Deviation of the Euclidean Ball and a Polytope