• Arabesque: A System for Distributed Graph Mining – Extended version
Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some ‘interestingness’ criteria desired by the user. These algorithms are very important for areas such as social net- works, semantic web, and bioinformatics. In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque’s API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.
• Mathematical Foundations for Designing and Development of Intelligent Systems of Information Analysis
This article is an attempt to combine different ways of working with sets of objects and their classes for designing and development of artificial intelligent systems (AIS) of analysis information, using object-oriented programming (OOP). This paper contains analysis of basic concepts of OOP and their relation with set theory and artificial intelligence (AI). Process of sets and multisets creation from different sides, in particular mathematical set theory, OOP and AI is considered. Definition of object and its properties, homogeneous and inhomogeneous classes of objects, set of objects, multiset of objects and constructive methods of their creation and classification are proposed. In addition, necessity of some extension of existing OOP tools for the purpose of practical implementation AIS of analysis information, using proposed approach, is shown.
• Column Selection via Adaptive Sampling
Selecting a good column (or row) subset of massive data matrices has found many applications in data analysis and machine learning. We propose a new adaptive sampling algorithm that can be used to improve any relative-error column selection algorithm. Our algorithm delivers a tighter theoretical bound on the approximation error which we also demonstrate empirically using two well known relative-error column subset selection algorithms. Our experimental results on synthetic and real-world data show that our algorithm outperforms non-adaptive sampling as well as prior adaptive sampling approaches.
• A Bayesian Network Model for Interesting Itemsets
Mining itemsets that are the most interesting under a statistical model of the underlying data is a frequently used and well-studied technique for exploratory data analysis. The most recent models of interestingness are predominantly based on maximum entropy distributions over items or tile entries with itemset constraints, and while computationally tractable are not easily interpretable. We therefore propose the first, to the best of our knowledge, generative model over itemsets, in the form of a Bayesian network, and an associated novel measure of interestingness. Our model is able to efficiently infer interesting itemsets directly from the transaction database using structural EM, in which the E-step employs the greedy approximation to weighted set cover. Our approach is theoretically simple, straightforward to implement, trivially parallelizable and exhibits competitive performance as we demonstrate on both synthetic and real-world examples.
• Functional additive regression
We suggest a new method, called Functional Additive Regression, or FAR, for efficiently performing high-dimensional functional regression. FAR extends the usual linear regression model involving a functional predictor,

, and a scalar response,

, in two key respects. First, FAR uses a penalized least squares optimization approach to efficiently deal with high-dimensional problems involving a large number of functional predictors. Second, FAR extends beyond the standard linear regression setting to fit general nonlinear additive models. We demonstrate that FAR can be implemented with a wide range of penalty functions using a highly efficient coordinate descent algorithm. Theoretical results are developed which provide motivation for the FAR optimization criterion. Finally, we show through simulations and two real data sets that FAR can significantly outperform competing methods.
• Comparison of different Methods for Univariate Time Series Imputation in R
Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of univariate time series imputation in general and an in-detail insight into the respective implementations within R packages. Furthermore, we experimentally compare the R functions on different time series using four different ratios of missing data. Our results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.
• Random Irregular Block-hierarchical Networks: Algorithms for Computation of Main Properties
• Equivalence of zero entropy and the Liouville property for stationary random graphs
• Contrast estimation for parametric stationary determinantal point processes
• Inheritance in Object-Oriented Knowledge Representation
• Exploiters-Based Knowledge Extraction in Object-Oriented Knowledge Representation
• An Omnibus Nonparametric Test of Equality in Distribution for Unknown Functions
• Object-Oriented Dynamic Networks
• Improving Back-Propagation by Adding an Adversarial Gradient
• Universal and Determined Constructors of Multisets of Objects
• The Small-Mass Limit for Langevin Dynamics with Unbounded Coefficients and positive friction
• Vector rearrangement invariant banach spaces of random variables with exponential decreasing tails of distributions
• An Entropy-Based Technique for Classifying Bacterial Chromosomes According to Synonymous Codon Usage
• Embarrassingly Parallel Variational Inference in Nonconjugate Models
• Density-Matching for Turbomachinery Optimization Under Uncertainty
• D-vine Copula Based Quantile Regression
• Benchmarking Fast Data Platforms for the Aadhaar Biometric Database
• Large deviations in relay-augmented wireless networks
• Asymptotically optimal control for a multiclass queueing model in the moderate deviation heavy traffic regime
• Matrix Schubert varieties and Gaussian conditional independence models
• Reachability problems for PAMs
• One-parameter statistical model for linear stochastic differential equation with time delay
• Bak-Sneppen Backwards
• Canonical Paths for MCMC: from Art to Science
• Convex hulls of random walks, hyperplane arrangements, and Weyl chambers
• Exact Small Time Equivalent for the density of the Circular Langevin Diffusion
• Sample dependence in the maximum entropy solution to the generalized moment problem
• Affine representations of fractional processes with applications in mathematical finance
• Ends and Tangles
• Entropy of Bernoulli convolutions and uniform exponential growth for linear groups
• Automatic Transcription of Flamenco Singing from Polyphonic Music Recordings
• The algebraic method in tree percolation
• Corpus COFLA: A research corpus for the Computational study of Flamenco Music
• Estimation and inference in generalized additive coefficient models for nonlinear interactions with high-dimensional covariates
• Phase Transitions in a kinetic flocking model of Cucker-Smale type
• Efficient, fast and principled mean-field inference for strongly coupled data
• On the Classes of Interval Graphs of Limited Nesting and Count of Lengths
• Insights into the variability of nucleated amyloid polymerization by a minimalistic model of stochastic protein assembly
• Necessary and sufficient conditions for the existence of $α$-determinantal processes
• Fused kernel-spline smoothing for repeatedly measured outcomes in a generalized partially linear model with functional single index
• Natural Exponential Families With Reduction Functions and Resolution of A Conjecture
• Detection of multiple perturbations in multi-omics biological networks
• On the Ramsey-Turán number with small $s$-independence number
• An $O(\log OPT)$-approximation for covering and packing minor models of $θ_r$
• Constrained Percolation on Z^2
• Inverse Littlewood-Offord problems for Quasi-Norms
• Structured Memory for Neural Turing Machines
• On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities
• Asymptotic Density of Zimin Words
• Detecting and Handling Flash-Crowd Events on Cloud Environments
• Erd\H os-Ko-Rado theorem for $\{0,\pm 1\}$-vectors
• On Operator-Valued Bi-Free Distributions
• A faster subquadratic algorithm for finding outlier correlations
• Nonlinear memory capacity of parallel time-delay reservoir computers in the processing of multidimensional signals
• The Surface Area Deviation of the Euclidean Ball and a Polytope
Like this:
Like Loading...