TermPicker: Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud – An Extended Technical Report

Deciding which vocabulary terms to use when modeling data as Linked Open Data (LOD) is far from trivial. Choosing too general vocabulary terms, or terms from vocabularies that are not used by other LOD datasets, is likely to lead to a data representation, which will be harder to understand by humans and to be consumed by Linked data applications. In this technical report, we propose TermPicker: a novel approach for vocabulary reuse by recommending RDF types and properties based on exploiting the information on how other data providers on the LOD cloud use RDF types and properties to describe their data. To this end, we introduce the notion of so-called schema-level patterns (SLPs). They capture how sets of RDF types are connected via sets of properties within some data collection, e.g., within a dataset on the LOD cloud. TermPicker uses such SLPs and generates a ranked list of vocabulary terms for reuse. The lists of recommended terms are ordered by a ranking model which is computed using the machine learning approach Learning To Rank (L2R). TermPicker is evaluated based on the recommendation quality that is measured using the Mean Average Precision (MAP) and the Mean Reciprocal Rank at the first five positions (MRR@5). Our results illustrate an improvement of the recommendation quality by 29% – 36% when using SLPs compared to the beforehand investigated baselines of recommending solely popular vocabulary terms or terms from the same vocabulary. The overall best results are achieved using SLPs in conjunction with the Learning To Rank algorithm Random Forests.


Probabilistic Programming with Gaussian Process Memoization

Gaussian Processes (GPs) are widely used tools in statistics, machine learning, robotics, computer vision, and scientific computation. However, despite their popularity, they can be difficult to apply; all but the simplest classification or regression applications require specification and inference over complex covariance functions that do not admit simple analytical posteriors. This paper shows how to embed Gaussian processes in any higher-order probabilistic programming language, using an idiom based on memoization, and demonstrates its utility by implementing and extending classic and state-of-the-art GP applications. The interface to Gaussian processes, called gpmem, takes an arbitrary real-valued computational process as input and returns a statistical emulator that automatically improve as the original process is invoked and its input-output behavior is recorded. The flexibility of gpmem is illustrated via three applications: (i) robust GP regression with hierarchical hyper-parameter learning, (ii) discovering symbolic expressions from time-series data by fully Bayesian structure learning over kernels generated by a stochastic grammar, and (iii) a bandit formulation of Bayesian optimization with automatic inference and action selection. All applications share a single 50-line Python library and require fewer than 20 lines of probabilistic code each.


Unsupervised Feature Construction for Improving Data Representation and Semantics

Feature-based format is the main data representation format used by machine learning algorithms. When the features do not properly describe the initial data, performance starts to degrade. Some algorithms address this problem by internally changing the representation space, but the newly-constructed features are rarely comprehensible. We seek to construct, in an unsupervised way, new features that are more appropriate for describing a given dataset and, at the same time, comprehensible for a human user. We propose two algorithms that construct the new features as conjunctions of the initial primitive features or their negations. The generated feature sets have reduced correlations between features and succeed in catching some of the hidden relations between individuals in a dataset. For example, a feature like sky \wedge \neg building \wedge panorama would be true for non-urban images and is more informative than simple features expressing the presence or the absence of an object. The notion of Pareto optimality is used to evaluate feature sets and to obtain a balance between total correlation and the complexity of the resulted feature set. Statistical hypothesis testing is used in order to automatically determine the values of the parameters used for constructing a data-dependent feature set. We experimentally show that our approaches achieve the construction of informative feature sets for multiple datasets.


Continuous online sequence learning with an unsupervised neural network model

The ability to recognize and predict temporal sequences of sensory inputs is vital for survival in natural environments. Based on many known properties of cortical neurons, a recent study proposed hierarchical temporal memory (HTM) sequence memory as a theoretical framework for sequence learning in the cortex. In this paper, we analyze properties of HTM sequence memory and apply it to various sequence learning and prediction problems. We show the model is able to continuously learn a large number of variable-order temporal sequences using an unsupervised Hebbian-like learning rule. The sparse temporal codes formed by the model can robustly handle branching temporal sequences by maintaining multiple predictions until there is sufficient disambiguating evidence. We compare the HTM sequence memory and other sequence learning algorithms, including the autoregressive integrated moving average (ARIMA) model and long short-term memory (LSTM), on sequence prediction problems with both artificial and real-world data. The HTM model not only achieves comparable or better accuracy than state-of-the-art algorithms, but also exhibits a set of properties that is critical for sequence learning. These properties include continuous online learning, the ability to handle multiple predictions and branching sequences, robustness to sensor noise and fault tolerance, and good performance without task-specific hyper-parameters tuning. Therefore the HTM sequence memory not only advances our understanding of how the brain may solve the sequence learning problem, but is also applicable to a wide range of real-world problems such as discrete and continuous sequence prediction, anomaly detection, and sequence classification.


The Topology of Equivariant Hilbert Schemes

A Survey of Available Corpora for Building Data-Driven Dialogue Systems

Denoising Bodies to Titles: Retrieving Similar Questions with Recurrent Convolutional Models

On cluster properties of classical ferromagnets in an external magnetic field

Computing a Relevant Set of Nonbinary Maximum Acyclic Agreement Forests

Synthesis of recurrent neural networks for dynamical system simulation

A Central Limit Theorem for the Optimal Alignments Score in Multiple Random Words

Oracle inequalities for ranking and U-processes with Lasso penalty

k-connected degree sequences

3-connected graphs and their degree sequences

Towards automating the generation of derivative nouns in Sanskrit by simulating Panini

Fast computation of all maximum acyclic agreement forests for two rooted binary phylogenetic trees

Curves in $\mathbb{R}^4$ and two-rich points

The Sorted Effects Method: Discovering Heterogeneous Effects Beyond Their Averages

Summary Statistics in Approximate Bayesian Computation

Amplifiers for the Moran Process

Multivariate discrete copulas, with applications in probabilistic weather forecasting

The Intrinsic Geometry of Some Random Manifolds

Deep-Spying: Spying using Smartwatch and Deep Learning

Classification of weak multi-view signals by sharing factors in a mixture of Bayesian group factor analyzers

Congruences on the Number of Restricted $m$-ary Partitions

Collision-free speed model for pedestrian dynamics

Local universality of the number of zeros of random trigonometric polynomials with continuous coefficients

Kauffman’s adjacent possible in word order evolution

Robust heavy-traffic approximations for service systems facing overdispersed demand

Improving Latency in a Signal Processing System on the Epiphany Architecture

A thermodynamical approach towards multi-criteria decision making (MCDM)

Combining low-dimensional ensemble postprocessing with reordering methods

Boolean lattices: Ramsey properties and embeddings

Rational $q\times q$ Carathéodory Functions and Central Non-negative Hermitian Measures

Bayesian Covariance Modelling of Large Tensor-Variate Data Sets $\&$ Inverse Non-parametric Learning of the Unknown Model Parameter Vector

The squared symmetric FastICA estimator

A generalization of the Erdős-Ko-Rado Theorem

An Empirical Comparison of Neural Architectures for Reinforcement Learning in Partially Observable Environments

Spin one $p$-spin glass: the Gardner transition

Dominant poles and tail asymptotics in the critical Gaussian many-sources regime

Blind, Greedy, and Random: Ordinal Approximation Algorithms for Graph Problems

When the extension property does not hold

Deep Active Object Recognition by Joint Label and Action Prediction

Localization in non-Hermitian chains with excitatory/inhibitory connections

Inferring the Causal Direction Privately

New Partial Geometric Difference Sets and Partial Geometric Difference Families

Differential Evolution with Event-Triggered Impulsive Control Scheme

ADMM for the SDP relaxation of the QAP

Parametric inference for proportional (reverse) hazard rate models with nomination sampling

A Method of Passage-Based Document Retrieval in Question Answering System

A hierarchical kinetic theory of birth, death, and fission in age-structured interacting populations

Second quantization approaches for stochastic age-structured birth-death processes

The read/write protocol complex is collapsible

Ranking genetic factors related to age-related maculardegeneration by variable selection confidence sets

Non-Local Probes Do Not Help with Graph Problems

Signal Representations on Graphs: Tools and Applications

Long-range Response in AC Electricity Grids