Fuzzy Description Logics (DLs) provide a means for representing vague knowledge about an application domain. In this paper, we study fuzzy extensions of conjunctive queries (CQs) over the DL $\mathcal{SROIQ}$ based on finite chains of degrees of truth. To answer such queries, we extend a well-known technique that reduces the fuzzy ontology to a classical one, and use classical DL reasoners as a black box. We improve the complexity of previous reduction techniques for finitely valued fuzzy DLs, which allows us to prove tight complexity results for answering certain kinds of fuzzy CQs. We conclude with an experimental evaluation of a prototype implementation, showing the feasibility of our approach.
We propose the Artificial Continuous Prediction Market (ACPM) as a means to predict a continuous real value, by integrating a range of data sources and aggregating the results of different machine learning (ML) algorithms. ACPM adapts the concept of the (physical) prediction market to address the prediction of real values instead of discrete events. Each ACPM participant has a data source, a ML algorithm and a local decision-making procedure that determines what to bid on what value. The contributions of ACPM are: (i) adaptation to changes in data quality by the use of learning in: (a) the market, which weights each market participant to adjust the influence of each on the market prediction and (b) the participants, which use a Q-learning based trading strategy to incorporate the market prediction into their subsequent predictions, (ii) resilience to a changing population of low- and high-performing participants. We demonstrate the effectiveness of ACPM by application to an influenza-like illnesses data set, showing ACPM out-performs a range of well-known regression models and is resilient to variation in data source quality.
We describe FactorBase, a new SQL-based framework that leverages a relational database management system to support multi-relational model discovery. A multi-relational statistical model provides an integrated analysis of the heterogeneous and interdependent data resources in the database. We adopt the BayesStore design philosophy: statistical models are stored and managed as first-class citizens inside a database. Whereas previous systems like BayesStore support multi-relational inference, FactorBase supports multi-relational learning. A case study on six benchmark databases evaluates how our system supports a challenging machine learning application, namely learning a first-order Bayesian network model for an entire database. Model learning in this setting has to examine a large number of potential statistical associations across data tables. Our implementation shows how the SQL constructs in FactorBase facilitate the fast, modular, and reliable development of highly scalable model learning systems.
The past century has seen a steady increase in the need of estimating and predicting complex systems and making (possibly critical) decisions with limited information. Although computers have made possible the numerical evaluation of sophisticated statistical models, these models are still designed \emph{by humans} because there is currently no known recipe or algorithm for dividing the design of a statistical model into a sequence of arithmetic operations. Indeed enabling computers to \emph{think} as \emph{humans} have the ability to do when faced with uncertainty is challenging in several major ways: (1) Finding optimal statistical models remains to be formulated as a well posed problem when information on the system of interest is incomplete and comes in the form of a complex combination of sample data, partial knowledge of constitutive relations and a limited description of the distribution of input random variables. (2) The space of admissible scenarios along with the space of relevant information, assumptions, and/or beliefs, tend to be infinite dimensional, whereas calculus on a computer is necessarily discrete and finite. With this purpose, this paper explores the foundations of a rigorous framework for the scientific computation of optimal statistical estimators/models and reviews their connections with Decision Theory, Machine Learning, Bayesian Inference, Stochastic Optimization, Robust Optimization, Optimal Uncertainty Quantification and Information Based Complexity.
Distributed representations of words as real-valued vectors in a relatively low-dimensional space aim at extracting syntactic and semantic features from large text corpora. A recently introduced neural network, named word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b), was shown to encode semantic information in the direction of the word vectors. In this brief report, it is proposed to use the length of the vectors, together with the term frequency, as measure of word significance in a corpus. Experimental evidence using a domain-specific corpus of abstracts is presented to support this proposal. A useful visualization technique for text corpora emerges, where words are mapped onto a two-dimensional plane and automatically ranked by significance.
We present improved methods of using structured SVMs in a large-scale hierarchical classification problem, that is when labels are leaves, or sets of leaves, in a tree or a DAG. We examine the need to normalize both the regularization and the margin and show how doing so significantly improves performance, including allowing achieving state-of-the-art results where unnormalized structured SVMs do not perform better than flat models. We also describe a further extension of hierarchical SVMs that highlight the connection between hierarchical SVMs and matrix factorization models.
This paper presents a new Markov chain Monte Carlo method to sample from the posterior distribution of conjugate mixture models. This algorithm relies on a flexible split-merge procedure built using the particle Gibbs sampler. Contrary to available split-merge procedures, the resulting so-called Particle Gibbs Split-Merge sampler does not require the computation of a complex acceptance ratio, is simple to implement using existing sequential Monte Carlo libraries and can be parallelized. We investigate its performance experimentally on synthetic problems as well as on geolocation and cancer genomics data. In all these examples, the particle Gibbs split-merge sampler outperforms state-of-the-art split-merge methods by up to an order of magnitude for a fixed computational complexity.
This document is additional material to our previous study comparing several strategies for variable subset selection. Our recommended approach was to fit the full model with all the candidate variables and best possible prior information, and perform the variable selection using the projection predictive framework. Here we give an example of performing such an analysis, using Stan for fitting the model, and R for the variable selection.
Large knowledge graphs increasingly add value to various applications that require machines to recognize and understand queries and their semantics, as in search or question answering systems. Latent variable models have increasingly gained attention for the statistical modeling of knowledge graphs, showing promising results in tasks related to knowledge graph completion and cleaning. Besides storing facts about the world, schema-based knowledge graphs are backed by rich semantic descriptions of entities and relation-types that allow machines to understand the notion of things and their semantic relationships. In this work, we study how type-constraints can generally support the statistical modeling with latent variable models. More precisely, we integrated prior knowledge in form of type-constraints in various state of the art latent variable approaches. Our experimental results show that prior knowledge on relation-types significantly improves these models up to 77% in link-prediction tasks. The achieved improvements are especially prominent when a low model complexity is enforced, a crucial requirement when these models are applied to very large datasets. Unfortunately, type-constraints are neither always available nor always complete e.g., they can become fuzzy when entities lack proper typing. We also show that in these cases, it can be beneficial to apply a local closed-world assumption that approximates the semantics of relation-types based on observations made in the data.
Giving user a simple and well organized web search result has been a topic of active information Retrieval (IR) research. Irrespective of how small or ambiguous a query is, a user always wants the desired result on the first display of an IR system. Clustering of an IR system result can render a way, which fulfills the actual information need of a user. In this paper, an approach to cluster an IR system result is presented.The approach is a combination of heuristics and k-means technique using cosine similarity. Our heuristic approach detects the initial value of k for creating initial centroids. This eliminates the problem of external specification of the value k, which may lead to unwanted result if wrongly specified. The centroids created in this way are more specific and meaningful in the context of web search result. Another advantage of the proposed method is the removal of the objective means function of k-means which makes cluster sizes same. The end result of the proposed approach consists of different clusters of documents having different sizes.