Knowledge graphs have recently become the state-of-the-art tool for representing the diverse and complex knowledge of the world. Examples include the proprietary knowledge graphs of companies such as Google, Facebook, IBM, or Microsoft, but also freely available ones such as YAGO, DBpedia, and Wikidata. A distinguishing feature of Wikidata is that the knowledge is collaboratively edited and curated. While this greatly enhances the scope of Wikidata, it also makes it impossible for a single individual to grasp complex connections between properties or understand the global impact of edits in the graph. We apply Formal Concept Analysis to efficiently identify comprehensible implications that are implicitly present in the data. Although the complex structure of data modelling in Wikidata is not amenable to a direct approach, we overcome this limitation by extracting contextual representations of parts of Wikidata in a systematic fashion. We demonstrate the practical feasibility of our approach through several experiments and show that the results may lead to the discovery of interesting implicational knowledge. Besides providing a method for obtaining large real-world data sets for FCA, we sketch potential applications in offering semantic assistance for editing and curating Wikidata.
In this paper we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word bring enough information to resolve lemma ambiguities while keeping the context representation dense and more practical for machine learning systems. Additionally, we study two different data augmentation methods utilizing autoencoder training and morphological transducers especially beneficial for low resource languages. We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems. Compared to the best overall baseline, UDPipe Future, our system outperforms it on 60 out of 76 treebanks reducing errors on average by 18% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.
Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the $f$-GAN and WAE models and make two main discoveries. First, we find that the $f$-GAN objective is equivalent to an autoencoder-like objective, which has close links, and is in some cases equivalent to the WAE objective – we refer to this as the $f$-WAE. This equivalence allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models (WAE and $f$-WAE), which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the $f$-WAE objective is related to other statistical quantities such as the $f$-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) OT solvers to minimize $f$-WAE. Our findings thus recommend the $f$-WAE as a tighter alternative to WAE, comment on generalization abilities and make a step towards unifying these models.
Prior background knowledge is essential for human reading and understanding. In this work, we investigate how to leverage external knowledge to improve question answering. We primarily focus on multiple-choice question answering tasks that require external knowledge to answer questions. We investigate the effects of utilizing external in-domain multiple-choice question answering datasets and enriching the reference corpus by external out-domain corpora (i.e., Wikipedia articles). Experimental results demonstrate the effectiveness of external knowledge on two challenging multiple-choice question answering tasks: ARC and OpenBookQA.
Recent studies have shown the latency and energy consumption of deep neural networks can be significantly improved by splitting the network between the mobile device and cloud. This paper introduces a new deep learning architecture, called BottleNet, for reducing the feature size needed to be sent to the cloud. Furthermore, we propose a training method for compensating for the potential accuracy loss due to the lossy compression of features before transmitting them to the cloud. BottleNet achieves on average 30x improvement in end-to-end latency and 40x improvement in mobile energy consumption compared to the cloud-only approach with negligible accuracy loss.
Machine learning systems can often achieve high performance on a test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. Based on an analysis of the task, we hypothesize three fallible syntactic heuristics that NLI models are likely to adopt: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including the state-of-the-art model BERT, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.
Recently, Graph Neural Networks (GNNs) are trending in the machine learning community as a family of architectures that specializes in capturing the features of graph-related datasets, such as those pertaining to social networks and chemical structures. Unlike for other families of the networks, the representation power of GNNs has much room for improvement, and many graph networks to date suffer from the problem of underfitting. In this paper we will introduce a Graph Warp Module, a supernode-based auxiliary network module that can be attached to a wide variety of existing GNNs in order to improve the representation power of the original networks. Through extensive experiments on molecular graph datasets, we will show that our GWM indeed alleviates the underfitting problem for various existing networks, and that it can even help create a network with the state-of-the-art generalization performance.
Causal effect identification considers whether an interventional probability distribution can be uniquely determined without parametric assumptions from measured source distributions and structural knowledge on the generating system. While complete graphical criteria and procedures exist for many identification problems, there are still challenging but important extensions that have not been considered in the literature. To tackle these new settings, we present a search algorithm directly over the rules of do-calculus. Due to generality of do-calculus, the search is capable of taking more advanced data-generating mechanisms into account along with an arbitrary type of both observational and experimental source distributions. The search is enhanced via a heuristic and search space reduction techniques. The approach, called do-search, is provably sound, and it is complete with respect to identifiability problems that have been shown to be completely characterized by do-calculus. When extended with additional rules, the search is capable of handling missing data problems as well. With the versatile search, we are able to approach new problems such as combined transportability and selection bias, or multiple sources of selection bias. We also perform a systematic analysis of bivariate missing data problems and study causal inference under case-control design.
We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings in three domains including a drawing task, a high dimensional navigation task, and the large action space domain of StarCraft II.
In this paper, we investigate temporal clusters of extremes defined as subsequent exceedances of high thresholds in a stationary time series. Two meaningful features of these clusters are the probability distribution of the cluster size and the ordinal patterns within a cluster. The latter have been introduced in order to handle data sets with several thousand data points appearing in medicine, biology, finance and computer science. Since these patterns take only the ordinal structure of consecutive data points into account the method is robust under monotone transformations and measurement errors. We verify the existence of the corresponding limit distributions in the framework of regularly varying time series, develop non-parametric estimators and show their asymptotic normality under appropriate mixing conditions. The performance of the estimators is demonstrated in a simulated example and a real data application to discharge data of the river Rhine.
Computer models, also known as simulators, can be computationally expensive to run, and it is is for this reason that statistical surrogates, also known as emulators, have been developed for such computer models. Gaussian processes are commonly used to emulate deterministic simulators, and they have also been extended to emulate stochastic simulators that have input-dependent noise. Any statistical model, including an emulator, should be validated before being used. We discuss how current methods of validating Gaussian process emulators of deterministic models are insufficient when applied to Gaussian process emulators of stochastic computer models. We develop tools for diagnosing problems in such emulators, based on independently validating the mean and variance predictions using out-of-sample, replicated, simulator runs. To illustrate these ideas, we apply them to a simple toy example, and a real example of a building performance simulator.
The area of declarative data analytics explores the application of the declarative paradigm on data science and machine learning. It proposes declarative languages for expressing data analysis tasks and develops systems which optimize programs written in those languages. The execution engine can be either centralized or distributed, as the declarative paradigm advocates independence from particular physical implementations. The survey explores a wide range of declarative data analysis frameworks by examining both the programming model and the optimization techniques used, in order to provide conclusions on the current state of the art in the area and identify open challenges.
While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, developing a theoretically well founded unsupervised tuning method, and incorporating a joint refinement procedure. Moreover, we use our improved SMT system to initialize a dual NMT model, which is further fine-tuned through on-the-fly back-translation. Together, we obtain large improvements over the previous state-of-the-art in unsupervised machine translation. For instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points more than the previous best unsupervised system, and 0.5 points more than the (supervised) shared task winner back in 2014.
Cloud computing delivers value to users by facilitating their access to computing capacity in periods when their need arises. An approach is to provide both on-demand and spot services on shared servers. The former allows users to access servers on demand at a fixed price and users occupy different periods of servers. The latter allows users to bid for the remaining unoccupied periods via dynamic pricing; however, without appropriate design, such periods may be arbitrarily small since on-demand users arrive randomly. This is also the current service model adopted by Amazon Elastic Cloud Compute. In this paper, we provide the first integral framework for sharing the time of servers between on-demand and spot services while optimally pricing spot instances. It guarantees that on-demand users can get served quickly while spot users can stably utilize servers for a properly long period once accepted, which is a key feature to make both on-demand and spot services accessible. Simulation results show that, by complementing the on-demand market with a spot market, a cloud provider can improve revenue by up to 464.7%. The framework is designed under assumptions which are met in real environments. It is a new tool that cloud operators can use to quantify the advantage of a hybrid spot and on-demand service, eventually making the case for operating such service model in their own infrastructures.
The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance is based on geometrical notions of certain sets of distributions. We show that this distance can be computed in cubic time and that it has several intuitive properties. We also show that this distance is the unique Mahalanobis distance satisfying certain assumptions. We also demonstrate that if we are dealing with binary data sets, then the distance can be represented naturally by certain parity functions, and that it can be evaluated in linear time. Our empirical tests with real world data show that the distance works well.
As the field of recommender systems has developed, authors have used a myriad of notations for describing the mathematical workings of recommendation algorithms. These notations ap-pear in research papers, books, lecture notes, blog posts, and software documentation. The dis-ciplinary diversity of the field has not contributed to consistency in notation; scholars whose home base is in information retrieval have different habits and expectations than those in ma-chine learning or human-computer interaction. In the course of years of teaching and research on recommender systems, we have seen the val-ue in adopting a consistent notation across our work. This has been particularly highlighted in our development of the Recommender Systems MOOC on Coursera (Konstan et al. 2015), as we need to explain a wide variety of algorithms and our learners are not well-served by changing notation between algorithms. In this paper, we describe the notation we have adopted in our work, along with its justification and some discussion of considered alternatives. We present this in hope that it will be useful to others writing and teaching about recommender systems. This notation has served us well for some time now, in research, online education, and traditional classroom instruction. We feel it is ready for broad use.
Artificial Intelligence has an important place in the scientific community as a result of its successful outputs in terms of different fields. In time, the field of Artificial Intelligence has been divided into many sub-fields because of increasing number of different solution approaches, methods, and techniques. Machine Learning has the most remarkable role with its functions to learn from samples from the environment. On the other hand, intelligent optimization done by inspiring from nature and swarms had its own unique scientific literature, with effective solutions provided for optimization problems from different fields. Because intelligent optimization can be applied in different fields effectively, this study aims to provide a general discussion on multidisciplinary effects of Artificial Intelligence by considering its optimization oriented solutions. The study briefly focuses on background of the intelligent optimization briefly and then gives application examples of intelligent optimization from a multidisciplinary perspective.
Empirical studies show that gradient based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that gradient-based methods with proper random initialization can find the global minima of the training loss in over-parameterized DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to a non-vacuous algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that explains the good generalization performance of over-parameterized deep neural networks learned by gradient descent.
We present a novel semantic framework for modeling temporal relations and event durations that maps pairs of events to real-valued scales for the purpose of constructing document-level event timelines. We use this framework to construct the largest temporal relations dataset to date, covering the entirety of the Universal Dependencies English Web Treebank. We use this dataset to train models for jointly predicting fine-grained temporal relations and event durations. We report strong results on our data and show the efficacy of a transfer-learning approach for predicting standard, categorical TimeML relations.