This paper describes a free/open-source implementation of the light sliding-window (LSW) part-of-speech tagger for the Apertium free/open-source machine translation platform. Firstly, the mechanism and training process of the tagger are reviewed, and a new method for incorporating linguistic rules is proposed. Secondly, experiments are conducted to compare the performances of the tagger under different window settings, with or without Apertium-style ‘forbid’ rules, with or without Constraint Grammar, and also with respect to the traditional HMM tagger in Apertium.
In semantic technologies, the shared common understanding of the structure of information among artifacts (people or software agents) can be realized by building an ontology. To do this, it is imperative for an ontology builder to answer several questions: a) what are the main components of an ontology? b) How an ontology look likes and how it works? c) Verify if it is required to consider reusing existing ontologies or not? c) What is the complexity of the ontology to be developed? d) What are the principles of ontology design and development? e) How to evaluate an ontology? This paper answers all the key questions above. The aim of this paper is to present a set of guiding principles to help ontology developers and also inexperienced users to answer such questions.
We present a powerful general framework for designing data-dependent optimization algorithms, building upon and unifying recent techniques in adaptive regularization, optimistic gradient predictions, and problem-dependent randomization. We first present a series of new regret guarantees that hold at any time and under very minimal assumptions, and then show how different relaxations recover existing algorithms, both basic as well as more recent sophisticated ones. Finally, we show how combining adaptivity, optimism, and problem-dependent randomization can guide the design of algorithms that benefit from more favorable guarantees than recent state-of-the-art methods.
Algorithmic statistics considers the following problem: given a binary string (e.g., some experimental data), find a ‘good’ explanation of this data. It uses algorithmic information theory to define formally what is a good explanation. In this paper we extend this framework in two directions. First, the explanations are not only interesting in themselves but also used for prediction: we want to know what kind of data we may reasonably expect in similar situations (repeating the same experiment). We show that some kind of hierarchy can be constructed both in terms of algorithmic statistics and using the notion of a priori probability, and these two approaches turn out to be equivalent. Second, a more realistic approach that goes back to machine learning theory, assumes that we have not a single data string but some set of ‘positive examples’ that all belong to some unknown set , a property that we want to learn. We want this set to contain all positive examples and to be as small and simple as possible. We show how algorithmic statistic can be extended to cover this situation.
This paper reports on application of bootstrap nonlinear regression method to a design of an experiment dataset with fewer experimental runs. Design with desired properties was augmented and verified using graphical techniques. The augmented design with the desired properties benefited the accuracy of the approximated function used. The computation power of R-language and SAS for computing nonlinear function and bootstrap was also compared.
This paper investigates the mining of class association rules with rough set approach. In data mining, an association occurs between two set of elements when one element set happen together with another. A class association rule set (CARs) is a subset of association rules with classes specified as their consequences. We present an efficient algorithm for mining the finest class rule set inspired form Apriori algorithm, where the support and confidence are computed based on the elementary set of lower approximation included in the property of rough set theory. Our proposed approach has been shown very effective, where the rough set approach for class association discovery is much simpler than the classic association method.
The vocabulary mismatch problem is one of the important challenges facing traditional keyword-based Information Retrieval Systems. The aim of query expansion (QE) is to reduce this query-document mismatch by adding related or synonymous words or phrases to the query. Several existing query expansion algorithms have proved their merit, but they are not uniformly beneficial for all kinds of queries. Our long-term goal is to formulate methods for applying QE techniques tailored to individual queries, rather than applying the same general QE method to all queries. As an initial step, we have proposed a taxonomy of query classes (from a QE perspective) in this report. We have discussed the properties of each query class with examples. We have also discussed some QE strategies that might be effective for each query category. In future work, we intend to test the proposed techniques using standard datasets, and to explore automatic query categorisation methods.
The problem of principle component analysis (PCA) is traditionally solved by spectral or algebraic methods. We show how PCA could be formulated as a sequence of {\it convex} optimization problems. This gives rise to a new efficient method for computing the PCA based on recent advances in stochastic methods for convex optimization. In particular, we present running times that improve over the current state-of-the-art.
The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area.
Stochastic Gradient Descent (SGD) is arguably the most popular of the machine learning methods applied to training deep neural networks (DNN) today. It has recently been demonstrated that SGD can be statistically biased so that certain elements of the training set are learned more rapidly than others. In this article, we place SGD into a feedback loop whereby the probability of selection is proportional to error magnitude. This provides a novelty-driven oddball SGD process that learns more rapidly than traditional SGD by prioritising those elements of the training set with the largest novelty (error). In our DNN example, oddball SGD trains some 50x faster than regular SGD.
In this paper, we consider an estimation problem of the regression coefficients in multiple regression models with several unknown change-points. Under some realistic assumptions, we propose a class of estimators which includes as a special cases shrinkage estimators (SEs) as well as the unrestricted estimator (UE) and the restricted estimator (RE). We also derive a more general condition for the SEs to dominate the UE. To this end, we generalize some identities for the evaluation of the bias and risk functions of shrinkage-type estimators. As illustrative example, our method is applied to the ‘gross domestic product’ data set of 10 countries whose USA, Canada, UK, France and Germany. The simulation results corroborate our theoretical findings.
Classification is an important tool with many useful applications. Among the many classification methods, Fisher’s Linear Discriminant Analysis (LDA) is a traditional model-based approach which makes use of the covariance information. However, in the high-dimensional, low-sample size setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods designed to deal with high-dimensional data, they may not fully use the covariance information as LDA does. Hence in some situations, it is still desirable to use a model-based method such as LDA for classification. This article exploits the potential of LDA in more complicated data settings. In many real applications, it is costly to manually place labels on observations; hence it is often that only a small portion of labeled data is available while a large number of observations are left without a label. It is a great challenge to obtain good classification performance through the labeled data alone, especially when the dimension is greater than the size of the labeled data. In order to overcome this issue, we propose a semi-supervised sparse LDA classifier to take advantage of the seemingly useless unlabeled data. They provide additional information which helps to boost the classification performance in some situations. A direct estimation method is used to reconstruct LDA and achieve the sparsity; meanwhile we employ the difference-convex algorithm to handle the non-convex loss function associated with the unlabeled data. Theoretical properties of the proposed classifier are studied. Our simulated examples help to understand when and how the information extracted from the unlabeled data can be useful. A real data example further illustrates the usefulness of the proposed method.
Streaming interactive proofs (SIPs) are a framework to reason about outsourced computation, where a data owner (the verifier) outsources a computation to the cloud (the prover), but wishes to verify the correctness of the solution provided by the cloud service. In this paper we present streaming interactive proofs for problems in data analysis. We present protocols for clustering and shape fitting problems, as well as an improved protocol for rectangular matrix multiplication. The latter can in turn be used to verify eigenvectors of a (streamed) matrix. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space. For special cases (when optimality certificates can be verified easily), we present constant round protocols with similar costs. For rectangular matrix multiplication and eigenvector verification, our protocols work in the more restricted annotated data streaming model, and use sublinear (but not polylogarithmic) communication.
Knowledge representation is a major topic in AI, and many studies attempt to represent entities and relations of knowledge base in a continuous vector space. Among these attempts, translation-based methods build entity and relation vectors by minimizing the translation loss from a head entity to a tail one. In spite of the success of these methods, translation-based methods also suffer from the oversimplified loss metric, and are not competitive enough to model various and complex entities/relations in knowledge bases. To address this issue, we propose \textbf{TransA}, an adaptive metric approach for embedding, utilizing the metric learning ideas to provide a more flexible embedding method. Experiments are conducted on the benchmark datasets and our proposed method makes significant and consistent improvements over the state-of-the-art baselines.
Recently, knowledge graph embedding, which projects symbolic entities and relations into continuous vector space, has become a new, hot topic in artificial intelligence. This paper addresses a new issue of \textbf{multiple relation semantics} that a relation may have multiple meanings revealed by the entity pairs associated with the corresponding triples, and proposes a novel Gaussian mixture model for embedding, \textbf{TransG}. The new model can discover latent semantics for a relation and leverage a mixture of relation component vectors for embedding a fact triple. To the best of our knowledge, this is the first generative model for knowledge graph embedding, which is able to deal with multiple relation semantics. Extensive experiments show that the proposed model achieves substantial improvements against the state-of-the-art baselines.