The current state-of-the-art Entity Linking (EL) systems are geared towards corpora that are as heterogeneous as the Web, and therefore perform sub-optimally on domain-specific corpora. A key open problem is how to construct effective EL systems for specific domains, as knowledge of the local context should in principle increase, rather than decrease, effectiveness. In this paper we propose the hybrid use of simple specialist linkers in combination with an existing generalist system to address this problem. Our main findings are the following. First, we construct a new reusable benchmark for EL on a corpus of domain-specific conversations. Second, we test the performance of a range of approaches under the same conditions, and show that specialist linkers obtain high precision in isolation, and high recall when combined with generalist linkers. Hence, we can effectively exploit local context and get the best of both worlds.
Domain adaptation problem arises in a variety of applications where the training set (\textit{source} domain) and testing set (\textit{target} domain) follow different distributions. The difficulty of such learning problem lies in how to bridge the gap between the source distribution and target distribution. In this paper, we give an formal analysis of feature learning algorithms for domain adaptation with linear classifiers. Our analysis shows that in order to achieve good adaptation performance, the second moments of source domain distribution and target domain distribution should be similar. Based on such a result, a new linear feature learning algorithm for domain adaptation is designed and proposed. Furthermore, the new algorithm is extended to have multiple layers, resulting in becoming another linear feature learning algorithm. The newly introduced method is effective for the domain adaptation tasks on Amazon review dataset and spam dataset from ECML/PKDD 2006 discovery challenge.
This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks.
The downfall of many supervised learning algorithms, such as neural networks, is the inherent need for a large amount of training data. Although there is a lot of buzz about big data, there is still the problem of doing classification from a small dataset. Other methods such as support vector machines, although capable of dealing with few samples, are inherently binary classifiers, and are in need of learning strategies such as One vs All in the case of multi-classification. In the presence of a large number of classes this can become problematic. In this paper we present, a novel approach to supervised learning through the method of clustering. Unlike traditional methods such as K-Means, Gravitational Clustering does not require the initial number of clusters, and automatically builds the clusters, individual samples can be arbitrarily weighted and it requires only few samples while staying resilient to over-fitting.
We propose HAMSI, a provably convergent incremental algorithm for solving large-scale partially separable optimization problems that frequently emerge in machine learning and inferential statistics. The algorithm is based on a local quadratic approximation and hence allows incorporating a second order curvature information to speed-up the convergence. Furthermore, HAMSI needs almost no tuning, and it is scalable as well as easily parallelizable. In large-scale simulation studies with the MovieLens datasets, we illustrate that the method is superior to a state-of-the-art distributed stochastic gradient descent method in terms of convergence behavior. This performance gain comes at the expense of using memory that scales only linearly with the total size of the optimization variables. We conclude that HAMSI may be considered as a viable alternative in many scenarios, where first order methods based on variants of stochastic gradient descent are applicable.
The main aim of this paper is to establish the applicability of a broad class of random measures, that includes the gamma process, for mixed membership modelling. We use completely random measures~(CRM) and hierarchical CRM to define a prior for Poisson processes. We derive the marginal distribution of the resultant point process, when the underlying CRM is marginalized out. Using well known properties unique to Poisson processes, we were able to derive an exact approach for instantiating a Poisson process with a hierarchical CRM prior. Furthermore, we derive Gibbs sampling strategies for hierarchical CRM models based on Chinese restaurant franchise sampling scheme. As an example, we present the sum of generalized gamma process (SGGP), and show its application in topic-modelling. We show that one can determine the power-law behaviour of the topics and words in a Bayesian fashion, by defining a prior on the parameters of SGGP.
Evolution of visual object recognition architectures based on Convolutional Neural Networks & Convolutional Deep Belief Networks paradigms has revolutionized artificial Vision Science. These architectures extract & learn the real world hierarchical visual features utilizing supervised & unsupervised learning approaches respectively. Both the approaches yet cannot scale up realistically to provide recognition for a very large number of objects as high as 10K. We propose a two level hierarchical deep learning architecture inspired by divide & conquer principle that decomposes the large scale recognition architecture into root & leaf level model architectures. Each of the root & leaf level models is trained exclusively to provide superior results than possible by any 1-level deep learning architecture prevalent today. The proposed architecture classifies objects in two steps. In the first step the root level model classifies the object in a high level category. In the second step, the leaf level recognition model for the recognized high level category is selected among all the leaf models. This leaf level model is presented with the same input object image which classifies it in a specific category. Also we propose a blend of leaf level models trained with either supervised or unsupervised learning approaches. Unsupervised learning is suitable whenever labelled data is scarce for the specific leaf level models. Currently the training of leaf level models is in progress; where we have trained 25 out of the total 47 leaf level models as of now. We have trained the leaf models with the best case top-5 error rate of 3.2% on the validation data set for the particular leaf models. Also we demonstrate that the validation error of the leaf level models saturates towards the above mentioned accuracy as the number of epochs are increased to more than sixty.
This paper proposes an algorithm to improve the calculation of confidence measure for spoken term detection (STD). Given an input query term, the algorithm first calculates a measurement named document ranking weight for each document in the speech database to reflect its relevance with the query term by summing all the confidence measures of the hypothesized term occurrences in this document. The confidence measure of each term occurrence is then re-estimated through linear interpolation with the calculated document ranking weight to improve its reliability by integrating document-level information. Experiments are conducted on three standard STD tasks for Tamil, Vietnamese and English respectively. The experimental results all demonstrate that the proposed algorithm achieves consistent improvements over the state-of-the-art method for confidence measure calculation. Furthermore, this algorithm is still effective even if a high accuracy speech recognizer is not available, which makes it applicable for the languages with limited speech resources.
In this paper we present LocLinkVis (Locate-Link-Visualize); a system which supports exploratory information access to a document collection based on geo-referencing and visualization. It uses a gazetteer which contains representations of places ranging from countries to buildings, and that is used to recognize toponyms, disambiguate them into places, and to visualize the resulting spatial footprints.
This text investigates relations between two well-known family of algorithms, matrix factorisations and recursive linear filters, by describing a probabilistic model in which approximate inference corresponds to a matrix factorisation algorithm. Using the probabilistic model, we derive a matrix factorisation algorithm as a recursive linear filter. More precisely, we derive a matrix-variate recursive linear filter in order to perform efficient inference in high dimensions. We also show that it is possible to interpret our algorithm as a nontrivial stochastic gradient algorithm. Demonstrations and comparisons on an image restoration task are given.
Existing approaches for multivariate functional principal component analysis are restricted to data on a single interval. The presented approach focuses on multivariate functional data on different domains that may differ in dimension, e.g. functions and images. The theoretical basis for multivariate functional principal component analysis is given in terms of a Karhunen-Lo\`eve Theorem. For the practically relevant case of a finite Karhunen-Lo\`eve representation, a relationship between univariate and multivariate functional principal component analysis is established. This offers an estimation strategy to calculate multivariate functional principal components and scores based on their univariate counterparts. The approach can be extended to finite expansions in general, not necessarily orthonormal bases. It is also applicable for sparse data or data with measurement error. The new method is shown to be competitive to existing approaches for the case of bivariate data on the same one-dimensional domain in simulations and for the well-known gait cycle data. The motivating application is a neuroimaging study, where the goal is to explore how longitudinal trajectories of a neuropsychological test score covary with FDG-PET brain scans at baseline. Supplementary material, including detailed proofs, simulation examples and software is available online.
Parallel machine scheduling has been extensively studied in the past decades, with applications ranging from production planning to job processing in large computing clusters. In this work we study some of these fundamental optimization problems, as well as their parameterized and reoptimization variants. We first present improved bounds for job scheduling on unrelated parallel machines, with the objective of minimizing the latest completion time (makespan) of the schedule. We consider the subclass of fully-feasible instances, in which the processing time of each job, on any machine, does not exceed the minimum makespan. The problem is known to be hard to approximate within factor 4/3 already in this subclass. Although fully-feasible instances are hard to identify, we give a polynomial time algorithm that yields for such instances a schedule whose makespan is better than twice the optimal, the best known ratio for general instances. Moreover, we show that our result is robust under small violations of feasibility constraints. We further study the power of parameterization. We show that makespan minimization on unrelated machines admits a parameterized approximation scheme, where the parameter used is the number of processing times that are large relative to the latest completion time of the schedule. We also present an FPT algorithm for the graph-balancing problem, which corresponds to the instances of the restricted assignment problem where each job can be processed on at most 2 machines. Finally, motivated by practical scenarios, we initiate the study of reoptimization in job scheduling on identical and uniform machines, with the objective of minimizing the makespan. We develop reapproximation algorithms that yield in both models the best possible approximation ratio of , for any , with respect to the minimum makespan.
We introduce a model-free algorithm for learning in Markov decision processes with parameterized actions-discrete actions with continuous parameters. At each step the agent must select both which action to use and which parameters to use with this action. This models domains where there are distinct actions which can be adjusted to a particular state. We introduce the Q-PAMDP algorithm for learning in these domains. We show that Q-PAMDP converges to a local optima, and compare different approaches in a robot soccer goal-scoring domain and a platformer domain.
Paper provides a method for solving the reverse Monge-Kantorovich transport problem (TP). It allows to accumulate positive decision-taking experience made by decision-taker in situations that can be presented in the form of TP. The initial data for the solution of the inverse TP is the information on orders, inventories and effective decisions take by decision-taker. The result of solving the inverse TP contains evaluations of the TPs payoff matrix elements. It can be used in new situations to select the solution corresponding to the preferences of the decision-taker. The method allows to gain decision-taker experience, so it can be used by others. The method allows to build the model of decision-taker preferences in a specific application area. The model can be updated regularly to ensure its relevance and adequacy to the decision-taker system of preferences. This model is adaptive to the current preferences of the decision taker.
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.
Large sample size brings the computation bottleneck for modern data analysis. Subsampling is one of efficient strategies to handle this problem. In previous studies, researchers make more focus on subsampling with replacement (SSR) than on subsampling without replacement (SSWR). In this paper we investigate the SSWR strategy discussed in Drineas et al. (2008) for fast algorithm in ordinary least-square (OLS) problem. We establish non-asymptotic property, i.e, the error bound of the corresponding subsample estimator, which provide a tradeoff between computation cost and approximation efficiency. Besides the non-asymptotic result, we provide asymptotic consistency and normality of the subsample estimator. Methodologically, we propose a two-step subsampling algorithm, which is efficient with respect to a statistical objective and does not reply on the linear model assumption.. Synthetic and real data are used to empirically study our proposed subsampling strategies. We argue by these empirical studies that, (1) our proposed two-step algorithm has obvi- ous advantage when the assumed linear model does not accurate, and (2) SSWR strategy performs obviously better than SSR when the subsampling ratio increases.
Recent work on word embeddings has shown that simple vector subtraction over pre-trained embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.
We theoretically and experimentally investigate tensor-based regression and classification. Our focus is regularization with various tensor norms, including the overlapped trace norm, the latent trace norm, and the scaled latent trace norm. We first give dual optimization methods using the alternating direction method of multipliers, which is computationally efficient when the number of training samples is moderate. We then theoretically derive an excess risk bound for each tensor norm and clarify their behavior. Finally, we perform extensive experiments using simulated and real data and demonstrate the superiority of tensor-based learning methods over vector- and matrix-based learning methods.
The article describes a new data structure called neuro-index. It is an alternative to well-known file indexes. The neuro-index is fundamentally different because it stores weight coefficients in neural network. It is not a reference type like ‘keyword-position in a file’.