This paper aims to compare different regularization strategies to address a common phenomenon, severe overfitting, in embedding-based neural networks for NLP. We chose two widely studied neural models and tasks as our testbed. We tried several frequently applied or newly proposed regularization strategies, including penalizing weights (embeddings excluded), penalizing embeddings, re-embedding words, and dropout. We also emphasized on incremental hyperparameter tuning, and combining different regularizations. The results provide a picture on tuning hyperparameters for neural NLP models.
The maximal information coefficient (MIC), which measures the amount of dependence between two variables, is able to detect both linear and non-linear associations. However, computational cost grows rapidly as a function of the dataset size. In this paper, we develop a computationally efficient approximation to the MIC that replaces its dynamic programming step with a much simpler technique based on the uniform partitioning of data grid. A variety of experiments demonstrate the quality of our approximation.
Uncovering causal relationships in data is a major objective of data analytics. Causal relationships are normally discovered with designed experiments, e.g. randomised controlled trials, which, however are expensive or infeasible to be conducted in many cases. Causal relationships can also be found using some well designed observational studies, but they require domain experts’ knowledge and the process is normally time consuming. Hence there is a need for scalable and automated methods for causal relationship exploration in data. Classification methods are fast and they could be practical substitutes for finding causal signals in data. However, classification methods are not designed for causal discovery and a classification method may find false causal signals and miss the true ones. In this paper, we develop a causal decision tree where nodes have causal interpretations. Our method follows a well established causal inference framework and makes use of a classic statistical test. The method is practical for finding causal signals in large data sets.
Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture leverages the shortest dependency path (SDP) between two entities; multichannel recurrent neural networks, with long short term memory (LSTM) units, pick up heterogeneous information along the SDP. Our proposed model has several distinct features: (1) The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence. (2) The multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency paths. (3) A customized dropout strategy regularizes the neural network to alleviate overfitting. We test our model on the SemEval 2010 relation classification task, and achieve an $F_1$-score of 83.7\%, higher than competing methods in the literature.
This paper deals with interactions between committee members as they rank a large list of applicants for a given position and eventually reach consensus. We will see that for a natural deterministic model the ranking can be described by solutions of a discrete quasilinear heat equation with time dependent coefficients on a graph. We show first that over time consensus emerges exponentially fast. Second, if there are clusters of members whose views are closer than those of the rest of the committee, then over time the clusters’ views become closer at a faster exponential rate than the views of the entire committee. We will also show that the variance of the rankings decays a definite amount, independent of the initial variance, when the influence of the members does not decay too quickly as opinions differ. When the influence between members is exactly a negative power, then the variance is convex and satisfies a three circles theorem.
In this short note, we present an extension of LSTM to use a depth gate to connect memory cells of adjacent layers. Doing so introduces a linear dependence between lower and upper recurrent units. Importantly, the linear dependence is gated through a gating function, which we call forget gate. This gate is a function of lower layer memory cell, its input, and its past memory. We conducted experiments and verified that this new architecture of LSTMs is able to improve machine translation and language modeling performances.
Capturing the interdependencies between real valued time series can be achieved by finding common similar patterns. The abstraction of time series makes the process of finding similarities closer to the way as humans do. Therefore, the abstraction by means of a symbolic levels and finding the common patterns attracts researchers. One particular algorithm, Longest Common Subsequence, has been used successfully as a similarity measure between two sequences including real valued time series. In this paper, we propose Fuzzy Longest Common Subsequence matching for time series.
The success of deep learning often derives from well-chosen operational building blocks. In this work, we revise the temporal convolution operation in CNNs to better adapt it to text processing. Instead of concatenating word representations, we appeal to tensor algebra and use low-rank n-gram tensors to directly exploit interactions between words already at the convolution stage. Moreover, we extend the n-gram convolution to non-consecutive words to recognize patterns with intervening words. Through a combination of low-rank tensors, and pattern weighting, we can efficiently evaluate the resulting convolution operation via dynamic programming. We test the resulting architecture on standard sentiment classification and news categorization tasks. Our model achieves state-of-the-art performance both in terms of accuracy and training speed. For instance, we obtain 51.4% accuracy on the fine-grained sentiment classification task.
We investigate an extension of continuous online learning in recurrent neural network language models. The model keeps a separate vector representation of the current unit of text being processed and adaptively adjusts it after each prediction. The initial experiments give promising results, indicating that the method is able to increase language modelling accuracy, while also decreasing the parameters needed to store the model along with the computation required at each step.
The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, books, with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of $k$\textit{-shinglings}, a $k$\textit{-shingling} being defined as a sequence of $k$ consecutive characters that are extracted from a text ($k\geq 1$ ). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman and al (% \cite{AnandJeffrey}), denoted here as (RUM). The Jaccard index is the one used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.
Learning novel and interesting concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms generally leverage the properties of the database schema to find the definition of the target concept in terms of the existing relations in the database. Nevertheless, it is well established that the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, many current learning algorithms tend to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency, which complicates their off-the-shelf application. In this paper, we formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of vertical (de)composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to a user. For sample-based algorithms we consider the two main algorithm classes: top-down and bottom-up. We prove that practical top-down algorithms are generally not schema independent, while, in contrast, two bottom-up algorithms Golem and ProGolem are schema independent with some modifications. For query-based learning algorithms we show that the vertical (de)composition transformations influence their learning efficiency. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark data sets under natural vertical (de)compositions.
During the past two decades there has been a lot of interest in developing statistical depth notions that generalize the univariate concept of ranking to multivariate data. The notion of depth has also been extended to regression models and functional data. However, computing such depth functions as well as their contours and deepest points is not trivial. Techniques of computational geometry appear to be well-suited for the development of such algorithms. Both the statistical and the computational geometry communities have done much work in this direction, often in close collaboration. We give a short survey of this work, focusing mainly on depth and multivariate medians, and end by listing some other areas of statistics where computational geometry has been of great help in constructing efficient algorithms.
Given observations of a collection of covariates and responses $(Y, X) \in \R^p \times \R^q$, sufficient dimension reduction (SDR) techniques aim to identify a mapping $f: \R^q \rightarrow \R^k$ with $k \ll q$ such that $Y|f(X)$ is independent of $X$. The image $f(X)$ summarizes the relevant information in a potentially large number of covariates $X$ that influence the responses $Y$. In many contemporary settings, the number of responses $p$ is also quite large, in addition to a large number $q$ of covariates. This leads to the challenge of fitting a succinctly parameterized statistical model to $Y|f(X)$, which is a problem that is usually not addressed in a traditional SDR framework. In this paper, we present a computationally tractable convex relaxation based estimator for simultaneously (a) identifying a linear dimension reduction $f(X)$ of the covariates that is sufficient with respect to the responses, and (b) fitting several types of structured low-dimensional models — factor models, graphical models, latent-variable graphical models — to the conditional distribution of $Y|f(X)$. We analyze the consistency properties of our estimator in a high-dimensional scaling regime. We also illustrate the performance of our approach on a newsgroup dataset and on a dataset consisting of financial asset prices.
Bike-sharing systems are a means of smart transportation in urban environments with the benefit of a positive impact on urban mobility. In this paper we are interested in studying and modeling the behavior of features that permit the end user to access, with her/his web browser, the status of the Bike-Sharing system. In particular, we address features able to make a prediction on the system state. We propose to use a machine learning approach to analyze usage patterns and learn computational models of such features from logs of system usage. On the one hand, machine learning methodologies provide a powerful and general means to implement a wide choice of predictive features. On the other hand, trained machine learning models are provided with a measure of predictive performance that can be used as a metric to assess the cost-performance trade-off of the feature. This provides a principled way to assess the runtime behavior of different components before putting them into operation.