A Comparative Study on Regularization Strategies for Embedding-based Neural Networks

This paper aims to compare different regularization strategies to address a common phenomenon, severe overfitting, in embedding-based neural networks for NLP. We chose two widely studied neural models and tasks as our testbed. We tried several frequently applied or newly proposed regularization strategies, including penalizing weights (embeddings excluded), penalizing embeddings, re-embedding words, and dropout. We also emphasized on incremental hyperparameter tuning, and combining different regularizations. The results provide a picture on tuning hyperparameters for neural NLP models.

An Information-Theoretic Measure of Dependency Among Variables in Large Datasets

The maximal information coefficient (MIC), which measures the amount of dependence between two variables, is able to detect both linear and non-linear associations. However, computational cost grows rapidly as a function of the dataset size. In this paper, we develop a computationally efficient approximation to the MIC that replaces its dynamic programming step with a much simpler technique based on the uniform partitioning of data grid. A variety of experiments demonstrate the quality of our approximation.

Causal Decision Trees

Uncovering causal relationships in data is a major objective of data analytics. Causal relationships are normally discovered with designed experiments, e.g. randomised controlled trials, which, however are expensive or infeasible to be conducted in many cases. Causal relationships can also be found using some well designed observational studies, but they require domain experts’ knowledge and the process is normally time consuming. Hence there is a need for scalable and automated methods for causal relationship exploration in data. Classification methods are fast and they could be practical substitutes for finding causal signals in data. However, classification methods are not designed for causal discovery and a classification method may find false causal signals and miss the true ones. In this paper, we develop a causal decision tree where nodes have causal interpretations. Our method follows a well established causal inference framework and makes use of a classic statistical test. The method is practical for finding causal signals in large data sets.

Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Path

Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture leverages the shortest dependency path (SDP) between two entities; multichannel recurrent neural networks, with long short term memory (LSTM) units, pick up heterogeneous information along the SDP. Our proposed model has several distinct features: (1) The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence. (2) The multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency paths. (3) A customized dropout strategy regularizes the neural network to alleviate overfitting. We test our model on the SemEval 2010 relation classification task, and achieve an F_1-score of 83.7\%, higher than competing methods in the literature.

Committee ranking

This paper deals with interactions between committee members as they rank a large list of applicants for a given position and eventually reach consensus. We will see that for a natural deterministic model the ranking can be described by solutions of a discrete quasilinear heat equation with time dependent coefficients on a graph. We show first that over time consensus emerges exponentially fast. Second, if there are clusters of members whose views are closer than those of the rest of the committee, then over time the clusters’ views become closer at a faster exponential rate than the views of the entire committee. We will also show that the variance of the rankings decays a definite amount, independent of the initial variance, when the influence of the members does not decay too quickly as opinions differ. When the influence between members is exactly a negative power, then the variance is convex and satisfies a three circles theorem.

Depth-Gated Recurrent Neural Networks

In this short note, we present an extension of LSTM to use a depth gate to connect memory cells of adjacent layers. Doing so introduces a linear dependence between lower and upper recurrent units. Importantly, the linear dependence is gated through a gating function, which we call forget gate. This gate is a function of lower layer memory cell, its input, and its past memory. We conducted experiments and verified that this new architecture of LSTMs is able to improve machine translation and language modeling performances.

Fuzzy Longest Common Subsequence Matching With FCM

Capturing the interdependencies between real valued time series can be achieved by finding common similar patterns. The abstraction of time series makes the process of finding similarities closer to the way as humans do. Therefore, the abstraction by means of a symbolic levels and finding the common patterns attracts researchers. One particular algorithm, Longest Common Subsequence, has been used successfully as a similarity measure between two sequences including real valued time series. In this paper, we propose Fuzzy Longest Common Subsequence matching for time series.

Molding CNNs for text: non-linear, non-consecutive convolutions

The success of deep learning often derives from well-chosen operational building blocks. In this work, we revise the temporal convolution operation in CNNs to better adapt it to text processing. Instead of concatenating word representations, we appeal to tensor algebra and use low-rank n-gram tensors to directly exploit interactions between words already at the convolution stage. Moreover, we extend the n-gram convolution to non-consecutive words to recognize patterns with intervening words. Through a combination of low-rank tensors, and pattern weighting, we can efficiently evaluate the resulting convolution operation via dynamic programming. We test the resulting architecture on standard sentiment classification and news categorization tasks. Our model achieves state-of-the-art performance both in terms of accuracy and training speed. For instance, we obtain 51.4% accuracy on the fine-grained sentiment classification task.

Online Representation Learning in Recurrent Neural Language Models

We investigate an extension of continuous online learning in recurrent neural network language models. The model keeps a separate vector representation of the current unit of text being processed and adaptively adjusts it after each prediction. The initial experiments give promising results, indicating that the method is able to increase language modelling accuracy, while also decreasing the parameters needed to store the model along with the computation required at each step.

Probabilistic, statistical and algorithmic aspects of the similarity of texts and application to Gospels comparison

The fundamental problem of similarity studies, in the frame of data-mining, is to examine and detect similar items in articles, papers, books, with huge sizes. In this paper, we are interested in the probabilistic, and the statistical and the algorithmic aspects in studies of texts. We will be using the approach of k\textit{-shinglings}, a k\textit{-shingling} being defined as a sequence of k consecutive characters that are extracted from a text (k\geq 1 ). The main stake in this field is to find accurate and quick algorithms to compute the similarity in short times. This will be achieved in using approximation methods. The first approximation method is statistical and, is based on the theorem of Glivenko-Cantelli. The second is the banding technique. And the third concerns a modification of the algorithm proposed by Rajaraman and al (% \cite{AnandJeffrey}), denoted here as (RUM). The Jaccard index is the one used in this paper. We finally illustrate these results of the paper on the four Gospels. The results are very conclusive.

Schema Independent Relational Learning

Learning novel and interesting concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms generally leverage the properties of the database schema to find the definition of the target concept in terms of the existing relations in the database. Nevertheless, it is well established that the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, many current learning algorithms tend to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency, which complicates their off-the-shelf application. In this paper, we formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of vertical (de)composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to a user. For sample-based algorithms we consider the two main algorithm classes: top-down and bottom-up. We prove that practical top-down algorithms are generally not schema independent, while, in contrast, two bottom-up algorithms Golem and ProGolem are schema independent with some modifications. For query-based learning algorithms we show that the vertical (de)composition transformations influence their learning efficiency. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark data sets under natural vertical (de)compositions.

Statistical depth meets computational geometry: a short survey

During the past two decades there has been a lot of interest in developing statistical depth notions that generalize the univariate concept of ranking to multivariate data. The notion of depth has also been extended to regression models and functional data. However, computing such depth functions as well as their contours and deepest points is not trivial. Techniques of computational geometry appear to be well-suited for the development of such algorithms. Both the statistical and the computational geometry communities have done much work in this direction, often in close collaboration. We give a short survey of this work, focusing mainly on depth and multivariate medians, and end by listing some other areas of statistics where computational geometry has been of great help in constructing efficient algorithms.

Sufficient Dimension Reduction and Modeling Responses Conditioned on Covariates: An Integrated Approach via Convex Optimization

Given observations of a collection of covariates and responses (Y, X) \in \R^p \times \R^q, sufficient dimension reduction (SDR) techniques aim to identify a mapping f: \R^q \rightarrow \R^k with k \ll q such that Y|f(X) is independent of X. The image f(X) summarizes the relevant information in a potentially large number of covariates X that influence the responses Y. In many contemporary settings, the number of responses p is also quite large, in addition to a large number q of covariates. This leads to the challenge of fitting a succinctly parameterized statistical model to Y|f(X), which is a problem that is usually not addressed in a traditional SDR framework. In this paper, we present a computationally tractable convex relaxation based estimator for simultaneously (a) identifying a linear dimension reduction f(X) of the covariates that is sufficient with respect to the responses, and (b) fitting several types of structured low-dimensional models — factor models, graphical models, latent-variable graphical models — to the conditional distribution of Y|f(X). We analyze the consistency properties of our estimator in a high-dimensional scaling regime. We also illustrate the performance of our approach on a newsgroup dataset and on a dataset consisting of financial asset prices.

Using a Machine Learning Approach to Implement and Evaluate Product Line Features

Bike-sharing systems are a means of smart transportation in urban environments with the benefit of a positive impact on urban mobility. In this paper we are interested in studying and modeling the behavior of features that permit the end user to access, with her/his web browser, the status of the Bike-Sharing system. In particular, we address features able to make a prediction on the system state. We propose to use a machine learning approach to analyze usage patterns and learn computational models of such features from logs of system usage. On the one hand, machine learning methodologies provide a powerful and general means to implement a wide choice of predictive features. On the other hand, trained machine learning models are provided with a measure of predictive performance that can be used as a metric to assess the cost-performance trade-off of the feature. This provides a principled way to assess the runtime behavior of different components before putting them into operation.

A Calculus of Mobility and Communication for Ubiquitous Computing

A data-driven method for the stochastic parametrisation of subgrid-scale tropical convective area fraction

A Deep Learning Approach to Structured Signal Recovery

A generalization of Eulerian numbers via rook placements

A Generative Model for Multi-Dialect Representation

A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution

A Model for Collective Dynamics in Ant Raids

A Note on ‘Regularity lemma for distal structures’

A one-dimensional diffusion hits points fast

A Refinement-Based Architecture for Knowledge Representation and Reasoning in Robotics

A simple sampler for the horseshoe estimator

A stochastic variational approach to the viscous Camassa-Holm and related equations

A Tale of Two Metrics: Simultaneous Bounds on Competitiveness and Regret

Adaptive estimation of planar convex sets

Algorithmic Aspects of Optimal Channel Coding

Asynchronous Pattern Formation without Chirality

Beyond-Quantum Modeling of Question Order Effects and Response Replicability in Psychological Measurements

Birational contractions of $\overline{\mathrm{M}}_{0,n}$ and combinatorics of extremal assignments

Boundaries of Planar Graphs: A Unified Approach

Categorifying the tensor product of a level 1 highest weight and perfect crystal in type A

Chimera patterns under the impact of noise

Chromatic thresholds in dense random graphs

Chromatic thresholds in sparse random graphs

Colored triangulations of arbitrary dimensions are stuffed Walsh maps

Comparison of Uncertainty of Two Precipitation Prediction Models

Computing in Additive Networks with Bounded-Information Codes

Coordination Complexity: Small Information Coordinating Large Populations

Copula based generalized additive models with non-random sample selection

Critical value for the contact process with random edge weights on regular tree

Cross-Layer Design of Wireless Multihop Networks over Stochastic Channels with Time-Varying Statistics

Decomposition of bi-colored square arrays into balanced diagonals

Differences of subgroups in subgroups

Diffusion on a Hypersphere: Application to the Wright-Fisher model

Discrete Euler integration over functions on finite categories

Discrete Route/Trajectory Decision Making Problems

Distance mean-regular graphs

Distance regularity in buildings and structure constants in Hecke algebras

Domain-specific queries and Web search personalization: some investigations

Effective Approaches to Attention-based Neural Machine Translation

Entanglement Holographic Mapping of Many-Body Localized System by Spectrum Bifurcation Renormalization Group

Evaluating Classifiers in Detecting 419 Scams in Bilingual Cybercriminal Communities

Excellence networks in science: A web-based application (www.excellence-networks.net) for the identification of institutions collaborating successfully

Existence of continuous functions that are one-to-one almost everywhere

From Observational Studies to Causal Rule Mining

In Quest of Significance: Identifying Types of Twitter Sentiment Events that Predict Spikes in Sales

Knuthian Drawings of Series-Parallel Flowcharts

KPZ reloaded

Locality-preserving allocations Problems and coloured Bin Packing

Matrix ansatz and combinatorics of the $k$-species PASEP

Max Point-Tolerance Graphs

Mixture Selection, Mechanism Design, and Signaling

Nash equilibria of threshold type for two-player nonzero-sum games of stopping

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

Nonparametric Bayesian Models With Focused Clustering for Mixed Ordinal and Nominal Data

Nonparametric Distributed Learning Framework: Algorithm and Application to Variable Selection

On Clique Convergences of Graphs

On connectivity in a general random intersection graph

On mathching property for groups and vector spaces

On the physical interpretation of a meta-analysis in the presence of heterogeneity and bias: from clinical trials to Mendelian randomization

On the spatial Markov property of soups of unoriented and oriented loops

Optimal Concentration of Information Content For Log-Concave Densities

‘Owl’ and ‘Lizard’: Patterns of Head Pose and Eye Pose in Driver Gaze Classification

Partitioning the vertex set of $G$ to make $G\,\Box\, H$ an efficient open domination graph

Planar Graphs of Girth at least Five are Square $(Δ+ 2)$-Choosable

Predicting Grades

Quantifying information transfer and mediation along causal pathways in complex systems

Ramsey equivalence of $K_n$ and $K_n+K_{n-1}$

Representing Permutations with Few Moves

Sensor Selection for Estimation with Correlated Measurement Noise

Sorting Under Restrictions

Space Bounds for Reliable Multi-Writer Data Store: Inherent Cost of Read/Write Primitives

The Bayesian Formulation of EIT: Analysis and Algorithms

The Computational Power of Beeps

The odd Hadwiger’s conjecture is ‘almost” decidable

The Ruzsa divergence for random elements in locally compact abelian groups

The SP theory of intelligence: its distinctive features and advantages

Total Variation and Separation Cutoffs are not equivalent and neither one implies the other

Toward Żak’s conjecture on graph packing

Towards an Axiomatic Approach to Hierarchical Clustering of Measures

Two-stage Cascaded Classifier for Purchase Prediction

Unbounded Bayesian Optimization via Regularization

Universality of the mean-field for the Potts model

Variable Elimination in Fourier Domain

Vertical modeling: analysis of competing risks data with a cure proportion

Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology

Web Content Extraction — a Meta-Analysis of its Past and Thoughts on its Future