Consistent Estimation for Partition-wise Regression and Classification Models

Partition-wise models offer a flexible approach for modeling complex and multidimensional data that are capable of producing interpretable results. They are based on partitioning the observed data into regions, each of which is modeled with a simple submodel. The success of this approach highly depends on the quality of the partition, as too large a region could lead to a non-simple submodel, while too small a region could inflate estimation variance. This paper proposes an automatic procedure for choosing the partition (i.e., the number of regions and the boundaries between regions) as well as the submodels for the regions. It is shown that, under the assumption of the existence of a true partition, the proposed partition estimator is statistically consistent. The methodology is demonstrated for both regression and classification problems.


Trans-gram, Fast Cross-lingual Word-embeddings

We introduce Trans-gram, a simple and computationally-efficient method to simultaneously learn and align wordembeddings for a variety of languages, using only monolingual data and a smaller set of sentence-aligned data. We use our new method to compute aligned wordembeddings for twenty-one languages using English as a pivot language. We show that some linguistic features are aligned across languages for which we do not have aligned data, even though those properties do not exist in the pivot language. We also achieve state of the art results on standard cross-lingual text classification and word translation tasks.


Git4Voc: Git-based Versioning for Collaborative Vocabulary Development

Collaborative vocabulary development in the context of data integration is the process of finding consensus between the experts of the different systems and domains. The complexity of this process is increased with the number of involved people, the variety of the systems to be integrated and the dynamics of their domain. In this paper we advocate that the realization of a powerful version control system is the heart of the problem. Driven by this idea and the success of Git in the context of software development, we investigate the applicability of Git for collaborative vocabulary development. Even though vocabulary development and software development have much more similarities than differences there are still important differences. These need to be considered within the development of a successful versioning and collaboration system for vocabulary development. Therefore, this paper starts by presenting the challenges we were faced with during the creation of vocabularies collaboratively and discusses its distinction to software development. Based on these insights we propose Git4Voc which comprises guidelines how Git can be adopted to vocabulary development. Finally, we demonstrate how Git hooks can be implemented to go beyond the plain functionality of Git by realizing vocabulary-specific features like syntactic validation and semantic diffs.


Argumentation Mining in User-Generated Web Discourse

The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people’s argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.


A Synthetic Approach for Recommendation: Combining Ratings, Social Relations, and Reviews

Recommender systems (RSs) provide an effective way of alleviating the information overload problem by selecting personalized choices. Online social networks and user-generated content provide diverse sources for recommendation beyond ratings, which present opportunities as well as challenges for traditional RSs. Although social matrix factorization (Social MF) can integrate ratings with social relations and topic matrix factorization can integrate ratings with item reviews, both of them ignore some useful information. In this paper, we investigate the effective data fusion by combining the two approaches, in two steps. First, we extend Social MF to exploit the graph structure of neighbors. Second, we propose a novel framework MR3 to jointly model these three types of information effectively for rating prediction by aligning latent factors and hidden topics. We achieve more accurate rating prediction on two real-life datasets. Furthermore, we measure the contribution of each data source to the proposed framework.


Temporal Multinomial Mixture for Instance-Oriented Evolutionary Clustering

Evolutionary clustering aims at capturing the temporal evolution of clusters. This issue is particularly important in the context of social media data that are naturally temporally driven. In this paper, we propose a new probabilistic model-based evolutionary clustering technique. The Temporal Multinomial Mixture (TMM) is an extension of classical mixture model that optimizes feature co-occurrences in the trade-off with temporal smoothness. Our model is evaluated for two recent case studies on opinion aggregation over time. We compare four different probabilistic clustering models and we show the superiority of our proposal in the task of instance-oriented clustering.


Bayesian linear regression with skew-symmetric error distributions with applications to survival analysis

We study Bayesian linear regression models with skew-symmetric scale mixtures of normal error distributions. These kinds of models can be used to capture departures from the usual assumption of normality of the errors in terms of heavy tails and asymmetry. We propose a general non-informative prior structure for these regression models and show that the corresponding posterior distribution is proper under mild conditions. We extend these propriety results to cases where the response variables are censored. The latter scenario is of interest in the context of accelerated failure time models, which are relevant in survival analysis. We present a simulation study that demonstrates good frequentist properties of the posterior credible intervals associated to the proposed priors. This study also sheds some light on the trade-off between increased model flexibility and the risk of over-fitting. We illustrate the performance of the proposed models with real data. Although we focus on models with univariate response variables, we also present some extensions to the multivariate case in the Supporting Web Material.


A FIRM Approach to Software-Defined Service Composition

Service composition is an aggregate of services often leveraged to automate the enterprise business processes. While Service Oriented Architecture (SOA) has been a forefront of service composition, services can be realized as efficient distributed and parallel constructs such as MapReduce, which are not typically exploited in service composition. With the advent of Software\-Defined Networking (SDN), global view and control of the entire network is made available to the networking controller, which can further be leveraged in application level. This paper presents FIRM, an approach for Software-Defined Service Composition by leveraging SDN and MapReduce. FIRM comprises Find, Invoke, Return, and Manage, as the core procedures in achieving a QoS-Aware Service Composition.


SENDIM for Incremental Development of Cloud Networks

Due to the limited and varying availability of cheap infrastructure and resources, cloud network systems and applications are tested in simulation and emulation environments prior to physical deployments, at different stages of development. Configuration management tools manage deployments and migrations across different cloud platforms, mitigating tedious system administration efforts. However, currently a cloud networking simulation cannot be migrated as an emulation, or vice versa, without rewriting and manually re-deploying the simulated application. This paper presents SENDIM (Sendim is a northeastern Portuguese town close to the Spanish border, where the rare Mirandese language is spoken), a Simulation, Emulation, aNd Deployment Integration Middleware for cloud networks. As an orchestration platform for incrementally building Software-Defined Cloud Networks (SDCN), SENDIM manages the development and deployment of algorithms and architectures the entire length from visualization, simulation, emulation, to physical deployments. Hence, SENDIM optimizes the evaluation of cloud networks.


Bayesian Inference using the Symmetric Monoidal Closed Category Structure

Exact Relation between Singular Value and Eigenvalue Statistics

Programming Discrete Distributions with Chemical Reaction Networks

On the enumeration of lattice $3$-polytopes

On the local genus distribution of graph embeddings

On some multicolour Ramsey properties of random graphs

Bayesian subset simulation

Environmental Noise Embeddings for Robust Speech Recognition

Evaluating the Performance of a Speech Recognition based System

Investigating gated recurrent neural networks for speech synthesis

Stationary signal processing on graphs

The homotopy theory of equivariant posets

On the variations of the principal eigenvalue and the probability of survival with respect to a parameter in growth-fragmentation-death models

Multidimensional Selberg theorem and fluctuations of the zeta zeros via Malliavin calculus

How to learn a graph from smooth signals

Numerical analysis of lognormal diffusions on the sphere

An inequality for moments of log-concave functions on Gaussian random vectors

Approximation algorithms for node-weighted prize-collecting Steiner tree problems on planar graphs

Approximating the degree sequence of two random graphs

An Application-Level Dependable Technique for Farmer-Worker Parallel Programs

Modeling Multivariate Mixed-Response Functional Data

Aging in the three-dimensional Random Field Ising Model

The Effects of Age, Gender and Region on Non-standard Linguistic Variation in Online Social Networks

Autonomous Crowds Tracking with Box Particle Filtering and Convolution Particle Filtering

Extension complexity and realization spaces of hypersimplices

Subexponential time algorithms for finding small tree and path decompositions

A novel approach for Markov Random Field with intractable normalising constant on large lattices

New Integrality Gap Results for the Firefighters Problem on Trees

Bounding errors of Expectation-Propagation

Implicit Look-alike Modelling in Display Ads: Transfer Collaborative Filtering to CTR Estimation

Deep Learning over Multi-field Categorical Data: A Case Study on User Response Prediction

Asymptotic results for exponential functionals of Levy processes

Cospectral lifts of graphs

Linear and Optimization Hamiltonians in Clustered Exponential Random Graph Modeling

Optimal Power Flow with Inelastic Demands for Demand Response in Radial Distribution Networks

Localisation of a source of biochemical agent dispersion using binary measurements

Eventual return probability in multidimensional random walks

On the geometry of random lemniscates

Bismut’s gradient formula for vector bundles

On the geometric properties of the semi-Lagrangian discontinuous Galerkin scheme for the Vlasov-Poisson equation

Bounded colorings of multipartite graphs and hypergraphs

Predicting the large-scale evolution of tag systems

Involution words II: braid relations and atomic structures

Improper Twin Edge Coloring of Graphs

A Sufficient Statistics Construction of Bayesian Nonparametric Exponential Family Conjugate Models

Negative interest rates: why and how?

On parallel solution of ordinary differential equations

Multivariate Regular Variation of Discrete Mass Functions with Applications to Preferential Attachment Networks

Hypo-efficient domination and hypo-unique domination

Constructions for the optimal pebbling of grids

Parallel Stroked Multi Line: a model-based method for compressing large fingerprint databases

On the lock-in probability estimate of stochastic approximation with controlled Markov noise

On Clustering Time Series Using Euclidean Distance and Pearson Correlation

Stammering tableaux – Tableaux bégayants

Heat transport in low-dimensional random harmonic networks

Random Continued fractions: Lévy constant and Chernoff-type estimate

Limit theorems related to beta-expansion and continued fraction expansion

Identifying Stable Patterns over Time for Emotion Recognition from EEG

Limit Laws for Random Matrices from Traffic-Free Probability

Optimal-order bounds on the rate of convergence to normality for maximum likelihood estimators

Empirical Gaussian priors for cross-lingual transfer learning

Coexistence of shocks and rarefaction fans: complex phase diagram of a simple hyperbolic particle system

Computing semiparametric bounds on the expected payments of insurance instruments via column generation

Fluctuations in the heterogeneous multiscale methods for fast-slow systems

Discrepancy of line segments for general lattice checkerboards

Spectra of general hypergraphs

A note on the Sobol’ indices and interactive criteria

Diffusive Propagation of Energy in a Non-Acoustic Chain

Sklar’s Theorem in an Imprecise Setting

On totally antimagic total labeling of complete bipartite graphs

Invertible binary matrix with maximum number of $2$-by-$2$ invertible submatrices

Dynamic Monopolies for Degree Proportional Thresholds in Connected Graphs of Girth at least Five and Trees

Wavelet analysis on symbolic sequences and two-fold de Bruijn sequences

Group Invariant Deep Representations for Image Instance Retrieval

On the Very-well-poised Bilateral Basic Hypergeometric $_5ψ_5$ Series

Minimax Subsampling for Estimation and Prediction in Low-Dimensional Linear Regression

Maxima of Two Random Walks: Universal Statistics of Lead Changes

A note on the sample complexity of the Er-SpUD algorithm by Spielman, Wang and Wright for exact recovery of sparsely used dictionaries

Autocorrelated errors in experimental data in the language sciences: Some solutions offered by Generalized Additive Mixed Models

It’s just a matter of perspective(s): Crowd-Powered Consensus Organization of Corpora