YEDDA
In this paper, we introduce YEDDA, a lightweight but efficient open-source tool for text span annotation. YEDDA provides a systematic solution for text span annotation, ranging from collaborative user annotation to administrator evaluation and analysis. It overcomes the low efficiency of traditional text annotation tools by annotating entities through both command line and shortcut keys, which are configurable with custom labels. YEDDA also gives intelligent recommendations by training a predictive model using the up-to-date annotated text. An administrator client is developed to evaluate annotation quality of multiple annotators and generate detailed comparison report for each annotator pair. YEDDA is developed based on Tkinter and is compatible with all major operating systems. …

Machines Talking To Machines (M2M)
We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap end-to-end dialogue agents for goal-oriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to task-specific interactions. Compared to the Wizard-of-Oz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domain-agnostic system bot converse to exhaustively generate dialogue ‘outlines’, i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows. …

Diffusion Variational Autoencoder
Variational autoencoders (VAEs) have become one of the most popular deep learning approaches to unsupervised learning and data generation. However, traditional VAEs suffer from the constraint that the latent space must distributionally match a simple prior (e.g. normal, uniform), independent of the initial data distribution. This leads to a number of issues around modeling manifold data, as there is no function with bounded Jacobian that maps a normal distribution to certain manifolds (e.g. sphere). Similarly, there are not many theoretical guarantees on the encoder and decoder created by the VAE. In this work, we propose a variational autoencoder that maps manifold valued data to its diffusion map coordinates in the latent space, resamples in a neighborhood around a given point in the latent space, and learns a decoder that maps the newly resampled points back to the manifold. The framework is built off of SpectralNet, and is capable of learning this data dependent latent space without computing the eigenfunction of the Laplacian explicitly. We prove that the diffusion variational autoencoder framework is capable of learning a locally bi-Lipschitz map between the manifold and the latent space, and that our resampling method around a point in the latent space $\psi(x)$ maps points back to the manifold around the point $x$, specifically into a neighborbood on the tangent space at the point $x$ on the manifold. We also provide empirical evidence of the benefits of using a diffusion map latent space on manifold data. …

AlphaClean
The analyst effort in data cleaning is gradually shifting away from the design of hand-written scripts to building and tuning complex pipelines of automated data cleaning libraries. Hyper-parameter tuning for data cleaning is very different than hyper-parameter tuning for machine learning since the pipeline components and objective functions have structure that tuning algorithms can exploit. This paper proposes a framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines. AlphaClean provides users with a rich library to define data quality measures with weighted sums of SQL aggregate queries. AlphaClean applies generate-then-search framework where each pipelined cleaning operator contributes candidate transformations to a shared pool. Asynchronously, in separate threads, a search algorithm sequences them into cleaning pipelines that maximize the user-defined quality measures. This architecture allows AlphaClean to apply a number of optimizations including incremental evaluation of the quality measures and learning dynamic pruning rules to reduce the search space. Our experiments on real and synthetic benchmarks suggest that AlphaClean finds solutions of up-to 9x higher quality than naively applying state-of-the-art parameter tuning methods, is significantly more robust to straggling data cleaning methods and redundancy in the data cleaning library, and can incorporate state-of-the-art cleaning systems such as HoloClean as cleaning operators. …