Empath: Understanding Topic Signals in Large-Scale Text

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like ‘bleed’ and ‘punch’ to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath’s data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

A survey of sparse representation: algorithms and applications

Sparse representation has attracted much attention from researchers in fields of signal processing, image processing, computer vision and pattern recognition. Sparse representation also has a good reputation in both theoretical research and practical applications. Many different algorithms have been proposed for sparse representation. The main purpose of this article is to provide a comprehensive study and an updated review on sparse representation and to supply a guidance for researchers. The taxonomy of sparse representation methods can be studied from various viewpoints. For example, in terms of different norm minimizations used in sparsity constraints, the methods can be roughly categorized into five groups: sparse representation with l_0-norm minimization, sparse representation with l_p-norm (0<p<1) minimization, sparse representation with l_1-norm minimization and sparse representation with l_{2,1}-norm minimization. In this paper, a comprehensive overview of sparse representation is provided. The available sparse representation algorithms can also be empirically categorized into four groups: greedy strategy approximation, constrained optimization, proximity algorithm-based optimization, and homotopy algorithm-based sparse representation. The rationales of different algorithms in each category are analyzed and a wide range of sparse representation applications are summarized, which could sufficiently reveal the potential nature of the sparse representation theory. Specifically, an experimentally comparative study of these sparse representation algorithms was presented. The Matlab code used in this paper can be available at: http://…/lunwen.html.

Sentence Similarity Learning by Lexical Decomposition and Composition

Most conventional sentence similarity methods only focus on similar parts of two input sentences, and simply ignore the dissimilar parts, which usually give us some clues and semantic meanings about the sentences. In this work, we propose a model to take into account both the similarities and dissimilarities by decomposing and composing lexical semantics over sentences. The model represents each word as a vector, and calculates a semantic matching vector for each word based on all words in the other sentence. Then, each word vector is decomposed into a similar component and a dissimilar component based on the semantic matching vector. After this, a two-channel CNN model is employed to capture features by composing the similar and dissimilar components. Finally, a similarity score is estimated over the composed feature vectors. Experimental results show that our model gets the state-of-the-art performance on the answer sentence selection task, and achieves a comparable result on the paraphrase identification task.

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for extracting meaningful information and hidden patterns from data. This article presents an overview and brief tutorial of deep learning in MBD analytics and discusses a scalable learning framework over Apache Spark. Specifically, a distributed deep learning is executed as an iterative MapReduce computing on many Spark workers. Each Spark worker learns a partial deep model on a partition of the overall MBD, and a master deep model is then built by averaging the parameters of all partial models. This Spark-based framework speeds up the learning of deep models consisting of many hidden layers and millions of parameters. We use a context-aware activity recognition application with a real-world dataset containing millions of samples to validate our framework and assess its speedup effectiveness.

SIFT: An Algorithm for Extracting Structural Information From Taxonomies

In this work we present SIFT, a 3-step algorithm for the analysis of the structural information represented by means of a taxonomy. The major advantage of this algorithm is the capability to leverage the information inherent to the hierarchical structures of taxonomies to infer correspondences which can allow to merge them in a later step. This method is particular relevant in scenarios where taxonomy alignment techniques exploiting textual information from taxonomy nodes cannot operate successfully.

A Streaming Algorithm for Crowdsourced Data Classification

We propose a streaming algorithm for the binary classification of data based on crowdsourcing. The algorithm learns the competence of each labeller by comparing her labels to those of other labellers on the same tasks and uses this information to minimize the prediction error rate on each task. We provide performance guarantees of our algorithm for a fixed population of independent labellers. In particular, we show that our algorithm is optimal in the sense that the cumulative regret compared to the optimal decision with known labeller error probabilities is finite, independently of the number of tasks to label. The complexity of the algorithm is linear in the number of labellers and the number of tasks, up to some logarithmic factors. Numerical experiments illustrate the performance of our algorithm compared to existing algorithms, including simple majority voting and expectation-maximization algorithms, on both synthetic and real datasets.

Approximate Hamming distance in a stream

We consider the problem of computing a (1+\epsilon)-approximation of the Hamming distance between a pattern of length n and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an O(\epsilon^{-4} \log^2 n) bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an O(\epsilon^{-2}\sqrt{n}\log n) bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for (1+\epsilon)-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an O(\epsilon^{-3} \sqrt{n} \log^{2} n) space and O(\epsilon^{-2} \log{n}) time streaming (1+\epsilon)-approximate Hamming distance algorithm. (2) For general input alphabets there is an O(\epsilon^{-5} \sqrt{n} \log^{4} n) space and O(\epsilon^{-4} \log^3 {n}) time streaming (1+\epsilon)-approximate Hamming distance algorithm.

A Simple Approach to Sparse Clustering

We consider the problem of sparse clustering, where it is assumed that only a subset of the features are useful for clustering purposes. In the framework of the COSA method of Friedman and Meulman (2004), subsequently improved in the form of the Sparse K-means method of Witten and Tibshirani (2010), we propose a very natural and simpler hill-climbing approach that is competitive with these two methods.

Active Learning from Positive and Unlabeled Data

During recent years, active learning has evolved into a popular paradigm for utilizing user’s feedback to improve accuracy of learning algorithms. Active learning works by selecting the most informative sample among unlabeled data and querying the label of that point from user. Many different methods such as uncertainty sampling and minimum risk sampling have been utilized to select the most informative sample in active learning. Although many active learning algorithms have been proposed so far, most of them work with binary or multi-class classification problems and therefore can not be applied to problems in which only samples from one class as well as a set of unlabeled data are available. Such problems arise in many real-world situations and are known as the problem of learning from positive and unlabeled data. In this paper we propose an active learning algorithm that can work when only samples of one class as well as a set of unlabelled data are available. Our method works by separately estimating probability desnity of positive and unlabeled points and then computing expected value of informativeness to get rid of a hyper-parameter and have a better measure of informativeness./ Experiments and empirical analysis show promising results compared to other similar methods.

A Survey on Domain-Specific Languages for Machine Learning in Big Data

The amount of data generated in the modern society is increasing rapidly. New problems and novel approaches of data capture, storage, analysis and visualization are responsible for the emergence of the Big Data research field. Machine Learning algorithms can be used in Big Data to make better and more accurate inferences. However, because of the challenges Big Data imposes, these algorithms need to be adapted and optimized to specific applications. One important decision made by software engineers is the choice of the language that is used in the implementation of these algorithms. Therefore, this literature survey identifies and describes domain-specific languages and frameworks used for Machine Learning in Big Data. By doing this, software engineers can then make more informed choices and beginners have an overview of the main languages used in this domain.

1D Many-body localized Floquet systems II: Symmetry-Broken phases

Sachdev-Ye-Kitaev Model and Thermalization on the Boundary of Many-Body Localized Fermionic Symmetry Protected Topological States

Blind score normalization method for PLDA based speaker recognition

Modelling collinear and spatially correlated data

Augur: Mining Human Behaviors from Fiction to Power Interactive Systems

Revising a Nice Cycle Lemma and its Consequences

Recovering the number of clusters in data sets with noise features using feature rescaling factors

Comparing Graphs of Different Sizes

Hyperbolic Anderson Model with space-time homogeneous Gaussian noise

Synthesis of fast multiplication algorithms for arbitrary tensors

Dirichlet approximation of equilibrium distributions in Cannings models with mutation

Improved Bounds for Shortest Paths in Dense Distance Graphs

Moving Target Defense for Web Applications using Bayesian Stackelberg Games

Latent Skill Embedding for Personalized Lesson Sequence Recommendation

Unbounded Human Learning: Optimal Scheduling for Spaced Repetition

Auditing Black-box Models by Obscuring Features

An Improved Gap-Dependency Analysis of the Noisy Power Method

The two subset recurrent property of Markov chains

Finding Needle in a Million Metrics: Anomaly Detection in a Large-scale Computational Advertising Platform

Online Low-Rank Tensor Subspace Tracking from Incomplete Data by CP Decomposition using Recursive Least Squares

New extremal binary self-dual codes of length 68 via short kharaghani array over f_2 + uf_2

Scalable Generation of Scale-free Graphs

Variational Inference for On-line Anomaly Detection in High-Dimensional Time Series

Energetics of Synchronization in Coupled Oscillators

Submodular Learning and Covering with Response-Dependent Costs

Cartan coherent configurations

Limits of Mappings

On the Power of Advice and Randomization for Online Bipartite Matching

On two conjectures about the proper connection number of graphs

The intersection ring of matroids

Improved bounds for hypohamiltonian graphs

Les lois Zêta pour l’arithmétique

Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

Lens depth function and k-relative neighborhood graph: versatile tools for ordinal data analysis

Paging with Multiple Caches

Large deviations principle for biorthogonal ensembles and variational formulation for the Dykema-Haagerup distribution

Faster Algorithms for the Maximum Common Subtree Isomorphism Problem

Query Expansion via structural motifs in Wikipedia Graph

Variable Effects of Climate on Forest Growth in Relation to Ecosystem State

Critical behavior of the 2D Ising model with long-range correlated disorder

Fluctuations of bridges, reciprocal characteristics, and concentration of measure

Trapezoidal Diagrams, Upward Triangulations, and Prime Catalan Numbers

Petrarch 2 : Petrarcher

The swept rule for breaking the latency barrier in time advancing two-dimensional PDEs

A graph which recognizes idempotents of a commutative ring

Intermittency fronts for space-time fractional stochastic partial differential equations in $(d+1)$ dimensions

Search Improves Label for Active Learning

On the Law of Large Numbers for Discrete Fourier Transform

Local times of stochastic differential equations driven by fractional Brownian motions

Reliability estimates for three factor score predictors

The IBM 2016 Speaker Recognition System

Stuck in a What? Adventures in Weight Space

Lecture notes on Gaussian multiplicative chaos and Liouville Quantum Gravity

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Stability analysis for a class of nonlinear time-changed systems

Parsimonious modeling with Information Filtering Networks

Computing approximate PSD factorizations

Post-selection inference L1-penalized likelihood models

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size

The Possibilities and Limitations of Private Prediction Markets

On Study of the Binarized Deep Neural Network for Image Classification

Recursive cubes of rings as models for interconnection networks

Automatic Moth Detection from Trap Images for Pest Management

Discrete Distribution Estimation under Local Privacy

The Myopia of Crowds: A Study of Collective Evaluation on Stack Exchange

Domain Specific Author Attribution Based on Feedforward Neural Network Language Models

Improved Accent Classification Combining Phonetic Vowels with Acoustic Features

Boundary value problems for statistics of diffusion in a randomly switching environment: PDE and SDE perspectives

3-regular colored graphs and classification of surfaces

Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

Learning to Generate with Memory

The robust recoverable spanning tree problem with interval costs is polynomially solvable

TRIÈST: Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size

Max-Margin Nonparametric Latent Feature Models for Link Prediction

On symmetries in phylogenetic trees

Some results on the statistics of hull perimeters in large planar triangulations and quadrangulations

Parametric Prediction from Parametric Agents

On the Kozachenko-Leonenko entropy estimator

Automatically Proving Mathematical Theorems with Evolutionary Algorithms and Proof Assistants

Feature ranking for multi-label classification using Markov Networks

Asymptotic consistency and order specification for logistic classifier chains in multi-label learning

Fractionally integrated inverse stable subordinators

On Isomorphisms of Vertex-transitive Graphs

Enumeration and Maximum Number of Minimal Connected Vertex Covers in Graphs

A Bayesian Approach to the Data Description Problem

Differentiation of the Cholesky decomposition

The local metric dimension of the lexicographic product of graphs

The Kelmans-Seymour conjecture II: 2-vertices in $K_4^-$

Regression of ranked responses when raw responses are censored

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

Stochastic Shortest Path with Energy Constraints in POMDPs

Time and Activity Sequence Prediction of Business Process Instances

A new S-type eigenvalue localization set for tensors and its applications

Bayesian Exploration: Incentivizing Exploration in Bayesian Games

Ultradense Word Embeddings by Orthogonal Transformation

Group Equivariant Convolutional Networks

A parsimonious theory of evidence-based choice

Blockmodels: A R-package for estimating in Latent Block Model and Stochastic Block Model, with various probability functions, with or without covariates

Structure of inactive states of a binary Lennard-Jones mixture

On occupation times of the first and third quadrants for planar Brownian motion

Local motifs in GeS$_2$-Ga$_2$S$_3$ glasses

Multicolour Ramsey Numbers of Odd Cycles

Noisy population recovery in polynomial time

Permutation groups and derangements of odd prime order

The Group Inverse of extended Symmetric and Periodic Jacobi Matrices

Online Dual Coordinate Ascent Learning

A Variational Algorithm for Bayesian Variable Selection

Bounds for spherical codes

Swap-invariant and exchangeable random sequences and measures

On the additive chromatic number of several families of graphs