Feature Selection for Classification under Anonymity Constraint

Over the last decade, proliferation of various online platforms and their increasing adoption by billions of users have heightened the privacy risk of a user enormously. In fact, security researchers have shown that sparse microdata containing information about online activities of a user although anonymous, can still be used to disclose the identity of the user by cross-referencing the data with other data sources. To preserve the privacy of a user, in existing works several methods (k-anonymity, l-diversity, differential privacy) are proposed that ensure a dataset which is meant to share or publish bears small identity disclosure risk. However, the majority of these methods modify the data in isolation, without considering their utility in subsequent knowledge discovery tasks, which makes these datasets less informative. In this work, we consider labeled data that are generally used for classification, and propose two methods for feature selection considering two goals: first, on the reduced feature set the data has small disclosure risk, and second, the utility of the data is preserved for performing a classification task. Experimental results on various real-world datasets show that the method is effective and useful in practice.


Recent Advances in Convolutional Neural Networks

In the last few years, deep learning has lead to very good performance on a variety of problems, such as object recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Due to the lack of training data and computing power in early days, it is hard to train a large high-capacity convolutional neural network without overfitting. Recently, with the rapid growth of data size and the increasing power of graphics processor unit, many researchers have improved the convolutional neural networks and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. Besides, we also introduce some applications of convolutional neural networks in computer vision.


Transforming Javascript Event-Loop Into a Pipeline

The development of a real-time web application often starts with a feature-driven approach allowing to quickly react to users feedbacks. However, this approach poorly scales in performance. Yet, the user-base can increase by an order of magnitude in a matter of hours. This first approach is unable to deal with the highest connections spikes. It leads the development team to shift to a scalable approach often linked to new development paradigm such as dataflow programming. This shift of technology is disruptive and continuity-threatening. To avoid it, we propose to abstract the feature-driven development into a more scalable high-level language. Indeed, reasoning on this high-level language allows to dynamically cope with user-base size evolutions. We propose a compilation approach that transforms a Javascript, single-threaded real-time web application into a network of small independent parts communicating by message streams. We named these parts fluxions, by contraction between a flow (flux in french) and a function. The independence of these parts allows their execution to be parallel, and to organize an application on several processors to cope with its load, in a similar way network routers do with IP traffic. We test this approach by applying the compiler to a real web application. We transform this application to parallelize the execution of an independent part and present the result.


Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns

We study how to obtain concise descriptions of discrete multivariate sequential data in terms of rich multivariate sequential patterns that can capture potentially highly interesting (cor)relations between sequences. To this end we allow our pattern language to span over the alphabets (domains) of all sequences, allow patterns to overlap temporally, and allow for gaps in their occurrences. We formalise our goal by the Minimum Description Length principle, by which our objective is to discover the set of patterns that provides the most succinct description of the data. To discover good pattern sets, we introduce Ditto, an efficient algorithm to approximate the ideal result. We support our claim with a set of experiments on both synthetic and real data.


The Impact of Technical Domain Expertise on Search Behavior and Task Outcome

Domain expertise is regarded as one of the key factors impacting search success: experts are known to write more effective queries, to select the right results on the result page, and to find answers satisfying their information needs. Search transaction logs play the crucial role in the result ranking. Yet despite the variety in expertise levels of users, all prior interactions are treated alike, suggesting that weighting in expertise can improve the ranking for informational tasks. The main aim of this paper is to investigate the impact of high levels of technical domain expertise on both search behavior and task outcome. We conduct an online user study with searchers proficient in programming languages. We focus on Java and Javascript, yet we believe that our study and results are applicable for other expertise-sensitive search tasks. The main findings are three-fold: First, we constructed expertise tests that effectively measure technical domain expertise and correlate well with the self-reported expertise. Second, we showed that there is a clear position bias, but technical domain experts were less affected by position bias. Third, we found that general expertise helped finding the correct answers, but the domain experts were more successful as they managed to detect better answers. Our work is using explicit tests to determine user expertise levels, which is an important step toward fully automatic detection of expertise levels based on interaction behavior. A deeper understanding of the impact of expertise on search behavior and task outcome can enable more effective use of expert behavior in search logs – essentially make everyone search as an expert.


Beauty and Brains: Detecting Anomalous Pattern Co-Occurrences

Our world is filled with both beautiful and brainy people, but how often does a Nobel Prize winner also wins a beauty pageant? Let us assume that someone who is both very beautiful and very smart is more rare than what we would expect from the combination of the number of beautiful and brainy people. Of course there will still always be some individuals that defy this stereotype; these beautiful brainy people are exactly the class of anomaly we focus on in this paper. They do not posses rare qualities, but it is the unexpected combination of factors that makes them stand out. In this paper we define the above described class of anomaly and propose a method to quickly identify them in transaction data. Further, as we take a pattern set based approach, our method readily explains why a transaction is anomalous. The effectiveness of our method is thoroughly verified with a wide range of experiments on both real world and synthetic data.


News Across Languages – Cross-Lingual Document Similarity and Event Tracking

In today’s world, we follow news which is distributed globally. Significant events are reported by different sources and in different languages. In this work, we address the problem of tracking of events in a large multilingual stream. Within a recently developed system Event Registry we examine two aspects of this problem: how to compare articles in different languages and how to link collections of articles in different languages which refer to the same event. Taking a multilingual stream and clusters of articles from each language, we compare different cross-lingual document similarity measures based on Wikipedia. This allows us to compute the similarity of any two articles regardless of language. Building on previous work, we show there are methods which scale well and can compute a meaningful similarity between articles from languages with little or no direct overlap in the training data. Using this capability, we then propose an approach to link clusters of articles across languages which represent the same event. We provide an extensive evaluation of the system as a whole, as well as an evaluation of the quality and robustness of the similarity measure and the linking algorithm.


S2RDF: RDF Querying with SPARQL on Spark

RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Yet, the ever-increasing size of RDF data collections makes it more and more infeasible to store and process them on a single machine, raising the need for distributed approaches. Instead of building a standalone but closed distributed RDF store, we endorse the usage of existing infrastructures for Big Data processing, e.g. Hadoop. However, SPARQL query performance is a major challenge as these platforms are not designed for RDF processing from ground. Thus, existing Hadoop-based approaches often favor certain query pattern shape while performance drops significantly for other shapes. In this paper, we describe a novel relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational databases, to efficiently minimize query input size regardless of its pattern shape and diameter. Our prototype system S2RDF is built on top of Spark and uses its relational interface to execute SPARQL queries over ExtVP. We demonstrate its superior performance in comparison to state of the art SPARQL-on-Hadoop approaches using the recent WatDiv test suite. S2RDF achieves sub-second runtimes for majority of queries on a billion triples RDF graph.


Multimodal Deep Learning Library

This is the document of Multimodal Deep Learning Library, MDL, which is written in C++. It explains principles and implementations with details of Restricted Boltzmann Machine, Deep Neural Network, Deep Belief Network, Denoising Autoencoder, Deep Boltzmann Machine, Deep Canonical Correlation Analysis, and modal prediction model. MDL uses OpenCV 3.0.0, which is the only dependency of this library. Most of its implementation has been tested in Mac OS. It also provides interface for reading various data set such as MNIST, CIFAR, XRMB, and AVLetters. To read mat file, Matlab must be installed because it uses Matlab/c++ interface provided by Matlab. There are multiple model options provided. Different gradient descent methods, loss function, annealing methods, and activation functions are given. These options are easy to extend given the structure of MDL. So MDL could be used as a frame for testings in deep learning.


Predicting the Co-Evolution of Event and Knowledge Graphs

Embedding learning, a.k.a. representation learning, has been shown to be able to model large-scale semantic knowledge graphs. A key concept is a mapping of the knowledge graph to a tensor representation whose entries are predicted by models using latent representations of generalized entities. Knowledge graphs are typically treated as static: A knowledge graph grows more links when more facts become available but the ground truth values associated with links is considered time invariant. In this paper we address the issue of knowledge graphs where triple states depend on time. We assume that changes in the knowledge graph always arrive in form of events, in the sense that the events are the gateway to the knowledge graph. We train an event prediction model which uses both knowledge graph background information and information on recent events. By predicting future events, we also predict likely changes in the knowledge graph and thus obtain a model for the evolution of the knowledge graph as well. Our experiments demonstrate that our approach performs well in a clinical application, a recommendation engine and a sensor network application.


Efficient Thresholded Correlation using Truncated Singular Value Decomposition

Two coloring problems on matrix graphs

Asymptotic properties of the derivative of self-intersection local time of fractional Brownian motion

Invariance of Qubit-Qutrit Separability Probabilities over Bloch Radii of Qubit and Qutrit Subsystems

Revealing the Mechanism of the Viscous-to-Elastic Crossover in Liquids

Perfect Matchings in Hypergraphs and the Erdős matching conjecture

Combinatorial solutions to integrable hierarchies

Disordered double Weyl node

Heuristic algorithms for finding distribution reducts in probabilistic rough set model

Computing the $L_1$ Geodesic Diameter and Center of a Polygonal Domain

Refined Error Bounds for Several Learning Algorithms

SR-Clustering: Semantic Regularized Clustering for Egocentric Photo Streams Segmentation

Weighted geometric distribution with a new characterisation of geometric distribution

Combinatorial and Probabilistic Formulae for Divided Symmetrization

Thick Points of High-Dimensional Gaussian Free Fields

Shell polynomials and dual birth-death processes

Hedging of covered options with linear market impact and gamma constraint

Linear Eigenvalue Statistics: An Indicator Ensemble Design for Situation Awareness of Power Systems

On tensor products of CSS Codes

Estimation and clustering in a semiparametric Poisson process stochastic block model for longitudinal networks

Move from Perturbed scheme to exponential weighting average

Stochastic simulators based optimization by Gaussian process metamodels -Application to maintenance investments planning issues Short title: Metamodel-based optimization of stochastic simulators

Improved hypothesis testing in a general multivariate elliptical model

Estimating the conditional density by histogram type estimators and model selection

Implementation of deep learning algorithm for automatic detection of brain tumors using intraoperative IR-thermal mapping data

Coherence-resonance chimeras in a network of excitable elements

Ramifications of Hurwitz theory, KP integrability and quantum curves

Determinants Containing Powers of Generalized Fibonacci Numbers

The box dimension of random box-like self-affine sets

The Bi-Objective Workflow Satisfiability Problem and Workflow Resiliency

Convex Hulls of Lévy Processes

A Stochastically Evolving Non-local Search and Solutions to Inverse Problems with Sparse Data

FAASTA: A fast solver for total-variation regularization of ill-conditioned problems with application to brain imaging

On the Differential Privacy of Bayesian Inference

On the Impact of Identifiers on Local Decision

The free energy in a class of quantum spin systems and interchange processes

Two-faced processes and random number generators

The flag upper bound theorem for 3- and 5-manifolds

Proceedings 14th International Workshop on Foundations of Coordination Languages and Self-Adaptive Systems

Restricted Predicates for Hypothetical Datalog

On the Bandwidth of the Kneser Graph

Facility Deployment Decisions through Warp Optimizaton of Regressed Gaussian Processes

Stochastic C-stability and B-consistency of explicit and implicit Milstein-type schemes

The C-finite Ansatz Meets the Holonomic Ansatz

Existence, uniqueness, and regularity for stochastic evolution equations with irregular initial values

Stochastic Dual Ascent for Solving Linear Systems

On Distributed Cooperative Decision-Making in Multiarmed Bandits

Finite-size effects and switching times for Moran dynamics with mutation

A dynamic Bayesian Markov model for health economic evaluations of interventions against infectious diseases

Addressing Complex and Subjective Product-Related Queries with Customer Reviews