CaosDB – Research Data Management for Complex, Changing, and Automated Research Workflows

Here we present CaosDB, a Research Data Management System (RDMS) designed to ensure seamless integration of inhomogeneous data sources and repositories of legacy data. Its primary purpose is the management of data from biomedical sciences, both from simulations and experiments during the complete research data lifecycle. An RDMS for this domain faces particular challenges: Research data arise in huge amounts, from a wide variety of sources, and traverse a highly branched path of further processing. To be accepted by its users, an RDMS must be built around workflows of the scientists and practices and thus support changes in workflow and data structure. Nevertheless it should encourage and support the development and observation of standards and furthermore facilitate the automation of data acquisition and processing with specialized software. The storage data model of an RDMS must reflect these complexities with appropriate semantics and ontologies while offering simple methods for finding, retrieving, and understanding relevant data. We show how CaosDB responds to these challenges and give an overview of the CaosDB Server, its data model and its easy-to-learn CaosDB Query Language. We briefly discuss the status of the implementation, how we currently use CaosDB, and how we plan to use and extend it.

Siamese Neural Networks with Random Forest for detecting duplicate question pairs

Determining whether two given questions are semantically similar is a fairly challenging task given the different structures and forms that the questions can take. In this paper, we use Gated Recurrent Units(GRU) in combination with other highly used machine learning algorithms like Random Forest, Adaboost and SVM for the similarity prediction task on a dataset released by Quora, consisting of about 400k labeled question pairs. We got the best result by using the Siamese adaptation of a Bidirectional GRU with a Random Forest classifier, which landed us among the top 24% in the competition Quora Question Pairs hosted on Kaggle.

Convergence of Value Aggregation for Imitation Learning

Value aggregation is a general framework for solving imitation learning problems. Based on the idea of data aggregation, it generates a policy sequence by iteratively interleaving policy optimization and evaluation in an online learning setting. While the existence of a good policy in the policy sequence can be guaranteed non-asymptotically, little is known about the convergence of the sequence or the performance of the last policy. In this paper, we debunk the common belief that value aggregation always produces a convergent policy sequence with improving performance. Moreover, we identify a critical stability condition for convergence and provide a tight non-asymptotic bound on the performance of the last policy. These new theoretical insights let us stabilize problems with regularization, which removes the inconvenient process of identifying the best policy in the policy sequence in stochastic problems.

Predictor Variable Prioritization in Nonlinear Models: A Genetic Association Case Study

The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated within the context of statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the ‘RelATive cEntrality’ (RATE) measure to prioritize candidate predictors that are not just marginally important, but whose associations also stem from significant covarying relationships with other variables in the data. We focus on illustrating RATE through Bayesian Gaussian process regression; although, the methodological innovations apply to other and more general methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for outcomes generated by complex architectures. With detailed simulations and a botanical QTL mapping study, we show that applying RATE enables an explanation for this improved performance.

Optimistic Execution in Key-Value Store

Limitations of CAP theorem imply that if availability is desired in the presence of network partitions, one must sacrifice sequential consistency, a consistency model that is more natural for system design. We focus on the problem of what a designer should do if she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provides a weaker (e.g., eventual or causal) consistency. We propose a detect-rollback based approach: The designer identifies a correctness predicate, say P , and continue to run the protocol, as our system monitors P . If P is violated (because the underlying key-value store provides a weaker consistency), the system rolls back and resumes the computation at a state where P holds. We evaluate this approach in the Voldemort key-value store. Our experiments with deployment of Voldemort on Amazon AWS shows that using eventual consistency with monitoring can provide 20 – 40% increase in throughput when compared with sequential consistency. We also show that the overhead of the monitor itself is small (typically less than 8%) and the latency of detecting violations is very low. For example, more than 99.9% violations are detected in less than 1 second.

Tracking Network Dynamics: a review of distances and similarity metrics

From longitudinal biomedical studies to social networks, graphs have emerged as a powerful framework for describing evolving interactions between agents in complex systems. In such studies, the data typically consists of a set of graphs representing a system’s state at different points in time or space. The analysis of the system’s dynamics depends on the selection of the appropriate tools. In particular, after specifying properties characterizing similarities between states, a critical step lies in the choice of a distance capable of reflecting such similarities. While the literature offers a number of distances that one could a priori choose from, their properties have been little investigated and no guidelines regarding the choice of such a distance have yet been provided. However, these distances’ sensitivity to perturbations in the network’s structure and their ability to identify important changes are crucial to the analysis, making the selection of an adequate metric a decisive — yet delicate — practical matter. In the spirit of Goldenberg, Zheng and Fienberg’s seminal 2009 review, the purpose of this article is to provide an overview of commonly-used graph distances and an explicit characterization of the structural changes that they are best able to capture. To see how this translates in real-life situations, we use as a guiding thread to our discussion the application of these distances to the analysis a longitudinal microbiome study — as well as on synthetic examples. Having unveiled some of traditional distances’ shortcomings, we also suggest alternative similarity metrics and highlight their relative advantages in specific analysis scenarios. Above all, we provide some guidance for choosing one distance over another in certain types of applications. Finally, we show an application of these different distances to a network created from worldwide recipes.

Flexible Deep Neural Network Processing

The recent success of Deep Neural Networks (DNNs) has drastically improved the state of the art for many application domains. While achieving high accuracy performance, deploying state-of-the-art DNNs is a challenge since they typically require billions of expensive arithmetic computations. In addition, DNNs are typically deployed in ensemble to boost accuracy performance, which further exacerbates the system requirements. This computational overhead is an issue for many platforms, e.g. data centers and embedded systems, with tight latency and energy budgets. In this article, we introduce flexible DNNs ensemble processing technique, which achieves large reduction in average inference latency while incurring small to negligible accuracy drop. Our technique is flexible in that it allows for dynamic adaptation between quality of results (QoR) and execution runtime. We demonstrate the effectiveness of the technique on AlexNet and ResNet-50 using the ImageNet dataset. This technique can also easily handle other types of networks.

Learning to Prune Filters in Convolutional Neural Networks

Many state-of-the-art computer vision algorithms use large scale convolutional neural networks (CNNs) as basic building blocks. These CNNs are known for their huge number of parameters, high redundancy in weights, and tremendous computing resource consumptions. This paper presents a learning algorithm to simplify and speed up these CNNs. Specifically, we introduce a ‘try-and-learn’ algorithm to train pruning agents that remove unnecessary CNN filters in a data-driven way. With the help of a novel reward function, our agents removes a significant number of filters in CNNs while maintaining performance at a desired level. Moreover, this method provides an easy control of the tradeoff between network performance and its scale. Per- formance of our algorithm is validated with comprehensive pruning experiments on several popular CNNs for visual recognition and semantic segmentation tasks.

Generalized two-dimensional linear discriminant analysis with regularization

Recent advances show that two-dimensional linear discriminant analysis (2DLDA) is a successful matrix based dimensionality reduction method. However, 2DLDA may encounter the singularity issue theoretically and the sensitivity to outliers. In this paper, a generalized Lp-norm 2DLDA framework with regularization for an arbitrary p>0 is proposed, named G2DLDA. There are mainly two contributions of G2DLDA: one is G2DLDA model uses an arbitrary Lp-norm to measure the between-class and within-class scatter, and hence a proper p can be selected to achieve the robustness. The other one is that by introducing an extra regularization term, G2DLDA achieves better generalization performance, and solves the singularity problem. In addition, G2DLDA can be solved through a series of convex problems with equality constraint, and it has closed solution for each single problem. Its convergence can be guaranteed theoretically when 1\leq p\leq2. Preliminary experimental results on three contaminated human face databases show the effectiveness of the proposed G2DLDA.

Curiosity-driven reinforcement learning with homeostatic regulation

We propose a curiosity reward based on information theory principles and consistent with the animal instinct to maintain certain critical parameters within a bounded range. Our experimental validation shows the added value of the additional homeostatic drive to enhance the overall information gain of a reinforcement learning agent interacting with a complex environment using continuous actions. Our method builds upon two ideas: i) To take advantage of a new Bellman-like equation of information gain and ii) to simplify the computation of the local rewards by avoiding the approximation of complex distributions over continuous states and actions.

Sliding Suffix Tree

We consider a sliding window over a stream of characters from some finite alphabet. The user wants to perform deterministic substring matching on the current sliding window content and obtain positions of the matches. We present an indexed version of the sliding window based on a suffix tree. The data structure has optimal time queries \Theta(m+occ) and amortized constant time updates, where m is the length of the query string and occ the number of occurrences.

Statistically Motivated Second Order Pooling

Second-order pooling, a.k.a. bilinear pooling, has proven effective for visual recognition. The recent progress in this area has focused on either designing normalization techniques for second-order models, or compressing the second-order representations. However, these two directions have typically been followed separately, and without any clear statistical motivation. Here, by contrast, we introduce a statistically-motivated framework that jointly tackles normalization and compression of second-order representations. To this end, we design a parametric vectorization layer, which maps a covariance matrix, known to follow a Wishart distribution, to a vector whose elements can be shown to follow a Chi-square distribution. We then propose to make use of a square-root normalization, which makes the distribution of the resulting representation converge to a Gaussian, thus complying with the standard machine learning assumption. As evidenced by our experiments, this lets us outperform the state-of-the-art second-order models on several benchmark recognition datasets.

Clustering with Deep Learning: Taxonomy and New Methods

Clustering is a fundamental machine learning method. The quality of its results is dependent on the data distribution. For this reason, deep neural networks can be used for learning better representations of the data. In this paper, we propose a systematic taxonomy for clustering with deep learning, in addition to a review of methods from the field. Based on our taxonomy, creating new methods is more straightforward. We also propose a new approach which is built on the taxonomy and surpasses some of the limitations of some previous work. Our experimental evaluation on image datasets shows that the method approaches state-of-the-art clustering quality, and performs better in some cases.

Dynamic Optimization of Neural Network Structures Using Probabilistic Modeling

Deep neural networks (DNNs) are powerful machine learning models and have succeeded in various artificial intelligence tasks. Although various architectures and modules for the DNNs have been proposed, selecting and designing the appropriate network structure for a target problem is a challenging task. In this paper, we propose a method to simultaneously optimize the network structure and weight parameters during neural network training. We consider a probability distribution that generates network structures, and optimize the parameters of the distribution instead of directly optimizing the network structure. The proposed method can apply to the various network structure optimization problems under the same framework. We apply the proposed method to several structure optimization problems such as selection of layers, selection of unit types, and selection of connections using the MNIST, CIFAR-10, and CIFAR-100 datasets. The experimental results show that the proposed method can find the appropriate and competitive network structures.

Inverse reinforcement learning in continuous time and space

This paper develops a data-driven inverse reinforcement learning technique for a class of linear systems to estimate the cost function of an agent online, using input-output measurements. A simultaneous state and parameter estimator is utilized to facilitate output-feedback inverse reinforcement learning, and cost function estimation is achieved up to multiplication by a constant.

Feeding the Multitude: A Polynomial-time Algorithm to Improve Sampling

A wide variety of optimization techniques, both exact and heuristic, tend to be biased samplers. This means that when attempting to find multiple uncorrelated solutions of a degenerate Boolean optimization problem a subset of the solution space tends to be favored while, in the worst case, some solutions can never be accessed by the used algorithm. Here we present a simple post-processing technique that improves sampling for any optimization approach, either quantum or classical. More precisely, starting from a pool of a few optimal configurations, the algorithm generates potentially new solutions via rejection-free cluster updates at zero temperature. Although the method is not ergodic and there is no guarantee that all the solutions can be found, fair sampling is typically improved. We illustrate the effectiveness of our method by improving the exponentially biased data produced by the D-Wave 2X quantum annealer [Phys. Rev. Lett. 118, 07052 (2017)], as well as data from three-dimensional Ising spin glasses. As part of the study, we also show that sampling is improved when sub-optimal states are included and discuss sampling at a finite fixed temperature.

Tractable Learning and Inference for Large-Scale Probabilistic Boolean Networks

Probabilistic Boolean Networks (PBNs) have been previously proposed so as to gain insights into complex dynamical systems. However, identification of large networks and of the underlying discrete Markov Chain which describes their temporal evolution, still remains a challenge. In this paper, we introduce an equivalent representation for the PBN, the Stochastic Conjunctive Normal Form (SCNF), which paves the way to a scalable learning algorithm and helps predict long-run dynamic behavior of large-scale systems. Moreover, SCNF allows its efficient sampling so as to statistically infer multi-step transition probabilities which can provide knowledge on the activity levels of individual nodes in the long run.

The quasi-periodic quantum Ising transition in 1D
On the estimation of variance parameters in non-standard generalised linear mixed models: Application to penalised smoothing
Scalable Secure Computation of Statistical Functions with Applications to $k$-Nearest Neighbors
Fractional DP-Colorings of Sparse Graphs
Propensity score methodology in the presence of network entanglement between treatments
Learning Class-specific Word Representations for Early Detection of Hoaxes in Social Media
The Hybrid Bootstrap: A Drop-in Replacement for Dropout
Polynomial-Time Random Oracles and Separating Complexity Classes
Topological Entropy of Formal Languages
Bounding Approaches for Generalization
Vehicle Detection in Aerial Images
Perfect simulation of the Hard Disks Model by Partial Rejection Sampling
Modeling and Performance Analysis of Full-Duplex Communications in Cache-Enabled D2D Networks
The Importance of Communities for Learning to Influence
Code-Frequency Block Group Coding for Anti-Spoofing Pilot Authentication in Multi-Antenna OFDM Systems
CHALET: Cornell House Agent Learning Environment
On an Algorithm for Comparing the Chromatic Symmetric Functions of Trees
Mean-Field Game Theoretic Edge Caching in Ultra-Dense Networks
Numerical Coordinate Regression with Convolutional Neural Networks
Secure Mobile Crowdsensing with Deep Learning
Hybrid Gradient Boosting Trees and NeuralNetworks for Forecasting Operating Room Data
Exploring a Delta Schur Conjecture
Learning Networks from Random Walk-Based Node Similarities
The sum of nonsingular matrices is often nonsingular
Let’s Dance: Learning From Online Dance Videos
On the complexity of convex inertial proximal algorithms
Onion Curve: A Space Filling Curve with Near-Optimal Clustering
Super-Resolution mmWave Channel Estimation using Atomic Norm Minimization
Statistical Studies of Fading in Underwater Wireless Optical Channels in the Presence of Air Bubble, Temperature, and Salinity Random Variations (Long Version)
Comparison Training for Computer Chinese Chess
Distributed Agreement on Activity Driven Networks
Assertion-based QA with Question-Aware Open Information Extraction
Optimality of Simple Layered Superposition Coding in the 3 User MISO BC with Finite Precision CSIT
Ultra-Reliable Short Message Cooperative Relaying Protocols under Nakagami-m Fading
Revisiting Video Saliency: A Large-scale Benchmark and a New Model
Leveraging Edge Caching in NOMA Systems with QoS Requirements
Double-Stage Delay Multiply and Sum Beamforming Algorithm: Application to Linear-Array Photoacoustic Imaging
Novel digital tissue phenotypic signatures of distant metastasis in colorectal cancer
A Proximal Approach for a Class of Matrix Optimization Problems
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Heuristic algorithms for the Maximum Colorful Subtree problem
Avoidable words
Stacked Filters Stationary Flow For Hardware-Oriented Acceleration Of Deep Convolutional Neural Networks
On defectivity of families of full-dimensional point configurations
Survey on Emotional Body Gesture Recognition
Protograph-based Quasi-Cyclic MDPC Codes for McEliece Cryptosystems
Type-two polynomial-time and restricted lookahead
Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding
Cyber Hate Classification: ‘Othering’ Language And Paragraph Embedding
Incidence bicomodules, Möbius inversion, and a Rota formula for infinity adjunctions
What did you Mention? A Large Scale Mention Detection Benchmark for Spoken and Written Text
Experimentally detecting a quantum change point via Bayesian inference
System-Level Modeling and Optimization of the Energy Efficiency in Cellular Networks — A Stochastic Geometry Framework
On the Smith normal form of a skew-symmetric D-optimal design of order $n\equiv 2\pmod{4}$
Connections between rank and dimension for subspaces of bilinear forms
Analyzing Language Learned by an Active Question Answering Agent
Improved Pseudo-Polynomial-Time Approximation for Strip Packing
An Efficient Primal-Dual Algorithm for Fair Combinatorial Optimization Problems
Hyper-heuristics Can Achieve Optimal Performance for Pseudo-Boolean Optimisation
Counting proper colourings in 4-regular graphs via the Potts model
Stable gonality is computable
Spectral Efficiency Optimization For Millimeter Wave Multi-User MIMO Systems
Transfer Principle for nth Order Fractional Brownian Motion with Applications to Prediction and Equivalence in Law
Towards Low-Latency and Ultra-Reliable Virtual Reality
Sharp comparison of moments and the log-concave moment problem
Conditioned point processes with application to Lévy bridges
Optimization-based Motion Planning in Virtual Driving Scenarios with Application to Communicating Autonomous Vehicles
Edge Computing Meets Millimeter-wave Enabled VR: Paving the Way to Cutting the Cord
Fast Point Spread Function Modeling with Deep Learning
Modelling and Using Response Times in Online Courses
Mistral Supercomputer Job History Analysis
Algorithms for difference families in finite abelian groups
Task-parallel Analysis of Molecular Dynamics Trajectories
High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks
Human Activity Recognition for Mobile Robot
DeepGestalt – Identifying Rare Genetic Syndromes Using Deep Learning
On the Hamilton-Waterloo Problem with cycle lengths of distinct parities
Model theory and combinatorics of banned sequences
Non-parametric sparse additive auto-regressive network models
Expectation Learning for Adaptive Crossmodal Stimuli Association
Byzantine Gathering in Polynomial Time
On all Pickands Dependence Functions whose corresponding Extreme-Value-Copulas have Spearman $ρ$ (Kendall $τ$) identical to some value $v \in [0,1]$
Puzzles in $K$-homology of Grassmannians
Pruning Techniques for Mixed Ensembles of Genetic Programming Models
Ergodic control of a class of jump diffusions with finite Lévy measures and rough kernels
Monomial ideals with tiny squares
A Classification Refinement Strategy for Semantic Segmentation
Signal Subgraph Estimation Via Vertex Screening
A combinatorial model for computing volumes of flow polytopes
Drug Selection via Joint Push and Learning to Rank
Homologous Codes for Multiple Access Channels
ArcFace: Additive Angular Margin Loss for Deep Face Recognition
On the Uniqueness of Global Multiple SLEs
A McKean–Vlasov equation with positive feedback and blow-ups
Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models