Magister Dixit

“The key to a successful big data project isn’t the bigness of the data or the slickness of the dashboard a given tool provides. It’s the quality of the selection and analysis of the data. Unfortunately, in many cases, those who use the big data tools may not even be aware of the underlying logic of the data selection and analysis.” Patrick Marshall ( 08.07.2014 )


Whats new on arXiv – Complete List

Graph Transformer Networks
Position Paper: Towards Transparent Machine Learning
Deep Clustering for Mars Rover image datasets
Radically Compositional Cognitive Concepts
Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling
Synthetic Event Time Series Health Data Generation
Thirteen Simple Steps for Creating An R Package with an External C++ Library
Human Annotations Improve GAN Performances
Self-supervised Adversarial Training
ASCAI: Adaptive Sampling for acquiring Compact AI
‘How do I fool you?’: Manipulating User Trust via Misleading Black Box Explanations
Using natural language processing to extract health-related causality from Twitter messages
Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams
LIBRE: Learning Interpretable Boolean Rule Ensembles
Controllability of the Voter Model: an information theoretic approach
Safe Interactive Model-Based Learning
Multi-Label Learning with Deep Forest
Learning Models over Relational Data: A Brief Tutorial
Learning To Characterize Adversarial Subspaces
Bootstrapping NLU Models with Multi-task Learning
Generative Models for Effective ML on Private, Decentralized Datasets
How bettering the best? Answers via blending models and cluster formulations in density-based clustering
Unsupervised Attributed Multiplex Network Embedding
Stagewise Knowledge Distillation
On the center of mass of the elephant random walk
On the Relativized Alon Second Eigenvalue Conjecture V: Proof of the Relativized Alon Conjecture for Regular Base Graphs
Long-range Prediction of Vital Signs Using Generative Boosting via LSTM Networks
Deep learning methods in speaker recognition: a review
Smart transformer Modelling in Optimal Power Flow Analysis
Performance evaluation of deep neural networks for forecasting time-series with multiple structural breaks and high volatility
Detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images
In Search of the Fastest Concurrent Union-Find Algorithm
On the Time-Based Conclusion Stability of Software Defect Prediction Models
Extremes of Vector-Valued Gaussian Processes
Shape optimisation of stirring rods in mixing binary fluids
Question-Conditioned Counterfactual Image Generation for VQA
Capturing the Production of the Innovative Ideas: An Online Social Network Experiment and ‘Idea Geography’ Visualization
Predicting Drug-Drug Interactions from Molecular Structure Images
Give me (un)certainty — An exploration of parameters that affect segmentation uncertainty
Hardness of Learning DNFs using Halfspaces
Twitter Watch: Leveraging Social Media to Monitor and Predict Collective-Efficacy of Neighborhoods
Multiple Patients Behavior Detection in Real-time using mmWave Radar and Deep CNNs
MmWave Radar Point Cloud Segmentation using GMM in Multimodal Traffic Monitoring
Arguing Ecosystem Values with Paraconsistent Logics
Automotive Radar Interference Mitigation Using Adaptive Noise Canceller
Solving Inverse Problems by Joint Posterior Maximization with a VAE Prior
On Data Enriched Logistic Regression
Unlabeled Sensing With Local Permutations
Entanglement-assisted Quantum Codes from Cyclic Codes
Estimation of dynamic networks for high-dimensional nonstationary time series
The Eighth Dialog System Technology Challenge
Contrast Phase Classification with a Generative Adversarial Network
Does Face Recognition Accuracy Get Better With Age? Deep Face Matchers Say No
Strongly uncontrollable network topologies
New Bounds on $k$-Planar Crossing Numbers
Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach
Auto-encoding a Knowledge Graph Using a Deep Belief Network: A Random Fields Perspective
Modelling EHR timeseries by restricting feature interaction
Sparse associative memory based on contextual code learning for disambiguating word senses
Assessing the uncertainty in statistical evidence with the possibility of model misspecification using a non-parametric bootstrap
Localization for Random Walks in Random Environment in Dimension two and Higher
Weighted Triangle-free 2-matching Problem with Edge-disjoint Forbidden Triangles
Atypical exit events near a repelling equilibrium
Reversible Hardware for Acoustic Communications
Weak Monotone Comparative Statics
Gated Variational AutoEncoders: Incorporating Weak Supervision to Encourage Disentanglement
Bounds to the Normal Approximation for Linear Recursions with Two Effects
CASTER: Predicting Drug Interactions with Chemical Substructure Representation
Quadratic addition rules for three $q$-integers
Structural Controllability of Networked Relative Coupling Systems under Fixed and Switching Topologies
Measurement Error Correction in Particle Tracking Microrheology
Estimating adaptive cruise control model parameters from on-board radar units
Distributed Nash equilibrium seeking for aggregative games via a small-gain approach
Optimal Mini-Batch Size Selection for Fast Gradient Descent
Resource-Competitive Sybil Defenses
Flexible Functional Split and Power Control for Energy Harvesting Cloud Radio Access Networks
Multiple Style-Transfer in Real-Time
Fourier Spectrum Discrepancies in Deep Network Generated Images
$\ell_{\infty}$ Vector Contraction for Rademacher Complexity
Explicit-Blurred Memory Network for Analyzing Patient Electronic Health Records
Interpreting chest X-rays via CNNs that exploit disease dependencies and uncertainty labels
Sequential Recommendation with Relation-Aware Kernelized Self-Attention
On Model Robustness Against Adversarial Examples
Automated Augmentation with Reinforcement Learning and GANs for Robust Identification of Traffic Signs using Front Camera Images
OpenLORIS-Object: A Dataset and Benchmark towards Lifelong Object Recognition
DNNRE: A Dynamic Neural Network for Distant Supervised Relation Extraction
What is the gradient of a scalar function of a symmetric matrix ?
Some results on the Ryser design conjecture-III
A Survey of Algorithms for Distributed Charging Control of Electric Vehicles in Smart Grid
Situation Coverage Testing for a Simulated Autonomous Car — an Initial Case Study
Simple iterative method for generating targeted universal adversarial perturbations
The TAZRP speed process
Single View Distortion Correction using Semantic Guidance
Improved algorithm for neuronal ensemble inference by Monte Carlo method
Graph Iso/Auto-morphism: A Divide-&-Conquer Approach
A Neural Network Assisted Greedy Algorithm For Sparse Electromagnetic Imaging
Likelihood Assignment for Out-of-Distribution Inputs in Deep Generative Models is Sensitive to Prior Distribution Choice
Improving PHY-Security of UAV-Enabled Transmission with Wireless Energy Harvesting: Robust Trajectory Design and Power Allocation
A Novel Content Caching and Delivery Scheme for Millimeter Wave Device-to-Device Communications
Safe Coverage of Compact Domains For Second Order Dynamical Systems
Random walks on hypergraphs
An Energy Efficient D2D Model with Guaranteed Quality of Service for Cloud Radio Access Networks
A3GAN: An Attribute-aware Attentive Generative Adversarial Network for Face Aging
Fine-grained Qualitative Spatial Reasoning about Point Positions
Optimal Sequential Tests for Detection of Changes under Finite measure space for Finite Sequences of Networks
A meshfree formulation for large deformation analysis of flexoelectric structures accounting for the surface effects
Codes Correcting All Patterns of Tandem-Duplication Errors of Maximum Length 3
Akaike’s Bayesian information criterion (ABIC) or not ABIC for geophysical inversion
Feedback Linearization based on Gaussian Processes with event-triggered Online Learning
Putting Privacy into Perspective — Comparing Technical, Legal, and Users’ View of Data Sensitivity
Independent and automatic evaluation of acoustic-to-articulatory inversion models
GET: Global envelopes in R
AdvKnn: Adversarial Attacks On K-Nearest Neighbor Classifiers With Approximate Gradients
Integrating Threat Modeling and Automated Test Case Generation into Industrialized Software Security Testing
Overcoming slowly decaying Kolmogorov n-width by transport maps: application to model order reduction of fluid dynamics and fluid–structure interaction problems
GraphX-Convolution for Point Cloud Deformation in 2D-to-3D Conversion
Data Preparation in Agriculture Through Automated Semantic Annotation — Basis for a Wide Range of Smart Services
Regularization with Metric Double Integrals for Vector Tomography
HealthFog: An Ensemble Deep Learning based Smart Healthcare System for Automatic Diagnosis of Heart Diseases in Integrated IoT and Fog Computing Environments
Single Image Reflection Removal through Cascaded Refinement
Reusable neural skill embeddings for vision-guided whole body movement and object manipulation
Pseudo-linear Convergence of an Additive Schwarz Method for Dual Total Variation Minimization
CatGAN: Category-aware Generative Adversarial Networks with Hierarchical Evolutionary Learning for Category Text Generation
Generalized rainbow Turán problems
Forgetting to learn logic programs
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
Imputing missing values with unsupervised random trees
Optimal adaptive group testing
Long cycle of random permutations with polynomially growing cycle weights
A Policy Editor for Semantic Sensor Networks
MMGAN: Generative Adversarial Networks for Multi-Modal Distributions
Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs
A System Theoretical Perspective to Gradient-Tracking Algorithms for Distributed Quadratic Optimization
CenterMask : Real-Time Anchor-Free Instance Segmentation
General Criteria for Successor Rules to Efficiently Generate Binary de Bruijn Sequences
Enforcing Deterministic Constraints on Generative Adversarial Networks for Emulating Physical Systems
A nonparametric estimator of the extremal index
Use of Power Flow Controllers to Enhance Transmission Network Utilisation on the Irish Transmission Network
Enforcing Boundary Conditions on Physical Fields in Bayesian Inversion
Fair Data Adaptation with Quantile Preservation
Deep radiomic features from MRI scans predict survival outcome of recurrent glioblastoma
Asymptotics and statistics on Fishburn matrices and their generalizations
Failed zero forcing and critical sets on directed graphs
New practical advances in polynomial root clustering
Asymptotics of Quasi-Stationary Distributions of Small Noise Stochastic Dynamical Systems in Unbounded Domains
Batch correction of high-dimensional data
Two-level Dynamic Load Balancing for High Performance Scientific Applications
On Polynomial Stability of Coupled Partial Differential Equations in 1D
A Generalized Markov Chain Model to Capture Dynamic Preferences and Choice Overload
Combinatorial Description Of The Principal Congruence Subgroup $Γ$(2) In Sl(2, Z)
In-domain representation learning for remote sensing
Causal inference using Bayesian non-parametric quasi-experimental design
A nonparametric framework for inferring orders of categorical data from category-real ordered pairs
Estimation via length-constrained generalized empirical principal curves under small noise
A positive combinatorial formula for symplectic Kostka-Foulkes polynomials I: Rows
An incremental scenario approach for building energy management with uncertain occupancy
On self-Mullineux and self-conjugate partitions
Actuation attacks on constrained linear systems: a set-theoretic analysis
On a Centrality Maximization Game
Semi-Algebraic Proofs, IPS Lower Bounds and the $τ$-Conjecture: Can a Natural Number be Negative?
Penalized k-means algorithms for finding the correct number of clusters in a dataset
Weak approximate unitary designs and applications to quantum encryption
Asymptotically Exact Variational Bayes for High-Dimensional Binary Regression Models
Towards Personalized Dialog Policies for Conversational Skill Discovery
Loss Minimization through the Allocation of DGs Considering the Stochastic Nature of Units
Local large deviation principle for Wiener process with random resetting
Clustering of solutions in the symmetric binary perceptron
Graphical One-Sided Markets
Partially normal 5-edge-colorings of cubic graphs
Stability and error analysis of a splitting method using Robin-Robin coupling applied to a fluid-structure interaction problem
Non-Orthogonal Multiple Access for Visible Light Communications with Ambient Light and User Mobility
Large deviations in a population dynamics with catastrophes
Probabilistic Foundations of the Staver-Levin Model
TinyCNN: A Tiny Modular CNN Accelerator for Embedded FPGA
Limit theorems for chains with unbounded variable length memory which satisfy Cramer condition
A Turing Test for Crowds
Learning an Optimally Reduced Formulation of OPF through Meta-optimization
Computationally Data-Independent Memory Hard Functions
Non-Monotone Submodular Maximization with Multiple Knapsacks in Static and Dynamic Settings
On differentiable local bounds preserving stabilization for Euler equations
Testing linear-invariant properties

Whats new on arXiv

Graph Transformer Networks

Graph neural networks (GNNs) have been widely used in representation learning on graphs and achieved state-of-the-art performance in tasks such as node classification and link prediction. However, most existing GNNs are designed to learn node representations on the fixed and homogeneous graphs. The limitations especially become problematic when learning representations on a misspecified graph or a heterogeneous graph that consists of various types of nodes and edges. In this paper, we propose Graph Transformer Networks (GTNs) that are capable of generating new graph structures, which involve identifying useful connections between unconnected nodes on the original graph, while learning effective node representation on the new graphs in an end-to-end fashion. Graph Transformer layer, a core layer of GTNs, learns a soft selection of edge types and composite relations for generating useful multi-hop connections so-called meta-paths. Our experiments show that GTNs learn new graph structures, based on data and tasks without domain knowledge, and yield powerful node representation via convolution on the new graphs. Without domain-specific graph preprocessing, GTNs achieved the best performance in all three benchmark node classification tasks against the state-of-the-art methods that require pre-defined meta-paths from domain knowledge.

Position Paper: Towards Transparent Machine Learning

Transparent machine learning is introduced as an alternative form of machine learning, where both the model and the learning system are represented in source code form. The goal of this project is to enable direct human understanding of machine learning models, giving us the ability to learn, verify, and refine them as programs. If solved, this technology could represent a best-case scenario for the safety and security of AI systems going forward.

Deep Clustering for Mars Rover image datasets

In this paper, we build autoencoders to learn a latent space from unlabeled image datasets obtained from the Mars rover. Then, once the latent feature space has been learnt, we use k-means to cluster the data. We test the performance of the algorithm on a smaller labeled dataset, and report good accuracy and concordance with the ground truth labels. This is the first attempt to use deep learning based unsupervised algorithms to cluster Mars Rover images. This algorithm can be used to augment human annotations for such datasets (which are time consuming) and speed up the generation of ground truth labels for Mars Rover image data, and potentially other planetary and space images.

Radically Compositional Cognitive Concepts

Despite ample evidence that our concepts, our cognitive architecture, and mathematics itself are all deeply compositional, few models take advantage of this structure. We therefore propose a radically compositional approach to computational neuroscience, drawing on the methods of applied category theory. We describe how these tools grant us a means to overcome complexity and improve interpretability, and supply a rigorous common language for scientific modelling, analogous to the type theories of computer science. As a case study, we sketch how to translate from compositional narrative concepts to neural circuits and back again.

Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling

Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, prohibiting the use of longer receptive fields in practice. To increase efficiency, we make use of the ‘slow feature’ hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model (‘Seq-U-Net’) to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance in all tasks.

Synthetic Event Time Series Health Data Generation

Synthetic medical data which preserves privacy while maintaining utility can be used as an alternative to real medical data, which has privacy costs and resource constraints associated with it. At present, most models focus on generating cross-sectional health data which is not necessarily representative of real data. In reality, medical data is longitudinal in nature, with a single patient having multiple health events, non-uniformly distributed throughout their lifetime. These events are influenced by patient covariates such as comorbidities, age group, gender etc. as well as external temporal effects (e.g. flu season). While there exist seminal methods to model time series data, it becomes increasingly challenging to extend these methods to medical event time series data. Due to the complexity of the real data, in which each patient visit is an event, we transform the data by using summary statistics to characterize the events for a fixed set of time intervals, to facilitate analysis and interpretability. We then train a generative adversarial network to generate synthetic data. We demonstrate this approach by generating human sleep patterns, from a publicly available dataset. We empirically evaluate the generated data and show close univariate resemblance between synthetic and real data. However, we also demonstrate how stratification by covariates is required to gain a deeper understanding of synthetic data quality.

Thirteen Simple Steps for Creating An R Package with an External C++ Library

We desribe how we extend R with an external C++ code library by using the Rcpp package. Our working example uses the recent machine learning library and application ‘Corels’ providing optimal yet easily interpretable rule lists <arXiv:1704.01701> which we bring to R in the form of the ‘RcppCorels’ package. We discuss each step in the process, and derive a set of simple rules and recommendations which are illustrated with the concrete example.

Human Annotations Improve GAN Performances

Generative Adversarial Networks (GANs) have shown great success in many applications. In this work, we present a novel method that leverages human annotations to improve the quality of generated images. Unlike previous paradigms that directly ask annotators to distinguish between real and fake data in a straightforward way, we propose and annotate a set of carefully designed attributes that encode important image information at various levels, to understand the differences between fake and real images. Specifically, we have collected an annotated dataset that contains 600 fake images and 400 real images. These images are evaluated by 10 workers from the Amazon Mechanical Turk (AMT) based on eight carefully defined attributes. Statistical analyses have revealed different distributions of the proposed attributes between real and fake images. These attributes are shown to be useful in discriminating fake images from real ones, and deep neural networks are developed to automatically predict the attributes. We further utilize the information by integrating the attributes into GANs to generate better images. Experimental results evaluated by multiple metrics show performance improvement of the proposed model.

Self-supervised Adversarial Training

Recent work has demonstrated that neural networks are vulnerable to adversarial examples. To escape from the predicament, many works try to harden the model in various ways, in which adversarial training is an effective way which learns robust feature representation so as to resist adversarial attacks. Meanwhile, the self-supervised learning aims to learn robust and semantic embedding from data itself. With these views, we introduce self-supervised learning to against adversarial examples in this paper. Specifically, the self-supervised representation coupled with k-Nearest Neighbour is proposed for classification. To further strengthen the defense ability, self-supervised adversarial training is proposed, which maximizes the mutual information between the representations of original examples and the corresponding adversarial examples. Experimental results show that the self-supervised representation outperforms its supervised version in respect of robustness and self-supervised adversarial training can further improve the defense ability efficiently.

ASCAI: Adaptive Sampling for acquiring Compact AI

This paper introduces ASCAI, a novel adaptive sampling methodology that can learn how to effectively compress Deep Neural Networks (DNNs) for accelerated inference on resource-constrained platforms. Modern DNN compression techniques comprise various hyperparameters that require per-layer customization to ensure high accuracy. Choosing such hyperparameters is cumbersome as the pertinent search space grows exponentially with the number of model layers. To effectively traverse this large space, we devise an intelligent sampling mechanism that adapts the sampling strategy using customized operations inspired by genetic algorithms. As a special case, we consider the space of model compression as a vector space. The adaptively selected samples enable ASCAI to automatically learn how to tune per-layer compression hyperparameters to optimize the accuracy/model-size trade-off. Our extensive evaluations show that ASCAI outperforms rule-based and reinforcement learning methods in terms of compression rate and/or accuracy

‘How do I fool you?’: Manipulating User Trust via Misleading Black Box Explanations

As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. It has recently become apparent that a high-fidelity explanation of a black box ML model may not accurately reflect the biases in the black box. As a consequence, explanations have the potential to mislead human users into trusting a problematic black box. In this work, we rigorously explore the notion of misleading explanations and how they influence user trust in black-box models. More specifically, we propose a novel theoretical framework for understanding and generating misleading explanations, and carry out a user study with domain experts to demonstrate how these explanations can be used to mislead users. Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.

Using natural language processing to extract health-related causality from Twitter messages

Twitter messages (tweets) contain various types of information, which include health-related information. Analysis of health-related tweets would help us understand health conditions and concerns encountered in our daily life. In this work, we evaluated an approach to extracting causal relations from tweets using natural language processing (NLP) techniques. We focused on three health-related topics: stress’, ‘insomnia’, and ‘headache’. We proposed a set of lexico-syntactic patterns based on dependency parser outputs to extract causal information. A large dataset consisting of 24 million tweets were used. The results show that our approach achieved an average precision between 74.59% and 92.27%. Analysis of extracted relations revealed interesting findings about health-related in Twitter.

Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams

Ever-increasing amounts of data and requirements to process them in real time lead to more and more analytics platforms and software systems being designed according to the concept of stream processing. A common area of application is the processing of continuous data streams from sensors, for example, IoT devices or performance monitoring tools. In addition to analyzing pure sensor data, analyses of data for groups of sensors often need to be performed as well. Therefore, data streams of the individual sensors have to be continuously aggregated to a data stream for a group. Motivated by a real-world application scenario, we propose that such a stream aggregation approach has to allow for aggregating sensors in hierarchical groups, support multiple such hierarchies in parallel, provide reconfiguration at runtime, and preserve the scalability and reliability qualities induced by applying stream processing techniques. We propose a stream processing architecture fulfilling these requirements, which can be integrated into existing big data architectures. We present a pilot implementation of such an extended architecture and show how it is used in industry. Furthermore, in experimental evaluations we show that our solution scales linearly with the amount of sensors and provides adequate reliability in the case of faults.

LIBRE: Learning Interpretable Boolean Rule Ensembles

We present a novel method – LIBRE – to learn an interpretable classifier, which materializes as a set of Boolean rules. LIBRE uses an ensemble of bottom-up weak learners operating on a random subset of features, which allows for the learning of rules that generalize well on unseen data even in imbalanced settings. Weak learners are combined with a simple union so that the final ensemble is also interpretable. Experimental results indicate that LIBRE efficiently strikes the right balance between prediction accuracy, which is competitive with black box methods, and interpretability, which is often superior to alternative methods from the literature.

Controllability of the Voter Model: an information theoretic approach

We address the link between the controllability or observability of a stochastic complex system and concepts of information theory. We show that the most influential degrees of freedom can be detected without acting on the system, by measuring the time-delayed multi-information. Numerical and analytical results support this claim, which is developed in the case of a simple stochastic model on a graph, the so-called voter model. The importance of the noise when controlling the system is demonstrated, leading to the concept of control length. The link with classical control theory is given, as well as the interpretation of controllability in terms of the capacity of a communication canal.

Safe Interactive Model-Based Learning

Control applications present hard operational constraints. A violation of this can result in unsafe behavior. This paper introduces Safe Interactive Model Based Learning (SiMBL), a framework to refine an existing controller and a system model while operating on the real environment. SiMBL is composed of the following trainable components: a Lyapunov function, which determines a safe set; a safe control policy; and a Bayesian RNN forward model. A min-max control framework, based on alternate minimisation and backpropagation through the forward model, is used for the offline computation of the controller and the safe set. Safety is formally verified a-posteriori with a probabilistic method that utilizes the Noise Contrastive Priors (NPC) idea to build a Bayesian RNN forward model with an additive state uncertainty estimate which is large outside the training data distribution. Iterative refinement of the model and the safe set is achieved thanks to a novel loss that conditions the uncertainty estimates of the new model to be close to the current one. The learned safe set and model can also be used for safe exploration, i.e., to collect data within the safe invariant set, for which a simple one-step MPC is proposed. The single components are tested on the simulation of an inverted pendulum with limited torque and stability region, showing that iteratively adding more data can improve the model, the controller and the size of the safe region.

Multi-Label Learning with Deep Forest

In multi-label learning, each instance is associated with multiple labels and the crucial task is how to leverage label correlations in building models. Deep neural network methods usually jointly embed the feature and label information into a latent space to exploit label correlations. However, the success of these methods highly depends on the precise choice of model depth. Deep forest is a recent deep learning framework based on tree model ensembles, which does not rely on backpropagation. We consider the advantages of deep forest models are very appropriate for solving multi-label problems. Therefore we design the Multi-Label Deep Forest (MLDF) method with two mechanisms: measure-aware feature reuse and measure-aware layer growth. The measure-aware feature reuse mechanism reuses the good representation in the previous layer guided by confidence. The measure-aware layer growth mechanism ensures MLDF gradually increase the model complexity by performance measure. MLDF handles two challenging problems at the same time: one is restricting the model complexity to ease the overfitting issue; another is optimizing the performance measure on user’s demand since there are many different measures in the multi-label evaluation. Experiments show that our proposal not only beats the compared methods over six measures on benchmark datasets but also enjoys label correlation discovery and other desired properties in multi-label learning.

Learning Models over Relational Data: A Brief Tutorial

This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database. The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.

Learning To Characterize Adversarial Subspaces

Deep Neural Networks (DNNs) are known to be vulnerable to the maliciously generated adversarial examples. To detect these adversarial examples, previous methods use artificially designed metrics to characterize the properties of \textit{adversarial subspaces} where adversarial examples lie. However, we find these methods are not working in practical attack detection scenarios. Because the artificially defined features are lack of robustness and show limitation in discriminative power to detect strong attacks. To solve this problem, we propose a novel adversarial detection method which identifies adversaries by adaptively learning reasonable metrics to characterize adversarial subspaces. As auxiliary context information, \textit{k} nearest neighbors are used to represent the surrounded subspace of the detected sample. We propose an innovative model called Neighbor Context Encoder (NCE) to learn from \textit{k} neighbors context and infer if the detected sample is normal or adversarial. We conduct thorough experiment on CIFAR-10, CIFAR-100 and ImageNet dataset. The results demonstrate that our approach surpasses all existing methods under three settings: \textit{attack-aware black-box detection}, \textit{attack-unaware black-box detection} and \textit{white-box detection}.

Bootstrapping NLU Models with Multi-task Learning

Bootstrapping natural language understanding (NLU) systems with minimal training data is a fundamental challenge of extending digital assistants like Alexa and Siri to a new language. A common approach that is adapted in digital assistants when responding to a user query is to process the input in a pipeline manner where the first task is to predict the domain, followed by the inference of intent and slots. However, this cascaded approach instigates error propagation and prevents information sharing among these tasks. Further, the use of words as the atomic units of meaning as done in many studies might lead to coverage problems for morphologically rich languages such as German and French when data is limited. We address these issues by introducing a character-level unified neural architecture for joint modeling of the domain, intent, and slot classification. We compose word-embeddings from characters and jointly optimize all classification tasks via multi-task learning. In our results, we show that the proposed architecture is an optimal choice for bootstrapping NLU systems in low-resource settings thus saving time, cost and human effort.

Generative Models for Effective ML on Private, Decentralized Datasets

To improve real-world applications of machine learning, experienced modelers develop intuition about their datasets, their models, and how the two interact. Manual inspection of raw data – of representative samples, of outliers, of misclassifications – is an essential tool in a) identifying and fixing problems in the data, b) generating new modeling hypotheses, and c) assigning or refining human-provided labels. However, manual data inspection is problematic for privacy sensitive datasets, such as those representing the behavior of real-world individuals. Furthermore, manual data inspection is impossible in the increasingly important setting of federated learning, where raw examples are stored at the edge and the modeler may only access aggregated outputs such as metrics or model parameters. This paper demonstrates that generative models – trained using federated methods and with formal differential privacy guarantees – can be used effectively to debug many commonly occurring data issues even when the data cannot be directly inspected. We explore these methods in applications to text with differentially private federated RNNs and to images using a novel algorithm for differentially private federated GANs.

How bettering the best? Answers via blending models and cluster formulations in density-based clustering

With the recent growth in data availability and complexity, and the associated outburst of elaborate modeling approaches, model selection tools have become a lifeline, providing objective criteria to deal with this increasingly challenging landscape. In fact, basing predictions and inference on a single model may be limiting if not harmful; ensemble approaches, which combine different models, have been proposed to overcome the selection step, and proven fruitful especially in the supervised learning framework. Conversely, these approaches have been scantily explored in the unsupervised setting. In this work we focus on the model-based clustering formulation, where a plethora of mixture models, with different number of components and parametrizations, is tipically estimated. We propose an ensemble clustering approach that circumvents the single best model paradigm, while improving stability and robustness of the partitions. A new density estimator, being a convex linear combination of the density estimates in the ensemble, is introduced and exploited for group assignment. As opposed to the standard case, where clusters are associated to the components of the selected mixture model, we define partitions by borrowing the modal, or nonparametric, formulation of the clustering problem, where groups are linked with high-density regions. Staying in the density-based realm we thus show how blending together parametric and nonparametric approaches may be beneficial from a clustering perspective.

Unsupervised Attributed Multiplex Network Embedding

Nodes in a multiplex network are connected by multiple types of relations. However, most existing network embedding methods assume that only a single type of relation exists between nodes. Even for those that consider the multiplexity of a network, they overlook node attributes, resort to node labels for training, and fail to model the global properties of a graph. We present a simple yet effective unsupervised network embedding method for attributed multiplex network called DMGI, inspired by Deep Graph Infomax (DGI) that maximizes the mutual information between local patches of a graph, and the global representation of the entire graph. We devise a systematic way to jointly integrate the node embeddings from multiple graphs by introducing 1) the consensus regularization framework that minimizes the disagreements among the relation-type specific node embeddings, and 2) the universal discriminator that discriminates true samples regardless of the relation types. We also show that the attention mechanism infers the importance of each relation type, and thus can be useful for filtering unnecessary relation types as a preprocessing step. Extensive experiments on various downstream tasks demonstrate that DMGI outperforms the state-of-the-art methods, even though DMGI is fully unsupervised.

Stagewise Knowledge Distillation

The deployment of modern Deep Learning models requires high computational power. However, many applications are targeted for embedded devices like smartphones and wearables which lack such computational abilities. This necessitates compact networks which reduce computations while preserving the performance. Knowledge Distillation is one of the methods used to achieve this. Traditional Knowledge Distillation methods transfer knowledge from teacher to student in a single stage. We propose progressive stagewise training to improve the transfer of knowledge. We also show that this method works even with a fraction of the data used for training the teacher model, without compromising on the metric. This method can complement other model compression methods and also can be viewed as a generalized model compression technique.

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

Multi Model Server is a tool for serving neural net models for inference. Apache MXNet Model Server (MMS) is a flexible and easy to use tool for serving deep learning models exported from MXNet or the Open Neural Network Exchange. Use the MMS Server CLI, or the pre-configured Docker images, to start a service that sets up HTTP endpoints to handle model inference requests.

Manipulate OTTR Reasonable Ontology Templates in Python

A scipy-like implementation of the PERT distribution

Pipeline for eXperimentation

Preprocessing Library for Natural Language Processing

Innvariant Group Sparse Neural Network Tools

Pyplan is a graphical Integrated Development Environment for creating and sharing Data Analytics Apps.

Simulation Sandbox for the development and evaluation of stormwater control algorithms. This repo has been developed in an effort to systematize quantitative analysis of stormwater control algorithms. It is a natural extension of the the Open-Storm mission to open up and ease access into the technical world of smart stormwater systems. Our initial efforts allowed us to develop open source and free tools for anyone to be able to deploy flood sensors, measure green infrastructure, or even control storm or sewer systems. Now we have developed a tool to be able to test the performance of algorithms used to coordinate these different sensing and control technologies that have been deployed throughout urban water systems.

Document worth reading: “Babel Storage: Uncoordinated Content Delivery from Multiple Coded Storage Systems”

In future content-centric networks, content is identified independently of its location. From an end-user’s perspective, individual storage systems dissolve into a seemingly omnipresent structureless `storage fog’. Content should be delivered oblivious of the network topology, using multiple storage systems simultaneously, and at minimal coordination overhead. Prior works have addressed the advantages of error correction coding for distributed storage and content delivery separately. This work takes a comprehensive approach to highlighting the tradeoff between storage overhead and transmission overhead in uncoordinated content delivery from multiple coded storage systems. Our contribution is twofold. First, we characterize the tradeoff between storage and transmission overhead when all participating storage systems employ the same code. Second, we show that the resulting stark inefficiencies can be avoided when storage systems use diverse codes. What is more, such code diversity is not just technically desirable, but presumably will be the reality in the increasingly heterogeneous networks of the future. To this end, we show that a mix of Reed-Solomon, low-density parity-check and random linear network codes achieves close-to-optimal performance at minimal coordination and operational overhead. Babel Storage: Uncoordinated Content Delivery from Multiple Coded Storage Systems

Distilled News

Impact of using transfer learning in NLP

We analyze the impact of classifying movie reviews sentiments based on a language model trained from scratch, or a pre-trained model using the corpus wikitext-103

Multicollinearity: Why is it a problem?

Having come from an economic background multicollinearity is something I have grown familiar with during my academic career. However, once I entered industry I have found that the professionals who come from backgrounds without a mathematical focus were unaware that multicollinearity even existed. While multicollinearity isn’t the most dangerous concept to ignore I do think it is important enough to at least understand. In this article, we will dive into what multicollinearity is, how to identify it, why it can be a problem, and what you can do to fix it.

Inferring causality in time series data

The question of what event caused another, or what brought about a certain change in a phenomenon, is a common one. Examples include whether a drug caused an improvement in some medical condition (versus the placebo effect, additional hospital visits, etc.), tracking down the cause for a malfunction in an assembly line or determining what caused an upsurge in a website’s traffic. While a naive interpretation of the problem may suggest simple approaches like equating causality with high correlation, or to infer the degree to which x causes y from the degree of x’s goodness as a predicator of y, the problem turns out to be much more complex. As a result, rigorous ways to approach this question were developed in several fields of science.

How To Add Confidence Intervals to Any Model

It is the first thing your manager asks as you present your latest work. How do you answer? Do you refer to the mean squared error? the R² coefficient? How about some example results? These are all great, but I would like to add another technique to your toolkit – confidence intervals.

What You Need to Know About Netflix’s ‘Jupyter Killer’: Polynote

Today, Netflix open-sourced Polynote, the internal notebook they developed, to the public. It’s not rare these days that big tech companies open sources their internal tools or services, then got popular and adopted by the industry. Amazon AWS, Facebook’s React.js, etc. are two of them. It makes sense. These big tech companies have the best engineers in the industry and more often than not they are facing the biggest challenges that will drive the development of great tools. Netflix’s Polynote could be another one of those great tools and the data science/machine learning industry does need better tools in terms of how to write code, experiment algorithms and visualize data. Here are several things you need to know about this new tool. I’ll try to keep this succinct and to the point so you can quickly read through it and be knowledgeable about the pros and cons of this new choice of our development/research environment.

Why Real-Time Analytics Are Essential to Data-Driven Businesses

Nearly every large organization today wants to take advantage of big data analytics to inform decisions and gain a competitive edge. For example, the 2019 NewVantage Big Data and AI Executive Survey Report found that nearly 92% of the Fortune 500 executives they surveyed are increasing their investments in big data and AI, and 55% are spending more than $50 million. But while nearly 62% of Fortune 500 executives say they are seeing measurable results, only 31% say they have a data-driven organization and just 28% have created a data culture.


A universal command-line interface for PostgreSQL, MySQL, Oracle Database, SQLite3, Microsoft SQL Server, and many other databases including NoSQL and non-relational databases!

RLCard: A Toolkit for Reinforcement Learning in Card Games

RLCard is a toolkit for Reinforcement Learning (RL) in card games. It supports multiple card environments with easy-to-use interfaces. The goal of RLCard is to bridge reinforcement learning and imperfect information games, and push forward the research of reinforcement learning in domains with multiple agents, large state and action space, and sparse reward. RLCard is developed by DATA Lab at Texas A&M University.

Introduction to Apple’s Core ML 3 – Build Deep Learning Models for the iPhone (with code)

• Apple’s Core ML 3 is a perfect segway for developers and programmers to get into the AI ecosystem
• You can build machine learning and deep learning models for the iPhone using Core ML 3
• We’ll build a brand new application for the iPhone in this article!

Introducing the Next Generation of On-Device Vision Models: MobileNetV3 and MobileNetEdgeTPU

On-device machine learning (ML) is an essential component in enabling privacy-preserving, always-available and responsive intelligence. This need to bring on-device machine learning to compute and power-limited devices has spurred the development of algorithmically-efficient neural network models and hardware capable of performing billions of math operations per second, while consuming only a few milliwatts of power. The recently launched Google Pixel 4 exemplifies this trend, and ships with the Pixel Neural Core that contains an instantiation of the Edge TPU architecture, Google’s machine learning accelerator for edge computing devices, and powers Pixel 4 experiences such as face unlock, a faster Google Assistant and unique camera features. Similarly, algorithms, such as MobileNets, have been critical for the success of on-device ML by providing compact and efficient neural network models for mobile vision applications.

If you did not already know

Polygon-RNN++ google
Manually labeling datasets with object masks is extremely time consuming. In this work, we follow the idea of Polygon-RNN to produce polygonal annotations of objects interactively using humans-in-the-loop. We introduce several important improvements to the model: 1) we design a new CNN encoder architecture, 2) show how to effectively train the model with Reinforcement Learning, and 3) significantly increase the output resolution using a Graph Neural Network, allowing the model to accurately annotate high-resolution objects in images. Extensive evaluation on the Cityscapes dataset shows that our model, which we refer to as Polygon-RNN++, significantly outperforms the original model in both automatic (10% absolute and 16% relative improvement in mean IoU) and interactive modes (requiring 50% fewer clicks by annotators). We further analyze the cross-domain scenario in which our model is trained on one dataset, and used out of the box on datasets from varying domains. The results show that Polygon-RNN++ exhibits powerful generalization capabilities, achieving significant improvements over existing pixel-wise methods. Using simple online fine-tuning we further achieve a high reduction in annotation time for new datasets, moving a step closer towards an interactive annotation tool to be used in practice. …

Optimal Sparse Decision Tree google
Decision tree algorithms have been among the most popular algorithms for interpretable (transparent) machine learning since the early 1980’s. The problem that has plagued decision tree algorithms since their inception is their lack of optimality, or lack of guarantees of closeness to optimality: decision tree algorithms are often greedy or myopic, and sometimes produce unquestionably suboptimal models. Hardness of decision tree optimization is both a theoretical and practical obstacle, and even careful mathematical programming approaches have not been able to solve these problems efficiently. This work introduces the first practical algorithm for optimal decision trees for binary variables. The algorithm is a co-design of analytical bounds that reduce the search space and modern systems techniques, including data structures and a custom bit-vector library. We highlight possible steps to improving the scalability and speed of future generations of this algorithm based on insights from our theory and experiments. …

Population-Attributable Fraction (PAF) google
The contribution of a risk factor to a disease or a death is quantified using the population attributable fraction (PAF). PAF is the proportional reduction in population disease or mortality that would occur if exposure to a risk factor were reduced to an alternative ideal exposure scenario (eg. no tobacco use). Many diseases are caused by multiple risk factors, and individual risk factors may interact in their impact on overall risk of disease. As a result, PAFs for individual risk factors often overlap and add up to more than 100 percent.
Causal inference with multi-state models – estimands and estimators of the population-attributable fraction

Nature Language Inference (NLI) google
Nature language inference (NLI) task is a predictive task of determining the inference relationship of a pair of natural language sentences. With the increasing popularity of NLI, many state-of-the-art predictive models have been proposed with impressive performances. However, several works have noticed the statistical irregularities in the collected NLI data set that may result in an over-estimated performance of these models and proposed remedies. …

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

A package to build optimal binary decision trees classifier.

Machine Learning Lifecycle Framework. Ebonite is a machine learning lifecycle framework. It allows you to persist your models and reproduce them (as services or in general).

A search engine for Open Data.

GLaDOS – Slack Bot Framework

Control JupyterLab from Python notebooks

A toolkit for executing and analyzing machine learning models

A toolkit for executing and analyzing machine learning models

Image processing pipeline