Magister Dixit

“Feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.” Scott Locklin ( 2014 )


Finding out why

Paper: Compositional Hierarchical Tensor Factorization: Representing Hierarchical Intrinsic and Extrinsic Causal Factors

Visual objects are composed of a recursive hierarchy of perceptual wholes and parts, whose properties, such as shape, reflectance, and color, constitute a hierarchy of intrinsic causal factors of object appearance. However, object appearance is the compositional consequence of both an object’s intrinsic and extrinsic causal factors, where the extrinsic causal factors are related to illumination, and imaging conditions. Therefore, this paper proposes a unified tensor model of wholes and parts, and introduces a compositional hierarchical tensor factorization that disentangles the hierarchical causal structure of object image formation, and subsumes multilinear block tensor decomposition as a special case. The resulting object representation is an interpretable combinatorial choice of wholes’ and parts’ representations that renders object recognition robust to occlusion and reduces training data requirements. We demonstrate ourapproach in the context of face recognition by training on an extremely reduced dataset of synthetic images, and report encouragingface verification results on two datasets – the Freiburg dataset, andthe Labeled Face in the Wild (LFW) dataset consisting of real world images, thus, substantiating the suitability of our approach for data starved domains.

Paper: A Unified Framework for Causal Inference with Multiple Imputation Using Martingale

Multiple imputation is widely used to handle confounders missing at random in causal inference. Although Rubin’s combining rule is simple, it is not clear weather or not the standard multiple imputation inference is consistent when coupled with the commonly-used average causal effect (ACE) estimators. This article establishes a unified martingale representation for the average causal effect (ACE) estimators after multiple imputation. This representation invokes the wild bootstrap inference to provide consistent variance estimation. Our framework applies to asymptotically normal ACE estimators, including the regression imputation, weighting, and matching estimators. We extend to the scenarios when both outcome and confounders are subject to missingness and when the data are missing not at random.

Paper: Causality-based tests to detect the influence of confounders on mobile health diagnostic applications: a comparison with restricted permutations

Machine learning practice is often impacted by confounders. Confounding can be particularly severe in remote digital health studies where the participants self-select to enter the study. While many different confounding adjustment approaches have been proposed in the literature, most of these methods rely on modeling assumptions, and it is unclear how robust they are to violations of these assumptions. This realization has recently motivated the development of restricted permutation methods to quantify the influence of observed confounders on the predictive performance of a machine learning models and evaluate if confounding adjustment methods are working as expected. In this paper we show, nonetheless, that restricted permutations can generate biased estimates of the contribution of the confounders to the predictive performance of a learner, and we propose an alternative approach to tackle this problem. By viewing a classification task from a causality perspective, we are able to leverage conditional independence tests between predictions and test set labels and confounders in order to detect confounding on the predictive performance of a classifier. We illustrate the application of our causality-based approach to data collected from mHealth study in Parkinson’s disease.

Paper: Long-range Event-level Prediction and Response Simulation for Urban Crime and Global Terrorism with Granger Networks

Large-scale trends in urban crime and global terrorism are well-predicted by socio-economic drivers, but focused, event-level predictions have had limited success. Standard machine learning approaches are promising, but lack interpretability, are generally interpolative, and ineffective for precise future interventions with costly and wasteful false positives. Here, we are introducing Granger Network inference as a new forecasting approach for individual infractions with demonstrated performance far surpassing past results, yet transparent enough to validate and extend social theory. Considering the problem of predicting crime in the City of Chicago, we achieve an average AUC of ~90\% for events predicted a week in advance within spatial tiles approximately $1000$ ft across. Instead of pre-supposing that crimes unfold across contiguous spaces akin to diffusive systems, we learn the local transport rules from data. As our key insights, we uncover indications of suburban bias — how law-enforcement response is modulated by socio-economic contexts with disproportionately negative impacts in the inner city — and how the dynamics of violent and property crimes co-evolve and constrain each other — lending quantitative support to controversial pro-active policing policies. To demonstrate broad applicability to spatio-temporal phenomena, we analyze terror attacks in the middle-east in the recent past, and achieve an AUC of ~80% for predictions made a week in advance, and within spatial tiles measuring approximately 120 miles across. We conclude that while crime operates near an equilibrium quickly dissipating perturbations, terrorism does not. Indeed terrorism aims to destabilize social order, as shown by its dynamics being susceptible to run-away increases in event rates under small perturbations.

Paper: Guidelines for estimating causal effects in pragmatic randomized trials

Pragmatic randomized trials are designed to provide evidence for clinical decision-making rather than regulatory approval. Common features of these trials include the inclusion of heterogeneous or diverse patient populations in a wide range of care settings, the use of active treatment strategies as comparators, unblinded treatment assignment, and the study of long-term, clinically relevant outcomes. These features can greatly increase the usefulness of the trial results for patients, clinicians, and other stakeholders. However, these features also introduce an increased risk of non-adherence, which reduces the value of the intention-to-treat effect as a patient-centered measure of causal effect. In these settings, the per-protocol effect provides useful complementary information for decision making. Unfortunately, there is little guidance for valid estimation of the per-protocol effect. Here, we present our full guidelines for analyses of pragmatic trials that will result in more informative causal inferences for both the intention-to-treat effect and the per-protocol effect.

Python Library: cause-ml

Causal ML benchmarking and development tools

Article: The 10 Bias and Causality Techniques of that Everyone Needs to Master.

In the end what does Causality have to do with machine learning? Machine Learning is about prediction and causality about real effects, do these two themes have something in common? Yes, a lot in common and this series of posts tries to bridge these two sub-areas of Data Science. I like to think that Machine Learning is just a Data Grinder, if you put good quality data, you get good quality predictions, but if you put garbage, it will keep grinding, but don’t expect good predictions to come out, it’s just ground garbage , and that’s what we’ll talk about in this post.

Paper: Seq-U-Net: A One-Dimensional Causal U-Net for Efficient Sequence Modelling

Convolutional neural networks (CNNs) with dilated filters such as the Wavenet or the Temporal Convolutional Network (TCN) have shown good results in a variety of sequence modelling tasks. However, efficiently modelling long-term dependencies in these sequences is still challenging. Although the receptive field of these models grows exponentially with the number of layers, computing the convolutions over very long sequences of features in each layer is time and memory-intensive, prohibiting the use of longer receptive fields in practice. To increase efficiency, we make use of the ‘slow feature’ hypothesis stating that many features of interest are slowly varying over time. For this, we use a U-Net architecture that computes features at multiple time-scales and adapt it to our auto-regressive scenario by making convolutions causal. We apply our model (‘Seq-U-Net’) to a variety of tasks including language and audio generation. In comparison to TCN and Wavenet, our network consistently saves memory and computation time, with speed-ups for training and inference of over 4x in the audio generation experiment in particular, while achieving a comparable performance in all tasks.

Distilled News

Machine Learning for Day Trading

In this post, I’m going to explore machine learning algorithms for time-series analysis and explain why they don’t work for day trading. If you’re a novice in this field you might get fooled by authors with amazing results where test data match predictions almost perfectly. A common trick is to show a plot with predicted values on a long period of data, which creates an illusion that lag is insignificant or you’ll not see it at all. Lag is what makes predictions useless and I’ll show you an example later in this post. There are other ways to make predictions look legit, some of them I’m sure made by mistake. But don’t get discouraged and keep in mind that the model can be as good as your data, and lack of it is the main stumbling block on your way to getting solid results.

AiPM – AI Performance Management

Making ML Proofs-of-Concept (POC) is easy, but maintaining them in Production is frustrating and expensive. It comes down to chaotic and blindsided monitoring and triage processes. To fix this, we want to introduce a concept called AI Performance Management (AiPM), explain how teams can adopt it with 5 tactical steps, and show you a tool to accelerate the process.

Analyzing Inverse Problems with Invertible Neural Networks

In a recent collaboration with experts from natural and medical sciences, we show how Invertible Neural Networks can help us deal with the ill-posed inverse problems that often arise in these fields. This page aims to provide an intuitive introduction to the idea.

Research Guide for Depth Estimation with Deep Learning

Depth estimation is a computer vision task designed to estimate depth from a 2D image. The task requires an input RGB image and outputs a depth image. The depth image includes information about the distance of the objects in the image from the viewpoint, which is usually the camera taking the image. Some of the applications of depth estimation include smoothing blurred parts of an image, better rendering of 3D scenes, self-driving cars, grasping in robotics, robot-assisted surgery, automatic 2D-to-3D conversion in film, and shadow mapping in 3D computer graphics, just to mention a few. In this guide, we’ll look at papers aimed at solving these problems using deep learning. The two images below provide a clear illustration of depth estimation in practice.

Introduction to Federated Learning

Any deep learning model learns from the data and that data must be collected or uploading on the server (one machine or in a data center). A most realistic and meaningful deep learning model can learn from personal data. Personal data is extremely private and sensitive and no one would like to send or upload it on the server. Federated learning is a collaborative machine learning approach in which we trained a model without centralizing data on the server and this is the main kind of a revolution.

Introducing the Schema-Guided Dialogue Dataset for Conversational Assistants

Today’s virtual assistants help users to accomplish a wide variety of tasks, including finding flights, searching for nearby events and movies, making reservations, sourcing information from the web and more. They provide this functionality by offering a unified natural language interface to a wide variety of services across the web. Large-scale virtual assistants, like Google Assistant, need to integrate with a large and constantly increasing number of services, each with potentially overlapping functionality, over a wide variety of domains. Supporting new services with ease, without collection of additional data or retraining the model, and reducing maintenance workload are necessary to accommodate future growth. Despite tremendous progress, however, these challenges have often been overlooked in state-of-the-art models. This is due, in part, to the absence of suitable datasets that match the scale and complexity confronted by such virtual assistants.

AzureR updates: AzureStor, AzureVM, AzureGraph, AzureContainers

Some major updates to AzureR packages this week! As well as last week’s AzureRMR update, there are changes to AzureStor, AzureVM, AzureGraph and AzureContainers. All of these are live on CRAN.

Azure AI and Machine Learning talk series

At last week’s Microsoft Ignite conference in Orlando, our team delivered a series of 6 talks about AI and machine learning applications with Azure. The videos from each talk are linked below, and you can watch every talk from the conference online (no registration necessary). Each of our talks also comes with a companion Github repository, where you can find all of the code and scripts behind the demonstrations, so you can deploy and run them yourself.

Interpretability: Cracking open the black box – Part I

Explainable AI (XAI) is a sub-field of AI which has been gaining ground in the recent past. And as I machine learning practitioner dealing with customers day in and day out, I can see why. I’ve been an analytics practitioner for more than 5 years and I swear, the hardest part of a machine learning project is not creating the perfect model which beats all the benchmarks. It’s the part where you convince the customer why and how it works.

Time Series Hierarchical Clustering using Dynamic Time Warping in Python

Let us consider the following task: we have a bunch of evenly distributed time series of different lengths. The goal is to cluster time series by defining general patterns that are presented in the data. Here I’d like to present one approach to solving this task. We will use hierarchical clustering and DTW algorithm as a comparison metric to the time series. The solution worked well on HR data (employee historical scores). For other types of time series, DTW function may work worse than other metrics like CID (Complexity Invariant Distance), MAE or correlation.

How to Use Pretrained Models in Keras

This guide will be useful if you are a bit familiar with pretained models but want to know how to use them in Keras. Keras contains 10 pretrained models for image classification. These models are trained on Imagenet data.

OpenML: Machine Learning as a community

OpenML is an online Machine Learning (ML) experiments database accessible to everyone for free. The core idea is to have a single repository of datasets and results of ML experiments on them. Despite having gained a lot of popularity in recent years, with a plethora of tools now available, the numerous ML experimentations continue to happen in silos and not necessarily as one whole shared community.

Whats new on arXiv – Complete List

Probabilistic Similarity Networks
Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models
Instance-based Transfer Learning for Multilingual Deep Retrieval
Interactive Attention for Semantic Text Matching
KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation
What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
Learning internal representations
Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering
All-Spin Bayesian Neural Networks
A Reduction from Reinforcement Learning to No-Regret Online Learning
Adversarial Margin Maximization Networks
FAQ-based Question Answering via Knowledge Anchors
An Efficient Hardware-Oriented Dropout Algorithm
Hiding in Multilayer Networks
HUSE: Hierarchical Universal Semantic Embeddings
A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Time-series Discriminant Component Analysis
Robust Parameter-Free Season Length Detection in Time Series
Semantic Granularity Metric Learning for Visual Search
A Bayesian/Information Theoretic Model of Bias Learning
Learning Model Bias
Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems
Understanding Graph Neural Networks with Asymmetric Geometric Scattering Transforms
Sato: Contextual Semantic Type Detection in Tables
Real-time Anomaly Detection and Classification in Streaming PMU Data
Backtracking activation impacts the criticality of excitable networks
Response to NITRD, NCO, NSF Request for Information on ‘Update to the 2016 National Artificial Intelligence Research and Development Strategic Plan’
Diversity of dynamical behaviors due to initial conditions: exact results with extended Ott–Antonsen ansatz for identical Kuramoto–Sakaguchi phase oscillators
MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models
Microsoft Research Asia’s Systems for WMT19
Multi-domain Dialogue State Tracking as Dynamic Knowledge Graph Enhanced Question Answering
Predicting Indian stock market using the psycho-linguistic features of financial news
Econophysics deserves a revamping
Towards automatic extractive text summarization of A-133 Single Audit reports with machine learning
An Introduction to Artificial Intelligence and Solutions to the Problems of Algorithmic Discrimination
A Massive Collection of Cross-Lingual Web-Document Pairs
Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding
RNN-Test: Adversarial Testing Framework for Recurrent Neural Network Systems
Antithetic integral feedback for the robust control of monostable and oscillatory biomolecular circuits
Emotional Voice Conversion using multitask learning with Text-to-speech
t-SS3: a text classifier with dynamic n-grams for early risk detection over text streams
Few-Features Attack to Fool Machine Learning Models through Mask-Based GAN
Learning Multi-Sense Word Distributions using Approximate Kullback-Leibler Divergence
Scientific Image Restoration Anywhere
Some properties of ergodicity coefficients with applications in spectral graph theory
Deep Encoder-decoder Adversarial Reconstruction (DEAR) Network for 3D CT from Few-view Data
Unsupervised Domain Adaptation on Reading Comprehension
Federated Learning for Healthcare Informatics
Image-Based Feature Representation for Insider Threat Classification
The phonetic bases of vocal expressed emotion: natural versus acted
Radio Resource Allocation in 5G New Radio: A Neural Networks Based Approach)
On global mechanisms of synchronization in networks of coupled chaotic circuits and the role of the voltage-type coupling
Existence of local minima of a minimal 2D pose-graph SLAM problem
A classification of flag-transitive block designs
Adversarial Transformations for Semi-Supervised Learning
Condition monitoring and early diagnostics methodologies for hydropower plants
Word-level Lexical Normalisation using Context-Dependent Embeddings
Towards Supervised Extractive Text Summarization via RNN-based Sequence Classification
Unsupervised Pre-training for Natural Language Generation: A Literature Review
Quantum percolation of monopole paths and the response of quantum spin ice
Convergence analysis for autonomous adaptive learning applied to quantum architectures
Maximizing the Partial Decode-and-Forward Rate in the Gaussian MIMO Relay Channel
Constrained Bayesian ICA for Brain Connectome Inference
Machine Learning Based Network Vulnerability Analysis of Industrial Internet of Things
Factor Group-Sparse Regularization for Efficient Low-Rank Matrix Recovery
On the Relativized Alon Second Eigenvalue Conjecture VI: Sharp Bounds for Ramanujan Base Graphs
Optimization Models for Estimating Transit Network Origin-Destination Flows with AVL/APC Data
The Number of Threshold Words on $n$ Letters Grows Exponentially for Every $n\geq 27$
Pricing Multi-Interval Dispatch under Uncertainty Part I: Dispatch-Following Incentives
Visual-Inertial Localization for Skid-Steering Robots with Kinematic Constraints
Computing Equilibria in Binary Networked Public Goods Games
On the Mean Subtree Order of Graphs Under Edge Addition
AI-optimized detector design for the future Electron-Ion Collider: the dual-radiator RICH case
By the user, for the user: A user-centric approach to quantifying the privacy of websites
Revisiting IRKA: Connections with pole placement and backward stability
A new preconditioner for elliptic PDE-constrained optimization problems
Triply Robust Off-Policy Evaluation
Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning
Haar wavelets collocation on a class of Emden-Fowler equation via Newton’s quasilinearization and Newton-Raphson techniques
A Model of Double Descent for High-dimensional Binary Linear Classification
Trustworthy Misinformation Mitigation with Soft Information Nudging
A Least-Squares Finite Element Method Based on the Helmholtz Decomposition for Hyperbolic Balance Laws
Compile-time Parallelization of Subscripted Subscript Patterns
On the Age of Information in Erasure Channels with Feedback
TASTE: Temporal and Static Tensor Factorization for Phenotyping Electronic Health Records
Accelerating cardiac cine MRI beyond compressed sensing using DL-ESPIRiT
Optimization of noisy blackboxes with adaptive precision
SpiralNet++: A Fast and Highly Efficient Mesh Convolution Operator
Graph-Induced Rank Structures and their Representations
Optimal Computation-Communication Trade-offs in Processing Networks
Federated and Differentially Private Learning for Electronic Health Records
Motion Reasoning for Goal-Based Imitation Learning
Kriging: Beyond Matérn
Attraction to and repulsion from a subset of the unit sphere for isotropic stable Lévy processes
A Kolmogorov type theorem for stochastic fields
Character Keypoint-based Homography Estimation in Scanned Documents for Efficient Information Extraction
LiDAR ICPS-net: Indoor Camera Positioning based-on Generative Adversarial Network for RGB to Point-Cloud Translation
Partial-Order, Partially-Seen Observations of Fluents or Actions for Plan Recognition as Planning
Projecting Flood-Inducing Precipitation with a Bayesian Analogue Model
A ratio of many gamma functions and its properties with applications
Deception through Half-Truths
Revenue Maximization of Airbnb Marketplace using Search Results
Fast multigrid solution of high-order accurate multi-phase Stokes problems
Generating Persona Consistent Dialogues by Exploiting Natural Language Inference
Reinforcement Learning for Market Making in a Multi-agent Dealer Market
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Linear Time Subgraph Counting, Graph Degeneracy, and the Chasm at Size Six
Proceedings of the Third Workshop on Software Foundations for Data Interoperability (SFDI2019+), October 28, 2019, Fukuoka, Japan
Multi-Antenna Aided Secrecy Beamforming Optimization for Wirelessly Powered HetNets
There is Limited Correlation between Coverage and Robustness for Deep Neural Networks
Majorization-Minimization Aided Hybrid Transceivers for MIMO Interference Channels
A Dynamic Preference Logic for reasoning about Agent Programming
Tractable reasoning about Agent Programming in Dynamic Preference Logic
Explainable Ordinal Factorization Model: Deciphering the Effects of Attributes by Piece-wise Linear Approximation
Univoque bases of real numbers: local dimension, Devil’s staircase and isolated points
Recent Advances in Algorithmic High-Dimensional Robust Statistics
Latin squares with maximal partial transversals of many lengths
RWF-2000: An Open Large Scale Video Database for Violence Detection
The convex hull of random points on the boundary of a simple polytope
Optimal Server Selection for Straggler Mitigation
Understanding the Disharmony between Weight Normalization Family and Weight Decay: $ε-$shifted $L_2$ Regularizer
Programmable View Update Strategies on Relations
Atari-fying the Vehicle Routing Problem with Stochastic Service Requests
Compact-Range RCS Measurements and Modeling of Small Drones at 15 GHz and 25 GHz
Enabling Efficient Privacy-Assured Outlier Detection over Encrypted Incremental Datasets
VisionISP: Repurposing the Image Signal Processor for Computer Vision Applications
GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs
Bayesian Optimization with Uncertain Preferences over Attributes
Small Latin arrays have a near transversal
SimVODIS: Simultaneous Visual Odometry, Object Detection, and Instance Segmentation
Distributional Clustering: A distribution-preserving clustering method
Progressive Feature Polishing Network for Salient Object Detection
2L-3W: 2-Level 3-Way Hardware-Software Co-Verification for the Mapping of Deep Learning Architecture (DLA) onto FPGA Boards
A Scalable Approach for Facial Action Unit Classifier Training UsingNoisy Data for Pre-Training
Online Second Price Auction with Semi-bandit Feedback Under the Non-Stationary Setting
Resistance distance in directed cactus graphs
Change-point Analysis in Financial Networks
Ethanos: Lightweight Bootstrapping for Ethereum
Hierarchical Graph Pooling with Structure Learning
Contextual Bandits Evolving Over Finite Time
Performance of Two-Way Relaying over $α$-$μ$ Fading Channels in Hybrid RF/FSO Wireless Networks
Contextual Recurrent Units for Cloze-style Reading Comprehension
Quasiparabolic sets and Stanley symmetric functions for affine fixed-point-free involutions
Empirical Bayes mean estimation with nonparametric errors via order statistic regression
A Machine-Learning Approach for Earthquake Magnitude Estimation
Towards an $O(\frac{1}{t})$ convergence rate for distributed dual averaging
Attention on Abstract Visual Reasoning
Graph Spanners in the Message-Passing Model
Privacy and Utility Preserving Sensor-Data Transformations
Cycles of many lengths in digraphs with Meyniel-like condition
An Application of Multiple-Instance Learning to Estimate Generalization Risk
Training a code-switching language model with monolingual data
An Invariant Test for Equality of Two Large Scale Covariance Matrices
ReCoDe: A Data Reduction and Compression Description for High Throughput Time-Resolved Electron Microscopy
Linear convergence of dual coordinate descent on non-polyhedral convex problems
High order linearly implicit methods for evolution equations: How to solve an ODE by inverting only linear systems
An Accelerated Nonlinear Contrast Source Inversion Scheme For Sparse Electromagnetic Imaging
SDGM: Sparse Bayesian Classifier Based on a Discriminative Gaussian Mixture Model
Guidelines for estimating causal effects in pragmatic randomized trials
A note on second order linear functional equations in random normed spaces
Self-Supervised Learning For Few-Shot Image Classification
Conjugate Gradients for Kernel Machines
Supplementary material for Uncorrected least-squares temporal difference with lambda-return
A zeta function related to the transition matrix of the discrete-time quantum walk on a graph
Algebraic Fault Detection and Identification for Rigid Robots
Space-time multilevel Monte Carlo methods and their application to cardiac electrophysiology
The Controller Generated from Noise Can Be Lyapunuov Stable: A Controller Stabilized Method
A Generalized Worst-Case Complexity Analysis for Non-Monotone Line Searches
Efficient ConvNet-based Object Detection for Unmanned Aerial Vehicles by Selective Tile Processing
A shape optimization approach for electrical impedance tomography with point measurements
The Boolean intervals of Chevalley type are strongly non group-complemented
Mean-field reflected backward stochastic differential equations
CMSN: Continuous Multi-stage Network and Variable Margin Cosine Loss for Temporal Action Proposal Generation
Coarse-graining of non-reversible stochastic differential equations: quantitative results and connections to averaging
PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module
EdgeNet: Balancing Accuracy and Performance for Edge-based Convolutional Neural Network Object Detectors
Towards Pose-invariant Lip-Reading
CartoonRenderer: An Instance-based Multi-Style Cartoon Image Translator
Localization Error Bounds For 5G mmWave Systems Under I/Q Imbalance: An Extended Version
Unveil stock correlation via a new tensor-based decomposition method
Double Circulant Self-Dual Codes From Generalized Cyclotomic Classes Modulo 2p
$\{-1,0,1\}$-APSP and (min,max)-Product Problems
Delta-Bose gas on a half-line and the KPZ equation: boundary bound states and unbinding transitions
Proper Jordan schemes exist. First examples, computer search, patterns of reasoning. An essay
Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources
Deep Learning for Over-the-Air Non-Orthogonal Signal Classification
Uncertainty Quantification in Ensembles of Honest Regression Trees using Generalized Fiducial Inference
Convolutional Neural Network for Convective Storm Nowcasting Using 3D Doppler Weather Radar Data
Concordance probability in a big data setting: application in non-life insurance
SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines
An Improved Tobit Kalman Filter with Adaptive Censoring Limits
Election Control in Social Networks via Edge Addition or Removal
Facets, weak facets, and extreme functions of the Gomory-Johnson infinite group problem
Estimating differential entropy using recursive copula splitting
Bayesian state-space modeling for analyzing heterogeneous network effects of US monetary policy
Analysis of the fiber laydown quality in spunbond processes with simulation experiments evaluated by blocked neural networks
Sparse Density Estimation with Measurement Errors
Dectecting Invasive Ductal Carcinoma with Semi-Supervised Conditional GANs
On Network Embedding for Machine Learning on Road Networks: A Case Study on the Danish Road Network
Thouless time analysis of Anderson and many-body localization transitions
Location estimation for symmetric log-concave densities
Beyond Pairwise Comparisons in Social Choice: A Setwise Kemeny Aggregation Problem
Convergence to scale-invariant Poisson processes and applications in Dickman approximation
On (Excessive) Transverse Coordinates for Orbital Stabilization of Periodic Motions
Coarse-graining via EDP-convergence for linear fast-slow reaction systems
Robust Beamforming Design for Intelligent Reflecting Surface Aided MISO Communication Systems
Disorder-induced two-body localised state in interacting quantum walks
Millimeter Wave Base Stations with Cameras: Vision Aided Beam and Blockage Prediction
A Comparative Study between Bayesian and Frequentist Neural Networks for Remaining Useful Life Estimation in Condition-Based Maintenance
ViWi: A Deep Learning Dataset Framework for Vision-Aided Wireless Communications
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Restricted Boltzmann Machines for galaxy morphology classification with a quantum annealer
Flexibility and movability in Cayley graphs
Multiplicative functions that are close to their mean
Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement
A regression algorithm for accelerated lattice QCD that exploits sparse inference on the D-Wave quantum annealer
High-Fidelity Large-Signal Order Reduction Approach for Composite Load Model
RLC Circuits based Distributed Mirror Descent Method
Colourings of star systems
LGN-CNN: a biologically inspired CNN architecture
Harnessing spatial MRI normalization: patch individual filter layers for CNNs
Self-Supervised Learning of State Estimation for Manipulating Deformable Linear Objects
Primal-dual block-proximal splitting for a class of non-convex problems
DomainGAN: Generating Adversarial Examples to Attack Domain Generation Algorithm Classifiers
Importance sampling for a robust and efficient multilevel Monte Carlo estimator for stochastic reaction networks
Scalable Exact Inference in Multi-Output Gaussian Processes
Refining Tournament Solutions via Margin of Victory
On Ehrhart positivity of Tesler polytopes and their deformations
Deep Reinforcement Learning for Adaptive Traffic Signal Control
Exponential Runge Kutta time semidiscetizations with low regularity initial data
Fetal Head and Abdomen Measurement Using Convolutional Neural Network, Hough Transform, and Difference of Gaussian Revolved along Elliptical Path (Dogell) Algorithm
rFIA: An R package for space-time estimation of forest attributes with the Forest Inventory and Analysis Database
Random walks on finite nilpotent groups driven by long-jump measures
Predicting sparse circle maps from their dynamics
Gradientless Descent: High-Dimensional Zeroth-Order Optimization
The Canonical Distortion Measure for Vector Quantization and Function Approximation

Whats new on arXiv

Probabilistic Similarity Networks

Normative expert systems have not become commonplace because they have been difficult to build and use. Over the past decade, however, researchers have developed the influence diagram, a graphical representation of a decision maker’s beliefs, alternatives, and preferences that serves as the knowledge base of a normative expert system. Most people who have seen the representation find it intuitive and easy to use. Consequently, the influence diagram has overcome significantly the barriers to constructing normative expert systems. Nevertheless, building influence diagrams is not practical for extremely large and complex domains. In this book, I address the difficulties associated with the construction of the probabilistic portion of an influence diagram, called a knowledge map, belief network, or Bayesian network. I introduce two representations that facilitate the generation of large knowledge maps. In particular, I introduce the similarity network, a tool for building the network structure of a knowledge map, and the partition, a tool for assessing the probabilities associated with a knowledge map. I then use these representations to build Pathfinder, a large normative expert system for the diagnosis of lymph-node diseases (the domain contains over 60 diseases and over 100 disease findings). In an early version of the system, I encoded the knowledge of the expert using an erroneous assumption that all disease findings were independent, given each disease. When the expert and I attempted to build a more accurate knowledge map for the domain that would capture the dependencies among the disease findings, we failed. Using a similarity network, however, we built the knowledge-map structure for the entire domain in approximately 40 hours. Furthermore, the partition representation reduced the number of probability assessments required by the expert from 75,000 to 14,000.

Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models

The impressive performance of neural networks on natural language processing tasks attributes to their ability to model complicated word and phrase interactions. Existing flat, word level explanations of predictions hardly unveil how neural networks handle compositional semantics to reach predictions. To tackle the challenge, we study hierarchical explanation of neural network predictions. We identify non-additivity and independent importance attributions within hierarchies as two desirable properties for highlighting word and phrase interactions. We show prior efforts on hierarchical explanations, e.g. contextual decomposition, however, do not satisfy the desired properties mathematically. In this paper, we propose a formal way to quantify the importance of each word or phrase for hierarchical explanations. Following the formulation, we propose Sampling and Contextual Decomposition (SCD) algorithm and Sampling and Occlusion (SOC) algorithm. Human and metrics evaluation on both LSTM models and BERT Transformer models on multiple datasets show that our algorithms outperform prior hierarchical explanation algorithms. Our algorithms apply to hierarchical visualization of compositional semantics, extraction of classification rules and improving human trust of models.

Instance-based Transfer Learning for Multilingual Deep Retrieval

Perhaps the simplest type of multilingual transfer learning is instance-based transfer learning, in which data from the target language and the auxiliary languages are pooled, and a single model is learned from the pooled data. It is not immediately obvious when instance-based transfer learning will improve performance in this multilingual setting: for instance, a plausible conjecture is this kind of transfer learning would help only if the auxiliary languages were very similar to the target. Here we show that at large scale, this method is surprisingly effective, leading to positive transfer on all of 35 target languages we tested. We analyze this improvement and argue that the most natural explanation, namely direct vocabulary overlap between languages, only partially explains the performance gains: in fact, we demonstrate target-language improvement can occur after adding data from an auxiliary language with no vocabulary in common with the target. This surprising result is due to the effect of transitive vocabulary overlaps between pairs of auxiliary and target languages.

Interactive Attention for Semantic Text Matching

Semantic text matching, which matches a target text to a source text, is a general problem in many domains like information retrieval, question answering, and recommendation. There are several challenges for this problem, such as semantic gaps between words, implicit matching, and mismatch due to out-of-vocabulary or low-frequency words, etc. Most existing studies made great efforts to overcome these challenges by learning good representations for different text pieces or operating on global matching signals to get the matching score. However, they did not learn the local fine-grained interactive information for a specific source and target pair. In this paper, we propose a novel interactive attention model for semantic text matching, which learns new representations for source and target texts through interactive attention via global matching matrix and updates local fine-grained relevance between source and target. Our model could enrich the representations of source and target objects by adopting global relevance and learned local fine-grained relevance. The enriched representations of source and target encode global relevance and local relevance of each other, therefore, could empower the semantic match of texts. We conduct empirical evaluations of our model with three applications including biomedical literature retrieval, tweet and news linking, and factoid question answering. Experimental results on three data sets demonstrate that our model significantly outperforms competitive baseline methods.

KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

Pre-trained language representation models (PLMs) learn effective language representations from large-scale unlabeled corpora. Knowledge embedding (KE) algorithms encode the entities and relations in knowledge graphs into informative embeddings to do knowledge graph completion and provide external knowledge for various NLP applications. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagE Representation (KEPLER), which not only better integrates factual knowledge into PLMs but also effectively learns knowledge graph embeddings. Our KEPLER utilizes a PLM to encode textual descriptions of entities as their entity embeddings, and then jointly learn the knowledge embeddings and language representations. Experimental results on various NLP tasks such as the relation extraction and the entity typing show that our KEPLER can achieve comparable results to the state-of-the-art knowledge-enhanced PLMs without any additional inference overhead. Furthermore, we construct Wikidata5m, a new large-scale knowledge graph dataset with aligned text descriptions, to evaluate KE embedding methods in both the traditional transductive setting and the challenging inductive setting, which needs the models to predict entity embeddings for unseen entities. Experiments demonstrate our KEPLER can achieve good results in both settings.

What do you mean, BERT? Assessing BERT as a Distributional Semantics Model

Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships.

Learning internal representations

Probably the most important problem in machine learning is the preliminary biasing of a learner’s hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner’s hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner’s hypothesis space for the learning of future tasks drawn from the same environment. An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples m {\em per task} required to ensure good generalisation from a representation learner obeys m = O(a+b/n) where n is the number of tasks being learnt and a and b are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then m=O(a+b). It is argued that for learning environments such as speech and character recognition b\gg a and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if n = O(b) (with m=O(a+b/n)) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to O(a) (as opposed to O(a+b) if no representation is used). It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.

Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering

The Data Clustering (DC) problem is of central importance for the area of Machine Learning (ML), given its usefulness to represent data structural similarities from input spaces. Differently from Supervised Machine Learning (SML), which relies on the theoretical frameworks of the Statistical Learning Theory (SLT) and the Algorithm Stability (AS), DC has scarce literature on general-purpose learning guarantees, affecting conclusive remarks on how those algorithms should be designed as well as on the validity of their results. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on Topological Spaces, instead of metric ones as typically approached in the literature. From that, we show that the DC problem presents an analogous dilemma to the Bias-Variance one, which is here referred to as the Coarse-Refinement (CR) dilemma. CR is intended to clarify the contrast between: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting); consequently, the CR dilemma suggests the need of a relaxation of Kleinberg’s richness axiom. Experimental results were used to illustrate that multidimensional persistent homology support the measurement of divergences among DC models, leading to a consistency criterion.

All-Spin Bayesian Neural Networks

Probabilistic machine learning enabled by the Bayesian formulation has recently gained significant attention in the domain of automated reasoning and decision-making. While impressive strides have been made recently to scale up the performance of deep Bayesian neural networks, they have been primarily standalone software efforts without any regard to the underlying hardware implementation. In this paper, we propose an ‘All-Spin’ Bayesian Neural Network where the underlying spintronic hardware provides a better match to the Bayesian computing models. To the best of our knowledge, this is the first exploration of a Bayesian neural hardware accelerator enabled by emerging post-CMOS technologies. We develop an experimentally calibrated device-circuit-algorithm co-simulation framework and demonstrate 23.6\times reduction in energy consumption against an iso-network CMOS baseline implementation.

A Reduction from Reinforcement Learning to No-Regret Online Learning

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which ‘any’ online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any \gamma-discounted tabular RL problem, with probability at least 1-\delta, it learns an \epsilon-optimal policy using at most \tilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\log(\frac{1}{\delta})}{(1-\gamma)^4\epsilon^2}\right) samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of |\mathcal{S}|,|\mathcal{A}|, though at the cost of potential approximation bias.

Adversarial Margin Maximization Networks

The tremendous recent success of deep neural networks (DNNs) has sparked a surge of interest in understanding their predictive ability. Unlike the human visual system which is able to generalize robustly and learn with little supervision, DNNs normally require a massive amount of data to learn new concepts. In addition, research works also show that DNNs are vulnerable to adversarial examples-maliciously generated images which seem perceptually similar to the natural ones but are actually formed to fool learning models, which means the models have problem generalizing to unseen data with certain type of distortions. In this paper, we analyze the generalization ability of DNNs comprehensively and attempt to improve it from a geometric point of view. We propose adversarial margin maximization (AMM), a learning-based regularization which exploits an adversarial perturbation as a proxy. It encourages a large margin in the input space, just like the support vector machines. With a differentiable formulation of the perturbation, we train the regularized DNNs simply through back-propagation in an end-to-end manner. Experimental results on various datasets (including MNIST, CIFAR-10/100, SVHN and ImageNet) and different DNN architectures demonstrate the superiority of our method over previous state-of-the-arts. Code and models for reproducing our results will be made publicly available.

FAQ-based Question Answering via Knowledge Anchors

Question answering (QA) aims to understand user questions and find appropriate answers. In real-world QA systems, Frequently Asked Question (FAQ) based QA is usually a practical and effective solution, especially for some complicated questions (e.g., How and Why). Recent years have witnessed the great successes of knowledge graphs (KGs) utilized in KBQA systems, while there are still few works focusing on making full use of KGs in FAQ-based QA. In this paper, we propose a novel Knowledge Anchor based Question Answering (KAQA) framework for FAQ-based QA to better understand questions and retrieve more appropriate answers. More specifically, KAQA mainly consists of three parts: knowledge graph construction, query anchoring and query-document matching. We consider entities and triples of KGs in texts as knowledge anchors to precisely capture the core semantics, which brings in higher precision and better interpretability. The multi-channel matching strategy also enable most sentence matching models to be flexibly plugged in out KAQA framework to fit different real-world computation costs. In experiments, we evaluate our models on a query-document matching task over a real-world FAQ-based QA dataset, with detailed analysis over different settings and cases. The results confirm the effectiveness and robustness of the KAQA framework in real-world FAQ-based QA.

An Efficient Hardware-Oriented Dropout Algorithm

This paper proposes a hardware-oriented dropout algorithm, which is efficient for field programmable gate array (FPGA) implementation. In deep neural networks (DNNs), overfitting occurs when networks are overtrained and adapt too well to training data. Consequently, they fail in predicting unseen data used as test data. Dropout is a common technique that is often applied in DNNs to overcome this problem. In general, implementing such training algorithms of DNNs in embedded systems is difficult due to power and memory constraints. Training DNNs is power-, time-, and memory- intensive; however, embedded systems require low power consumption and real-time processing. An FPGA is suitable for embedded systems for its parallel processing characteristic and low operating power; however, due to its limited memory and different architecture, it is difficult to apply general neural network algorithms. Therefore, we propose a hardware-oriented dropout algorithm that can effectively utilize the characteristics of an FPGA with less memory required. Software program verification demonstrates that the performance of the proposed method is identical to that of conventional dropout, and hardware synthesis demonstrates that it results in significant resource reduction.

Hiding in Multilayer Networks

Multilayer networks allow for modeling complex relationships, where individuals are embedded in multiple social networks at the same time. Given the ubiquity of such relationships, these networks have been increasingly gaining attention in the literature. This paper presents the first analysis of the robustness of centrality measures against strategic manipulation in multilayer networks. More specifically, we consider an ‘evader’ who strategically chooses which connections to form in a multilayer network in order to obtain a low centrality-based ranking-thereby reducing the chance of being highlighted as a key figure in the network-while ensuring that she remains connected to a certain group of people. We prove that determining an optimal way to ‘hide’ is NP-complete and hard to approximate for most centrality measures considered in our study. Moreover, we empirically evaluate a number of heuristics that the evader can use. Our results suggest that the centrality measures that are functions of the entire network topology are more robust to such a strategic evader than their counterparts which consider each layer separately.

HUSE: Hierarchical Universal Semantic Embeddings

There is a recent surge of interest in cross-modal representation learning corresponding to images and text. The main challenge lies in mapping images and text to a shared latent space where the embeddings corresponding to a similar semantic concept lie closer to each other than the embeddings corresponding to different semantic concepts, irrespective of the modality. Ranking losses are commonly used to create such shared latent space — however, they do not impose any constraints on inter-class relationships resulting in neighboring clusters to be completely unrelated. The works in the domain of visual semantic embeddings address this problem by first constructing a semantic embedding space based on some external knowledge and projecting image embeddings onto this fixed semantic embedding space. These works are confined only to image domain and constraining the embeddings to a fixed space adds additional burden on learning. This paper proposes a novel method, HUSE, to learn cross-modal representation with semantic information. HUSE learns a shared latent space where the distance between any two universal embeddings is similar to the distance between their corresponding class embeddings in the semantic embedding space. HUSE also uses a classification objective with a shared classification layer to make sure that the image and text embeddings are in the same shared latent space. Experiments on UPMC Food-101 show our method outperforms previous state-of-the-art on retrieval, hierarchical precision and classification results.

A Recurrent Probabilistic Neural Network with Dimensionality Reduction Based on Time-series Discriminant Component Analysis

This paper proposes a probabilistic neural network developed on the basis of time-series discriminant component analysis (TSDCA) that can be used to classify high-dimensional time-series patterns. TSDCA involves the compression of high-dimensional time series into a lower-dimensional space using a set of orthogonal transformations and the calculation of posterior probabilities based on a continuous-density hidden Markov model with a Gaussian mixture model expressed in the reduced-dimensional space. The analysis can be incorporated into a neural network, which is named a time-series discriminant component network (TSDCN), so that parameters of dimensionality reduction and classification can be obtained simultaneously as network coefficients according to a backpropagation through time-based learning algorithm with the Lagrange multiplier method. The TSDCN is considered to enable high-accuracy classification of high-dimensional time-series patterns and to reduce the computation time taken for network training. The validity of the TSDCN is demonstrated for high-dimensional artificial data and EEG signals in the experiments conducted during the study.

Robust Parameter-Free Season Length Detection in Time Series

The in-depth analysis of time series has gained a lot of research interest in recent years, with the identification of periodic patterns being one important aspect. Many of the methods for identifying periodic patterns require time series’ season length as input parameter. There exist only a few algorithms for automatic season length approximation. Many of these rely on simplifications such as data discretization and user defined parameters. This paper presents an algorithm for season length detection that is designed to be sufficiently reliable to be used in practical applications and does not require any input other than the time series to be analyzed. The algorithm estimates a time series’ season length by interpolating, filtering and detrending the data. This is followed by analyzing the distances between zeros in the directly corresponding autocorrelation function. Our algorithm was tested against a comparable algorithm and outperformed it by passing 122 out of 165 tests, while the existing algorithm passed 83 tests. The robustness of our method can be jointly attributed to both the algorithmic approach and also to design decisions taken at the implementational level.

Semantic Granularity Metric Learning for Visual Search

Deep metric learning applied to various applications has shown promising results in identification, retrieval and recognition. Existing methods often do not consider different granularity in visual similarity. However, in many domain applications, images exhibit similarity at multiple granularities with visual semantic concepts, e.g. fashion demonstrates similarity ranging from clothing of the exact same instance to similar looks/design or a common category. Therefore, training image triplets/pairs used for metric learning inherently possess different degree of information. However, the existing methods often treats them with equal importance during training. This hinders capturing the underlying granularities in feature similarity required for effective visual search. In view of this, we propose a new deep semantic granularity metric learning (SGML) that develops a novel idea of leveraging attribute semantic space to capture different granularity of similarity, and then integrate this information into deep metric learning. The proposed method simultaneously learns image attributes and embeddings using multitask CNNs. The two tasks are not only jointly optimized but are further linked by the semantic granularity similarity mappings to leverage the correlations between the tasks. To this end, we propose a new soft-binomial deviance loss that effectively integrates the degree of information in training samples, which helps to capture visual similarity at multiple granularities. Compared to recent ensemble-based methods, our framework is conceptually elegant, computationally simple and provides better performance. We perform extensive experiments on benchmark metric learning datasets and demonstrate that our method outperforms recent state-of-the-art methods, e.g., 1-4.5\% improvement in Recall@1 over the previous state-of-the-arts [1],[2] on DeepFashion In-Shop dataset.

A Bayesian/Information Theoretic Model of Bias Learning

In this paper the problem of learning appropriate bias for an environment of related tasks is examined from a Bayesian perspective. The environment of related tasks is shown to be naturally modelled by the concept of an {\em objective} prior distribution. Sampling from the objective prior corresponds to sampling different learning tasks from the environment. It is argued that for many common machine learning problems, although we don’t know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by sampling from the objective prior. Bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, and the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous.

Learning Model Bias

In this paper the problem of {\em learning} appropriate domain-specific bias is addressed. It is shown that this can be achieved by learning many related tasks from the same domain, and a theorem is given bounding the number tasks that must be learnt. A corollary of the theorem is that if the tasks are known to possess a common {\em internal representation} or {\em preprocessing} then the number of examples required per task for good generalisation when learning n tasks simultaneously scales like O(a + \frac{b}{n}), where O(a) is a bound on the minimum number of examples required to learn a single task, and O(a + b) is a bound on the number of examples required to learn each task independently. An experiment providing strong qualitative support for the theoretical results is reported.

Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems

We use a novel modification of Multi-Armed Bandits to create a new model for recommendation systems. We model the recommendation system as a bandit seeking to maximize reward by pulling on arms with unknown rewards. The catch however is that this bandit can only access these arms through an unreliable intermediate that has some level of autonomy while choosing its arms. For example, in a streaming website the user has a lot of autonomy while choosing content they want to watch. The streaming sites can use targeted advertising as a means to bias opinions of these users. Here the streaming site is the bandit aiming to maximize reward and the user is the unreliable intermediate. We model the intermediate as accessing states via a Markov chain. The bandit is allowed to perturb this Markov chain. We prove fundamental theorems for this setting after which we show a close-to-optimal Explore-Commit algorithm.

Understanding Graph Neural Networks with Asymmetric Geometric Scattering Transforms

The scattering transform is a multilayered wavelet-based deep learning architecture that acts as a model of convolutional neural networks. Recently, several works have introduced generalizations of the scattering transform for non-Euclidean settings such as graphs. Our work builds upon these constructions by introducing windowed and non-windowed graph scattering transforms based upon a very general class of asymmetric wavelets. We show that these asymmetric graph scattering transforms have many of the same theoretical guarantees as their symmetric counterparts. This work helps bridge the gap between scattering and other graph neural networks by introducing a large family of networks with provable stability and invariance guarantees. This lays the groundwork for future deep learning architectures for graph-structured data that have learned filters and also provably have desirable theoretical properties.

Sato: Contextual Semantic Type Detection in Tables

Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes in the training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.901 and 0.973, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.

Real-time Anomaly Detection and Classification in Streaming PMU Data

Ensuring secure and reliable operations of the power grid is a primary concern of system operators. Phasor measurement units (PMUs) are rapidly being deployed in the grid to provide fast-sampled operational data that should enable quicker decision-making. This work presents a general interpretable framework for analyzing real-time PMU data, and thus enabling grid operators to understand the current state and to identify anomalies on the fly. Applying statistical learning tools on the streaming data, we first learn an effective dynamical model to describe the current behavior of the system. Next, we use the probabilistic predictions of our learned model to define in a principled way an efficient anomaly detection tool. Finally, the last module of our framework produces on-the-fly classification of the detected anomalies into common occurrence classes using features that grid operators are familiar with. We demonstrate the efficacy of our interpretable approach through extensive numerical experiments on real PMU data collected from a transmission operator in the USA.

Document worth reading: “Adversarial Examples in Modern Machine Learning: A Review”

Recent research has found that many families of machine learning models are vulnerable to adversarial examples: inputs that are specifically designed to cause the target model to produce erroneous outputs. In this survey, we focus on machine learning models in the visual domain, where methods for generating and detecting such examples have been most extensively studied. We explore a variety of adversarial attack methods that apply to image-space content, real world adversarial attacks, adversarial defenses, and the transferability property of adversarial examples. We also discuss strengths and weaknesses of various methods of adversarial attack and defense. Our aim is to provide an extensive coverage of the field, furnishing the reader with an intuitive understanding of the mechanics of adversarial attack and defense mechanisms and enlarging the community of researchers studying this fundamental set of problems. Adversarial Examples in Modern Machine Learning: A Review

If you did not already know

Autoencoding Binary Classifier (ABC) google
We propose the Autoencoding Binary Classifiers (ABC), a novel supervised anomaly detector based on the Autoencoder (AE). There are two main approaches in anomaly detection: supervised and unsupervised. The supervised approach accurately detects the known anomalies included in training data, but it cannot detect the unknown anomalies. Meanwhile, the unsupervised approach can detect both known and unknown anomalies that are located away from normal data points. However, it does not detect known anomalies as accurately as the supervised approach. Furthermore, even if we have labeled normal data points and anomalies, the unsupervised approach cannot utilize these labels. The ABC is a probabilistic binary classifier that effectively exploits the label information, where normal data points are modeled using the AE as a component. By maximizing the likelihood, the AE in the proposed ABC is trained to minimize the reconstruction error for normal data points, and to maximize it for known anomalies. Since our approach becomes able to reconstruct the normal data points accurately and fails to reconstruct the known and unknown anomalies, it can accurately discriminate both known and unknown anomalies from normal data points. Experimental results show that the ABC achieves higher detection performance than existing supervised and unsupervised methods. …

Parikh Matrix google
Parikh Matrices are a newly developed tool for studying numerical properties of words in terms of their (scattered) subwords. They were introduced by Mateescu et al. in 2000 and continuously received attention from the research community ever since.
Mateescu et al (2000) introduced an interesting new tool, called Parikh matrix, to study in terms of subwords, the numerical properties of words over an alphabet. The Parikh matrix gives more information than the well-known Parikh vector of a word which counts only occurrences of symbols in a word. …

Graph Node-Feature Convolution google
Graph convolutional network (GCN) is an emerging neural network approach. It learns new representation of a node by aggregating feature vectors of all neighbors in the aggregation process without considering whether the neighbors or features are useful or not. Recent methods have improved solutions by sampling a fixed size set of neighbors, or assigning different weights to different neighbors in the aggregation process, but features within a feature vector are still treated equally in the aggregation process. In this paper, we introduce a new convolution operation on regular size feature maps constructed from features of a fixed node bandwidth via sampling to get the first-level node representation, which is then passed to a standard GCN to learn the second-level node representation. Experiments show that our method outperforms competing methods in semi-supervised node classification tasks. Furthermore, our method opens new doors for exploring new GCN architectures, particularly deeper GCN models. …

Higher-Order Kolmogorov-Smirnov Test google
We present an extension of the Kolmogorov-Smirnov (KS) two-sample test, which can be more sensitive to differences in the tails. Our test statistic is an integral probability metric (IPM) defined over a higher-order total variation ball, recovering the original KS test as its simplest case. We give an exact representer result for our IPM, which generalizes the fact that the original KS test statistic can be expressed in equivalent variational and CDF forms. For small enough orders ($k \leq 5$), we develop a linear-time algorithm for computing our higher-order KS test statistic; for all others ($k \geq 6$), we give a nearly linear-time approximation. We derive the asymptotic null distribution for our test, and show that our nearly linear-time approximation shares the same asymptotic null. Lastly, we complement our theory with numerical studies. …

Document worth reading: “Optimization Models for Machine Learning: A Survey”

This paper surveys the machine learning literature and presents machine learning as optimization models. Such models can benefit from the advancement of numerical optimization techniques which have already played a distinctive role in several machine learning settings. Particularly, mathematical optimization models are presented for commonly used machine learning approaches for regression, classification, clustering, and deep neural networks as well new emerging applications in machine teaching and empirical model learning. The strengths and the shortcomings of these models are discussed and potential research directions are highlighted. Optimization Models for Machine Learning: A Survey

Distilled News

Multi-Label Text Classification with XLNet

Achieve state-of-the-art multi-label and multi-class text classification with XLNet. At the time of its publication on 19 June 2019, XLNet achieved state-of-the-art results on 18 tasks including text classification, question-answering, natural language inference, sentiment analysis, and document ranking. It even outperformed BERT on 20 tasks! Developed by Carnegie Mellon University and Google Brain, XLNet is a permutation-based auto-regressive language model. We will not delve too much into the inner workings of the model as there are a lot of great resources out there for this purpose. Rather, this article will focus on the application of XLNet to the problem of multi-label and multi-class text classification.

Five Open Source Reference Architectures Designed to Build Machine Learning at Scale

From Uber to Facebook, what the architectures used to power the machine learning workloads of the internet giants. Despite the hype surrounding machine learning and artificial intelligence(AI) most efforts in the enterprise remain in a pilot stage. Part of the reason for this phenomenon is the natural experimentation associated with machine learning projects but also there is a significant component related to the lack of maturity of machine learning architectures. This problem is particularly visible in enterprise environments in which the new application lifecycle management practices of modern machine learning solutions conflicts with corporate practices and regulatory requirements. What are the key architecture building blocks that organizations should put in place when adopting machine learning solutions? The answer is not very trivial but recently we have seen some efforts from research labs and AI data science that are starting to lay down the path of what can become reference architectures for large scale machine learning solutions.

Workflow Tools for Model Pipelines

Airflow is becoming the industry standard for authoring data engineering and model pipeline workflows. This chapter of my book explores the process of taking a simple pipeline that runs on a single EC2 instance to a fully-managed Kubernetes ecosystem responsible for scheduling tasks. This posts omits the sections on the fully-managed solutions with GKE and Cloud Composer.

Artificial Intelligence on Edge devices: an engineering led approach

Artificial Intelligence – Cloud and Edge implementations takes an engineering-led approach for the deployment of AI to Edge devices within the framework of the cloud. We often use the word ‘engineering’ in casual conversation. However, in this context, we attach a specific meaning to Engineering. Engineering is the use of scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The American Engineers’ Council for Professional Development defines engineering as: (specific emphasis of interest highlighted)

Statistical uncertainty with R and pdqr

Statistical estimation usually has the following setup. There is a sample (observed, usually randomly chosen, set of values of measurable quantities) from some general population (whole set of values of the same measurable quantities). We need to make conclusions about the general population based on a sample. This is done by computing summary values (called statistics) of a sample, and making reasonable assumptions (with process usually called inference) about how these values are close to values that potentially can be computed based on whole general population. Thus, summary value based on a sample (sample statistic) is an estimation of potential summary value based on a general population (true value). How can we make inference about quality of this estimation? This question itself describes statistical uncertainty and can be unfolded into a deep philosophical question about probability, nature, and life in general. Basically, the answer depends on assumptions about the relation between sample, general population, and statistic.

H2O Driverless AI: End-to-End Machine Learning (for anyone!)

In today’s world, being a Data Scientist is not limited to those without technical knowledge. While it is recommended and sometimes important to know a little bit of code, you can get by with just intuitive knowledge. Especially if you’re on H2O’s Driverless AI platform. If you haven’t heard of, it is the company that created the open-source machine learning platform, H2O, which is used by many in the Fortune 500. H2O aims at creating efficiency-driven machine learning environments by leveraging its user-friendly interface and modular capabilities.

How to tune a Decision Tree?

How do the hyperparameters for a decision tree affect your model and how do you choose which ones to tune?

Build and Compare 3 Models – NLP Prediction

This project was created in an attempt to learn and understand how various classification algorithms work within a Natural Language Processing Model. Natural Language Processing, which I will now refer to as NLP, is a branch of machine learning that focuses on enabling computers to interpret and process human languages in both speech and text forms.

Protect your Deep Neural Network by Embedding Watermarks!

We have intellectual property (IP) protection watermarks on media contents such as images, musics and etc. How about Deep Neural Network (DNN)?

Decision Trees: A Complete Introduction

Decision tree’s are one of many supervised learning algorithms available to anyone looking to make predictions of future events based on some historical data and, although there is no one generic tool optimal for all problems, decision tree’s are hugely popular and turn out to be very effective in many machine learning applications. To understand the intuition behind the decision tree, consider the problem of designing an algorithm to automatically differentiate between apples and pears (class labels) given only their width and height measurements (features).

Building a Convolutional Neural Network for Image Classification with Tensorflow

Convolutional Neural Network (CNN) is a special type of deep neural network that performs impressively in computer vision problems such as image classification, object detection, etc. In this article, we are going to create an image classifier with Tensorflow by implementing a CNN to classify cats & dogs. With traditional programming is it not possible to build scalable solutions for problems like computer vision since it is not feasible to write an algorithm that is generalized enough to identify the nature of images. With machine learning, we can build an approximation that is sufficient enough for use-cases by training a model for given examples and predict for unseen data.