We consider the efficient use of an approximation within Markov chain Monte Carlo (MCMC), with subsequent importance sampling (IS) correction of the Markov chain inexact output, leading to asymptotically exact inference. We detail convergence and central limit theorems for the resulting MCMC-IS estimators. We also consider the case where the approximate Markov chain is pseudo-marginal, requiring unbiased estimators for its approximate marginal target. Convergence results with asymptotic variance formulae are shown for this case, and for the case where the IS weights based on unbiased estimators are only calculated for distinct output samples of the so-called jump’ chain, which, with a suitable reweighting, allows for improved efficiency. As the IS type weights may assume negative values, extended classes of unbiased estimators may be used for the IS type correction, such as those obtained from randomised multilevel Monte Carlo. Using Euler approximations and coupling of particle filters, we apply the resulting estimator using randomised weights to the problem of parameter inference for partially observed It\^{o} diffusions. Convergence of the estimator is verified to hold under regularity assumptions which do not require that the diffusion can be simulated exactly. In the context of approximate Bayesian computation (ABC), we suggest an adaptive MCMC approach to deal with the selection of a suitably large tolerance, with IS correction possible to finer tolerance, and with provided approximate confidence intervals. A prominent question is the efficiency of MCMC-IS compared to standard direct MCMC, such as pseudo-marginal, delayed acceptance, and ABC-MCMC. We provide a comparison criterion which generalises the covariance ordering to the IS setting.
Extreme Multi-label classification (XML) is an important yet challenging machine learning task, that assigns to each instance its most relevant candidate labels from an extremely large label collection, where the numbers of labels, features and instances could be thousands or millions. XML is more and more on demand in the Internet industries, accompanied with the increasing business scale / scope and data accumulation. The extremely large label collections yield challenges such as computational complexity, inter-label dependency and noisy labeling. Many methods have been proposed to tackle these challenges, based on different mathematical formulations. In this paper, we propose a deep learning XML method, with a word-vector-based self-attention, followed by a ranking-based AutoEncoder architecture. The proposed method has three major advantages: 1) the autoencoder simultaneously considers the inter-label dependencies and the feature-label dependencies, by projecting labels and features onto a common embedding space; 2) the ranking loss not only improves the training efficiency and accuracy but also can be extended to handle noisy labeled data; 3) the efficient attention mechanism improves feature representation by highlighting feature importance. Experimental results on benchmark datasets show the proposed method is competitive to state-of-the-art methods.
We propose a model selection criterion to detect purely causal from purely noncausal models in the framework of quantile autoregressions (QAR). We also present asymptotics for the i.i.d. case with regularly varying distributed innovations in QAR. This new modelling perspective is appealing for investigating the presence of bubbles in economic and financial time series, and is an alternative to approximate maximum likelihood methods. We illustrate our analysis using hyperinflation episodes in Latin American countries.
Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to reduce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. Thus, to solve different machine learning problems, one has to collect coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing coreset construction algorithms based on k-means/median clustering, that give a provably good approximation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the robust performance of the proposed algorithms.
For a product of interest, we propose a search method to surface a set of reference products. The reference products can be used as candidates to support downstream modeling tasks and business applications. The search method consists of product representation learning and fingerprint-type vector searching. The product catalog information is transformed into a high-quality embedding of low dimensions via a novel attention auto-encoder neural network, and the embedding is further coupled with a binary encoding vector for fast retrieval. We conduct extensive experiments to evaluate the proposed method, and compare it with peer services to demonstrate its advantage in terms of search return rate and precision.
There is a significant conceptual gap between legal and mathematical thinking around data privacy. The effect is uncertainty as to which technical offerings adequately match expectations expressed in legal standards. The uncertainty is exacerbated by a litany of successful privacy attacks, demonstrating that traditional statistical disclosure limitation techniques often fall short of the sort of privacy envisioned by legal standards. We define predicate singling out, a new type of privacy attack intended to capture the concept of singling out appearing in the General Data Protection Regulation (GDPR). Informally, an adversary predicate singles out a dataset $X$ using the output of a data release mechanism $M(X)$ if it manages to find a predicate $p$ matching exactly one row $x \in X$ with probability much better than a statistical baseline. A data release mechanism that precludes such attacks is secure against predicate singling out (PSO secure). We argue that PSO security is a mathematical concept with legal consequences. Any data release mechanism that purports to ‘render anonymous’ personal data under the GDPR must be secure against singling out, and hence must be PSO secure. We then analyze PSO security, showing that it fails to self-compose. Namely, a combination of $\omega(\log n)$ exact counts, each individually PSO secure, enables an attacker to predicate single out. In fact, the composition of just two PSO-secure mechanisms can fail to provide PSO security. Finally, we ask whether differential privacy and $k$-anonymity are PSO secure. Leveraging a connection to statistical generalization, we show that differential privacy implies PSO security. However, $k$-anonymity does not: there exists a simple and general predicate singling out attack under mild assumptions on the $k$-anonymizer and the data distribution.
Batch normalization (BN) has been very effective for deep learning and is widely used. However, when training with small minibatches, models using BN exhibit a significant degradation in performance. In this paper we study this peculiar behavior of BN to gain a better understanding of the problem, and identify a potential cause based on a statistical insight. We propose EvalNorm’ to address the issue by estimating corrected normalization statistics to use for BN during evaluation. EvalNorm supports online estimation of the corrected statistics while the model is being trained, and it does not affect the training scheme of the model. As a result, an added advantage of EvalNorm is that it can be used with existing pre-trained models allowing them to benefit from our method. EvalNorm yields large gains for models trained with smaller batches. Our experiments show that EvalNorm performs 6.18% (absolute) better than vanilla BN for a batchsize of 2 on ImageNet validation set and from 1.5 to 7.0 points (absolute) gain on the COCO object detection benchmark across a variety of setups.
We propose a supervised anomaly detection method based on neural density estimators, where the negative log likelihood is used for the anomaly score. Density estimators have been widely used for unsupervised anomaly detection. By the recent advance of deep learning, the density estimation performance has been greatly improved. However, the neural density estimators cannot exploit anomaly label information, which would be valuable for improving the anomaly detection performance. The proposed method effectively utilizes the anomaly label information by training the neural density estimator so that the likelihood of normal instances is maximized and the likelihood of anomalous instances is lower than that of the normal instances. We employ an autoregressive model for the neural density estimator, which enables us to calculate the likelihood exactly. With the experiments using 16 datasets, we demonstrate that the proposed method improves the anomaly detection performance with a few labeled anomalous instances, and achieves better performance than existing unsupervised and supervised anomaly detection methods.
Deep Learning techniques have achieved remarkable results in many domains. Often, training deep learning models requires large datasets, which may require sensitive information to be uploaded to the cloud to accelerate training. To adequately protect sensitive information, we propose distributed layer-partitioned training with step-wise activation functions for privacy-preserving deep learning. Experimental results attest our method to be simple and effective.
In this paper, we study the problem of unifying knowledge from a set of classifiers with different architectures and target classes into a single classifier, given only a generic set of unlabelled data. We call this problem Unifying Heterogeneous Classifiers (UHC). This problem is motivated by scenarios where data is collected from multiple sources, but the sources cannot share their data, e.g., due to privacy concerns, and only privately trained models can be shared. In addition, each source may not be able to gather data to train all classes due to data availability at each source, and may not be able to train the same classification model due to different computational resources. To tackle this problem, we propose a generalisation of knowledge distillation to merge HCs. We derive a probabilistic relation between the outputs of HCs and the probability over all classes. Based on this relation, we propose two classes of methods based on cross-entropy minimisation and matrix factorisation, which allow us to estimate soft labels over all classes from unlabelled samples and use them in lieu of ground truth labels to train a unified classifier. Our extensive experiments on ImageNet, LSUN, and Places365 datasets show that our approaches significantly outperform a naive extension of distillation and can achieve almost the same accuracy as classifiers that are trained in a centralised, supervised manner.
Cycling is a green transportation mode, and is promoted by many governments to mitigate traffic congestion. However, studies concerning the traffic dynamics of bicycle flow are very limited. This study experimentally investigated bicycle flow dynamics on a wide road, modeled using a 3-m-wide track. The results showed that the bicycle flow rate remained nearly constant across a wide range of densities, in marked contrast to single-file bicycle flow, which exhibits a unimodal fundamental diagram. By studying the weight density of the radial locations of cyclists, we argue that this behavior arises from the formation of more lanes with the increase of global density. The extra lanes prevent the longitudinal density from increasing as quickly as in single-file bicycle flow. When the density is larger than 0.5 bicycles/m2, the flow rate begins to decrease, and stop-and-go traffic emerges. A cognitive-science-based model to reproduce bicycle dynamics is proposed, in which cyclists apply simple cognitive procedures to adapt their target directions and desired riding speeds. To incorporate differences in acceleration, deceleration, and turning, different relaxation times are used. The model can reproduce the experimental results acceptably well and may also provide guidance on infrastructure design.
We present a method for both cross estimation and iterated time series prediction of spatio temporal dynamics based on reconstructed local states, PCA dimension reduction, and local modelling using nearest neighbour methods. The effectiveness of this approach is shown for (noisy) data from a (cubic) Barkley model, the Bueno-Orovio-Cherry-Fenton model, and the Kuramoto-Sivashinsky model.
Social media offer an abundant source of valuable raw data, however informal writing can quickly become a bottleneck for many natural language processing (NLP) tasks. Off-the-shelf tools are usually trained on formal text and cannot explicitly handle noise found in short online posts. Moreover, the variety of frequently occurring linguistic variations presents several challenges, even for humans who might not be able to comprehend the meaning of such posts, especially when they contain slang and abbreviations. Text Normalization aims to transform online user-generated text to a canonical form. Current text normalization systems rely on string or phonetic similarity and classification models that work on a local fashion. We argue that processing contextual information is crucial for this task and introduce a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for NLP applications to adapt to noisy text in social media. Our character-based component is trained on synthetic adversarial examples that are designed to capture errors commonly found in online user-generated text. Experiments show that our model surpasses neural architectures designed for text normalization and achieves comparable performance with state-of-the-art related work.
An increasing amount of civil engineering applications are utilising data acquired from infrastructure instrumented with sensing devices. This data has an important role in monitoring the response of these structures to excitation, and evaluating structural health. In this paper we seek to monitor pedestrian-events (such as a person walking) on a footbridge using strain and acceleration data. The rate of this data acquisition and the number of sensing devices make the storage and analysis of this data a computational challenge. We introduce a streaming method to compress the sensor data, whilst preserving key patterns and features (unique to different sensor types) corresponding to pedestrian-events. Numerical demonstrations of the methodology on data obtained from strain sensors and accelerometers on the pedestrian footbridge are provided to show the trade-off between compression and accuracy during and in-between periods of pedestrian-events.
Unorganized heaps of analysis code are a growing liability as data analysis pipelines are getting longer and more complicated. This is worrying, as neuroscience papers are getting retracted due to programmer error. In this paper, some guidelines are presented that help keep analysis code well organized, easy to understand and convenient to work with:
1. Each analysis step is one script
2. A script either processes a single recording, or aggregates across recordings, never both
3. One master script to run the entire analysis
4. Save all intermediate results
5. Visualize all intermediate results
6. Each parameter and filename is defined only once
7. Distinguish files that are part of the official pipeline from other scripts
In addition to discussing the reasoning behind each guideline, an example analysis pipeline is presented as a case study to see how each guideline translates into code.
During the last fifteen years, text scaling approaches have become a central element for the text-as-data community. However, they are based on the assumption that latent positions can be captured just by modeling word-frequency information from the different documents under study. We challenge this by presenting a new semantically aware unsupervised scaling algorithm, SemScale, which relies upon distributional representations of the documents under study. We conduct an extensive quantitative analysis over a collection of speeches from the European Parliament in five different languages and from two different legislations, in order to understand whether a) an approach that is aware of semantics would better capture known underlying political dimensions compared to a frequency-based scaling method, b) such positioning correlates in particular with a specific subset of linguistic traits, compared to the use of the entire text, and c) these findings hold across different languages. To support further research on this new branch of text scaling approaches, we release the employed dataset and evaluation setting, an easy-to-use online demo, and a Python implementation of SemScale.
Algorithm configurators are automated methods to optimise the parameters of an algorithm for a class of problems. We evaluate the performance of a simple random local search configurator (ParamRLS) for tuning the neighbourhood size $k$ of the RLS$_k$ algorithm. We measure performance as the expected number of configuration evaluations required to identify the optimal value for the parameter. We analyse the impact of the cutoff time $\kappa$ (the time spent evaluating a configuration for a problem instance) on the expected number of configuration evaluations required to find the optimal parameter value, where we compare configurations using either best found fitness values (ParamRLS-F) or optimisation times (ParamRLS-T). We consider tuning RLS$_k$ for a variant of the Ridge function class (Ridge*), where the performance of each parameter value does not change during the run, and for the OneMax function class, where longer runs favour smaller $k$. We rigorously prove that ParamRLS-F efficiently tunes RLS$_k$ for Ridge* for any $\kappa$ while ParamRLS-T requires at least a quadratic one. For OneMax ParamRLS-F identifies $k=1$ as optimal with linear $\kappa$ while ParamRLS-T requires a $\kappa$ of at least $\Omega(n \log n)$. For smaller $\kappa$ ParamRLS-F identifies that $k>1$ performs better while ParamRLS-T returns $k$ chosen uniformly at random.
Zero-shot learning (ZSL) aims at recognizing unseen classes with knowledge transferred from seen classes. This is typically achieved by exploiting a semantic feature space (FS) shared by both seen and unseen classes, i.e., attributes or word vectors, as the bridge. However, due to the mutually disjoint of training (seen) and testing (unseen) data, existing ZSL methods easily and commonly suffer from the domain shift problem. To address this issue, we propose a novel model called AMS-SFE. It considers the Alignment of Manifold Structures by Semantic Feature Expansion. Specifically, we build up an autoencoder based model to expand the semantic features and joint with an alignment to an embedded manifold extracted from the visual FS of data. It is the first attempt to align these two FSs by way of expanding semantic features. Extensive experiments show the remarkable performance improvement of our model compared with other existing methods.
Reinforcement learning (RL) is about sequential decision making and is traditionally opposed to supervised learning (SL) and unsupervised learning (USL). In RL, given the current state, the agent makes a decision that may influence the next state as opposed to SL (and USL) where, the next state remains the same, regardless of the decisions taken, either in batch or online learning. Although this difference is fundamental between SL and RL, there are connections that have been overlooked. In particular, we prove in this paper that gradient policy method can be cast as a supervised learning problem where true label are replaced with discounted rewards. We provide a new proof of policy gradient methods (PGM) that emphasizes the tight link with the cross entropy and supervised learning. We provide a simple experiment where we interchange label and pseudo rewards. We conclude that other relationships with SL could be made if we modify the reward functions wisely.
With the wide deployment of machine learning (ML) based systems for a variety of applications including medical, military, automotive, genomic, as well as multimedia and social networking, there is great potential for damage from adversarial learning (AL) attacks. In this paper, we provide a contemporary survey of AL, focused particularly on defenses against attacks on statistical classifiers. After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasion (TTE), data poisoning (DP), and reverse engineering (RE) attacks and particularly defenses against same. In so doing, we distinguish robust classification from anomaly detection (AD), unsupervised from supervised, and statistical hypothesis-based defenses from ones that do not have an explicit null (no attack) hypothesis; we identify the hyperparameters a particular method requires, its computational complexity, as well as the performance measures on which it was evaluated and the obtained quality. We then dig deeper, providing novel insights that challenge conventional AL wisdom and that target unresolved issues, including: 1) robust classification versus AD as a defense strategy; 2) the belief that attack success increases with attack strength, which ignores susceptibility to AD; 3) small perturbations for test-time evasion attacks: a fallacy or a requirement?; 4) validity of the universal assumption that a TTE attacker knows the ground-truth class for the example to be attacked; 5) black, grey, or white box attacks as the standard for defense evaluation; 6) susceptibility of query-based RE to an AD defense. We then present benchmark comparisons of several defenses against TTE, RE, and backdoor DP attacks on images. The paper concludes with a discussion of future work.
The infeasible parts of the objective space in difficult many-objective optimization problems cause trouble for evolutionary algorithms. This paper proposes a reference vector based algorithm which uses two interacting engines to adapt the reference vectors and to evolve the population towards the true Pareto Front (PF) s.t. the reference vectors are always evenly distributed within the current PF to provide appropriate guidance for selection. The current PF is tracked by maintaining an archive of undominated individuals, and adaptation of reference vectors is conducted with the help of another archive that contains layers of reference vectors corresponding to different density. Experimental results show the expected characteristics and competitive performance of the proposed algorithm TEEA.
Inferring a decision tree from a given dataset is one of the classic problems in machine learning. This problem consists of buildings, from a labelled dataset, a tree such that each node corresponds to a class and a path between the tree root and a leaf corresponds to a conjunction of features to be satisfied in this class. Following the principle of parsimony, we want to infer a minimal tree consistent with the dataset. Unfortunately, inferring an optimal decision tree is known to be NP-complete for several definitions of optimality. Hence, the majority of existing approaches relies on heuristics, and as for the few exact inference approaches, they do not work on large data sets. In this paper, we propose a novel approach for inferring a decision tree of a minimum depth based on the incremental generation of Boolean formula. The experimental results indicate that it scales sufficiently well and the time it takes to run grows slowly with the size of dataset.
Spatio-temporal graphs such as traffic networks or gene regulatory systems present challenges for the existing deep learning methods due to the complexity of structural changes over time. To address these issues, we introduce Spatio-Temporal Deep Graph Infomax (STDGI)—a fully unsupervised node representation learning approach based on mutual information maximization that exploits both the temporal and spatial dynamics of the graph. Our model tackles the challenging task of node-level regression by training embeddings to maximize the mutual information between patches of the graph, at any given time step, and between features of the central nodes of patches, in the future. We demonstrate through experiments and qualitative studies that the learned representations can successfully encode relevant information about the input graph and improve the predictive performance of spatio-temporal auto-regressive forecasting models.
We describe an expressive class of policies that can be efficiently learned from a few demonstrations. Policies are represented as logical combinations of programs drawn from a small domain-specific language (DSL). We define a prior over policies with a probabilistic grammar and derive an approximate Bayesian inference algorithm to learn policies from demonstrations. In experiments, we study five strategy games played on a 2D grid with one shared DSL. After a few demonstrations of each game, the inferred policies generalize to new game instances that differ substantially from the demonstrations. We argue that the proposed method is an apt choice for policy learning tasks that have scarce training data and feature significant, structured variation between task instances.
While the use of deep learning in drug discovery is gaining increasing attention, the lack of methods to compute reliable errors in prediction for Neural Networks prevents their application to guide decision making in domains where identifying unreliable predictions is essential, e.g. precision medicine. Here, we present a framework to compute reliable errors in prediction for Neural Networks using Test-Time Dropout and Conformal Prediction. Specifically, the algorithm consists of training a single Neural Network using dropout, and then applying it N times to both the validation and test sets, also employing dropout in this step. Therefore, for each instance in the validation and test sets an ensemble of predictions were generated. The residuals and absolute errors in prediction for the validation set were then used to compute prediction errors for test set instances using Conformal Prediction. We show using 24 bioactivity data sets from ChEMBL 23 that dropout Conformal Predictors are valid (i.e., the fraction of instances whose true value lies within the predicted interval strongly correlates with the confidence level) and efficient, as the predicted confidence intervals span a narrower set of values than those computed with Conformal Predictors generated using Random Forest (RF) models. Lastly, we show in retrospective virtual screening experiments that dropout and RF-based Conformal Predictors lead to comparable retrieval rates of active compounds. Overall, we propose a computationally efficient framework (as only N extra forward passes are required in addition to training a single network) to harness Test-Time Dropout and the Conformal Prediction framework, and to thereby generate reliable prediction errors for deep Neural Networks.