Restart strategy helps the covariance matrix adaptation evolution strategy (CMA-ES) to increase the probability of finding the global optimum in optimization, while a single run CMA-ES is easy to be trapped in local optima. In this paper, the continuous non-revisiting genetic algorithm (cNrGA) is used to help CMA-ES to achieve multiple restarts from different sub-regions of the search space. The CMA-ES with on-line search history-assisted restart strategy (HR-CMA-ES) is proposed. The entire on-line search history of cNrGA is stored in a binary space partitioning (BSP) tree, which is effective for performing local search. The frequently sampled sub-region is reflected by a deep position in the BSP tree. When leaf nodes are located deeper than a threshold, the corresponding sub-region is considered a region of interest (ROI). In HR-CMA-ES, cNrGA is responsible for global exploration and suggesting ROI for CMA-ES to perform an exploitation within or around the ROI. CMA-ES restarts independently in each suggested ROI. The non-revisiting mechanism of cNrGA avoids to suggest the same ROI for a second time. Experimental results on the CEC 2013 and 2017 benchmark suites show that HR-CMA-ES performs better than both CMA-ES and cNrGA. A positive synergy is observed by the memetic cooperation of the two algorithms.
In this paper we propose four deep recurrent architectures to tackle the task of offensive tweet detection as well as further classification into targeting and subject of said targeting. Our architectures are based on LSTMs and GRUs, we present a simple bidirectional LSTM as a baseline system and then further increase the complexity of the models by adding convolutional layers and implementing a split-process-merge architecture with LSTM and GRU as processors. Multiple pre-processing techniques were also investigated. The validation F1-score results from each model are presented for the three subtasks as well as the final F1-score performance on the private competition test set. It was found that model complexity did not necessarily yield better results. Our best-performing model was also the simplest, a bidirectional LSTM; closely followed by a two-branch bidirectional LSTM and GRU architecture.
We present some new nonparametric estimators of entropies and we establish almost sure consistency and central limit Theorems for some of the most important entropies in the discrete case. Our theorical results are validated by simulations.
A re-calibration is proposed for ‘numerical analysis’ as it arises specifically within the broader, embracing field of modern computer science (CS). This would facilitate research into theoretical and practicable models of real-number computation at the foundations of CS, and it would also advance the instructional objectives of the CS field. Our approach is premised on the key observation that the great ‘watershed’ in numerical computation is much more between finite- and infinite-dimensional numerical problems than it is between discrete and continuous numerical problems. A revitalized discipline for numerical computation within modern CS can more accurately be defined as ‘numerical algorithmic science & engineering (NAS&E), or more compactly, as ‘numerical algorithmics,’ its focus being the algorithmic solution of numerical problems that are either discrete, or continuous over a space of finite dimension, or a combination of the two. It is the counterpart within modern CS of the numerical analysis discipline, whose primary focus is the algorithmic solution of continuous, infinite-dimensional numerical problems and their finite-dimensional approximates, and whose specialists today have largely been repatriated to departments of mathematics. Our detailed overview of NAS&E from the viewpoints of rationale, foundations, and organization is preceded by a recounting of the role played by numerical analysts in the evolution of academic departments of computer science, in order to provide background for NAS&E and place the newly-emerging discipline within its larger historical context.
The study of international relations by definition deals with interdependencies among countries. One form of interdependence between countries is the diffusion of country-level features, such as policies, political regimes, or conflict. In these studies, the outcome variable tends to be categorical, and the primary concern is the clustering of the outcome variable among connected countries. Statistically, such clustering is studied with spatial econometric models. This paper instead proposes the use of a statistical network approach to model diffusion with a binary outcome variable. Using statistical network instead of spatial econometric models allows for a more natural specification of the diffusion process, assuming autocorrelation in the outcomes rather than the corresponding latent variable, and it simplifies the inclusion of temporal dynamics, higher level interdependencies and interactions between network ties and country-level features. In our simulations, the performance of the Stochastic Actor-Oriented Model (SAOM) estimator is evaluated. Our simulation results show that spatial parameters and coefficients on additional covariates in a static binary spatial autoregressive model are accurately recovered when using SAOM, albeit on a different scale. To demonstrate the use of this model, the paper applies the model to the international diffusion of same-sex marriage.
We elaborate the idea behind Markov chain Monte Carlo (MCMC) methods in a mathematically comprehensive way. Our focus is on simplicity. We give an elementary proof for the Perron-Frobenius theorem and a convergence theorem for Markov chains. Subsequently we briefly discuss the well-known Gibbs sampler and the Metropolis- Hastings algorithm. Only basic knowledge about matrix multiplication, convergence of real sequences and stochastic is required.
Deep Bayesian neural network has aroused a great attention in recent years since it combines the benefits of deep neural network and probability theory. Because of this, the network can make predictions and quantify the uncertainty of the predictions at the same time, which is important in many life-threatening areas. However, most of the recent researches are mainly focusing on making the Bayesian neural network easier to train, and proposing methods to estimate the uncertainty. I notice there are very few works that properly discuss the ways to measure the performance of the Bayesian neural network. Although accuracy and average uncertainty are commonly used for now, they are too general to provide any insight information about the model. In this paper, we would like to introduce more specific criteria and propose several metrics to measure the model performance from different perspectives, which include model calibration measurement, data rejection ability and uncertainty divergence for samples from the same and different distributions.
Gradient Boosting Machine (GBM) is an extremely powerful supervised learning algorithm that is widely used in practice. GBM routinely features as a leading algorithm in machine learning competitions such as Kaggle and the KDDCup. In this work, we propose Accelerated Gradient Boosting Machine (AGBM) by incorporating Nesterov’s acceleration techniques into the design of GBM. The difficulty in accelerating GBM lies in the fact that weak (inexact) learners are commonly used, and therefore the errors can accumulate in the momentum term. To overcome it, we design a ‘corrected pseudo residual’ and fit best weak learner to this corrected pseudo residual, in order to perform the z-update. Thus, we are able to derive novel computational guarantees for AGBM. This is the first GBM type of algorithm with theoretically-justified accelerated convergence rate. Finally we demonstrate with a number of numerical experiments the effectiveness of AGBM over conventional GBM in obtaining a model with good training and/or testing data fidelity.
The representation problem of finite-dimensional Markov matrices in Markov semigroups is revisited, with emphasis on concrete criteria for matrix subclasses of theoretical or practical relevance, such as equal-input, circulant, symmetric or doubly stochastic matrices. Here, we pay special attention to various algebraic properties of the embedding problem, and discuss the connection with the centraliser of a Markov matrix.
A network effect is said to take place when a new feature not only impacts the people who receive it, but also other users of the platform, like their connections or the people who follow them. This very common phenomenon violates the fundamental assumption underpinning nearly all enterprise experimentation systems, the stable unit treatment value assumption (SUTVA). When this assumption is broken, a typical experimentation platform, which relies on Bernoulli randomization for assignment and two-sample t-test for assessment of significance, will not only fail to account for the network effect, but potentially give highly biased results. This paper outlines a simple and scalable solution to measuring network effects, using ego-network randomization, where a cluster is comprised of an ‘ego’ (a focal individual), and her ‘alters’ (the individuals she is immediately connected to). Our approach aims at maintaining representativity of clusters, avoiding strong modeling assumption, and significantly increasing power compared to traditional cluster-based randomization. In particular, it does not require product-specific experiment design, or high levels of investment from engineering teams, and does not require any changes to experimentation and analysis platforms, as it only requires assigning treatment an individual level. Each user either has the feature or does not, and no complex manipulation of interactions between users is needed. It focuses on measuring the one-out network effect (i.e the effect of my immediate connection’s treatment on me), and gives reasonable estimates at a very low setup cost, allowing us to run such experiments dozens of times a year.
Whale Optimization Algorithm (WOA) is a nature-inspired meta-heuristic optimization algorithm, which was proposed by Mirjalili and Lewis in 2016. This algorithm has shown its ability to solve many problems. Comprehensive surveys have been conducted about some other nature-inspired algorithms, such as ABC, PSO, etc.Nonetheless, no survey search work has been conducted on WOA. Therefore, in this paper, a systematic and meta analysis survey of WOA is conducted to help researchers to use it in different areas or hybridize it with other common algorithms. Thus, WOA is presented in depth in terms of algorithmic backgrounds, its characteristics, limitations, modifications, hybridizations, and applications. Next, WOA performances are presented to solve different problems. Then, the statistical results of WOA modifications and hybridizations are established and compared with the most common optimization algorithms and WOA. The survey’s results indicate that WOA performs better than other common algorithms in terms of convergence speed and balancing between exploration and exploitation. WOA modifications and hybridizations also perform well compared to WOA. In addition, our investigation paves a way to present a new technique by hybridizing both WOA and BAT algorithms. The BAT algorithm is used for the exploration phase, whereas the WOA algorithm is used for the exploitation phase. Finally, statistical results obtained from WOA-BAT are very competitive and better than WOA in 16 benchmarks functions. WOA-BAT also outperforms well in 13 functions from CEC2005 and 7 functions from CEC2019.
Nonlinear acceleration algorithms improve the performance of iterative methods, such as gradient descent, using the information contained in past iterates. However, their efficiency is still not entirely understood even in the quadratic case. In this paper, we clarify the convergence analysis by giving general properties that share several classes of nonlinear acceleration: Anderson acceleration (and variants), quasi-Newton methods (such as Broyden Type-I or Type-II, SR1, DFP, and BFGS) and Krylov methods (Conjugate Gradient, MINRES, GMRES). In particular, we propose a generic family of algorithms that contains all the previous methods and prove its optimal rate of convergence when minimizing quadratic functions. We also propose multi-secants updates for the quasi-Newton methods listed above. We provide a Matlab code implementing the algorithm.
Research in Artificial Intelligence (AI) has focused mostly on two extremes: either on small improvements in narrow AI domains, or on universal theoretical frameworks which are usually uncomputable, incompatible with theories of biological intelligence, or lack practical implementations. The goal of this work is to combine the main advantages of the two: to follow a big picture view, while providing a particular theory and its implementation. In contrast with purely theoretical approaches, the resulting architecture should be usable in realistic settings, but also form the core of a framework containing all the basic mechanisms, into which it should be easier to integrate additional required functionality. In this paper, we present a novel, purposely simple, and interpretable hierarchical architecture which combines multiple different mechanisms into one system: unsupervised learning of a model of the world, learning the influence of one’s own actions on the world, model-based reinforcement learning, hierarchical planning and plan execution, and symbolic/sub-symbolic integration in general. The learned model is stored in the form of hierarchical representations with the following properties: 1) they are increasingly more abstract, but can retain details when needed, and 2) they are easy to manipulate in their local and symbolic-like form, thus also allowing one to observe the learning process at each level of abstraction. On all levels of the system, the representation of the data can be interpreted in both a symbolic and a sub-symbolic manner. This enables the architecture to learn efficiently using sub-symbolic methods and to employ symbolic inference.
Neural networks have been criticized for their lack of easy interpretation, which undermines confidence in their use for important applications. Here, we introduce a novel technique, interpreting a trained neural network by investigating its flip points. A flip point is any point that lies on the boundary between two output classes: e.g. for a neural network with a binary yes/no output, a flip point is any input that generates equal scores for ‘yes’ and ‘no’. The flip point closest to a given input is of particular importance, and this point is the solution to a well-posed optimization problem. This paper gives an overview of the uses of flip points and how they are computed. Through results on standard datasets, we demonstrate how flip points can be used to provide detailed interpretation of the output produced by a neural network. Moreover, for a given input, flip points enable us to measure confidence in the correctness of outputs much more effectively than softmax score. They also identify influential features of the inputs, identify bias, and find changes in the input that change the output of the model. We show that distance between an input and the closest flip point identifies the most influential points in the training data. Using principal component analysis (PCA) and rank-revealing QR factorization (RR-QR), the set of directions from each training input to its closest flip point provides explanations of how a trained neural network processes an entire dataset: what features are most important for classification into a given class, which features are most responsible for particular misclassifications, how an adversary might fool the network, etc. Although we investigate flip points for neural networks, their usefulness is actually model-agnostic.
Traditional machine learning algorithms use data from databases that are mutable, and therefore the data cannot be fully trusted. Also, the machine learning process is difficult to automate. This paper proposes building a trustable machine learning system by using blockchain technology, which can store data in a permanent and immutable way. In addition, smart contracts are used to automate the machine learning process. This paper makes three contributions. First, it establishes a link between machine learning technology and blockchain technology. Previously, machine learning and blockchain have been considered two independent technologies without an obvious link. Second, it proposes a unified analytical framework for trustable machine learning by using blockchain technology. This unified framework solves both the trustability and automation issues in machine learning. Third, it enables a computer to translate core machine learning implementation from a single thread on a single machine to multiple threads on multiple machines running with blockchain by using a unified approach. The paper uses association rule mining as an example to demonstrate how trustable machine learning can be implemented with blockchain, and it shows how this approach can be used to analyze opioid prescriptions to help combat the opioid crisis.
Recent progress in graph signal processing (GSP) has addressed a number of problems, including sampling and filtering. Proposed methods have focused on generic graphs and defined signals with certain characteristics, e.g., bandlimited signals, based on t he graph Fourier transform (GFT). However, the effect of GFT properties (e.g., vertex localization) on the behavior of such methods is not as well understood. In this paper, we propose novel GFT visualization tools and provide some examples to illustrate certain GFT properties and their impact on sampling or wavelet transforms.
Sorting input objects is an important step in many machine learning pipelines. However, the sorting operator is non-differentiable with respect to its inputs, which prohibits end-to-end gradient-based optimization. In this work, we propose NeuralSort, a general-purpose continuous relaxation of the output of the sorting operator from permutation matrices to the set of unimodal row-stochastic matrices, where every row sums to one and has a distinct arg max. This relaxation permits straight-through optimization of any computational graph involve a sorting operation. Further, we use this relaxation to enable gradient-based stochastic optimization over the combinatorially large space of permutations by deriving a reparameterized gradient estimator for the Plackett-Luce family of distributions over permutations. We demonstrate the usefulness of our framework on three tasks that require learning semantic orderings of high-dimensional objects, including a fully differentiable, parameterized extension of the k-nearest neighbors algorithm.
In this work, we study discrete-time Markov decision processes (MDPs) under constraints with Borel state and action spaces and where all the performance functions have the same form of the expected total reward (ETR) criterion over the infinite time horizon. One of our objective is to propose a convex programming formulation for this type of MDPs. It will be shown that the values of the constrained control problem and the associated convex program coincide and that if there exists an optimal solution to the convex program then there exists a stationary randomized policy which is optimal for the MDP. It will be also shown that in the framework of constrained control problems, the supremum of the expected total rewards over the set of randomized policies is equal to the supremum of the expected total rewards over the set of stationary randomized policies. We consider standard hypotheses such as the so-called continuity-compactness conditions and a Slater-type condition. Our assumptions are quite weak to deal with cases that have not yet been addressed in the literature. An example is presented to illustrate our results with respect to those of the literature.
Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer LM, and BERT) with a suite of sixteen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between RNNs and transformers. For instance, higher layers of RNNs are more task-specific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.
Motivated by recent developments in serverless systems for large-scale machine learning as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale smooth and strongly-convex problems in serverless systems. OverSketched Newton leverages matrix sketching ideas from Randomized Numerical Linear Algebra to compute the Hessian approximately. These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures. We establish that OverSketched Newton has a linear-quadratic convergence rate, and we empirically validate our results by solving large-scale supervised learning problems on real-world datasets. Experiments demonstrate a reduction of ~50% in total running time on AWS Lambda, compared to state-of-the-art distributed optimization schemes.
In this work, we present a method for node embedding in temporal graphs. We propose an algorithm that learns the evolution of a temporal graph’s nodes and edges over time and incorporates this dynamics in a temporal node embedding framework for different graph prediction tasks. We present a joint loss function that creates a temporal embedding of a node by learning to combine its historical temporal embeddings, such that it optimizes per given task (e.g., link prediction). The algorithm is initialized using static node embeddings, which are then aligned over the representations of a node at different time points, and eventually adapted for the given task in a joint optimization. We evaluate the effectiveness of our approach over a variety of temporal graphs for the two fundamental tasks of temporal link prediction and multi-label node classification, comparing to competitive baselines and algorithmic alternatives. Our algorithm shows performance improvements across many of the datasets and baselines and is found particularly effective for graphs that are less cohesive, with a lower clustering coefficient.
Reasoning is essential for the development of large knowledge graphs, especially for completion, which aims to infer new triples based on existing ones. Both rules and embeddings can be used for knowledge graph reasoning and they have their own advantages and difficulties. Rule-based reasoning is accurate and explainable but rule learning with searching over the graph always suffers from efficiency due to huge search space. Embedding-based reasoning is more scalable and efficient as the reasoning is conducted via computation between embeddings, but it has difficulty learning good representations for sparse entities because a good embedding relies heavily on data richness. Based on this observation, in this paper we explore how embedding and rule learning can be combined together and complement each other’s difficulties with their advantages. We propose a novel framework IterE iteratively learning embeddings and rules, in which rules are learned from embeddings with proper pruning strategy and embeddings are learned from existing triples and new triples inferred by rules. Evaluations on embedding qualities of IterE show that rules help improve the quality of sparse entity embeddings and their link prediction results. We also evaluate the efficiency of rule learning and quality of rules from IterE compared with AMIE+, showing that IterE is capable of generating high quality rules more efficiently. Experiments show that iteratively learning embeddings and rules benefit each other during learning and prediction.
Time series models such as dynamical systems are frequently fitted to a cohort of data, ignoring variation between individual entities such as patients. In this paper we show how these models can be personalised to an individual level while retaining statistical power, via use of multi-task learning (MTL). To our knowledge this is a novel development of MTL which applies to time series both with and without control inputs. The modelling framework is demonstrated on a physiological drug response problem which results in improved predictive accuracy and uncertainty estimation over existing state-of-the-art models.
High dimensional data often contain multiple facets, and several clustering patterns (views) can co-exist under different feature subspaces. While multi-view clustering algorithms were proposed, the uncertainty quantification remains difficult — a particular challenge is in the high complexity of estimating the cluster assignment probability under each view, or/and to efficiently share information across views. In this article, we propose an empirical Bayes approach — viewing the similarity matrices generated over subspaces as rough first-stage estimates for co-assignment probabilities, in its Kullback-Leibler neighborhood we obtain a refined low-rank soft cluster graph, formed by the pairwise product of simplex coordinates. Interestingly, each simplex coordinate directly encodes the cluster assignment uncertainty. For multi-view clustering, we equip each similarity matrix with a mixed membership over a small number of latent views, leading to effective dimension reduction. With a high model flexibility, the estimation can be succinctly re-parameterized as a continuous optimization problem, hence enjoys gradient-based computation. Theory establishes the connection of this model to random cluster graph under multiple views. Compared to single-view clustering approaches, substantially more interpretable results are obtained when clustering brains from human traumatic brain injury study, using high-dimensional gene expression data. KEY WORDS: Co-regularized Clustering, Consensus, PAC-Bayes, Random Cluster Graph, Variable Selection
The goal of this paper is to deal with a data scarcity scenario where deep learning techniques use to fail. We compare the use of two well established techniques, Restricted Boltzmann Machines and Variational Auto-encoders, as generative models in order to increase the training set in a classification framework. Essentially, we rely on Markov Chain Monte Carlo (MCMC) algorithms for generating new samples. We show that generalization can be improved comparing this methodology to other state-of-the-art techniques, e.g. semi-supervised learning with ladder networks. Furthermore, we show that RBM is better than VAE generating new samples for training a classifier with good generalization capabilities.
Due to its extensive use in databases, the relational model is ubiquitous in representing big-data. We propose to apply deep learning to this type of relational data by introducing an Equivariant Relational Layer (ERL), a neural network layer derived from the entity-relationship model of the database. Our layer relies on identification of exchangeabilities in the relational data(base), and their expression as a permutation group. We prove that an ERL is an optimal parameter-sharing scheme under the given exchangeability constraints, and subsumes recently introduced deep models for sets, exchangeable tensors, and graphs. The proposed model has a linear complexity in the size of the relational data, and it can be used for both inductive and transductive reasoning in databases, including the prediction of missing records, and database embedding. This opens the door to the application of deep learning to one of the most abundant forms of data.
We prove that no fully transactional system can provide fast read transactions (including read-only ones that are considered the most frequent in practice). Specifically, to achieve fast read transactions, the system has to give up support of transactions that write more than one object. We prove this impossibility result for distributed storage systems that are causally consistent, i.e., they do not require to ensure any strong form of consistency. Therefore, our result holds also for any system that ensures a consistency level stronger than causal consistency, e.g., strict serializability. The impossibility result holds even for systems that store only two objects (and support at least two servers and at least four clients). It also holds for systems that are partially replicated. Our result justifies the design choices of state-of-the-art distributed transactional systems and insists that system designers should not put more effort to design fully-functional systems that support both fast read transactions and ensure causal or any stronger form of consistency.
Multitask learning aims at solving a set of related tasks simultaneously, by exploiting the shared knowledge for improving the performance on individual tasks. Hence, an important aspect of multitask learning is to understand the similarities within a set of tasks. Previous works have incorporated this similarity information explicitly (e.g., weighted loss for each task) or implicitly (e.g., adversarial loss for feature adaptation), for achieving good empirical performances. However, the theoretical motivations for adding task similarity knowledge are often missing or incomplete. In this paper, we give a different perspective from a theoretical point of view to understand this practice. We first provide an upper bound on the generalization error of multitask learning, showing the benefit of explicit and implicit task similarity knowledge. We systematically derive the bounds based on two distinct task similarity metrics: H divergence and Wasserstein distance. From these theoretical results, we revisit the Adversarial Multi-task Neural Network, proposing a new training algorithm to learn the task relation coefficients and neural network parameters iteratively. We assess our new algorithm empirically on several benchmarks, showing not only that we find interesting and robust task relations, but that the proposed approach outperforms the baselines, reaffirming the benefits of theoretical insight in algorithm design.