Big data, data science, deep learning, artificial intelligence are the key words of intense hype related with a job market in full evolution, that impose to adapt the contents of our university professional trainings. Which artificial intelligence is mostly concerned by the job offers? Which methodologies and technologies should be favored in the training pprograms? Which objectives, tools and educational resources do we needed to put in place to meet these pressing needs? We answer these questions in describing the contents and operational ressources in the Data Science orientation of the speciality Applied Mathematics at INSA Toulouse. We focus on basic mathematics training (Optimization, Probability, Statistics), associated with the practical implementation of the most performing statistical learning algorithms, with the most appropriate technologies and on real examples. Considering the huge volatility of the technologies, it is imperative to train students in seft-training, this will be their technological watch tool when they will be in professional activity. This explains the structuring of the educational site https://…/wikistat into a set of tutorials. Finally, to motivate the thorough practice of these tutorials, a serious game is organized each year in the form of a prediction contest between students of Master degrees in Applied Mathematics for IA.
Evaluation has always been a key challenge in the development of artificial intelligence (AI) based software, due to the technical complexity of the software artifact and, often, its embedding in complex sociotechnical processes. Recent advances in machine learning (ML) enabled by deep neural networks has exacerbated the challenge of evaluating such software due to the opaque nature of these ML-based artifacts. A key related issue is the (in)ability of such systems to generate useful explanations of their outputs, and we argue that the explanation and evaluation problems are closely linked. The paper models the elements of a ML-based AI system in the context of public sector decision (PSD) applications involving both artificial and human intelligence, and maps these elements against issues in both evaluation and explanation, showing how the two are related. We consider a number of common PSD application patterns in the light of our model, and identify a set of key issues connected to explanation and evaluation in each case. Finally, we propose multiple strategies to promote wider adoption of AI/ML technologies in PSD, where each is distinguished by a focus on different elements of our model, allowing PSD policy makers to adopt an approach that best fits their context and concerns.
Salient object detection is a problem that has been considered in detail and many solutions proposed. In this paper, we argue that work to date has addressed a problem that is relatively ill-posed. Specifically, there is not universal agreement about what constitutes a salient object when multiple observers are queried. This implies that some objects are more likely to be judged salient than others, and implies a relative rank exists on salient objects. Initially, we present a novel deep learning solution based on a hierarchical representation of relative saliency and stage-wise refinement. Furthermore, we present data, analysis and benchmark baseline results towards addressing the problem of salient object ranking. Methods for deriving suitable ranked salient object instances are presented, along with metrics suitable to measuring algorithm performance. In addition, we show how a derived dataset can be successively refined to provide cleaned results that correlate well with pristine ground truth. Finally, we provide a comparison among prevailing algorithms that address salient object ranking or detection to establish initial baselines.
The least square Monte Carlo (LSM) algorithm proposed by Longstaff and Schwartz [2001] is the most widely used method for pricing options with early exercise features. The LSM estimator contains look-ahead bias, and the conventional technique of removing it necessitates an independent set of simulations. This study proposes a new approach for efficiently eliminating look-ahead bias by using the leave-one-out method, a well-known cross-validation technique for machine learning applications. The leave-one-out LSM (LOOLSM) method is illustrated with examples, including multi-asset options whose LSM price is biased high. The asymptotic behavior of look-ahead bias is also discussed with the LOOLSM approach.
Clustering is a fundamental problem in unsupervised learning. Popular methods like K-means, may suffer from poor performance as they are prone to get stuck in its local minima. Recently, the sum-of-norms (SON) model (also known as the clustering path) has been proposed in Pelckmans et al. (2005), Lindsten et al. (2011) and Hocking et al. (2011). The perfect recovery properties of the convex clustering model with uniformly weighted all pairwise-differences regularization have been proved by Zhu et al. (2014) and Panahi et al. (2017). However, no theoretical guarantee has been established for the general weighted convex clustering model, where better empirical results have been observed. In the numerical optimization aspect, although algorithms like the alternating direction method of multipliers (ADMM) and the alternating minimization algorithm (AMA) have been proposed to solve the convex clustering model (Chi and Lange, 2015), it still remains very challenging to solve large-scale problems. In this paper, we establish sufficient conditions for the perfect recovery guarantee of the general weighted convex clustering model, which include and improve existing theoretical results as special cases. In addition, we develop a semismooth Newton based augmented Lagrangian method for solving large-scale convex clustering problems. Extensive numerical experiments on both simulated and real data demonstrate that our algorithm is highly efficient and robust for solving large-scale problems. Moreover, the numerical results also show the superior performance and scalability of our algorithm comparing to the existing first-order methods. In particular, our algorithm is able to solve a convex clustering problem with 200,000 points in $\mathbb{R}^3$ in about 6 minutes.
This paper discusses predictive inference and feature selection for generalized linear models with scarce but high-dimensional data. We argue that in many cases one can benefit from a decision theoretically justified two-stage approach: first, construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions. The model built in the first step is referred to as the \emph{reference model} and the operation during the latter step as predictive \emph{projection}. The key characteristic of this approach is that it finds an excellent tradeoff between sparsity and predictive accuracy, and the gain comes from utilizing all available information including prior and that coming from the left out features. We review several methods that follow this principle and provide novel methodological contributions. We present a new projection technique that unifies two existing techniques and is both accurate and fast to compute. We also propose a way of evaluating the feature selection process using fast leave-one-out cross-validation that allows for easy and intuitive model size selection. Furthermore, we prove a theorem that helps to understand the conditions under which the projective approach could be beneficial. The benefits are illustrated via several simulated and real world examples.
Spatial relations between neurons in the network with time delays play a crucial role in determining dynamics of the system. During the development of the nervous system different types of neurons group together to enable specific functions of the network. Right spatial distances, thus right time delays between cells are crucial for an appropriate functioning of the system. To model the process of neural migration we proposed simple but effective model of network spatial evolution based on Hindmarsh-Rose neurons and Metropoli-Hastings Monte Carlo algorithm. Under the specific assumptions and using appropriate parameters of the neural evolution the network can converge to the desirable state giving the opportunity of achieving large variety of spectra. We show that there is a specific range of network size in space which allows it to generate assumed output. A network or generally speaking a system with time delays (corresponding to the arrangement in the physical space) of specific output properties has a specific spatial dimension that allows it to function properly.
Conditional Gradients (aka Frank-Wolfe algorithms) form a classical set of methods for constrained smooth convex minimization due to their simplicity, the absence of projection step, and competitive numerical performance. While the vanilla Frank-Wolfe algorithm only ensures a worst-case rate of O(1/epsilon), various recent results have shown that for strongly convex functions, the method can be slightly modified to achieve linear convergence. However, this still leaves a huge gap between sublinear O(1/epsilon) convergence and linear O(log(1/epsilon)) convergence to reach an $\epsilon$-approximate solution. Here, we present a new variant of Conditional Gradients, that can dynamically adapt to the function’s geometric properties using restarts and thus smoothly interpolates between the sublinear and linear regimes.
Abstraction is a powerful idea widely used in science, to model, reason and explain the behavior of systems in a more tractable search space, by omitting irrelevant details. While notions of abstraction have matured for deterministic systems, the case for abstracting probabilistic models is not yet fully understood. In this paper, we develop a foundational framework for abstraction in probabilistic relational models from first principles. These models borrow syntactic devices from first-order logic and are very expressive, thus naturally allowing for relational and hierarchical constructs with stochastic primitives. We motivate a definition of consistency between a high-level model and its low-level counterpart, but also treat the case when the high-level model is missing critical information present in the low-level model. We prove properties of abstractions, both at the level of the parameter as well as the structure of the models.
We study the topology of the space of learning tasks, which is critical to understanding transfer learning whereby a model such as a deep neural network is pre-trained on a task, and then used on a different one after some fine-tuning. First we show that using the Kolmogorov structure function we can define a distance between tasks, which is independent on any particular model used and, empirically, correlates with the semantic similarity between tasks. Then, using a path integral approximation, we show that this plays a central role in the learning dynamics of Deep Networks, and in particular in the reachability of one task from another. We show that the probability of paths connecting two tasks, is asymmetric and has a static component that depends on the geometry of the loss function, in particular on the curvature, and a dynamic component that is model dependent and relates to the ease of traversing such paths. Surprisingly, the static component corresponds to the distance derived from the Kolmogorov Structure Function. With the dynamic component, this gives strict lower bounds on the complexity necessary to learn a task starting from the solution to another. Our analysis also explains more complex phenomena where semantically similar tasks may be unreachable from one another, a phenomenon called Information Plasticity and observed in diverse learning systems such as animals and deep neural networks.
We focus on the convergence analysis of averaged relaxations of cutters, specifically for variants that—depending upon how parameters are chosen—resemble \emph{alternating projections}, the \emph{Douglas–Rachford method}, \emph{relaxed reflect-reflect}, or the \emph{Peaceman–Rachford} method. Such methods are frequently used to solve convex feasibility problems. The standard convergence analyses of projection algorithms are based on the \emph{firm nonexpansivity} property of the relevant operators. However if the projections onto the constraint sets are replaced by cutters (projections onto separating hyperplanes), the firm nonexpansivity is lost. We provide a proof of convergence for a family of related averaged relaxed cutter methods under reasonable assumptions, relying on a simple geometric argument. This allows us to clarify fine details related to the allowable choice of the relaxation parameters, highlighting the distinction between the exact (firmly nonexpansive) and approximate (strongly quasi-nonexpansive) settings. We provide illustrative examples and discuss practical implementations of the method.
In hierarchical planning for Markov decision processes (MDPs), temporal abstraction allows planning with macro-actions that take place at different time scale in form of sequential composition. In this paper, we propose a novel approach to compositional reasoning and hierarchical planning for MDPs under temporal logic constraints. In addition to sequential composition, we introduce a composition of policies based on generalized logic composition: Given sub-policies for sub-tasks and a new task expressed as logic compositions of subtasks, a semi-optimal policy, which is optimal in planning with only sub-policies, can be obtained by simply composing sub-polices. Thus, a synthesis algorithm is developed to compute optimal policies efficiently by planning with primitive actions, policies for sub-tasks, and the compositions of sub-policies, for maximizing the probability of satisfying temporal logic specifications. We demonstrate the correctness and efficiency of the proposed method in stochastic planning examples with a single agent and multiple task specifications.
Simulation is a useful tool in situations where training data for machine learning models is costly to annotate or even hard to acquire. In this work, we propose a reinforcement learning-based method for automatically adjusting the parameters of any (non-differentiable) simulator, thereby controlling the distribution of synthesized data in order to maximize the accuracy of a model trained on that data. In contrast to prior art that hand-crafts these simulation parameters or adjusts only parts of the available parameters, our approach fully controls the simulator with the actual underlying goal of maximizing accuracy, rather than mimicking the real data distribution or randomly generating a large volume of data. We find that our approach (i) quickly converges to the optimal simulation parameters in controlled experiments and (ii) can indeed discover good sets of parameters for an image rendering simulator in actual computer vision applications.
Recent analyses of certain gradient descent optimization methods have shown that performance can degrade in some settings – such as with stochasticity or implicit momentum. In deep reinforcement learning (Deep RL), such optimization methods are often used for training neural networks via the temporal difference error or policy gradient. As an agent improves over time, the optimization target changes and thus the loss landscape (and local optima) change. Due to the failure modes of those methods, the ideal choice of optimizer for Deep RL remains unclear. As such, we provide an empirical analysis of the effects that a wide range of gradient descent optimizers and their hyperparameters have on policy gradient methods, a subset of Deep RL algorithms, for benchmark continuous control tasks. We find that adaptive optimizers have a narrow window of effective learning rates, diverging in other cases, and that the effectiveness of momentum varies depending on the properties of the environment. Our analysis suggests that there is significant interplay between the dynamics of the environment and Deep RL algorithm properties which aren’t necessarily accounted for by traditional adaptive gradient methods. We provide suggestions for optimal settings of current methods and further lines of research based on our findings.
We map stock market interactions to spin models to recover their hierarchical structure using a simulated annealing based Super-Paramagnetic Clustering (SPC) algorithm. This is directly compared to a modified implementation of a maximum likelihood approach to fast-Super-Paramagnetic Clustering (f-SPC). The methods are first applied standard toy test-case problems, and then to a dataset of 447 stocks traded on the New York Stock Exchange (NYSE) over 1249 days. The signal to noise ratio of stock market correlation matrices is briefly considered. Our result recover approximately clusters representative of standard economic sectors and mixed clusters whose dynamics shine light on the adaptive nature of financial markets and raise concerns relating to the effectiveness of industry based static financial market classification in the world of real-time data-analytics. A key result is that we show that the standard maximum likelihood methods are confirmed to converge to solutions within a Super-Paramagnetic (SP) phase. We use insights arising from this to discuss the implications of using a Maximum Entropy Principle (MEP) as opposed to the Maximum Likelihood Principle (MLP) as an optimization device for this class of problems.
This paper is concerned with developing a novel distributed Kalman filtering algorithm over wireless sensor networks based on randomized consensus strategy. Compared with the centralized algorithm, distributed filtering techniques require less computation per sensor and lead to more robust estimation since they simply use the information from the neighboring nodes in the network. However, poor local sensor estimation caused by limited observability and network topology changes which interfere the global consensus are challenging issues. Motivated by this observation, we propose a novel randomized gossip-based distributed Kalman filtering algorithm. Information exchange and computation in the proposed algorithm can be carried out in an arbitrarily connected network of nodes. In addition, the computational burden can be distributed for a sensor which communicates with a stochastically selected neighbor at each clock step under schemes of gossip algorithm. In this case, the error covariance matrix changes stochastically at every clock step, thus the convergence is considered in a probabilistic sense. We provide the mean square convergence analysis of the proposed algorithm. Under a sufficient condition, we show that the proposed algorithm is quite appealing as it achieves better mean square error performance theoretically than the noncooperative decentralized Kalman filtering algorithm. Besides, considering the limited computation, communication, and energy resources in the wireless sensor networks, we propose an optimization problem which minimizes the average expected state estimation error based on the proposed algorithm. To solve the proposed problem efficiently, we transform it into a convex optimization problem. And a sub-optimal solution is attained. Examples and simulations are provided to illustrate the theoretical results.
Stochastic optimization techniques are standard in variational inference algorithms. These methods estimate gradients by approximating expectations with independent Monte Carlo samples. In this paper, we explore a technique that uses correlated, but more representative , samples to reduce estimator variance. Specifically, we show how to generate antithetic samples that match sample moments with the true moments of an underlying importance distribution. Combining a differentiable antithetic sampler with modern stochastic variational inference, we showcase the effectiveness of this approach for learning a deep generative model.
GPdoemd is an open-source python package for design of experiments for model discrimination that uses Gaussian process surrogate models to approximate and maximise the divergence between marginal predictive distributions of rival mechanistic models. GPdoemd uses the divergence prediction to suggest a maximally informative next experiment.
We propose a new continuous-time formulation for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance reduced techniques. We exploit this continuous-time model, together with a simple Lyapunov analysis as well as tools from stochastic calculus, in order to derive convergence bounds for various types of non-convex functions. We contrast these bounds to their known equivalent in discrete-time as well as derive new bounds. Our model also includes SVRG, for which we derive a linear convergence rate for the class of weakly quasi-convex and quadratically growing functions.
We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.
Many types of anomaly detection methods have been proposed recently, and applied to a wide variety of fields including medical screening and production quality checking. Some methods have utilized images, and, in some cases, a part of the anomaly images is known beforehand. However, this kind of information is dismissed by previous methods, because the methods can only utilize a normal pattern. Moreover, the previous methods suffer a decrease in accuracy due to negative effects from surrounding noises. In this study, we propose a spatially-weighted anomaly detection method (SPADE) that utilizes all of the known patterns and lessens the vulnerability to ambient noises by applying Grad-CAM, which is the visualization method of a CNN. We evaluated our method quantitatively using two datasets, the MNIST dataset with noise and a dataset based on a brief screening test for dementia.
We use statistical mechanics to study model-based Bayesian data clustering. In this approach, each partitioning of the data into clusters is regarded as a microscopic system state, the negative data log-likelihood gives the energy of each state, and the data set realisation acts as disorder. Optimal clustering corresponds to the ground state of the system, and is hence obtained from the free energy via a low `temperature’ limit. We assume that for large sample sizes the free energy density is self-averaging, and we use the replica method to compute the asymptotic free energy density. The main order parameter in the resulting (replica symmetric) theory, the distribution of the data over the clusters, satisfies a self-consistent equation which can be solved by a population dynamics algorithm. From this order parameter one computes the average free energy, and all relevant macroscopic characteristics of the problem. The theory describes numerical experiments perfectly, and gives a significant improvement over the mean-field theory that was used to study this model in past.
We introduce a method, KL-LIME, for explaining predictions of Bayesian predictive models by projecting the information in the predictive distribution locally to a simpler, interpretable explanation model. The proposed approach combines the recent Local Interpretable Model-agnostic Explanations (LIME) method with ideas from Bayesian projection predictive variable selection methods. The information theoretic basis helps in navigating the trade-off between explanation fidelity and complexity. We demonstrate the method in explaining MNIST digit classifications made by a Bayesian deep convolutional neural network.
Social media corpora pose unique challenges and opportunities, including typically short document lengths and rich meta-data such as author characteristics and relationships. This creates great potential for systematic analysis of the enormous body of the users and thus provides implications for industrial strategies such as targeted marketing. Here we propose a novel and statistically principled method, clust-LDA, which incorporates authorship structure into the topical modeling, thus accomplishing the task of the topical inferences across documents on the basis of authorship and, simultaneously, the identification of groupings between authors. We develop an inference procedure for clust-LDA and demonstrate its performance on simulated data, showing that clust-LDA out-performs the ‘vanilla’ LDA on the topic identification task where authors exhibit distinctive topical preference. We also showcase the empirical performance of clust-LDA based on a real-world social media dataset from Reddit.
Generating a robust representation of the environment is a crucial ability of learning agents. Deep learning based methods have greatly improved perception systems but still fail in challenging situations. These failures are often not solvable on the basis of a single image. In this work, we present a parameter-efficient temporal filtering concept which extends an existing single-frame segmentation model to work with multiple frames. The resulting recurrent architecture temporally filters representations on all abstraction levels in a hierarchical manner, while decoupling temporal dependencies from scene representation. Using a synthetic dataset, we show the ability of our model to cope with data perturbations and highlight the importance of recurrent and hierarchical filtering.
Graphs are found in a plethora of domains, including online social networks, the World Wide Web and the study of epidemics, to name a few. With the advent of greater volumes of information and the need for continuously updated results under temporal constraints, it is necessary to explore novel approaches that further enable performance improvements. In the scope of stream processing over graphs, we research the trade-offs between result accuracy and the speedup of approximate computation techniques. We believe this to be a natural path towards these performance improvements. Herein we present GraphBolt, through which we conducted our research. It is an innovative model for approximate graph processing, implemented in Apache Flink. We analyze our model and evaluate it with the case study of the PageRank algorithm, perhaps the most famous measure of vertex centrality used to rank websites in search engine results. In light of our model, we discuss the challenges driven by relations between result accuracy and potential performance gains. Our experiments show that GraphBolt can reduce computational time by over 50% while achieving result quality above 95% when compared to results of the traditional version of PageRank without any summarization or approximation techniques.
The model parameters of convolutional neural networks (CNNs) are determined by backpropagation (BP). In this work, we propose an interpretable feedforward (FF) design without any BP as a reference. The FF design adopts a data-centric approach. It derives network parameters of the current layer based on data statistics from the output of the previous layer in a one-pass manner. To construct convolutional layers, we develop a new signal transform, called the Saab (Subspace Approximation with Adjusted Bias) transform. It is a variant of the principal component analysis (PCA) with an added bias vector to annihilate activation’s nonlinearity. Multiple Saab transforms in cascade yield multiple convolutional layers. As to fully-connected (FC) layers, we construct them using a cascade of multi-stage linear least squared regressors (LSRs). The classification and robustness (against adversarial attacks) performances of BP- and FF-designed CNNs applied to the MNIST and the CIFAR-10 datasets are compared. Finally, we comment on the relationship between BP and FF designs.
We extend the existing framework of semi-implicit variational inference (SIVI) and introduce doubly semi-implicit variational inference (DSIVI), a way to perform variational inference and learning when both the approximate posterior and the prior distribution are semi-implicit. In other words, DSIVI performs inference in models where the prior and the posterior can be expressed as an intractable infinite mixture of some analytic density with a highly flexible implicit mixing distribution. We provide a sandwich bound on the evidence lower bound (ELBO) objective that can be made arbitrarily tight. Unlike discriminator-based and kernel-based approaches to implicit variational inference, DSIVI optimizes a proper lower bound on ELBO that is asymptotically exact. We evaluate DSIVI on a set of problems that benefit from implicit priors. In particular, we show that DSIVI gives rise to a simple modification of VampPrior, the current state-of-the-art prior for variational autoencoders, which improves its performance.
Many services that perform information retrieval for Points of Interest (POI) utilize a Lucene-based setup with spatial filtering. While this type of system is easy to implement it does not make use of semantics but relies on direct word matches between a query and reviews leading to a loss in both precision and recall. To study the challenging task of semantically enriching POIs from unstructured data in order to support open-domain search and question answering (QA), we introduce a new dataset POIReviewQA. It consists of 20k questions (e.g.’is this restaurant dog friendly?’) for 1022 Yelp business types. For each question we sampled 10 reviews, and annotated each sentence in the reviews whether it answers the question and what the corresponding answer is. To test a system’s ability to understand the text we adopt an information retrieval evaluation by ranking all the review sentences for a question based on the likelihood that they answer this question. We build a Lucene-based baseline model, which achieves 77.0% AUC and 48.8% MAP. A sentence embedding-based model achieves 79.2% AUC and 41.8% MAP, indicating that the dataset presents a challenging problem for future research by the GIR community. The result technology can help exploit the thematic content of web documents and social media for characterisation of locations.