# Book Memo: “Strong Stable Markov Chains”

**21**
*Thursday*
Mar 2019

Posted Books

in
Advertisements

**21**
*Thursday*
Mar 2019

Posted Books

in
Advertisements

**21**
*Thursday*
Mar 2019

Posted Documents

inAn extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep learning dynamics to give rise to these regularities. A mathematical theory of semantic development in deep neural networks

**21**
*Thursday*
Mar 2019

Posted What is ...

in**Variational Neural Network**

The choice of activation function can significantly influence the performance of neural networks. The lack of guiding principles for the selection of activation function is lamentable. We try to address this issue by introducing our variational neural networks, where the activation function is represented as a linear combination of possible candidate functions, and an optimal activation is obtained via minimization of a loss function using gradient descent method. The gradient formulae for the loss function with respect to these expansion coefficients are central for the implementation of gradient descent algorithm, and here we derive these gradient formulae. … **NSGA-Net**

This paper introduces NSGA-Net, an evolutionary approach for neural architecture search (NAS). NSGA-Net is designed with three goals in mind: (1) a NAS procedure for multiple, possibly conflicting, objectives, (2) efficient exploration and exploitation of the space of potential neural network architectures, and (3) output of a diverse set of network architectures spanning a trade-off frontier of the objectives in a single run. NSGA-Net is a population-based search algorithm that explores a space of potential neural network architectures in three steps, namely, a population initialization step that is based on prior-knowledge from hand-crafted architectures, an exploration step comprising crossover and mutation of architectures and finally an exploitation step that applies the entire history of evaluated neural architectures in the form of a Bayesian Network prior. Experimental results suggest that combining the objectives of minimizing both an error metric and computational complexity, as measured by FLOPS, allows NSGA-Net to find competitive neural architectures near the Pareto front of both objectives on two different tasks, object classification and object alignment. NSGA-Net obtains networks that achieve 3.72% (at 4.5 million FLOP) error on CIFAR-10 classification and 8.64% (at 26.6 million FLOP) error on the CMU-Car alignment task. Code available at: https://…/nsga-net … **Variational Noise-Contrastive Estimation**

Unnormalised latent variable models are a broad and flexible class of statistical models. However, learning their parameters from data is intractable, and few estimation techniques are currently available for such models. To increase the number of techniques in our arsenal, we propose variational noise-contrastive estimation (VNCE), building on NCE which is a method that only applies to unnormalised models. The core idea is to use a variational lower bound to the NCE objective function, which can be optimised in the same fashion as the evidence lower bound (ELBO) in standard variational inference (VI). We prove that VNCE can be used for both parameter estimation of unnormalised models and posterior inference of latent variables. The developed theory shows that VNCE has the same level of generality as standard VI, meaning that advances made there can be directly imported to the unnormalised setting. We validate VNCE on toy models and apply it to a realistic problem of estimating an undirected graphical model from incomplete data. …

**21**
*Thursday*
Mar 2019

Posted Books

in
**21**
*Thursday*
Mar 2019

Posted arXiv Papers

in**Machine learning meets quantum physics**

The marriage of machine learning and quantum physics may give birth to a new research frontier that could transform both.

**QuickStop: A Markov Optimal Stopping Approach for Quickest Misinformation Detection**

This paper combines data-driven and model-driven methods for real-time misinformation detection. Our algorithm, named QuickStop, is an optimal stopping algorithm based on a probabilistic information spreading model obtained from labeled data. The algorithm consists of an offline machine learning algorithm for learning the probabilistic information spreading model and an online optimal stopping algorithm to detect misinformation. The online detection algorithm has both low computational and memory complexities. Our numerical evaluations with a real-world dataset show that QuickStop outperforms existing misinformation detection algorithms in terms of both accuracy and detection time (number of observations needed for detection). Our evaluations with synthetic data further show that QuickStop is robust to (offline) learning errors.

**A Capsule-unified Framework of Deep Neural Networks for Graphical Programming**

Recently, the growth of deep learning has produced a large number of deep neural networks. How to describe these networks unifiedly is becoming an important issue. We first formalize neural networks in a mathematical definition, give their directed graph representations, and prove a generation theorem about the induced networks of connected directed acyclic graphs. Then, using the concept of capsule to extend neural networks, we set up a capsule-unified framework for deep learning, including a mathematical definition of capsules, an induced model for capsule networks and a universal backpropagation algorithm for training them. Finally, we discuss potential applications of the framework to graphical programming with standard graphical symbols of capsules, neurons, and connections.

**The Iterated Local Model for Social Networks**

On-line social networks, such as in Facebook and Twitter, are often studied from the perspective of friendship ties between agents in the network. Adversarial ties, however, also play an important role in the structure and function of social networks, but are often hidden. Underlying generative mechanisms of social networks are predicted by structural balance theory, which postulates that triads of agents, prefer to be transitive, where friends of friends are more likely friends, or anti-transitive, where adversaries of adversaries become friends. The previously proposed Iterated Local Transitivity (ILT) and Iterated Local Anti-Transitivity (ILAT) models incorporated transitivity and anti-transitivity, respectively, as evolutionary mechanisms. These models resulted in graphs with many observable properties of social networks, such as low diameter, high clustering, and densification. We propose a new, generative model, referred to as the Iterated Local Model (ILM) for social networks synthesizing both transitive and anti-transitive triads over time. In ILM, we are given a countably infinite binary sequence as input, and that sequence determines whether we apply a transitive or an anti-transitive step. The resulting model exhibits many properties of complex networks observed in the ILT and ILAT models. In particular, for any input binary sequence, we show that asymptotically the model generates finite graphs that densify, have clustering coefficient bounded away from 0, have diameter at most 3, and exhibit bad spectral expansion. We also give a thorough analysis of the chromatic number, domination number, Hamiltonicity, and isomorphism types of induced subgraphs of ILM graphs.

**GOGGLES: Automatic Training Data Generation with Affinity Coding**

Generating large labeled training data is becoming the biggest bottleneck in building and deploying supervised machine learning models. Recently, data programming has been proposed in the data management community to reduce the human cost in training data generation. Data programming expects users to write a set of labeling functions, each of which is a weak supervision source that labels a subset of data points with better-than-random accuracy. However, the success of data programming heavily depends on the quality (in terms of both accuracy and coverage) of the labeling functions that users still need to design manually. We propose affinity coding, a new paradigm for fully automatic generation of training data. In affinity coding, the similarity between the unlabeled instances and prototypes that are derived from the same unlabeled instances serve as signals (or sources of weak supervision) for determining class membership. We term this implicit similarity as the affinity score. Consequently, we can have as many sources of weak supervision as the number of unlabeled data points, without any human input. We also propose a system called GOGGLES that is an implementation of affinity coding for labeling image datasets. GOGGLES features novel techniques for deriving affinity scores from image datasets based on ‘semantic prototypes’ extracted from convolutional neural nets, as well as an expectation-maximization approach for performing class label inference based on the computed affinity scores. Compared to the state-of-the-art data programming system Snorkel, GOGGLES exhibits 14.88% average improvement in terms of the quality of labels generated for the binary labeling task. The GOGGLES system is open-sourced at https://…/.

**Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification**

Unintended bias in Machine Learning can manifest as systemic differences in performance for different demographic groups, potentially compounding existing challenges to fairness in society at large. In this paper, we introduce a suite of threshold-agnostic metrics that provide a nuanced view of this unintended bias, by considering the various ways that a classifier’s score distribution can vary across designated groups. We also introduce a large new test set of online comments with crowd-sourced annotations for identity references. We use this to show how our metrics can be used to find new and potentially subtle unintended bias in existing public models.

Even though computational intelligence techniques have been extensively utilized in financial trading systems, almost all developed models use the time series data for price prediction or identifying buy-sell points. However, in this study we decided to use 2-D stock bar chart images directly without introducing any additional time series associated with the underlying stock. We propose a novel algorithmic trading model CNN-BI (Convolutional Neural Network with Bar Images) using a 2-D Convolutional Neural Network. We generated 2-D images of sliding windows of 30-day bar charts for Dow 30 stocks and trained a deep Convolutional Neural Network (CNN) model for our algorithmic trading model. We tested our model separately between 2007-2012 and 2012-2017 for representing different market conditions. The results indicate that the model was able to outperform Buy and Hold strategy, especially in trendless or bear markets. Since this is a preliminary study and probably one of the first attempts using such an unconventional approach, there is always potential for improvement. Overall, the results are promising and the model might be integrated as part of an ensemble trading model combined with different strategies.

**Generalized Sparse Additive Models**

We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework.

**Joint Time Series and Cross-Section Limit Theory under Mixingale Assumptions**

In this paper we complement joint time series and cross-section convergence results of Hahn, Kuersteiner and Mazzocco (2016) by allowing for serial correlation in the time series sample. The implications of our analysis are limiting distributions that have a well known form of long run variances for the time series limit. We obtain these results at the cost of imposing strict stationarity for the time series model and conditional independence between the time series and cross-section samples. Our results can be applied to estimators that combine time series and cross-section data in the presence of aggregate uncertainty in models with rationally forward looking agents.

Features in machine learning problems are often time varying and may be related to outputs in an algebraic or dynamical manner. The dynamic nature of these machine learning problems renders current accelerated gradient descent methods unstable or weakens their convergence guarantees. This paper proposes algorithms for the case when time varying features are present, and demonstrates provable performance guarantees. We develop a variational perspective within a continuous time algorithm. This variational perspective includes, among other things, higher-order learning concepts and normalization, both of which stem from adaptive control, and allows stability to be established for dynamical machine learning problems. These higher-order algorithms are also examined for achieving accelerated learning in adaptive control. Simulations are provided to verify the theoretical results.

**Transfer Adaptation Learning: A Decade Survey**

The world we see is ever-changing and it always changes with people, things, and the environment. Domain is referred to as the state of the world at a certain moment. A research problem is characterized as domain transfer adaptation when it needs knowledge correspondence between different moments. Conventional machine learning aims to find a model with the minimum expected risk on test data by minimizing the regularized empirical risk on the training data, which, however, supposes that the training and test data share similar joint probability distribution. Transfer adaptation learning aims to build models that can perform tasks of target domain by learning knowledge from a semantic related but distribution different source domain. It is an energetic research filed of increasing influence and importance. This paper surveys the recent advances in transfer adaptation learning methodology and potential benchmarks. Broader challenges being faced by transfer adaptation learning researchers are identified, i.e., instance re-weighting adaptation, feature adaptation, classifier adaptation, deep network adaptation, and adversarial adaptation, which are beyond the early semi-supervised and unsupervised split. The survey provides researchers a framework for better understanding and identifying the research status, challenges and future directions of the field.

**Knowledge Adaptation for Efficient Semantic Segmentation**

Both accuracy and efficiency are of significant importance to the task of semantic segmentation. Existing deep FCNs suffer from heavy computations due to a series of high-resolution feature maps for preserving the detailed knowledge in dense estimation. Although reducing the feature map resolution (i.e., applying a large overall stride) via subsampling operations (e.g., pooling and convolution striding) can instantly increase the efficiency, it dramatically decreases the estimation accuracy. To tackle this dilemma, we propose a knowledge distillation method tailored for semantic segmentation to improve the performance of the compact FCNs with large overall stride. To handle the inconsistency between the features of the student and teacher network, we optimize the feature similarity in a transferred latent domain formulated by utilizing a pre-trained autoencoder. Moreover, an affinity distillation module is proposed to capture the long-range dependency by calculating the non-local interactions across the whole image. To validate the effectiveness of our proposed method, extensive experiments have been conducted on three popular benchmarks: Pascal VOC, Cityscapes and Pascal Context. Built upon a highly competitive baseline, our proposed method can improve the performance of a student network by 2.5\% (mIOU boosts from 70.2 to 72.7 on the cityscapes test set) and can train a better compact model with only 8\% float operations (FLOPS) of a model that achieves comparable performances.

**Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning**

Bayesian optimization is popular for optimizing time-consuming black-box objectives. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. Multi-fidelity optimization promises relief using cheaper proxies to such objectives — for example, validation error for a network trained using a subset of the training points or fewer iterations than required for convergence. We propose a highly flexible and practical approach to multi-fidelity Bayesian optimization, focused on efficiently optimizing hyperparameters for iteratively trained supervised learning models. We introduce a new acquisition function, the trace-aware knowledge-gradient, which efficiently leverages both multiple continuous fidelity controls and trace observations — values of the objective at a sequence of fidelities, available when varying fidelity using training iterations. We provide a provably convergent method for optimizing our acquisition function and show it outperforms state-of-the-art alternatives for hyperparameter tuning of deep neural networks and large-scale kernel learning.

**Interaction Embeddings for Prediction and Explanation in Knowledge Graphs**

Knowledge graph embedding aims to learn distributed representations for entities and relations, and is proven to be effective in many applications. Crossover interactions — bi-directional effects between entities and relations — help select related information when predicting a new triple, but haven’t been formally discussed before. In this paper, we propose CrossE, a novel knowledge graph embedding which explicitly simulates crossover interactions. It not only learns one general embedding for each entity and relation as most previous methods do, but also generates multiple triple specific embeddings for both of them, named interaction embeddings. We evaluate embeddings on typical link prediction tasks and find that CrossE achieves state-of-the-art results on complex and more challenging datasets. Furthermore, we evaluate embeddings from a new perspective — giving explanations for predicted triples, which is important for real applications. In this work, an explanation for a triple is regarded as a reliable closed-path between the head and the tail entity. Compared to other baselines, we show experimentally that CrossE, benefiting from interaction embeddings, is more capable of generating reliable explanations to support its predictions.

**SmartEDA: An R Package for Automated Exploratory Data Analysis**

This paper introduces SmartEDA, which is an R package for performing Exploratory data analysis (EDA). EDA is generally the first step that one needs to perform before developing any machine learning or statistical models. The goal of EDA is to help someone perform the initial investigation to know more about the data via descriptive statistics and visualizations. In other words, the objective of EDA is to summarize and explore the data. The need for EDA became one of the factors that led to the development of various statistical computing packages over the years including the R programming language that is a very popular and currently the most widely used software for statistical computing. However, EDA is a very tedious task, requires some manual effort and some of the open source packages available in R are not just upto the mark. In this paper, we propose a new open source package i.e. SmartEDA for R to address the need for automation of exploratory data analysis. We discuss the various features of SmartEDA and illustrate some of its applications for generating actionable insights using a couple of real-world datasets. We also perform a comparative study of SmartEDA with respect to other packages available for exploratory data analysis in the Comprehensive R Archive Network (CRAN).

**Paradox in Deep Neural Networks: Similar yet Different while Different yet Similar**

Machine learning is advancing towards a data-science approach, implying a necessity to a line of investigation to divulge the knowledge learnt by deep neuronal networks. Limiting the comparison among networks merely to a predefined intelligent ability, according to ground truth, does not suffice, it should be associated with innate similarity of these artificial entities. Here, we analysed multiple instances of an identical architecture trained to classify objects in static images (CIFAR and ImageNet data sets). We evaluated the performance of the networks under various distortions and compared it to the intrinsic similarity between their constituent kernels. While we expected a close correspondence between these two measures, we observed a puzzling phenomenon. Pairs of networks whose kernels’ weights are over 99.9% correlated can exhibit significantly different performances, yet other pairs with no correlation can reach quite compatible levels of performance. We show implications of this for transfer learning, and argue its importance in our general understanding of what intelligence is, whether natural or artificial.

**ROC and AUC with a Binary Predictor: a Potentially Misleading Metric**

In analysis of binary outcomes, the receiver operator characteristic (ROC) curve is heavily used to show the performance of a model or algorithm. The ROC curve is informative about the performance over a series of thresholds and can be summarized by the area under the curve (AUC), a single number. When a predictor is categorical, the ROC curve has only as many thresholds as the one less than number of categories; when the predictor is binary there is only one threshold. As the AUC may be used in decision-making processes on determining the best model, it important to discuss how it agrees with the intuition from the ROC curve. We discuss how the interpolation of the curve between thresholds with binary predictors can largely change the AUC. Overall, we believe a linear interpolation from the ROC curve with binary predictors, which is most commonly done in software, corresponding to the estimated AUC. We believe these ROC curves and AUC can lead to misleading results. We compare R, Python, Stata, and SAS software implementations.

**Discriminative Principal Component Analysis: A REVERSE THINKING**

In this paper, we propose a novel approach named by Discriminative Principal Component Analysis which is abbreviated as Discriminative PCA in order to enhance separability of PCA by Linear Discriminant Analysis (LDA). The proposed method performs feature extraction by determining a linear projection that captures the most scattered discriminative information. The most innovation of Discriminative PCA is performing PCA on discriminative matrix rather than original sample matrix. For calculating the required discriminative matrix under low complexity, we exploit LDA on a converted matrix to obtain within-class matrix and between-class matrix thereof. During the computation process, we utilise direct linear discriminant analysis (DLDA) to solve the encountered SSS problem. For evaluating the performances of Discriminative PCA in face recognition, we analytically compare it with DLAD and PCA on four well known facial databases, they are PIE, FERET, YALE and ORL respectively. Results in accuracy and running time obtained by nearest neighbour classifier are compared when different number of training images per person used. Not only the superiority and outstanding performance of Discriminative PCA showed in recognition rate, but also the comparable results of running time.

One of the central interests of animal movement ecology is relating movement characteristics to behavioural characteristics. The traditional discrete-time statistical tool for inferring unobserved behaviours from movement data is the hidden Markov model (HMM). While the HMM is an important and powerful tool, sometimes it is not flexible enough to appropriately fit the data. Data for marine animals often exhibit conditional autocorrelation, self-dependence of the step length process which cannot be explained solely by the behavioural state, which violates one of the main assumptions of the HMM. Using a grey seal track as an example, along with multiple simulation scenarios, we motivate and develop the conditionally autoregressive hidden Markov model (CarHMM), which is a generalization of the HMM designed specifically to handle conditional autocorrelation. In addition to introducing and examining the new CarHMM, we provide guidelines for all stages of an analysis using either an HMM or CarHMM. These include guidelines for pre-processing location data to obtain deflection angles and step lengths, model selection, and model checking. In addition to these practical guidelines, we link estimated model parameters to biologically meaningful quantities such as activity budget and residency time. We also provide interpretations of traditional ‘foraging’ and ‘transiting’ behaviours in the context of the new CarHMM parameters.

**Termite: A System for Tunneling Through Heterogeneous Data**

Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files that are not integrated together in a single, queryable repository. Despite 40+ years of continuous effort by the database community, data integration still remains an open challenge. In this paper, we advocate a different approach: rather than trying to infer a common schema, we aim to find another common representation for diverse, heterogeneous data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness. We present Termite, a prototype we have built to learn the best embedding from the data. Because the best representation is learned, this allows Termite to avoid much of the human effort associated with traditional data integration tasks. On top of Termite, we have implemented a Termite-Join operator, which allows people to identify related concepts, even when these are stored in databases with different schemas and in unstructured data such as text files, webpages, etc. Finally, we show preliminary evaluation results of our prototype via a user study, and describe a list of future directions we have identified.

Age of Information (AoI) is a newly appeared concept and metric to characterize the freshness of data. In this work, we study the delay and AoI in a multiple access channel (MAC) with two source nodes transmitting different types of data to a common destination. The first node is grid-connected and its data packets arrive in a bursty manner, and at each time slot it transmits one packet with some probability. Another energy harvesting (EH) sensor node generates a new status update with a certain probability whenever it is charged. We derive the delay of the grid-connected node and the AoI of the EH sensor as functions of different parameters in the system. The results show that the mutual interference has a non-trivial impact on the delay and age performance of the two nodes.

**Efficient Optimization of Echo State Networks for Time Series Datasets**

Echo State Networks (ESNs) are recurrent neural networks that only train their output layer, thereby precluding the need to backpropagate gradients through time, which leads to significant computational gains. Nevertheless, a common issue in ESNs is determining its hyperparameters, which are crucial in instantiating a well performing reservoir, but are often set manually or using heuristics. In this work we optimize the ESN hyperparameters using Bayesian optimization which, given a limited budget of function evaluations, outperforms a grid search strategy. In the context of large volumes of time series data, such as light curves in the field of astronomy, we can further reduce the optimization cost of ESNs. In particular, we wish to avoid tuning hyperparameters per individual time series as this is costly; instead, we want to find ESNs with hyperparameters that perform well not just on individual time series but rather on groups of similar time series without sacrificing predictive performance significantly. This naturally leads to a notion of clusters, where each cluster is represented by an ESN tuned to model a group of time series of similar temporal behavior. We demonstrate this approach both on synthetic datasets and real world light curves from the MACHO survey. We show that our approach results in a significant reduction in the number of ESN models required to model a whole dataset, while retaining predictive performance for the series in each cluster.

**21**
*Thursday*
Mar 2019

Posted R Packages

in* Plot Objects Moving in Orbits* (

Visualize the objects in orbits in 2D and 3D. The packages is under developing to plot the orbits of objects in polar coordinate system. See the exampl …

Monte Carol based tests for the Behrens Fisher Problem enhance the statistical power and performs better than Welch’s t-approximation, see Ullah et al. …

Create a side-by-side view of raster(image)s with an interactive slider to switch between regions of the images. This can be especially useful for imag …

Survey sampling using permanent random numbers (PRN’s). A solution to the problem of unknown overlap between survey samples, which leads to a low preci …

Provides utilities to compute, store and access distance matrices on disk as file-backed matrices provided by the ‘bigstatsr’ package. File-backed dist …

An ‘RStudio’ addin to assist with removing objects from the global environment. Features include removing objects according to name patterns and object …

**21**
*Thursday*
Mar 2019

Posted Magister Dixit

in“Big-data analysis consists of searching for buried patterns that have some kind of predictive power. But choosing which “features” of the data to analyze usually requires some human intuition. In a database containing, say, the beginning and end dates of various sales promotions and weekly profits, the crucial data may not be the dates themselves but the spans between them, or not the total profits but the averages across those spans.” Larry Hardesty ( October 16, 2015 )

**20**
*Wednesday*
Mar 2019

Posted Distilled News

in**Mapping 24 Emotions Conveyed by Brief Human Vocalization**

Emotional vocalizations are central to human social life. Recent studies have documented that people recognize at least 13 emotions in brief vocalizations. This capacity emerges early in development, is preserved in some form across cultures, and informs how people respond emotionally to music. What is poorly understood is how emotion recognition from vocalization is structured within what we call a semantic space, the study of which addresses questions critical to the field: How many distinct kinds of emotions can be expressed? Do expressions convey emotion categories or affective appraisals (e.g., valence, arousal)? Is the recognition of emotion expressions discrete or continuous? Guided by a new theoretical approach to emotion taxonomies, we apply large-scale data collection and analysis techniques to judgments of 2,032 emotional vocal bursts produced in laboratory settings (Study 1) and 48 found in the real world (Study 2) by U.S. English speakers (N = 1,105). We find that vocal bursts convey at least 24 distinct kinds of emotion. Emotion categories (sympathy, awe), more so than affective appraisals (including valence and arousal), organize emotion recognition. In contrast to discrete emotion theories, the emotion categories conveyed by vocal bursts are bridged by smooth gradients with continuously varying meaning. We visualize the complex, high-dimensional space of emotion conveyed by brief human vocalization within an online interactive map.

**5 New Generative Adversarial Network (GAN) Architectures For Image Synthesis**

AI image synthesis has made impressive progress since Generative Adversarial Networks (GANs) were introduced in 2014. GANs were originally only capable of generating small, blurry, black-and-white pictures, but now we can generate high-resolution, realistic and colorful pictures that you can hardly distinguish from real photographs. Here we have summarized for you 5 recently introduced GAN architectures that are used for image synthesis.

**Essential NLP Tools, Code, and Tips**

In a previous article, we introduced the influential impact of natural language processing (NLP) in different industries and explained the way this discipline is reshaping several fields, yet facing huge challenges on its way. The main drawbacks we face these days with NLP relate to the fact that language is very tricky. The process of understanding and manipulating language is extremely complex, and for this reason, it is common to use different techniques to handle different challenges before binding everything together. Programming languages like Python or R are highly used to perform these techniques, but before diving into code lines (that will be the topic of a different article), it’s important to understand the concepts beneath them.

**Tackling Bias in Machine Learning**

Machine learning and AI applications are used across industries, from recommendation engines to self-driving cars and more. When machine learning is used in automated decision-making, it can create issues with transparency, accountability, and equity. For example, last year it came to light that the AI tool Amazon built to automate their hiring process had to be shut down because it was discriminating against women.

**The FAIR Guiding Principles for scientific data management and stewardship**

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders – representing academia, industry, funding agencies, and scholarly publishers – have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

FAIR = Findability, Accessibility, Interoperability and Reusability

FAIR = Findability, Accessibility, Interoperability and Reusability

**Is Your Data FAIR? An Open Data Checklist for Success**

Many believe that data should be recognized as the ‘oil of the 21st century’, the world’s most valuable resource. This opinion is spreading, with 64% of more than 1,000 researchers who contributed to the State of Open Data Report in 2018 making their data available, compared to just 57% in 2016. There are huge social and financial benefits that businesses and economies can realize if they can successfully leverage Open Data. Despite this, there are still some hurdles for data professionals to leap. A great way to start is to consider whether your data meets the criteria for what’s known as the FAIR principles. These are Findability, Accessibility, Interoperability and Reusability.

**Overcoming distrust on the path to productive analytics**

We outline the importance of overcoming distrust in data and analytics, with tips on how to align all stakeholders, being a data optimist, streamlining the process, and more.

**Artificial Neural Networks Optimization using Genetic Algorithm with Python**

In a previous tutorial titled ‘Artificial Neural Network Implementation using NumPy and Classification of the Fruits360 Image Dataset’ available in my LinkedIn profile at this link https://…implementation-using-numpy-fruits360-gad, an artificial neural network (ANN) is created for classifying 4 classes of the Fruits360 image dataset. The source code used in this tutorial is available in my GitHub page here: https://…/NumPyANN A quick summary of this tutorial is extracting the feature vector (360 bins hue channel histogram) and reducing it to just 102 element by using a filter-based technique using the standard deviation. Later, the ANN is built from scratch using NumPy.

**The Credibility Crisis in Data Science**

There’s a tendency to focus on minutia, like what neural network architecture are we using? Are we using R or are we using Python? What method are you using? These kind of things, which are not as important to decision makers. One thing that I’ve heard just in working here at Civis is like you say from a CEO of a very large company that every one would know, if I mentioned who they were. I mean it’s basically saying if I have all these data scientists, I have hundreds of data scientists and I have no idea what the fuck they do all day. It’s like a part of a profession, part of a class of jobs. That’s not what you wanna be. You don’t want the decision makers and the people who are supposed to be benefiting from your insights, not able to discern what it is you do not understanding what your output is.

**A gentle introduction to SHAP values in R**

This novel approach allows us to dig a little bit more in the complexity of the predictive model results, while it allows us to explore the relationships between variables for predicted case.

In this month’s post, I set out to create a visual network of emotions. Emotion Dynamics tells us that different emotions are highly interconnected, such that one emotion morphs into another and so on. I’ll be using a large dataset from an original study published in PLOS ONE by Trampe, Quoidbach, and Taquet (2015). Thanks to Google Dataset Search, I was able to locate this data. The data is collected from 11,000 participants who completed daily questionnaires on the emotions they felt at a given moment. The original paper is fascinating and I highly encourage checking it out – not to mention that the author’s analysis is the inspiration for this post. The raw data can be freely accessed from the author’s OSF page (link in online article) – props to them for publishing the data! What is a network? In a sentence, a network is a complex set of interrelations between variables. Some terminology: nodes are the variables (in this case, emotions), and edges are the relationships between the variables. Networks can be directed, which means that variables are linked in a sequence (e.g, from emotion A to emotion B), or undirected, which just shows the relationships. Trampe et al. (2015) created an undirected network in their paper, but the data also allows for a directed network – and this is what I’m going to make for this post.

**Speeding Up and Perfecting Your Work Using Parallel Computing**

A detailed guide of Python multiprocessing vs. PySpark mapPartition. In science, behind every achievement is grinding, rigorous work. And success is unlikely to happen with one attempt. As a data scientist, you probably deal with huge amount of data and computations, perform repeated tests and experiments on your day-to-day work. Though you don’t want to turn your rewarding, stimulating job into tedious one by waiting the time-consuming operation repeating again and again, observation after observation.

**Creating a discord sentiment analysis bot using VADER**

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and many more to identify and quantify the sentiment of some kind of text or audio. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER makes it easy for us to create a sentiment analysis application. In this article, we will create a discord bot that can analyze the sentiment of the written messages.

**The Complete TensorFlow Tutorial for Newbies**

TensorFlow is a robust framework for machine learning and deep learning. It makes it easier to build models and deploy them for production. It is the most popular framework among developers. This comes with no surprise, as the framework is also available for web-based machine learning (TensorFlow.js) and for on-device inference (TensorFlow Lite). Furthermore, with the recent announcement of TensorFlow 2.0, the framework will soon be easier to use, as the syntax will be simplified with fewer APIs, and it will support the Julia programming language. It is a great time to get started with TensorFlow, and mastering it is an important asset data scientists. This tutorial will get you up and running with TensorFlow. Note that TensorFlow 2.0 is not stable, so we will focus on the previous stable version. We will first install the framework in the easiest way possible, then we will write a few functions to learn the syntax and to use a few APIs. Finally, we will write a model that will recognize hand signs. Let’s get started!

**Introducing Mercury-ML: an open-source ‘messenger of the machine learning gods’**

In the ancient Roman mythology, the god Mercury was known as ‘the messenger of the gods’. Wearing winged shoes and a winged hat, he zipped between Mount Olympus and the kingdoms of men and saw to it that the will of the gods was known. He wasn’t the strongest, the wisest, the most revered, or the most feared of the gods, but he was fleet of foot and cunning and could be relied upon to steer events to their desired outcomes. Without him Perseus could not have defeated Medusa; Odysseus would have fallen to Circe’s spells; and Hercules could not have dragged Cerberus from Hades, thereby completing the final of his 12 mythical labours… With this post I would like to introduce a new initiative called Mercury-ML, and open-source ‘messenger of the machine learning gods’.

**Interactive spreadsheets in Jupyter**

ipywidgets plays an essential part in the Jupyter ecosystem; it brings interactivity between user and data. Widgets are eventful Python objects that often have a visual representation in the Jupyter Notebook or JupyterLab: a button, a slider, a text input, a checkbox… More than a library of interactive widgets, ipywidgets is a powerful framework upon which it is straightforward to create new custom widgets. Developers can quickly start their own widgets library with best practices of code structure and packaging using the widget-cookiecutter project. You can find examples of really nice widgets libraries in the blog-post: Video streaming in the Jupyter Notebook. A spreadsheet is an interactive tool for data analysis in a tabular form. It consists of cells and cell ranges. It supports value dependent cell formatting/styling and one can apply mathematical functions on cells and perform chained computations. It is the perfect user interface for statistical and financial operations. The Jupyter Notebook was lacking a spreadsheet library, that’s when ipysheet comes into play.

**20**
*Wednesday*
Mar 2019

Posted Documents

inDue to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers. FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

**20**
*Wednesday*
Mar 2019

Posted arXiv Papers

in**$β^3$-IRT: A New Item Response Model and its Applications**

Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the -IRT model, which models continuous responses and can generate a much enriched family of Item Characteristic Curve (ICC). In experiments we applied the proposed model to data from an online exam platform, and show our model outperforms a more standard 2PL-ND model on all datasets. Furthermore, we show how to apply \BIRT{} to assess the ability of machine learning classifiers. This novel application results in a new metric for evaluating the quality of the classifier’s probability estimates, based on the inferred difficulty and discrimination of data instances.

**Algorithms for an Efficient Tensor Biclustering**

Consider a data set collected by (individuals-features) pairs in different times. It can be represented as a tensor of three dimensions (Individuals, features and times). The tensor biclustering problem computes a subset of individuals and a subset of features whose signal trajectories over time lie in a low-dimensional subspace, modeling similarity among the signal trajectories while allowing different scalings across different individuals or different features. This approach are based on spectral decomposition in order to build the desired biclusters. We evaluate the quality of the results from each algorithms with both synthetic and real data set.

**Blameworthiness in Multi-Agent Settings**

We provide a formal definition of blameworthiness in settings where multiple agents can collaborate to avoid a negative outcome. We first provide a method for ascribing blameworthiness to groups relative to an epistemic state (a distribution over causal models that describe how the outcome might arise). We then show how we can go from an ascription of blameworthiness for groups to an ascription of blameworthiness for individuals using a standard notion from cooperative game theory, the Shapley value. We believe that getting a good notion of blameworthiness in a group setting will be critical for designing autonomous agents that behave in a moral manner.

**An Innovative Word Encoding Method For Text Classification Using Convolutional Neural Network**

Text classification plays a vital role today especially with the intensive use of social networking media. Recently, different architectures of convolutional neural networks have been used for text classification in which one-hot vector, and word embedding methods are commonly used. This paper presents a new language independent word encoding method for text classification. The proposed model converts raw text data to low-level feature dimension with minimal or no preprocessing steps by using a new approach called binary unique number of word ‘BUNOW’. BUNOW allows each unique word to have an integer ID in a dictionary that is represented as a k-dimensional vector of its binary equivalent. The output vector of this encoding is fed into a convolutional neural network (CNN) model for classification. Moreover, the proposed model reduces the neural network parameters, allows faster computation with few network layers, where a word is atomic representation the document as in word level, and decrease memory consumption for character level representation. The provided CNN model is able to work with other languages or multi-lingual text without the need for any changes in the encoding method. The model outperforms the character level and very deep character level CNNs models in terms of accuracy, network parameters, and memory consumption; the results show total classification accuracy 91.99% and error 8.01% using AG’s News dataset compared to the state of art methods that have total classification accuracy 91.45% and error 8.55%, in addition to the reduction in input feature vector and neural network parameters by 62% and 34%, respectively.

**Distributed Randomized Gradient-Free Mirror Descent Algorithm for Constrained Optimization**

This paper is concerned with multi-agent optimization problem. A distributed randomized gradient-free mirror descent (DRGFMD) method is developed by introducing a randomized gradient-free oracle in the mirror descent scheme where the non-Euclidean Bregman divergence is used. The classical gradient descent method is generalized without using subgradient information of objective functions. The proposed algorithm is the first distributed non-Euclidean zeroth-order method which achieves an convergence rate, recovering the best known optimal rate of distributed compact constrained convex optimization. Also, the DRGFMD algorithm achieves an convergence rate for the strongly convex constrained optimization case. The rate matches the best known non-compact constraint result. Moreover, a decentralized reciprocal weighted average approximating sequence is investigated and first used in distributed algorithm. A class of convergence rates are also achieved for the algorithm with weighted averaging (DRGFMD-WA). The technique on constructing the decentralized weighted average sequence provides new insight in searching for minimizers in distributed algorithms.

**Gathering Insights from Teenagers’ Hacking Experience with Authentic Cybersecurity Tools**

This Work-In-Progress Paper for the Innovative Practice Category presents a novel experiment in active learning of cybersecurity. We introduced a new workshop on hacking for an existing science-popularizing program at our university. The workshop participants, 28 teenagers, played a cybersecurity game designed for training undergraduates and professionals in penetration testing. Unlike in learning environments that are simplified for young learners, the game features a realistic virtual network infrastructure. This allows exploring security tools in an authentic scenario, which is complemented by a background story. Our research aim is to examine how young players approach using cybersecurity tools by interacting with the professional game. A preliminary analysis of the game session showed several challenges that the workshop participants faced. Nevertheless, they reported learning about security tools and exploits, and 61% of them reported wanting to learn more about cybersecurity after the workshop. Our results support the notion that young learners should be allowed more hands-on experience with security topics, both in formal education and informal extracurricular events.

**Graph Data on the Web: extend the pivot, don’t reinvent the wheel**

This article is a collective position paper from the Wimmics research team, expressing our vision of how Web graph data technologies should evolve in the future in order to ensure a high-level of interoperability between the many types of applications that produce and consume graph data. Wimmics stands for Web-Instrumented Man-Machine Interactions, Communities, and Semantics. We are a joint research team between INRIA Sophia Antipolis-M{\’e}diterran{\’e}e and I3S (CNRS and Universit{\’e} C{\^o}te d’Azur). Our challenge is to bridge formal semantics and social semantics on the web. Our research areas are graph-oriented knowledge representation, reasoning and operationalization to model and support actors, actions and interactions in web-based epistemic communities. The application of our research is supporting and fostering interactions in online communities and management of their resources. In this position paper, we emphasize the need to extend the semantic Web standard stack to address and fulfill new graph data needs, as well as the importance of remaining compatible with existing recommendations, in particular the RDF stack, to avoid the painful duplication of models, languages, frameworks, etc. The following sections group motivations for different directions of work and collect reasons for the creation of a working group on RDF 2.0 and other recommendations of the RDF family.

**Structured Knowledge Distillation for Semantic Segmentation**

In this paper, we investigate the knowledge distillation strategy for training small semantic segmentation networks by making use of large networks. We start from the straightforward scheme, pixel-wise distillation, which applies the distillation scheme adopted for image classification and performs knowledge distillation for each pixel~\emph{separately}. We further propose to distill the \emph{structured} knowledge from large networks to small networks, which is motivated by that semantic segmentation is a structured prediction problem. We study two structured distillation schemes: (i) \emph{pair-wise} distillation that distills the pairwise similarities, and (ii) \emph{holistic} distillation that uses GAN to distill holistic knowledge. The effectiveness of our knowledge distillation approaches is demonstrated by extensive experiments on three scene parsing datasets: Cityscapes, Camvid and ADE20K.

**Shapley regressions: A framework for statistical inference on machine learning models**

Machine learning models often excel in the accuracy of their predictions but are opaque due to their non-linear and non-parametric structure. This makes statistical inference challenging and disqualifies them from many applications where model interpretability is crucial. This paper proposes the Shapley regression framework as an approach for statistical inference on non-linear or non-parametric models. Inference is performed based on the Shapley value decomposition of a model, a pay-off concept from cooperative game theory. I show that universal approximators from machine learning are estimation consistent and introduce hypothesis tests for individual variable contributions, model bias and parametric functional forms. The inference properties of state-of-the-art machine learning models – like artificial neural networks, support vector machines and random forests – are investigated using numerical simulations and real-world data. The proposed framework is unique in the sense that it is identical to the conventional case of statistical inference on a linear model if the model is linear in parameters. This makes it a well-motivated extension to more general models and strengthens the case for the use of machine learning to inform decisions.

**Similarity Learning via Kernel Preserving Embedding**

Data similarity is a key concept in many data-driven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has been developed and successfully applied in various models, such as low-rank representation, sparse subspace learning, semi-supervised learning. However, it just tries to reconstruct the original data and some valuable information, e.g., the manifold structure, is largely ignored. In this paper, we argue that it is beneficial to preserve the overall relations when we extract similarity information. Specifically, we propose a novel similarity learning framework by minimizing the reconstruction error of kernel matrices, rather than the reconstruction error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we observe considerable improvements compared to other state-of-the-art methods. More importantly, our proposed framework is very general and provides a novel and fundamental building block for many other similarity-based tasks. Besides, our proposed kernel preserving opens up a large number of possibilities to embed high-dimensional data into low-dimensional space.

**From Hotelling to Load Balancing: Approximation and the Principle of Minimum Differentiation**

Competing firms tend to select similar locations for their stores. This phenomenon, called the principle of minimum differentiation, was captured by Hotelling with a landmark model of spatial competition but is still the object of an ongoing scientific debate. Although consistently observed in practice, many more realistic variants of Hotelling’s model fail to support minimum differentiation or do not have pure equilibria at all. In particular, it was recently proven for a generalized model which incorporates negative network externalities and which contains Hotelling’s model and classical selfish load balancing as special cases, that the unique equilibria do not adhere to minimum differentiation. Furthermore, it was shown that for a significant parameter range pure equilibria do not exist. We derive a sharp contrast to these previous results by investigating Hotelling’s model with negative network externalities from an entirely new angle: approximate pure subgame perfect equilibria. This approach allows us to prove analytically and via agent-based simulations that approximate equilibria having good approximation guarantees and that adhere to minimum differentiation exist for the full parameter range of the model. Moreover, we show that the obtained approximate equilibria have high social welfare.

**Gradient Descent based Optimization Algorithms for Deep Learning Models Training**

In this paper, we aim at providing an introduction to the gradient descent based optimization algorithms for learning deep neural network models. Deep learning models involving multiple nonlinear projection layers are very challenging to train. Nowadays, most of the deep learning model training still relies on the back propagation algorithm actually. In back propagation, the model variables will be updated iteratively until convergence with gradient descent based optimization algorithms. Besides the conventional vanilla gradient descent algorithm, many gradient descent variants have also been proposed in recent years to improve the learning performance, including Momentum, Adagrad, Adam, Gadam, etc., which will all be introduced in this paper respectively.

Deep Q-Learning has been successfully applied to a wide variety of tasks in the past several years. However, the architecture of the vanilla Deep Q-Network is not suited to deal with partially observable environments such as 3D video games. For this, recurrent layers had been added to the Deep Q-Network in order to allow it to handle past dependencies. We here use Minecraft for its customization advantages and design two very simple missions that can be frames as Partially Observable Markov Decision Process. We compare on these missions the Deep Q-Network and the Deep Recurrent Q-Network in order to see if the latter, which is trickier and longer to train, is always the best architecture when the agent has to deal with partial observability.

**Coresets for Ordered Weighted Clustering**

We design coresets for Ordered k-Median, a generalization of classical clustering problems such as k-Median and k-Center, that offers a more flexible data analysis, like easily combining multiple objectives (e.g., to increase fairness or for Pareto optimization). Its objective function is defined via the Ordered Weighted Averaging (OWA) paradigm of Yager (1988), where data points are weighted according to a predefined weight vector, but in order of their contribution to the objective (distance from the centers). A powerful data-reduction technique, called a coreset, is to summarize a point set in into a small (weighted) point set , such that for every set of potential centers, the objective value of the coreset approximates that of within factor . When there are multiple objectives (weights), the above standard coreset might have limited usefulness, whereas in a \emph{simultaneous} coreset, which was introduced recently by Bachem and Lucic and Lattanzi (2018), the above approximation holds for all weights (in addition to all centers). Our main result is a construction of a simultaneous coreset of size for Ordered k-Median. To validate the efficacy of our coreset construction we ran experiments on a real geographical data set. We find that our algorithm produces a small coreset, which translates to a massive speedup of clustering computations, while maintaining high accuracy for a range of weights.

TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for developing HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers.

**Accuracy Booster: Performance Boosting using Feature Map Re-calibration**

Convolution Neural Networks (CNN) have been extremely successful in solving intensive computer vision tasks. The convolutional filters used in CNNs have played a major role in this success, by extracting useful features from the inputs. Recently researchers have tried to boost the performance of CNNs by re-calibrating the feature maps produced by these filters, e.g., Squeeze-and-Excitation Networks (SENets). These approaches have achieved better performance by \textit{Exciting} up the important channels or feature maps while diminishing the rest. However, in the process, architectural complexity has increased. We propose an architectural block that introduces much lower complexity than the existing methods of CNN performance boosting while performing significantly better than them. We carry out experiments on the CIFAR, ImageNet and MS-COCO datasets, and show that the proposed block can challenge the state-of-the-art results. Our method boosts the ResNet-50 architecture to perform comparably to the ResNet-152 architecture, which is a three times deeper network, on classification. We also show experimentally that our method is not limited to classification but also generalizes well to other tasks such as object detection.

**ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings**

In this paper, we introduce a comprehensive toolkit, ETNLP, which can evaluate, extract, and visualize multiple sets of pre-trained word embeddings. First, for evaluation, ETNLP analyses the quality of pre-trained embeddings based on an input word analogy list. Second, for extraction ETNLP provides a subset of the embeddings to be used in the downstream NLP tasks. Finally, ETNLP has a visualization module which is for exploring the embedded words interactively. We demonstrate the effectiveness of ETNLP on our pre-trained word embeddings in Vietnamese. Specifically, we create a large Vietnamese word analogy list to evaluate the embeddings. We then utilize the pre-trained embeddings for the name entity recognition (NER) task in Vietnamese and achieve the new state-of-the-art results on a benchmark dataset for the NER task. A video demonstration of ETNLP is available at https://…/317599106. The source code and data are available at https: //github.com/vietnlp/etnlp.

**Physics Enhanced Artificial Intelligence**

We propose that intelligently combining models from the domains of Artificial Intelligence or Machine Learning with Physical and Expert models will yield a more ‘trustworthy’ model than any one model from a single domain, given a complex and narrow enough problem. Based on mean-variance portfolio theory and bias-variance trade-off analysis, we prove combining models from various domains produces a model that has lower risk, increasing user trust. We call such combined models – physics enhanced artificial intelligence (PEAI), and suggest use cases for PEAI.

**Continual Learning via Neural Pruning**

We introduce Continual Learning via Neural Pruning (CLNP), a new method aimed at lifelong learning in fixed capacity models based on neuronal model sparsification. In this method, subsequent tasks are trained using the inactive neurons and filters of the sparsified network and cause zero deterioration to the performance of previous tasks. In order to deal with the possible compromise between model sparsity and performance, we formalize and incorporate the concept of graceful forgetting: the idea that it is preferable to suffer a small amount of forgetting in a controlled manner if it helps regain network capacity and prevents uncontrolled loss of performance during the training of future tasks. CLNP also provides simple continual learning diagnostic tools in terms of the number of free neurons left for the training of future tasks as well as the number of neurons that are being reused. In particular, we see in experiments that CLNP verifies and automatically takes advantage of the fact that the features of earlier layers are more transferable. We show empirically that CLNP leads to significantly improved results over current weight elasticity based methods.

We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, where the conditional probability tables of this network are then integrated out analytically. We show that the resulting marginal process turns out to be a Polya urn, an integer valued self-reinforcing process. This urn processes, which we name a Polya-Bayes process, obey certain conditional independence properties that provide further insight about the nature of NTF. These insights also let us develop space efficient simulation algorithms that respect the potential sparsity of data: we propose a class of sequential importance sampling algorithms for computing NTF and approximating their marginal likelihood, which would be useful for model selection. The resulting methods can also be viewed as a model scoring method for topic models and discrete Bayesian networks with hidden variables. The new algorithms have favourable properties in the sparse data regime when contrasted with variational algorithms that become more accurate when the total sum of the elements of the observed tensor goes to infinity. We illustrate the performance on several examples and numerically study the behaviour of the algorithms for various data regimes.