Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness

The accuracy of deep learning, i.e., deep neural networks, can be characterized by dividing the total error into three main types: approximation error, optimization error, and generalization error. Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization. Most existing theoretical works for generalization fail to explain the performance of neural networks in practice. To derive a meaningful bound, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We introduce the cover complexity (CC) to measure the difficulty of learning a data set and the inverse of modules of continuity to quantify neural network smoothness. A quantitative bound for expected accuracy/error is derived by considering both the CC and neural network smoothness. We validate our theoretical results by several data sets of images. The numerical results verify that the expected error of trained networks scaled with the square root of the number of classes has a linear relationship with respect to the CC. In addition, we observe a clear consistency between test loss and neural network smoothness during the training process.


Equivalent and Approximate Transformations of Deep Neural Networks

Two networks are equivalent if they produce the same output for any given input. In this paper, we study the possibility of transforming a deep neural network to another network with a different number of units or layers, which can be either equivalent, a local exact approximation, or a global linear approximation of the original network. On the practical side, we show that certain rectified linear units (ReLUs) can be safely removed from a network if they are always active or inactive for any valid input. If we only need an equivalent network for a smaller domain, then more units can be removed and some layers collapsed. On the theoretical side, we constructively show that for any feed-forward ReLU network, there exists a global linear approximation to a 2-hidden-layer shallow network with a fixed number of units. This result is a balance between the increasing number of units for arbitrary approximation with a single layer and the known upper bound of \lceil log(n_0+1)\rceil +1 layers for exact representation, where n_0 is the input dimension. While the transformed network may require an exponential number of units to capture the activation patterns of the original network, we show that it can be made substantially smaller by only accounting for the patterns that define linear regions. Based on experiments with ReLU networks on the MNIST dataset, we found that l_1-regularization and adversarial training reduces the number of linear regions significantly as the number of stable units increases due to weight sparsity. Therefore, we can also intentionally train ReLU networks to allow for effective loss-less compression and approximation.


COSET: A Benchmark for Evaluating Neural Program Embeddings

Neural program embedding can be helpful in analyzing large software, a task that is challenging for traditional logic-based program analyses due to their limited scalability. A key focus of recent machine-learning advances in this area is on modeling program semantics instead of just syntax. Unfortunately evaluating such advances is not obvious, as program semantics does not lend itself to straightforward metrics. In this paper, we introduce a benchmarking framework called COSET for standardizing the evaluation of neural program embeddings. COSET consists of a diverse dataset of programs in source-code format, labeled by human experts according to a number of program properties of interest. A point of novelty is a suite of program transformations included in COSET. These transformations when applied to the base dataset can simulate natural changes to program code due to optimization and refactoring and can serve as a ‘debugging’ tool for classification mistakes. We conducted a pilot study on four prominent models: TreeLSTM, gated graph neural network (GGNN), AST-Path neural network (APNN), and DYPRO. We found that COSET is useful in identifying the strengths and limitations of each model and in pinpointing specific syntactic and semantic characteristics of programs that pose challenges.


Probabilistic mappings and Bayesian nonparametrics

In this paper we develop a functorial language of probabilistic mappings and apply it to basic problems in Bayesian nonparametrics. First we extend and unify the Kleisli category of probabilistic mappings proposed by Lawvere and Giry with the category of statistical models proposed by Chentsov and Morse-Sacksteder. Then we introduce the notion of a Bayesian statistical model that formalizes the notion of a parameter space with a given prior distribution in Bayesian statistics. We give a formula for posterior distributions, assuming that the underlying parameter space of a Bayesian statistical model is a Souslin space and the sample space of the Bayesian statistical model is a subset in a complete connected finite dimensional Riemannian manifold. Then we give a new proof of the existence of Dirichlet measures over any measurable space using a functorial property of the Dirichlet map constructed by Sethuraman.


Forecasting Stock Market with Support Vector Regression and Butterfly Optimization Algorithm

Support Vector Regression (SVR) has achieved high performance on forecasting future behavior of random systems. However, the performance of SVR models highly depends upon the appropriate choice of SVR parameters. In this study, a novel BOA-SVR model based on Butterfly Optimization Algorithm (BOA) is presented. The performance of the proposed model is compared with eleven other meta-heuristic algorithms on a number of stocks from NASDAQ. The results indicate that the presented model here is capable to optimize the SVR parameters very well and indeed is one of the best models judged by both prediction performance accuracy and time consumption.


Infusing domain knowledge in AI-based ‘black box’ models for better explainability with application in bankruptcy prediction

Although ‘black box’ models such as Artificial Neural Networks, Support Vector Machines, and Ensemble Approaches continue to show superior performance in many disciplines, their adoption in the sensitive disciplines (e.g., finance, healthcare) is questionable due to the lack of interpretability and explainability of the model. In fact, future adoption of ‘black box’ models is difficult because of the recent rule of ‘right of explanation’ by the European Union where a user can ask for an explanation behind an algorithmic decision, and the newly proposed bill by the US government, the ‘Algorithmic Accountability Act’, which would require companies to assess their machine learning systems for bias and discrimination and take corrective measures. Top Bankruptcy Prediction Models are A.I.-based and are in need of better explainability -the extent to which the internal working mechanisms of an AI system can be explained in human terms. Although explainable artificial intelligence is an emerging field of research, infusing domain knowledge for better explainability might be a possible solution. In this work, we demonstrate a way to collect and infuse domain knowledge into a ‘black box’ model for bankruptcy prediction. Our understanding from the experiments reveals that infused domain knowledge makes the output from the black box model more interpretable and explainable.


AI Feynman: a Physics-Inspired Method for Symbolic Regression

A core challenge for both physics and artificial intellicence (AI) is symbolic regression: finding a symbolic expression that matches data from an unknown function. Although this problem is likely to be NP-hard in principle, functions of practical interest often exhibit symmetries, separability, compositionality and other simplifying properties. In this spirit, we develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques. We apply it to 100 equations from the Feynman Lectures on Physics, and it discovers all of them, while previous publicly available software cracks only 71; for a more difficult test set, we improve the state of the art success rate from 15% to 90%.


Relational Representation Learning for Dynamic (Knowledge) Graphs: A Survey

Graphs arise naturally in many real-world applications including social networks, recommender systems, ontologies, biology, and computational finance. Traditionally, machine learning models for graphs have been mostly designed for static graphs. However, many applications involve evolving graphs. This introduces important challenges for learning and inference since nodes, attributes, and edges change over time. In this survey, we review the recent advances in representation learning for dynamic graphs, including dynamic knowledge graphs. We describe existing models from an encoder-decoder perspective, categorize these encoders and decoders based on the techniques they employ, and analyze the approaches in each category. We also review several prominent applications and widely used datasets, and highlight directions for future research.


FAN: Focused Attention Networks

Attention networks show promise for both vision and language tasks, by emphasizing relationships between constituent elements through appropriate weighting functions. Such elements could be regions in an image output by a region proposal network, or words in a sentence, represented by word embedding. Thus far, however, the learning of attention weights has been driven solely by the minimization of task specific loss functions. We here introduce a method of learning attention weights to better emphasize informative pair-wise relations between entities. The key idea is to use a novel center-mass cross entropy loss, which can be applied in conjunction with the task specific ones. We then introduce a focused attention backbone to learn these attention weights for general tasks. We demonstrate that the focused attention module leads to a new state-of-the-art for the recovery of relations in a relationship proposal task. Our experiments show that it also boosts performance for diverse vision and language tasks, including object detection, scene categorization and document classification.


Deep Neural Networks Abstract Like Humans

Deep neural networks (DNNs) have revolutionized AI due to their remarkable performance in pattern recognition, comprising of both memorizing complex training sets and demonstrating intelligence by generalizing to previously unseen data (test sets). The high generalization performance in DNNs has been explained by several mathematical tools, including optimization, information theory, and resilience analysis. In humans, it is the ability to abstract concepts from examples that facilitates generalization; this paper thus researches DNN generalization from that perspective. A recent computational neuroscience study revealed a correlation between abstraction and particular neural firing patterns. We express these brain patterns in a closed-form mathematical expression, termed the `Cognitive Neural Activation metric’ (CNA) and apply it to DNNs. Our findings reveal parallels in the mechanism underlying abstraction in DNNs and those in the human brain. Beyond simply measuring similarity to human abstraction, the CNA is able to predict and rate how well a DNN will perform on test sets, and determines the best network architectures for a given task in a manner not possible with extant tools. These results were validated on a broad range of datasets (including ImageNet and random labeled datasets) and neural architectures.


On a scalable problem transformation method for multi-label learning

Binary relevance is a simple approach to solve multi-label learning problems where an independent binary classifier is built per each label. A common challenge with this in real-world applications is that the label space can be very large, making it difficult to use binary relevance to larger scale problems. In this paper, we propose a scalable alternative to this, via transforming the multi-label problem into a single binary classification. We experiment with a few variations of our method and show that our method achieves higher precision than binary relevance and faster execution times on a top-K recommender system task.


Improved Training Speed, Accuracy, and Data Utilization Through Loss Function Optimization

As the complexity of neural network models has grown, it has become increasingly important to optimize their design automatically through metalearning. Methods for discovering hyperparameters, topologies, and learning rate schedules have lead to significant increases in performance. This paper shows that loss functions can be optimized with metalearning as well, and result in similar improvements. The method, Genetic Loss-function Optimization (GLO), discovers loss functions de novo, and optimizes them for a target task. Leveraging techniques from genetic programming, GLO builds loss functions hierarchically from a set of operators and leaf nodes. These functions are repeatedly recombined and mutated to find an optimal structure, and then a covariance-matrix adaptation evolutionary strategy (CMA-ES) is used to find optimal coefficients. Networks trained with GLO loss functions are found to outperform the standard cross-entropy loss on standard image classification tasks. Training with these new loss functions requires fewer steps, results in lower test error, and allows for smaller datasets to be used. Loss-function optimization thus provides a new dimension of metalearning, and constitutes an important step towards AutoML.


Structure Learning for Neural Module Networks

Neural Module Networks, originally proposed for the task of visual question answering, are a class of neural network architectures that involve human-specified neural modules, each designed for a specific form of reasoning. In current formulations of such networks only the parameters of the neural modules and/or the order of their execution is learned. In this work, we further expand this approach and also learn the underlying internal structure of modules in terms of the ordering and combination of simple and elementary arithmetic operators. Our results show that one is indeed able to simultaneously learn both internal module structure and module sequencing without extra supervisory signals for module execution sequencing. With this approach, we report performance comparable to models using hand-designed modules.


OrderNet: Ordering by Example

In this paper we introduce a new neural architecture for sorting unordered sequences where the correct sequence order is not easily defined but must rather be inferred from training data. We refer to this architecture as OrderNet and describe how it was constructed to be naturally permutation equivariant while still allowing for rich interactions of elements of the input set. We evaluate the capabilities of our architecture by training it to approximate solutions for the Traveling Salesman Problem and find that it outperforms previously studied supervised techniques in its ability to generalize to longer sequences than it was trained with. We further demonstrate the capability by reconstructing the order of sentences with scrambled word order.


Learning Bregman Divergences

Metric learning is the problem of learning a task-specific distance function given supervision. Classical linear methods for this problem (known as Mahalanobis metric learning approaches) are well-studied both theoretically and empirically, but are limited to Euclidean distances after learned linear transformations of the input space. In this paper, we consider learning a Bregman divergence, a rich and important class of divergences that includes Mahalanobis metrics as a special case but also includes the KL-divergence and others. We develop a formulation and algorithm for learning arbitrary Bregman divergences based on approximating their underlying convex generating function via a piecewise linear function. We show several theoretical results of our resulting model, including a PAC guarantee that the learned Bregman divergence approximates an arbitrary Bregman divergence with error O_p (m^{-1/(d+2)}), where m is the number of training points and d is the dimension of the data. We provide empirical results on using the learned divergences for classification, semi-supervised clustering, and ranking problems.


Leap-LSTM: Enhancing Long Short-Term Memory for Text Categorization

Recurrent Neural Networks (RNNs) are widely used in the field of natural language processing (NLP), ranging from text categorization to question answering and machine translation. However, RNNs generally read the whole text from beginning to end or vice versa sometimes, which makes it inefficient to process long texts. When reading a long document for a categorization task, such as topic categorization, large quantities of words are irrelevant and can be skipped. To this end, we propose Leap-LSTM, an LSTM-enhanced model which dynamically leaps between words while reading texts. At each step, we utilize several feature encoders to extract messages from preceding texts, following texts and the current word, and then determine whether to skip the current word. We evaluate Leap-LSTM on several text categorization tasks: sentiment analysis, news categorization, ontology classification and topic classification, with five benchmark data sets. The experimental results show that our model reads faster and predicts better than standard LSTM. Compared to previous models which can also skip words, our model achieves better trade-offs between performance and efficiency.


A Review of Semi Supervised Learning Theories and Recent Advances

Semi-supervised learning, which has emerged from the beginning of this century, is a new type of learning method between traditional supervised learning and unsupervised learning. The main idea of semi-supervised learning is to introduce unlabeled samples into the model training process to avoid performance (or model) degeneration due to insufficiency of labeled samples. Semi-supervised learning has been applied successfully in many fields. This paper reviews the development process and main theories of semi-supervised learning, as well as its recent advances and importance in solving real-world problems demonstrated by typical application examples.


Cross-lingual Knowledge Graph Alignment via Graph Matching Neural Network

Previous cross-lingual knowledge graph (KG) alignment studies rely on entity embeddings derived only from monolingual KG structural information, which may fail at matching entities that have different facts in two KGs. In this paper, we introduce the topic entity graph, a local sub-graph of an entity, to represent entities with their contextual information in KG. From this view, the KB-alignment task can be formulated as a graph matching problem; and we further propose a graph-attention based solution, which first matches all entities in two topic entity graphs, and then jointly model the local matching information to derive a graph-level matching vector. Experiments show that our model outperforms previous state-of-the-art methods by a large margin.


Solving NP-Hard Problems on Graphs by Reinforcement Learning without Domain Knowledge

We propose an algorithm based on reinforcement learning for solving NP-hard problems on graphs. We combine Graph Isomorphism Networks and the Monte-Carlo Tree Search, which was originally used for game searches, for solving combinatorial optimization on graphs. Similarly to AlphaGo Zero, our method does not require any problem-specific knowledge or labeled datasets (exact solutions), which are difficult to calculate in principle. We show that our method, which is trained by generated random graphs, successfully finds near-optimal solutions for the Maximum Independent Set problem on citation networks. Experiments illustrate that the performance of our method is comparable to SOTA solvers, but we do not require any problem-specific reduction rules, which is highly desirable in practice since collecting hand-crafted reduction rules is costly and not adaptive for a wide range of problems.


Crowdsourced Peer Learning Activity for Internet of Things Education: A Case Study

Computing devices such as laptops, tablets and mobile phones have become part of our daily lives. End users increasingly know more and more information about these devices. Further, more technically savvy end users know how such devices are being built and know how to choose one over the others. However, we cannot say the same about the Internet of Things (IoT) products. Due to its infancy nature of the marketplace, end users have very little idea about IoT products. To address this issue, we developed a method, a crowdsourced peer learning activity, supported by an online platform (OLYMPUS) to enable a group of learners to learn IoT products space better. We conducted two different user studies to validate that our tool enables better IoT education. Our method guide learners to think more deeply about IoT products and their design decisions. The learning platform we developed is open source and available for the community.


Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning

Without relevant human priors, neural networks may learn uninterpretable features. We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. DAFT is a novel method that regularizes attention-based reasoning by modelling it as a continuous dynamical system using neural ordinary differential equations. As a proof of concept, we augment a state-of-the-art visual reasoning model with DAFT. Our experiments reveal that applying DAFT yields similar performance to the original model while using fewer reasoning steps, showing that it implicitly learns to skip unnecessary steps. We also propose a new metric, Total Length of Transition (TLT), which represents the effective reasoning step size by quantifying how much a given model’s focus drifts while reasoning about a question. We show that adding DAFT results in lower TLT, demonstrating that our method indeed obeys the human prior towards shorter reasoning paths in addition to producing more interpretable attention maps.