Challenges for an Ontology of Artificial Intelligence

Of primary importance in formulating a response to the increasing prevalence and power of artificial intelligence (AI) applications in society are questions of ontology. Questions such as: What ‘are’ these systems? How are they to be regarded? How does an algorithm come to be regarded as an agent? We discuss three factors which hinder discussion and obscure attempts to form a clear ontology of AI: (1) the various and evolving definitions of AI, (2) the tendency for pre-existing technologies to be assimilated and regarded as ‘normal,’ and (3) the tendency of human beings to anthropomorphize. This list is not intended as exhaustive, nor is it seen to preclude entirely a clear ontology, however, these challenges are a necessary set of topics for consideration. Each of these factors is seen to present a ‘moving target’ for discussion, which poses a challenge for both technical specialists and non-practitioners of AI systems development (e.g., philosophers and theologians) to speak meaningfully given that the corpus of AI structures and capabilities evolves at a rapid pace. Finally, we present avenues for moving forward, including opportunities for collaborative synthesis for scholars in philosophy and science.

Artificial Intelligence in Intelligent Tutoring Robots: A Systematic Review and Design Guidelines

This study provides a systematic review of the recent advances in designing the intelligent tutoring robot (ITR), and summarises the status quo of applying artificial intelligence (AI) techniques. We first analyse the environment of the ITR and propose a relationship model for describing interactions of ITR with the students, the social milieu and the curriculum. Then, we transform the relationship model into the perception-planning-action model for exploring what AI techniques are suitable to be applied in the ITR. This article provides insights on promoting human-robot teaching-learning process and AI-assisted educational techniques, illustrating the design guidelines and future research perspectives in intelligent tutoring robots.

Homunculus’ Brain and Categorical Logic

The interaction between syntax (formal language) and its semantics (meanings of language) is well studied in categorical logic. Results of this study are employed to understand how the brain could create meanings. To emphasize the toy character of the proposed model, we prefer to speak on homunculus’ brain rather than just on the brain. Homunculus’ brain consists of neurons, each of which is modeled by a category, and axons between neurons, which are modeled by functors between the corresponding neuron-categories. Each neuron (category) has its own program enabling its working, i.e. a ‘theory’ of this neuron. In analogy with what is known from categorical logic, we postulate the existence of the pair of adjoint functors, called Lang and Syn, from a category, now called BRAIN, of categories, to a category, now called MIND, of theories. Our homunculus is a kind of ‘mathematical robot’, the neuronal architecture of which is not important. Its only aim is to provide us with the opportunity to study how such a simple brain-like structure could ‘create meanings’ out of its purely syntactic program. The pair of adjoint functors Lang and Syn models mutual dependencies between the syntactical structure of a given theory of MIND and the internal logic of its semantics given by a category of BRAIN. In this way, a formal language (syntax) and its meanings (semantics) are interwoven with each other in a manner corresponding to the adjointness of the functors Lang and Syn. Categories BRAIN and MIND interact with each other with their entire structures and, at the same time, these very structures are shaped by this interaction.

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters with enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, with the end of Moore’s law, there is a limit to such scaling. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, which drastically reduce the computation during both training and inference, with simple multi-core parallelism on a modest CPU. SLIDE is an auspicious illustration of the power of smart randomized algorithms over CPUs in outperforming the best available GPU with an optimized implementation. Our evaluations on large industry-scale datasets, with some large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 2.7 times (2 hours vs. 5.5 hours) faster than the same network trained using Tensorflow on Tesla V100 at any given accuracy level. We provide codes and benchmark scripts for reproducibility.

A Learnable ScatterNet: Locally Invariant Convolutional Layers

In this paper we explore tying together the ideas from Scattering Transforms and Convolutional Neural Networks (CNN) for Image Analysis by proposing a learnable ScatterNet. Previous attempts at tying them together in hybrid networks have tended to keep the two parts separate, with the ScatterNet forming a fixed front end and a CNN forming a learned backend. We instead look at adding learning between scattering orders, as well as adding learned layers before the ScatterNet. We do this by breaking down the scattering orders into single convolutional-like layers we call ‘locally invariant’ layers, and adding a learned mixing term to this layer. Our experiments show that these locally invariant layers can improve accuracy when added to either a CNN or a ScatterNet. We also discover some surprising results in that the ScatterNet may be best positioned after one or more layers of learning rather than at the front of a neural network.

Connecting Bayes factor and the Region of Practical Equivalence (ROPE) Procedure for testing interval null hypothesis

There has been strong recent interest in testing interval null hypothesis for improved scientific inference. For example, Lakens et al (2018) and Lakens and Harms (2017) use this approach to study if there is a pre-specified meaningful treatment effect in gerontology and clinical trials, which is different from the more traditional point null hypothesis that tests for any treatment effect. Two popular Bayesian approaches are available for interval null hypothesis testing. One is the standard Bayes factor and the other is the Region of Practical Equivalence (ROPE) procedure championed by Kruschke and others over many years. This paper establishes a formal connection between these two approaches with two benefits. First, it helps to better understand and improve the ROPE procedure. Second, it leads to a simple and effective algorithm for computing Bayes factor in a wide range of problems using draws from posterior distributions generated by standard Bayesian programs such as BUGS, JAGS and Stan. The tedious and error-prone task of coding custom-made software specific for Bayes factor is then avoided.

Multi-Hot Compact Network Embedding

Network embedding, as a promising way of the network representation learning, is capable of supporting various subsequent network mining and analysis tasks, and has attracted growing research interests recently. Traditional approaches assign each node with an independent continuous vector, which will cause huge memory overhead for large networks. In this paper we propose a novel multi-hot compact embedding strategy to effectively reduce memory cost by learning partially shared embeddings. The insight is that a node embedding vector is composed of several basis vectors, which can significantly reduce the number of continuous vectors while maintain similar data representation ability. Specifically, we propose a MCNE model to learn compact embeddings from pre-learned node features. A novel component named compressor is integrated into MCNE to tackle the challenge that popular back-propagation optimization cannot propagate through discrete samples. We further propose an end-to-end model MCNE_{t} to learn compact embeddings from the input network directly. Empirically, we evaluate the proposed models over three real network datasets, and the results demonstrate that our proposals can save about 90\% of memory cost of network embeddings without significantly performance decline.

Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

Heterogeneous knowledge naturally arises among different agents in cooperative multiagent reinforcement learning. As such, learning can be greatly improved if agents can effectively pass their knowledge on to other agents. Existing work has demonstrated that peer-to-peer knowledge transfer, a process referred to as action advising, improves team-wide learning. In contrast to previous frameworks that advise at the level of primitive actions, we aim to learn high-level teaching policies that decide when and what high-level action (e.g., sub-goal) to advise a teammate. We introduce a new learning to teach framework, called hierarchical multiagent teaching (HMAT). The proposed framework solves difficulties faced by prior work on multiagent teaching when operating in domains with long horizons, delayed rewards, and continuous states/actions by leveraging temporal abstraction and deep function approximation. Our empirical evaluations show that HMAT accelerates team-wide learning progress in difficult environments that are more complex than those explored in previous work. HMAT also learns teaching policies that can be transferred to different teammates/tasks and can even teach teammates with heterogeneous action spaces.

Deductive Optimization of Relational Data Storage

Optimizing the physical data storage and retrieval of data are two key database management problems. In this paper, we propose a language that can express a wide range of physical database layouts, going well beyond the row- and column- based methods that are widely used in database management systems. We also build a compiler for this language, which is specialized for a dataset and a query workload. We conduct experiments using a popular database benchmark, which shows that the performance of these specialized queries is competitive with a state-of-the-art in memory compiled database system.

Dyna-AIL : Adversarial Imitation Learning by Planning

Adversarial methods for imitation learning have been shown to perform well on various control tasks. However, they require a large number of environment interactions for convergence. In this paper, we propose an end-to-end differentiable adversarial imitation learning algorithm in a Dyna-like framework for switching between model-based planning and model-free learning from expert data. Our results on both discrete and continuous environments show that our approach of using model-based planning along with model-free learning converges to an optimal policy with fewer number of environment interactions in comparison to the state-of-the-art learning methods.

Ranked List Loss for Deep Metric Learning

The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, rankingmotivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we present two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a setbased similarity structure by exploiting all instances in the gallery. The samples are split into a positive and a negative set. Our objective is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution might be dropped. In contrast, we propose to learn a hypersphere for each class in order to preserve the similarity structure inside it. Our extensive experiments show that the proposed method achieves state-of-the-art performance on three widely used benchmarks.

Attribute Acquisition in Ontology based on Representation Learning of Hierarchical Classes and Attributes

Attribute acquisition for classes is a key step in ontology construction, which is often achieved by community members manually. This paper investigates an attention-based automatic paradigm called TransATT for attribute acquisition, by learning the representation of hierarchical classes and attributes in Chinese ontology. The attributes of an entity can be acquired by merely inspecting its classes, because the entity can be regard as the instance of its classes and inherit their attributes. For explicitly describing of the class of an entity unambiguously, we propose class-path to represent the hierarchical classes in ontology, instead of the terminal class word of the hypernym-hyponym relation (i.e., is-a relation) based hierarchy. The high performance of TransATT on attribute acquisition indicates the promising ability of the learned representation of class-paths and attributes. Moreover, we construct a dataset named \textbf{BigCilin11k}. To the best of our knowledge, this is the first Chinese dataset with abundant hierarchical classes and entities with attributes.

Should we Reload Time Series Classification Performance Evaluation ? (a position paper)

Since the introduction and the public availability of the \textsc{ucr} time series benchmark data sets, numerous Time Series Classification (TSC) methods has been designed, evaluated and compared to each others. We suggest a critical view of TSC performance evaluation protocols put in place in recent TSC literature. The main goal of this `position’ paper is to stimulate discussion and reflexion about performance evaluation in TSC literature.

Do we still need fuzzy classifiers for Small Data in the Era of Big Data?

The Era of Big Data has forced researchers to explore new distributed solutions for building fuzzy classifiers, which often introduce approximation errors or make strong assumptions to reduce computational and memory requirements. As a result, Big Data classifiers might be expected to be inferior to those designed for standard classification tasks (Small Data) in terms of accuracy and model complexity. To our knowledge, however, there is no empirical evidence to confirm such a conjecture yet. Here, we investigate the extent to which state-of-the-art fuzzy classifiers for Big Data sacrifice performance in favor of scalability. To this end, we carry out an empirical study that compares these classifiers with some of the best performing algorithms for Small Data. Assuming the latter were generally designed for maximizing performance without considering scalability issues, the results of this study provide some intuition around the tradeoff between performance and scalability achieved by current Big Data solutions. Our findings show that, although slightly inferior, Big Data classifiers are gradually catching up with state-of-the-art classifiers for Small data, suggesting that a unified learning algorithm for Big and Small Data might be possible.

Large-Margin Multiple Kernel Learning for Discriminative Features Selection and Representation Learning

Multiple kernel learning (MKL) algorithms combine different base kernels to obtain a more efficient representation in the feature space. Focusing on discriminative tasks, MKL has been used successfully for feature selection and finding the significant modalities of the data. In such applications, each base kernel represents one dimension of the data or is derived from one specific descriptor. Therefore, MKL finds an optimal weighting scheme for the given kernels to increase the classification accuracy. Nevertheless, the majority of the works in this area focus on only binary classification problems or aim for linear separation of the classes in the kernel space, which are not realistic assumptions for many real-world problems. In this paper, we propose a novel multi-class MKL framework which improves the state-of-the-art by enhancing the local separation of the classes in the feature space. Besides, by using a sparsity term, our large-margin multiple kernel algorithm (LMMK) performs discriminative feature selection by aiming to employ a small subset of the base kernels. Based on our empirical evaluations on different real-world datasets, LMMK provides a competitive classification accuracy compared with the state-of-the-art algorithms in MKL. Additionally, it learns a sparse set of non-zero kernel weights which leads to a more interpretable feature selection and representation learning.

A Quantum Observation Scheme Can Universally Identify Causalities from Correlations

It has long been recognized as a difficult problem to determine whether the observed statistical correlation between two classical variables arise from causality or from common causes. Recent research has shown that in quantum theoretical framework, the mechanisms of entanglement and quantum coherence provide an advantage in tackling this problem. In some particular cases, quantum common causes and quantum causality can be effectively distinguished using observations only. However, these solutions do not apply to all cases. There still exist enormous cases in which quantum common causes and quantum causality can not be distinguished. In this paper, along the line of considering unitary transformation as causality in the quantum world, we formally show quantum common causes and quantum causality are universally separable. Based on the analysis, we further provide a general method to discriminate the two.

Is Deeper Better only when Shallow is Good?

Understanding the power of depth in feed-forward neural networks is an ongoing challenge in the field of deep learning theory. While current works account for the importance of depth for the expressive power of neural-networks, it remains an open question whether these benefits are exploited during a gradient-based optimization process. In this work we explore the relation between expressivity properties of deep networks and the ability to train them efficiently using gradient-based algorithms. We give a depth separation argument for distributions with fractal structure, showing that they can be expressed efficiently by deep networks, but not with shallow ones. These distributions have a natural coarse-to-fine structure, and we show that the balance between the coarse and fine details has a crucial effect on whether the optimization process is likely to succeed. We prove that when the distribution is concentrated on the fine details, gradient-based algorithms are likely to fail. Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general.

A Three-Player GAN: Generating Hard Samples To Improve Classification Networks

We propose a Three-Player Generative Adversarial Network to improve classification networks. In addition to the game played between the discriminator and generator, a competition is introduced between the generator and the classifier. The generator’s objective is to synthesize samples that are both realistic and hard to label for the classifier. Even though we make no assumptions on the type of augmentations to learn, we find that the model is able to synthesize realistically looking examples that are hard for the classification model. Furthermore, the classifier becomes more robust when trained on these difficult samples. The method is evaluated on a public dataset for traffic sign recognition.