**Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses**

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis assessment techniques, especially those not commonly used in the field (such as evaluations based on Bayesian inference). Since these statistical techniques differ in the hypotheses they can support, we argue that practitioners should first decide their target hypothesis before choosing an assessment method. This is crucial because common fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research, as well as an easy-to-use package called ‘HyBayes’ for Bayesian assessment of hypotheses, complementing existing tools.

**Constructing a Data Visualization Recommender System**

Choosing a suitable visualization for data is a difficult task. Current data visualization recommender systems exist to aid in choosing a visualization, yet suffer from issues such as low accessibility and indecisiveness. In this study, we first define a step-by-step guide on how to build a data visualization recommender system. We then use this guide to create a model for a data visualization recommender system for non-experts that aims to resolve the issues of current solutions. The result is a question-based model that uses a decision tree and a data visualization classification hierarchy in order to recommend a visualization. Furthermore, it incorporates both task-driven and data characteristics-driven perspectives, whereas existing solutions seem to either convolute these or focus on one of the two exclusively. Based on testing against existing solutions, it is shown that the new model reaches similar results while being simpler, clearer, more versatile, extendable and transparent. The presented guide can be used as a manual for anyone building a data visualization recommender system. The resulting model can be applied in the development of new data visualization software or as part of a learning tool.

**Rethinking Self-Attention: An Interpretable Self-Attentive Encoder-Decoder Parser**

Attention mechanisms have improved the performance of NLP tasks while providing for appearance of model interpretability. Self-attention is currently widely used in NLP models, however it is difficult to interpret due to the numerous attention distributions. We hypothesize that model representations can benefit from label-specific information, while facilitating interpretation of predictions. We introduce the Label Attention Layer: a new form of self-attention where attention heads represent labels. We validate our hypothesis by running experiments in constituency and dependency parsing and show our new model obtains new state-of-the-art results for both tasks on the English Penn Treebank. Our neural parser obtains 96.34 F1 score for constituency parsing, and 97.33 UAS and 96.29 LAS for dependency parsing. Additionally, our model requires fewer layers, therefore, fewer parameters compared to existing work.

**A Re-evaluation of Knowledge Graph Completion Methods**

Knowledge Graph Completion (KGC) aims at automatically predicting missing links for large-scale knowledge graphs. A vast number of state-of-the-art KGC techniques have been published in top conferences in several research fields including data mining, machine learning, and natural language processing. However, we notice that several recent papers report very high performance which largely outperforms previous state-of-the-art methods. In this paper, we find that this can be attributed to the inappropriate evaluation protocol used by them and propose a simple evaluation protocol to address this problem. The proposed protocol is robust to handle bias in the model which can substantially affect the final results. We conduct extensive experiments and report the performance of several existing methods using our protocol.

**Improving Node Classification by Co-training Node Pair Classification: A Novel Training Framework for General Graph Neural Networks**

Semi-supervised learning is a widely used training framework for graph node classification. However, there are two problems existing in this learning method: (1) the original graph topology may not be perfectly aligned with the node classification task; (2) the supervision information in the training set has not been fully used. To tackle these two problems, we design a new task: node pair classification, to assist in training GNN models for the target node classification task. We further propose a novel training framework named Adaptive Co-training, which jointly trains the node classification and the node pair classification after the optimization of graph topology. Extensive experimental results on four representative GNN models have demonstrated that our proposed training framework significantly outperforms baseline methods across three benchmark graph datasets.

**Improving BERT Fine-tuning with Embedding Normalization**

Large pre-trained sentence encoders like BERT start a new chapter in natural language processing. A common practice to apply pre-trained BERT to sequence classification tasks (e.g., classification of sentences or sentence pairs) is by feeding the embedding of [CLS] token (in the last layer) to a task-specific classification layer, and then fine tune the model parameters of BERT and classifier jointly. In this paper, we conduct systematic analysis over several sequence classification datasets to examine the embedding values of [CLS] token before the fine tuning phase, and present the biased embedding distribution issue—i.e., embedding values of [CLS] concentrate on a few dimensions and are non-zero centered. Such biased embedding brings challenge to the optimization process during fine-tuning as gradients of [CLS] embedding may explode and result in degraded model performance. We further propose several simple yet effective normalization methods to modify the [CLS] embedding during the fine-tuning. Compared with the previous practice, neural classification model with the normalized embedding shows improvements on several text classification tasks, demonstrates the effectiveness of our method.

**Interpretable Multiple-Kernel Prototype Learning for Discriminative Representation and Feature Selection**

Prototype-based methods are of the particular interest for domain specialists and practitioners as they summarize a dataset by a small set of representatives. Therefore, in a classification setting, interpretability of the prototypes is as significant as the prediction accuracy of the algorithm. Nevertheless, the state-of-the-art methods make inefficient trade-offs between these concerns by sacrificing one in favor of the other, especially if the given data has a kernel-based representation. In this paper, we propose a novel interpretable multiple-kernel prototype learning (IMKPL) to construct highly interpretable prototypes in the feature space, which are also efficient for the discriminative representation of the data. Our method focuses on the local discrimination of the classes in the feature space and shaping the prototypes based on condensed class-homogeneous neighborhoods of data. Besides, IMKPL learns a combined embedding in the feature space in which the above objectives are better fulfilled. When the base kernels coincide with the data dimensions, this embedding results in a discriminative features selection. We evaluate IMKPL on several benchmarks from different domains which demonstrate its superiority to the related state-of-the-art methods regarding both interpretability and discriminative representation.

**TSK-Streams: Learning TSK Fuzzy Systems on Data Streams**

The problem of adaptive learning from evolving and possibly non-stationary data streams has attracted a lot of interest in machine learning in the recent past, and also stimulated research in related fields, such as computational intelligence and fuzzy systems. In particular, several rule-based methods for the incremental induction of regression models have been proposed. In this paper, we develop a method that combines the strengths of two existing approaches rooted in different learning paradigms. More concretely, our method adopts basic principles of the state-of-the-art learning algorithm AMRules and enriches them by the representational advantages of fuzzy rules. In a comprehensive experimental study, TSK-Streams is shown to be highly competitive in terms of performance.

**EarthquakeGen: Earthquake Simulation Using Generative Adversarial Networks**

Detecting earthquake events from seismic time series has proved itself a challenging task. Manual detection can be expensive and tedious due to the intensive labor and large scale data set. In recent years, automatic detection methods based on machine learning have been developed to improve accuracy and efficiency. However, the accuracy of those methods relies on a sufficient amount of high-quality training data, which itself can be expensive to obtain due to the requirement of domain knowledge and subject matter expertise. This paper is to resolve this dilemma by answering two questions: (1) provided with a limited number of reliable labels, can we use them to generate more synthetic labels; (2) Can we use those synthetic labels to improve the detectability? Among all the existing generative models, the generative adversarial network (GAN) shows its supreme capability in generating high-quality synthetic samples in multiple domains. We designed our model based on GAN. In particular, we studied several different network structures. By comparing the generated results, our GAN-based generative model yields the highest quality. We further combine the dataset with synthetic samples generated by our generative model and show that the detectability of our earthquake classification model is significantly improved than the one trained without augmenting the training set.

**Multimodal Intelligence: Representation Learning, Information Fusion, and Applications**

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles – learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

**Grinding the Space: Learning to Classify Against Strategic Agents**

We study the problem of online learning in strategic classification settings from the perspective of the learner, who is repeatedly facing myopically rational strategic agents. We model this interplay as a repeated Stackelberg game, where at each timestep the learner deploys a high-dimensional linear classifier first and an agent, after observing the classifier, along with his real feature vector, and according to his underlying utility function, best-responds with a (potentially altered) feature vector. We measure the performance of the learner in terms of Stackelberg regret for her 0-1 loss function. Surprisingly, we prove that in strategic settings like the one considered in this paper there exist worst-case scenarios, where any sequence of actions providing sublinear external regret might result in linear Stackelberg regret and vice versa. We then provide the Grinder Algorithm, an adaptive discretization algorithm, potentially of independent interest in the online learning community, and prove its data-dependent upper bound on the Stackelberg regret given oracle access, while being computationally efficient. We also provide a nearly matching lower bound for the problem of strategic classification. We complement our theoretical analysis with simulation results, which suggest that our algorithm outperforms the benchmarks, even given access to approximation oracles. Our results advance the known state-of-the-art results in the growing literature of online learning from revealed preferences, which has so far focused on smoother utility and loss functions from the perspective of the agents and the learner respectively.

**Feedback Recurrent AutoEncoder**

In this work, we propose a new recurrent autoencoder architecture, termed Feedback Recurrent AutoEncoder (FRAE), for online compression of sequential data with temporal dependency. The recurrent structure of FRAE is designed to efficiently extract the redundancy along the time dimension and allows a compact discrete representation of the data to be learned. We demonstrate its effectiveness in speech spectrogram compression. Specifically, we show that the FRAE, paired with a powerful neural vocoder, can produce high-quality speech waveforms at a low, fixed bitrate. We further show that by adding a learned prior for the latent space and using an entropy coder, we can achieve an even lower variable bitrate.

**DRiLLS: Deep Reinforcement Learning for Logic Synthesis**

Logic synthesis requires extensive tuning of the synthesis optimization flow where the quality of results (QoR) depends on the sequence of optimizations used. Efficient design space exploration is challenging due to the exponential number of possible optimization permutations. Therefore, automating the optimization process is necessary. In this work, we propose a novel reinforcement learning-based methodology that navigates the optimization space without human intervention. We demonstrate the training of an Advantage Actor Critic (A2C) agent that seeks to minimize area subject to a timing constraint. Using the proposed methodology, designs can be optimized autonomously with no-humans in-loop. Evaluation on the comprehensive EPFL benchmark suite shows that the agent outperforms existing exploration methodologies and improves QoRs by an average of 13%.

**MAME : Model-Agnostic Meta-Exploration**

Meta-Reinforcement learning approaches aim to develop learning procedures that can adapt quickly to a distribution of tasks with the help of a few examples. Developing efficient exploration strategies capable of finding the most useful samples becomes critical in such settings. Existing approaches towards finding efficient exploration strategies add auxiliary objectives to promote exploration by the pre-update policy, however, this makes the adaptation using a few gradient steps difficult as the pre-update (exploration) and post-update (exploitation) policies are often quite different. Instead, we propose to explicitly model a separate exploration policy for the task distribution. Having two different policies gives more flexibility in training the exploration policy and also makes adaptation to any specific task easier. We show that using self-supervised or supervised learning objectives for adaptation allows for more efficient inner-loop updates and also demonstrate the superior performance of our model compared to prior works in this domain.

**Hierarchically Robust Representation Learning**

With the tremendous success of deep learning in visual tasks, the representations extracted from intermediate layers of learned models, that is, deep features, attract much attention of researchers. The previous analysis shows that those features include appropriate semantic information. By training the deep models on a large-scale benchmark data set (e.g., ImageNet), the features can work well on other tasks. In this work, we investigate this phenomenon and demonstrate that deep features can fail due to the fact that they are learned by minimizing empirical risk. When the distribution of data is different from that of the benchmark data set, the performance of deep features can degrade. Hence, we propose a hierarchically robust optimization to learn more generic features. Considering the example-level and concept-level robustness simultaneously, we formulate the problem as a distributionally robust optimization problem with Wasserstein ambiguity set constraints. An efficient algorithm with the conventional training pipeline is proposed. Experiments on benchmark data sets confirm our claim and demonstrate the effectiveness of the robust deep representations.

**Decompressing Knowledge Graph Representations for Link Prediction**

This paper studies the problem of predicting missing relationships between entities in knowledge graphs through learning their representations. Currently, the majority of existing link prediction models employ simple but intuitive scoring functions and relatively small embedding size so that they could be applied to large-scale knowledge graphs. However, these properties also restrict the ability to learn more expressive and robust features. Therefore, diverging from most of the prior works which focus on designing new objective functions, we propose, DeCom, a simple but effective mechanism to boost the performance of existing link predictors such as DistMult, ComplEx, etc, through extracting more expressive features while preventing overfitting by adding just a few extra parameters. Specifically, embeddings of entities and relationships are first decompressed to a more expressive and robust space by decompressing functions, then knowledge graph embedding models are trained in this new feature space. Experimental results on several benchmark knowledge graphs and advanced link prediction systems demonstrate the generalization and effectiveness of our method. Especially, RESCAL + DeCom achieves state-of-the-art performance on the FB15k-237 benchmark across all evaluation metrics. In addition, we also show that compared with DeCom, explicitly increasing the embedding size significantly increase the number of parameters but could not achieve promising performance improvement.

**Accurate Uncertainty Estimation and Decomposition in Ensemble Learning**

Ensemble learning is a standard approach to building machine learning systems that capture complex phenomena in real-world data. An important aspect of these systems is the complete and valid quantification of model uncertainty. We introduce a Bayesian nonparametric ensemble (BNE) approach that augments an existing ensemble model to account for different sources of model uncertainty. BNE augments a model’s prediction and distribution functions using Bayesian nonparametric machinery. It has a theoretical guarantee in that it robustly estimates the uncertainty patterns in the data distribution, and can decompose its overall predictive uncertainty into distinct components that are due to different sources of noise and error. We show that our method achieves accurate uncertainty estimates under complex observational noise, and illustrate its real-world utility in terms of uncertainty decomposition and model bias detection for an ensemble in predict air pollution exposures in Eastern Massachusetts, USA.

**LMLFM: Longitudinal Multi-Level Factorization Machines**

Selecting important variables and learning predictive models from high-dimensional longitudinal data is challenging due to the need to account for complex data correlation and expensive computation. In this work, we propose an extension of factorization machines, LMLFM, to deal with such longitudinal data. LMLFM is efficient, sparse, provably convergent and explainable. Specifically, LMLFM is the first multi-level model that can simultaneously select {\em fixed effects} and {\em random effects} while accounting for complex correlations in the data and non-linear interactions among variables. Experimental results with both simulated and real-world longitudinal data show that LMLFM outperforms the state-of-the-art longitudinal methods in terms of prediction accuracy with significantly lower false positive, using substantially less computational resources.

**DialogAct2Vec: Towards End-to-End Dialogue Agent by Multi-Task Representation Learning**

In end-to-end dialogue modeling and agent learning, it is important to (1) effectively learn knowledge from data, and (2) fully utilize heterogeneous information, e.g., dialogue act flow and utterances. However, the majority of existing methods cannot simultaneously satisfy the two conditions. For example, rule definition and data labeling during system design take too much manual work, and sequence-to-sequence methods only model one-side utterance information. In this paper, we propose a novel joint end-to-end model by multi-task representation learning, which can capture the knowledge from heterogeneous information through automatically learning knowledgeable low-dimensional embeddings from data, named with DialogAct2Vec. The model requires little manual work for intervention in system design and we find that the multi-task learning can greatly improve the effectiveness of representation learning. Extensive experiments on a public dataset for restaurant reservation show that the proposed method leads to significant improvements against the state-of-the-art baselines on both the act prediction task and utterance prediction task.

**Context-aware Active Multi-Step Reinforcement Learning**

Reinforcement learning has attracted great attention recently, especially policy gradient algorithms, which have been demonstrated on challenging decision making and control tasks. In this paper, we propose an active multi-step TD algorithm with adaptive stepsizes to learn actor and critic. Specifically, our model consists of two components: active stepsize learning and adaptive multi-step TD algorithm. Firstly, we divide the time horizon into chunks and actively select state and action inside each chunk. Then given the selected samples, we propose the adaptive multi-step TD, which generalizes TD(

), but adaptively switch on/off the backups from future returns of different steps. Particularly, the adaptive multi-step TD introduces a context-aware mechanism, here a binary classifier, which decides whether or not to turn on its future backups based on the context changes. Thus, our model is kind of combination of active learning and multi-step TD algorithm, which has the capacity for learning off-policy without the need of importance sampling. We evaluate our approach on both discrete and continuous space tasks in an off-policy setting respectively, and demonstrate competitive results compared to other reinforcement learning baselines.

**An empirical study of the relation between network architecture and complexity**

In this preregistration submission, we propose an empirical study of how networks handle changes in complexity of the data. We investigate the effect of network capacity on generalization performance in the face of increasing data complexity. For this, we measure the generalization error for an image classification task where the number of classes steadily increases. We compare a number of modern architectures at different scales in this setting. The methodology, setup, and hypotheses described in this proposal were evaluated by peer review before experiments were conducted.

**Higher-order Weighted Graph Convolutional Networks**

Graph Convolution Network (GCN) has been recognized as one of the most effective graph models for semi-supervised learning, but it extracts merely the first-order or few-order neighborhood information through information propagation, which suffers performance drop-off for deeper structure. Existing approaches that deal with the higher-order neighbors tend to take advantage of adjacency matrix power. In this paper, we assume a seemly trivial condition that the higher-order neighborhood information may be similar to that of the first-order neighbors. Accordingly, we present an unsupervised approach to describe such similarities and learn the weight matrices of higher-order neighbors automatically through Lasso that minimizes the feature loss between the first-order and higher-order neighbors, based on which we formulate the new convolutional filter for GCN to learn the better node representations. Our model, called higher-order weighted GCN(HWGCN), has achieved the state-of-the-art results on a number of node classification tasks over Cora, Citeseer and Pubmed datasets.

**Time2Graph: Revisiting Time Series Modeling with Dynamic Shapelets**

Time series modeling has attracted extensive research efforts; however, achieving both reliable efficiency and interpretability from a unified model still remains a challenging problem. Among the literature, shapelets offer interpretable and explanatory insights in the classification tasks, while most existing works ignore the differing representative power at different time slices, as well as (more importantly) the evolution pattern of shapelets. In this paper, we propose to extract time-aware shapelets by designing a two-level timing factor. Moreover, we define and construct the shapelet evolution graph, which captures how shapelets evolve over time and can be incorporated into the time series embeddings by graph embedding algorithms. To validate whether the representations obtained in this way can be applied effectively in various scenarios, we conduct experiments based on three public time series datasets, and two real-world datasets from different domains. Experimental results clearly show the improvements achieved by our approach compared with 17 state-of-the-art baselines.

**Meta Answering for Machine Reading**

We investigate a framework for machine reading, inspired by real world information-seeking problems, where a meta question answering system interacts with a black box environment. The environment encapsulates a competitive machine reader based on BERT, providing candidate answers to questions, and possibly some context. To validate the realism of our formulation, we ask humans to play the role of a meta-answerer. With just a small snippet of text around an answer, humans can outperform the machine reader, improving recall. Similarly, a simple machine meta-answerer outperforms the environment, improving both precision and recall on the Natural Questions dataset. The system relies on joint training of answer scoring and the selection of conditioning information.

**Practical Federated Gradient Boosting Decision Trees**

Gradient Boosting Decision Trees (GBDTs) have become very successful in recent years, with many awards in machine learning and data mining competitions. There have been several recent studies on how to train GBDTs in the federated learning setting. In this paper, we focus on horizontal federated learning, where data samples with the same features are distributed among multiple parties. However, existing studies are not efficient or effective enough for practical use. They suffer either from the inefficiency due to the usage of costly data transformations such as secure sharing and homomorphic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical federated environment with relaxed privacy constraints. In this environment, a dishonest party might obtain some information about the other parties’ data, but it is still impossible for the dishonest party to derive the actual raw data of other parties. Specifically, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. We prove that our framework is secure without exposing the original record to other parties, while the computation overhead in the training process is kept low. Our experimental studies show that, compared with normal training with the local data of each owner, our approach can significantly improve the predictive accuracy, and achieve comparable accuracy to the original GBDT with the data from all parties.

**(When) Is Truth-telling Favored in AI Debate?**

For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et al. (2018) propose that in such cases, we may use a debate between two AI systems to amplify the problem-solving capabilities of a human judge. We introduce a mathematical framework that can model debates of this type and propose that the quality of debate designs should be measured by the accuracy of the most persuasive answer. We describe a simple instance of the debate framework called feature debate and analyze the degree to which such debates track the truth. We argue that despite being very simple, feature debates nonetheless capture many aspects of practical debates such as the incentives to confuse the judge or stall to prevent losing. We then outline how these models should be generalized to analyze a wider range of debate phenomena.

**Rethinking Generalisation**

In this paper, we present a new approach to computing the generalisation performance assuming that the distribution of risks,

, for a learning scenario is known. This allows us to compute the expected error of a learning machine using empirical risk minimisation. We show that it is possible to obtain results for both classification and regression. We show a critical quantity in determining the generalisation performance is the power-law behaviour of

around its minimum value. We compute

for the case of all Boolean functions and for the perceptron. We start with a simplistic analysis but then do a more formal one later on. We show that the simplistic results are qualitatively correct and provide a good approximation to the actual results if we replace the true training set size with an approximate training set size.

**Kernel Dependence Regularizers and Gaussian Processes with Applications to Algorithmic Fairness**

Current adoption of machine learning in industrial, societal and economical activities has raised concerns about the fairness, equity and ethics of automated decisions. Predictive models are often developed using biased datasets and thus retain or even exacerbate biases in their decisions and recommendations. Removing the sensitive covariates, such as gender or race, is insufficient to remedy this issue since the biases may be retained due to other related covariates. We present a regularization approach to this problem that trades off predictive accuracy of the learned models (with respect to biased labels) for the fairness in terms of statistical parity, i.e. independence of the decisions from the sensitive covariates. In particular, we consider a general framework of regularized empirical risk minimization over reproducing kernel Hilbert spaces and impose an additional regularizer of dependence between predictors and sensitive covariates using kernel-based measures of dependence, namely the Hilbert-Schmidt Independence Criterion (HSIC) and its normalized version. This approach leads to a closed-form solution in the case of squared loss, i.e. ridge regression. Moreover, we show that the dependence regularizer has an interpretation as modifying the corresponding Gaussian process (GP) prior. As a consequence, a GP model with a prior that encourages fairness to sensitive variables can be derived, allowing principled hyperparameter selection and studying of the relative relevance of covariates under fairness constraints. Experimental results in synthetic examples and in real problems of income and crime prediction illustrate the potential of the approach to improve fairness of automated decisions.

**Markov chains in random environment with applications in queueing theory and machine learning**

We prove the existence of limiting distributions for a large class of Markov chains on a general state space in a random environment. We assume suitable versions of the standard drift and minorization conditions. In particular, the system dynamics should be contractive on the average with respect to the Lyapunov function and large enough small sets should exist with large enough minorization constants. We also establish that a law of large numbers holds for bounded functionals of the process. Applications to queuing systems and to machine learning algorithms are presented.

**A Biologically Plausible Benchmark for Contextual Bandit Algorithms in Precision Oncology Using in vitro Data**

Precision oncology, the genetic sequencing of tumors to identify druggable targets, has emerged as the standard of care in the treatment of many cancers. Nonetheless, due to the pace of therapy development and variability in patient information, designing effective protocols for individual treatment assignment in a sample-efficient way remains a major challenge. One promising approach to this problem is to frame precision oncology treatment as a contextual bandit problem and to apply sequential decision-making algorithms designed to minimize regret in this setting. However, a clear prerequisite for considering this methodology in high-stakes clinical decisions is careful benchmarking to understand realistic costs and benefits. Here, we propose a benchmark dataset to evaluate contextual bandit algorithms based on real in vitro drug response of approximately 900 cancer cell lines. Specifically, we curated a dataset of complete treatment responses for a subset of 7 treatments from prior in vitro studies. This allows us to compute the regret of proposed decision policies using biologically plausible counterfactuals. We ran a suite of Bayesian bandit algorithms on our benchmark, and found that the methods accumulate less regret over a sequence of treatment assignment tasks than a rule-based baseline derived from current clinical practice. This effect was more pronounced when genomic information was included as context. We expect this work to be a starting point for evaluation of both the unique structural requirements and ethical implications for real-world testing of bandit based clinical decision support.

**Simplifying Random Forests: On the Trade-off between Interpretability and Accuracy**

We analyze the trade-off between model complexity and accuracy for random forests by breaking the trees up into individual classification rules and selecting a subset of them. We show experimentally that already a few rules are sufficient to achieve an acceptable accuracy close to that of the original model. Moreover, our results indicate that in many cases, this can lead to simpler models that clearly outperform the original ones.

**Driving Reinforcement Learning with Models**

Over the years, Reinforcement Learning (RL) established itself as a convenient paradigm to learn optimal policies from data. However, most RL algorithms achieve optimal policies by exploring all the possible actions and this, in real-world scenarios, is often infeasible or impractical due to e.g. safety constraints. Motivated by this, in this paper we propose to augment RL with Model Predictive Control (MPC), a popular model-based control algorithm that allows to optimally control a system while satisfying a set of constraints. The result is an algorithm, the MPC-augmented RL algorithm (MPCaRL) that makes use of MPC to both drive how RL explores the actions and to modify the corresponding rewards. We demonstrate the effectiveness of the MPCaRL by letting it play against the Atari game Pong. The results obtained highlight the ability of the algorithm to learn general tasks with essentially no training.

**GraphDefense: Towards Robust Graph Convolutional Networks**

In this paper, we study the robustness of graph convolutional networks (GCNs). Despite the good performance of GCNs on graph semi-supervised learning tasks, previous works have shown that the original GCNs are very unstable to adversarial perturbations. In particular, we can observe a severe performance degradation by slightly changing the graph adjacency matrix or the features of a few nodes, making it unsuitable for security-critical applications. Inspired by the previous works on adversarial defense for deep neural networks, and especially adversarial training algorithm, we propose a method called GraphDefense to defend against the adversarial perturbations. In addition, for our defense method, we could still maintain semi-supervised learning settings, without a large label rate. We also show that adversarial training in features is equivalent to adversarial training for edges with a small perturbation. Our experiments show that the proposed defense methods successfully increase the robustness of Graph Convolutional Networks. Furthermore, we show that with careful design, our proposed algorithm can scale to large graphs, such as Reddit dataset.

**Real-Time Reinforcement Learning**

Markov Decision Processes (MDPs), the mathematical framework underlying most algorithms in Reinforcement Learning (RL), are often used in a way that wrongfully assumes that the state of an agent’s environment does not change during action selection. As RL systems based on MDPs begin to find application in real-world safety critical situations, this mismatch between the assumptions underlying classical MDPs and the reality of real-time computation may lead to undesirable outcomes. In this paper, we introduce a new framework, in which states and actions evolve simultaneously and show how it is related to the classical MDP formulation. We analyze existing algorithms under the new real-time formulation and show why they are suboptimal when used in real-time. We then use those insights to create a new algorithm Real-Time Actor-Critic (RTAC) that outperforms the existing state-of-the-art continuous control algorithm Soft Actor-Critic both in real-time and non-real-time settings. Code and videos can be found at

https://…/rtrl.

**Structural Pruning in Deep Neural Networks: A Small-World Approach**

Deep Neural Networks (DNNs) are usually over-parameterized, causing excessive memory and interconnection cost on the hardware platform. Existing pruning approaches remove secondary parameters at the end of training to reduce the model size; but without exploiting the intrinsic network property, they still require the full interconnection to prepare the network. Inspired by the observation that brain networks follow the Small-World model, we propose a novel structural pruning scheme, which includes (1) hierarchically trimming the network into a Small-World model before training, (2) training the network for a given dataset, and (3) optimizing the network for accuracy. The new scheme effectively reduces both the model size and the interconnection needed before training, achieving a locally clustered and globally sparse model. We demonstrate our approach on LeNet-5 for MNIST and VGG-16 for CIFAR-10, decreasing the number of parameters to 2.3% and 9.02% of the baseline model, respectively.