# If you did not already know

Pairwise Median Scaled Difference (MSD)
A new robust pairwise statistic, the pairwise median scaled difference (MSD), is proposed for the detection of anomalous location/uncertainty pairs in heteroscedastic interlaboratory study data with associated uncertainties. The distribution for the IID case is presented and approximate critical values for routine use are provided. The determination of observation-specific quantiles and p-values for heteroscedastic data, using parametric bootstrapping, is demonstrated by example. It is shown that the statistic has good power for detecting anomalies compared to a previous pairwise statistic, and offers much greater resistance to multiple outlying values. …

Instance segmentation aims to detect and segment individual objects in a scene. Most existing methods rely on precise mask annotations of every category. However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required. We introduce ShapeMask, which learns the intermediate concept of object shape to address the problem of generalization in instance segmentation to novel categories. ShapeMask starts with a bounding box detection and gradually refines it by first estimating the shape of the detected object through a collection of shape priors. Next, ShapeMask refines the coarse shape into an instance level mask by learning instance embeddings. The shape priors provide a strong cue for object-like prediction, and the instance embeddings model the instance specific appearance information. ShapeMask significantly outperforms the state-of-the-art by 6.4 and 3.8 AP when learning across categories, and obtains competitive performance in the fully supervised setting. It is also robust to inaccurate detections, decreased model capacity, and small training data. Moreover, it runs efficiently with 150ms inference time and trains within 11 hours on TPUs. With a larger backbone model, ShapeMask increases the gap with state-of-the-art to 9.4 and 6.2 AP across categories. Code will be released. …

The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high-dimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose block-distributed Gradient Boosted Trees. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the block-distributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for high-dimensional data, and allow more cost-effective scale-out without the need for expensive network communication. …

Semantic segmentation is a key problem for many computer vision tasks. While approaches based on convolutional neural networks constantly break new records on different benchmarks, generalizing well to diverse testing environments remains a major challenge. In numerous real world applications, there is indeed a large gap between data distributions in train and test domains, which results in severe performance loss at run-time. In this work, we address the task of unsupervised domain adaptation in semantic segmentation with losses based on the entropy of the pixel-wise predictions. To this end, we propose two novel, complementary methods using (i) entropy loss and (ii) adversarial loss respectively. We demonstrate state-of-the-art performance in semantic segmentation on two challenging ‘synthetic-2-real’ set-ups and show that the approach can also be used for detection. …

# If you did not already know

Stochastic Path-Integrated Differential EstimatoR (SPIDER)
In this paper, we propose a new technique named Stochastic Path-Integrated Differential EstimatoR (SPIDER), which can be used to track many deterministic quantities of interest with significantly reduced computational cost. Combining SPIDER with the method of normalized gradient descent, we propose two new algorithms, namely SPIDER-SFO and SPIDER-SSO, that solve non-convex stochastic optimization problems using stochastic gradients only. We provide sharp error-bound results on their convergence rates. Specially, we prove that the SPIDER-SFO and SPIDER-SSO algorithms achieve a record-breaking $\tilde{O}(\epsilon^{-3})$ gradient computation cost to find an $\epsilon$-approximate first-order and $(\epsilon, O(\epsilon^{0.5}))$-approximate second-order stationary point, respectively. In addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding stationary point under the gradient Lipschitz assumption in the finite-sum setting. …

Levy-Attack
Developing techniques for adversarial attack and defense is an important research field for establishing reliable machine learning and its applications. Many existing methods employ Gaussian random variables for exploring the data space to find the most adversarial (for attacking) or least adversarial (for defense) point. However, the Gaussian distribution is not necessarily the optimal choice when the exploration is required to follow the complicated structure that most real-world data distributions exhibit. In this paper, we investigate how statistics of random variables affect such random walk exploration. Specifically, we generalize the Boundary Attack, a state-of-the-art black-box decision based attacking strategy, and propose the L\’evy-Attack, where the random walk is driven by symmetric $\alpha$-stable random variables. Our experiments on MNIST and CIFAR10 datasets show that the L\’evy-Attack explores the image data space more efficiently, and significantly improves the performance. Our results also give an insight into the recently found fact in the whitebox attacking scenario that the choice of the norm for measuring the amplitude of the adversarial patterns is essential. …

Multi-Range Reasoning Unit (MRU)
We propose MRU (Multi-Range Reasoning Units), a new fast compositional encoder for machine comprehension (MC). Our proposed MRU encoders are characterized by multi-ranged gating, executing a series of parameterized contract-and-expand layers for learning gating vectors that benefit from long and short-term dependencies. The aims of our approach are as follows: (1) learning representations that are concurrently aware of long and short-term context, (2) modeling relationships between intra-document blocks and (3) fast and efficient sequence encoding. We show that our proposed encoder demonstrates promising results both as a standalone encoder and as well as a complementary building block. We conduct extensive experiments on three challenging MC datasets, namely RACE, SearchQA and NarrativeQA, achieving highly competitive performance on all. On the RACE benchmark, our model outperforms DFN (Dynamic Fusion Networks) by 1.5%-6% without using any recurrent or convolution layers. Similarly, we achieve competitive performance relative to AMANDA on the SearchQA benchmark and BiDAF on the NarrativeQA benchmark without using any LSTM/GRU layers. Finally, incorporating MRU encoders with standard BiLSTM architectures further improves performance, achieving state-of-the-art results. …

Luhn Algorithm
The Luhn algorithm or Luhn formula, also known as the ‘modulus 10’ or ‘mod 10’ algorithm, is a simple checksum formula used to validate a variety of identification numbers, such as credit card numbers, IMEI numbers, National Provider Identifier numbers in the United States, Canadian Social Insurance Numbers, Israel ID Numbers and Greek Social Security Numbers (????). It was created by IBM scientist Hans Peter Luhn and described in U.S. Patent No. 2,950,048, filed on January 6, 1954, and granted on August 23, 1960. The algorithm is in the public domain and is in wide use today. It is specified in ISO/IEC 7812-1. It is not intended to be a cryptographically secure hash function; it was designed to protect against accidental errors, not malicious attacks. Most credit cards and many government identification numbers use the algorithm as a simple method of distinguishing valid numbers from mistyped or otherwise incorrect numbers. …

# If you did not already know

Attention Fusion Network
Customer support is a central objective at Square as it helps us build and maintain great relationships with our sellers. In order to provide the best experience, we strive to deliver the most accurate and quasi-instantaneous responses to questions regarding our products. In this work, we introduce the Attention Fusion Network model which combines signals extracted from seller interactions on the Square product ecosystem, along with submitted email questions, to predict the most relevant solution to a seller’s inquiry. We show that the innovative combination of two very different data sources that are rarely used together, using state-of-the-art deep learning systems outperforms, candidate models that are trained only on a single source. …

McTorch
In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch. It aims to lower the barrier for users wishing to use manifold constraints in deep learning applications, i.e., when the parameters are constrained to lie on a manifold. Such constraints include the popular orthogonality and rank constraints, and have been recently used in a number of applications in deep learning. McTorch follows PyTorch’s architecture and decouples manifold definitions and optimizers, i.e., once a new manifold is added it can be used with any existing optimizer and vice-versa. McTorch is available at https://…/mctorch.

Tiramisu
This paper introduces Tiramisu, an optimization framework designed to generate efficient code for high-performance systems such as multicores, GPUs, FPGAs, distributed machines, or any combination of these. Tiramisu relies on a flexible representation based on the polyhedral model and introduces a novel four-level IR that allows full separation between algorithms, schedules, data-layouts and communication. This separation simplifies targeting multiple hardware architectures from the same algorithm. We evaluate Tiramisu by writing a set of linear algebra and DNN kernels and by integrating it as a pass in the Halide compiler. We show that Tiramisu extends Halide with many new capabilities, and that Tiramisu can generate efficient code for multicores, GPUs, FPGAs and distributed heterogeneous systems. The performance of code generated by the Tiramisu backends matches or exceeds hand-optimized reference implementations. For example, the multicore backend matches the highly optimized Intel MKL library on many kernels and shows speedups reaching 4x over the original Halide. …

Nonparanormal Graphical Model
A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables in which it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations. …

# If you did not already know

Time-Warp-Invariant
The literature postulates that the dynamic time warping (dtw) distance can cope with temporal variations but stores and processes time series in a form as if the dtw-distance cannot cope with such variations. To address this inconsistency, we first show that the dtw-distance is not warping-invariant-despite its name and contrary to its characterization in some publications. The lack of warping-invariance contributes to the inconsistency mentioned above and to a strange behavior. To eliminate these peculiarities, we convert the dtw-distance to a warping-invariant semi-metric, called time-warp-invariant (twi) distance. Empirical results suggest that the error rates of the twi and dtw nearest-neighbor classifier are practically equivalent in a Bayesian sense. However, the twi-distance requires less storage and computation time than the dtw-distance for a broad range of problems. These results challenge the current practice of applying the dtw-distance in nearest-neighbor classification and suggest the proposed twi-distance as a more efficient and consistent option. …

Neural Persistence
While many approaches to make neural networks more fathomable have been proposed, they are restricted to interrogating the network with input data. Measures for characterizing and monitoring structural properties, however, have not been developed. In this work, we propose neural persistence, a complexity measure for neural network architectures based on topological data analysis on weighted stratified graphs. To demonstrate the usefulness of our approach, we show that neural persistence reflects best practices developed in the deep learning community such as dropout and batch normalization. Moreover, we derive a neural persistence-based stopping criterion that shortens the training process while achieving comparable accuracies as early stopping based on validation loss. …

Optimal Margin Distribution Network (mdNet)
Recent research about margin theory has proved that maximizing the minimum margin like support vector machines does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. In the meantime, margin theory has been used to explain the empirical success of deep network in recent studies. In this paper, we present mdNet (the Optimal Margin Distribution Network), a network which embeds a loss function in regard to the optimal margin distribution. We give a theoretical analysis of our method using the PAC-Bayesian framework, which confirms the significance of the margin distribution for classification within the framework of deep networks. In addition, empirical results show that the mdNet model always outperforms the baseline cross-entropy loss model consistently across different regularization situations. And our mdNet model also outperforms the cross-entropy loss (Xent), hinge loss and soft hinge loss model in generalization task through limited training data. …

LogCanvas
In this demo paper, we introduce LogCanvas, a platform for user search history visualisation. Different from the existing visualisation tools, LogCanvas focuses on helping users re-construct the semantic relationship among their search activities. LogCanvas segments a user’s search history into different sessions and generates a knowledge graph to represent the information exploration process in each session. A knowledge graph is composed of the most important concepts or entities discovered by each search query as well as their relationships. It thus captures the semantic relationship among the queries. LogCanvas offers a session timeline viewer and a snippets viewer to enable users to re-find their previous search results efficiently. LogCanvas also provides a collaborative perspective to support a group of users in sharing search results and experience. …

# If you did not already know

Proposal Cluster Learning (PCL)
Weakly Supervised Object Detection (WSOD), using only image-level annotations to train object detectors, is of growing importance in object recognition. In this paper, we propose a novel end-to-end deep network for WSOD. Unlike previous networks that transfer the object detection problem to an image classification problem using Multiple Instance Learning (MIL), our strategy generates proposal clusters to learn refined instance classifiers by an iterative process. The proposals in the same cluster are spatially adjacent and associated with the same object. This prevents the network from concentrating too much on parts of objects instead of whole objects. We first show that instances can be assigned object or background labels directly based on proposal clusters for instance classifier refinement, and then show that treating each cluster as a small new bag yields fewer ambiguities than the directly assigning label method. The iterative instance classifier refinement is implemented online using multiple streams in convolutional neural networks, where the first is an MIL network and the others are for instance classifier refinement supervised by the preceding one. Experiments are conducted on the PASCAL VOC and ImageNet detection benchmarks for WSOD. Results show that our method outperforms the previous state of the art significantly. …

Bandit on Large Action set Graph (BLAG)
Information diffusion in social networks facilitates rapid and large-scale propagation of content. However, spontaneous diffusion behavior could also lead to the cascading of sensitive information, which is neglected in prior arts. In this paper, we present the first look into adaptive diffusion of sensitive information, which we aim to prevent from widely spreading without incurring much information loss. We undertake the investigation in networks with partially known topology, meaning that some users’ ability of forwarding information is unknown. Formulating the problem into a bandit model, we propose BLAG (Bandit on Large Action set Graph), which adaptively diffuses sensitive information towards users with weak forwarding ability that is learnt from tentative transmissions and corresponding feedbacks. BLAG enjoys a low complexity of O(n), and is provably more efficient in the sense of half regret bound compared with prior learning method. Experiments on synthetic and three real datasets further demonstrate the superiority of BLAG in terms of adaptive diffusion of sensitive information over several baselines, with at least 40 percent less information loss, at least 10 times of learning efficiency given limited learning rounds and significantly postponed cascading of sensitive information. …

Interactive Report
An “Interactive Report” provides a new paradigm to fill the gap between Static Report and BI Tool. It has the following characteristics …
1. Like a static report, “Interactive Report” is still based on “static data”, which is a fixed set of data generated in a periodic batch fashion.
2. Unlike static report, this pre-generated “static data” is much larger and wider that covers a broader scope of questions that the execs may ask.
3. Because the “static data” is large and wide, it is impossible to visualize all aspects in the report. Therefore, only one perspective of the static data (based on the exec’s pre-specified requirement) is shown in the report.
4. However, if the exec wants to ask a different question, he/she can switch to a different perspective of the same “static data”. …

# If you did not already know

Multi-Temporal-Range Mixture Model (M3)
Understanding temporal dynamics has proved to be highly valuable for accurate recommendation. Sequential recommenders have been successful in modeling the dynamics of users and items over time. However, while different model architectures excel at capturing various temporal ranges or dynamics, distinct application contexts require adapting to diverse behaviors. In this paper we examine how to build a model that can make use of different temporal ranges and dynamics depending on the request context. We begin with the analysis of an anonymized Youtube dataset comprising millions of user sequences. We quantify the degree of long-range dependence in these sequences and demonstrate that both short-term and long-term dependent behavioral patterns co-exist. We then propose a neural Multi-temporal-range Mixture Model (M3) as a tailored solution to deal with both short-term and long-term dependencies. Our approach employs a mixture of models, each with a different temporal range. These models are combined by a learned gating mechanism capable of exerting different model combinations given different contextual information. In empirical evaluations on a public dataset and our own anonymized YouTube dataset, M3 consistently outperforms state-of-the-art sequential recommendation methods. …

Supervised Tensor Embedding (STE)
Today’s densely instrumented world offers tremendous opportunities for continuous acquisition and analysis of multimodal sensor data providing temporal characterization of an individual’s behaviors. Is it possible to efficiently couple such rich sensor data with predictive modeling techniques to provide contextual, and insightful assessments of individual performance and wellbeing? Prediction of different aspects of human behavior from these noisy, incomplete, and heterogeneous bio-behavioral temporal data is a challenging problem, beyond unsupervised discovery of latent structures. We propose a Supervised Tensor Embedding (STE) algorithm for high dimension multimodal data with join decomposition of input and target variable. Furthermore, we show that features selection will help to reduce the contamination in the prediction and increase the performance. The efficiently of the methods was tested via two different real world datasets. …

In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form $\mathbf{x}\rightarrow \max(0,\langle\mathbf{w},\mathbf{x}\rangle)$ with $\mathbf{w}\in\mathbb{R}^d$ denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2. …

Semantic Memory
Semantic memory refers to a portion of long-term memory that processes ideas and concepts that are not drawn from personal experience. Semantic memory includes things that are common knowledge, such as the names of colors, the sounds of letters, the capitals of countries and other basic facts acquired over a lifetime. The concept of semantic memory is fairly new. It was introduced in 1972 as the result of collaboration between Endel Tulving of the University of Toronto and Wayne Donaldson of the University of New Brunswick on the impact of organization in human memory. Tulving outlined the separate systems of conceptualization of episodic and semantic memory in his book, ‘Elements of Episodic Memory.’ He noted that semantic and episodic differ in how they operate and the types of information they process. …

# If you did not already know

Krazy World
We consider the problem of exploration in meta reinforcement learning. Two new meta reinforcement learning algorithms are suggested: E-MAML and E-$\text{RL}^2$. Results are presented on a novel environment we call `Krazy World’ and a set of maze environments. We show E-MAML and E-$\text{RL}^2$ deliver better performance on tasks where exploration is important. …

Vector Autoregressive Model (VAR)
Vector autoregression (VAR) is an econometric model used to capture the linear interdependencies among multiple time series. VAR models generalize the univariate autoregression (AR) models by allowing for more than one evolving variable. All variables in a VAR are treated symmetrically in a structural sense (although the estimated quantitative response coefficients will not in general be the same); each variable has an equation explaining its evolution based on its own lags and the lags of the other model variables. VAR modeling does not require as much knowledge about the forces influencing a variable as do structural models with simultaneous equations: The only prior knowledge required is a list of variables which can be hypothesized to affect each other intertemporally. …

Complementary Conditional GAN (CCGAN)
Majority of state-of-the-art deep learning methods for vision applications are discriminative approaches, which model the conditional distribution. The success of such approaches heavily depends on high-quality labeled instances, which are not easy to obtain, especially as the number of candidate classes increases. In this paper, we study the complementary learning problem. Unlike ordinary labels, complementary labels are easy to obtain because an annotator only needs to provide a yes/no answer to a randomly chosen candidate class for each instance. We propose a generative-discriminative complementary learning method that estimates the ordinary labels by modeling both the conditional (discriminative) and instance (generative) distributions. Our method, we call Complementary Conditional GAN (CCGAN), improves the accuracy of predicting ordinary labels and is able to generate high quality instances in spite of weak supervision. In addition to the extensive empirical studies, we also theoretically show that our model can retrieve the true conditional distribution from the complementarily-labeled data. …

Auto-Correlational Neural Network (ACNN)
In recent years, the natural language processing community has moved away from task-specific feature engineering, i.e., researchers discovering ad-hoc feature representations for various tasks, in favor of general-purpose methods that learn the input representation by themselves. However, state-of-the-art approaches to disfluency detection in spontaneous speech transcripts currently still depend on an array of hand-crafted features, and other representations derived from the output of pre-existing systems such as language models or dependency parsers. As an alternative, this paper proposes a simple yet effective model for automatic disfluency detection, called an auto-correlational neural network (ACNN). The model uses a convolutional neural network (CNN) and augments it with a new auto-correlation operator at the lowest layer that can capture the kinds of ‘rough copy’ dependencies that are characteristic of repair disfluencies in speech. In experiments, the ACNN model outperforms the baseline CNN on a disfluency detection task with a 5% increase in f-score, which is close to the previous best result on this task. …

# If you did not already know

DREAM
We present DREAM, the first dialogue-based multiple-choice reading comprehension dataset. Collected from English-as-a-foreign-language examinations designed by human experts to evaluate the comprehension level of Chinese learners of English, our dataset contains 10,197 multiple-choice questions for 6,444 dialogues. In contrast to existing reading comprehension datasets, DREAM is the first to focus on in-depth multi-turn multi-party dialogue understanding. DREAM is likely to present significant challenges for existing reading comprehension systems: 84% of answers are non-extractive, 85% of questions require reasoning beyond a single sentence, and 34% of questions also involve commonsense knowledge. We apply several popular neural reading comprehension models that primarily exploit surface information within the text and find them to, at best, just barely outperform a rule-based approach. We next investigate the effects of incorporating dialogue structure and different kinds of general world knowledge into both rule-based and (neural and non-neural) machine learning-based reading comprehension models. Experimental results on the DREAM dataset show the effectiveness of dialogue structure and general world knowledge. DREAM will be available at https://…/.

A lot of advances in the processing of XML data have been proposed in the previous decade. There were many approaches focused on the efficient processing of twig pattern queries (TPQ). However, including the TPQ into an XQuery compiler is not a straightforward problem and current XML DBMSs process XQueries without any TPQ detection. In this paper, we demonstrate our prototype of a native XML DBMS called RadegastXDB that uses a TPQ detection to accelerate structural XQueries. Such a detection allows us to utilize state-of-the-art TPQ processing algorithms. Our experiments show that, for the structural queries, these algorithms and state-of-the-art XML indexing techniques make our prototype faster than all of the current XML DBMSs, especially for large data collections. We also show that using the same techniques is also efficient for the processing of queries with value predicates. …

Feature, Affinity and Multi-Dimensional Assignment Net (FAMNet)
Data association-based multiple object tracking (MOT) involves multiple separated modules processed or optimized differently, which results in complex method design and requires non-trivial tuning of parameters. In this paper, we present an end-to-end model, named FAMNet, where Feature extraction, Affinity estimation and Multi-dimensional assignment are refined in a single network. All layers in FAMNet are designed differentiable thus can be optimized jointly to learn the discriminative features and higher-order affinity model for robust MOT, which is supervised by the loss directly from the assignment ground truth. We also integrate single object tracking technique and a dedicated target management scheme into the FAMNet-based tracking system to further recover false negatives and inhibit noisy target candidates generated by the external detector. The proposed method is evaluated on a diverse set of benchmarks including MOT2015, MOT2017, KITTI-Car and UA-DETRAC, and achieves promising performance on all of them in comparison with state-of-the-arts. …

e-SNLI
In order for machine learning to garner widespread public adoption, models must be able to provide interpretable and robust explanations for their decisions, as well as learn from human-provided explanations at train time. In this work, we extend the Stanford Natural Language Inference dataset with an additional layer of human-annotated natural language explanations of the entailment relations. We further implement models that incorporate these explanations into their training process and output them at test time. We show how our corpus of explanations, which we call e-SNLI, can be used for various goals, such as obtaining full sentence justifications of a model’s decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets. Our dataset thus opens up a range of research directions for using natural language explanations, both for improving models and for asserting their trust. …

# If you did not already know

This paper contributes by providing a definition of a data-driven business model as a business model that relies on data as a key resource. …

Surrogate Assisted Feature Extraction for Model Learning (SAFE ML)
Complex black-box predictive models may have high accuracy, but opacity causes problems like lack of trust, lack of stability, sensitivity to concept drift. On the other hand, interpretable models require more work related to feature engineering, which is very time consuming. Can we train interpretable and accurate models, without timeless feature engineering? In this article, we show a method that uses elastic black-boxes as surrogate models to create a simpler, less opaque, yet still accurate and interpretable glass-box models. New models are created on newly engineered features extracted/learned with the help of a surrogate model. We show applications of this method for model level explanations and possible extensions for instance level explanations. We also present an example implementation in Python and benchmark this method on a number of tabular data sets. …

Addressing uncertainty is critical for autonomous systems to robustly adapt to the real world. We formulate the problem of model uncertainty as a continuous Bayes-Adaptive Markov Decision Process (BAMDP), where an agent maintains a posterior distribution over the latent model parameters given a history of observations and maximizes its expected long-term reward with respect to this belief distribution. Our algorithm, Bayesian Policy Optimization, builds on recent policy optimization algorithms to learn a universal policy that navigates the exploration-exploitation trade-off to maximize the Bayesian value function. To address challenges from discretizing the continuous latent parameter space, we propose a policy network architecture that independently encodes the belief distribution from the observable state. Our method significantly outperforms algorithms that address model uncertainty without explicitly reasoning about belief distributions, and is competitive with state-of-the-art Partially Observable Markov Decision Process solvers. …

code2seq
The ability to generate natural language sequences from source code snippets can be used for code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present ${\rm {\scriptsize CODE2SEQ}}$: an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of paths in its abstract syntax tree (AST) and uses attention to select the relevant paths during decoding, much like contemporary NMT models. We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as general state-of-the-art NMT models. …

# If you did not already know

Shallow Triple Stream Three-Dimensional CNN (STSTNet)
In the recent year, the state-of-the-arts of facial micro-expression recognition task have been significantly advanced by the emergence of data-driven approaches based on deep learning. Due to the superb learning capacity of deep learning, it generates promising performance beyond the traditional handcrafted approaches. Recently, many researchers have focused on developing better networks by increasing its depth, as deep networks can effectively approximate certain function classes more efficiently than shallow ones. In this paper, we aim to design a shallow network to extract the high level features of the micro-expression details. Specifically, a two-layer neural network, namely Shallow Triple Stream Three-dimensional CNN (STSTNet) is proposed. The network is capable to learn the features from three optical flow features (i.e., optical strain, horizontal and vertical optical flow images) computed from the onset and apex frames from each video. Our experimental results demonstrate the viability of the proposed STSTNet, which exhibits the UAR recognition results of 76.05%, 70.13%, 86.86% and 68.10% in composite, SMIC, CASME II and SAMM databases, respectively. …

Hubness
The tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest-neighbor lists of other points. Semantically Aligned Bias Reducing Zero Shot Learning
Semantically Aligned Bias Reducing Zero Shot Learning