# If you did not already know

Mean Field Reinforcement Learning (MFRL)
Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of user interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent’s optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution. Experiments on resource allocation, Ising model estimation, and battle game tasks verify the learning effectiveness of our mean field approaches in handling many-agent interactions in population. …

Monica
Can you remember the names of the children of all your friends? Can you remember the wedding anniversary of your brother? Can you tell the last time you called your grand mother and what you talked about? Monica lets you quickly and easily log all those information so you can be a better friend, family member or spouse. …

Probabilistic Causation
Probabilistic causation is a concept in a group of philosophical theories that aim to characterize the relationship between cause and effect using the tools of probability theory. The central idea behind these theories is that causes raise the probabilities of their effects, all else being equal. Interpreting causation as a deterministic relation means that if A causes B, then A must always be followed by B. In this sense, war does not cause deaths, nor does smoking cause cancer. As a result, many turn to a notion of probabilistic causation. Informally, A probabilistically causes B if A’s occurrence increases the probability of B. This is sometimes interpreted to reflect imperfect knowledge of a deterministic system but other times interpreted to mean that the causal system under study has an inherently indeterministic nature. (Propensity probability is an analogous idea, according to which probabilities have an objective existence and are not just limitations in a subject’s knowledge). Philosophers such as Hugh Mellor and Patrick Suppes have defined causation in terms of a cause preceding and increasing the probability of the effect. …

Self-Imitation Learning (SIL)
This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent’s past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks. …

# If you did not already know

Rooted Tree
A rooted tree is a tree in which one vertex has been designated the root. The edges of a rooted tree can be assigned a natural orientation, either away from or towards the root, in which case the structure becomes a directed rooted tree. …

POPQORN
The vulnerability to adversarial attacks has been a critical issue for deep neural networks. Addressing this issue requires a reliable way to evaluate the robustness of a network. Recently, several methods have been developed to compute $\textit{robustness quantification}$ for neural networks, namely, certified lower bounds of the minimum adversarial perturbation. Such methods, however, were devised for feed-forward networks, e.g. multi-layer perceptron or convolutional networks. It remains an open problem to quantify robustness for recurrent networks, especially LSTM and GRU. For such networks, there exist additional challenges in computing the robustness quantification, such as handling the inputs at multiple steps and the interaction between gates and states. In this work, we propose $\textit{POPQORN}$ ($\textbf{P}$ropagated-$\textbf{o}$ut$\textbf{p}$ut $\textbf{Q}$uantified R$\textbf{o}$bustness for $\textbf{RN}$Ns), a general algorithm to quantify robustness of RNNs, including vanilla RNNs, LSTMs, and GRUs. We demonstrate its effectiveness on different network architectures and show that the robustness quantification on individual steps can lead to new insights. …

Adaptive Quantile Low-Rank Matrix Factorization (AQ-LRMF)
Low-rank matrix factorization (LRMF) has received much popularity owing to its successful applications in both computer vision and data mining. By assuming the noise term to come from a Gaussian, Laplace or a mixture of Gaussian distributions, significant efforts have been made on optimizing the (weighted) $L_1$ or $L_2$-norm loss between an observed matrix and its bilinear factorization. However, the type of noise distribution is generally unknown in real applications and inappropriate assumptions will inevitably deteriorate the behavior of LRMF. On the other hand, real data are often corrupted by skew rather than symmetric noise. To tackle this problem, this paper presents a novel LRMF model called AQ-LRMF by modeling noise with a mixture of asymmetric Laplace distributions. An efficient algorithm based on the expectation-maximization (EM) algorithm is also offered to estimate the parameters involved in AQ-LRMF. The AQ-LRMF model possesses the advantage that it can approximate noise well no matter whether the real noise is symmetric or skew. The core idea of AQ-LRMF lies in solving a weighted $L_1$ problem with weights being learned from data. The experiments conducted with synthetic and real datasets show that AQ-LRMF outperforms several state-of-the-art techniques. Furthermore, AQ-LRMF also has the superiority over the other algorithms that it can capture local structural information contained in real images. …

Deep Self-Organization
Human professionals are often required to make decisions based on complex multivariate time series measurements in an online setting, e.g. in health care. Since human cognition is not optimized to work well in high-dimensional spaces, these decisions benefit from interpretable low-dimensional representations. However, many representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time. To address this problem, we propose to couple a variational autoencoder to a discrete latent space and introduce a topological structure through the use of self-organizing maps. This allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the latent space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application. In the latter experiment, our representation uncovers meaningful structure in the acute physiological state of a patient. …

# If you did not already know

We propose the cascade attribute learning network (CALNet), which can learn attributes in a control task separately and assemble them together. Our contribution is twofold: first we propose attribute learning in reinforcement learning (RL). Attributes used to be modeled using constraint functions or terms in the objective function, making it hard to transfer. Attribute learning, on the other hand, models these task properties as modules in the policy network. We also propose using novel cascading compensative networks in the CALNet to learn and assemble attributes. Using the CALNet, one can zero shoot an unseen task by separately learning all its attributes, and assembling the attribute modules. We have validated the capacity of our model on a wide variety of control problems with attributes in time, position, velocity and acceleration phases. …

Collaborative Convolutional Network (CoCoNet)
We present an end-to-end CNN architecture for fine-grained visual recognition called Collaborative Convolutional Network (CoCoNet). The network uses a collaborative filter after the convolutional layers to represent an image as an optimal weighted collaboration of features learned from training samples as a whole rather than one at a time. This gives CoCoNet more power to encode the fine-grained nature of the data with limited samples in an end-to-end fashion. We perform a detailed study of the performance with 1-stage and 2-stage transfer learning and different configurations with benchmark architectures like AlexNet and VggNet. The ablation study shows that the proposed method outperforms its constituent parts considerably and consistently. CoCoNet also outperforms the baseline popular deep learning based fine-grained recognition method, namely Bilinear-CNN (BCNN) with statistical significance. Experiments have been performed on the fine-grained species recognition problem, but the method is general enough to be applied to other similar tasks. Lastly, we also introduce a new public dataset for fine-grained species recognition, that of Indian endemic birds and have reported initial results on it. The training metadata and new dataset are available through the corresponding author. …

ISBNet
Recent years have witnessed growing interests in designing efficient neural networks and neural architecture search (NAS). Although remarkable efficiency and accuracy have been achieved, existing expert designed and NAS models neglect that input instances are of varying complexity thus different amount of computation is required. Therefore, inference with a fixed model that processes all instances through the same transformations would waste plenty of computational resources. Customizing the model capacity in an instance-aware manner is highly demanded. In this paper, we introduce a novel network ISBNet to address this issue, which supports efficient instance-level inference by selectively bypassing transformation branches of infinitesimal importance weight. We also propose lightweight hypernetworks SelectionNet to generate these importance weights instance-wisely. Extensive experiments have been conducted to evaluate the efficiency of ISBNet and the results show that ISBNet achieves extremely efficient inference comparing to existing networks. For example, ISBNet takes only 12.45% parameters and 45.79% FLOPs of the state-of-the-art efficient network ShuffleNetV2 with comparable accuracy. …

DeepTag
In many under-resourced settings, clinicians lack time and expertise to annotate patients with standard medical diagnosis codes. Veterinary medicine is an example of this and clinical encounters are largely captured in free text notes which are not labeled with diagnosis code. The lack of such standard coding makes it challenging to apply data science to improve patient care. It is also a major impediment to translational research, where, for example, we would like to leverage veterinary data to inform drug development for humans. We develop a deep learning algorithm, DeepTag, to automatically infer diagnosis codes from veterinarian free text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multi-task LSTM with an improved hierarchical objective that captures structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defer them to human experts, resulting in improved performance of the model. DeepTag accurately infers disease codes from free text even in challenging out-of-domain settings where the text comes from different clinics than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal pre-processing. The technical framework in this work can be applied in other medical domains that currently lack medical coding infrastructure. …

# If you did not already know

Poincaré Wasserstein Autoencoder
This work presents a reformulation of the recently proposed Wasserstein autoencoder framework on a non-Euclidean manifold, the Poincar\’e ball model of the hyperbolic space. By assuming the latent space to be hyperbolic, we can use its intrinsic hierarchy to impose structure on the learned latent space representations. We demonstrate the model in the visual domain to analyze some of its properties and show competitive results on a graph link prediction task. …

Kernel Machine Learning (KernelML)
I created a custom ‘particle optimizer’ and published a pip python package called kernelml. The motivation for making this algorithm was to give analysts and data scientists a generalized machine learning algorithm for complex loss functions and non-linear coefficients. The optimizer uses a combination of simple machine learning and probabilistic simulations to search for optimal parameters using a loss function, input and output matrices, and (optionally) a random sampler. I´m currently working on more features and hope to eventually make the project open source. …

Self-Paced Probabilistic Principal Component Analysis (SP-PPCA)
Principal Component Analysis (PCA) is a popular tool for dimensionality reduction and feature extraction in data analysis. There is a probabilistic version of PCA, known as Probabilistic PCA (PPCA). However, standard PCA and PPCA are not robust, as they are sensitive to outliers. To alleviate this problem, this paper introduces the Self-Paced Learning mechanism into PPCA, and proposes a novel method called Self-Paced Probabilistic Principal Component Analysis (SP-PPCA). Furthermore, we design the corresponding optimization algorithm based on the alternative search strategy and the expectation-maximization algorithm. SP-PPCA looks for optimal projection vectors and filters out outliers iteratively. Experiments on both synthetic problems and real-world datasets clearly demonstrate that SP-PPCA is able to reduce or eliminate the impact of outliers. …

The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model. …

# If you did not already know

Generative Latent Flow (GLF)
Generative Adversarial Networks (GANs) have been shown to outperform non-adversarial generative models in terms of the image generation quality by a large margin. Recently, researchers have looked into improving non-adversarial alternatives that can close the gap of generation quality while avoiding some common issues of GANs, such as unstable training and mode collapse. Examples in this direction include Two-stage VAE and Generative Latent Nearest Neighbors. However, a major drawback of these models is that they are slow to train, and in particular, they require two training stages. To address this, we propose Generative Latent Flow (GLF), which uses an auto-encoder to learn the mapping to and from the latent space, and an invertible flow to map the distribution in the latent space to simple i.i.d noise. The advantages of our method include a simple conceptual framework, single stage training and fast convergence. Quantitatively, the generation quality of our model significantly outperforms that of VAEs, and is competitive with GANs’ benchmark on commonly used datasets. …

Social influence plays a vital role in shaping a user’s behavior in online communities dealing with items of fine taste like movies, food, and beer. For online recommendation, this implies that users’ preferences and ratings are influenced due to other individuals. Given only time-stamped reviews of users, can we find out who-influences-whom, and characteristics of the underlying influence network? Can we use this network to improve recommendation? While prior works in social-aware recommendation have leveraged social interaction by considering the observed social network of users, many communities like Amazon, Beeradvocate, and Ratebeer do not have explicit user-user links. Therefore, we propose GhostLink, an unsupervised probabilistic graphical model, to automatically learn the latent influence network underlying a review community — given only the temporal traces (timestamps) of users’ posts and their content. Based on extensive experiments with four real-world datasets with 13 million reviews, we show that GhostLink improves item recommendation by around 23% over state-of-the-art methods that do not consider this influence. As additional use-cases, we show that GhostLink can be used to differentiate between users’ latent preferences and influenced ones, as well as to detect influential users based on the learned influence graph. …

Pyomo
Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating, solving, and analyzing optimization models.
A core capability of Pyomo is modeling structured optimization applications. Pyomo can be used to define general symbolic problems, create specific problem instances, and solve these instances using commercial and open-source solvers. Pyomo’s modeling objects are embedded within a full-featured high-level programming language providing a rich set of supporting libraries, which distinguishes Pyomo from other algebraic modeling languages like AMPL, AIMMS and GAMS.
Pyomo supports a wide range of problem types, including:
· Linear programming
· Nonlinear programming
· Mixed-integer linear programming
· Mixed-integer nonlinear programming
· Stochastic programming
· Generalized disjunctive programming
· Differential algebraic equations
· Bilevel programming
· Mathematical programs with equilibrium constraints
Pyomo also supports iterative analysis and scripting capabilities within a full-featured programming language. Further, Pyomo has also proven an effective framework for developing high-level optimization and analysis tools. For example, the PySP package provides generic solvers for stochastic programming. PySP leverages the fact that Pyomo’s modeling objects are embedded within a full-featured high-level programming language, which allows for transparent parallelization of subproblems using Python parallel communication libraries. …

Perceptual Visual Interactive Learning (PVIL)
Supervised learning methods are widely used in machine learning. However, the lack of labels in existing data limits the application of these technologies. Visual interactive learning (VIL) compared with computers can avoid semantic gap, and solve the labeling problem of small label quantity (SLQ) samples in a groundbreaking way. In order to fully understand the importance of VIL to the interaction process, we re-summarize the interactive learning related algorithms (e.g. clustering, classification, retrieval etc.) from the perspective of VIL. Note that, perception and cognition are two main visual processes of VIL. On this basis, we propose a perceptual visual interactive learning (PVIL) framework, which adopts gestalt principle to design interaction strategy and multi-dimensionality reduction (MDR) to optimize the process of visualization. The advantage of PVIL framework is that it combines computer’s sensitivity of detailed features and human’s overall understanding of global tasks. Experimental results validate that the framework is superior to traditional computer labeling methods (such as label propagation) in both accuracy and efficiency, which achieves significant classification results on dense distribution and sparse classes dataset. …

# If you did not already know

DataJoint
The relational data model offers unrivaled rigor and precision in defining data structure and querying complex data. Yet the use of relational databases in scientific data pipelines is limited due to their perceived unwieldiness. We propose a simplified and conceptually refined relational data model named DataJoint. The model includes a language for schema definition, a language for data queries, and diagramming notation for visualizing entities and relationships among them. The model adheres to the principle of entity normalization, which requires that all data — both stored and derived — must be represented by well-formed entity sets. DataJoint’s data query language is an algebra on entity sets with five operators that provide matching capabilities to those of other relational query languages with greater clarity due to entity normalization. Practical implementations of DataJoint have been adopted in neuroscience labs for fluent interaction with scientific data pipelines. …

Distributionally Robust Optimization (DRO)
Topology design is a critical task for the reliability, economic operation, and resilience of distribution systems. This paper proposes a distributionally robust optimization (DRO) model for designing the topology of a new distribution system facing random contingencies (e.g., imposed by natural disasters). The proposed DRO model optimally configures the network topology and integrates distributed generation to effectively meet the loads. Moreover, we take into account the uncertainty of contingency. Using the moment information of distribution line failures, we construct an ambiguity set of the contingency probability distribution, and minimize the expected amount of load shedding with regard to the worst-case distribution within the ambiguity set. As compared with a classical robust optimization model, the DRO model explicitly considers the contingency uncertainty and so provides a less conservative configuration, yielding a better out-of-sample performance. We recast the proposed model to facilitate the column-and-constraint generation algorithm. We demonstrate the out-of-sample performance of the proposed approach in numerical case studies. …

TensorFlow.js
A WebGL accelerated, browser based JavaScript library for training and deploying ML models. …

Nowadays, an increasing number of customers are in favor of using E-commerce Apps to browse and purchase products. Since merchants are usually inclined to employ redundant and over-informative product titles to attract customers’ attention, it is of great importance to concisely display short product titles on limited screen of cell phones. Previous researchers mainly consider textual information of long product titles and lack of human-like view during training and evaluation procedure. In this paper, we propose a Multi-Modal Generative Adversarial Network (MM-GAN) for short product title generation, which innovatively incorporates image information, attribute tags from the product and the textual information from original long titles. MM-GAN treats short titles generation as a reinforcement learning process, where the generated titles are evaluated by the discriminator in a human-like view. …

# If you did not already know

Weight Standardization (WS)
In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: https://…/WeightStandardization.

Deep COACH (D-COACH)
Deep Reinforcement Learning (DRL) has become a powerful strategy to solve complex decision making problems based on Deep Neural Networks (DNNs). However, it is highly data demanding, so unfeasible in physical systems for most applications. In this work, we approach an alternative Interactive Machine Learning (IML) strategy for training DNN policies based on human corrective feedback, with a method called Deep COACH (D-COACH). This approach not only takes advantage of the knowledge and insights of human teachers as well as the power of DNNs, but also has no need of a reward function (which sometimes implies the need of external perception for computing rewards). We combine Deep Learning with the COrrective Advice Communicated by Humans (COACH) framework, in which non-expert humans shape policies by correcting the agent’s actions during execution. The D-COACH framework has the potential to solve complex problems without much data or time required. Experimental results validated the efficiency of the framework in three different problems (two simulated, one with a real robot), with state spaces of low and high dimensions, showing the capacity to successfully learn policies for continuous action spaces like in the Car Racing and Cart-Pole problems faster than with DRL. …

Language Model Based Grammatical Error Correction (LM-GEC)
Grammatical error correction (GEC) is one of the areas in natural language processing in which purely neural models have not yet superseded more traditional symbolic models. Hybrid systems combining phrase-based statistical machine translation (SMT) and neural sequence models are currently among the most effective approaches to GEC. However, both SMT and neural sequence-to-sequence models require large amounts of annotated data. Language model based Grammatical error correction (LM-GEC) is a promising alternative which does not rely on annotated training data. We show how to improve LM-GEC by applying modelling techniques based on finite state transducers. We report further gains by rescoring with neural language models. We show that our methods developed for LM-GEC can also be used with SMT systems if annotated training data is available. Our best system outperforms the best published result on the CoNLL-2014 test set, and achieves far better relative improvements over the SMT baselines than previous hybrid systems. …

Matrix Linear Discriminant Analysis
We propose a novel linear discriminant analysis approach for the classification of high-dimensional matrix-valued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a low-rank structure. Theoretical properties including a non-asymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches. …

# If you did not already know

Model-Implied Instrumental Variable Two-Stage Bayesian Model Averaging (MIIV-2SBMA)
Model-Implied Instrumental Variable Two-Stage Least Squares (MIIV-2SLS) is a limited information, equation-by-equation, non-iterative estimator for latent variable models. Associated with this estimator are equation specific tests of model misspecification. We propose an extension to the existing MIIV-2SLS estimator that utilizes Bayesian model averaging which we term Model-Implied Instrumental Variable Two-Stage Bayesian Model Averaging (MIIV-2SBMA). MIIV-2SBMA accounts for uncertainty in optimal instrument set selection, and provides powerful instrument specific tests of model misspecification and instrument strength. We evaluate the performance of MIIV-2SBMA against MIIV-2SLS in a simulation study and show that it has comparable performance in terms of parameter estimation. Additionally, our instrument specific overidentification tests developed within the MIIV-2SBMA framework show increased power to detect model misspecification over the traditional equation level tests of model misspecification. Finally, we demonstrate the use of MIIV-2SBMA using an empirical example. …

Subword Regularization
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings. …

Stanza
The parameter server architecture is prevalently used for distributed deep learning. Each worker machine in a parameter server system trains the complete model, which leads to a hefty amount of network data transfer between workers and servers. We empirically observe that the data transfer has a non-negligible impact on training time. To tackle the problem, we design a new distributed training system called Stanza. Stanza exploits the fact that in many models such as convolution neural networks, most data exchange is attributed to the fully connected layers, while most computation is carried out in convolutional layers. Thus, we propose layer separation in distributed training: the majority of the nodes just train the convolutional layers, and the rest train the fully connected layers only. Gradients and parameters of the fully connected layers no longer need to be exchanged across the cluster, thereby substantially reducing the data transfer volume. We implement Stanza on PyTorch and evaluate its performance on Azure and EC2. Results show that Stanza accelerates training significantly over current parameter server systems: on EC2 instances with Tesla V100 GPU and 10Gb bandwidth for example, Stanza is 1.34x–13.9x faster for common deep learning models. …

Noise-Contrastive Estimation (NCE)
Many parametric statistical models are not properly normalised and only specified up to an intractable partition function, which renders parameter estimation difficult. Examples of unnormalised models are Gibbs distributions, Markov random fields, and neural network models in unsupervised deep learning. In previous work, the estimation principle called noise-contrastive estimation (NCE) was introduced where unnormalised models are estimated by learning to distinguish between data and auxiliary noise. An open question is how to best choose the auxiliary noise distribution. We here propose a new method that addresses this issue. The proposed method shares with NCE the idea of formulating density estimation as a supervised learning problem but in contrast to NCE, the proposed method leverages the observed data when generating noise samples. The noise can thus be generated in a semi-automated manner. We first present the underlying theory of the new method, show that score matching emerges as a limiting case, validate the method on continuous and discrete valued synthetic data, and show that we can expect an improved performance compared to NCE when the data lie in a lower-dimensional manifold. Then we demonstrate its applicability in unsupervised deep learning by estimating a four-layer neural image model. …

# If you did not already know

Regularized Kernel
We introduce Regularized Kernel and Neural Sobolev Descent for transporting a source distribution to a target distribution along smooth paths of minimum kinetic energy (defined by the Sobolev discrepancy), related to dynamic optimal transport. In the kernel version, we give a simple algorithm to perform the descent along gradients of the Sobolev critic, and show that it converges asymptotically to the target distribution in the MMD sense. In the neural version, we parametrize the Sobolev critic with a neural network with input gradient norm constrained in expectation. We show in theory and experiments that regularization has an important role in favoring smooth transitions between distributions, avoiding large discrete jumps. Our analysis could provide a new perspective on the impact of critic updates (early stopping) on the paths to equilibrium in the GAN setting. …

Wasserstein Transform
We introduce the Wasserstein transform, a method for enhancing and denoising datasets defined on general metric spaces. The construction draws inspiration from Optimal Transportation ideas. We establish precise connections with the mean shift family of algorithms and establish the stability of both our method and mean shift under data perturbation. …

EnergyNet
We present ENERGYNET , a new framework for analyzing and building artificial neural network architectures. Our approach adaptively learns the structure of the networks in an unsupervised manner. The methodology is based upon the theoretical guarantees of the energy function of restricted Boltzmann machines (RBM) of infinite number of nodes. We present experimental results to show that the final network adapts to the complexity of a given problem. …

Long Term Memory Network (LTM)
Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTM), and Memory Networks which contain memory are popularly used to learn patterns in sequential data. Sequential data has long sequences that hold relationships. RNN can handle long sequences but suffers from the vanishing and exploding gradient problems. While LSTM and other memory networks address this problem, they are not capable of handling long sequences (50 or more data points long sequence patterns). Language modelling requiring learning from longer sequences are affected by the need for more information in memory. This paper introduces Long Term Memory network (LTM), which can tackle the exploding and vanishing gradient problems and handles long sequences without forgetting. LTM is designed to scale data in the memory and gives a higher weight to the input in the sequence. LTM avoid overfitting by scaling the cell state after achieving the optimal results. The LTM is tested on Penn treebank dataset, and Text8 dataset and LTM achieves test perplexities of 83 and 82 respectively. 650 LTM cells achieved a test perplexity of 67 for Penn treebank, and 600 cells achieved a test perplexity of 77 for Text8. LTM achieves state of the art results by only using ten hidden LTM cells for both datasets. …

# If you did not already know

Semantic Weight-Inverse Document Frequency (SW-IDF)
Time-sync comments reveal a new way of extracting the online video tags. However, such time-sync comments have lots of noises due to users’ diverse comments, introducing great challenges for accurate and fast video tag extractions. In this paper, we propose an unsupervised video tag extraction algorithm named Semantic Weight-Inverse Document Frequency (SW-IDF). Specifically, we first generate corresponding semantic association graph (SAG) using semantic similarities and timestamps of the time-sync comments. Second, we propose two graph cluster algorithms, i.e., dialogue-based algorithm and topic center-based algorithm, to deal with the videos with different density of comments. Third, we design a graph iteration algorithm to assign the weight to each comment based on the degrees of the clustered subgraphs, which can differentiate the meaningful comments from the noises. Finally, we gain the weight of each word by combining Semantic Weight (SW) and Inverse Document Frequency (IDF). In this way, the video tags are extracted automatically in an unsupervised way. Extensive experiments have shown that SW-IDF (dialogue-based algorithm) achieves 0.4210 F1-score and 0.4932 MAP (Mean Average Precision) in high-density comments, 0.4267 F1-score and 0.3623 MAP in low-density comments; while SW-IDF (topic center-based algorithm) achieves 0.4444 F1-score and 0.5122 MAP in high-density comments, 0.4207 F1-score and 0.3522 MAP in low-density comments. It has a better performance than the state-of-the-art unsupervised algorithms in both F1-score and MAP. …

Generalized Kalman Smoothing
State-space smoothing has found many applications in science and engineering. Under linear and Gaussian assumptions, smoothed estimates can be obtained using efficient recursions, for example Rauch-Tung-Striebel and Mayne-Fraser algorithms. Such schemes are equivalent to linear algebraic techniques that minimize a convex quadratic objective function with structure induced by the dynamic model. These classical formulations fall short in many important circumstances. For instance, smoothers obtained using quadratic penalties can fail when outliers are present in the data, and cannot track impulsive inputs and abrupt state changes. Motivated by these shortcomings, generalized Kalman smoothing formulations have been proposed in the last few years, replacing quadratic models with more suitable, often nonsmooth, convex functions. In contrast to classical models, these general estimators require use of iterated algorithms, and these have received increased attention from control, signal processing, machine learning, and optimization communities. In this survey we show that the optimization viewpoint provides the control and signal processing community great freedom in the development of novel modeling and inference frameworks for dynamical systems. We discuss general statistical models for dynamic systems, making full use of nonsmooth convex penalties and constraints, and providing links to important models in signal processing and machine learning. We also survey optimization techniques for these formulations, paying close attention to dynamic problem structure. Modeling concepts and algorithms are illustrated with numerical examples. …

TreeGAN
Generative Adversarial Networks (GANs) have shown great capacity on image generation, in which a discriminative model guides the training of a generative model to construct images that resemble real images. Recently, GANs have been extended from generating images to generating sequences (e.g., poems, music and codes). Existing GANs on sequence generation mainly focus on general sequences, which are grammar-free. In many real-world applications, however, we need to generate sequences in a formal language with the constraint of its corresponding grammar. For example, to test the performance of a database, one may want to generate a collection of SQL queries, which are not only similar to the queries of real users, but also follow the SQL syntax of the target database. Generating such sequences is highly challenging because both the generator and discriminator of GANs need to consider the structure of the sequences and the given grammar in the formal language. To address these issues, we study the problem of syntax-aware sequence generation with GANs, in which a collection of real sequences and a set of pre-defined grammatical rules are given to both discriminator and generator. We propose a novel GAN framework, namely TreeGAN, to incorporate a given Context-Free Grammar (CFG) into the sequence generation process. In TreeGAN, the generator employs a recurrent neural network (RNN) to construct a parse tree. Each generated parse tree can then be translated to a valid sequence of the given grammar. The discriminator uses a tree-structured RNN to distinguish the generated trees from real trees. We show that TreeGAN can generate sequences for any CFG and its generation fully conforms with the given syntax. Experiments on synthetic and real data sets demonstrated that TreeGAN significantly improves the quality of the sequence generation in context-free languages. …

Trigger Detection Dynamic Memory Network (TD-DMN)
The task of event detection involves identifying and categorizing event triggers. Contextual information has been shown effective on the task. However, existing methods which utilize contextual information only process the context once. We argue that the context can be better exploited by processing the context multiple times, allowing the model to perform complex reasoning and to generate better context representation, thus improving the overall performance. Meanwhile, dynamic memory network (DMN) has demonstrated promising capability in capturing contextual information and has been applied successfully to various tasks. In light of the multi-hop mechanism of the DMN to model the context, we propose the trigger detection dynamic memory network (TD-DMN) to tackle the event detection problem. We performed a five-fold cross-validation on the ACE-2005 dataset and experimental results show that the multi-hop mechanism does improve the performance and the proposed model achieves best $F_1$ score compared to the state-of-the-art methods. …