Learning to Recommend with Missing Modalities (LRMM)
Multimodal learning has shown promising performance in content-based recommendation due to the auxiliary user and item information of multiple modalities such as text and images. However, the problem of incomplete and missing modality is rarely explored and most existing methods fail in learning a recommendation model with missing or corrupted modalities. In this paper, we propose LRMM, a novel framework that mitigates not only the problem of missing modalities but also more generally the cold-start problem of recommender systems. We propose modality dropout (m-drop) and a multimodal sequential autoencoder (m-auto) to learn multimodal representations for complementing and imputing missing modalities. Extensive experiments on real-world Amazon data show that LRMM achieves state-of-the-art performance on rating prediction tasks. More importantly, LRMM is more robust to previous methods in alleviating data-sparsity and the cold-start problem. …
Motion Transformation Variational Auto-Encoder (MT-VAE)
Long-term human motion can be represented as a series of motion modes—motion sequences that capture short-term temporal dynamics—with transitions between them. We leverage this structure and present a novel Motion Transformation Variational Auto-Encoders (MT-VAE) for learning motion sequence generation. Our model jointly learns a feature embedding for motion modes (that the motion sequence can be reconstructed from) and a feature transformation that represents the transition of one motion mode to the next motion mode. Our model is able to generate multiple diverse and plausible motion sequences in the future from the same input. We apply our approach to both facial and full body motion, and demonstrate applications like analogy-based motion transfer and video synthesis. …
Neural Basis Expansion Analysis for Interpretable Time Series Forecasting (N-BEATS)
We focus on solving the univariate times series point forecasting problem using deep learning. We propose a deep neural architecture based on backward and forward residual links and a very deep stack of fully-connected layers. The architecture has a number of desirable properties, being interpretable, applicable without modification to a wide array of target domains, and fast to train. We test the proposed architecture on the well-known M4 competition dataset containing 100k time series from diverse domains. We demonstrate state-of-the-art performance for two configurations of N-BEATS, improving forecast accuracy by 11% over a statistical benchmark and by 3% over last year’s winner of the M4 competition, a domain-adjusted hand-crafted hybrid between neural network and statistical time series models. The first configuration of our model does not employ any time-series-specific components and its performance on the M4 dataset strongly suggests that, contrarily to received wisdom, deep learning primitives such as residual blocks are by themselves sufficient to solve a wide range of forecasting problems. Finally, we demonstrate how the proposed architecture can be augmented to provide outputs that are interpretable without loss in accuracy. …
CoordConv
Uber uses convolutional neural networks in many domains that could potentially involve coordinate transforms, from designing self-driving vehicles to automating street sign detection to build maps and maximizing the efficiency of spatial movements in the Uber Marketplace. In deep learning, few ideas have experienced as much impact as convolution. Almost all state-of-the-art results in machine vision make use of stacks of convolutional layers as basic building blocks. Since such architectures are widespread, we should expect that they excel at simple tasks like painting a single pixel in a tiny image, right Surprisingly, it turns out that convolution often has difficulty completing seemingly trivial tasks. In our paper, An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, we expose and analyze a generic inability of convolutional neural networks (CNNs) to transform spatial representations between two different types: coordinates in (i, j) Cartesian space and coordinates in one-hot pixel space. It´s surprising because the task appears so simple, and it may be important because such coordinate transforms seem to be required to solve many common tasks, like detecting objects in images, training generative models of images, and training reinforcement learning (RL) agents from pixels. It turns out that these tasks may have subtly suffered from this failing of convolution all along, as suggested by performance improvements we demonstrate across several domains when using the solution we propose, a layer called CoordConv. …
If you did not already know
01 Wednesday Feb 2023
Posted What is ...
in