Uneven Group Convolution google
In this paper, we are interested in boosting the representation capability of convolution neural networks which utilizing the inverted residual structure. Based on the success of Inverted Residual structure[Sandler et al. 2018] and Interleaved Low-Rank Group Convolutions[Sun et al. 2018], we rethink this two pattern of neural network structure, rather than NAS(Neural architecture search) method[Zoph and Le 2017; Pham et al. 2018; Liu et al. 2018b], we introduce uneven point-wise group convolution, which provide a novel search space for designing basic blocks to obtain better trade-off between representation capability and computational cost. Meanwhile, we propose two novel information flow patterns that will enable cross-group information flow for multiple group convolution layers with and without any channel permute/shuffle operation. Dense experiments on image classification task show that our proposed model, named Seesaw-Net, achieves state-of-the-art (SOTA) performance with limited computation and memory cost. Our code will be open-source and available together with pre-trained models. …

MQGrad google
One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while keeping its prediction accuracies. …

Huber Loss google
In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used. …

Dynamic Generative Memory (DGM) google
Models trained in the context of continual learning (CL) should be able to learn from a stream of data over an undefined period of time. The main challenges herein are: 1) maintaining old knowledge while simultaneously benefiting from it when learning new tasks, and 2) guaranteeing model scalability with a growing amount of data to learn from. In order to tackle these challenges, we introduce Dynamic Generative Memory (DGM) – a synaptic plasticity driven framework for continual learning. DGM relies on conditional generative adversarial networks with learnable connection plasticity realized with neural masking. Specifically, we evaluate two variants of neural masking: applied to (i) layer activations and (ii) to connection weights directly. Furthermore, we propose a dynamic network expansion mechanism that ensures sufficient model capacity to accommodate for continually incoming tasks. The amount of added capacity is determined dynamically from the learned binary mask. We evaluate DGM in the continual class-incremental setup on visual classification tasks. …