Stochastic Conjugate Gradient Algorithm with Variance Reduction (CGVR) google
Conjugate gradient methods are a class of important methods for solving linear equations and nonlinear optimization. In our work, we propose a new stochastic conjugate gradient algorithm with variance reduction (CGVR) and prove its linear convergence with the Fletcher and Revves method for strongly convex and smooth functions. We experimentally demonstrate that the CGVR algorithm converges faster than its counterparts for six large-scale optimization problems that may be convex, non-convex or non-smooth, and its AUC (Area Under Curve) performance with $L2$-regularized $L2$-loss is comparable to that of LIBLINEAR but with significant improvement in computational efficiency. …

Continuation Multiple Instance Learning (C-MIL) google
Weakly supervised object detection (WSOD) is a challenging task when provided with image category supervision but required to simultaneously learn object locations and object detectors. Many WSOD approaches adopt multiple instance learning (MIL) and have non-convex loss functions which are prone to get stuck into local minima (falsely localize object parts) while missing full object extent during training. In this paper, we introduce a continuation optimization method into MIL and thereby creating continuation multiple instance learning (C-MIL), with the intention of alleviating the non-convexity problem in a systematic way. We partition instances into spatially related and class related subsets, and approximate the original loss function with a series of smoothed loss functions defined within the subsets. Optimizing smoothed loss functions prevents the training procedure falling prematurely into local minima and facilitates the discovery of Stable Semantic Extremal Regions (SSERs) which indicate full object extent. On the PASCAL VOC 2007 and 2012 datasets, C-MIL improves the state-of-the-art of weakly supervised object detection and weakly supervised object localization with large margins. …

Feedback Particle Filter (FPF) google
A new formulation of the particle filter for nonlinear filtering is presented, based on concepts from optimal control, and from the mean-field game theory. The optimal control is chosen so that the posterior distribution of a particle matches as closely as possible the posterior distribution of the true state given the observations. This is achieved by introducing a cost function, defined by the Kullback-Leibler (K-L) divergence between the actual posterior, and the posterior of any particle. The optimal control input is characterized by a certain Euler-Lagrange (E-L) equation, and is shown to admit an innovation error-based feedback structure. For diffusions with continuous observations, the value of the optimal control solution is ideal. The two posteriors match exactly, provided they are initialized with identical priors. The feedback particle filter is defined by a family of stochastic systems, each evolving under this optimal control law. A numerical algorithm is introduced and implemented in two general examples, and a neuroscience application involving coupled oscillators. Some preliminary numerical comparisons between the feed- back particle filter and the bootstrap particle filter are described.
Error Analysis of the Stochastic Linear Feedback Particle Filter


Batch Sampling google
Deep Neural Networks (DNNs) thrive in recent years in which Batch Normalization (BN) plays an indispensable role. However, it has been observed that BN is costly due to the reduction operations. In this paper, we propose alleviating this problem through sampling only a small fraction of data for normalization at each iteration. Specifically, we model it as a statistical sampling problem and identify that by sampling less correlated data, we can largely reduce the requirement of the number of data for statistics estimation in BN, which directly simplifies the reduction operations. Based on this conclusion, we propose two sampling strategies, ‘Batch Sampling’ (randomly select several samples from each batch) and ‘Feature Sampling’ (randomly select a small patch from each feature map of all samples), that take both computational efficiency and sample correlation into consideration. Furthermore, we introduce an extremely simple variant of BN, termed as Virtual Dataset Normalization (VDN), that can normalize the activations well with few synthetical random samples. All the proposed methods are evaluated on various datasets and networks, where an overall training speedup by up to 20% on GPU is practically achieved without the support of any specialized libraries, and the loss on accuracy and convergence rate are negligible. Finally, we extend our work to the ‘micro-batch normalization’ problem and yield comparable performance with existing approaches at the case of tiny batch size. …