Dynamic Mixed Training google
There is an arms race to defend neural networks against adversarial examples. Notably, adversarially robust training and verifiably robust training are the most promising defenses. The adversarially robust training scales well but cannot provide provable robustness guarantee for the absence of attacks. We present an Interval Attack that reveals fundamental problems about the threat model used by adversarially robust training. On the contrary, verifiably robust training achieves sound guarantee, but it is computationally expensive and sacrifices accuracy, which prevents it being applied in practice. In this paper, we propose two novel techniques for verifiably robust training, stochastic output approximation and dynamic mixed training, to solve the aforementioned challenges. They are based on two critical insights: (1) soundness is only needed in a subset of training data; and (2) verifiable robustness and test accuracy are conflicting to achieve after a certain point of verifiably robust training. On both MNIST and CIFAR datasets, we are able to achieve similar test accuracy and estimated robust accuracy against PGD attacks within $14\times$ less training time compared to state-of-the-art adversarially robust training techniques. In addition, we have up to 95.2% verified robust accuracy as a bonus. Also, to achieve similar verified robust accuracy, we are able to save up to $5\times$ computation time and offer 9.2% test accuracy improvement compared to current state-of-the-art verifiably robust training techniques. …

SAdam google
The Adam algorithm has become extremely popular for large-scale machine learning. Under convexity condition, it has been proved to enjoy a data-dependant $O(\sqrt{T})$ regret bound where $T$ is the time horizon. However, whether strong convexity can be utilized to further improve the performance remains an open problem. In this paper, we give an affirmative answer by developing a variant of Adam (referred to as SAdam) which achieves a data-dependant $O(\log T)$ regret bound for strongly convex functions. The essential idea is to maintain a faster decaying yet under controlled step size for exploiting strong convexity. In addition, under a special configuration of hyperparameters, our SAdam reduces to SC-RMSprop, a recently proposed variant of RMSprop for strongly convex functions, for which we provide the first data-dependent logarithmic regret bound. Empirical results on optimizing strongly convex functions and training deep networks demonstrate the effectiveness of our method. …

Sum-Product Network (SPN) google
Sum-product networks (SPNs) represent an emerging class of neural networks with clear probabilistic semantics and superior inference speed over graphical models. This work reveals a strikingly intimate connection between SPNs and tensor networks, thus leading to a highly efficient representation that we call tensor SPNs (tSPNs). For the first time, through mapping an SPN onto a tSPN and employing novel optimization techniques, we demonstrate remarkable parameter compression with negligible loss in accuracy. …

Multilevel Wavelet Decomposition Network (mWDN) google
Recent years have witnessed the unprecedented rising of time series from almost all kindes of academic and industrial fields. Various types of deep neural network models have been introduced to time series analysis, but the important frequency information is yet lack of effective modeling. In light of this, in this paper we propose a wavelet-based neural network structure called multilevel Wavelet Decomposition Network (mWDN) for building frequency-aware deep learning models for time series analysis. mWDN preserves the advantage of multilevel discrete wavelet decomposition in frequency learning while enables the fine-tuning of all parameters under a deep neural network framework. Based on mWDN, we further propose two deep learning models called Residual Classification Flow (RCF) and multi-frequecy Long Short-Term Memory (mLSTM) for time series classification and forecasting, respectively. The two models take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to the back propagation algorithm to learn all the parameters globally, which enables seamless embedding of wavelet-based frequency analysis into deep learning frameworks. Extensive experiments on 40 UCR datasets and a real-world user volume dataset demonstrate the excellent performance of our time series models based on mWDN. In particular, we propose an importance analysis method to mWDN based models, which successfully identifies those time-series elements and mWDN layers that are crucially important to time series analysis. This indeed indicates the interpretability advantage of mWDN, and can be viewed as an indepth exploration to interpretable deep learning. …