Capacity saturation and catastrophic forgetting are the central challenges of any parametric lifelong learning system. In this work, we study these challenges in the context of sequential supervised learning with emphasis on recurrent neural networks. To evaluate the models in life-long learning setting, we propose a curriculum-based, simple, and intuitive benchmark where the models are trained on a task with increasing levels of difficulty. As a step towards developing true lifelong learning systems, we unify Gradient Episodic Memory (a catastrophic forgetting alleviation approach) and Net2Net (a capacity expansion approach). Evaluation on the proposed benchmark shows that the unified model is more suitable than the constituent models for lifelong learning setting.
Success in natural language inference (NLI) should require a model to understand both lexical and compositional semantics. However, through adversarial evaluation, we find that several state-of-the-art models with diverse architectures are over-relying on the former and fail to use the latter. Further, this compositionality unawareness is not reflected via standard evaluation on current datasets. We show that removing RNNs in existing models or shuffling input words during training does not induce large performance loss despite the explicit removal of compositional information. Therefore, we propose a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone (i.e., on which a bag-of-words model gives a high probability to one wrong label), hence revealing the models’ actual compositionality awareness. We show that this setup not only highlights the limited compositional ability of current NLI models, but also differentiates model performance based on design, e.g., separating shallow bag-of-words models from deeper, linguistically-grounded tree-based models. Our evaluation setup is an important analysis tool: complementing currently existing adversarial and linguistically driven diagnostic evaluations, and exposing opportunities for future work on evaluating models’ compositional understanding.
We consider an intermediary’s problem of dynamically matching demand and supply of heterogeneous types in a periodic-review fashion. More specifically, there are two disjoint sets of demand and supply types, and a reward associated with each possible matching of a demand type and a supply type. In each period, demand and supply of various types arrive in random quantities. The platform’s problem is to decide on the optimal matching policy to maximize the total discounted rewards minus costs, given that unmatched demand and supply will incur waiting or holding costs, and will be carried over to the next period (with abandonment). For this dynamic matching problem, we provide sufficient conditions on matching rewards such that the optimal matching policy follows a priority hierarchy among possible matching pairs. We show those conditions are satisfied by vertically and unidirectionally horizontally differentiated types, for which quality and distance determine priority, respectively. As a result of the priority property, the optimal matching policy boils down to a match-down-to threshold structure when considering a specific pair of demand and supply types in the priority hierarchy.
Symmetry, a central concept in understanding the laws of nature, has been used for centuries in physics, mathematics, and chemistry, to help make mathematical models tractable. Yet, despite its power, symmetry has not been used extensively in machine learning, until rather recently. In this article we show a general way to incorporate symmetries into machine learning models. We demonstrate this with a detailed analysis on a rather simple real world machine learning system – a neural network for classifying handwritten digits, lacking bias terms for every neuron. We demonstrate that ignoring symmetries can have dire over-fitting consequences, and that incorporating symmetry into the model reduces over-fitting, while at the same time reducing complexity, ultimately requiring less training data, and taking less time and resources to train.
Although many scalable event matching algorithms have been proposed to achieve scalability for large-scale content-based networks, content-based publish/subscribe networks (especially for large-scale real time systems) still suffer performance deterioration when subscription scale increases. While subscription aggregation techniques can be useful to reduce the amount of subscription dissemination traffic and the subscription table size by exploiting the similarity among subscriptions, efficient subscription aggregation is not a trivial task to accomplish. Previous research works have proved that it is either a NP-Complete or a co-NP complete problem. In this paper, we propose DLS (Discrete Label Set), a novel subscription representation model, and design algorithms to achieve the mapping from traditional Boolean predicate model to the DLS model. Based on the DLS model, we propose a subscription aggregation algorithm with O(1) time complexity in most cases, and an event matching algorithm with O(1) time complexity. The significant performance improvement is at the cost of memory consumption and controllable false positive rate. Our theoretical analysis shows that these algorithms are inherently scalable and can achieve real time event matching in a large-scale content-based publish/subscribe network. We discuss the tradeoff between memory, false positive rate and partition granules of content space. Experimental results show that proposed algorithms achieve expected performance. With the increasing of computer memory capacity and the dropping of memory price, more and more large-scale real time applications can benefit from our proposed DLS model, such as stock quote distribution, earthquake monitoring, and severe weather alert.
Few-shot deep learning is a topical challenge area for scaling visual recognition to open-ended growth in the space of categories to recognise. A promising line work towards realising this vision is deep networks that learn to match queries with stored training images. However, methods in this paradigm usually train a deep embedding followed by a single linear classifier. Our insight is that effective general-purpose matching requires discrimination with regards to features at multiple abstraction levels. We therefore propose a new framework termed Deep Comparison Network (DCN) that decomposes embedding learning into a sequence of modules, and pairs each with a relation module. The relation modules compute a non-linear metric to score the match using the corresponding embedding module’s representation. To ensure that all embedding module’s features are used, the relation modules are deeply supervised. Finally generalisation is further improved by a learned noise regulariser. The resulting network achieves state of the art performance on both miniImageNet and tieredImageNet, while retaining the appealing simplicity and efficiency of deep metric learning approaches.
One of the most prominent problems in machine learning in the age of deep learning is the availability of sufficiently large annotated datasets. While for standard problem domains (ImageNet classification), appropriate datasets exist, for specific domains, \eg classification of animal species, a long-tail distribution means that some classes are observed and annotated insufficiently. Challenges like iNaturalist show that there is a strong interest in species recognition. Acquiring additional labels can be prohibitively expensive. First, since domain experts need to be involved, and second, because acquisition of new data might be costly. Although there exist methods for data augmentation, which not always lead to better performance of the classifier, there is more additional information available that is to the best of our knowledge not exploited accordingly. In this paper, we propose to make use of existing class hierarchies like WordNet to integrate additional domain knowledge into classification. We encode the properties of such a class hierarchy into a probabilistic model. From there, we derive a special label encoding together with a corresponding loss function. Using a convolutional neural network, on the ImageNet and NABirds datasets our method offers a relative improvement of 10.4% and 9.6% in accuracy over the baseline respectively. After less than a third of training time, it is already able to match the baseline’s fine-grained recognition performance. Both results show that our suggested method is efficient and effective.
Object detection plays a vital role in natural scene and aerial scene and is full of challenges. Although many advanced algorithms have succeeded in the natural scene, the progress in the aerial scene has been slow due to the complexity of the aerial image and the large degree of freedom of remote sensing objects in scale, orientation, and density. In this paper, a novel multi-category rotation detector is proposed, which can efficiently detect small objects, arbitrary direction objects, and dense objects in complex remote sensing images. Specifically, the proposed model adopts a targeted feature fusion strategy called inception fusion network, which fully considers factors such as feature fusion, anchor sampling, and receptive field to improve the ability to handle small objects. Then we combine the pixel attention network and the channel attention network to weaken the noise information and highlight the objects feature. Finally, the rotational object detection algorithm is realized by redefining the rotating bounding box. Experiments on public datasets including DOTA, NWPU VHR-10 demonstrate that the proposed algorithm significantly outperforms state-of-the-art methods. The code and models will be available at https://…/R2CNN-Plus-Plus_Tensorflow.
Currently, knowledge discovery in databases is an essential step to identify valid, novel and useful patterns for decision making. There are many real-world scenarios, such as bankruptcy prediction, option pricing or medical diagnosis, where the classification models to be learned need to fulfil restrictions of monotonicity (i.e. the target class label should not decrease when input attributes values increase). For instance, it is rational to assume that a higher debt ratio of a company should never result in a lower level of bankruptcy risk. Consequently, there is a growing interest from the data mining research community concerning monotonic predictive models. This paper aims to present an overview about the literature in the field, analyzing existing techniques and proposing a taxonomy of the algorithms based on the type of model generated. For each method, we review the quality metrics considered in the evaluation and the different data sets and monotonic problems used in the analysis. In this way, this paper serves as an overview of the research about monotonic classification in specialized literature and can be used as a functional guide of the field.
Recently, three dimensional (3D) convolutional neural networks (CNNs) have emerged as dominant methods to capture spatiotemporal representations, by adding to pre-existing 2D CNNs a third, temporal dimension. Such 3D CNNs, however, are anti-causal (i.e., they exploit information from both the past and the future to produce feature representations, thus preventing their use in online settings), constrain the temporal reasoning horizon to the size of the temporal convolution kernel, and are not temporal resolution-preserving for video sequence-to-sequence modelling, as, e.g., in spatiotemporal action detection. To address these serious limitations, we present a new architecture for the causal/online spatiotemporal representation of videos. Namely, we propose a recurrent convolutional network (RCN), which relies on recurrence to capture the temporal context across frames at every level of network depth. Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state $1\times 1$ convolution applied across time. The hidden state at any time $t$ is assumed to depend on the hidden state at $t-1$ and on the current output of the spatial convolution component. As a result, the proposed network: (i) provides flexible temporal reasoning, (ii) produces causal outputs, and (iii) preserves temporal resolution. Our experiments on the large-scale large ‘Kinetics’ dataset show that the proposed method achieves superior performance compared to 3D CNNs, while being causal and using fewer parameters.
Now that Bayesian Networks (BNs) have become widely used, an appreciation is developing of just how critical an awareness of the sensitivity and robustness of certain target variables are to changes in the model. When time resources are limited, such issues impact directly on the chosen level of complexity of the BN as well as the quantity of missing probabilities we are able to elicit. Currently most such analyses are performed once the whole BN has been elicited and are based on Kullback-Leibler information measures. In this paper we argue that robustness methods based instead on the familiar total variation distance provide simple and more useful bounds on robustness to misspecification which are both formally justifiable and transparent. We demonstrate how such formal robustness considerations can be embedded within the process of building a BN. Here we focus on two particular choices a modeller needs to make: the choice of the parents of each node and the number of levels to choose for each variable within the system. Our analyses are illustrated throughout using two BNs drawn from the recent literature.
Stacking-based deep neural network (S-DNN) is aggregated with pluralities of basic learning modules, one after another, to synthesize a deep neural network (DNN) alternative for pattern classification. Contrary to the DNNs trained end to end by backpropagation (BP), each S-DNN layer, i.e., a self-learnable module, is to be trained decisively and independently without BP intervention. In this paper, a ridge regression-based S-DNN, dubbed deep analytic network (DAN), along with its kernelization (K-DAN), are devised for multilayer feature re-learning from the pre-extracted baseline features and the structured features. Our theoretical formulation demonstrates that DAN/K-DAN re-learn by perturbing the intra/inter-class variations, apart from diminishing the prediction errors. We scrutinize the DAN/K-DAN performance for pattern classification on datasets of varying domains – faces, handwritten digits, generic objects, to name a few. Unlike the typical BP-optimized DNNs to be trained from gigantic datasets by GPU, we disclose that DAN/K-DAN are trainable using only CPU even for small-scale training sets. Our experimental results disclose that DAN/K-DAN outperform the present S-DNNs and also the BP-trained DNNs, including multiplayer perceptron, deep belief network, etc., without data augmentation applied.
Approximate inference algorithm is one of the fundamental research fields in machine learning. The two dominant theoretical inference frameworks in machine learning are variational inference (VI) and Markov chain Monte Carlo (MCMC). However, because of the fundamental limitation in the theory, it is very challenging to improve existing VI and MCMC methods on both the computational scalability and statistical efficiency. To overcome this obstacle, we propose a new theoretical inference framework called ergodic Inference based on the fundamental property of ergodic transformations. The key contribution of this work is to establish the theoretical foundation of ergodic inference for the development of practical algorithms in future work.
We present a new approach to neural network verification based on estimating the proportion of inputs for which a property is violated. Specifically, we estimate the probability of the event that the property is violated under an input model. This permits classic verification as a special case, for which one considers only the question of whether this expectation is exactly zero or not. When the property can be violated, our approach provides an informative notion of how robust the network is, rather than just the conventional assertion that the network is not verifiable. Furthermore, it provides an ability to scale to larger networks than classical formal verification approaches. Key to achieving this is an adaptation of multi-level splitting, a Monte Carlo approach for estimating the probability of rare events, to our statistical verification framework. We demonstrate that our approach is able to emulate existing verification procedures on benchmark problems, while scaling to larger networks and providing reliable additional information in the form of accurate estimates of the violation probability.
Anonymity forms an integral and important part of our digital life. It enables us to express our true selves without the fear of judgment. In this paper, we investigate the different aspects of anonymity in the social Q&A site Quora. The choice of Quora is motivated by the fact that this is one of the rare social Q&A sites that allow users to explicitly post anonymous questions and such activity in this forum has become normative rather than a taboo. Through an analysis of 5.1 million questions, we observe that at a global scale almost no difference manifests between the linguistic structure of the anonymous and the non-anonymous questions. We find that topical mixing at the global scale to be the primary reason for the absence. However, the differences start to feature once we ‘deep dive’ and (topically) cluster the questions and compare the clusters that have high volumes of anonymous questions with those that have low volumes of anonymous questions. In particular, we observe that the choice to post the question as anonymous is dependent on the user’s perception of anonymity and they often choose to speak about depression, anxiety, social ties and personal issues under the guise of anonymity. We further perform personality trait analysis and observe that the anonymous group of users has positive correlation with extraversion, agreeableness, and negative correlation with openness. Subsequently, to gain further insights, we build an anonymity grid to identify the differences in the perception on anonymity of the user posting the question and the community of users answering it. We also look into the first response time of the questions and observe that it is lowest for topics which talk about personal and sensitive issues, which hints toward a higher degree of community support and user engagement.
This paper makes two important contributions to understand the mobility patters of scientists. First, by combining two large-scale data sets covering the publications of 3.5 mio scientists over 60 years, we are able to reveal the geographical ‘career paths’ of scientists. Each path contains, on the individual level, information about the cities (resolved on real geographical space) and the time (in years) spent there. A statistical analysis gives empirical insights into the geographical distance scientists move for a new affiliation and their age when moving. From the individual career paths, we further reconstruct the world network of movements of scientists, where the nodes represent cities and the links in- and out-flow of scientists between cities. We analyze the topological properties of this network with respect to degree distribution, local clustering coefficients, path lengths and assortativity. The second important contribution is an agent-based model that allows to reproduce the empirical findings, both on the level of scientists and of the network. The model considers that agents have a fitness and consider potential new locations if they allow to increase this fitness. Locations on the other hand rank agents against their fitness and consider them only if they still have a capacity for them. This leads to a matching problem which is solved algorithmically. Using empirical data to calibrate our model and to determine its initial conditions, we are able to validate the model against the measured distributions. This allows to interpret the model assumptions as microbased decision rules that explain the observed mobility patterns of scientists.
As machine learning algorithms continue to improve, there is an increasing need for explaining why a model produces a certain prediction for a certain input. In recent years, several methods for model interpretability have been developed, aiming to provide explanation of which subset regions of the model input is the main reason for the model prediction. In parallel, a significant research community effort is occurring in recent years for developing adversarial example generation methods for fooling models, while not altering the true label of the input,as it would have been classified by a human annotator. In this paper, we bridge the gap between adversarial example generation and model interpretability, and introduce a modification to the adversarial example generation process which encourages better interpretability. We analyze the proposed method on a public medical imaging dataset, both quantitatively and qualitatively, and show that it significantly outperforms the leading known alternative method. Our suggested method is simple to implement, and can be easily plugged into most common adversarial example generation frameworks. Additionally, we propose an explanation quality metric – $APE$ – ‘Adversarial Perturbative Explanation’, which measures how well an explanation describes model decisions.
We propose a novel multilinear dynamical system (MLDS) in a transform domain, named $\mathcal{L}$-MLDS, to model tensor time series. With transformations applied to a tensor data, the latent multidimensional correlations among the frontal slices are built, and thus resulting in the computational independence in the transform domain. This allows the exact separability of the multi-dimensional problem into multiple smaller LDS problems. To estimate the system parameters, we utilize the expectation-maximization (EM) algorithm to determine the parameters of each LDS. Further, $\mathcal{L}$-MLDSs significantly reduce the model parameters and allows parallel processing. Our general $\mathcal{L}$-MLDS model is implemented based on different transforms: discrete Fourier transform, discrete cosine transform and discrete wavelet transform. Due to the nonlinearity of these transformations, $\mathcal{L}$-MLDS is able to capture the nonlinear correlations within the data unlike the MLDS \cite{rogers2013multilinear} which assumes multi-way linear correlations. Using four real datasets, the proposed $\mathcal{L}$-MLDS is shown to achieve much higher prediction accuracy than the state-of-the-art MLDS and LDS with an equal number of parameters under different noise models. In particular, the relative errors are reduced by $50\% \sim 99\%$. Simultaneously, $\mathcal{L}$-MLDS achieves an exponential improvement in the model’s training time than MLDS.