Many real-world problems can be represented as graph-based learning problems. In this paper, we propose a novel framework for learning spatial and attentional convolution neural networks on arbitrary graphs. Different from previous convolutional neural networks on graphs, we first design a motif-matching guided subgraph normalization method to capture neighborhood information. Then we implement self-attentional layers to learn different importances from different subgraphs to solve graph classification problems. Analogous to image-based attentional convolution networks that operate on locally connected and weighted regions of the input, we also extend graph normalization from one-dimensional node sequence to two-dimensional node grid by leveraging motif-matching, and design self-attentional layers without requiring any kinds of cost depending on prior knowledge of the graph structure. Our results on both bioinformatics and social network datasets show that we can significantly improve graph classification benchmarks over traditional graph kernel and existing deep models.
Although the image recognition has been a research topic for many years, many researchers still have a keen interest in it. In some papers, however, there is a tendency to compare models only on one or two datasets, either because of time restraints or because the model is tailored to a specific task. Accordingly, it is hard to understand how well a certain model generalizes across image recognition field. In this paper, we compare four neural networks on MNIST dataset with different division. Among of them, three are Convolutional Neural Networks (CNN), Deep Residual Network (ResNet) and Dense Convolutional Network (DenseNet) respectively, and the other is our improvement on CNN baseline through introducing Capsule Network (CapsNet) to image recognition area. We show that the previous models despite do a quite good job in this area, our retrofitting can be applied to get a better performance. The result obtained by CapsNet is an accuracy rate of 99.75\%, and it is the best result published so far. Another inspiring result is that CapsNet only needs a small amount of data to get the excellent performance. Finally, we will apply CapsNet’s ability to generalize in other image recognition field in the future.
Reinforcement learning (RL) techniques, while often powerful, can suffer from slow learning speeds, particularly in high dimensional spaces. Decomposition of tasks into a hierarchical structure holds the potential to significantly speed up learning, generalization, and transfer learning. However, the current task decomposition techniques often rely on high-level knowledge provided by an expert (e.g. using dynamic Bayesian networks) to extract a hierarchical task structure; which is not necessarily available in autonomous systems. In this paper, we propose a novel method based on Sequential Association Rule Mining that can extract Hierarchical Structure of Tasks in Reinforcement Learning (SARM-HSTRL) in an autonomous manner for both Markov decision processes (MDPs) and factored MDPs. The proposed method leverages association rule mining to discover the causal and temporal relationships among states in different trajectories, and extracts a task hierarchy that captures these relationships among sub-goals as termination conditions of different sub-tasks. We prove that the extracted hierarchical policy offers a hierarchically optimal policy in MDPs and factored MDPs. It should be noted that SARM-HSTRL extracts this hierarchical optimal policy without having dynamic Bayesian networks in scenarios with a single task trajectory and also with multiple tasks’ trajectories. Furthermore, it has been theoretically and empirically shown that the extracted hierarchical task structure is consistent with trajectories and provides the most efficient, reliable, and compact structure under appropriate assumptions. The numerical results compare the performance of the proposed SARM-HSTRL method with conventional HRL algorithms in terms of the accuracy in detecting the sub-goals, the validity of the extracted hierarchies, and the speed of learning in several testbeds.
Humans are the final decision makers in critical tasks that involve ethical and legal concerns, ranging from recidivism prediction, to medical diagnosis, to fighting against fake news. Although machine learning models can sometimes achieve impressive performance in these tasks, these tasks are not amenable to full automation. To realize the potential of machine learning for improving human decisions, it is important to understand how assistance from machine learning models affect human performance and human agency. In this paper, we use deception detection as a testbed and investigate how we can harness explanations and predictions of machine learning models to improve human performance while retaining human agency. We propose a spectrum between full human agency and full automation, and develop varying levels of machine assistance along the spectrum that gradually increase the influence of machine predictions. We find that without showing predicted labels, explanations alone do not statistically significantly improve human performance in the end task. In comparison, human performance is greatly improved by showing predicted labels (>20% relative improvement) and can be further improved by explicitly suggesting strong machine performance. Interestingly, when predicted labels are shown, explanations of machine predictions induce a similar level of accuracy as an explicit statement of strong machine performance. Our results demonstrate a tradeoff between human performance and human agency and show that explanations of machine predictions can moderate this tradeoff.
Ensemble learning that can be used to combine the predictions from multiple learners has been widely applied in pattern recognition, and has been reported to be more robust and accurate than the individual learners. This ensemble logic has recently also been more applied in feature selection. There are basically two strategies for ensemble feature selection, namely data perturbation and function perturbation. Data perturbation performs feature selection on data subsets sampled from the original dataset and then selects the features consistently ranked highly across those data subsets. This has been found to improve both the stability of the selector and the prediction accuracy for a classifier. Function perturbation frees the user from having to decide on the most appropriate selector for any given situation and works by aggregating multiple selectors. This has been found to maintain or improve classification performance. Here we propose a framework, EFSIS, combining these two strategies. Empirical results indicate that EFSIS gives both high prediction accuracy and stability.
Recent studies have demonstrated the vulnerability of deep convolutional neural networks against adversarial examples. Inspired by the observation that the intrinsic dimension of image data is much smaller than its pixel space dimension and the vulnerability of neural networks grows with the input dimension, we propose to embed high-dimensional input images into a low-dimensional space to perform classification. However, arbitrarily projecting the input images to a low-dimensional space without regularization will not improve the robustness of deep neural networks. Leveraging optimal transport theory, we propose a new framework, Optimal Transport Classifier (OT-Classifier), and derive an objective that minimizes the discrepancy between the distribution of the true label and the distribution of the OT-Classifier output. Experimental results on several benchmark datasets show that, our proposed framework achieves state-of-the-art performance against strong adversarial attack methods.
Model change detection is studied, in which there are two sets of samples that are independently and identically distributed (i.i.d.) according to a pre-change probabilistic model with parameter $\theta$, and a post-change model with parameter $\theta'$, respectively. The goal is to detect whether the change in the model is significant, i.e., whether the difference between the pre-change parameter and the post-change parameter $\|\theta-\theta'\|_2$ is larger than a pre-determined threshold $\rho$. The problem is considered in a Neyman-Pearson setting, where the goal is to maximize the probability of detection under a false alarm constraint. Since the generalized likelihood ratio test (GLRT) is difficult to compute in this problem, we construct an empirical difference test (EDT), which approximates the GLRT and has low computational complexity. Moreover, we provide an approximation method to set the threshold of the EDT to meet the false alarm constraint. Experiments with linear regression and logistic regression are conducted to validate the proposed algorithms.
We present a conditional space-time POD formulation that is tailored to the eduction of the average, rare or intermittent event from an ensemble of realizations of a fluid process. By construction, the resulting spatio-temporal modes are coherent in space and over a pre-defined finite time horizon and optimally capture the variance, or energy of the ensemble. For the example of intermittent acoustic radiation from a turbulent jet, we introduce a conditional expectation operator that focuses on the loudest events, as measured by a pressure probe in the far-field and contained in the tail of the pressure signal’s probability distribution. Applied to high-fidelity simulation data, the method identifies a statistically significant prototype’, or average acoustic burst event that is tracked over time. Most notably, the burst event can be traced back to its precursor, which opens up the possibility of prediction of an imminent burst. We furthermore investigate the mechanism underlying the prototypical burst event using linear stability theory and find that its structure and evolution is accurately predicted by optimal transient growth theory. The jet-noise problem demonstrates that the conditional space-time POD formulation applies even for systems with probability distributions that are not heavy-tailed, i.e. for systems in which events overlap and occur in rapid succession.
Identifying trendline visualizations with desired patterns is a common and fundamental data exploration task. Existing visual analytics tools offer limited flexibility and expressiveness for such tasks, especially when the pattern of interest is under-specified and approximate, and do not scale well when the pattern searching needs are ad-hoc, as is often the case. We propose ShapeSearch, an efficient and flexible pattern-searching tool, that enables the search for desired patterns via multiple mechanisms: sketch, natural-language, and visual regular expressions. We develop a novel shape querying algebra, with a minimal set of primitives and operators that can express a large number of ShapeSearch queries, and design a natural-language and regex-based parser to automatically parse and translate user queries to the algebra representation. To execute these queries within interactive response times, ShapeSearch uses a fast shape algebra-based execution engine with query-aware optimizations, and perceptually-aware scoring methodologies. We present a thorough evaluation of the system, including a general-purpose user study, a case study involving genomic data analysis, as well as performance experiments, comparing against state-of-the-art time series shape matching approaches—that together demonstrate the usability and scalability of ShapeSearch.
Despite the recent successes of deep neural networks, the corresponding training problem remains highly non-convex and difficult to optimize. Classes of models have been proposed that introduce greater structure to the objective function at the cost of lifting the dimension of the problem. However, these lifted methods sometimes perform poorly compared to traditional neural networks. In this paper, we introduce a new class of lifted models, Fenchel lifted networks, that enjoy the same benefits as previous lifted models, without suffering a degradation in performance over classical networks. Our model represents activation functions as equivalent biconvex constraints and uses Lagrange Multipliers to arrive at a rigorous lower bound of the traditional neural network training problem. This model is efficiently trained using block-coordinate descent and is parallelizable across data points and/or layers. We compare our model against standard fully connected and convolutional networks and show that we are able to match or beat their performance.
An efficient monotone data augmentation (MDA) algorithm is proposed for missing data imputation for incomplete multivariate nonnormal data that may contain variables of different types, and are modeled by a sequence of regression models including the linear, binary logistic, multinomial logistic, proportional odds, Poisson, negative binomial, skew-normal, skew-t regressions or a mixture of these models. The MDA algorithm is applied to the sensitivity analyses of longitudinal trials with nonignorable dropout using the controlled pattern imputations that assume the treatment effect reduces or disappears after subjects in the experimental arm discontinue the treatment. We also describe a heuristic approach to implement the controlled imputation, in which the fully conditional specification method is used to impute the intermediate missing data to create a monotone missing pattern, and the missing data after dropout are then imputed according to the assumed nonignorable mechanisms. The proposed methods are illustrated by simulation and real data analyses.
In this paper, we present an approach for Recurrent Iterative Gating called RIGNet. The core elements of RIGNet involve recurrent connections that control the flow of information in neural networks in a top-down manner, and different variants on the core structure are considered. The iterative nature of this mechanism allows for gating to spread in both spatial extent and feature space. This is revealed to be a powerful mechanism with broad compatibility with common existing networks. Analysis shows how gating interacts with different network characteristics, and we also show that more shallow networks with gating may be made to perform better than much deeper networks that do not include RIGNet modules.
Incremental learning (IL) is an important task aimed to increase the capability of a trained model, in terms of the number of classes recognizable by the model. The key problem in this task is the requirement of storing data (e.g. images) associated with existing classes, while training the classifier to learn new classes. However, this is impractical as it increases the memory requirement at every incremental step, which makes it impossible to implement IL algorithms on the edge devices with limited memory. Hence, we propose a novel approach, called ‘Learning without Memorizing (LwM)’, to preserve the information with respect to existing (base) classes, without storing any of their data, while making the classifier progressively learn the new classes. In LwM, we present an information preserving penalty: Attention Distillation Loss, and demonstrate that penalizing the changes in classifiers’ attention maps helps to retain information of the base classes, as new classes are added. We show that adding Attention Distillation Loss to the distillation loss which is an existing information preserving loss consistently outperforms the state-of-the-art performance in the iILSVRC-small and iCIFAR-100 datasets in terms of the overall accuracy of base and incrementally learned classes.
Nowadays, multivariate time series data are increasingly collected in various real world systems, e.g., power plants, wearable devices, etc. Anomaly detection and diagnosis in multivariate time series refer to identifying abnormal status in certain time steps and pinpointing the root causes. Building such a system, however, is challenging since it not only requires to capture the temporal dependency in each time series, but also need encode the inter-correlations between different pairs of time series. In addition, the system should be robust to noise and provide operators with different levels of anomaly scores based upon the severity of different incidents. Despite the fact that a number of unsupervised anomaly detection algorithms have been developed, few of them can jointly address these challenges. In this paper, we propose a Multi-Scale Convolutional Recurrent Encoder-Decoder (MSCRED), to perform anomaly detection and diagnosis in multivariate time series data. Specifically, MSCRED first constructs multi-scale (resolution) signature matrices to characterize multiple levels of the system statuses in different time steps. Subsequently, given the signature matrices, a convolutional encoder is employed to encode the inter-sensor (time series) correlations and an attention based Convolutional Long-Short Term Memory (ConvLSTM) network is developed to capture the temporal patterns. Finally, based upon the feature maps which encode the inter-sensor correlations and temporal information, a convolutional decoder is used to reconstruct the input signature matrices and the residual signature matrices are further utilized to detect and diagnose anomalies. Extensive empirical studies based on a synthetic dataset and a real power plant dataset demonstrate that MSCRED can outperform state-of-the-art baseline methods.
Deep neural networks are often prone to over-fitting with their numerous parameters, so regularization plays an important role in generalization. L1 and L2 regularizers are common regularization tools in machine learning with their simplicity and effectiveness. However, we observe that imposing strong L1 or L2 regularization on deep neural networks with stochastic gradient descent easily fails, which limits the generalization ability of the underlying neural networks. To understand this phenomenon, we first investigate how and why learning fails when strong regularization is imposed on deep neural networks. We then propose a novel method, gradient-coherent strong regularization, which imposes regularization only when the gradients are kept coherent in the presence of strong regularization. Experiments are performed with multiple deep architectures on three benchmark data sets for image recognition. Experimental results show that our proposed approach indeed endures strong regularization and significantly improves both accuracy and compression, which could not be achieved otherwise.
Calculating the log-determinant of a matrix is useful for statistical computations used in machine learning, such as generative learning which uses the log-determinant of the covariance matrix to calculate the log-likelihood of model mixtures. The log-determinant calculation becomes challenging as the number of variables becomes large. Therefore, finding a practical speedup for this computation can be useful. In this study, we present a parallel matrix condensation algorithm for calculating the log-determinant of a large matrix. We demonstrate that in a distributed environment, Parallel Matrix Condensation has several advantages over the well-known Parallel Gaussian Elimination. The advantages include high data distribution efficiency and less data communication operations. We test our Parallel Matrix Condensation against self-implemented Parallel Gaussian Elimination as well as ScaLAPACK (Scalable Linear Algebra Package) on 1000 x1000 to 8000×8000 for 1,2,4,8,16,32,64 and 128 processors. The results show that Matrix Condensation yields the best speed-up among all other tested algorithms. The code is available on https://…/MatrixCondensation
We propose a new formulation of Multiple-Instance Learning (MIL). In typical MIL settings, a unit of data is given as a set of instances called a bag and the goal is to find a good classifier of bags based on similarity from a single or finitely many ‘shapelets’ (or patterns), where the similarity of the bag from a shapelet is the maximum similarity of instances in the bag. Classifiers based on a single shapelet are not sufficiently strong for certain applications. Additionally, previous work with multiple shapelets has heuristically chosen some of the instances as shapelets with no theoretical guarantee of its generalization ability. Our formulation provides a richer class of the final classifiers based on infinitely many shapelets. We provide an efficient algorithm for the new formulation, in addition to generalization bound. Our empirical study demonstrates that our approach is effective not only for MIL tasks but also for Shapelet Learning for time-series classification.
Label noise may handicap the generalization of classifiers, and it is an important issue how to effectively learn main pattern from samples with noisy labels. Recent studies have witnessed that deep neural networks tend to prioritize learning simple patterns and then memorize noise patterns. This suggests a method to search the best generalization, which learns the main pattern until the noise begins to be memorized. A natural idea is to use a supervised approach to find the stop timing of learning, for example resorting clean verification set. In practice, however, a clean verification set is sometimes not easy to obtain. To solve this problem, we propose an unsupervised method called limited gradient descent to estimate the best stop timing. We modified the labels of few samples in noisy dataset to be almost false labels as reverse pattern. By monitoring the learning progresses of the noisy samples and the reverse samples, we can determine the stop timing of learning. In this paper, we also provide some sufficient conditions on learning with noisy labels. Experimental results on CIFAR-10 demonstrate that our approach has similar generalization performance to those supervised methods. For uncomplicated datasets, such as MNIST, we add relabeling strategy to further improve generalization and achieve state-of-the-art performance.
Change detection has been a challenging visual task due to the dynamic nature of real-world scenes. Good performance of existing methods depends largely on prior background images or a long-term observation. These methods, however, suffer severe degradation when they are applied to detection of instantaneously occurred changes with only a few preceding frames provided. In this paper, we exploit spatio-temporal convolutional networks to address this challenge, and propose a novel retrospective convolution, which features efficient change information extraction between the current frame and frames from historical observation. To address the problem of foreground-specific over-fitting in learning-based methods, we further propose a data augmentation method, named static sample synthesis, to guide the network to focus on learning change-cued information rather than specific spatial features of foreground. Trained end-to-end with complex scenarios, our framework proves to be accurate in detecting instantaneous changes and robust in combating diverse noises. Extensive experiments demonstrate that our proposed method significantly outperforms existing methods.
Much time in process mining projects is spent on finding and understanding data sources and extracting the event data needed. As a result, only a fraction of time is spent actually applying techniques to discover, control and predict the business process. Moreover, there is a lack of techniques to display relationships on top of databases without the need to express a complex query to get the required information. In this paper, a novel modeling technique that works on top of databases is presented. This technique is able to show a multigraph representing activities inferred from database events, connected with edges that are annotated with frequency and performance information. The representation may be the entry point to apply advanced process mining techniques that work on classic event logs, as the model provides a simple way to retrieve a classic event log from a specified piece of model. Comparison with similar techniques and an empirical evaluation are provided.
In this paper, we analyze the effects of depth and width on the quality of local minima, without strong over-parameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve towards the global minimum value as depth and width increase. Furthermore, with a locally-induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic dataset as well as MNIST, CIFAR-10 and SVHN datasets. When compared to previous studies with strong over-parameterization assumptions, the results in this paper do not require over-parameterization, and instead show the gradual effects of over-parameterization as consequences of general results.
Convolution Neural Networks is one of the most powerful tools in the present era of science. There has been a lot of research done to improve their performance and robustness while their internal working was left unexplored to much extent. They are often defined as black boxes that can map non-linear data very effectively. This paper tries to show how CNN has learned to look at an image. The proposed algorithm exploits the basic math of CNN to backtrack the important pixels it is considering to predict. This is a simple algorithm which does not involve any training of its own over a pre-trained CNN which can classify.
Stochastic gradient descent updates parameters with summation gradient computed from a random data batch. This summation will lead to unbalanced training process if the data we obtained is unbalanced. To address this issue, this paper takes the error variance and error mean both into consideration. The adaptively adjusting approach of two terms trading off is also given in our algorithm. Due to this algorithm can suppress error variance, we named it Variance Suppression Gradient Descent (VSSGD). Experimental results have demonstrated that VSSGD can accelerate the training process, effectively prevent overfitting, improve the networks learning capacity from small samples.
Item response theory (IRT) can be applied to the analysis of the evaluation of results from AI benchmarks. The two-parameter IRT model provides two indicators (difficulty and discrimination) on the side of the item (or AI problem) while only one indicator (ability) on the side of the respondent (or AI agent). In this paper we analyse how to make this set of indicators dual, by adding a fourth indicator, generality, on the side of the respondent. Generality is meant to be dual to discrimination, and it is based on difficulty. Namely, generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. With the addition of generality, we see that this set of four key indicators can give us more insight on the results of AI benchmarks. In particular, we explore two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition. We provide some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.
The demand of applying semantic segmentation model on mobile devices has been increasing rapidly. Current state-of-the-art networks have enormous amount of parameters hence unsuitable for mobile devices, while other small memory footprint models ignore the inherent characteristic of semantic segmentation. To tackle this problem, we propose a novel Context Guided Network (CGNet), which is a light-weight network for semantic segmentation on mobile devices. We first propose the Context Guided (CG) block, which learns the joint feature of both local feature and surrounding context, and further improves the joint feature with the global context. Based on the CG block, we develop Context Guided Network (CGNet), which captures contextual information in all stages of the network and is specially tailored for increasing segmentation accuracy. CGNet is also elaborately designed to reduce the number of parameters and save memory footprint. Under an equivalent number of parameters, the proposed CGNet significantly outperforms existing segmentation networks. Extensive experiments on Cityscapes and CamVid datasets verify the effectiveness of the proposed approach. Specifically, without any post-processing, CGNet achieves 64.8% mean IoU on Cityscapes with less than 0.5 M parameters, and has a frame-rate of 50 fps on one NVIDIA Tesla K80 card for 2048 $\times$ 1024 high-resolution images. The source code for the complete system are publicly available.
Stigmergy has proved its great superiority in terms of distributed control, robustness and adaptability, thus being regarded as an ideal solution for large-scale swarm control problems. Based on new discoveries on astrocytes in regulating synaptic transmission in the brain, this paper has mapped stigmergy mechanism into the interaction between synapses and investigated its characteristics and advantages. Particularly, we have divided the interaction between synapses which are not directly connected into three phases and proposed a stigmergic learning model. In this model, the state change of a stigmergy agent will expand its influence to affect the states of others. The strength of the interaction is determined by the level of neural activity as well as the distance between stigmergy agents. Inspired by the morphological and functional changes in astrocytes during environmental enrichment, it is likely that the regulation of distance between stigmergy agents plays a critical role in the stigmergy learning process. Simulation results have verified its importance and indicated that the well-regulated distance between stigmergy agents can help to obtain stigmergy learning gain.
The automatic detection of frauds in banking transactions has been recently studied as a way to help the analysts finding fraudulent operations. Due to the availability of a human feedback, this task has been studied in the framework of active learning: the fraud predictor is allowed to sequentially call on an oracle. This human intervention is used to label new examples and improve the classification accuracy of the latter. Such a setting is not adapted in the case of fraud detection with financial data in European countries. Actually, as a human verification is mandatory to consider a fraud as really detected, it is not necessary to focus on improving the classifier. We introduce the setting of ‘Computer-assisted fraud detection’ where the goal is to minimize the number of non fraudulent operations submitted to an oracle. The existing methods are applied to this task and we show that a simple meta-algorithm provides competitive results in this scenario on benchmark datasets.
When applied to high-dimensional datasets, feature selection algorithms might still leave dozens of irrelevant variables in the dataset. Therefore, even after feature selection has been applied, classifiers must be prepared to the presence of irrelevant variables. This paper investigates a new training method called Contingency Training which increases the accuracy as well as the robustness against irrelevant attributes. Contingency training is classifier independent. By subsampling and removing information from each sample, it creates a set of constraints. These constraints aid the method to automatically find proper importance weights of the dataset’s features. Experiments are conducted with the contingency training applied to neural networks over traditional datasets as well as datasets with additional irrelevant variables. For all of the tests, contingency training surpassed the unmodified training on datasets with irrelevant variables and even outperformed slightly when only a few or no irrelevant variables were present.
Based on the property that solving the system of linear matrix equations via the column space and the row space projections boils down to an approximation in the least squares error sense, a formulation for learning the weight matrices of the multilayer network can be derived. By exploiting into the vast number of feasible solutions of these interdependent weight matrices, the learning can be performed analytically layer by layer without needing of gradient computation after an initialization. Possible initialization schemes include utilizing the data matrix as initial weights and random initialization. The study is followed by an investigation into the representation capability and the output variance of the learning scheme. An extensive experimentation on synthetic and real-world data sets validates its numerical feasibility.
In this paper we propose a data augmentation method for time series with irregular sampling, Time-Conditional Generative Adversarial Network (T-CGAN). Our approach is based on Conditional Generative Adversarial Networks (CGAN), where the generative step is implemented by a deconvolutional NN and the discriminative step by a convolutional NN. Both the generator and the discriminator are conditioned on the sampling timestamps, to learn the hidden relationship between data and timestamps, and consequently to generate new time series. We evaluate our model with synthetic and real-world datasets. For the synthetic data, we compare the performance of a classifier trained with T-CGAN-generated data, against the performance of the same classifier trained on the original data. Results show that classifiers trained on T-CGAN-generated data perform the same as classifiers trained on real data, even with very short time series and small training sets. For the real world datasets, we compare our method with other techniques of data augmentation for time series, such as time slicing and time warping, over a classification problem with unbalanced datasets. Results show that our method always outperforms the other approaches, both in case of regularly sampled and irregularly sampled time series. We achieve particularly good performance in case with a small training set and short, noisy, irregularly-sampled time series.
State-space models (SSMs) provide a flexible framework for modelling time-series data. Consequently, SSMs are ubiquitously applied in areas such as engineering, econometrics and epidemiology. In this paper we provide a fast approach for approximate Bayesian inference in SSMs using the tools of deep learning and variational inference.
Extracting causal relationships from observed correlations is a growing area in probabilistic reasoning, originating with the seminal work of Pearl and others from the early 1990s. This paper develops a new, categorically oriented view based on a clear distinction between syntax (string diagrams) and semantics (stochastic matrices), connected via interpretations as structure-preserving functors. A key notion in the identification of causal effects is that of an intervention, whereby a variable is forcefully set to a particular value independent of any prior propensities. We represent the effect of such an intervention as an endofunctor which performs string diagram surgery’ within the syntactic category of string diagrams. This diagram surgery in turn yields a new, interventional distribution via the interpretation functor. While in general there is no way to compute interventional distributions purely from observed data, we show that this is possible in certain special cases using a calculational tool called comb disintegration. We demonstrate the use of this technique on a well-known toy example, where we predict the causal effect of smoking on cancer in the presence of a confounding common cause. After developing this specific example, we show this technique provides simple sufficient conditions for computing interventions which apply to a wide variety of situations considered in the causal inference literature.
We present an ideal mixed-integer programming (MIP) formulation for a rectified linear unit (ReLU) appearing in a trained neural network. Our formulation requires a single binary variable and no additional continuous variables beyond the input and output variables of the ReLU. We contrast it with an ideal ‘extended’ formulation with a linear number of additional continuous variables, derived through standard techniques. An apparent drawback of our formulation is that it requires an exponential number of inequality constraints, but we provide a routine to separate the inequalities in linear time. We also prove that these exponentially-many constraints are facet-defining under mild conditions. Finally, we present computational results showing that dynamically separating from the exponential inequalities 1) is much more computationally efficient and scalable than the extended formulation, 2) decreases the solve time of a state-of-the-art MIP solver by a factor of 7 on smaller instances, and 3) nearly matches the dual bounds of a state-of-the-art MIP solver on harder instances, after just a few rounds of separation and in orders of magnitude less time.
We propose a directed acyclic hypergraph framework for a probabilistic graphical model that we call Bayesian hypergraphs. The space of directed acyclic hypergraphs is much larger than the space of chain graphs. Hence Bayesian hypergraphs can model much finer factorizations than Bayesian networks or LWF chain graphs and provide simpler and more computationally efficient procedures for factorizations and interventions. Bayesian hypergraphs also allow a modeler to represent causal patterns of interaction such as Noisy-OR graphically (without additional annotations). We introduce global, local and pairwise Markov properties of Bayesian hypergraphs and prove under which conditions they are equivalent. We define a projection operator, called shadow, that maps Bayesian hypergraphs to chain graphs, and show that the Markov properties of a Bayesian hypergraph are equivalent to those of its corresponding chain graph. We extend the causal interpretation of LWF chain graphs to Bayesian hypergraphs and provide corresponding formulas and a graphical criterion for intervention.
Deep Learning algorithms are often used as black box type learning and they are too complex to understand. The widespread usability of Deep Learning algorithms to solve various machine learning problems demands deep and transparent understanding of the internal representation as well as decision making. Moreover, the learning models, trained on sequential data, such as audio and video data, have intricate internal reasoning process due to their complex distribution of features. Thus, a visual simulator might be helpful to trace the internal decision making mechanisms in response to adversarial input data, and it would help to debug and design appropriate deep learning models. However, interpreting the internal reasoning of deep learning model is not well studied in the literature. In this work, we have developed a visual interactive web application, namely d-DeVIS, which helps to visualize the internal reasoning of the learning model which is trained on the audio data. The proposed system allows to perceive the behavior as well as to debug the model by interactively generating adversarial audio data point. The web application of d-DeVIS is available at ddevis.herokuapp.com.
Parameter pruning is a promising approach for CNN compression and acceleration by eliminating redundant model parameters with tolerable performance loss. Despite its effectiveness, existing regularization-based parameter pruning methods usually drive weights towards zero with large and constant regularization factors, which neglects the fact that the expressiveness of CNNs is fragile and needs a more gentle way of regularization for the networks to adapt during pruning. To solve this problem, we propose a new regularization-based pruning method (named IncReg) to incrementally assign different regularization factors to different weight groups based on their relative importance, whose effectiveness is proved on popular CNNs compared with state-of-the-art methods.
In this paper, we study the problems of principal Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm, Gen-Oja, for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-time-scale stochastic approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest.
The objective learning formulation is essential for the success of convolutional neural networks. In this work, we analyse thoroughly the standard learning objective functions for multi-class classification CNNs: softmax regression (SR) for single-label scenario and logistic regression (LR) for multi-label scenario. Our analyses lead to an inspiration of exploiting LR for single-label classification learning, and then the disclosing of the negative class distraction problem in LR. To address this problem, we develop two novel LR based objective functions that not only generalise the conventional LR but importantly turn out to be competitive alternatives to SR in single label classification. Extensive comparative evaluations demonstrate the model learning advantages of the proposed LR functions over the commonly adopted SR in single-label coarse-grained object categorisation and cross-class fine-grained person instance identification tasks. We also show the performance superiority of our method on clothing attribute classification in comparison to the vanilla LR function.
Recent studies on multi-label image classification have been focusing on designing more complex architectures of deep neural networks such as the use of attention mechanism and region proposal networks. Although performance gains have been reported in literature, the backbone deep models of the proposed approaches and the evaluation metrics employed in different works vary, making it difficult to compare with each other fairly. Moreover, due to the lack of properly investigated baselines, the advantage introduced by the proposed techniques in literature are vague. To address these issues, we make a thorough investigation of the mainstream deep convolutional neural network architectures for multi-label image classification and present a strong baseline. With only data augmentation and model ensemble, we achieve better performance than those previously reported on three benchmark datasets. We hope the work presented in this paper will provide insights to the future studies on multi-label image classification.
Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these two kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multi-stable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.
Most of the parameters in large vocabulary models are used in embedding layer to map categorical features to vectors and in softmax layer for classification weights. This is a bottle-neck in memory constraint on-device training applications like federated learning and on-device inference applications like automatic speech recognition (ASR). One way of compressing the embedding and softmax layers is to substitute larger units such as words with smaller sub-units such as characters. However, often the sub-unit models perform poorly compared to the larger unit models. We propose WEST, an algorithm for encoding categorical features and output classes with a sequence of random or domain dependent sub-units and demonstrate that this transduction can lead to significant compression without compromising performance. WEST bridges the gap between larger unit and sub-unit models and can be interpreted as a MaxEnt model over sub-unit features, which can be of independent interest.