In the context of a motivating study of dynamic network flow data on a large-scale e-commerce web site, we develop Bayesian models for on-line/sequential analysis for monitoring and adapting to changes reflected in node-node traffic. For large-scale networks, we customize core Bayesian time series analysis methods using dynamic generalized linear models (DGLMs) integrated into the multivariate network context using the concept of decouple/recouple recently introduced in multivariate time series. This enables flexible dynamic modeling of flows on large-scale networks and exploitation of partial parallelization of analysis while critically maintaining coherence with an over-arching multivariate dynamic flow model. Development is anchored in a case-study on internet data, with flows of visitors to a commercial news web site defining a long time series of node-node counts on over 56,000 node pairs. Characterizing inherent stochasticity in traffic patterns, understanding node-node interactions, adapting to dynamic changes in flows and allowing for sensitive monitoring to flag anomalies are central questions. The methodology of dynamic network DGLMs will be of interest and utility in broad ranges of dynamic network flow studies.
The classification of anomalies or sudden changes in power networks versus normal abrupt changes or switching actions is essential to take appropriate maintenance actions that guarantee the quality of power delivery. This issue has increased in importance and has become more complicated with the proliferation of volatile resources that introduce variability, uncertainty, and intermittency in circuit behavior that can be observed as variations in voltage and current phasors. This makes diagnostics applications more challenging. This paper proposes using quotient gradient system (QGS) to train two-stage partially recurrent neural network to improve anomaly classification rate in power distribution networks using high-fidelity data from micro-phasor measurement units (PMUs). QGS is a systematic approach to finding solutions of constraint satisfaction problems. We transform the PMUs data from the power network into a constraint satisfaction problem and use QGS to train a neural network by solving the resulting optimization problem. Simulation results show that the proposed supervised classification method can reliably distinguish between different anomalies in power distribution networks. Comparison with other neural network classifiers shows that QGS trained networks provide significantly better classification. Sensitivity analysis is performed concerning the number of PMUs, reporting rates, noise level and early versus late data stream fusion frameworks.
This tutorial is based on the lecture notes for the courses ‘Machine Learning: Basic Principles’ and ‘Artificial Intelligence’, which I have taught during fall 2017 and spring 2018 at Aalto university. The aim is to provide an accessible introduction to some of the main concepts and methods within supervised machine learning. Most of the current systems which are con- sidered as (artificially) intelligent are based on some form of supervised machine learning. After discussing the main building blocks of a formal machine learning problem, some of the most popular algorithmic design patterns for machine learning methods are presented.
We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work performed during inference for image classification models can be consistently reduced by two orders of magnitude. In the best case studied of a sparse variant of AlexNet, this approach can ideally reduce computation work by more than 500x. We present Laconic a hardware accelerator that implements this approach to improve execution time, and energy efficiency for inference with Deep Learning Networks. Laconic judiciously gives up some of the work reduction potential to yield a low-cost, simple, and energy efficient design that outperforms other state-of-the-art accelerators. For example, a Laconic configuration that uses a weight memory interface with just 128 wires outperforms a conventional accelerator with a 2K-wire weight memory interface by 2.3x on average while being 2.13x more energy efficient on average. A Laconic configuration that uses a 1K-wire weight memory interface, outperforms the 2K-wire conventional accelerator by 15.4x and is 1.95x more energy efficient. Laconic does not require but rewards advances in model design such as a reduction in precision, the use of alternate numeric representations that reduce the number of bits that are ‘1’, or an increase in weight or activation sparsity.
High volume of data, perceived as either challenge or opportunity. Deep learning architecture demands high volume of data to effectively back propagate and train the weights without bias. At the same time, large volume of data demands higher capacity of the machine where it could be executed seamlessly. Budding data scientist along with many research professionals face frequent disconnection issue with cloud computing framework (working without dedicated connection) due to free subscription to the platform. Similar issues also visible while working on local computer where computer may run out of resource or power sometimes and researcher has to start training the models all over again. In this paper, we intend to provide a way to resolve this issue and progressively training the neural network even after having frequent disconnection or resource outage without loosing much of the progress
Users in various web and mobile applications are vulnerable to attribute inference attacks, in which an attacker leverages a machine learning classifier to infer a target user’s private attributes (e.g., location, sexual orientation, political view) from its public data (e.g., rating scores, page likes). Existing defenses leverage game theory or heuristics based on correlations between the public data and attributes. These defenses are not practical. Specifically, game-theoretic defenses require solving intractable optimization problems, while correlation-based defenses incur large utility loss of users’ public data. In this paper, we present AttriGuard, a practical defense against attribute inference attacks. AttriGuard is computationally tractable and has small utility loss. Our AttriGuard works in two phases. Suppose we aim to protect a user’s private attribute. In Phase I, for each value of the attribute, we find a minimum noise such that if we add the noise to the user’s public data, then the attacker’s classifier is very likely to infer the attribute value for the user. We find the minimum noise via adapting existing evasion attacks in adversarial machine learning. In Phase II, we sample one attribute value according to a certain probability distribution and add the corresponding noise found in Phase I to the user’s public data. We formulate finding the probability distribution as solving a constrained convex optimization problem. We extensively evaluate AttriGuard and compare it with existing methods using a real-world dataset. Our results show that AttriGuard substantially outperforms existing methods. Our work is the first one that shows evasion attacks can be used as defensive techniques for privacy protection.
Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further extended the Spark ecosystem with the MPI applications using the Process Management Interface. The paper explores this integrated platform within the context of online ptychographic and tomographic reconstruction pipelines.
Children speech recognition is challenging mainly due to the inherent high variability in children’s physical and articulatory characteristics and expressions. This variability manifests in both acoustic constructs and linguistic usage due to the rapidly changing developmental stage in children’s life. Part of the challenge is due to the lack of large amounts of available children speech data for efficient modeling. This work attempts to address the key challenges using transfer learning from adult’s models to children’s models in a Deep Neural Network (DNN) framework for children’s Automatic Speech Recognition (ASR) task evaluating on multiple children’s speech corpora with a large vocabulary. The paper presents a systematic and an extensive analysis of the proposed transfer learning technique considering the key factors affecting children’s speech recognition from prior literature. Evaluations are presented on (i) comparisons of earlier GMM-HMM and the newer DNN Models, (ii) effectiveness of standard adaptation techniques versus transfer learning, (iii) various adaptation configurations in tackling the variabilities present in children speech, in terms of (a) acoustic spectral variability, and (b) pronunciation variability and linguistic constraints. Our Analysis spans over (i) number of DNN model parameters (for adaptation), (ii) amount of adaptation data, (iii) ages of children, (iv) age dependent-independent adaptation. Finally, we provide Recommendations on (i) the favorable strategies over various aforementioned – analyzed parameters, and (ii) potential future research directions and relevant challenges/problems persisting in DNN based ASR for children’s speech.
Reinforcement learning (RL) has had many successes in both ‘deep’ and ‘shallow’ settings. In both cases, significant hyperparameter tuning is often required to achieve good performance. Furthermore, when nonlinear function approximation is used, non-stationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this — most notably large experience replay buffers or the use of multiple parallel actors. These techniques come at the cost of moving away from the online RL problem as it is traditionally formulated (i.e., a single agent learning online without maintaining a large database of training examples). Meta-learning can potentially help with both these issues by tuning hyperparameters online and allowing the algorithm to more robustly adjust to non-stationarity in a problem. This paper applies meta-gradient descent to derive a set of step-size tuning algorithms specifically for online RL control with eligibility traces. Our novel technique, Metatrace, makes use of an eligibility trace analogous to methods like $TD(\lambda)$. We explore tuning both a single scalar step-size and a separate step-size for each learned parameter. We evaluate Metatrace first for control with linear function approximation in the classic mountain car problem and then in a noisy, non-stationary version. Finally, we apply Metatrace for control with nonlinear function approximation in 5 games in the Arcade Learning Environment where we explore how it impacts learning speed and robustness to initial step-size choice. Results show that the meta-step-size parameter of Metatrace is easy to set, Metatrace can speed learning, and Metatrace can allow an RL algorithm to deal with non-stationarity in the learning task.
In this paper, we consider the problem of minimizing the average of a large number of nonsmooth and convex functions. Such problems often arise in typical machine learning problems as empirical risk minimization, but are computationally very challenging. We develop and analyze a new algorithm that achieves robust linear convergence rate, and both its time complexity and gradient complexity are superior than state-of-art nonsmooth algorithms and subgradient-based schemes. Besides, our algorithm works without any extra error bound conditions on the objective function as well as the common strongly-convex condition. We show that our algorithm has wide applications in optimization and machine learning problems, and demonstrate experimentally that it performs well on a large-scale ranking problem.
Modern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naive adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyze our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.2 frames per second at full (1024×2048) resolution.
Human labeling of textual data can be very time-consuming and expensive, yet it is critical for the success of an automatic text classification system. In order to minimize human labeling efforts, we propose a novel active learning (AL) solution, that does not rely on existing sources of unlabeled data. It uses a small amount of labeled data as the core set for the synthesis of useful membership queries (MQs) – unlabeled instances synthesized by an algorithm for human labeling. Our solution uses modification operators, functions from the instance space to the instance space that change the input to some extent. We apply the operators on the core set, thus creating a set of new membership queries. Using this framework, we look at the instance space as a search space and apply search algorithms in order to create desirable MQs. We implement this framework in the textual domain. The implementation includes using methods such as WordNet and Word2vec, for replacing text fragments from a given sentence with semantically related ones. We test our framework on several text classification tasks and show improved classifier performance as more MQs are labeled and incorporated into the training set. To the best of our knowledge, this is the first work on membership queries in the textual domain.
The field of Natural Language Processing (NLP) is growing rapidly, with new research published daily along with an abundance of tutorials, codebases and other online resources. In order to learn this dynamic field or stay up-to-date on the latest research, students as well as educators and researchers must constantly sift through multiple sources to find valuable, relevant information. To address this situation, we introduce TutorialBank, a new, publicly available dataset which aims to facilitate NLP education and research. We have manually collected and categorized over 6,300 resources on NLP as well as the related fields of Artificial Intelligence (AI), Machine Learning (ML) and Information Retrieval (IR). Our dataset is notably the largest manually-picked corpus of resources intended for NLP education which does not include only academic papers. Additionally, we have created both a search engine and a command-line tool for the resources and have annotated the corpus to include lists of research topics, relevant resources for each topic, prerequisite relations among topics, relevant sub-parts of individual resources, among other annotations. We are releasing the dataset and present several avenues for further research.
We consider the problem of learning textual entailment models with limited supervision (5K-10K training examples), and present two complementary approaches for it. First, we propose knowledge-guided adversarial example generators for incorporating large lexical resources in entailment models via only a handful of rule templates. Second, to make the entailment model – a discriminator – more robust, we propose the first GAN-style approach for training it using a natural language example generator that iteratively adjusts based on the discriminator’s performance. We demonstrate effectiveness using two entailment datasets, where the proposed methods increase accuracy by 4.7% on SciTail and by 2.8% on a 1% training sub-sample of SNLI. Notably, even a single hand-written rule, negate, improves the accuracy on the negation examples in SNLI by 6.1%.
Active learning is a machine learning approach for reducing the data labeling effort. Given a pool of unlabeled samples, it tries to select the most useful ones to label so that a model built from them can achieve the best possible performance. This paper focuses on pool-based sequential active learning for regression (ALR). We first propose three essential criteria that an ALR approach should consider in selecting the most useful unlabeled samples: informativeness, representativeness, and diversity, and compare four existing ALR approaches against them. We then propose a new ALR approach using passive sampling, which considers both the representativeness and the diversity in both the initialization and subsequent iterations. Remarkably, this approach can also be integrated with other existing ALR approaches in the literature to further improve the performance. Extensive experiments on 11 UCI, CMU StatLib, and UFL Media Core datasets from various domains verified the effectiveness of our proposed ALR approaches.
With the increase of machine learning usage by industries and scientific communities in a variety of tasks such as text mining, image recognition and self-driving cars, automatic setting of hyper-parameter in learning algorithms is a key factor for achieving satisfactory performance regardless of user expertise in the inner workings of the techniques and methodologies. In particular, for a reinforcement learning algorithm, the efficiency of an agent learning a control policy in an uncertain environment is heavily dependent on the hyper-parameters used to balance exploration with exploitation. In this work, an autonomous learning framework that integrates Bayesian optimization with Gaussian process regression to optimize the hyper-parameters of a reinforcement learning algorithm, is proposed. Also, a bandits-based approach to achieve a balance between computational costs and decreasing uncertainty about the Q-values, is presented. A gridworld example is used to highlight how hyper-parameter configurations of a learning algorithm (SARSA) are iteratively improved based on two performance functions.
In order to reach higher degrees of flexibility, adaptability and autonomy in manufacturing systems, it is essential to develop new rescheduling methodologies which resort to cognitive capabilities, similar to those found in human beings. Artificial cognition is important for designing planning and control systems that generate and represent knowledge about heuristics for repair-based scheduling. Rescheduling knowledge in the form of decision rules is used to deal with unforeseen events and disturbances reactively in real time, and take advantage of the ability to act interactively with the user to counteract the effects of disruptions. In this work, to achieve the aforementioned goals, a novel approach to generate rescheduling knowledge in the form of dynamic first-order logical rules is proposed. The proposed approach is based on the integration of reinforcement learning with artificial cognitive capabilities involving perception and reasoning/learning skills embedded in the Soar cognitive architecture. An industrial example is discussed showing that the approach enables the scheduling system to assess its operational range in an autonomic way, and to acquire experience through intensive simulation while performing repair tasks.
Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher}) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness. %we desire a compact model with performance close to the teacher’s. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these {Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and non-predicted classes. We present experiments with students of various capacities, focusing on the under-explored case where students overpower teachers. Our experiments show significant advantages from transferring knowledge between DenseNets and ResNets in either direction.
Distributional reinforcement learning (distributional RL) has seen empirical success in complex Markov Decision Processes (MDPs) in the setting of nonlinear function approximation. However, there are many different ways in which one can leverage the distributional approach to reinforcement learning. In this paper, we propose GAN Q-learning, a novel distributional RL method based on generative adversarial networks (GANs) and analyze its performance in simple tabular environments, as well as OpenAI Gym. We empirically show that our algorithm leverages the flexibility and blackbox approach of deep learning models while providing a viable alternative to other state-of-the-art methods.
Matrix completion is one of the key problems in signal processing and machine learning, with applications ranging from image processing and data gathering to classification and recommender systems. Recently, deep neural networks have been proposed as latent factor models for matrix completion and have achieved state-of-the-art performance. Nevertheless, a major problem with existing neural-network-based models is their limited capabilities to extend to samples unavailable at the training stage. In this paper, we propose a deep two-branch neural network model for matrix completion. The proposed model not only inherits the predictive power of neural networks, but is also capable of extending to partially observed samples outside the training set, without the need of retraining or fine-tuning. Experimental studies on popular movie rating datasets prove the ef- fectiveness of our model compared to the state of the art, in terms of both accuracy and extendability.
As deep neural networks grow in size, from thousands to millions to billions of weights, the performance of those networks becomes limited by our ability to accurately train them. A common naive question arises: if we have a system with billions of degrees of freedom, don’t we also need billions of samples to train it Of course, the success of deep learning indicates that reliable models can be learned with reasonable amounts of data. Similar questions arise in protein folding, spin glasses and biological neural networks. With effectively infinite potential folding/spin/wiring configurations, how does the system find the precise arrangement that leads to useful and robust results Simple sampling of the possible configurations until an optimal one is reached is not a viable option even if one waited for the age of the universe. On the contrary, there appears to be a mechanism in the above phenomena that forces them to achieve configurations that live on a low-dimensional manifold, avoiding the curse of dimensionality. In the current work we use the concept of mutual information between successive layers of a deep neural network to elucidate this mechanism and suggest possible ways of exploiting it to accelerate training. We show that adding structure to the neural network that enforces higher mutual information between layers speeds training and leads to more accurate results. High mutual information between layers implies that the effective number of free parameters is exponentially smaller than the raw number of tunable weights.
Reinforcement learning (RL) agents performing complex tasks must be able to remember observations and actions across sizable time intervals. This is especially true during the initial learning stages, when exploratory behaviour can increase the delay between specific actions and their effects. Many new or popular approaches for learning these distant correlations employ backpropagation through time (BPTT), but this technique requires storing observation traces long enough to span the interval between cause and effect. Besides memory demands, learning dynamics like vanishing gradients and slow convergence due to infrequent weight updates can reduce BPTT’s practicality; meanwhile, although online recurrent network learning is a developing topic, most approaches are not efficient enough to use as replacements. We propose a simple, effective memory strategy that can extend the window over which BPTT can learn without requiring longer traces. We explore this approach empirically on a few tasks and discuss its implications.
Heterogeneous networks not only present a challenge of heterogeneity in the types of nodes and relations, but also the attributes and content associated with the nodes. While recent works have looked at representation learning on homogeneous and heterogeneous networks, there is no work that has collectively addressed the following challenges: (a) the heterogeneous structural information of the network consisting of multiple types of nodes and relations; (b) the unstructured semantic content (e.g., text) associated with nodes; and (c) online updates due to incoming new nodes in growing network. We address these challenges by developing a Content-Aware Representation Learning model (CARL). CARL performs joint optimization of heterogeneous SkipGram and deep semantic encoding for capturing both heterogeneous structural closeness and unstructured semantic relations among all nodes, as function of node content, that exist in the network. Furthermore, an additional online update module is proposed for efficiently learning representations of incoming nodes. Extensive experiments demonstrate that CARL outperforms state-of-the-art baselines in various heterogeneous network mining tasks, such as link prediction, document retrieval, node recommendation and relevance search. We also demonstrate the effectiveness of the CARL’s online update module through a category visualization study.
We introduce a novel criterion in clustering that seeks clusters with limited range of values associated with each cluster’s elements. In clustering or classification the objective is to partition a set of objects into subsets, called clusters or classes, consisting of similar objects so that different clusters are as dissimilar as possible. We propose a number of objective functions that employ the range of the clusters as part of the objective function. Several of the proposed objectives mimic objectives based on sums of similarities. These objective functions are motivated by image segmentation problems, where the diameter, or range of values associated with objects in each cluster, should be small. It is demonstrated that range-based problems are in general easier, in terms of their complexity, than the analogous similarity-sum problems. Several of the problems we present could therefore be viable alternatives to existing clustering problems which are NP-hard, offering the advantage of efficient algorithms.
Empirical copula functions can be used to model the dependence structure of multivariate data. This paper adapts the Greenwald and Khanna algorithm in order to provide a space-memory efficient approximation to the empirical copula function of a bivariate stream of data. A succinct space-memory efficient summary of values seen in the stream up to a certain time is maintained and can be queried at any point to return an approximation to the empirical copula function with guaranteed error bounds. This paper then gives an example of a class of higher dimensional copulas that can be computed from a product of these bivariate copula approximations. The computational benefits and the approximation error of this algorithm is theoretically and numerically assessed.