To be effective in sequential data processing, Recurrent Neural Networks (RNNs) are required to keep track of past events by creating memories. While the relation between memories and the network’s hidden state dynamics was established over the last decade, previous works in this direction were of a predominantly descriptive nature focusing mainly on locating the dynamical objects of interest. In particular, it remained unclear how dynamical observables affect the performance, how they form and whether they can be manipulated. Here, we utilize different training protocols, datasets and architectures to obtain a range of networks solving a delayed classification task with similar performance, alongside substantial differences in their ability to extrapolate for longer delays. We analyze the dynamics of the network’s hidden state, and uncover the reasons for this difference. Each memory is found to be associated with a nearly steady state of the dynamics which we refer to as a ‘slow point’. Slow point speeds predict extrapolation performance across all datasets, protocols and architectures tested. Furthermore, by tracking the formation of the slow points we are able to understand the origin of differences between training protocols. Finally, we propose a novel regularization technique that is based on the relation between hidden state speeds and memory longevity. Our technique manipulates these speeds, thereby leading to a dramatic improvement in memory robustness over time, and could pave the way for a new class of regularization methods.
We study the dynamic regret of a new class of online learning problems, in which the gradient of the loss function changes continuously across rounds with respect to the learner’s decisions. This setup is motivated by the use of online learning as a tool to analyze the performance of iterative algorithms. Our goal is to identify interpretable dynamic regret rates that explicitly consider the loss variations as consequences of the learner’s decisions as opposed to external constraints. We show that achieving sublinear dynamic regret in general is equivalent to solving certain variational inequalities, equilibrium problems, and fixed-point problems. Leveraging this identification, we present necessary and sufficient conditions for the existence of efficient algorithms that achieve sublinear dynamic regret. Furthermore, we show a reduction from dynamic regret to both static regret and convergence rate to equilibriums in the aforementioned problems, which allows us to analyze the dynamic regret of many existing learning algorithms in few steps.
Bloom filters are data structures used to determine set membership of elements, with applications from string matching to networking and security problems. These structures are favored because of their reduced memory consumption and fast wallclock and asymptotic time bounds. Generally, Bloom filters maintain constant membership query time, making them very fast in their niche. However, they are limited in their lack of a removal operation, as well as by their probabilistic nature. In this paper, we discuss various iterations of and alternatives to the generic Bloom filter that have been researched and implemented to overcome their inherent limitations. Bloom filters, especially when used in conjunction with other data structures, are still powerful and efficient data structures; we further discuss their use in industy and research to optimize resource utilization.
We introduce a new scheduling problem in distributed computing that we call the DSMS problem. We are given a set of mobile identical servers that are initially hosted by some processors of the given network. Further, we are given a set of requests that are issued by the processors possibly at any time in an online fashion, and the servers must serve them. A request must be scheduled before the request is served. The delay for scheduling a request is the time taken since the request is issued until it is scheduled. The goal is to minimize the average delay. Navigating mobile servers in a large-scale distributed system needs a scalable location service. We devise the distributed GNN protocol, a novel linked-reversal-based protocol for the DSMS problem that works on overlay trees. We prove that GNN is a starvation-free protocol that correctly integrates locating the servers and synchronizing the concurrent access to the servers despite asynchrony. Further, we analyze the GNN protocol for one-shot executions where all requests are simultaneously issued. We show that when running the GNN protocol on top of a special family of tree topologies—known as hierarchically well-separated trees (HSTs)—we obtain a randomized distributed protocol with an expected competitive ratio of on general network topologies for the DSMS problem where is the number of processors in the given network. From a technical point of view, the main result of our paper shows that the GNN protocol optimally solves the DSMS problem on HSTs for one-shot executions. Our results hold even if communication is asynchronous.
In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level decision can be directly obtained from the output of the neural network. To handle speech utterances with entire arbitrary and potentially long duration, we combine CNN-BLSTM model with a self-attentive pooling layer together. The front-end CNN-BLSTM module plays a role as local pattern extractor for the variable-length inputs, and the following self-attentive pooling layer is built on top to get the fixed-dimensional utterance-level representation. We conducted experiments on NIST LRE07 closed-set task, and the results reveal that the proposed attention-based CNN-BLSTM model achieves comparable error reduction with other state-of-the-art utterance-level neural network approaches for all 3 seconds, 10 seconds, 30 seconds duration tasks.
Traditional load analysis is facing challenges with the new electricity usage patterns due to demand response as well as increasing deployment of distributed generations, including photovoltaics (PV), electric vehicles (EV), and energy storage systems (ESS). At the transmission system, despite of irregular load behaviors at different areas, highly aggregated load shapes still share similar characteristics. Load clustering is to discover such intrinsic patterns and provide useful information to other load applications, such as load forecasting and load modeling. This paper proposes an efficient submodular load clustering method for transmission-level load areas. Robust principal component analysis (R-PCA) firstly decomposes the annual load profiles into low-rank components and sparse components to extract key features. A novel submodular cluster center selection technique is then applied to determine the optimal cluster centers through constructed similarity graph. Following the selection results, load areas are efficiently assigned to different clusters for further load analysis and applications. Numerical results obtained from PJM load demonstrate the effectiveness of the proposed approach.
In the past decade, sparse principal component analysis has emerged as an archetypal problem for illustrating statistical-computational tradeoffs. This trend has largely been driven by a line of research aiming to characterize the average-case complexity of sparse PCA through reductions from the planted clique (PC) conjecture – which conjectures that there is no polynomial-time algorithm to detect a planted clique of size in . All previous reductions to sparse PCA either fail to show tight computational lower bounds matching existing algorithms or show lower bounds for formulations of sparse PCA other than its canonical generative model, the spiked covariance model. Also, these lower bounds all quickly degrade with the exponent in the PC conjecture. Specifically, when only given the PC conjecture up to where , there is no sparsity level at which these lower bounds remain tight. If these reductions fail to even show the existence of a statistical-computational tradeoff at any sparsity . We give a reduction from PC that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities . We also show the surprising result that weaker forms of the PC conjecture up to clique size for any given imply tight computational lower bounds for sparse PCA at sparsities . This shows that even a mild improvement in the signal strength needed by the best known polynomial-time sparse PCA algorithms would imply that the hardness threshold for PC is subpolynomial. This is the first instance of a suboptimal hardness assumption implying optimal lower bounds for another problem in unsupervised learning.
Optimizing deep neural networks is largely thought to be an empirical process, requiring manual tuning of several parameters, such as learning rate, weight decay, and dropout rate. Arguably, the learning rate is the most important of these to tune, and this has gained more attention in recent works. In this paper, we propose a novel method to compute the learning rate for training deep neural networks. We derive a theoretical framework to compute learning rates dynamically, and then show experimental results on standard datasets and architectures to demonstrate the efficacy of our approach.
We consider the problem of inference after model selection under weak assumptions in the time series setting. Even when the data are not independent, we show that sample splitting remains asymptotically valid as long as the process satisfies appropriate weak dependence conditions and the functional of interest is suitably well-behaved. In addition, if the inference targets are appropriately defined, we demonstrate that valid statistical inference is possible without assuming stationarity. As a working example, we consider post-selection inference for regression coefficients under a random design assumption, in which the pair is assumed to be an observation from a weakly dependent triangular array. We establish (asymptotic) sample splitting validity for regression coefficients under both -mixing and -dependence assumptions. To facilitate statistical inference in the non-stationary, weakly dependent regime, we extend a central limit theorem of Doukhan and Wintenberger (2007). To extend their result, we derive some properties of the variance of a normalized sum of a weakly dependent process. In particular, we show that, under very general conditions, the variance is often well-approximated by independent blocks. Using this result, we derive the validity of the block multiplier bootstrap under -dependence and demonstrate the validity of an inference procedure that combines sample splitting with the bootstrap under weak assumptions.
The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler DNNVM, which is an integration of optimizers for graphs, loops and data layouts, and an assembler, a runtime supporter and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both the data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the optimal execution strategies of the whole computing graph. On the Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by naive implementations without optimizations, and the throughput is further improved up to 1.26x by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50.
Convolutional neural networks excel in a number of computer vision tasks. One of their most crucial architectural elements is the effective receptive field size, that has to be manually set to accommodate a specific task. Standard solutions involve large kernels, down/up-sampling and dilated convolutions. These require testing a variety of dilation and down/up-sampling factors and result in non-compact representations and excessive number of parameters. We address this issue by proposing a new convolution filter composed of displaced aggregation units (DAU). DAUs learn spatial displacements and adapt the receptive field sizes of individual convolution filters to a given problem, thus eliminating the need for hand-crafted modifications. DAUs provide a seamless substitution of convolutional filters in existing state-of-the-art architectures, which we demonstrate on AlexNet, ResNet50, ResNet101, DeepLab and SRN-DeblurNet. The benefits of this design are demonstrated on a variety of computer vision tasks and datasets, such as image classification (ILSVRC 2012), semantic segmentation (PASCAL VOC 2011, Cityscape) and blind image de-blurring (GOPRO). Results show that DAUs efficiently allocate parameters resulting in up to four times more compact networks at similar or better performance.
Assigning a label to each pixel in an image, namely semantic segmentation, has been an important task in computer vision, and has applications in autonomous driving, robotic navigation, localization, and scene understanding. Fully convolutional neural networks have proved to be a successful solution for the task over the years but most of the work being done focuses primarily on accuracy. In this paper, we present a computationally efficient approach to semantic segmentation, meanwhile achieving a high mIOU, on Cityscapes challenge. The network proposed is capable of running real-time on mobile devices. In addition, we make our code and model weights publicly available.
In this paper, we develop a neural attentive interpretable recommendation system, named NAIRS. A self-attention network, as a key component of the system, is designed to assign attention weights to interacted items of a user. This attention mechanism can distinguish the importance of the various interacted items in contributing to a user profile. Based on the user profiles obtained by the self-attention network, NAIRS offers personalized high-quality recommendation. Moreover, it develops visual cues to interpret recommendations. This demo application with the implementation of NAIRS enables users to interact with a recommendation system, and it persistently collects training data to improve the system. The demonstration and experimental results show the effectiveness of NAIRS.
Noisy labeled data represent a rich source of information that often are easily accessible and cheap to obtain, but label noise might also have many negative consequences if not accounted for. How to fully utilize noisy labels has been studied extensively within the framework of standard supervised machine learning over a period of several decades. However, very little research has been conducted on solving the challenge posed by noisy labels in non-standard settings. This includes situations where only a fraction of the samples are labeled (semi-supervised) and each high-dimensional sample is associated with multiple labels. In this work, we present a novel semi-supervised and multi-label dimensionality reduction method that effectively utilizes information from both noisy multi-labels and unlabeled data. With the proposed Noisy multi-label semi-supervised dimensionality reduction (NMLSDR) method, the noisy multi-labels are denoised and unlabeled data are labeled simultaneously via a specially designed label propagation algorithm. NMLSDR then learns a projection matrix for reducing the dimensionality by maximizing the dependence between the enlarged and denoised multi-label space and the features in the projected space. Extensive experiments on synthetic data, benchmark datasets, as well as a real-world case study, demonstrate the effectiveness of the proposed algorithm and show that it outperforms state-of-the-art multi-label feature extraction algorithms.
In Reinforcement Learning (RL), an agent explores the environment and collects trajectories into the memory buffer for later learning. However, the collected trajectories can easily be imbalanced with respect to the achieved goal states. The problem of learning from imbalanced data is a well-known problem in supervised learning, but has not yet been thoroughly researched in RL. To address this problem, we propose a novel Curiosity-Driven Prioritization (CDP) framework to encourage the agent to over-sample those trajectories that have rare achieved goal states. The CDP framework mimics the human learning process and focuses more on relatively uncommon events. We evaluate our methods using the robotic environment provided by OpenAI Gym. The environment contains six robot manipulation tasks. In our experiments, we combined CDP with Deep Deterministic Policy Gradient (DDPG) with or without Hindsight Experience Replay (HER). The experimental results show that CDP improves both performance and sample-efficiency of reinforcement learning agents, compared to state-of-the-art methods.
In this paper, we propose a data collaboration analysis method for distributed datasets. The proposed method is a centralized machine learning while training datasets and models remain distributed over some institutions. Recently, data became large and distributed with decreasing costs of data collection. If we can centralize these distributed datasets and analyse them as one dataset, we expect to obtain novel insight and achieve a higher prediction performance compared with individual analyses on each distributed dataset. However, it is generally difficult to centralize the original datasets due to their huge data size or regarding a privacy-preserving problem. To avoid these difficulties, we propose a data collaboration analysis method for distributed datasets without sharing the original datasets. The proposed method centralizes only intermediate representation constructed individually instead of the original dataset.
In the past few years, the research community has dedicated growing interest to the issue of false news circulating on social networks. The widespread attention on detecting and characterizing false news has been motivated by considerable backlashes of this threat against the real world. As a matter of fact, social media platforms exhibit peculiar characteristics, with respect to traditional news outlets, which have been particularly favorable to the proliferation of deceptive information. They also present unique challenges for all kind of potential interventions on the subject. As this issue becomes of global concern, it is also gaining more attention in academia. The aim of this survey is to offer a comprehensive study on the recent advances in terms of detection, characterization and mitigation of false news that propagate on social media, as well as the challenges and the open questions that await future research on the field. We use a data-driven approach, focusing on a classification of the features that are used in each study to characterize false information and on the datasets used for instructing classification methods. At the end of the survey, we highlight emerging approaches that look most promising for addressing false news.
Pre-conditioning is a well-known concept that can significantly improve the convergence of optimization algorithms. For noise-free problems, where good pre-conditioners are not known a priori, iterative linear algebra methods offer one way to efficiently construct them. For the stochastic optimization problems that dominate contemporary machine learning, however, this approach is not readily available. We propose an iterative algorithm inspired by classic iterative linear solvers that uses a probabilistic model to actively infer a pre-conditioner in situations where Hessian-projections can only be constructed with strong Gaussian noise. The algorithm is empirically demonstrated to efficiently construct effective pre-conditioners for stochastic gradient descent and its variants. Experiments on problems of comparably low dimensionality show improved convergence. In very high-dimensional problems, such as those encountered in deep learning, the pre-conditioner effectively becomes an automatic learning-rate adaptation scheme, which we also empirically show to work well.
Human decision-making deviates from the optimal solution, that maximizes cumulative rewards, in many situations. Here we approach this discrepancy from the perspective of bounded rationality and our goal is to provide a justification for such seemingly sub-optimal strategies. More specifically we investigate the hypothesis, that humans do not know optimal decision-making algorithms in advance, but instead employ a learned, resource-bounded approximation. The idea is formalized through combining a recently proposed meta-learning model based on Recurrent Neural Networks with a resource-bounded objective. The resulting approach is closely connected to variational inference and the Minimum Description Length principle. Empirical evidence is obtained from a two-armed bandit task. Here we observe patterns in our family of models that resemble differences between individual human participants.
Robust MDPs (RMDPs) can be used to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution are determined by the ambiguity set—the set of plausible transition probabilities—which is usually constructed as a multi-dimensional confidence region. Existing methods construct ambiguity sets as confidence regions using concentration inequalities which leads to overly conservative solutions. This paper proposes a new paradigm that can achieve better solutions with the same robustness guarantees without using confidence regions as ambiguity sets. To incorporate prior knowledge, our algorithms optimize the size and position of ambiguity sets using Bayesian inference. Our theoretical analysis shows the safety of the proposed method, and the empirical results demonstrate its practical promise.
advertorch is a toolbox for adversarial robustness research. It contains various implementations for attacks, defenses and robust training methods. advertorch is built on PyTorch (Paszke et al., 2017), and leverages the advantages of the dynamic computational graph to provide concise and efficient reference implementations. The code is licensed under the LGPL license and is open sourced at https://…/advertorch .
Amid historically low response rates, survey researchers seek ways to reduce respondent burden while measuring desired concepts with precision. We propose to ask fewer questions of respondents and impute missing responses via probabilistic matrix factorization. The most informative questions per respondent are chosen sequentially using active learning with variance minimization. We begin with Gaussian responses standard in matrix completion and derive a simple active strategy with closed-form posterior updates. Next we model responses more realistically as ordinal logit; posterior inference and question selection are adapted to the nonconjugate setting. We simulate our matrix sampling procedure on data from real-world surveys. Our active question selection achieves efficiency gains over baselines and can benefit from available side information about respondents.
Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed networks with learned, task-specific architectures. In this work, in order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result on PTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results. We note the lack of source material needed to exactly reproduce these results, and further discuss the robustness of published results given the various sources of variability in NAS experimental setups. Relatedly, we provide all information (code, random seeds, documentation) needed to exactly reproduce our results, and report our random search with weight-sharing results for each benchmark on two independent experimental runs.
Convolutional Neural Networks (CNNs) are the state-of-the-art algorithms used in computer vision. However, these models often suffer from the lack of interpretability of their information transformation process. To address this problem, we introduce a novel model called Sparse Deep Predictive Coding (SDPC). In a biologically realistic manner, SDPC mimics how the brain is efficiently representing visual information. This model complements the hierarchical convolutional layers found in CNNs with the feed-forward and feed-back update scheme described in the Predictive Coding (PC) theory and found in the architecture of the mammalian visual system. We experimentally demonstrate on two databases that the SDPC model extracts qualitatively meaningful features. These features, besides being similar to some of the biological Receptive Fields of the visual cortex, also represent hierarchically independent components of the image that are crucial to describe it in a generic manner. For the first time, the SDPC model demonstrates a meaningful representation of features within the hierarchical generative model and of the decision-making process leading to a specific prediction. A quantitative analysis reveals that the features extracted by the SDPC model encode the input image into a representation that is both easily classifiable and robust to noise.
This paper is motivated by the desire to develop distributed algorithms for nonconvex optimization problems with complicated constraints associated with a network. The network can be a physical one, such as an electric power network, where the constraints are nonlinear power flow equations, or an abstract one that represents constraint couplings between decision variables of different agents. Thus, this type of problems are ubiquitous in applications. Despite the recent development of distributed algorithms for nonconvex programs, highly complicated constraints still pose a significant challenge in theory and practice. We first identify some intrinsic difficulties with the existing algorithms based on the alternating direction method of multipliers (ADMM) for dealing with such problems. We then propose a reformulation for constrained nonconvex programs that enables us to design a two-level algorithm, which embeds a specially structured three-block ADMM at the inner level in an augmented Lagrangian method (ALM) framework. Furthermore, we prove the global convergence of this new scheme for both smooth and nonsmooth constrained nonconvex programs. The proof builds on and extends the classic and recent works on ALM and ADMM. Finally, we demonstrate with computation that the new scheme provides convergent and parallelizable algorithms for several classes of constrained nonconvex programs, for all of which existing algorithms may fail. To the best of our knowledge, this is the first distributed algorithm that extends the ADMM architecture to general nonlinear nonconvex constrained optimization. The proposed algorithmic framework also provides a new principled way for parallel computation of constrained nonconvex optimization.
In this paper, we propose a simple, fast and easy to implement algorithm LOSSGRAD (locally optimal step-size in gradient descent), which automatically modifies the step-size in gradient descent during neural networks training. Given a function , a point , and the gradient of , we aim to find the step-size which is (locally) optimal, i.e. satisfies: Making use of quadratic approximation, we show that the algorithm satisfies the above assumption. We experimentally show that our method is insensitive to the choice of initial learning rate while achieving results comparable to other methods.
The increasing occurrence of ordinal data, mainly sociodemographic, led to a renewed research interest in ordinal regression, i.e. the prediction of ordered classes. Besides model accuracy, the interpretation of these models itself is of high relevance, and existing approaches therefore enforce e.g. model sparsity. For high dimensional or highly correlated data, however, this might be misleading due to strong variable dependencies. In this contribution, we aim for an identification of feature relevance bounds which – besides identifying all relevant features – explicitly differentiates between strongly and weakly relevant features.
In recent years, the size of big linked data has grown rapidly and this number is still rising. Big linked data and knowledge bases come from different domains such as life sciences, publications, media, social web, and so on. However, with the rapid increasing of data, it is very challenging for people to acquire a comprehensive collection of cross domain knowledge to meet their needs. Under this circumstance, it is extremely difficult for people without expertise to extract knowledge from various domains. Therefore, nowadays human limited knowledge can’t feed the high requirement for discovering large amount of cross domain knowledge. In this research, we present a big graph analytics framework aims at addressing this issue by providing semantic methods to facilitate the management of big graph data from close domains in order to discover cross domain knowledge in a more accurate and efficient way.
Ecological processes may exhibit memory to past disturbances affecting the resilience of ecosystems to future disturbance. Understanding the role of ecological memory in shaping ecosystem responses to disturbance under global change is a critical step toward developing effective adaptive management strategies to maintain ecosystem function and biodiversity. We developed EcoMem, an R package for quantifying ecological memory functions using common environmental time series data (continuous, count, proportional) applying a Bayesian hierarchical framework. The package estimates memory functions for continuous and binary (e.g., disturbance chronology) variables making no a priori assumption on the form of the functions. EcoMem allows users to quantify ecological memory for a wide range of ecosystem processes and responses. The utility of the package to advance understanding of the memory of ecosystems to environmental drivers is demonstrated using a simulated dataset and a case study assessing the memory of boreal tree growth to insect defoliation.