Given the intractability of large scale HIN, network embedding which learns low dimensional proximity-preserved representations for nodes in the new space becomes a natural way to analyse HIN. However, two challenges arise in HIN embedding. (1) Different HIN structures with different semantic meanings play different roles in capturing relationships among nodes in HIN, how can we learn personalized preferences over different meta-paths for each individual node in HIN? (2) With the number of large scale HIN increasing dramatically in various web services, how can we update the embedding information of new nodes in an efficient way? To tackle these challenges, we propose a Hierarchical Attentive Heterogeneous information network Embedding (HAHE ) model which is capable of learning personalized meta-path preferences for each node as well as updating the embedding information for each new node efficiently with only its neighbor node information. The proposed HAHE model extracts the semantic relationships among nodes in the semantic space based on different meta-paths and adopts a neighborhood attention layer to conduct weighted aggregations of neighborhood structure features for each node, enabling the embedding information of each new node to be updated efficiently. Besides, a meta-path attention layer is also employed to learn the personalized meta-path preferences for each individual node. Extensive experiments on several real-world datasets show that our proposed HAHE model significantly outperforms the state-of-the-art methods in terms of various evaluation metrics.
Performance of clustering algorithms is evaluated with the help of accuracy metrics. There is a great diversity of clustering algorithms, which are key components of many data analysis and exploration systems. However, there exist only few metrics for the accuracy measurement of overlapping and multi-resolution clustering algorithms on large datasets. In this paper, we first discuss existing metrics, how they satisfy a set of formal constraints, and how they can be applied to specific cases. Then, we propose several optimizations and extensions of these metrics. More specifically, we introduce a new indexing technique to reduce both the runtime and the memory complexity of the Mean F1 score evaluation. Our technique can be applied on large datasets and it is faster on a single CPU than state-of-the-art implementations running on high-performance servers. In addition, we propose several extensions of the discussed metrics to improve their effectiveness and satisfaction to formal constraints without affecting their efficiency. All the metrics discussed in this paper are implemented in C++ and are available for free as open-source packages that can be used either as stand-alone tools or as part of a benchmarking system to compare various clustering algorithms.
In this work, we propose to learn a generative model using both learned features (through a latent space) and memories (through neighbors). Although human learning makes seamless use of both learned perceptual features and instance recall, current generative learning paradigms only make use of one of these two components. Take, for instance, flow models, which learn a latent space of invertible features that follow a simple distribution. Conversely, kernel density techniques use instances to shift a simple distribution into an aggregate mixture model. Here we propose multiple methods to enhance the latent space of a flow model with neighborhood information. Not only does our proposed framework represent a more human-like approach by leveraging both learned features and memories, but it may also be viewed as a step forward in non-parametric methods. The efficacy of our model is shown empirically with standard image datasets. We observe compelling results and a significant improvement over baselines.
MapReduce and its variants have significantly simplified and accelerated the process of developing parallel programs. However, most MapReduce implementations focus on data-intensive tasks while many real-world tasks are compute intensive and their data can fit distributedly into the memory. For these tasks, the speed of MapReduce programs can be much slower than those hand-optimized ones. We present Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks. At the core of Blaze is a highly-optimized in-memory MapReduce function, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range. We also offer additional conveniences that make developing parallel programs similar to developing serial programs. These improvements make Blaze an easy-to-use cluster computing library that approaches the speed of hand-optimized parallel code. We apply Blaze to some common data mining tasks, including word frequency count, PageRank, k-means, expectation maximization (Gaussian mixture model), and k-nearest neighbors. Blaze outperforms Apache Spark by more than 10 times on average for these tasks, and the speed of Blaze scales almost linearly with the number of nodes. In addition, Blaze uses only the MapReduce function and 3 utility functions in its implementation while Spark uses almost 30 different parallel primitives in its official implementation.
The assumption that data samples are independent and identically distributed (iid) is standard in many areas of statistics and machine learning. Nevertheless, in some settings, such as social networks, infectious disease modeling, and reasoning with spatial and temporal data, this assumption is false. An extensive literature exists on making causal inferences under the iid assumption [18, 12, 28, 22], even when unobserved confounding bias may be present. But, as pointed out in [20], causal inference in non-iid contexts is challenging due to the presence of both unobserved confounding and data dependence. In this paper we develop a general theory describing when causal inferences are possible in such scenarios. We use segregated graphs [21], a generalization of latent projection mixed graphs [30], to represent causal models of this type and provide a complete algorithm for non-parametric identification in these models. We then demonstrate how statistical inference may be performed on causal parameters identified by this algorithm. In particular, we consider cases where only a single sample is available for parts of the model due to full interference, i.e., all units are pathwise dependent and neighbors’ treatments affect each others’ outcomes [26]. We apply these techniques to a synthetic data set which considers users sharing fake news articles given the structure of their social network, user activity levels, and baseline demographics and socioeconomic covariates.
A widespread approach in machine learning to evaluate the quality of a classifier is to cross — classify predicted and actual decision classes in a confusion matrix, also called error matrix. A classification tool which does not assume distributional parameters but only information contained in the data is based on the rough set data model which assumes that knowledge is given only up to a certain granularity. Using this assumption and the technique of confusion matrices, we define various indices and classifiers based on rough confusion matrices.
We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample synthetic records through this copula. Despite the inherently numerical nature of Gaussian correlations we construct a method that is applicable to both numerical and categorical attributes alike. Our mechanism is efficient in that it only takes time proportional to the square of the number of attributes in the dataset. We propose a differentially private way of constructing the Gaussian copula without compromising computational efficiency. Through experiments on three real-world datasets, we show that we can obtain highly accurate answers to the set of all one-way marginal, and two-and three-way positive conjunction queries, with 99\% of the query answers having absolute (fractional) error rates between 0.01 to 3\%. Furthermore, for a majority of two-way and three-way queries, we outperform independent noise addition through the well-known Laplace mechanism. In terms of computational time we demonstrate that our mechanism can output synthetic datasets in around 6 minutes 47 seconds on average with an input dataset of about 200 binary attributes and more than 32,000 rows, and about 2 hours 30 mins to execute a much larger dataset of about 700 binary attributes and more than 5 million rows. To further demonstrate scalability, we ran the mechanism on larger (artificial) datasets with 1,000 and 2,000 binary attributes (and 5 million rows) obtaining synthetic outputs in approximately 6 and 19 hours, respectively.
Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Perturbative GAN is evaluated using con-ventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator.
Deep neural networks have achieved great success in multiple learning problems, and attracted increasing attention from the medicine community. In reality, however, the limited availability and high costs of medical data is a major challenge of applying deep neural networks to computer-aided diagnosis and treatment planning. We address this challenge with adaptive virtual patients (AVPs) and the associated physics-informed learning framework. Specifically, the original training dataset is fused with an additional dataset of AVPs, which are generated by a data-driven model and the associated supervision (e.g., labels) is obtained by a physics-based approach. A key novelty in the proposed framework is the bidirectional and uncoupled generative invertible networks (GIN), which can extract pathophysiological features from the training medical image and generate pathophysiologically meaningful virtual patients. In order to mitigate the possibly high labeling cost of physical experiments, a $\mu$-measure design is conducted: this allows the AVPs to not only further explore the uncertain regions, but also balance the label distribution. We then discuss the pathophysiological interpretability of GIN both theoretically and experimentally, and demonstrate the effectiveness of AVPs using a real medical image dataset, in which the proposed AVPs lower the labeling cost by 90% while achieving a 15% improvement in prediction accuracy.
In the recent years, the scale of graph datasets has increased to such a degree that a single machine is not capable of efficiently processing large graphs. Thereby, efficient graph partitioning is necessary for those large graph applications. Traditional graph partitioning generally loads the whole graph data into the memory before performing partitioning; this is not only a time consuming task but it also creates memory bottlenecks. These issues of memory limitation and enormous time complexity can be resolved using stream-based graph partitioning. A streaming graph partitioning algorithm reads vertices once and assigns that vertex to a partition accordingly. This is also called an one-pass algorithm. This paper proposes an efficient window-based streaming graph partitioning algorithm called WStream. The WStream algorithm is an edge-cut partitioning algorithm, which distributes a vertex among the partitions. Our results suggest that the WStream algorithm is able to partition large graph data efficiently while keeping the load balanced across different partitions, and communication to a minimum. Evaluation results with real workloads also prove the effectiveness of our proposed algorithm, and it achieves a significant reduction in load imbalance and edge-cut with different ranges of dataset.
Many real-world reinforcement learning tasks require multiple agents to make sequential decisions under the agents’ interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted number of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learning framework, called SchedNet, in which agents learn how to schedule themselves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent’s partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.
In this paper, we propose a new drag and drop interaction technique for graphs. We designed this interaction to support analysis in complex multidimensional and temporal graphs. The drag and drop interaction is enhanced with an intuitive and controllable animation, in support of comparison tasks.
AI intensive systems that operate upon user data face the challenge of balancing data utility with privacy concerns. We propose the idea and present the prototype of an open-source tool called Privacy Utility Trade-off (PUT) Workbench which seeks to aid software practitioners to take such crucial decisions. We pick a simple privacy model that doesn’t require any background knowledge in Data Science and show how even that can achieve significant results over standard and real-life datasets. The tool and the source code is made freely available for extensions and usage.
In this paper we propose a general framework to analyze prediction in time series models and show how a wide class of popular time series models satisfies this framework. We postulate a set of high-level assumptions, and formally verify these assumptions for the aforementioned time series models. Our framework coincides with that of Beutner et al. (2019, arXiv:1710.00643) who establish the validity of conditional confidence intervals for predictions made in this framework. The current paper therefore complements the results in Beutner et al. (2019, arXiv:1710.00643) by providing practically relevant applications of their theory.
Backpropagation and the chain rule of derivatives have been prominent; however, the total derivative rule has not enjoyed the same amount of attention. In this work we show how the total derivative rule leads to an intuitive visual framework for creating gradient estimators on graphical models. In particular, previous ‘policy gradient theorems’ are easily derived. We derive new gradient estimators based on density estimation, as well as a likelihood ratio gradient, which ‘jumps’ to an intermediate node, not directly to the objective function. We evaluate our methods on model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm.
The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (\textit{RoOFS}) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. We also prove that our algorithm has a restricted error bound compared to the optimal solution. Extensive empirical experiments in both synthetic and real-world datasets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.
The paper surveys recent extensions of the Long-Short Term Memory networks to handle tree structures from the perspective of learning non-trivial forms of isomorph structured transductions. It provides a discussion of modern TreeLSTM models, showing the effect of the bias induced by the direction of tree processing. An empirical analysis is performed on real-world benchmarks, highlighting how there is no single model adequate to effectively approach all transduction problems.
AI software is still software. Software engineers need better tools to make better use of AI software. For example, for software defect prediction and software text mining, the default tunings for software analytics tools can be improved with ‘hyperparameter optimization’ tools that decide (e.g.,) how many trees are needed in a random forest. Hyperparameter optimization is unnecessarily slow when optimizers waste time exploring redundant options (i.e., pairs of tunings with indistinguishably different results). By ignoring redundant tunings, the Dodge(E) hyperparameter optimization tool can run orders of magnitude faster, yet still find better tunings than prior state-of-the-art algorithms (for software defect prediction and software text mining).