Cheeger’s inequality shows that any undirected graph $G$ with minimum nonzero normalized Laplacian eigenvalue $\lambda_G$ has a cut with conductance at most $O(\sqrt{\lambda_G})$. Qualitatively, Cheeger’s inequality says that if the relaxation time of a graph is high, there is a cut that certifies this. However, there is a gap in this relationship, as cuts can have conductance as low as $\Theta(\lambda_G)$. To better approximate the relaxation time of a graph, we consider a more general object. Instead of bounding the mixing time with cuts, we bound it with cuts in graphs obtained by Schur complementing out vertices from the graph $G$. Combinatorially, these Schur complements describe random walks in $G$ restricted to a subset of its vertices. As a result, all Schur complement cuts have conductance at least $\Omega(\lambda_G)$. We show that unlike with cuts, this inequality is tight up to a constant factor. Specifically, there is a Schur complement cut with conductance at most $O(\lambda_G)$.
It is important for big data systems to identify their performance bottleneck. However, the popular indicators such as resource utilizations, are often misleading and incomparable with each other. In this paper, a novel indicator framework which can directly compare the impact of different indicators with each other is proposed to identify and analyze the performance bottleneck efficiently. A methodology which can construct the indicator from the performance change with the CPU frequency scaling is described. Spark is used as an example of a big data system and two typical SQL benchmarks are used as the workloads to evaluate the proposed method. Experimental results show that the proposed method is accurate compared with the resource utilization method and easy to implement compared with some white-box method. Meanwhile, the analysis with our indicators lead to some interesting findings and valuable performance optimization suggestions for big data systems.
Astronomy is well recognized as big data driven science. As the novel observation infrastructures are developed, the sky survey cycles have been shortened from a few days to a few seconds, causing data processing pressure to shift from offline to online. However, existing scientific databases focus on offline analysis of long-term historical data, not real-time and low latency analysis of large-scale newly arriving data. In this paper, a cloud based method is proposed to efficiently analyze scientific events on large-scale newly arriving data. The solution is implemented as a highly efficient system, namely Aserv. A set of compact data store and index structures are proposed to describe the proposed scientific events and a typical analysis pattern is formulized as a set of query operations. Domain aware filter, accuracy aware data partition, highly efficient index and frequently used statistical data designs are four key methods to optimize the performance of Aserv. Experimental results under the typical cloud environment show that the presented optimization mechanism can meet the low latency demand for both large data insertion and scientific event analysis. Aserv can insert 3.5 million rows of data within 3 seconds and perform the heaviest query on 6.7 billion rows of data also within 3 seconds. Furthermore, a performance model is given to help Aserv choose the right cloud resource setup to meet the guaranteed real-time performance requirement.
With today’s abundant streams of data, the only constant we can rely on is change. For stream classification algorithms, it is necessary to adapt to concept drift. This can be achieved by monitoring the model error, and triggering counter measures as changes occur. In this paper, we propose a drift detection mechanism that fits a beta distribution to the model error, and treats abnormal behavior as drift. It works with any given model, leverages prior knowledge about this model, and allows to set application-specific confidence thresholds. Experiments confirm that it performs well, in particular when drift occurs abruptly.
We propose a visual query language for interactively exploring large-scale knowledge graphs. Starting from an overview, the user explores bar charts through three interactions: class expansion, property expansion, and subject/object expansion. A major challenge faced is performance: a state-of-the-art SPARQL engine may require tens of minutes to compute the multiway join, grouping and counting required to render a bar chart. A promising alternative is to apply approximation through online aggregation, trading precision for performance. However, state-of-the-art online aggregation algorithms such as Wander Join have two limitations for our exploration scenario: (1) a high number of rejected paths slows the convergence of the count estimations, and (2) no unbiased estimator exists for counts under the distinct operator. We thus devise a specialized algorithm for online aggregation that augments Wander Join with exact partial computations to reduce the number of rejected paths encountered, as well as a novel estimator that we prove to be unbiased in the case of the distinct operator. In an experimental study with random interactions exploring two large-scale knowledge graphs, our algorithm shows a clear reduction in error with respect to computation time versus Wander Join.
Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called {\em dataset distillation}: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to {\em synthesize} a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress $60,000$ MNIST training images into just $10$ synthetic {\em distilled images} (one per class) and achieve close to original performance with only a few steps of gradient descent, given a particular fixed network initialization. We evaluate our method in a wide range of initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods in most settings.
Standard kernels such as Mat\’ern or RBF kernels only encode simple monotonic dependencies within the input space. Spectral mixture kernels have been proposed as general-purpose, flexible kernels for learning and discovering more complicated patterns in the data. Spectral mixture kernels have recently been generalized into non-stationary kernels by replacing the mixture weights, frequency means and variances by input-dependent functions. These functions have also been modelled as Gaussian processes on their own. In this paper we propose modelling the hyperparameter functions with neural networks, and provide an experimental comparison between the stationary spectral mixture and the two non-stationary spectral mixtures. Scalable Gaussian process inference is implemented within the sparse variational framework for all the kernels considered. We show that the neural variant of the kernel is able to achieve the best performance, among alternatives, on several benchmark datasets.
In this manuscript, we will investigate the deconvolution method for recovering pulse arrival times and amplitudes using synthetic data. For the deconvolution procedure to have hope of recovering amplitudes and arrivals, the average waiting time between events must be at least 10 times the time step.
Dynamic time warping ($\texttt{DTW}$) is one of the most used distance functions to compare time series, e.$\,$g. in nearest neighbor classifiers. Yet, fast state of the art algorithms only compare 1-dimensional time series efficiently. One of these state of the art algorithms uses a lower bound ($\texttt{LB}_\texttt{Keogh}$) introduced by E. Keogh to prune $\texttt{DTW}$ computations. We introduce $\texttt{LB}_\texttt{Box}$ as a canonical extension to $\texttt{LB}_\texttt{Keogh}$ on multi-dimensional time series. We evaluate its performance conceptually and experimentally and show that an alternative to $\texttt{LB}_\texttt{Box}$ is necessary for multi-dimensional time series. We also propose a new algorithm for the dog-keeper distance ($\texttt{DK}$) which is an alternative distance function to $\texttt{DTW}$ and show that it outperforms $\texttt{DTW}$ with $\texttt{LB}_\texttt{Box}$ by more than one order of magnitude on multi-dimensional time series.
Generative Adversarial Networks (GANs) have proven to be a powerful framework for learning to draw samples from complex distributions. However, GANs are also notoriously difficult to train, with mode collapse and oscillations a common problem. We hypothesize that this is at least in part due to the evolution of the generator distribution and the catastrophic forgetting tendency of neural networks, which leads to the discriminator losing the ability to remember synthesized samples from previous instantiations of the generator. Recognizing this, our contributions are twofold. First, we show that GAN training makes for a more interesting and realistic benchmark for continual learning methods evaluation than some of the more canonical datasets. Second, we propose leveraging continual learning techniques to augment the discriminator, preserving its ability to recognize previous generator samples. We show that the resulting methods add only a light amount of computation, involve minimal changes to the model, and result in better overall performance on the examined image and text generation tasks.
Recently, techniques for applying convolutional neural networks to graph-structured data have emerged. Graph convolutional neural networks (GCNNs) have been used to address node and graph classification and matrix completion. Although the performance has been impressive, the current implementations have limited capability to incorporate uncertainty in the graph structure. Almost all GCNNs process a graph as though it is a ground-truth depiction of the relationship between nodes, but often the graphs employed in applications are themselves derived from noisy data or modelling assumptions. Spurious edges may be included; other edges may be missing between nodes that have very strong relationships. In this paper we adopt a Bayesian approach, viewing the observed graph as a realization from a parametric family of random graphs. We then target inference of the joint posterior of the random graph parameters and the node (or graph) labels. We present the Bayesian GCNN framework and develop an iterative learning procedure for the case of assortative mixed-membership stochastic block models. We present the results of experiments that demonstrate that the Bayesian formulation can provide better performance when there are very few labels available during the training process.
Distributed learning systems have enabled training large-scale models over large amount of data in significantly shorter time. In this paper, we focus on decentralized distributed deep learning systems and aim to achieve differential privacy with good convergence rate and low communication cost. To achieve this goal, we propose a new learning algorithm LEASGD (Leader-Follower Elastic Averaging Stochastic Gradient Descent), which is driven by a novel Leader-Follower topology and a differential privacy model.We provide a theoretical analysis of the convergence rate and the trade-off between the performance and privacy in the private setting.The experimental results show that LEASGD outperforms state-of-the-art decentralized learning algorithm DPSGD by achieving steadily lower loss within the same iterations and by reducing the communication cost by 30%. In addition, LEASGD spends less differential privacy budget and has higher final accuracy result than DPSGD under private setting.
The goal of case-based retrieval is to assist physicians in the clinical decision making process, by finding relevant medical literature in large archives. We propose a research that aims at improving the effectiveness of case-based retrieval systems through the use of automatically created document-level semantic networks. The proposed research tackles different aspects of information systems and leverages the recent advancements in information extraction and relational learning to revisit and advance the core ideas of concept-centered hypertext models. We propose a two-step methodology that in the first step addresses the automatic creation of document-level semantic networks, then in the second step it designs methods that exploit such document representations to retrieve relevant cases from medical literature. For the automatic creation of documents’ semantic networks, we design a combination of information extraction techniques and relational learning models. Mining concepts and relations from text, information extraction techniques represent the core of the document-level semantic networks’ building process. On the other hand, relational learning models have the task of enriching the graph with additional connections that have not been detected by information extraction algorithms and strengthening the confidence score of extracted relations. For the retrieval of relevant medical literature, we investigate methods that are capable of comparing the documents’ semantic networks in terms of structure and semantics. The automatic extraction of semantic relations from documents, and their centrality in the creation of the documents’ semantic networks, represent our attempt to go one step further than previous graph-based approaches.
Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks on computer clusters. With the increase of computational power, network communications have become one limiting factor on system scalability. In this paper, we observe that many deep neural networks have a large number of layers with only a small amount of data to be communicated. Based on the fact that merging some short communication tasks into a single one may reduce the overall communication time, we formulate an optimization problem to minimize the training iteration time. We develop an optimal solution named merged-gradient WFBP (MG-WFBP) and implement it in our open-source deep learning platform B-Caffe. Our experimental results on an 8-node GPU cluster with 10GbE interconnect and trace-based simulation results on a 64-node cluster both show that the MG-WFBP algorithm can achieve much better scaling efficiency than existing methods WFBP and SyncEASGD.