We introduce a class of random graphs with a hierarchical community structure, which we call the hierarchical configuration model. On the inter-community level, the graph is a configuration model, and on the intra-community level, every vertex in the configuration model is replaced by a community: a small graph. These communities may have any shape, as long as they are connected. For these hierarchical graphs, we find the size of the largest component, the degree distribution and the clustering coefficient. Furthermore, we determine the conditions under which a giant percolation cluster exists, and find its size.
In the last ten years, the employment of symbolic methods has substantially extended both the theory and the applications of statistics and probability. This survey reviews the development of a symbolic technique arising from classical umbral calculus, as introduced by Rota and Taylor in The usefulness of this symbolic technique is twofold. The first is to show how new algebraic identities drive in discovering insights among topics apparently very far from each other and related to probability and statistics. One of the main tools is a formal generalization of the convolution of identical probability distributions, which allows us to employ compound Poisson random variables in various topics that are only somewhat interrelated. Having got a different and deeper viewpoint, the second goal is to show how to set up algorithmic processes performing efficiently algebraic calculations. In particular, the challenge of finding these symbolic procedures should lead to a new method, and it poses new problems involving both computational and conceptual issues. Evidence of efficiency in applying this symbolic method will be shown within statistical inference, parameter estimation, L\’evy processes, and, more generally, problems involving multivariate functions. The symbolic representation of Sheffer polynomial sequences allows us to carry out a unifying theory of classical, Boolean and free cumulants. Recent connections within random matrices have extended the applications of the symbolic method.
The tripartite graph is one of the commonest topological structures in social tagging systems such as Delicious, which has three types of nodes (i.e., users, URLs and tags). Traditional recommender systems developed based on collaborative filtering for the social tagging systems bring very high demands on CPU time cost. In this paper, to overcome this drawback, we propose a novel approach that extracts non-overlapping user clusters and corresponding overlapping item clusters simultaneously through coarse clustering to accelerate the user-based collaborative filtering and develop a fast recommendation algorithm for the social tagging systems. The experimental results show that the proposed approach is able to dramatically reduce the processing time cost greater than and relatively enhance the accuracy in comparison with the ordinary user-based collaborative filtering algorithm.
Bag-of-ngram based methods still achieve state-of-the-art results for tasks such as sentiment classification of long movie reviews, despite semantic information is partially lost for these methods. Many document embeddings methods have been proposed to capture semantics, but they still can’t outperform bag-of-ngram based methods on this task. In this paper, we modify the architecture of the recently proposed Paragraph Vector, allowing it to learn document vectors by predicting not only words, but n-gram features as well. Our model is able to capture both semantics and word order in documents while keeping the expressive power of learned vectors. Experimental results on IMDB movie review dataset shows that our model outperforms previous deep learning models and bag-of-ngram based models due to the above advantages. More robust results are also obtained when our model is combined with other models. The source code of our model will be also published together with this paper.
Mining movement data to reveal interesting behavioral patterns has gained attention in recent years. One such pattern is the convoy pattern which consists of at least m objects moving together for at least k consecutive time instants where m and k are user-defined parameters. Existing algorithms for detecting convoy patterns, however do not scale to real-life dataset sizes. Therefore a distributed algorithm for convoy mining is inevitable. In this paper, we discuss the problem of convoy mining and analyze different data partitioning strategies to pave the way for a generic distributed convoy pattern mining algorithm.
We explore the problem of binary classification in machine learning, with a twist – the classifier is allowed to abstain on any datum, professing ignorance about the true class label without committing to any prediction. This is directly motivated by applications like medical diagnosis and fraud risk assessment, in which incorrect predictions have potentially calamitous consequences. We focus on a recent spate of theoretically driven work in this area that characterizes how allowing abstentions can lead to fewer errors in very general settings. Two areas are highlighted: the surprising possibility of zero-error learning, and the fundamental tradeoff between predicting sufficiently often and avoiding incorrect predictions. We review efficient algorithms with provable guarantees for each of these areas. We also discuss connections to other scenarios, notably active learning, as they suggest promising directions of further inquiry in this emerging field.
Multi-relational learning has received lots of attention from researchers in various research communities. Most existing methods either suffer from superlinear per-iteration cost, or are sensitive to the given ranks. To address both issues, we propose a scalable core tensor trace norm Regularized Orthogonal Iteration Decomposition (ROID) method for full or incomplete tensor analytics, which can be generalized as a graph Laplacian regularized version by using auxiliary information or a sparse higher-order orthogonal iteration (SHOOI) version. We first induce the equivalence relation of the Schatten p-norm (0<p<\infty) of a low multi-linear rank tensor and its core tensor. Then we achieve a much smaller matrix trace norm minimization problem. Finally, we develop two efficient augmented Lagrange multiplier algorithms to solve our problems with convergence guarantees. Extensive experiments using both real and synthetic datasets, even though with only a few observations, verified both the efficiency and effectiveness of our methods.
In this work we discuss a novel model prior probability for variable selection in linear regression. The idea is to determine the prior mass in an objective sense, by considering the worth of each of the possible regression models, given the number of covariates under consideration. Through a simulation study, we show that the proposed prior outperforms the uniform prior and the Scott \& Berger prior in a scenario of no prior knowledge about the size of the true regression models. We illustrate the use of the prior using two well-known data sets with, respectively, 15 and 4 covariates.
The report proposes a new approach for inverse reinforcement learning based on deep Gaussian process (GP), which is capable of learning complicated reward structures with few demonstrations. The model stacks multiple latent GP layers to learn abstract representations of the state feature space, which is linked to the demonstrations through the Maximum Entropy learning framework. As analytic derivation of the model evidence is prohibitive due to the nonlinearity of latent variables, variational inference is employed for approximate inference, based on a special choice of variational distributions. This guards the model from over training, achieving the Automatic Occam’s Razor. Experiments on the benchmark test, i.e., object world, as well as a new setup, i.e., binary world, are performed, where the proposed method outperforms state-of-the-art approaches.
Vehicles are becoming more and more connected, this opens up a larger attack surface which not only affects the passengers inside vehicles, but also people around them. These vulnerabilities exist because modern systems are built on the comparatively less secure and old CAN bus framework which lacks even basic authentication. Since a new protocol can only help future vehicles and not older vehicles, our approach tries to solve the issue as a data analytics problem and use machine learning techniques to secure cars. We develop a Hidden Markov Model to detect anomalous states from real data collected from vehicles. Using this model, while a vehicle is in operation, we are able to detect and issue alerts. Our model could be integrated as a plug-n-play device in all new and old cars.
Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods; however, this connection is not well studied. We explore this relationship by applying simulated annealing to an SGMCMC algorithm. Furthermore, we extend recent SG-MCMC methods with two key components: i) adaptive preconditioners (as in ADAgrad or RMSprop), and ii) adaptive element-wise momentum weights. The zero-temperature limit gives a novel stochastic optimization method with adaptive element-wise momentum weights, while conventional optimization methods only have a shared, static momentum weight. Under certain assumptions, our theoretical analysis suggests the proposed simulated annealing approach converges close to the global optima. Experiments on several deep neural network models show state-of-the-art results compared to related stochastic optimization algorithms.
The histogram method is a powerful non-parametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of observations to capture the underlying density function. Thus it is not suitable for analyzing a sparse data set, a collection of units with a small size of data. In this paper, by employing the probabilistic topic model, we develop a novel Bayesian approach to alleviating the sparsity problem in the conventional histogram estimation. Our method estimates a unit’s density function as a mixture of basis histograms, in which the number of bins for each basis, as well as their heights, is determined automatically. The estimation procedure is performed by using the fast and easy-to-implement collapsed Gibbs sampling. We apply the proposed method to synthetic data, showing that it performs well.