Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.
Recent works have highlighted scale invariance or symmetry present in the weight space of a typical deep network and the adverse effect it has on the Euclidean gradient based stochastic gradient descent optimization. In this work, we show that a commonly used deep network, which uses convolution, batch normalization, reLU, max-pooling, and sub-sampling pipeline, possess more complex forms of symmetry arising from scaling-based reparameterization of the network weights. We propose to tackle the issue of the weight space symmetry by constraining the filters to lie on the unit-norm manifold. Consequently, training the network boils down to using stochastic gradient descent updates on the unit-norm manifold. Our empirical evidence based on the MNIST dataset shows that the proposed updates improve the test performance beyond what is achieved with batch normalization and without sacrificing the computational efficiency of the weight updates.
Big Data streams are being generated in a faster, bigger, and more commonplace. In this scenario, Hoeffding Trees are an established method for classification. Several extensions exist, including high-performing ensemble setups such as online and leveraging bagging. Also, $k$-nearest neighbors is a popular choice, with most extensions dealing with the inherent performance limitations over a potentially-infinite stream. At the same time, gradient descent methods are becoming increasingly popular, owing in part to the successes of deep learning. Although deep neural networks can learn incrementally, they have so far proved too sensitive to hyper-parameter options and initial conditions to be considered an effective off-the-shelf’ data-streams solution. In this work, we look at combinations of Hoeffding-trees, nearest neighbour, and gradient descent methods with a streaming preprocessing approach in the form of a random feature functions filter for additional predictive power. We further extend the investigation to implementing methods on GPUs, which we test on some large real-world datasets, and show the benefits of using GPUs for data-stream learning due to their high scalability. Our empirical evaluation yields positive results for the novel approaches that we experiment with, highlighting important issues, and shed light on promising future directions in approaches to data-stream classification.
The linear regression models are widely used statistical techniques in numerous practical applications. The standard regression model requires several assumptions about the regres- sors and the error term. The regression parameters are estimated using the least-squares method. In this paper, we consider the regression model with arbitrary regressors and with- out the error term. An explicit expression for the regression parameters vector is obtained. The unbiasedness approach is used to estimate the regression parameters and its various properties are investigated. It is shown that the resulting unbiased estimator equals the least-squares estimator for the ?xed design model. The analysis of residuals and the regres- sion sum of squares can be carried out in a natural way. The unbiased estimator of the dispersion matrix of the unbiased estimator is also obtained. Applications to AR(p) model and numerical examples are also discussed.
A large part of the use of knowledge base systems is the interpretation of the output by the end-users and the interaction with these users. Even during the development process visualisations can be a great help to the developer. We created IDPD3 as a library to visualise models of logic theories. IDPD3 is a new version of $ID^{P}_{Draw}$ and adds support for visualised interactive simulations.
To exploit the Web Ontology Language OWL as an answer set programming (ASP) language, we introduce the notion of bounded model semantics, as an intuitive and computationally advantageous alternative to its classical semantics. We show that a translation into ASP allows for solving a wide range of bounded-model reasoning tasks, including satisfiability and axiom entailment but also novel ones such as model extraction and enumeration. Ultimately, our work facilitates harnessing advanced semantic web modeling environments for the logic programming community through an ‘off-label use’ of OWL.
We give an overview of the role of information theory in statistics, and particularly in biostatistics. We recall the basic quantities in information theory; entropy, cross-entropy, conditional entropy, mutual information and Kullback-Leibler risk. Then we examine the role of information theory in estimation theory, where the log-klikelihood can be identified as being an estimator of a cross-entropy. Then the basic quantities are extended to estimators, leading to criteria for estimator selection, such as Akaike criterion and its extensions. Finally we investigate the use of these concepts in Bayesian theory; the cross-entropy of the predictive distribution can be used for model selection; a cross-validation estimator of this cross-entropy is found to be equivalent to the pseudo-Bayes factor.
Dimensionality reduction methods are very common in the field of high dimensional data analysis. Typically, algorithms for dimensionality reduction are computationally expensive. Therefore, their applications for the analysis of massive amounts of data are impractical. For example, repeated computations due to accumulated data are computationally prohibitive. In this paper, an out-of-sample extension scheme, which is used as a complementary method for dimensionality reduction, is presented. We describe an algorithm which performs an out-of-sample extension to newly-arrived data points. Unlike other extension algorithms such as Nystr\’om algorithm, the proposed algorithm uses the intrinsic geometry of the data and properties for dimensionality reduction map. We prove that the error of the proposed algorithm is bounded. Additionally to the out-of-sample extension, the algorithm provides a degree of the abnormality of any newly-arrived data point.
Building recommendation algorithms is one of the most challenging tasks in Machine Learning. Although there has been significant progress in building recommendation systems when explicit feedback is available from the users in the form of rating or text, most of the applications do not receive such feedback. Here we consider the recommendation task where the available data is the record of the items selected by different users over time for subscription or purchase. This is known as implicit feedback recommendation. Such data are usually available as large amount of user logs stored over massively distributed storage systems such as Hadoop. Therefore it is essential to have a highly scalable algorithm to build a recommender system for such applications. Here we propose a probabilistic algorithm that takes only two to three passes through the entire dataset to extract the model parameters during the training phase. We demonstrate the competitive performance of our algorithm in several empirical measures as well as the computation time in comparison with the existing algorithms on various publicly available datasets.
Text actionability detection is the problem of classifying user authored natural language text, according to whether it can be acted upon by a responding agent. In this paper, we propose a supervised learning framework for domain-aware, large-scale actionability classification of social media messages. We derive lexicons, perform an in-depth analysis for over 25 text based features, and explore strategies to handle domains that have limited training data. We apply these methods to over 46 million messages spanning 75 companies and 35 languages, from both Facebook and Twitter. The models achieve an aggregate population-weighted F measure of 0.78 and accuracy of 0.74, with values of over 0.9 in some cases.
When partitioning workflows in realistic scenarios, the knowledge of the processing units is often vague or unknown. A naive approach to addressing this issue is to perform many controlled experiments for different workloads, each consisting of multiple number of trials in order to estimate the mean and variance of the specific workload. Since this controlled experimental approach can be quite costly in terms of time and resources, we propose a variant of the Gibbs Sampling algorithm that uses a sequence of Bayesian inference updates to estimate the processing characteristics of the processing units. Using the inferred characteristics of the processing units, we are able to determine the best way to split a workflow for processing it in parallel with the lowest expected completion time and least variance.
This paper presents our recent efforts, zenLDA, an efficient and scalable Collapsed Gibbs Sampling system for Latent Dirichlet Allocation training, which is thought to be challenging that both data parallelism and model parallelism are required because of the Big sampling data with up to billions of documents and Big model size with up to trillions of parameters. zenLDA combines both algorithm level improvements and system level optimizations. It first presents a novel CGS algorithm that balances the time complexity, model accuracy and parallelization flexibility. The input corpus in zenLDA is represented as a directed graph and model parameters are annotated as the corresponding vertex attributes. The distributed training is parallelized by partitioning the graph that in each iteration it first applies CGS step for all partitions in parallel, followed by synchronizing the computed model each other. In this way, both data parallelism and model parallelism are achieved by converting them to graph parallelism. We revisited the tradeoff between system efficiency and model accuracy and presented approximations such as unsynchronized model, sparse model initialization and ‘converged’ token exclusion. zenLDA is built on GraphX in Spark that provides distributed data abstraction (RDD) and expressive APIs to simplify the programming efforts and simultaneously hides the system complexities. This enables us to implement other CGS algorithm with a few lines of code change. To better fit in distributed data-parallel framework and achieve comparable performance with contemporary systems, we also presented several system level optimizations to push the performance limit. zenLDA was evaluated it against web-scale corpus, and the result indicates that zenLDA can achieve about much better performance than other CGS algorithm we implemented, and simultaneously achieve better model accuracy.
In the extreme value analysis of time series, not only the tail behavior is of interest, but also the serial dependence plays a crucial role. Drees and Rootz\’en (2010) established limit theorems for a general class of empirical processes of so-called cluster functionals which can be used to analyse various aspects of the extreme value behavior of mixing time series. However, usually the limit distribution is too complex to enable a direct construction of confidence regions. Therefore, we suggest a multiplier block bootstrap analog to the empirical processes of cluster functionals. It is shown that under virtually the same conditions as used by Drees and Rootz\’en (2010), conditionally on the data, the bootstrap processes converge to the same limit distribution. These general results are applied to construct confidence regions for the empirical extremogram introduced by Davis and Mikosch (2009). In a simulation study, the confidence intervals constructed by our multiplier block bootstrap approach compare favorably to the stationary bootstrap proposed by Davis et al.\ (2012).
Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spatially compact events in text streams. SCSS employs alternating optimization between using semantic scan to estimate contrastive foreground topics in documents, and discovering spatial neighborhoods with high occurrence of documents containing the foreground topics. We evaluate our method on Emergency Department chief complaints dataset (ED dataset) to verify the effectiveness of our method in detecting real-world disease outbreaks from free-text ED chief complaint data.
Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) has been shown to be very effective for modeling and predicting sequential data, e.g. speech utterances or handwritten documents. In this study, we propose to use BLSTM-RNN for a unified tagging solution that can be applied to various tagging tasks including part-of-speech tagging, chunking and named entity recognition. Instead of exploiting specific features carefully optimized for each task, our solution only uses one set of task-independent features and internal representations learnt from unlabeled text for all tasks.Requiring no task specific knowledge or sophisticated feature engineering, our approach gets nearly state-of-the-art performance in all these three tagging tasks.
We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.
We propose a new notion called extremal depth’ (ED) for functional data, discuss its properties, and compare its performance with existing concepts. The proposed notion is based on a measure of extreme `outlyingness’. ED has several desirable properties that are not shared by other notions and is especially well suited for obtaining central regions of functional data and function spaces. In particular: a) the central region achieves the nominal (desired) simultaneous coverage probability; b) there is a correspondence between ED-based (simultaneous) central regions and appropriate point-wise central regions; and c) the method is resistant to certain classes of functional outliers. The paper examines the performance of ED and compares it with other depth notions. Its usefulness is demonstrated through applications to constructing central regions, functional boxplots, outlier detection, and simultaneous confidence bands in regression problems.
We extend the wavelet tests for fixed effects FANOVA models with iid errors, proposed in Abramovich et al, 2004 to FANOVA models with dependent errors and provide an iterative Cochrane-Orcutt type procedure to estimate the parameters and the functional. The function is estimated through a nonlinear wavelet estimator. Nonparametric tests based on the optimal performance of nonlinear wavelet estimators are also proposed. The method is illustrated on real data sets and in simulated studies. The simulation also addresses the test performance under realistic sample sizes.
In this paper we develop a recurrent neural network (TreeRNN), which is designed to predict a tree rather than a linear sequence as is the case in conventional recurrent neural networks. Our model defines the probability of a sentence by estimating the generation probability of its dependency tree. We construct the tree incrementally by generating the left and right dependents of a node whose probability is computed using recurrent neural networks with shared hidden layers. Application of our model to two language modeling tasks shows that it outperforms or performs on par with related models.
Gaussian processes have been successful in both supervised and unsupervised machine learning tasks, but their computational complexity has constrained practical applications. We introduce a new approximation for large-scale Gaussian processes, the Gaussian Process Random Field (GPRF), in which local GPs are coupled via pairwise potentials. The GPRF likelihood is a simple, tractable, and parallelizeable approximation to the full GP marginal likelihood, enabling latent variable modeling and hyperparameter selection on large datasets. We demonstrate its effectiveness on synthetic spatial data as well as a real-world application to seismic event location.