Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.
For bivariate normal data, all (marginal) posterior moments of Pearson’s correlation coefficient are given in analytic form.
Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms. In the context of large scale learning, as utilized by many Big Data applications, efficient parallelization of SGD is in the focus of active research. Recently, we were able to show that the asynchronous communication paradigm can be applied to achieve a fast and scalable parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems. In this paper, we investigate the impact of asynchronous communication frequency and message size on the performance of ASGD applied to large scale ML on HTC cluster and cloud environments. We introduce a novel algorithm for the automatic balancing of the asynchronous communication load, which allows to adapt ASGD to changing network bandwidths and latencies.
This paper examines the role and efficiency of the non-convex loss functions for binary classification problems. In particular, we investigate how to design a simple and effective boosting algorithm that is robust to the outliers in the data. The analysis of the role of a particular non-convex loss for prediction accuracy varies depending on the diminishing tail properties of the gradient of the loss — the ability of the loss to efficiently adapt to the outlying data, the local convex properties of the loss and the proportion of the contaminated data. In order to use these properties efficiently, we propose a new family of non-convex losses named $\gamma$-robust losses. Moreover, we present a new boosting framework, {\it Arch Boost}, designed for augmenting the existing work such that its corresponding classification algorithm is significantly more adaptable to the unknown data contamination. Along with the Arch Boosting framework, the non-convex losses lead to the new class of boosting algorithms, named adaptive, robust, boosting (ARB). Furthermore, we present theoretical examples that demonstrate the robustness properties of the proposed algorithms. In particular, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, we present new theoretical results, based only on local curvatures, which may be used to establish statistical and optimization properties of the proposed Arch boosting algorithms with highly non-convex loss functions. Extensive numerical calculations are used to illustrate these theoretical properties and reveal advantages over the existing boosting methods when data exhibits a number of outliers.
Inference of low-dimensional structures, such as clusters, on large networks is a central problem in network science. An important class of models describing such structures is the Random Dot Product Graph (RDPG), which assigns low dimensional latent position vectors to nodes and computes edge probabilities using dot products between these vectors. The RDPG provides a more flexible network model compared with the standard Stochastic Block Model (SBM). In this paper, we introduce the Logistic RDPG, which uses a logistic link function mapping from latent positions to edge probabilities. The logistic RDPG includes most SBMs as well as other low-dimensional structures, such as degree-corrected models, that are not described by SBMs. Over this model, we derive a method for efficient, asymptotically exact maximum-likelihood inference of latent position vectors. Our method involves computing the top eigenvectors of the mean-centered adjacency matrix and performing a logistic regression step to recover the appropriate eigenvalue scaling. Applied to the network clustering problem on diverse synthetic network models, we illustrate that our method is more accurate and more robust than existing spectral and semidefinite network clustering methods.
This paper presents and analyzes a stochastic version of the Frank-Wolfe algorithm (a.k.a. conditional gradient method or projection-free algorithm) for constrained convex optimization. We first prove that when the quality of gradient estimate improves as ${\cal O}( \sqrt{ \eta_t^{\Delta} / t } )$, where $t$ is the iteration index and $\eta_t^{\Delta}$ is an increasing sequence, then the objective value of the stochastic Frank-Wolfe algorithm converges in at least the same order. When the optimal solution lies in the interior of the constraint set, the convergence rate is accelerated to ${\cal O}(\eta_t^{\Delta} /t)$. Secondly, we study how the stochastic Frank-Wolfe algorithm can be applied to a few practical machine learning problems. Tight bounds on the gradient estimate errors for these examples are established. Numerical simulations support our findings.
Vehicle (bike or car) sharing represents an emerging transportation scheme which may comprise an important link in the green mobility chain of smart city environments. This chapter offers a comprehensive review of algorithmic approaches for the design and management of vehicle sharing systems. Our focus is on one-way vehicle sharing systems (wherein customers are allowed to pick-up a vehicle at any location and return it to any other station) which best suits typical urban journey requirements. Along this line, we present methods dealing with the so-called asymmetric demand-offer problem (i.e. the unbalanced offer and demand of vehicles) typically experienced in one-way sharing systems which severely affects their economic viability as it implies that considerable human (and financial) resources should be engaged in relocating vehicles to satisfy customer demand. The chapter covers all planning aspects that affect the effectiveness and viability of vehicle sharing systems: the actual system design (e.g. number and location of vehicle station facilities, vehicle fleet size, vehicles distribution among stations); customer incentivisation schemes to motivate customer-based distribution of bicycles/cars (such schemes offer meaningful incentives to users so as to leave their vehicle to a station different to that originally intended and satisfy future user demand); cost-effective solutions to schedule operator-based repositioning of bicycles/cars (by employees explicitly enrolled in vehicle relocation) based on the current and future (predicted) demand patterns (operator-based and customer-based relocation may be thought as complementary methods to achieve the intended distribution of vehicles among stations).
This paper describes how to convert a machine learning problem into a series of map-reduce tasks. We study logistic regression algorithm. In logistic regression algorithm, it is assumed that samples are independent and each sample is assigned a probability. Parameters are obtained by maxmizing the product of all sample probabilities. Rapid expansion of training samples brings challenges to machine learning method. Training samples are so many that they can be only stored in distributed file system and driven by map-reduce style programs. The main step of logistic regression is inference. According to map-reduce spirit, each sample makes inference through a separate map procedure. But the premise of inference is that the map procedure holds parameters for all features in the sample. In this paper, we propose Distributed Parameter Map-Reduce, in which not only samples, but also parameters are distributed in nodes of distributed filesystem. Through a series of map-reduce tasks, we assign each sample parameters for its features, make inference for the sample and update paramters of the model. The above processes are excuted looply until convergence. We test the proposed algorithm in actual hadoop production environment. Experiments show that the acceleration of the algorithm is in linear relationship with the number of cluster nodes.
The need to carry out parameter estimation from massive data has reinvigorated interest in iterative estimation methods, in statistics and machine learning. Classic work includes deterministic gradient-based methods, such as quasi-Newton, and stochastic gradient descent and its variants, including adaptive learning rates, acceleration and averaging. Current work increasingly relies on methods that employ proximal operators, leading to updates defined through implicit equations, which need to be solved at each iteration. Such methods are especially attractive in modern problems with massive data because they are numerically stable and converge with minimal assumptions, among other reasons. However, while the majority of existing methods can be subsumed into the gradient-free stochastic approximation framework developed by Robbins and Monro (1951), there is no such framework for methods with implicit updates. Here, we conceptualize a gradient-free implicit stochastic approximation procedure, and develop asymptotic and non-asymptotic theory for it. This new framework provides a theoretical foundation for gradient-based procedures that rely on implicit updates, and opens the door to iterative estimation methods that do not require a gradient, nor a fully known likelihood.
Information on the web is prodigious; searching relevant information is difficult making web users to rely on search engines for finding relevant information on the web. Search engines index and categorize web pages according to their contents using crawlers and rank them accordingly. For given user query they retrieve millions of webpages and display them to users according to web-page rank. Every search engine has their own algorithms based on certain parameters for ranking web-pages. Search Engine Optimization (SEO) is that technique by which webmasters try to improve ranking of their websites by optimizing it according to search engines ranking parameters. It is the aim of this research to identify the most popular SEO techniques used by search engines for ranking web-pages and to establish their importance for indexing and categorizing web data. The research tries to establish that using more SEO parameters in ranking algorithms helps in retrieving better search results thus increasing user satisfaction. In the accomplished research, a web based Meta search engine is proposed to aggregates search results from different search engines and rank web-pages based on new page ranking algorithm which will assign heuristic page rank to web-pages based on SEO parameters such as title tag, Meta description, sitemap etc. The research also provides insight into techniques which webmasters can use for better ranking their websites in Google and Bing. Initial results has shown that using certain SEO parameters in present ranking algorithm helps in retrieving more useful results for user queries. These results generated from Meta search engine outperformed existing search engines in terms of better retrieved search results and high precision.
The task of determining labels of all network nodes based on the knowledge about network structure and labels of some training subset of nodes is called the within-network classification. It may happen that none of the labels of the nodes is known and additionally there is no information about number of classes to which nodes can be assigned. In such a case a subset of nodes has to be selected for initial label acquisition. The question that arises is: ‘labels of which nodes should be collected and used for learning in order to provide the best classification accuracy for the whole network?’. Active learning and inference is a practical framework to study this problem. A set of methods for active learning and inference for within network classification is proposed and validated. The utility score calculation for each node based on network structure is the first step in the process. The scores enable to rank the nodes. Based on the ranking, a set of nodes, for which the labels are acquired, is selected (e.g. by taking top or bottom N from the ranking). The new measure-neighbour methods proposed in the paper suggest not obtaining labels of nodes from the ranking but rather acquiring labels of their neighbours. The paper examines 29 distinct formulations of utility score and selection methods reporting their impact on the results of two collective classification algorithms: Iterative Classification Algorithm and Loopy Belief Propagation. We advocate that the accuracy of presented methods depends on the structural properties of the examined network. We claim that measure-neighbour methods will work better than the regular methods for networks with higher clustering coefficient and worse than regular methods for networks with low clustering coefficient. According to our hypothesis, based on clustering coefficient we are able to recommend appropriate active learning and inference method.
The rooted tree is an important data structure, and the height, depth and subtree size are naturally defined attributes of every node. We consider the problem of the realization of a tree given a sequence of one of these attributes. We also consider the problem of the existence of a tree when multiple sequences of attributes are given or when a sequence of tuples of attributes is given. Our most significant results are the NP-Completeness of problems related to subtree size sequences, which we prove by non-trivial reductions from Numerical Matching with Target Sums. By contrast we give polynomial time algorithms for depth and height sequences. Even for the subtree size problem, we show that the realization of some classes of trees can be solved in O(log(n)) time.