Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.
For bivariate normal data, all (marginal) posterior moments of Pearson’s correlation coefficient are given in analytic form.
Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms. In the context of large scale learning, as utilized by many Big Data applications, efficient parallelization of SGD is in the focus of active research. Recently, we were able to show that the asynchronous communication paradigm can be applied to achieve a fast and scalable parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems. In this paper, we investigate the impact of asynchronous communication frequency and message size on the performance of ASGD applied to large scale ML on HTC cluster and cloud environments. We introduce a novel algorithm for the automatic balancing of the asynchronous communication load, which allows to adapt ASGD to changing network bandwidths and latencies.
This paper examines the role and efficiency of the non-convex loss functions for binary classification problems. In particular, we investigate how to design a simple and effective boosting algorithm that is robust to the outliers in the data. The analysis of the role of a particular non-convex loss for prediction accuracy varies depending on the diminishing tail properties of the gradient of the loss — the ability of the loss to efficiently adapt to the outlying data, the local convex properties of the loss and the proportion of the contaminated data. In order to use these properties efficiently, we propose a new family of non-convex losses named $\gamma$-robust losses. Moreover, we present a new boosting framework, {\it Arch Boost}, designed for augmenting the existing work such that its corresponding classification algorithm is significantly more adaptable to the unknown data contamination. Along with the Arch Boosting framework, the non-convex losses lead to the new class of boosting algorithms, named adaptive, robust, boosting (ARB). Furthermore, we present theoretical examples that demonstrate the robustness properties of the proposed algorithms. In particular, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, we present new theoretical results, based only on local curvatures, which may be used to establish statistical and optimization properties of the proposed Arch boosting algorithms with highly non-convex loss functions. Extensive numerical calculations are used to illustrate these theoretical properties and reveal advantages over the existing boosting methods when data exhibits a number of outliers.
Inference of low-dimensional structures, such as clusters, on large networks is a central problem in network science. An important class of models describing such structures is the Random Dot Product Graph (RDPG), which assigns low dimensional latent position vectors to nodes and computes edge probabilities using dot products between these vectors. The RDPG provides a more flexible network model compared with the standard Stochastic Block Model (SBM). In this paper, we introduce the Logistic RDPG, which uses a logistic link function mapping from latent positions to edge probabilities. The logistic RDPG includes most SBMs as well as other low-dimensional structures, such as degree-corrected models, that are not described by SBMs. Over this model, we derive a method for efficient, asymptotically exact maximum-likelihood inference of latent position vectors. Our method involves computing the top eigenvectors of the mean-centered adjacency matrix and performing a logistic regression step to recover the appropriate eigenvalue scaling. Applied to the network clustering problem on diverse synthetic network models, we illustrate that our method is more accurate and more robust than existing spectral and semidefinite network clustering methods.
This paper presents and analyzes a stochastic version of the Frank-Wolfe algorithm (a.k.a. conditional gradient method or projection-free algorithm) for constrained convex optimization. We first prove that when the quality of gradient estimate improves as ${\cal O}( \sqrt{ \eta_t^{\Delta} / t } )$, where $t$ is the iteration index and $\eta_t^{\Delta}$ is an increasing sequence, then the objective value of the stochastic Frank-Wolfe algorithm converges in at least the same order. When the optimal solution lies in the interior of the constraint set, the convergence rate is accelerated to ${\cal O}(\eta_t^{\Delta} /t)$. Secondly, we study how the stochastic Frank-Wolfe algorithm can be applied to a few practical machine learning problems. Tight bounds on the gradient estimate errors for these examples are established. Numerical simulations support our findings.
Vehicle (bike or car) sharing represents an emerging transportation scheme which may comprise an important link in the green mobility chain of smart city environments. This chapter offers a comprehensive review of algorithmic approaches for the design and management of vehicle sharing systems. Our focus is on one-way vehicle sharing systems (wherein customers are allowed to pick-up a vehicle at any location and return it to any other station) which best suits typical urban journey requirements. Along this line, we present methods dealing with the so-called asymmetric demand-offer problem (i.e. the unbalanced offer and demand of vehicles) typically experienced in one-way sharing systems which severely affects their economic viability as it implies that considerable human (and financial) resources should be engaged in relocating vehicles to satisfy customer demand. The chapter covers all planning aspects that affect the effectiveness and viability of vehicle sharing systems: the actual system design (e.g. number and location of vehicle station facilities, vehicle fleet size, vehicles distribution among stations); customer incentivisation schemes to motivate customer-based distribution of bicycles/cars (such schemes offer meaningful incentives to users so as to leave their vehicle to a station different to that originally intended and satisfy future user demand); cost-effective solutions to schedule operator-based repositioning of bicycles/cars (by employees explicitly enrolled in vehicle relocation) based on the current and future (predicted) demand patterns (operator-based and customer-based relocation may be thought as complementary methods to achieve the intended distribution of vehicles among stations).
This paper describes how to convert a machine learning problem into a series of map-reduce tasks. We study logistic regression algorithm. In logistic regression algorithm, it is assumed that samples are independent and each sample is assigned a probability. Parameters are obtained by maxmizing the product of all sample probabilities. Rapid expansion of training samples brings challenges to machine learning method. Training samples are so many that they can be only stored in distributed file system and driven by map-reduce style programs. The main step of logistic regression is inference. According to map-reduce spirit, each sample makes inference through a separate map procedure. But the premise of inference is that the map procedure holds parameters for all features in the sample. In this paper, we propose Distributed Parameter Map-Reduce, in which not only samples, but also parameters are distributed in nodes of distributed filesystem. Through a series of map-reduce tasks, we assign each sample parameters for its features, make inference for the sample and update paramters of the model. The above processes are excuted looply until convergence. We test the proposed algorithm in actual hadoop production environment. Experiments show that the acceleration of the algorithm is in linear relationship with the number of cluster nodes.