Understanding Human Navigation using Bayesian Hypothesis Comparison

Understanding human navigation behavior has implications for a wide range of application scenarios. For example, insights into geo-spatial navigation in urban areas can impact city planning or public transport. Similarly, knowledge about navigation on the web can help to improve web site structures or service experience. In this work, we focus on a hypothesis-driven approach to address the task of understanding human navigation: We aim to formulate and compare ideas – for example stemming from existing theory, literature, intuition, or previous experiments – based on a given set of navigational observations. For example, we may compare whether tourists exploring a city walk ‘short distances’ before taking their next photo vs. they tend to ‘travel long distances between points of interest’, or whether users browsing Wikipedia ‘navigate semantically’ vs. ‘click randomly’. For this, the Bayesian method HypTrails has recently been proposed. However, while HypTrails is a straightforward and flexible approach, several major challenges remain: i) HypTrails does not account for heterogeneity (e.g., incorporating differently behaving user groups such as tourists and locals is not possible), ii) HypTrails does not support the user in conceiving novel hypotheses when confronted with a large set of possibly relevant background information or influence factors, e.g., points of interest, popularity of locations, time of the day, or user properties, and finally iii) formulating hypotheses can be technically challenging depending on the application scenario (e.g., due to continuous observations or temporal constraints). In this thesis, we address these limitations by introducing various novel methods and tools and explore a wide range of case studies. In particular, our main contributions are the methods MixedTrails and SubTrails which specifically address the first two limitations: MixedTrails is an approach for hypothesis comparison that extends the previously proposed HypTrails method to allow formulating and comparing heterogeneous hypotheses (e.g., incorporating differently behaving user groups). SubTrails is a method that supports hypothesis conception by automatically discovering interpretable subgroups with exceptional navigation behavior. In addition, our methodological contributions also include several tools consisting of a distributed implementation of HypTrails, a web application for visualizing geo-spatial human navigation in the context of background information, as well as a system for collecting, analyzing, and visualizing mobile participatory sensing data. Furthermore, we conduct case studies in many application domains, which encompass – among others – geo-spatial navigation based on photos from the photo-sharing platform Flickr, browsing behavior on the social tagging system BibSonomy, and task choosing behavior on a commercial crowdsourcing platform. In the process, we develop approaches to cope with application specific subtleties (like continuous observations and temporal constraints). The corresponding studies illustrate the variety of domains and facets in which navigation behavior can be studied and, thus, showcase the expressiveness, applicability, and flexibility of our methods. Using these methods, we present new aspects of navigational phenomena which ultimately help to better understand the multi-faceted characteristics of human navigation behavior.

Building Recurrent Neural Networks in Tensorflow

In the previous blog posts we have seen how we can build Convolutional Neural Networks in Tensorflow and also how we can use Stochastic Signal Analysis techniques to classify signals and time-series. In this blog post, lets have a look and see how we can build Recurrent Neural Networks in Tensorflow and use them to classify Signals.

Cohort Analysis in the Age of Digital Twins

To be actionable, Big Data and Data Science must get down to the level of the individual – whether the individual is a customer, physician, patient, teacher, student, coach, athlete, technician, mechanic or engineer. This is the ‘Power of One.’ By applying data science to the growing wealth of human purchase, interaction and social engagement data, organizations can capture individual´s tendencies, propensities, inclinations, behaviors, patterns, associations, interests, passions, affiliations and relationships that drive business monetization opportunities.

Reducing bias and ensuring fairness in data science

Here at Civis, we build a lot of models. Most of the time we´re modeling people and their behavior because that´s what we´re particularly good at, but we´re hardly the only ones doing this – as we enter the age of ‘big data’ more and more industries are applying machine learning techniques to drive person-level decision-making. This comes with exciting opportunities, but it also introduces an ethical dilemma: when machine learning models make decisions that affect people´s lives, how can you be sure those decisions are fair

Accessing Web Data (JSON) in R using httr

The most important and primary step in Data Analysis is gathering data from all possible sources (Primary or Secondary). Data can be available in all sorts of formats ranging from flat files like (.txt,.csv) to exotic file formats like excel. These files may be stored locally in your system or in your working directory. Packages like utils of Base R, readR, data.table, XLconnect can be used to expose some very important methods to access such locally saved files.

Standardization in LASSO

In a recent post, we´ve seen computational aspects of the optimization problem. But I went quickly throught the story of the l1\ell_1l 1 -norm. Because it means, somehow, that the value of ß1\beta_1ß 1 and ß2\beta_2ß 2 should be comparable. Somehow, with two significant variables, with very different scales, we should expect orders (or relative magnitudes) of ß^1\widehat{\beta}_1 ß 1 and ß^2\widehat{\beta}_2 ß 2 to be very very different. So people say that it is therefore necessary to center and reduce (or standardize) the variables.

Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail- averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini- batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this mini- batching characterization is utilized in providing a highly parallelizable SGD method that achieves the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD-style methods. Following this, a non-asymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided. Finally, this work sheds light on fundamental differences in SGD’s behavior when dealing with mis-specified models in the non-realizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the mis-specified case must depend on the noise properties. The analysis tools used by this paper generalize the operator view of averaged SGD (Défossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.

Weak and Strong Bias in Machine Learning

With the arrival of the GDPR there has been increased focus on non-discrimination in machine learning. This post explores different forms of model bias and suggests some practical steps to improve fairness in machine learning.

Time Series Analysis With Documentation And Steps I Follow For Analytics Projects

Setting up RStudio Server, Shiny Server and PostgreSQL

A few months back, I set up a server on Amazon Web Services with a data sciencey toolkit on it. Amongst other things, this means I can collect data around the clock when necessary, as well as host my little RRobot twitter bot, without having a physical machine humming in my living room. There are lots of fiddly things to sort out to make such a setup actually fit for purpose.