Getting Started with Markov Chains

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.


The Ultimate Plan to Become a Data Scientist in 2016

Data Scientist is one of the hottest jobs of this decade. The demand for data scientists is much higher than available candidates (Source). So, there is a lot of incentive for people to look up to data science as a career option, and that is not going to change in near future. However, if you do one search on Google, you will see your dream vanishing. There are too many resources, advice and paths suggested by various people, which makes it impossible for a beginner to take right decisions. If you are facing a similar problem, let’s accomplish this in 2016. If you aspire to become a data scientist, this annual plan would make things much easier and faster for you. I’ve mentioned only the best resources you should follow. This plan is designed to make you a data scientist by December 2016 (conservative pace). If you can devote more time, great. You’d could achieve this feat much faster or with more depth by looking at additional resources (orange bullet).


Bayesian regression with STAN: Part 1 normal regression

This post will introduce you to bayesian regression in R, see the reference list at the end of the post for further information concerning this very broad topic.


Machine Learning is Fun! Part 2

In Part 1, we said that Machine Learning is using generic algorithms to tell you something interesting about your data without writing any code specific to the problem you are solving. (If you haven’t already read part 1, read it now!). This time, we are going to see one of these generic algorithms do something really cool — create video game levels that look like they were made by humans. We’ll build a neural network, feed it existing Super Mario levels and watch new ones pop out!


A 5-step guide to data visualization

There are many advanced visualizations (e.g., networks, 3D-models and map overlays) used for specialized purposes such as 3D medical imaging, urban transportation simulation, and disaster relief monitoring. But regardless of the complexity of a visualization, its purpose is to help readers see a pattern or trend in the data being analyzed, rather than having them read tedious descriptions such as: ‘A’s profit was more than B by 2.9% in 2000, and despite a profit growth of 25% in 2001, A’s profit became less than B by 3.5% in 2001.’ A good visualization summarizes information and organizes in a way that enables the reader to focus on the points that are relevant to the key message being conveyed.


Prove Your Point with Data and a Fast Python Library

Harness the power of Python and the command line to prove your point using data and a fast data-processing library.


Thinning-based models in the analysis of integer-valued time series: a review

This article aims at providing a comprehensive survey of recent developments in the field of integer-valued time series modelling, paying particular attention to models obtained as discrete counterparts of conventional autoregressive moving average and bilinear models, and based on the concept of thinning. Such models have proven to be useful in the analysis of many real-world applications ranging from economy and finance to medicine. We review the literature of the most relevant thinning operators proposed in the analysis of univariate and multivariate integer-valued time series with either finite or infinite support. Finally, we also outline and discuss possible directions of future research.


Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood

The stochastic block model (SBM) is a mixture model for the clustering of nodes in networks. The SBM has now been employed for more than a decade to analyze very different types of networks in many scientific fields, including biology and the social sciences. Recently, an analytical expression based on the collapsing of the SBM parameters has been proposed, in combination with a sampling procedure that allows the clustering of the vertices and the estimation of the number of clusters to be performed simultaneously. Although the corresponding algorithm can technically accommodate up to 10 000 nodes and millions of edges, the Markov chain, however, tends to exhibit poor mixing properties, that is, low acceptance rates, for large networks. Therefore, the number of clusters tends to be highly overestimated, even for a very large number of samples. In this article, we rely on a similar expression, which we call the integrated complete data log likelihood, and propose a greedy inference algorithm that focuses on maximizing this exact quantity. This algorithm incurs a smaller computational cost than existing inference techniques for the SBM and can be employed to analyze large networks (several tens of thousands of nodes and millions of edges) with no convergence problems. Using toy datasets, the algorithm exhibits improvements over existing strategies, both in terms of clustering and model selection. An application to a network of blogs related to illustrations and comics is also provided.


Managing nonignorable missing data with clustered multinomial responses

Clustered multinomial responses are common in public health studies. In this situation, the baseline logit random effects model is usually suggested as a general modelling approach. When nonignorable missing outcomes exist, naïve methods such as complete case analysis or likelihood methods ignoring missing information may distort the conclusions that are drawn. While methods to deal with binary and ordinal outcomes have been proposed, no easily implementable method is specifically available for missing clustered nominal responses. Joint modelling is usually one of the available choices but has high complexity in terms of likelihood. The numerical integration of both missing data and random effects is challenging. In this study, we have derived a closed form of likelihood. A simplified likelihood is also proposed, which is an extension of a previous study. One advantage is that both methods are easily implemented with commonly used software. We illustrate our proposed methods using the Global Youth Tobacco Survey and compare the results obtained by naïve methods that ignore missing data with the results obtained using the proposed methods. Our approaches restore the parameter estimates and predicted probability of each category to an acceptable extent. Analysis guidelines for the use of our methods are provided.


satRdays are coming

It’s been only around 2 months since the idea of community-driven R conferences was born, when Steph Locke first talked publicly about this cool conception, but I am pretty sure we will be able to attend at least one or two satRdays in 2016 — as the project received many and very positive feedback on GitHub, Twitter and in-person conversations as well. In short, this is a proposal for the R Consortium to be submitted in the next few days about free/cheap full-day conferences organized by R users for R users around the world, acting as a bridge between local R User Groups and global conferences.