Nonparametric Latent Dirichlet Allocation

In 2003, a groundbreaking statistical model called ‘Latent Dirichlet Allocation’ was presented by David Blei, Andrew Ng, and Michael Jordan. LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see Edwin Chen’s post. The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published a ‘nonparametric’ version of this model that doesn’t require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. I wrote an introduction to this HDP-LDA model earlier this year.

Scientist Sees Squirrel

Why do we make statistics so hard for our students?

8 Productivity hacks for Data Scientists & Business Analysts

Tip 1: Focus on big problems (and big problems only)
Tip 2: Create a presentation of your analysis before you start (with possible layouts and branches)
Tip 3: Define data requirements upfront
Tip 4: Make sure your analysis is reproducible
Tip 5: Keep standard libraries of codes ready and accessible
Tip 6: Similarly, keep a library of intermediate datamarts
Tip 7: Always use an holdout sample / cross-validation to avoid over-fitting
Tip 8: Work in chunks and take breaks regularly

Functional Regression

Functional data analysis (FDA) involves the analysis of data whose ideal units of observation are functions defined on some continuous domain, and the observed data consist of a sample of functions taken from some population, sampled on a discrete grid. Ramsay & Silverman’s (1997) textbook sparked the development of this field, which has accelerated in the past 10 years to become one of the fastest growing areas of statistics, fueled by the growing number of applications yielding this type of data. One unique characteristic of FDA is the need to combine information both across and within functions, which Ramsay and Silverman called replication and regularization, respectively. This article focuses on functional regression, the area of FDA that has received the most attention in applications and methodological development. First, there is an introduction to basis functions, key building blocks for regularization in functional regression methods, followed by an overview of functional regression methods, split into three types:
(a) functional predictor regression (scalar-on-function),
(b) functional response regression (function-on-scalar), and
(c) function-on-function regression.
For each, the role of replication and regularization is discussed and the methodological development described in a roughly chronological manner, at times deviating from the historical timeline to group together similar methods. The primary focus is on modeling and methodology, highlighting the modeling structures that have been developed and the various regularization approaches employed. The review concludes with a brief discussion describing potential areas of future development in this field.

Statistical Causality from a Decision-Theoretic Perspective

We present an overview of the decision-theoretic framework of statistical causality, which is well suited for formulating and solving problems of determining the effects of applied causes. The approach is described in detail, and it is related to and contrasted with other current formulations, such as structural equation models and potential responses. Topics and applications covered include confounding, the effect of treatment on the treated, instrumental variables, and dynamic treatment strategies.

Julia 0.4 Release Announcement

We are pleased to announce the release of Julia 0.4.0. This release contains major language refinements and numerous standard library improvements. A summary of changes is available in the NEWS log found in our main repository. We will be making regular 0.4.x bugfix releases from the release-0.4 branch of the codebase, and we recommend the 0.4.x line for users requiring a more stable Julia environment. The Julia ecosystem continues to grow, and there are now over 700 registered packages! (highlights below). JuliaCon 2015 was held in June, and >60 talks are available to view. JuliaCon India will be held in Bangalore on 9 and 10 October. We welcome bug reports on our GitHub tracker, and general usage questions on the users mailing list, StackOverflow, and several community forums. Binaries are available from the main download page, or visit JuliaBox to try 0.4 from the comfort of your browser. Happy Coding!

Multivariate Order Statistics: Theory and Application

This work revisits several proposals for the ordering of multivariate data via a prescribed depth function. We argue that one of these deserves special consideration, namely, Tukey’s halfspace depth, which constructs nested convex sets via intersections of halfspaces. These sets provide a natural generalization of univariate order statistics to higher dimensions and exhibit consistency and asymptotic normality as estimators of corresponding population quantities. For absolutely continuous probability measures in , we present a connection between halfspace depth and the Radon transform of the density function, which is employed to formalize both the finite-sample and asymptotic probability distributions of the random nested sets. We review multivariate goodness-of-fit statistics based on halfspace depths, which were originally proposed in the projection pursuit literature. Finally, we demonstrate the utility of halfspace ordering as an exploratory tool by studying spatial data on maximum and minimum temperatures produced by a climate simulation model.

Statistics of Extremes

Statistics of extremes concerns inference for rare events. Often the events have never yet been observed, and their probabilities must therefore be estimated by extrapolation of tail models fitted to available data. Because data concerning the event of interest may be very limited, efficient methods of inference play an important role. This article reviews this domain, emphasizing current research topics. We first sketch the classical theory of extremes for maxima and threshold exceedances of stationary series. We then review multivariate theory, distinguishing asymptotic independence and dependence models, followed by a description of models for spatial and spatiotemporal extreme events. Finally, we discuss inference and describe two applications. Animations illustrate some of the main ideas.

Agent-Based Models and Microsimulation

Agent-based models (ABMs) are computational models used to simulate the actions and interactions of agents within a system. Usually, each agent has a relatively simple set of rules for how he or she responds to his or her environment and to other agents. These models are used to gain insight into the emergent behavior of complex systems with many agents, in which the emergent behavior depends upon the micro-level behavior of the individuals. ABMs are widely used in many fields, and this article reviews some of those applications. However, as relatively little work has been done on statistical inference for such models, this article also points out some of those gaps and recent strategies to address them.

Data manipulation with reshape2

In this article, I will show you how you can use the reshape2 package to convert data from wide to long format and vice versa. It was written and is maintained by Hadley Wickham.

Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients

In the previous part of the tutorial we implemented a RNN from scratch, but didn’t go into detail on how Backpropagation Through Time (BPTT) algorithms calculates the gradients. In this part we’ll give a brief overview of BPTT and explain how it differs from traditional backpropagation. We will then try to understand the vanishing gradient problem, which has led to the development of LSTMs and GRUs, two of the currently most popular and powerful models used in NLP (and other areas). The vanishing gradient problem was originally discovered by Sepp Hochreiter in 1991 and has been receiving attention again recently due to the increased application of deep architectures. To fully understand this part of the tutorial I recommend being familiar with how partial differentiation and basic backpropagation works. If you are not, you can find excellent tutorials here and here and here, in order of increasing difficulty.

The Graphical Network Associated with Customer Churn

The node representing ‘Will Not Stay’ draws our focus toward the left side of the following undirected graph. Customers of a health care insurance provider were asked about their intentions to renew at the next sign-up period. We focus on those indicating the greatest potential for defection by creating a binary indicator separating those who say they will not stay from everyone else. In addition, before telling us whether or not they intended to switch health care providers, these customers were given a checklist and instructed to check all the events that recently occurred (e.g., price increases, higher prescription costs, provider not covering all expenses, hospital and doctor visits, and customer service contacts).

Learning R: Index of Online R Courses, October 2015

Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person’s circumstances and motivation.

How to Change the Reference Map in Choroplethr

Last week I released an update to choroplethr that lets you combine choropleth maps with reference maps. Since that post many people have asked if it’s possible to change the reference map that choroplethr uses. The answer is yes, but it requires some code.

Plotting regression curves with confidence intervals for LM, GLM and GLMM in R

Once models have been fitted and checked and re-checked comes the time to interprete them. The easiest way to do so is to plot the response variable versus the explanatory variables (I call them predictors) adding to this plot the fitted regression curve together (if you are feeling fancy) with a confidence interval around it. Now this approach is preferred over the partial residual one because it allows the averaging out of any other potentially confounding predictors and so focus only on the effect of one focal predictor on the response. In my work I have been doing this hundreds of time and finally decided to put all this into a function to clean up my code a little bit.

User-friendly scaling

Back in the mists of time, whilst programming early versions of Canoco, Cajo ter Braak decided to allow users to specify how species and site ordination scores were scaled relative to one another via a simple numeric coding system. This was fine for the DOS-based software that Canoco was at the time; you entered 2 when prompted and you got species scaling, -1 got you site or sample scaling and Hill’s scaling or correlation-based scores depending on whether your ordination was a linear or unimodal method. This system persisted; even in the Windows era of Canoco these numeric codes can be found lurking in the .con files that describe the analysis performed. This use of numeric codes for scaling types was so pervasive that it was logical for Jari Oksanen to include the same system when the first cca() and rda() functions were written and in doing so Jari perpetuated one of the most frustrating things I’ve ever had to deal with as a user and teacher of ordination methods. But, as of last week, my frustration is no more…