SentimentBuilder – Free Online Natural Language Processing Tool

Sankey diagrams are majorly used to visualize the flow of data on energy flows, material flow and trade-offs. But SentimentBuilder has rediscovered them to use with unstructured text, based on the their online NLP tool!

Predicting Titanic deaths on Kaggle VI: Stan

It is a bit a contradiction. Kaggle provides competitions on data science, while Stan is clearly part of the (Bayesian) statistics. Yet after using random forests, boosting and bagging, I also think this problem has a suitable size for Stan, which I understand can handle larger problems than older Bayesian software such as JAGS. What I aim to do is enter a load of variables in the Stan model. Aliasing will be ignored, and I hope the hierarchical model will provide suitable shrinkage for terms which are not relevant.

The Real Reason for House Price Inflation in New Zealand

Population growth, shortages in housing supply, internal migration, immigration, cheap money, and foreign investors are just a few of the claimed causes of House Price Inflation (HPI) in New Zealand in recent years. The notorious example of HPI in action is NZ’s largest city – Auckland.

An introduction to Apache drill and why is it useful

With the rapid growth of data and the shift towards rapid development solutions much data is being stored in NoSQL stores such as Hadoop and MongoDB. The infrastructure built upon relational databases that have been used for decades cannot keep up with the volume and scope of data being captured. Further to this SQL is also a really good invention and method for extracting and analysing data that is very widely used. In short it will not be replaced by hierarchical query techniques such as XPATH anytime soon.

Introduction of Markov State Modeling

Modeling and prediction problems occur in different domain and data situations. One type of situation involves sequence of events. For instance, you may want to model behaviour of customers on your website, looking at pages they land or enter by, links they click, and so on. You may want to do this to understand common issues and needs and may redesign your website to address that. You may, on the other hand, may want to promote certain sections or products on website and want to understand right page architecture and layout. In other example, you may be interested in predicting next medical visit of patient based on previous visits or next purchase product of customer based on previous products. While traditional classification model based prediction methodologies may apply, there is additional class of algorithm available if you can classify actions as finite set of discrete events.

Can You Say “Heteroscedasticity” 3 Times Fast?

Most books on regression analysis assume homoscedasticity, the situation in which Var(Y | X = t), for a response variable Y and vector of predictor variables X, is the same for all t. Yet, needless to say, almost all data in real life is heteroscedastic. For Y = human weight and X = height, say, we know that the assumption of homoscedasticity can’t be true, even approximately.

Code and documentation for the winning sollution at the Grasp-and-Lift EEG Detection challenge

The goal of this challenge was to detect 6 different events related to hand movement during a task of grasping and lifting an object, using only EEG signal. We were asked to provide probabilities for the 6 events and for every time sample. The evaluation metric for this challenge was the Area under ROC curve (AUC) averaged over the 6 event types.

Comparing two timeseries-generating blackboxes

This question on Cross-Validated got me interested. I gave a fairly inadequate answer and want to explore a few of the issues. Actually, I have a plan for an effective technique which is what I think the original post was asking for, but I need to check out a few things first.

R: Utilize function body inline comments for documentation

When writing a long function which has to deal with multiple checks and complex processes, it is valuable to put comments in the function body. This allows readers (including you) to catch the concept of process workflow without going into details. I’m going to present a way how those comments can be nicely reused for the documentation purpose. The post will be constructed based on the package development process. The inline function body comments will be utilized to generate a documentation file which stores a task list.