New ‘R Talk’ podcast with news from R Consortium

There’s a new podcast in town, and it’s focused on R. Four members of the R community (Oliver Keyes, Wikimedia Foundation; Jasmine Dumas. DePaul University; Ted Hart, Silcon Valley ‘fruit vendor’; and Mikhail Popov, Wikimedia Foundation) have banded together to produce R Talk, a podcast about ‘the R Programming language, featuring news, interviews and dives into how R is used in different academic and indutry fields’. In the inaugural episode, Jasmine shared her experiences developing R in the Google Summer of Code, and Oliver shared his impressions (and stories from) the recent useR conference in Denmark. Ted also responded to a user question on the history of CRAN and whether anything is likely to replace it (and you can submit your own question to be answered in a future episode by tweeting to @RTalkPodcast).

Data Mining Algorithms: Explained Using R

The essential idea of the book is to describe the basic data mining algorithms and their components, and this is done by presenting first the building blocks, and then their combinations, in the form of R code. The author notes right away that the book “does not teach [R] nor requires (sic) the readers to learn it. This is because the example code can be run and the results can looked up (sic) with barely any knowledge of R.” Cichosz creates even low-level functions (like, for example, code for computing means and medians) to show how they might have been coded in R. These low-level functions are sometimes re-used, but more often he turns to the more efficient but less readable built-in versions later in the book. As another example, he sets up an entire classification tree implementation, complete with pruning, in code that is laid out line by line – then, in later examples, reverts to use of the built-in rpart package.

Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. I must confess, I was watching (had to watch) this years ago because my wife was watching it…… My gut feeling with these daily soap series is that missing a few months or even years does not matter. Once you’ve seen some GTST episodes you’ve seen them all, the story line is always very similar. Can we use some data science to test if this gut feeling makes sense? I am using R and SAS to investigate this and see if more interesting soap analytics can be derived.