IPython 3.0 Released
This is a really big release. Over 150 contributors, and almost 6000 commits in a bit under a year. Support for languages other than Python is greatly improved, notebook UI has been significantly redesigned, and a lot of improvement has happened in the experimental interactive widgets. The message protocol and document format have both been updated, while maintaining better compatibility with previous versions than prior updates. The notebook webapp now enables editing of any text file, and even a web-based terminal (on Unix platforms). 3.x will be the last monolithic release of IPython, as the next release cycle will see the growing project split into its Python-specific and language-agnostic components. Language-agnostic projects (notebook, qtconsole, etc.) will move under the umbrella of the new Project Jupyter name, while Python-specific projects (interactive Python shell, Python kernel, IPython.parallel) will remain under IPython, and be split into a few smaller packages. To reflect this, IPython is in a bit of a transition state. The logo on the notebook is now the Jupyter logo. When installing kernels system-wide, they go in a jupyter directory. We are going to do our best to ease this transition for users and developers.
Playing around with #rstats twitter data
As a bit of weekend fun, I decided to briefly look into the #rstats twitter data that Stephen Turner collected and made available (thanks!). Essentially, this data set contains some basic information about over 100,000 tweets that contain the hashtag “#rstats” that denotes that a tweeter is tweeting about R.
One weird trick to compile multipartite dynamic documents with Rmarkdown
This afternoon I stumbled across this one weird trick an undocumented part of the YAML headers that get processed when you click the ‘knit’ button in RStudio. Knitting turns an Rmarkdown document into a specified format, using the rmarkdown package’s render function to call pandoc (a universal document converter written in Haskell). If you specify a knit: field in an Rmarkdown YAML header you can replace the default function (rmarkdown::render) that the input file and encoding are passed to with any arbitrarily complex function. For example, the developer of slidify passed in a totally different function rather than render – slidify::knit2slides.
Tools in Tandem – SQL and ggplot. But is it Really R?
Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it. So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:
Scalable Machine Learning for Big Data Using R and H2O – See more at: http://datascience.vegas/scalable-machine-learning-for-big-data-using-r-and-h2o/#sthash.7btsQPai.dpuf
H2O is an open source parallel processing engine for machine learning on Big Data. This prediction engine is designed by, h20, a Mountain View-based startup that has implemented a number of impressive statistical and machine learning algorithms to run on HDFS, S3, SQL and NoSQL. We were honored to have Tom Kraljevic (Vice President of Engineering at H2O) demonstrate how this prediction engine is suited for machine learning on Big Data from within R. Yes, that’s right, from within R. Most R users will attest to running into memory issues when crunching millions or billions of data records. That’s what H2o is designed to address. So it was no surprise that most of the R users in attendance including myself were impressed when Tom said: – See more at: http://datascience.vegas/scalable-machine-learning-for-big-data-using-r-and-h2o/#sthash.7btsQPai.dpuf