Shiny PCA

Following the previous data science graduate course, it is possible to use a nice (new) package in R, to create a shiny page with some interactive output for a principal component analysis.

Research Leaders on Data Mining, Data Science and Big Data key advances, top trends

Research Leaders in Data Science and Big Data reflect on the most important research advances in 2015 and the key trends expected to dominate throughout 2016.

Scikit-learn and Python Stack Tutorials: Introduction, Implementing Classifiers

A small collection of introductory scikit-learn and Python stack tutorials for those with an existing understanding of machine learning looking to jump right into using a new set of tools.

Google Geo Data – Data Access Without Restrictions

Geo-Distances are of great importance: Researchers from various disciplines refer to geographic distances – health researchers refer to geographic data when analyzing the spread of diseases, economists when evaluating the impact of transaction costs on human behavior, or sociologists when evaluating interpersonal distances (based on external factors) in human interaction.

A gentle introduction to parallel computing in R

Let’s talk about the use and benefits of parallel computation in R.

Experience with Rules-Based Programming for Distributed Concurrent Fault-Tolerant Code

As we saw in yesterday’s paper, the authors of RAMCloud settled on a very effective design pattern for writing distributed, concurrent, fault-tolerant (DCFT) modules within their system. They call this pattern ‘rules-based programming’ – a collection of (condition,action) pairs that can execute very efficiently even in the highly demanding latency sensitive environment of RAMCloud. If you have to write logic that makes lots of concurrent requests, requires fault-tolerance / failure handling, and must cope with non-determinism – as well as being concise and as easy to understand as the domain allows – then you definitely want to study the rules-based DCFT pattern.

Tutorial – Python List Comprehension With Examples

List comprehension is powerful and must know concept in Python. Yet, this remains one of the most challenging topic for beginners. With this post, I intend help each one of you who is facing this trouble in python.

Mastering R Plot – Part 1: colors, legends and lines

This is the first post of a series that will look at how to create graphics in R using the plot function from the base package. There are of course other packages to make cool graphs in R (like ggplot2 or lattice), but so far plot always gave me satisfaction.

Large-Scale Language Classification

Identifying the language of a text is an important requirement for any other processing on written text. While the focus so far has been mainly on language detection for long written text, short social media post like tweets are becoming more important. Furthermore, most language classifiers are designed to work only with dozens of languages. In this paper, we scale up the language identification to work on 200 languages, and additionally present the results of the classifier on a twitter data set of 70 languages. The classifier reached close to 95% when tested on 200 languages, and the results on the twitter data set are just short of 90%. Additionally, we made some optimizations to ensure that classifications are made in a timely manner, suitable for use on the web.

Exploring €1.3 trillion in public contracts with graph visualization

The European Union is giving free access to detailed information about public purchasing contracts. That data describes which European institutions are spending money, for what and who benefits from it. We are going to explore the network of public institutions and suppliers and look for interesting patterns with graph visualization.

The Unreasonable Reputation of Neural Networks

It is hard not to be enamoured by deep learning nowadays, watching neural networks show off their endless accumulation of new tricks. There are, as I see it, at least two good reasons to be impressed:
(1) Neural networks can learn to model many natural functions well, from weak priors.
(2) Neural networks can learn surprisingly useful representations

R trends in 2015 (based on cranlogs)

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.

Heston model for Options pricing with ESGtoolkit

In this post, I’ll show you how to use ESGtoolkit, for the simulation of Heston stochastic volatility model for stock prices. This is probably my last post on ESGtoolkit, before I start working on the project again (yeah, I know it’s been a while since v0.1! 🙂 ).

miniCRAN – developing internal CRAN Repositories

Today, I needed to work on a package that had numerous dependencies on internal packages and ones from CRAN. To be able to handle dependencies in the installation process, I needed something like CRAN so that install.packages() woul work correctly. We have an internal CRAN but I wanted to make one specific to this set of packages. Our early guidance, produced by Greg back in 2014 used a number of custom functions and manual folder structure creation. It worked but it required effort. Since then Revolution Analytics have developed miniCRAN, which was designed to make developing internal CRAN repositories a breeze. It’s been really helpful for developing an internal repository for working on this project so I wanted to show others how easy it is too. In fact, my entire minimum reproducible example is just nine steps!

Casting a Wide (and Sparse) Matrix in R

I routinely use melt() and cast() from the reshape2 package as part of my data munging workflow. Recently I’ve noticed that the data frames I’ve been casting are often extremely sparse. Stashing these in a dense data structure just feels wasteful. And the dismal drone of page thrashing is unpleasant. So I had a look around for an alternative. As it turns out, it’s remarkably easy to cast a sparse matrix using sparseMatrix() from the Matrix package. Here’s an example.

Formatting table output in R

Formatting data for output in a table can be a bit of a pain in R. The package formattable by Kun Ren and Kenton Russell provides some intuitive functions to create good looking tables for the R console or HTML quickly. The package home page demonstrates the functions with illustrative examples nicely.