How Big Data Can Improve the Lives of the Poor
The role of Big Data in allowing greater financial inclusion for the poor also is a trending Internet topic. But it’s mostly creating optimism and interest, rather than controversy and dissent.

A new open source data set for anomaly detection
Yahoo Labs has just released an inter­est­ing new data set use­ful for research on detect­ing anom­alies (or out­liers) in time series data. There are many con­texts in which anom­aly detec­tion is impor­tant. For Yahoo, the main use case is in detect­ing unusual traf­fic on Yahoo servers. The data set com­prises real traf­fic to Yahoo ser­vices, along with some syn­thetic data. There are 367 time series in the data set, each of which con­tains between 741 and 1680 obser­va­tions recorded at reg­u­lar inter­vals. Each series is accom­pa­nied by an indi­ca­tor series with a 1 if the obser­va­tion was an anom­aly, and 0 oth­er­wise. The anom­alies in the real data were deter­mined by human judge­ment, while those in the syn­thetic data were gen­er­ated algo­rith­mi­cally. For the syn­thetic data, some infor­ma­tion about the com­po­nents used to con­struct the data is also provided.

Palladium: Framework for setting up predictive analytics services
Palladium provides means to easily set up predictive analytics services as web services. It is a pluggable framework for developing real-world machine learning solutions. It provides generic implementations for things commonly needed in machine learning, such as dataset loading, model training with parameter search, a web service, and persistence capabilities, allowing you to concentrate on the core task of developing an accurate machine learning model. Having a well-tested core framework that is used for a number of different services can lead to a reduction of costs during development and maintenance due to harmonization of different services being based on the same code base and identical processes. Palladium has a web service overhead of a few milliseconds only, making it possible to set up services with low response times.

Advice for Data Scientists on Where to Work
Of course, there are other considerations: domain, the company’s brand, the specific technology in use, the culture, the people, and so forth. All of those are equally important. We call out the three above since they are less frequently talked about, yet fundamental to a data scientist’s growth, impact, and happiness. They are also less obvious. We learned these things from experience. At first glance, you would not expect to find these things in a women’s apparel company. However, our very different business model places a huge emphasis on data science, enables some of the richest data in the world, and creates space for a whole new suite of innovative software.

IPython Notebooks with tutorials for #pandas, #scikitlearn, and #numpy
The following IPython Notebooks are the standard training material distributed with the Addfor trainings. For more information about standard and custom training solutions please visit Services @ Addfor. All the IPython notebooks are distributed under the Creative Commons Attribution-ShareAlike 4.0 International License.

Computing Platforms for Analytics, Data Mining, Data Science
The results of KDnuggets Poll suggest a split between a majority of data miners and data scientists who work with ‘PC-size’ data, and a smaller group of Big Data analysts who work with cloud-sized data. Cloud computing, and Unix and especially Mac gained in popularity.

The Price of Fuel: How Bad Could It Get?
The cost of fuel in South Africa (and I imagine pretty much everywhere else) is a contentious topic. It varies from month to month and, although it is clearly related to the price of crude oil and the exchange rate, various other forces play an influential role. According to the Department of Energy the majority of South Africa’s fuel is refined from imported crude oil. The rest is synthetic fuel produced locally from coal and natural gas. The largest expense for a fuel refinery is for raw materials. So little wonder that the cost of fuel should be intimately linked to the price of crude oil. The price of crude oil on the international market is quoted in US Dollars (USD) per barrel, so that the exchange rate between the South African Rand and the US Dollar (ZAR/USD) also exerts a strong influence on the South African fuel price. I am going to adopt a simplistic model in which I’ll assume that the price of South African fuel depends on only those two factors: crude oil price and the ZAR/USD exchange rate. We’ll look at each of them individually before building the model.

Configuring the R BatchJobs package for Torque batch queues
I was asked recently to look at some R code which performs ’embarrassingly parallel’ computations (the same function, multiple times, different parameters) and see whether I could modify it to run on one of our high-performance computing clusters. The machine has 63 virtual compute nodes and uses the TORQUE batch queue system to allocate nodes to compute jobs. First stop: the CRAN Task View High-Performance and Parallel Computing with R. Two promising packages there: BatchJobs and BatchExperiments. Their documentation is quite extensive with useful examples, but I found it a little disjointed and confusing. What I wanted was a simple, step-by-step guide to setting up for a first-time user. So here is my attempt. As always, it’s for ‘Linux-like’ systems.

An example of drawing beast tree using ggtree
BEAST output is well supported by ggtree and it’s easy to reproduce such a tree view. ggtree supports parsing beast output by read.beast function. We can visualize the tree directly by using ggtree function. Since this is a time scale tree, we can set the parameter time_scale = TRUE and ggtree will parse the time and use it as branch length.

Introduction to R
Getting started with data analysis can seem overwhelming, but it doesn’t have to be with the use of the right tools. In this lesson, DataCamp will teach you about the fundamentals of R, the increasingly popular statistical programming language. Through case studies and a live tutorial walkthrough, you’ll learn the advantages of R and understand if it’s the right language for you.

Speed test of sequence generation for unbalanced simulation
I have a simulation package that allows for the simulation of regression models including nested data structures. You can see the package on github here: simReg. Over the weekend I updated the package to allow for the simulation of unbalanced designs. I’m hoping to put together a new vigenette soon highlighting the functionality.