Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System. But eventually word got out that Hadoop is slow, and very limited in available data operations. Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark. Spark is apparently much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations. But even Spark su ers a very practical problem, shared by the others mentioned above: All of these systems are complicated. There is a considerable amount of con guration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava. Some of this requires systems knowledge that many R users may lack. And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though technically they are coding in R. So, do we really need all that complicated machinery? Hadoop and Spark provide e cient dis- tributed sort operations, but if one’s application does not depend on sorting, we have a cost-bene t issue here. Here is an alternative, more of a general approach rather than a package, which I call ‘Snowdoop.’ (The name alludes to the fact that it uses the section of the parallel package derived from the old snow package.) The idea is to retain the notion of chunking les into distributed mini-files, but (a) do this on one’s own, and (b) the process those les using ordinary R code, not fancy new functions like Hadoop and Spark require.
Snowdoop google