A survey of open source tools for machine learning with big data in the Hadoop ecosystem

With an ever-increasing amount of options, the task of selecting machine learning tools for big data can be difficult. The available tools have advantages and drawbacks, and many have overlapping uses. The world’s data is growing rapidly, and traditional tools for machine learning are becoming insufficient as we move towards distributed and real-time processing. This paper is intended to aid the researcher or professional who understands machine learning but is inexperienced with big data. In order to evaluate tools, one should have a thorough understanding of what to look for. To that end, this paper provides a list of criteria for making selections along with an analysis of the advantages and drawbacks of each. We do this by starting from the beginning, and looking at what exactly the term “big data” means. From there, we go on to the Hadoop ecosystem for a look at many of the projects that are part of a typical machine learning architecture and an understanding of how everything might fit together. We discuss the advantages and disadvantages of three different processing paradigms along with a comparison of engines that implement them, including MapReduce, Spark, Flink, Storm, and H 2 O. We then look at machine learning libraries and frameworks including Mahout, MLlib, SAMOA, and evaluate them based on criteria such as scalability, ease of use, and extensibility. There is no single toolkit that truly embodies a one-size-fits-all solution, so this paper aims to help make decisions smoother by providing as much information as possible and quantifying what the tradeoffs will be. Additionally, throughout this paper, we review recent research in the field using these tools and talk about possible future directions for toolkit-based learning.


Sharing data to form alliances

The advantages of forming alliances between companies are well understood, from the giants, such as IBM and Apple to collections of smaller enterprises the advantages are seen be the ability to:
• Enter new markets.
• Enhance product and service offering.
• Extend market coverage.
• Scale up production.
• Benefit from bulk purchases.
• Share costs of resources and technology.
Data may be the pivot around which alliances are formed. and the advantages of data centric alliances include; scalability of operations, flexibility of alliance partnerships and agility of business goals.


Adding Text to R Plot

Diversity is a real strength. By now it is common knowledge. I often see institutions openly encourage multinational environment and multidisciplinary professionals, with specific “on-the-job” training to tailor for own needs. No one knows a lot about a lot, so bringing different together enhance independent thinking and knowledge available to the organization. Clarity of communication then becomes even more important, and making sure your figures are quickly understandable goes a long way.


The Mathematics of Paul Graham’s Bias Test

A major problem in detecting biased decisionmaking is the problem of unknown inputs. For example, suppose a venture capitalist has a public portfolio which has funded 75% male founders and 25% female. Is this venture capitalist biased against women? It’s impossible to know – perhaps only 10% of his applicants were female, in which case he might actually biased in favor of women.


October 2015: Scripts of Week

October’s scripts of the week get you started with XGBoost in the up and coming Julia language, share a great template for exploratory analyses (and why they’re so important), highlight the power of interactive dygraph visualizations, walk through a method of filling in gaps in a time series training sets, and tell a fascinating story on the economics of being a working mom.


A Bayesian Model to Calculate Whether My Wife is Pregnant or Not

On the 21st of February, 2015, my wife had not had her period for 33 days, and as we were trying to conceive, this was good news! An average period is around a month, and if you are a couple trying to go triple, then a missing period is a good sign something is going on. But at 33 days, this was not yet a missing period, just a late one, so how good news was it? Pretty good, really good, or just meh?


Using the wakefield package to easily generate reproducible sample data

Back in 2011, I asked a question on StackOverflow: ‘How to make a great R reproducible example?’. This question attracted some great answers, including answers by Hadley Wickham and Joris Meys (co-author of R for Dummies). In June of this year Tyler Rinker added a new answer.