Open-sourcing Pinball (a scalable workflow manager)
As we continue to build in a fast and dynamic environment, we need a workflow manager that’s flexible and can keep up with our data processing needs. After trying a few options, we decided to build one in-house. Today we’re open-sourcing Pinball, which is designed to accommodate the needs of a wide range of data processing pipelines composed of jobs ranging from simple shell scripts to elaborate Hadoop workloads. Pinball joins our other open-sourced projects like Secor and Bender, available on our Github.
Clustering With K-Means in Python
A very common task in data analysis is that of grouping a set of objects into subsets such that all elements within a group are more similar among them than they are to the others. The practical applications of such a procedure are many: given a medical image of a group of cells, a clustering algorithm could aid in identifying the centers of the cells; looking at the GPS data of a user’s mobile device, their more frequently visited locations within a certain radius can be revealed; for any set of unlabeled observations, clustering helps establish the existence of some sort of structure that might indicate that the data is separable.
The R package ecosystem is one of the cornerstones of the success seen by R. As of this writing, over 6200 are on CRAN, several hundred more at BioConductor and at OmegaHat. Support for multiple repositories is built deeply into R; mostly via the (default) package utils. The update.packages function (along with several others from the utils package) can used for these three default repositories as well as many others with ease. But it seemed that support for simple creation and use of local repositories was however missing. Drat tries to help here.
The emergence of Spark
In the continuing Big Data evolution of reinventing everything that happened in HPC a couple of decades ago (with slight modifications), one newer ecosystem that comes up more and more is the Berkeley Data Analytics Stack. Some of the better-known components of this stack are Spark, Mesos, GraphX, and MLlib. Spark in particular has gained interest due in part to very fast computation in-memory or on-disk, generally pulling from Hadoop or Cassandra (courtesy of a connector). And its programming model uses Python, Scala, or Java, which — especially in the case of Python — is very friendly to data scientists. Coincidentally, Spark 1.3 was released today, and it supports the DataFrame abstraction used both in the popular Python pandas library as well as in R (for which it has an upcoming API).
New Whitepaper: Connect R to other applications with DeployR
DeployR is a server-based framework that provides simple, secure R integration for application developers. It’s available in two editions: DeployR Open, which is free and open-source; and Revolution R Enterprise DeployR, which adds a scalable grid framework and enterprise authentication features for production applications integrated with R. If you’re looking for an overview of what DeployR is and how you can use it to access R from other applications, we’ve just released a new white paper, Using DeployR to Solve the R Integration Problem. This 14-page document steps you through the basic DeployR workflow: the R user creates an script and publishes it to the server; an application developer links to the R script via the DeployR API; and an application user sees some results of an R calculation in a desktop, mobile or web-based application (most likely without being aware that R was involved at all).
Data-as-a-Service: Real-Life Examples of Companies Who Are Using DaaS to Boost Revenue
Example One – Financial Services
1. Mortgages: First Time Home Buyers
2. Mortgages: Refinance Variation
Example Two – Automotive Industry
1. Broader Universe of Prospects
2. Real-Time Credit Triggers
3. In-Market Social Signaling
4. Digital Audience Expansion
5. Fleet Prospect Identification
Example Three – Furniture Retail
1. DataMentors’ Consumer Database: Identify New Prospects
2. Millennial Data: Target Millennial Consumers with Multi-Channel Messaging
3. Pre-Mover and New-Mover Data: Send Offers to Consumers Who May Soon be In Market
4. Social Signaling Data: Boost Customer Acquisition Through Social Prospecting
5. Onboarded Data: Digitally Addressable Dataset for Real-Time Messaging