The Data Lake Debate: Pros Up First
The data lake is essential for any organization who wants to take full advantage of its data.

SVG patterns for Data Visualization

Mining Web Pages in Parallel
A tool in C# called the ‘Webpage Downloader’. This class can be used within any C# program to download large volumes of webpage content in parallel.

Journal of Statistical Software – Vol. 64
• iqLearn: Interactive Q-Learning in R
• Fitting Heavy Tailed Distributions: The poweRlaw Package
• Exploring Diallelic Genetic Markers: The HardyWeinberg Package
• fitdistrplus: An R Package for Fitting Distributions
• Constructing and Modifying Sequence Statistics for relevent Using informR in R
• NHPoisson: An R Package for Fitting and Validating Nonhomogeneous Poisson Processes
• PReMiuM: An R Package for Profile Regression Mixture Models Using Dirichlet Processes
• R Package multgee: A Generalized Estimating Equations Solver for Multinomial Responses
• nparcomp: An R Software Package for Nonparametric Multiple Comparisons and Simultaneous Cfidence Intervals
• gems: An R Package for Simulating from Disease Progression Models
• BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments
• Building a Nomogram for Survey-Weighted Cox Models Using R
• SDD: An R Package for Serial Dependence Diagrams

BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments
Empirical analysis of statistical algorithms often demands time-consuming experiments. We present two R packages which greatly simplify working in batch computing environments. The package BatchJobs implements the basic objects and procedures to control any batch cluster from within R. It is structured around cluster versions of the well-known higher order functions Map, Reduce and Filter from functional programming. Computations are performed asynchronously and all job states are persistently stored in a database, which can be queried at any point in time. The second package, BatchExperiments, is tailored for the still very general scenario of analyzing arbitrary algorithms on problem instances. It extends package BatchJobs by letting the user define an array of jobs of the kind ‘apply algorithm A to problem instance P and store results’. It is possible to associate statistical designs with parameters of problems and algorithms and therefore to systematically study their nfluence on the results. The packages’ main features are:
(a) Convenient usage: All relevant batch system operations are either handled internally or mapped to simple R functions.
(b) Portability: Both packages use a clear and well-defined interface to the batch system which makes them applicable in most high-performance computing environments.
(c) Reproducibility: Every computational part has an associated seed to ensure reproducibility even when the underlying batch system changes.
(d) Abstraction and good software design: The code layers for algorithms, experiment definitions and execution are cleanly separated and enable the writing of readable and maintainable code.

5 reasons why Spark Streaming’s batch processing of data streams is not stream processing
1. Stream Processing versus batch-based processing of data streams
2. Data arriving out of time order is a problem for batch-based processing
3. Batch length restricts Window-based analytics
4. Spark is claimed to be faster than Storm but is still performance limited
5. Batch-based systems offer high development costs and low productivity

Analyze LinkedIn with R
Some time ago I saw an interesting post in a R related group on LinkedIn. It was from Michael Piccirilli and he wrote something about his new package Rlinkedin. I was really impressed by his work and so I decided to write a blog post about it.

An Introduction to Statistics
This book assumes that
• you have some basic programming experience (If you have zero prior programming experience, you may want to start out with getting going with Python, using some of the great links given in the text. Starting programming and starting statistics may be a bit much at a time.)
• you have some data that you want to analyze (For almost all cases, a working Python program is provided. All you have to do is select the right program, adjust it so that it reads in your data, and interpret the results.)
• that you are not a statistics expert (If you are already a statistics expert, the online help in Python will be sufficient to allow you to do most of your data anlysis right away.)
The idea of this book is to give you all (or at least most of) the tools that you will need for your statistical data analysis. Thereby I try to provide all the background required to understand what you are doing. I will not proof any theorems, and won’t indulge in mathematics where it is unnecessary. This approach explains why so much code is included: in principle, you have to define our problem, select the corresponding program, and adapt it to your needs. This should allow you to get going quickly, even if you have little Python experience. This is also the reason why I have not provided the software as a Python module, since I expect that you have to tailor each program to your specific setup (data format, etc).

On Some Alternatives to Regression Models
When you start discussing with people in machine learning, you quickly hear something like ‘forget your econometric models, your GLMs, I can easily find a machine learning ‘model’ that can beat yours’. I am usually very sceptical, especially when I hear ‘easily’ or ‘always’. I have no problem about the fact that I use old econometric models, but I had the feeling that things aren’t that easy. I can understand that we might have problems when we do have a lot of features (I am still working on that, I’ll get back to this point soon), but I have the feeling that I can still capture interactions, and non-linearities with standard econometric models as well as any machine learning algorithm.

New Online Tool for Seasonal Adjustment
Seasonal adjustment of time series can be a hassle. The softwares used by statistical agencies (X-13, X-12, TRAMO-SEATS) have tons of fantastic options, but the steep learning curve prevents users from taking advantage of the functionality of these packages, or from using them at all. The R package seasonal simplifies the task by providing an interface to X-13, the newest seasonal adjustment software by the US Census Bureau. It combines and extends the capabilities of the older X-12ARIMA and TRAMO-SEATS software packages. The most simple use of seasonal requires the application of the main function to a time series, which invokes automated procedures that work well in many circumstances: …

What Consumers Learn Before Deciding to Buy: Representation Learning
Features form the basis for much of our preference modeling. When asked to explain one’s preferences, features are typically accepted as appropriate reasons: this job paid more, that candidate supports tax reform, or it was closer to home. We believe that features must be the drivers since they so easily serve as rationales for past behavior. Choice modeling formalizes this belief by assuming that products and services are feature bundles with the value of the bundle calculated directly from the utilities of its separate features. All that we need to know about a product or service can be represented as the intersection of its features, which is why it is called conjoint analysis.