Accessing Big Data with KNIME
Once established that it would be beneficial to integrate a big data platform into the corporate analytics ecosystem, the problem has just started.
First, you need to select your platform among the many choices available on the market. Then, you need to configure your analytics environment so that any script/workflow can connect and run on the big data platform of choice.
Notice, also, that we are just at the beginning of the big data era: the winners of today might not seem the best choice in a few years (or even months) anymore! You need the freedom to change platform quickly, whenever necessary.
KNIME has developed a very flexible and easy to implement strategy to access any big data platform.

Global Economic Maps
In this post I am going to show how to extract data from web pages in table format, transform these data into spatial objects in R and then plot them in maps.

Plotly: Online Dashboards That Update Your Data and Graphs
New online visualization option from allows you to have data visualizations and graphs that update dynamically.

Copulas and Financial Time Series
I was recently asked to write a survey on copulas for financial time series. The paper is, so far, unfortunately, in French, and is available on https://…/. There is a description of various models, including some graphs and statistical outputs, obtained from read data. To illustrate, I’ve been using weekly log-returns of (crude) oil prices, Brent, Dubaï and Maya.

10 types of regressions. Which one to use?
Should you use linear or logistic regression? In what contexts? There are hundreds of types of regressions. Here is an overview for data scientists and other analytic practitioners, to help you decide on what regression to use depending on your context.

Predicting Flights Delay Using Supervised Machine Learning
In this post, we’ll use a supervised machine learning technique called logistic regression to predict delayed flights.

Using Azure as an R data source, Part 1
This post is the first in a series that covers pulling data from various Windows Azure hosted storage solutions (such as MySQL, or Microsoft SQL Server) to an R client on Windows or Linux. We’ll start with a relatively simple case of pulling data from SQL Azure to an R client on Windows.

streaming machine learning with RMOA: stream_in > train > predict
In this example below, we showcase the RMOA package by using streaming JSON data which can come from whatever noSQL database that spits out json. For this example, package jsonlite provides a nice stream_in function (an example is shown here) which handles streaming json data. Plugging in streaming machine learning models with RMOA is a breeze.

Survival analysis: basic terms, the exponential model, censoring, examples in R and JAGS
The material contains:
• Mathematical formulations of key concepts of survival analysis.
• Illustration of the exponential model of failure density.
• Example of the exponential model fitting in R.
• Example of the same model fitting in JAGS.
• More complex model with censoring in JAGS.

Data Science – Short lesson on cluster analysis
In clustering you let data to be grouped according to their similarity. A cluster model is a group of segments -clusters- containing cases (such as clients, patients, cars, etc.). Once a cluster model is developed, one question arises: How can I describe my model? Here we present a way to approach this question, through the implementation of Coordinate Plot in R (code available at the end of the post)

Polyglot Persistence?
Yes it’s a real phrase and it’s the secret to picking the right NoSQL database.

Where Big Data Projects Fail
1. Not starting with clear business objectives
2. Not making a good business case
3. Management Failure
4. Poor communication
5. Not having the right skills for the job

Columbia data science course, week 1: what is data science?
I’m attending Rachel Schutt’s Columbia University Data Science course on Wednesdays this semester and I’m planning to blog the class. Here’s what happened yesterday at the first meeting.

Tutorial: How to determine the quality and correctness of classification models? Introduction
Classification is the process of assigning every object from a collection to exactly one class from a known set of classes. Examples of classification tasks are:
• assigning a patient (the object) to a group of healthy or ill (the classes) people on the basis of his or her medical record,
• determining the customer’s (the object) credibility during credit application using, for example, demographic and financial data; in this case the classes are ‘credible’ and ‘not credible’,
• determining if the customer (the object) is likely to stop using the company’s services or products on the basis of behavioral and demographic data; in this case the classes are ‘disloyal customers’ and ‘loyal customers’.