Impact of target class proportions on accuracy of classification
When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.

THE PAST (Entity-Attribute-Value) vs THE FUTURE (Sign, Signifier, Signified)
In both semantic model standards Topic Maps and RDF/OWL and in many other NoSQL approaches to solve efficiently the problem of how to represent relations and relationships one major stumbling block is raised beyond all efforts: the namespace. It is a language problem, the babel we have in our civilized world is transferred into our IT systems. But machines do not have to understand our language, we do. Good news for everyone, there is an alternative way of thinking on modelling data:
• token based,
• fully symmetrical,
• bidirectional linking,
• single instance centric,
• data-type agnostic.
• namespace agnostic,
• fully contextualized,
• structure-free,
and many more novelties….

Principal Component Analysis for Dummies
Principal Component Analysis, or PCA, is a statistical method used to reduce the number of variables in a dataset. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity.

Regression Models, It’s Not Only About Interpretation
Yesterday, I did upload a post where I tried to show that ‘standard’ regression models where not performing bad. At least if you include splines (multivariate splines) to take into accound joint effects, and nonlinearities. So far, I do not discuss the possible high number of features (but with boostrap procedures, it is possible to assess something related to variable importance, that people from machine learning like). But my post was not complete: I was simply plotting the prediction obtained by some model. And it ‘looked like’ regression was nice, but so were random forrest, k-nearest neighbour and boosting algorithm. What if we compare those models on new data? Here is the code to create all the models (I did include another one, some kind of benchmark, where no covariates are included), based on 1,000 simulated values

SAMOA – Scalable Advanced Massive Online Analysis
SAMOA is a tool to perform mining on big data streams. It is a distributed streaming machine learning (ML) framework, i.e. it is a Mahout but for stream mining. SAMOA contains a programing abstraction for distributed streaming ML algorithms (refer to this post for stream ML definition) to enable development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Twitter Storm and S4). SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow SAMOA users to develop distributed streaming ML algorithms once and they can execute the algorithms in multiple SPEs, i.e. code the algorithms once and execute them in multiple SPEs.

R: Parsing Date and Times
R has excellent for dates and times via the built-in Date and POSIXt classes. Their usage, however, is not always as straightforward as one would want. Certain conversions are more cumbersome than we would like: while as.Date(‘2015-03-22’), would it not be nice if as.Date(‘20150322’) (a format often used in logfiles) also worked, or for that matter as.Date(20150322L) using an integer variable, or even as.Date(‘2015-Mar-22’) and as.Date(‘2015Mar22′)? Similarly, many date and time formats suitable for POSIXct (the short form) and POSIXlt (the long form with accessible components) often require rather too much formatting, and/or defaults. Why for example does as.POSIXct(as.numeric(Sys.time()), origin=’1970-01-01’) require the origin argument on the conversion back (from fractional seconds since the epoch) into datetime—when it is not required when creating the double-precision floating point representation of time since the epoch? But thanks to Boost and its excellent Boost Date_Time library—which we already mentioned in this post about the BH package— we can address parsing of dates and times. It permitted us to write a new function toPOSIXct() which now part of the RcppBDT package (albeit right now just the GitHub version but we expect this to migrate to CRAN ‘soon’ as well).

How to get the most from big data
Simply collecting big data does not unleash its potential value. People must do that, especially people who understand how analytics can resolve business issues or capture opportunities. Yet, as most executives know, good data people are hard to come by. According to a McKinsey survey, only 18 percent of companies believe they have the skills necessary to gather and use insights effectively. At the same time, only 19 percent of companies are confident that their insights-gathering processes contribute directly to sales effectiveness. And what if number crunchers aren’t enough? After all, if a great insight derived from advanced analytics is too complicated to understand, business managers just won’t use it. That’s why companies need to recruit and cultivate ‘translators’ – specialists capable of bridging different functions within the organization and effectively communicating between them (exhibit). But looking for a single translator at the right intersection of all the various skills you need is like looking for a unicorn. It’s more realistic to find translators who possess two complementary sets of skills, such as computer programming and finance, statistics and marketing, or psychology and economics. In all but the rarest of cases, you’ll need at least two translators to bridge each pair of functions – one of whom is grounded in his or her own function but has a good enough understanding of the other function to be able to communicate with a counterpart grounded there. That’s because when this process works best, it’s a collaboration rather than a straight ‘translation.’