Efficient aggregation (and more) using data.table

In my recent post I have written about the aggregate function in base R and gave some examples on its use. This post repeats the same examples using data.table instead, the most efficient implementation of the aggregation logic in R, plus some additional use cases showing the power of the data.table package. This post focuses on the aggregation aspect of the data.table and only touches upon all other uses of this versatile tool. For a great resource on everything data.table, head to the authors’ own free training material.

Simple Methods to deal with Categorical Variables in Predictive Modeling

Categorical variables are known to hide and mask lots of interesting information in a data set. It’s crucial to learn the methods of dealing with such variables. If you won’t, many a times, you’d miss out on finding the most important variables in a model. It has happened with me. Initially, I used to focus more on numerical variables. Hence, never actually got an accurate model. But, later I discovered my flaws and learnt the art of dealing with such variables.

Accessing APIs from R (and a little R programming)

APIs are the driving force behind data mash-ups. It is APIs that allow machines to access data programmatically – that is automatically from within a program – to make use of API provided functionalities and data. Without APIs much of today’s Web 2.0, Apps and data applications would be outright impossible. This post is about using APIs with R. As an example. we’ll use the EU’s EurLex1 data base API as provided by Buhl Rassmussen. This API is a good example of the APIs you might find in the wild. Of course, there are the APIs of large vendors, like Google or Facebook, that are thought out and well documented. But then there is the vast majority of smaller APIs for special applications that often lack in structure or documentation. Nevertheless, these APIs often provide access to valuable ressources.

scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems

Comparing the results obtained by two or more algorithms in a set of problems is a central task in areas such as machine learning or optimization. Drawing conclusions from these comparisons may require the use of statistical tools such as hypothesis testing. There are some interesting papers that cover this topic. In this manuscript we present scmamp, an R package aimed at being a tool that simplifies the whole process of analyzing the results obtained when comparing algorithms, from loading the data to the production of plots and tables. Comparing the performance of different algorithms is an essential step in many research and practical computational works. When new algorithms are proposed, they have to be compared with the state of the art. Similarly, when an algorithm is used for a particular problem, its performance with different sets of parameters has to be compared, in order to tune them for the best results. When the differences are very clear (e.g., when an algorithm is the best in all the problems used in the comparison), the direct comparison of the results may be enough. However, this is an unusual situation and, thus, in most situations a direct comparison may be misleading and not enough to draw sound conclusions; in those cases, the statistical assessment of the results is advisable.

What the candidates say, analyzing republican debates using R

As most people realize, this is probably one of the most data-rich primary campaigns in history, with hundreds of professional pollsters poring over every data-point trying to understand voter’s intention. So here is another data-rich post to that end. I was glad to discover the University of California at Santa Barbara’s webpage with tons of high-quality data related to the elections. Amongst these are the transcripts of presidential debates going back to 1960, which I will pore over a bit further. Because the Republican race is arguably more fun to watch, i’ll be concentrating on these debates.

Get access to ALL R courses at DataCamp this month for $9

DataCamp is offering R-bloggers readers a holiday promotion. For just $9 (instead of $25) You can gain access to their full curriculum of online R courses, videos and interactive coding challenges. With hands-on learning and instruction from leading experts such as Garrett Grolemund (RStudio), Matt Dowle (data.table) and Bob Muenchen (r4stats), DataCamp’s premium courses can help you acquire new R skills.

Data Science for Losers, Part 5 – Spark DataFrames

Sometimes, the hardest part in writing is completing the very first sentence. I began to write the “Loser’s articles” because I wanted to learn a few bits on Data Science, Machine Learning, Spark, Flink etc., but as the time passed by the whole degenerated into a really chaotic mess. This may be a “creative” chaos but still it’s a way too messy to make any sense to me. I’ve got a few positive comments and also a lot of nice tweets, but quality is not a question of comments or individual twitter-frequency. Do these texts properly describe “Data Science”, or at least some parts of it? Maybe they do, but I don’t know for sure. Whatever, let’s play with Apache Spark’s DataFrames.

So You Want to Implement a Custom Loss Function?

My venerable boss recently took a trip to Amsterdam. As we live in New York City, he needed to board a plane. Days before, we discussed the asymmetric risk of airport punctuality: “if I get to the gate early, it’s really not so bad; if I arrive too late and miss my flight, it really, really sucks.” When deciding just when to leave the house – well – machine learning can help.

Choosing a Database for Analytics

When your analytics questions run into the edges of out-of-the-box tools, it’s probably time to for you to choose a database for analytics. It’s not a good idea to write scripts to query your production database, because you could reorder the data and likely slow down your app. You might also accidentally delete important info if you have analysts or engineers poking around in there. You need a separate kind of database for analysis. But which one is right? In this post, we’ll go over suggestions and best practices for the average company that’s just getting started. Whichever set up you choose, you can make tradeoffs along the way to improve the performance from what we discuss here.

Official Product Tutorials – SAP Predictive Analytics, SAP Predictive Analysis and SAP InfiniteInsight

The following tutorials have been developed to help you get started using Business Intelligence products. New content is added as soon as it becomes available, so check back on a regular basis.