Will the new Star Wars suck? An analysis of directors and movie involvement

In this article, we’ll use a statistical model to analyze movie directors and their involvement as screenwriters and producers of movies, and we’ll use that same model to predict the Rotten Tomatoes Tomatometer™ score for Star Wars: Episode VII – The Force Awakens. Truth be told, there will be more shooting from the hip than an actual Star Wars battle. There will be a lot of estimating, not a lot of focus on uncertainty, and hell, if we hit something, it’s probably because we got lucky. But to even take a stab at it, we’re going to need some data.


Putting the ‘I’ in open science: How you can change the face of science

If we want to shift from a closed science to an open science, there has to be change at several levels. In this process, it’s easy to push the responsibility (and the power) for reform onto “the system”: “If only journals changed their policy …”, “It’s the responsibility of the granting agencies to change XYZ”, or “Unless university boards change their hiring practices, it is futile to …”. Beyond doubt, change has to occur at the institutional level. In particular, many journals have already done a lot (see, for example, the TOP guidelines or the new registered reports article format). But journal policies aren’t enough, particularly since they are often not enforced. In this blog post, I want to advocate for a complementary position of agency and empowerment: Let’s focus on steps each individual can do!


A Complete Tutorial on Time Series Modeling in R

‘Time’ is the most important factor which ensures success in a business. It’s difficult to keep up with the pace of time. But, technology has developed some powerful methods using which we can ‘see things’ ahead of time. Don’t worry, I am not talking about Time Machine. Let’s be realistic here! I’m talking about the methods of prediction & forecasting. One such method, which deals with time based data is Time Series Modeling. As the name suggests, it involves working on time (years, days, hours, minutes) based data, to derive hidden insights to make informed decision making. Time series models are very useful models when you have serially correlated data. Most of business houses work on time series data to analyze sales number for the next year, website traffic, competition position and much more. However, it is also one of the areas, which many analysts do not understand.


Big Data Insights – IT Support Log Analysis

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic’s helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.


Trade-offs to consider when reading a large dataset into R using the RevoScaleR package

There are many R packages dedicated to letting users (or useRs if you prefer) deal with big data in R. (We will intentionally avoid using proper case for ‘big data’, because (1) the term has been somewhat hackneyed, and (2) for the sake of this article we can think of big data as any dataset too large to fit into memory as a data.frame so that standard R functions can run on them.) Even without third party packages, base R still puts some toolkits at our disposal, which boil down to doing one of two things: We can either format the data more economically so that it can still be squeeze into memory, or we can deal with the data piecemeal, bypassing the need to load it into memory all at once.


mapview 1.0.0 now on CRAN

We are happy to announce that mapview 1.0.0 has been released on CRAN. Together with Florian Detsch (R, Rcpp), Christoph Reudenbach (js) and Stefan Woellauer (js) we have put together a powerful tool for on-the-fly mapping of spatial data during the workflow.


Talking mats as a visual method for assessing and discussing data visualisations

One of the central research activities during this project involved the nine focus groups, with a total of 46 participants involved. The purpose of these focus groups was to invite the participants to spend time looking at, exploring and reflecting on their experiences of working with up to eight different visualisation projects.