Athey and Imbens on Machine Learning and Econometrics

The characteristics of ML are basically (1) emphasis on overall modeling, for prediction (as opposed, for example, to emphasis on inference), (2) moreover, emphasis on non-causal modeling and prediction, (3) emphasis on computationally-intensive methods and algorithmic development, and (4) emphasis on large and often high-dimensional datasets. Readers of this blog will recognize the ML characteristics as closely matching those of TSE! Rob Engle’s V-Lab at the NYU Stern Volatility Institute, for example, embeds all of (1)-(4). So TSE and ML have a lot to learn from each other, but the required bridge is quite short. Interestingly, Athey and Imbens come not from the TSE tradition, but rather from the CSE tradition, which typically emphasizes causal estimation and inference. That makes for a longer required CSE-ML bridge, but it may also make for a larger payoff from building and crossing it (in both directions).

Why Data Scientists Need to be Good Data Storytellers

Storytelling is data with a soul. Data Scientists are extremely good with numbers but numbers alone are not sufficient to convey the results to the end user. Being a good data storyteller is an art as well as a science. Data Scientists take the help of various data visualization tools like Tableau to present the data in visually appealing format. A Data Scientist not only understands the data but also understands the business and the end user very well. A good data storyteller is as essential to a business as a data scientist. Because a good “Data Storyteller” will ensure that the results from the data analysis and modelling gets imparted to the right audience in an understandable format.

Data Science and EU Privacy Regulations: A Storm on the Horizon – Part 2

In Part 1, we introduced a pending EU privacy and data protection regulation (the GDPR) which will carry fines for violations of up to 5% of global annual turnover (1 million Euros for smaller companies). We discussed how this regulation will present particular challenges for collection, storage, and use of data within EU and global organizations. Impact will be felt by data scientists in particular but also across the IT organization. In this post, we focus on the impact on data science and analytic applications and suggest steps to take in the immediate to near future to prevent fines and/or crippling data blackouts.

Modelling Dependence with Copulas in R

A copula is a function which couples a multivariate distribution function to its marginal distribution functions, generally called marginals or simply margins. Copulas are great tools for modelling and simulating correlated random variables. The main appeal of copulas is that by using them you can model the correlation structure and the marginals (i.e. the distribution of each of your random variables) separately. This can be an advantage because for some combination of marginals there are no built-in functions to generate the desired multivariate distribution. For instance, in R it is easy to generate random samples from a multivariate normal distribution, however it is not so easy to do the same with say a distribution whose margins are Beta, Gamma and Student, respectively. Furthermore, as you can probably see by googling copulas, there is a wide range of models providing a set of very different and customizable correlation structures that can be easily fitted to observed data and used for simulations. This variety of choice is one of the things I like the most about copulas. In this post, we are going to show how to use a copula in R using the copula package and then we try to provide a simple example of application.

Multiple Hypothesis Testing

In recent years, there has been a lot of attention on hypothesis testing and so-called ‘p-hacking’, or misusing statistical methods to obtain more “significant” results. Rightly so: For example, we spend millions of dollars on medical research, and we don’t want to waste our time and money, pursuing false leads caused by flaky statistics. But even if all of our assumptions are met and our data collection is flawless, it’s not always easy to get the statistics right; there are still quite a few subtleties that we need to be aware of. This post introduces some of the interesting phenomena that can occur when we are dealing with testing hypotheses. First, we consider an example of a single hypothesis test which gives great insight into the difference between significance and ‘being correct’. Next, we look at global testing, where we have many different hypotheses and we want to test whether all null hypotheses are true using a single test. We discuss two different tests, Fisher’s combination test and Bonferroni’s method, which lead to rather different results. We save the best till last, when we discuss what to do if we have many hypotheses and want to test each individually. We introduce the concepts of familywise error rate and false discovery rate, and explain the Benjamini-Hochberg procedure.

Clique-based semantic kernel with application to semantic relatedness

In this paper, a novel semantic kernel is proposed which is capable of incorporating the relatedness between conceptual features. This kernel leverages clique theory to map data objects to a novel feature space wherein complex data objects will be comparable. The proposed kernel is relevant to all applications which have a prior knowledge about the relatedness between features. We concentrate on representing text documents and words using Wikipedia and WordNet, respectively.

SAP Redefines Analytics in the Cloud

SAP unveiled the SAP Cloud for Analytics solution, a planned software as a service (SaaS) offering that aims to bring all analytics capabilities into one solution for an unparalleled user experience (UX). Built natively on SAP HANA Cloud Platform, this high-performing, real-time solution plans to be embedded with existing SAP solutions and intends to connect to cloud and on-premise data to deliver planning, predictive and business intelligence (BI) capabilities in one analytics experience. The intent is for organizations to use this one solution to enable their employees to track performance, analyze trends, predict and collaborate to make informed decisions and improve business outcomes.

Basic Forecasting

Forecasting refers to the process of using statistical procedures to predict future values of a time series based on historical trends. For businesses, being able gauge expected outcomes for a given time period is essential for managing marketing, planning, and finances. For example, an advertising agency may want to utilizes sales forecasts to identify which future months may require increased marketing expenditures. Companies may also use forecasts to identify which sales persons met their expected targets for a fiscal quarter.

Book: Mastering: Data Analysis with R

So this is not a reference book, it does not even include a piece of formal mathematical formula, but instead it does provide a practical introduction, many references and hands-on examples.

Emoticons decoder for social media sentiment analysis in R

If you have ever retrieved data from Twitter, Facebook or Instagram with R, you might have noticed a strange phenomenon. While R seems to be able to display some emoticons properly, many other times it doesn’t, making any further analysis impossible unless you get rid of them. With a little hack, I decoded these emoticons and put them all in a dictionary for further use. I’ll explain how I did it and share the decoder with you.

Reach for your Matlab data with R

If you work with both Matlab and R, the R.matlab package maintained by Henrik Bengtsson on CRAN helps you to connect the two environments by allowing you to read and write Matlab’s MAT data file format from R (even if you don’t have Matlab installed). This allows you to pass data between Matlab and R via the filesystem.