New Survey Reveals Businesses Are Bullish on Data Lakes

The data lake has long served as a powerful tool for data scientists and data engineers. However, today´s business environment often requires that users without coding or scripting skills access the data stored in lakes. The survey results highlight how data lakes support more casual business users, enabling visual reporting and analysis and BI with point-and-click accessibility. Though some bemoan the idea data lakes are turning into data swamps, this survey underscores that businesses are bullish about data lakes and that some have already reaped the benefits of modern lake technologies. In fact, nearly 66 percent of respondents felt ‘business users can explore data (e.g., filter, drill) to get the views they want.’ However, for all the praise data lakes receive, there are still gaps in how many casual users have access to the information and what they are able to do with it.


p-Hacking and False Discovery in A/B Testing

We investigate whether online A/B experimenters ‘p-hack’ by stopping their experiments based on the p-value of the treatment effect. Our data contains 2,101 commercial experiments in which experimenters can track the magnitude and significance level of the effect every day of the experiment. We use a regression discontinuity design to detect p-hacking, i.e., the causal effect of reaching a particular p-value on stopping behavior. Experimenters indeed p-hack, especially for positive effects. Specifically, about 57% of experimenters p-hack when the experiment reaches 90% confidence. Furthermore, approximately 70% of the effects are truly null, and p-hacking increases the false discovery rate (FDR) from 33% to 42% among experiments p-hacked at 90% confidence. Assuming that false discoveries cause experimenters to stop exploring for more effective treatments, we estimate the expected cost of a false discovery to be a loss of 1.95% in lift, which corresponds to the 76th percentile of observed lifts.


Why Data Analytics is Heavy on Data Engineering?

While many companies have embarked on data analytics initiatives, only a few have been successful. Studies have shown that over 70% of data analytics programs fail to realize their full potential and over 80% of the digital transformation initiatives fail. While there are many reasons that affect successful deployment of data analytics, one fundamental reason is lack of good quality data. However, many business enterprises realize this and invest considerable time and effort in data cleansing and remediation; technically known as data engineering. It is estimated that about 60 to 70% of the effort in data analytics is on data engineering. Given that data quality is an essential requirement for analytics, there are 5 key reasons on why data analytics is heavy on data engineering.
1.Different systems and technology mechanisms to integrate data.
2. Different time frames of data capture
3. Different user value-propositions
4. Different business processes
5. Different aggregations driven by organizational structures


Hierarchical Clustering in R

Clustering is the most common form of unsupervised learning, a type of machine learning algorithm used to draw inferences from unlabeled data. In this tutorial, you will learn to perform hierarchical clustering on a dataset in R.


Python List Index()

A data structure is a way to organize and store data that enables efficient access and modification. Lists are a type of data structure that store a collection of heterogeneous items. Lists are inbuilt data structures. Python also provides many functions or methods that you can use to work with lists. In this tutorial, you will learn exclusively about the index() function. The index() method searches an element in the list and returns its position/index. First, this tutorial will introduce you to lists, and then you will see some simple examples to work with the index() function. Stick around till the very end, to learn some additional cool stuff…


Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)

Given the number of people interested in my first post for visualizing Classification Models Results, I´ve decided to create and share some new function to visualize and compare whole Linear Regression Models with one line of code. These plots will help us with our time invested in model selection and a general understanding of our results.


Genetic Algorithm Implementation in Python

This tutorial will implement the genetic algorithm optimization technique in Python based on a simple example in which we are trying to maximize the output of an equation. The tutorial uses the decimal representation for genes, one point crossover, and uniform mutation.


Interactive Tutorial on Dirichlet Processes Using R Shiny

My advisor and his collaborator are teaching a short course on Bayesian Nonparametric Methods for Causal Inference at JSM next week. As part of the short course, I made an interactive tutorial on Dirichlet Processes using R Shiny. All underlying code is hosted on GitHub – you can also run locally from GitHub by running the code below. This local version may be faster if there are too many users on the web version.


Boost the speed of R calls from Rcpp

If you are a user who needs to work with Rcpp-based packages, or you are a maintainer of one of such packages, you may be interested in the recent development of the unwind API, which can be leveraged to boost performance since the last Rcpp update. In a nutshell, until R 3.5.0, every R call from C++ code was executed inside a try-catch, which is really slow, to avoid breaking things apart. From v3.5.0 on, this API provides a new and safe fast evaluation path for such calls.


How to Access Any RESTful API Using the R Language

R is an excellent language for data analytics, but it’s uncommon to use it for serious development. This means that popular APIs don’t offer software development kits (SDKs) or how-to guides for analysts working in R the way they do for other more popular languages like Python or Objective-C (for Apple’s iOS). This is a how-to guide for connecting to an API to receive stock prices as a data frame when the API doesn’t have a specific package for R. For those of you not familiar with R, a data frame is like a spreadsheet, with data arranged in rows in columns. You can then use these same techniques to pull data into R from other APIs.