Echoes of the Future

This report discusses ways to combine graphics output from the ‘graphics’ package and the ‘grid’ package in R and introduces a new function echoGrob in the ‘gridGraphics’ package.


6 data analytics trends that will dominate 2018

As businesses transform into data-driven enterprises, data technologies and strategies need to start delivering value. Here are four data analytics trends to watch in the months ahead.
• Data lakes will need to demonstrate business value or die
• The CDO will come of age
• Rise of the data curator
• Data governance strategies will be key themes for all C-level executives
• The proliferation of metadata management continues
• Predictive analytics helps improve data quality


The end of errors in ANOVA reporting

Psychology is still (unfortunately) massively using analysis of variance (ANOVA). Despite its relative simplicity, I am very often confronted to errors in its reporting, for instance in student´s theses or manuscripts. Beyond the incomplete, uncomprehensible or just wrong reporting, one can find a tremendous amount of genuine errors (that could influence the results and their intepretation), even in published papers! (See the excellent statcheck to quickly check the stats of a paper). This error proneness can be at least partially explained by the fact that copy/pasting the (appropriate) values of any statistical software and formatting them textually is a very annoying process. How to end it


Top 8 Sites To Make Money By Uploading Files

pay to upload sites have been really in now days and is indeed an easy way to make few bugs online. Its one of those ways to make money online that doesn´t need any effort on your part. Each time you upload your files to their servers and someone downloads them, you get paid a certain amount.


Amazon Alexa and Accented English

Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Washington Post and turned out really interesting! Let´s walk through the analysis I did for Dylan and Pulse Labs.


Explaining Black-Box Machine Learning Models – Code Part 1: tabular data + caret + iml

This is code that will accompany an article that will appear in a special edition of a German IT magazine. The article is about explaining black-box machine learning models. In that article I´m showcasing three practical examples:
1.Explaining supervised classification models built on tabular data using caret and the iml package
2.Explaining image classification models with keras and lime
3.Explaining text classification models with lime


Benchmarking Feature Selection Algorithms with Xy()

Feature Selection is one of the most interesting fields in machine learning in my opinion. It is a boundary point of two different perspectives on machine learning – performance and inference. From a performance point of view, feature selection is typically used to increase the model performance or to reduce the complexity of the problem in order to optimize computational efficiency. From an inference perspective, it is important to extract variable importance to identify key drivers of a problem. Many people argue that in the era of deep learning feature selection is not important anymore. As a method of representation learning, deep learning models can find important features of the input data on their own. Those features are basically nonlinear transformations of the input data space. However, not every problem is suited to be approached with neural nets (actually, many problems). In many practical ML applications feature selection plays a key role on the road to success.


Causation in a Nutshell

Knowing the who, what, when, where, etc., is vital in marketing. Predictive analytics can also be useful for many organizations. However, also knowing the why helps us better understand the who, what, when, where, and so on, and the ways they are tied together. It also helps us predict them more accurately. Knowing the why increases their value to marketers and increases the value of marketing. Analysis of causation can be challenging, though, and there are differences of opinion among authorities. The statistical orthodoxy is that randomized experiments are the best approach. Experiments in many cases are infeasible or unethical, however. They also can be botched or be so artificial that they do not generalize to real world conditions. They may also fail to replicate. They are not magic.


Autoencoder as a Classifier using Fashion-MNIST Dataset

In this tutorial, you will learn & understand how to use autoencoder as a classifier in Python with Keras. You’ll be using Fashion-MNIST dataset as an example.


Receiver Operating Characteristic Curves Demystified (in Python)

In Data Science, evaluating model performance is very important and the most commonly used performance metric is the classification score. However, when dealing with fraud datasets with heavy class imbalance, a classification score does not make much sense. Instead, Receiver Operating Characteristic or ROC curves offer a better alternative. ROC is a plot of signal (True Positive Rate) against noise (False Positive Rate). The model performance is determined by looking at the area under the ROC curve (or AUC). The best possible AUC is 1 while the worst is 0.5 (the 45 degrees random line). Any value less than 0.5 means we can simply do the exact opposite of what the model recommends to get the value back above 0.5. While ROC curves are common, there aren´t that many pedagogical resources out there explaining how it is calculated or derived. In this blog, I will reveal, step by step, how to plot an ROC curve using Python. After that, I will explain the characteristics of a basic ROC curve.


Opportunity: data lakes offer a ‘360-degree view’ to an organisation

Data lakes provide a solution for businesses looking to harness the power of data. Stuart Wells, executive vice president, chief product and technology officer at FICO, discusses with Information Age how approaching data in this way can lead to better business decisions.


Using the AWS Glue Data Catalog as the Metastore for Hive

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. AWS Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog. For more information about the Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.