Extending the Value of Hadoop and Spark with a Cloud-based Managed Service

With the growth of data and the focus on becoming more data-driven, many organizations have turned to Apache Hadoop and Apache Spark as their big data and analytics framework to store and process their data. While the solutions are quite powerful, to effectively manage a constantly evolving infrastructure that must continue to meet the demands of a modern business, it can quickly become a logistical and administrative nightmare. With the understanding that organizations have plans to adopt cloud technology and one of the top use cases for existing IaaS and PaaS users is for data analytics, the idea of leveraging an Hadoop- and Spark-based managed service in the cloud is quite appealing. Last year we conducted a rigorous economic validation of Google BigQuery, where we modeled different sized organizations and estimated what the costs would be over a three year period for those organizations if they were to leverage on-premises Hadoop, another cloud service, or BigQuery. We recently completed another economic validation, but this time directed our focus to Google Cloud Dataproc.

Cloudera Enterprise Data Hub on Oracle Cloud Infrastructure

Oracle and Cloudera Provide the Ultimate Big Data Platform. – The Cloudera and Oracle partnership allows customers to deploy comprehensive data strategies, from business operations to data warehousing, data science, data engineering, streaming, and real-time analytics, all on a unified enterprise cloud platform. Cloudera Enterprise Data Hub brings together the best big data technologies from the Hadoop ecosytem, and adds consistent security, granular governance, and full support. Oracle Cloud Infrastructure adds unmatched performance, security, and availability, as well as the ability to run on the same private networks as Oracle databases, Exadata, and back-office applications, for easy data sharing and operational analytics.

5 Ways to Detect Outliers/Anomalies That Every Data Scientist Should Know (Python Code)

Detecting Anomalies is critical to any business either by identifying faults or being proactive. This article discusses 5 different ways to identify those anomalies.

AutoML for Data Augmentation

DeepAugment is an AutoML tool focusing on data augmentation. It utilizes Bayesian optimization for discovering data augmentation strategies tailored to your image dataset. The main benefits and features of DeepAugment are:
• Reduces the error rate of CNN models (showed 60% decrease in error for CIFAR10 on WRN-28-10)
• Saves time by automating the process
• 50 times faster than Google’s previous solution-AutoAugment

Explainable AI or Halting Faulty Models ahead of Disaster

A brief overview of a new method for explainable AI (XAI), called anchors, introduce its open-source implementation and show how to use it to explain models predicting the survival of Titanic passengers.

Inverse Statistics – and how to create Gain-Loss Asymmetry plots in R

Asset returns have certain statistical properties, also called stylized facts. Important ones are:
• Absence of autocorrelation: basically the direction of the return of one day doesn’t tell you anything useful about the direction of the next day.
• Fat tails: returns are not normal, i.e. there are many more extreme events than there would be if returns were normal.
• Volatility clustering: basically financial markets exhibit high-volatility and low-volatility regimes.
• Leverage effect: high-volatility regimes tend to coincide with falling prices and vice versa.
A good introduction and overview can be found in R. Cont: Empirical properties of asset returns: stylized facts and statistical issues.

Text Summarization using Deep Learning

With the rise of internet, we now have information readily available to us. We are bombarded with it literally from many sources?-?news, social media, office emails to name a few. If only someone could summarize the most important information for us! Deep Learning is getting there. Through the latest advances in sequence to sequence models, we can now develop good text summarization models.

Gentle introduction to Echo State Networks

This post will address the following questions:
1. What are Echo State Networks?
2. Why and when should you use an Echo State Network?
3. Simple implementation example in python.

Significance of ACF and PACF Plots In Time Series Analysis

This article is for folks who want to know the intuition behind determining the order of auto-regressive (AR) and moving average (MA) series using ACF and PACF plots. Most of us know how to use ACF and PACF plots to obtain the values of p and q to feed into the AR-I-MA model, but we lack the intuition behind why we use PACF and ACF to obtain p and q respectively and not the other way around.

Tuning a Multi-Task Fate Grand Order Trained Pytorch Network

In a previous post I did some multi-task learning in Keras (here) and after finishing that one I wanted to do a follow up post on doing a multi-task learning in Pytorch. This was mostly because I thought it would be a good exercise for me to build it in another framework, however in this post I will go through how I did a bit of extra tuning after building the model that I didn’t go through when I built the Keras based model.

Pruned Cross Validation

There are several reasons why you would like to use cross-validation: it helps you to assess the quality of the model, optimize its hyperparameters and test various architectures. There are also a variable number of reasons why you wouldn’t like it?-?this variable is the number of folds. For each fold, the model has to be trained. With computationally intensive algorithms and big datasets, training each fold may be a cumbersome endeavor.

Hamiltonian Methods for Data Clustering

Imagine you finally land a data science job, and it entails checking, labelling and classifying every new datum added to the dataset manually. Such a job would be a dull and tedious one! Furthermore, with the volume of data being harvested today exceeds the volume of water in the Atlantic Ocean, it would be impossible to complete by one person, let alone with an army of data scientists. The solution? You might have heard of data clustering. It is an automated process of grouping data and sorting them into groups and clusters. A cluster can be thought of as a collection of data, that share a number of similarities with each other and they are dissimilar to objects in other clusters. In this article I will review, in a simplified way, a type of clustering algorithm that relies on Hamiltonian dynamics. The idea of using Hamiltonian dynamics for data clustering was presented in (Casagrande, Sassano, & Astolfi, 2012). In the next section, I will give a very simplified explanation of Hamiltonian dynamics and the underlying idea behind this type of clustering.

Why Should I Care About Understanding My Model?

Non-parametric machine learning (ML) models (e.g. Random Forests, Neural Networks) are highly flexible but complex models that can attain significantly higher accuracy than parametric models such as regression based methods (e.g. logistic, linear, polynomial etc.). They can also be easier to use and more robust, leaving less room for improper use and misunderstanding. But these advantages have a cost. Compared to their parametric and often linear cousins, these models do not produce predictions that can be explained and their structure cannot be directly visualized i.e. they are not interpretable.

Market Basket Analysis with recommenderlab

Recently I wanted to learn something new and challenged myself to carry out an end-to-end Market Basket Analysis. To continue to challenge myself, I’ve decided to put the results of my efforts before the eyes of the data science community.