Data Science tricks: Simple anomaly detection for metrics with a weekly pattern

A recurring problem that engineers have to deal with is creating an alarm system to give them early warnings when things start to go wrong. For example, they may need to monitor CPU or memory usage for a set of virtual machines or user-behaviour signals. Things may go wrong for a variety of reasons like the introduction of bugs in new code releases or random hardware failures. Simple heuristics can cover the most common cases. However, there are times where simple heuristics may be inadequate, while at the same time the metric may look “obviously” problematic in the eyes of an experienced engineer. This confidence is related to the fact that most aggregated metrics are quite well-behaved and follow predictable patterns.


Prescriptive versus Predictive Analytics – A Distinction without a Diffrence

Is the addition of “Prescriptive” analytics to our nomenclature really worthwhile or are we just confusing our customers?


Understanding the Basics of Supply Chain Analytics

Today’s supply chains move millions of shipments around the world each year, but just think for a moment about the information required to ensure these shipments get from A to B safely and on time. The information flows, primarily based on EDI/B2B transactions, to support today’s global supply chains are growing in volume year-on-year.


Understanding Bayes: How to become a Bayesian in eight easy steps

It can be hard to know where to start when you want to learn about Bayesian statistics. I am frequently asked to share my favorite introductory resources to Bayesian statistics, and my go-to answer has been to share a dropbox folder with a bunch of PDFs that aren’t really sorted or cohesive. In some sense I was acting as little more than a glorified Google Scholar search bar.


Auto-scaling scikit-learn with Spark

Data scientists often spend hours or days tuning models to get the highest accuracy. This tuning typically involves running a large number of independent Machine Learning (ML) tasks coded in Python or R. Following some work presented at Spark Summit Europe 2015, we are excited to release a library that dramatically simplifies the life of data scientists using Python. This library, published as spark-sklearn, automatically distributes the most repetitive tasks of model tuning on a Spark cluster, without impacting the workflow of data scientists:
• When used on a single machine, Spark can be used as a substitute to the default multithreading framework used by scikit-learn (Joblib).
• If a need comes to spread the work across multiple machines, no change is required in the code between the single-machine case and the cluster case.


Data Exploration with Kaggle – R Tutorial

Ever wonder where to begin your data analysis? Exploratory Data Analysis (EDA) is often the best starting point. Take the new hands-on course from Kaggle & DataCamp “Data Exploration with Kaggle Scripts” to learn the essentials of Data Exploration and begin navigating the world of data. By the end of the course you will learn how to apply various R packages and tools in combination in order to extract all of their usefulness for exploring your data. Furthermore, you will also be guided through the process of submitting your first Kaggle Script to your profile, and will publish analysis on Kaggle Scripts that you’ve personalized with information from your own life. (Tip: make sure to share your profile link with hiring managers and peers to easily show off and discuss your work.)


Neglected optimization topic: set diversity

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.


Tutorial: Credit Card Fraud Detection with SQL Server 2016 R Services

If you have a database of credit-card transactions with a small percentage tagged as fraudulent, how can you create a process that automatically flags likely fraudulent transactions in the future? That’s the premise behind the latest Data Science Deep Dive on MSDN. This tutorial provides a step by step to using the R language and the big-data statistical models of the RevoScaleR package of SQL Server 2016 R Services to build and use a predictive model to detect fraud.


New Release of RStudio (v0.99.878)

We’re pleased to announce that a new release of RStudio (v0.99.878) is available for download now. Highlights of this release include:
• Support for registering custom RStudio Addins.
• R Markdown editing improvements including outline view and inline UI for chunk execution.
• Support for multiple source windows (tear editor tabs off main window).
• Pane zooming for working distraction free within a single pane.
• Editor and IDE keyboard shortcuts can now be customized.
• New Emacs keybindings mode for the source editor.
• Support for parameterized R Markdown reports.
• Various improvements to RStudio Server Pro including multiple concurrent R sessions, use of multiple R versions, and shared projects for collaboration.


Using SVG graphics in blog posts

My traditional work flow for embedding R graphics into a blog post has been via a PNG files that I upload online. However, when I created a ‘simple’ graphic with only basic curves and triangles for a recent post, I noticed that the PNG output didn’t look as crisp as I expected it to be. So, eventually I used a SVG (scalable vector graphic) instead. Creating a SVG file with R could’t be easier; e.g. use the svg() function in the same way as png(). Next, make the file available online and embed it into your page. There are many ways to do this, in the example here I placed the file into a public GitHub repository.


LDA Topic Modeling on Singapore Parliamentary Debate Records

Using python package gensim and pyLDAvis to create an interactive topic models to explore how much bandwidth did the parliament spend on each each topic.


Market Basket Analysis With Adobe Analytics And RSiteCatalyst

Analyzing data programmatically seems to have finally taken hold in the digital analytics community. Traffic to my RSiteCatalyst documentation has skyrocketed, Jason has been on a Python kick lately (data cleaning, learning new analysis techniques) and I’m seeing other great posts from my peers, such as how to create a real-time dashboard using the Adobe Analytics API and 6 Marketing Tools [We Use] That No One Else Really Talks About. That said, in many ways the digital analytics industry is just now catching up to where the direct mail/database marketing industry was decades ago. There is plenty of chatter about personalization strategies or A/B testing on websites, but seemingly much less effort is dedicated to understand which products are purchased together (and more importantly, what the non-obvious product combinations tell us about our customers).