Analyzing the first Presidential Debate

A significant chunk of the data that we encounter on a daily basis is available in an unstructured, free text format. Hence, the ability to glean useful bits of information from this unstructured pile can be quite valuable. In this post, we will attempt a basic analysis of the text from the first Presidential debate between Clinton and Trump.

GoodReads: Machine Learning (Part 3)

In the first installment of this series, we scraped reviews from Goodreads. In the second one, we performed exploratory data analysis and created new variables. We are now ready for the “main dish”: machine learning!

A Dramatic Tour through Python’s Data Visualization Landscape

I recently came upon Brian Granger and Jake VanderPlas’s Altair, a promising young visualization library. Altair seems well-suited to addressing Python’s ggplot envy, and its tie-in with JavaScript’s Vega-Lite grammar means that as the latter develops new functionality (e.g., tooltips and zooming), Altair benefits — seemingly for free! Indeed, I was so impressed by Altair that the original thesis of my post was going to be: “Yo, use Altair.”

Unevenly Spaced Data

As the Internet of Things (IoT) comes of age, we’re seeing more and more data from event-triggered sensors instead of sensors that record measurements at regular time intervals. These event-triggered sensors give rise to unevenly-spaced time series. Many analysts will immediately convert unevenly-spaced data to evenly-spaced time series to be compatible with existing sensor data analytics tools, but we have found that the conversion is usually unnecessary, and sometimes even causes problems.

Deep Learning Research Review Week 1: Generative Adversarial Nets

Starting this week, I’ll be doing a new series called Deep Learning Research Review. Every couple weeks or so, I’ll be summarizing and explaining research papers in specific subfields of deep learning. This week I’ll begin with Generative Adversarial Net

Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics

Analyzing web log traffic to gain insights that drive business decisions has historically been performed using batch processing. While effective, this approach results in delayed responses to emerging trends and user activities. There are solutions to deal with processing data in real time using streaming and micro-batching technologies, but they can be complex to set up and maintain. Amazon Kinesis Analytics is a managed service that makes it very easy to identify and respond to changes in behavior in real-time. One use case where it’s valuable to have immediate insights is analyzing clickstream data. In the world of digital advertising, an impression is when an ad is displayed in a browser and a clickthrough represents a user clicking on that ad. A clickthrough rate (CTR) is one way to monitor the ad’s effectiveness. CTR is calculated in the form of: CTR = Clicks / Impressions * 100. Digital marketers are interested in monitoring CTR to know when particular ads perform better than normal, giving them a chance to optimize placements within the ad campaign. They may also be interested in anomalous low-end CTR that could be a result of a bad image or bidding model. In this post, I show an analytics pipeline which detects anomalies in real time for a web traffic stream, using the RANDOM_CUT_FOREST function available in Amazon Kinesis Analytics.