Overview of Apache Flink: Next-Gen Big Data Analytics Framework
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing. In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
What is an interaction effect?
One statistical concept that instructors frequently don’t have time to cover in Stat 101 is the ‘interaction’ effect. I will explain this concept using the fantastic interactive graphic by the visualization team at the German publication Zeit (please also read the corresponding post on Junk Charts here for some background.)
Pandas Categoricals
Pandas Categoricals efficiently encode repetitive text data. Categoricals are useful for data like stock symbols, gender, experiment outcomes, cities, states, etc.. Categoricals are easy to use and greatly improve performance on this data.
Understanding Neural Networks Through Deep Visualization
Deep neural networks have recently been producing amazing results! But how do they do what they do? Historically, they have been thought of as ‘black boxes’, meaning that their inner workings were mysterious and inscrutable. Recently, we and others have started shinning light into these black boxes to better understand exactly what each neuron has learned and thus what computation it is performing.
June 2015: Scripts of the Week
Six new competitions launched on Kaggle in June and lots of great activity on scripts quickly followed! It was tough choosing just one script to highlight each week, but we’re confident you’ll find these four visualizations as compelling as we do. Remember, you can click through to the code on Kaggle scripts to understand the process, view the packages, and (in a of couple cases) get interactive.
Web Site Content Extraction with XSLT & R
Sometimes you just need the salient text from a web site, often as a first step towards natural language processing (NLP) or classification. There are many ways to achieve this, but XSLT (eXtensible Stylesheet Language) was purpose-built for slicing, dicing and transforming XML (and, hence, HTML) so, it can make more sense and even be speedier use XSLT transformations than to a write a hefty bit of R (or other language) code.
Working with Sessionized Data 1: Evaluating Hazard Models
When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a ‘ready to analyze’ form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this form is a matter of performing the equivalent of a number of SQL joins (for example, Lecture 23 (‘The Shape of Data’) from our paid video course Introduction to Data Science discusses this).