Data Warehouse vs. Data Lake
In April, I was given the opportunity to present An Executive’s Cheat Sheet on Hadoop, the Enterprise Data Warehouse and the Data Lake at the SAS Global Forum Executive Conference in Dallas. During this standing-room only session, I addressed these five questions:
1. What can Hadoop do that my data warehouse can’t?
2. We’re not doing ‘big’ data, so why do we need Hadoop?
3. Is Hadoop enterprise-ready?
4. Isn’t a data lake just the data warehouse revisited?
5. What are some of the pros and cons of a data lake?
Following is a recap of my comments, along with a few screenshots. See what you think.
How To Analyze Data: 21 Graphs that Explain the Same-Sex Marriage Case, Public Opinion, & Supreme Court
The nine Justices on the United States Supreme Court recently took up a case about same-sex marriage. The question in Obergefell v. Hodges is whether states are required to license and recognize marriages between two people of the same sex. This post examines the same-sex marriage case and Court in three sections about:
• Public opinion on same-sex marriage (10 graphs)
• Politics and voting on the Court (5)
• Justices, clerks, & opinions about the Court (6)
Introductory Point Pattern Analysis of Open Crime Data in London
Police in Britain (http://data.police.uk) not only register every single crime they encounter, and include coordinates, but also distribute their data free on the web. They have two ways of distributing data: the first is through an API, which is extremely easy to use but returns only a limited number of crimes for each request, the second is a good old manual download from this page http://…/. Again this page is extremely easy to use, they did a very good job in securing that people can access and work with these data; we can just select the time range and the police force from a certain area, and then wait for the system to create the dataset for us. I downloaded data from all forces for May and June 2014 and it took less than 5 minutes to prepare them for download. These data are distributed under the Open Government Licence, which allows me to do basically whatever I want with them (even commercially) as long as I cite the origin and the license.
How to reduce Data Hoarding, get Better Visualizations and Decisions
Creating a hodge-podge of pretty pictures of every datapoint is a guaranteed way to destroy the value of a visualization. We examine how to reduce such data hoarding and improve decisions.
Data science makes an impact on Wall Street
• Text mining in finance
• Pricing financial products
• Recruiting data professionals
The Unreasonable Effectiveness of Recurrent Neural Networks
There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.