Spinning Up a Free Hadoop Cluster: Step by Step
The following tutorial shows how you can spin up your own personal cluster on AWS and deploy Hadoop. By no means are these production level setups, but it helps you quickly start interacting with Hadoop’s distributed file system and even run MapReduce jobs.
Spark Grows Up and Scales Out
To understand the furor that’s greeted recent vendor announcements around open source analytics computing engine Spark, and some commentary seemingly setting up a Spark versus Hadoop battle, it’s worth taking a moment to recap on what each actually is (and is not).
SQL for Beginners in a Nutshell
Data sources are an integral part of Big Data and much corporate data are still houses in databases, data marts and data warehouses accessed by the SQL language. More and more, traditional business and data analysts are being asked to work directly with enterprise-scale databases in order to procure data sets for the purpose of early-stage data cleansing and transformation. This means that these professionals must master SQL so they can make simple queries, rather than turning to IT to request data. But learning a programming language might not be an easy task for some. Our friends over at Udemy produced the infographic below containing the basics of SQL for beginners designed to kick start this effort.
Deaths in the Netherlands by cause and age
I downloaded counts of deaths by age, year and mayor cause from the Dutch statistics site. In this post I do some plots to look at causes and changes between the years.
The Workflow of Infinite Shame, and other stories from the R Summit
At day one of the R Summit at Copenhagen Business School there was a lot of talk about the performance of R, and alternate R interpreters.