Hadoop is an open-source framework developed in Java, dedicated to store and analyze the large sets of unstructured data. It is a highly scalable platform which allows multiple concurrent tasks to run from single to thousands of servers without any delay. It consists of a distributed file system that allows transferring data and files in split seconds between different nodes. Its ability to process efficiently even if a node fails makes it a reliable technology for companies which cannot afford to delay or stop their activities.
This tutorial provides introduction to Apache Spark, what are its ecosystem components, Spark abstraction – RDD, transformation and action. The objective of this introductory guide is to provide detailed overview of spark, its history, architecture, deployment mode and RDD.
In this post, we leverage a few other NLP techniques to analyze another text corpus – A collection of tweets. Given a tweet, we would like to extract the key words or phrases which conveys the gist of the meaning of the tweet. For the purpose of this demo, we will extract President Donald Trump’s tweets (~3000 in total) from twitter using Twitter’s API.
Doing Data Science can be equally perilous, time-wise. Indeed, if you ignore the time component of your analyses, you may find effects where none exist — or miss gigantic, wedding-sized effects where they do.
In my last post, where I shared the code that I used to produce an example analysis to go along with my webinar on building meaningful models for disease prediction, I mentioned that it is advised to consider over- or under-sampling when you have unbalanced data sets. Because my focus in this webinar was on evaluating model performance, I did not want to add an additional layer of complexity and therefore did not further discuss how to specifically deal with unbalanced data.
At the recent Strata conference in San Jose, several members of the Microsoft Data Science team presented the tutorial Using R for Scalable Data Analytics: Single Machines to Spark Clusters. The materials are all available online, including the presentation slides and hands-on R scripts. You can follow along with the materials at home, using the Data Science Virtual Machine for Linux, which provides all the necessary components like Spark and Microsoft R Server. (If you don’t already have an Azure account, you can get $200 credit with the Azure free trial.)
Data analysis workflows and recipes are commonly used in science. They are actually indispensable since reinventing the wheel for each project would result in a colossal waste of time. On the other hand, mindlessly applying a workflow can result in totally wrong conclusions if the required assumptions don’t hold. This is why successful data analysts rely heavily on interactive data analysis (IDA). I write today because I am somewhat concerned that the importance of IDE is not fully appreciated by many of the policy makers and thought leaders that will influence how we access and work with data in the future.