Big Data And Hadoop – Hadoop is being regarded as one of the best platforms for storing and managing big data. It owes its success to its high data storage and processing scalability, low price/performance ratio, high performance, high availability, high schema flexibility, and its capability to handle all types of data. Unfortunately, Hadoop APIs, such as HDFS, MapReduce, and HBase, are quite complex. They require expertise in Java programming (or similar languages) and require in‐depth knowledge of how to parallelize query processing efficiently. The downsides of these interfaces are a small target audience, low productivity, and limited tool support. The Need For SQL-on-Hadoop Engines – What is needed is a programming interface that retains HDFS’s performance and scalability, offers high productivity and maintainability, is known to non‐technical users, and can be used by many reporting and analytical tools. The obvious choice is evidently SQL. SQL is a highlevel, declarative, and standardized database language, it’s familiar to countless BI specialists, it’s supported by almost all reporting and analytical tools, and has proven its worth over and over again. To offer SQL on Hadoop, SQL query engines are needed that can query and manipulate data stored in HDFS or HBase. Such products are called SQL‐on‐Hadoop engines. Lately, the popularity of SQL‐on‐Hadoop engine is growing rapidly. Here are just a few of the many SQLon‐ Hadoop engines available: Apache Drill, Apache Hive, CitusDB, Cloudera Impala, Concurrent Lingual, Hadapt, HP Vertica, InfiniDB, JethroData, MemSQL, Pivotal HAWQ, Progress DataDirect, ScleraDB, Shark, and SpliceMachine. On the outside most of the SQL‐on‐Hadoop engines look alike. They all support some SQL‐dialect that can be invoked through ODBC or JDBC. Internally, they can be very different. The differences stem from the purpose for which they have been designed. Here are some potential use cases for which they may have been designed:
• batch‐oriented query environment (data mining)
• interactive query environment (OLAP, self‐service BI, data visualization)
• point‐queries (retrieving and manipulating individual objects)
• investigative analytics (data science)
• operational intelligence (real‐time analytics)
• transactional (production systems) Undesired Big Data Silos – Most Hadoop‐based systems have been designed and developed by organizations for one or two use cases. The workload characteristics of these use cases are usually massive data load and execution of non‐interactive, complex forms of analytics. However, Hadoop implementations can support other use cases, including interactive reporting, data stream processing, transactional processing, and text search. The growing availability of SQL‐on‐Hadoop engines has just widen the range of use cases of Hadoop even more. Unfortunately, when deployed for a different use case, a specific Hadoop implementation may be unsuitable with regard to functionality or performance. Development of another use case may force an organization to develop a second solution in which data is stored again. In the long run, this results in many data management platforms: each one designed and optimized to support a limited number of use cases. Finally, this leads to undesirable big data silos. The disadvantages of having big data silos are: high costs because of data duplication, high data latency, complex data replication solutions, and data quality problems. Silos may work well temporarily, but history has shown that eventually the users of these silos will want to combine data from multiple data sources. When this happens, each application is extended to access multiple data sources. This leads to a dedicated integration solution for each one of them. The result is another undesired solution: an integration labyrinth. For an organization it’s almost impossible to guarantee that all these integration solutions are correct, efficient, and lead to consistent results. The Need For One Data Management Platform – The ROI on all big data stored in Hadoop is increased when it’s made available for as wide a range of use cases as possible, including all the new use cases offered by the SQL‐on‐Hadoop engines. What is needed is one Hadoop data management platform that has been designed to support all the current and future use cases, so that the need for duplication of all that big data is minimized and that the development of big data silos and an integration labyrinth is avoided. The Whitepaper – This whitepaper explains what SQL‐on‐Hadoop engines are, what the technological challenges are, and what potential use cases of SQL‐on‐Hadoop are. Besides a high‐level comparison of several of these engines, it also contains a detailed description of Apache Drill that brings to light some of the pertinent issues in providing SQL capabilities on big data. In addition, the MapR Technologies data management platform M7 is also described as an example of a big data platform that can support many different use cases.
SQL-on-Hadoop Engines Explained