What is Hive?

What is Hive?

Hive is an open source volunteer project under the Apache Software Foundation. Previously a subproject of Apache Hadoop, it has now graduated to become a top-level project of its own.

The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Hive provides a mechanism to project structure onto this data, and query it using a Structured Query Language (SQL)-like syntax called HiveQL (Hive Query Language). This language also lets traditional map/reduce programmers plug in their custom mappers and reducers when it's inconvenient or inefficient to express this logic in HiveQL.

That's how the official website puts it.

In Essence...

Hive is a data-warehousing infrastructure for Hadoop. And Hadoop is a framework for handling large datasets in a distributed computing environment.

Its Evolution

When Hadoop was introduced, Yahoo! started working on a system called Pig for their application deployment. Its ultimate goal was to manage their unstructured data. At the same time, technicians at Facebook developed a runtime Hadoop support structure that allowed anyone already fluent with SQL (commonplace for relational database developers) to leverage the Hadoop platform directly.

Facebook's creation was called Hive, and allowed SQL developers to write Hive Query Language (HQL) statements similar to standard SQL statements.

Similar, But Different

At first glance, Hive and Pig would seem to be the same. In practice, you will find one tool or the other being favored by the different groups that use Apache Hadoop.

Pig exploits its data flow strengths, where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into a form for querying. All data objects exist and are operated on in the script. Once the script is complete, all the data objects are deleted unless you stored them.

Hive is considered friendlier and more familiar to users accustomed to using SQL for querying data.

Hive In Action

Hive acts on the Hadoop data store. You can think of it as providing a data workbench where you can examine, modify and manipulate the data in Apache Hadoop.

Any query you make, table you create, or data that you copy persists from query to query. When you perform a data processing task, you can execute it one query or line at a time. Once a line successfully executes, you can look at the data objects to verify if the last operation did what you expected.

This is in direct contrast with Pig. In Hive, all your data is live. You're able to solve problems bit by bit, and change your mind on what to do next, depending on what you find. This kind of flexibility is Hive’s strength.

Hive is well suited for batch processing data like log processing, text mining, document indexing, customer-facing business intelligence, predictive modeling, and hypothesis testing.

Under the Hood

Hive supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 file system.

HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. Hive's primary responsibility is to provide data summarization, query and analysis. It supports SQL-like access to structured data, as well as Big Data analysis with the help of MapReduce.

Hive data can be organized in three different formats:

1) Tables: These are very similar to relational database management system (RDBMS) tables and contain rows and tables. Hive is layered over the Hadoop Distributed File System (HDFS), so tables are directly mapped to directories of the file systems. Hive also supports tables stored in other native file systems.

2) Partitions: Hive tables can have more than one partition. These are mapped to subdirectories and file systems as well.

3) Buckets: Buckets are stored as files in a partition of the underlying file system.

Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database manner).

Hive also has a Hive-Metastore, which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). Hive-Metastore acts as a system catalog, and enables data serialization/deserialization, increasing flexibility in schema design.

You can run your Hive queries in many ways: from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from a Hive Thrift Client, which acts much like any database client software, and communicates with Hive services running on the server. 

Limitations

Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, Hive queries have a very high latency (many minutes). This means that Hive is not appropriate for applications that need very fast response times (such as you might expect with a database like DB2). Moreover, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Hive is not built to get a quick response to queries – but it is built for data mining applications, which can take from several minutes to several hours to complete an analysis.