Hive Architecture

Hive, a framework for data warehousing on top of Hadoop. Hive grew from a need to manage and learn from the huge volumes of data that Facebook was producing every day from its burgeoning social network. After trying a few different systems, the team chose Hadoop for storage and processing, since it was cost-effective and met their scalability needs.

Hive was created to make it possible for analysts with strong SQL skills (but meagre Java programming skills) to run queries on the huge volumes of data that Facebook stored in HDFS. Today, Hive is a successful Apache project used by many organizations as a general-purpose, scalable data processing platform.

Hive clients

If you run Hive as a server ,then there are a number of different mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services is illustrated in below diagram

Hive Architecture

Thrift Client

The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP,Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution.

JDBC Driver

Hive provides a Type 4 (pure Java) JDBC driver, defined in the class rg.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings. )

You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client.

ODBC Driver

The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.

There are more details on using these clients on the Hive wiki at https://cwiki.apache

.org/confluence/display/Hive/HiveClient.

The Metastore

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore.

Schema on Write

In a traditional database, a table’s schema is enforced at data load time. If the data being loaded doesn’t conform to the schema, then it is rejected. This design is sometimes called schema on write, since the data is checked against the schema when it is written into the database.

Schema on Read

Hive, doesn’t verify the data when it is loaded, but it verifies the data when a query is issued. This is called schema on read.

Hadoop

Wednesday, August 28, 2013

Hive Architecture

No comments:

Post a Comment