Hive, a framework for data warehousing on top of Hadoop.
Hive grew from a need to manage and learn from the huge volumes of data that
Facebook was producing every day from its burgeoning social network. After
trying a few different systems, the team chose Hadoop for storage and
processing, since it was cost-effective and met their scalability needs.
Hive was created to make it possible for analysts with strong
SQL skills (but meagre Java programming skills) to run queries on the huge
volumes of data that Facebook stored in HDFS. Today, Hive is a successful
Apache project used by many organizations as a general-purpose, scalable data
processing platform.
Hive clients
If you run Hive as a server ,then there are a number of different
mechanisms for connecting to it from applications. The relationship between Hive
clients and Hive services is illustrated in below diagram
|
Hive Architecture |
Thrift Client
The Hive Thrift Client makes it easy to run Hive commands
from a wide range of programming languages. Thrift bindings for Hive are
available for C++, Java, PHP,Python, and Ruby. They can be found in the
src/service/src subdirectory in the Hive distribution.
JDBC Driver
Hive provides a Type 4 (pure Java) JDBC driver, defined in
the class rg.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC
URI of the form jdbc:hive://host:port/dbname, a Java application will connect
to a Hive server running in a separate process at the given host and port. (The
driver makes calls to an interface implemented by the Hive Thrift Client using
the Java Thrift bindings. )
You may alternatively choose to connect to Hive via JDBC in
embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same
JVM as the application invoking it, so there is no need to launch it as a
standalone server since it does not use the Thrift service or the Hive Thrift
Client.
ODBC Driver
The Hive ODBC Driver allows applications that support the
ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses
Thrift to communicate with the Hive server.) The ODBC driver is still in
development, so you should refer to the latest instructions on the Hive wiki
for how to build and run it.
There are more details on using these clients on the Hive
wiki at https://cwiki.apache
.org/confluence/display/Hive/HiveClient.
The Metastore
The metastore is the central repository of Hive metadata.
The metastore is divided into two pieces: a service and the backing store for
the data. By default, the metastore service runs in the same JVM as the Hive
service and contains an embedded Derby database instance backed by the local
disk. This is called the embedded metastore.
Schema on Write
In a traditional database, a table’s schema is enforced at
data load time. If the data being loaded doesn’t conform to the schema, then it
is rejected. This design is sometimes called schema on write, since the data is
checked against the schema when it is written into the database.
Schema on Read
Hive, doesn’t verify the data when it is loaded, but it
verifies the data when a query is issued. This is called schema on read.