Wednesday, August 28, 2013


 Learning Hadoop on SlideShare


1. Introduction to MapReduce, an Abstraction for Large-Scale Computation by Ilan Horn, Google

2. Hadoop, Pig, and Twitter by Kevin Weil, Twitter


3. Introduction to data processing using Hadoop and Pig by Ricardo Varela, Yahoo!

4. Pig, Making Hadoop Easy by Alan F. Gates, Yahoo!

5. Practical Problem Solving with Hadoop and Pig by Milind Bhandarkar, Yahoo!


6. Big Data Analytics with Hadoop by Philippe Julio




7. HIVE Data Warehousing & Analytics on Hadoop by Facebook Data Team


8. Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop  by Facebook Data Team


9. Integration of Apache Hive and HBase by Enis Soztutarenis, Hortonworks


10. Hive Quick Start Tutorial by Cloudera





Hive Data Types



Hive Primitive data types
 Complex Data Type
Hive complex data types
Add caption



If you’re already familiar with SQL then you may well be thinking about how to add Hadoop skills to your toolbelt as an option for data processing.

From a querying perspective, using Apache Hive provides a familiar interface to data held in a Hadoop cluster and is a great way to get started. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

If you really want to get to grips with Hive, then take a look at the full language manual.

Retrieving Information

Function MySQL Hive
Retrieving Information (General) SELECT from_columns FROM table WHERE conditions; SELECT from_columns FROM table WHERE conditions;
Retrieving All Values SELECT * FROM table; SELECT * FROM table;
Retrieving Some Values SELECT * FROM table WHERE rec_name = "value"; SELECT * FROM table WHERE rec_name = "value";
Retrieving With Multiple Criteria SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2"; SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";
Retrieving Specific Columns SELECT column_name FROM table; SELECT column_name FROM table;
Retrieving Unique Output SELECT DISTINCT column_name FROM table; SELECT DISTINCT column_name FROM table;
Sorting SELECT col1, col2 FROM table ORDER BY col2; SELECT col1, col2 FROM table ORDER BY col2;
Sorting Reverse SELECT col1, col2 FROM table ORDER BY col2 DESC; SELECT col1, col2 FROM table ORDER BY col2 DESC;
Counting Rows SELECT COUNT(*) FROM table; SELECT COUNT(*) FROM table;
Grouping With Counting SELECT owner, COUNT(*) FROM table GROUP BY owner; SELECT owner, COUNT(*) FROM table GROUP BY owner;
Maximum Value SELECT MAX(col_name) AS label FROM table; SELECT MAX(col_name) AS label FROM table;
Selecting from multiple tables (Join same table using alias w/”AS”) SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name; SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)

Metadata

Function MySQL Hive
Selecting a database USE database; USE database;
Listing databases SHOW DATABASES; SHOW DATABASES;
Listing tables in a database SHOW TABLES; SHOW TABLES;
Describing the format of a table DESCRIBE table; DESCRIBE (FORMATTED|EXTENDED) table;
Creating a database CREATE DATABASE db_name; CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name; DROP DATABASE db_name (CASCADE);

Current SQL Compatibility


Command Line

Function Hive
Run Query hive -e 'select a.col from tab1 a'
Run Query Silent Mode hive -S -e 'select a.col from tab1 a'
Set Hive Config Variables hive -e 'select a.col from tab1 a' -hiveconf hive.root.logger=DEBUG,console
Use Initialization Script hive -i initialize.sql
Run Non-Interactive Script hive -f script.sql




Hive Architecture


    Hive, a framework for data warehousing on top of Hadoop. Hive grew from a need to manage and learn from the huge volumes of data that Facebook was producing every day from its burgeoning social network. After trying a few different systems, the team chose Hadoop for storage and processing, since it was cost-effective and met their scalability needs.

    Hive was created to make it possible for analysts with strong SQL skills (but meagre Java programming skills) to run queries on the huge volumes of data that Facebook stored in HDFS. Today, Hive is a successful Apache project used by many organizations as a general-purpose, scalable data processing platform.

Hive clients

If you run Hive as a server ,then there are a number of different mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services is illustrated in below diagram



Hive Architecture
Hive Architecture


Thrift Client

The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP,Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution.

JDBC Driver

Hive provides a Type 4 (pure Java) JDBC driver, defined in the class rg.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings. )

You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client.

ODBC Driver

The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.
There are more details on using these clients on the Hive wiki at https://cwiki.apache
.org/confluence/display/Hive/HiveClient.

The Metastore

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore.

Schema on Write

In a traditional database, a table’s schema is enforced at data load time. If the data being loaded doesn’t conform to the schema, then it is rejected. This design is sometimes called schema on write, since the data is checked against the schema when it is written into the database.  

Schema on Read

Hive, doesn’t verify the data when it is loaded, but it verifies the data when a query is issued. This is called schema on read.








Hive & Pig cheatsheet 



Hive Pig commands
Hive Pig commands cheetcodes