Wednesday, August 28, 2013

Learning Hadoop on SlideShare

1. Introduction to MapReduce, an Abstraction for Large-Scale Computation by Ilan Horn, Google

Introduction To Map Reduce from rantav

2. Hadoop, Pig, and Twitter by Kevin Weil, Twitter

Hadoop, Pig, and Twitter (NoSQL East 2009) from Kevin Weil

3. Introduction to data processing using Hadoop and Pig by Ricardo Varela, Yahoo!

introduction to data processing using Hadoop and Pig from Ricardo Varela

4. Pig, Making Hadoop Easy by Alan F. Gates, Yahoo!

Pig, Making Hadoop Easy from Nick Dimiduk

5. Practical Problem Solving with Hadoop and Pig by Milind Bhandarkar, Yahoo!

Practical Problem Solving with Apache Hadoop & Pig from Milind Bhandarkar

6. Big Data Analytics with Hadoop by Philippe Julio

Big Data Analytics with Hadoop from Philippe Julio

7. HIVE Data Warehousing & Analytics on Hadoop by Facebook Data Team

HIVE: Data Warehousing & Analytics on Hadoop from Zheng Shao

8. Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop by Facebook Data Team

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop from royans

9. Integration of Apache Hive and HBase by Enis Soztutarenis, Hortonworks

Integration of Hive and HBase from Hortonworks

10. Hive Quick Start Tutorial by Cloudera

Hive Quick Start Tutorial from Carl Steinbach

Hive Data Types

Complex Data Type

Add caption

If you’re already familiar with SQL then you may well be thinking about how to add Hadoop skills to your toolbelt as an option for data processing.

From a querying perspective, using Apache Hive provides a familiar interface to data held in a Hadoop cluster and is a great way to get started. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

If you really want to get to grips with Hive, then take a look at the full language manual.

Retrieving Information

Function	MySQL	Hive
Retrieving Information (General)	`SELECT from_columns FROM table WHERE conditions;`	`SELECT from_columns FROM table WHERE conditions;`
Retrieving All Values	`SELECT * FROM table;`	`SELECT * FROM table;`
Retrieving Some Values	`SELECT * FROM table WHERE rec_name = "value";`	`SELECT * FROM table WHERE rec_name = "value";`
Retrieving With Multiple Criteria	`SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";`	`SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2";`
Retrieving Specific Columns	`SELECT column_name FROM table;`	`SELECT column_name FROM table;`
Retrieving Unique Output	`SELECT DISTINCT column_name FROM table;`	`SELECT DISTINCT column_name FROM table;`
Sorting	`SELECT col1, col2 FROM table ORDER BY col2;`	`SELECT col1, col2 FROM table ORDER BY col2;`
Sorting Reverse	`SELECT col1, col2 FROM table ORDER BY col2 DESC;`	`SELECT col1, col2 FROM table ORDER BY col2 DESC;`
Counting Rows	`SELECT COUNT(*) FROM table;`	`SELECT COUNT(*) FROM table;`
Grouping With Counting	`SELECT owner, COUNT(*) FROM table GROUP BY owner;`	`SELECT owner, COUNT(*) FROM table GROUP BY owner;`
Maximum Value	`SELECT MAX(col_name) AS label FROM table;`	`SELECT MAX(col_name) AS label FROM table;`
Selecting from multiple tables (Join same table using alias w/”AS”)	`SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;`	`SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)`

Metadata

Function	MySQL	Hive
Selecting a database	`USE database;`	`USE database;`
Listing databases	`SHOW DATABASES;`	`SHOW DATABASES;`
Listing tables in a database	`SHOW TABLES;`	`SHOW TABLES;`
Describing the format of a table	`DESCRIBE table;`	`DESCRIBE (FORMATTED\|EXTENDED) table;`
Creating a database	`CREATE DATABASE db_name;`	`CREATE DATABASE db_name;`
Dropping a database	`DROP DATABASE db_name;`	`DROP DATABASE db_name (CASCADE);`

Current SQL Compatibility

Command Line

Function	Hive
Run Query	`hive -e 'select a.col from tab1 a'`
Run Query Silent Mode	`hive -S -e 'select a.col from tab1 a'`
Set Hive Config Variables	`hive -e 'select a.col from tab1 a' -hiveconf hive.root.logger=DEBUG,console`
Use Initialization Script	`hive -i initialize.sql`
Run Non-Interactive Script	`hive -f script.sql`

Hive Architecture

Hive, a framework for data warehousing on top of Hadoop. Hive grew from a need to manage and learn from the huge volumes of data that Facebook was producing every day from its burgeoning social network. After trying a few different systems, the team chose Hadoop for storage and processing, since it was cost-effective and met their scalability needs.

Hive was created to make it possible for analysts with strong SQL skills (but meagre Java programming skills) to run queries on the huge volumes of data that Facebook stored in HDFS. Today, Hive is a successful Apache project used by many organizations as a general-purpose, scalable data processing platform.

Hive clients

If you run Hive as a server ,then there are a number of different mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services is illustrated in below diagram

Hive Architecture

Thrift Client

The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP,Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution.

JDBC Driver

Hive provides a Type 4 (pure Java) JDBC driver, defined in the class rg.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings. )

You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client.

ODBC Driver

The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.

There are more details on using these clients on the Hive wiki at https://cwiki.apache

.org/confluence/display/Hive/HiveClient.

The Metastore

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called the embedded metastore.

Schema on Write

In a traditional database, a table’s schema is enforced at data load time. If the data being loaded doesn’t conform to the schema, then it is rejected. This design is sometimes called schema on write, since the data is checked against the schema when it is written into the database.

Schema on Read

Hive, doesn’t verify the data when it is loaded, but it verifies the data when a query is issued. This is called schema on read.

Hive & Pig cheatsheet

Tuesday, August 27, 2013

Hadoop and HDFS cheatsheet

Hadoop & HDFS cheatsheet

HADOOP & HDFS commands

Hadoop

Wednesday, August 28, 2013

Learning Hadoop on SlideShare

2. Hadoop, Pig, and Twitter by Kevin Weil, Twitter

3. Introduction to data processing using Hadoop and Pig by Ricardo Varela, Yahoo!

4. Pig, Making Hadoop Easy by Alan F. Gates, Yahoo!

5. Practical Problem Solving with Hadoop and Pig by Milind Bhandarkar, Yahoo!

6. Big Data Analytics with Hadoop by Philippe Julio

7. HIVE Data Warehousing & Analytics on Hadoop by Facebook Data Team

8. Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop by Facebook Data Team

9. Integration of Apache Hive and HBase by Enis Soztutarenis, Hortonworks

10. Hive Quick Start Tutorial by Cloudera

Hive Data Types

Retrieving Information

Metadata

Current SQL Compatibility

Command Line

Hive Architecture

Hive & Pig cheatsheet

Tuesday, August 27, 2013

Hadoop and HDFS cheatsheet

Hadoop & HDFS cheatsheet

Hadoop Pig cheatsheet

Hadoop Pig cheatsheet