Pig and Hive
Hive or Pig?
Similarities:
- Same Motivation:
- Allow non-Java programmers easy access to data in its Hadoop clusters.
- Easy to learn and easy to use
- Immediately makes data on the cluster accessible to more people.
- Open-source project managed by the Apache Software Foundation.
- Same Motivation:
Differences:
- Hive
- Developed by Facebook
- Metastore databse required. For storing:
- Table definitions (name, columns, data types, etc.)
- Information on where the table data is stored in HDFS
- Row format of files in the table.
- HiveQL is SQL like, familar to many developers. (declarative)
- Data structure is predefined
- JDBC driver available (limited functionality)
- Hive is not an RDBMS! Not all SQL works
- Pig
- Developed by Yahoo!
- No Metastore or changes to the cluster (just a client).
- PigLatin, a new but simple data flow language (pipeline).
- Just an interpreter running on the client machine turning PigLatin into Mapreduce jobs and submits them to the cluster
- Data structure defined execution time.
- Hive
Examples
Follow Amazon EMR or Google Cloud DataProc or IERG4300Fall2020Tutorial8 on how to launch a multi-node Hadoop cluster using Amazon EMR or Google Cloud DataProc.
Note that the user and password for IERG4300Fall2020Tutorial8 is:
User: ierg4300
Password: fall2020ierg
In this tutorial, we are going to demonstrate how to launch a multi-node Hadoop cluster with Pig and Hive and we will show some examples of using Pig and Hive.
aws emr create-cluster \
--release-label emr-6.2.0 \
--instance-type m4.large \
--instance-count 2 \
--applications Name=Hadoop Name=Hive Name=Pig \
--region us-west-1 \
--service-role EMR_DefaultRole \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName='tutorial 4'
Pig Examples (With s3)
- Download SalesJan2009.csv from https://drive.google.com/uc?export=download&id=1tP8AJGSgDXwI12r2Ap07GyamMj1o0iDD
- Upload the SalesJan2009.csv to s3
salesTable = LOAD 's3://spring2021tutorial4/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,
Product:chararray,
Price:chararray,
Payment_Type:chararray,
Name:chararray,
City:chararray,
State:chararray,
Country:chararray,
Account_Created:chararray,
Last_Login:chararray,
Latitude:chararray,
Longitude:chararray);
GroupByCountry = GROUP salesTable BY Country;
CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
STORE CountByCountry INTO 's3://spring2021tutorial4/pig_output_sales' USING PigStorage('\t');
We can also submit Pig jobs using the WebUI:
- Save the above script as a file, e.g.
pig_script.pig
- Upload pig_script.pig to
s3
- AWS -> Amazon EMR -> Select your cluster -> Steps -> Add step
- Step type: Pig program
- Script S3 location: s3://spring2021tutorial4/SalesJan2009.csv
- Check your result in s3://spring2021tutorial4/pig_output_sales
Reference:
https://www.guru99.com/introduction-to-pig-and-hive.html
Hive Examples (With HDFS)
- Download the data from https://github.com/RobinDong/hive-examples
- Scp the file salaries.csv and employees.csv
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -mkdir employee
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -mkdir salary
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -copyFromLocal employees.csv employee/
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -copyFromLocal salaries.csv salary/
Enter the hive shell:
[hadoop@ip-172-31-4-131 ~]$ hive
Hive Session ID = 58eda43a-a0da-4e36-abcb-22e353513478
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
Hive Session ID = 82f9e474-9ae9-4112-b7d6-798f5475703b
hive>
Create employee table:
create external table employee (
employee_id INT,
birthday DATE,
first_name STRING,
family_name STRING,
gender CHAR(1),
work_day DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\'")
stored as textfile
location '/user/hadoop/employee/';
Create salary table:
create external table salary (
employee_id INT,
salary INT,
start_date DATE,
end_date DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\'"
)
stored as textfile
location '/user/hadoop/salary/';
Queries:
select * from employee order by work_day asc limit 10;
select * from employee order by birthday asc limit 5;
select first_name, family_name, work_day from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31'
select e.first_name, e.family_name, avg(s.salary) as avg_salary
from employee as e join salary as s
on (e.employee_id == s.employee_id)
group by e.first_name, e.family_name
order by avg_salary
limit 20;
select e.gender, avg(s.salary) as avg_salary
from employee as e join salary as s
on (e.employee_id == s.employee_id)
group by e.gender;