~ Tutorial 4 ~

Feed


Pig and Hive

Hive or Pig?

  • Similarities:

    • Same Motivation:
      • Allow non-Java programmers easy access to data in its Hadoop clusters.
      • Easy to learn and easy to use
      • Immediately makes data on the cluster accessible to more people.
    • Open-source project managed by the Apache Software Foundation.
  • Differences:

    • Hive
      • Developed by Facebook
      • Metastore databse required. For storing:
        • Table definitions (name, columns, data types, etc.)
        • Information on where the table data is stored in HDFS
      • Row format of files in the table.
      • HiveQL is SQL like, familar to many developers. (declarative)
      • Data structure is predefined
      • JDBC driver available (limited functionality)
      • Hive is not an RDBMS! Not all SQL works
    • Pig
      • Developed by Yahoo!
      • No Metastore or changes to the cluster (just a client).
      • PigLatin, a new but simple data flow language (pipeline).
      • Just an interpreter running on the client machine turning PigLatin into Mapreduce jobs and submits them to the cluster
      • Data structure defined execution time.

Examples

Follow Amazon EMR or Google Cloud DataProc or IERG4300Fall2020Tutorial8 on how to launch a multi-node Hadoop cluster using Amazon EMR or Google Cloud DataProc.

Note that the user and password for IERG4300Fall2020Tutorial8 is:

User: ierg4300

Password: fall2020ierg

In this tutorial, we are going to demonstrate how to launch a multi-node Hadoop cluster with Pig and Hive and we will show some examples of using Pig and Hive.

aws emr create-cluster \
--release-label emr-6.2.0 \
--instance-type m4.large  \
--instance-count 2 \
--applications Name=Hadoop Name=Hive Name=Pig \
--region us-west-1 \
--service-role EMR_DefaultRole \
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName='tutorial 4'

Pig Examples (With s3)

salesTable = LOAD 's3://spring2021tutorial4/SalesJan2009.csv' USING PigStorage(',') AS 
    (Transaction_date:chararray,
    Product:chararray,
    Price:chararray,
    Payment_Type:chararray,
    Name:chararray,
    City:chararray,
    State:chararray,
    Country:chararray,
    Account_Created:chararray,
    Last_Login:chararray,
    Latitude:chararray,
    Longitude:chararray);

GroupByCountry = GROUP salesTable BY Country;

CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));

STORE CountByCountry INTO 's3://spring2021tutorial4/pig_output_sales' USING PigStorage('\t');

We can also submit Pig jobs using the WebUI:

  • Save the above script as a file, e.g. pig_script.pig
  • Upload pig_script.pig to s3
  • AWS -> Amazon EMR -> Select your cluster -> Steps -> Add step
    • Step type: Pig program
    • Script S3 location: s3://spring2021tutorial4/SalesJan2009.csv
  • Check your result in s3://spring2021tutorial4/pig_output_sales

Reference:

https://www.guru99.com/introduction-to-pig-and-hive.html

Hive Examples (With HDFS)

[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -mkdir employee
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -mkdir salary
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -copyFromLocal employees.csv employee/
[hadoop@ip-172-31-4-131 ~]$ hdfs dfs -copyFromLocal salaries.csv salary/

Enter the hive shell:

[hadoop@ip-172-31-4-131 ~]$ hive
Hive Session ID = 58eda43a-a0da-4e36-abcb-22e353513478

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j2.properties Async: true
Hive Session ID = 82f9e474-9ae9-4112-b7d6-798f5475703b
hive>

Create employee table:

create external table employee (
            employee_id INT,
            birthday DATE,
            first_name STRING,
            family_name STRING,
            gender CHAR(1),
            work_day DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties (
    "separatorChar" = ",",
    "quoteChar" = "\'")
stored as textfile
location '/user/hadoop/employee/';

Create salary table:

create external table salary (
            employee_id INT,
            salary INT,
            start_date DATE,
            end_date DATE)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
            "separatorChar" = ",",
            "quoteChar"     = "\'"
)
stored as textfile
location '/user/hadoop/salary/';

Queries:

select * from employee order by work_day asc limit 10;

select * from employee order by birthday asc limit 5;

select first_name, family_name, work_day from employee where work_day >= '1990-01-01' and work_day <= '1990-01-31'

select e.first_name, e.family_name, avg(s.salary) as avg_salary 
from employee as e join salary as s 
    on (e.employee_id == s.employee_id)
group by e.first_name, e.family_name 
order by avg_salary 
limit 20;

select e.gender, avg(s.salary) as avg_salary
from employee as e join salary as s 
    on (e.employee_id == s.employee_id)
group by e.gender;

Reference:

https://github.com/RobinDong/hive-examples