Pig Eval Functions#

Eval Functions#

Function Description
AVG() To compute the average of the numerical values within a bag.
BagToString() To concatenate the elements of a bag into a string. While concatenating, we can place a delimiter between these values (optional).
CONCAT() To concatenate two or more expressions of same type.
COUNT() To get the number of elements in a bag, while counting the number of tuples in a bag.
COUNT_STAR() It is similar to the COUNT() function. It is used to get the number of elements in a bag.
DIFF() To compare two bags (fields) in a tuple.
IsEmpty() To check if a bag or map is empty.
MAX() To calculate the highest value for a column (numeric values or chararrays) in a single-column bag.
MIN() To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag.
PluckTuple() Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter the columns in a relation that begin with the given prefix.
SIZE() To compute the number of elements based on any Pig data type.
SUBTRACT() To subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples of the first bag that are not in the second bag.
SUM() To get the total of the numeric values of a column in a single-column bag.
TOKENIZE() To split a string (which contains a group of words) in a single tuple and return a bag which contains the output of the split operation.


grunt> AVG(expression)
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);
grunt> student_group_all = Group student_details All;
grunt> Dump student_group_all;
grunt> student_age_avg = foreach student_group_all Generate
   (student_details.firstname, student_details.age), AVG(student_details.age);
grunt> Dump student_age_avg;


grunt> BagToString(vals:bag [, delimiter:chararray])
grunt> group_dob = Group student_details All;
grunt> Dump group_dob;
grunt> dob_string = foreach group_dob Generate BagToString(student_details);
grunt> Dump dob_string;


grunt> CONCAT (expression, expression, [...expression])
grunt> student_name_concat = foreach student_details Generate CONCAT (firstname, lastname);
grunt> Dump student_name_concat;
grunt> CONCAT(firstname, '_',lastname);
grunt> student_name_concat = foreach student_details GENERATE CONCAT(firstname, '_',lastname);
grunt> Dump student_name_concat;


grunt> DIFF (expression, expression)

vi ~/pig/emp_sales.txt


vi ~/pig/emp_bonus.txt

$ hdfs dfs -put ~/pig/emp_sales.txt hdfs://localhost:9000/pig_data/
$ hdfs dfs -put ~/pig/emp_bonus.txt hdfs://localhost:9000/pig_data/
grunt> emp_sales = LOAD 'hdfs://localhost:9000/pig_data/emp_sales.txt' USING PigStorage(',')
   as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
grunt> emp_bonus = LOAD 'hdfs://localhost:9000/pig_data/emp_bonus.txt' USING PigStorage(',')
   as (sno:int, name:chararray, age:int, salary:int, dept:chararray);
grunt> cogroup_data = COGROUP emp_sales by sno, emp_bonus by sno;
grunt> Dump cogroup_data;
grunt> diff_data = FOREACH cogroup_data GENERATE DIFF(emp_sales,emp_bonus);
grunt> Dump diff_data;


DEFINE pluck PluckTuple(expression1) 
DEFINE pluck PluckTuple(expression1,expression3) 

a = load 'a' as (x, y);
b = load 'b' as (x, y);
c = join a by x, b by x;
DEFINE pluck PluckTuple('a::');
d = foreach c generate FLATTEN(pluck(*));
describe c;
c: {a::x: bytearray,a::y: bytearray,b::x: bytearray,b::y: bytearray}
describe d;
d: {plucked::a::x: bytearray,plucked::a::y: bytearray}
DEFINE pluckNegative PluckTuple('a::','false');
d = foreach c generate FLATTEN(pluckNegative(*));
describe d;
d: {plucked::b::x: bytearray,plucked::b::y: bytearray}