Pigs eat anything! Pig leaves everywhere! Pigs are domestic!…and most importantly Pigs Fly!!! Well…at least Apache Pig can! Apache Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. It can easily be extended to operate on data beyond files, including key/value stores, databases, etc. However, we need some handling to avoid errors in Pig scripts. Usually, the usual problems we face while writing pig scripts are failing jobs or never ending jobs. Getting a meaningful error information too is a pain!
How do we approach these problems? We are going to discuss some of these problems and potential solutions. These are a few errors I have encountered while writing the script:
- Data Format/validation error
- Errors from UDF
- Memory warnings in UDF
Why do we need to worry about above errors?
For majority of pig errors, the entire job execution will fail. If Pig catches an exception, it assumes that you are asking to quit everything, and the quit the job. Hadoop will then restart your task. If any particular task fails three times, Hadoop will not restart again. Rather, it will kill all the other tasks and announce the expensive cluster job a failure.
Data Format/validation error
This error occurs more often due to misfit of few lines of data to the pattern which we defined in the scripts. While processing millions of records, we cannot rule out the possibility of this data misfit. These errors will not show up while writing and testing these scripts. Simple data format errors like missing columns or different date format can stop a Hadoop job.
To avoid these data format, unfortunately there is no ‘try catch’ block of statements in Apache Pig. We can handle this by two approaches,
- UDF that does the data check using java or python
Implement a UDF which checks schema/structure of your data and impute the misfit to null or default values based on our needs.
- Using Split command to check particular column value 
A = LOAD ‘data’ AS (a0: int, a1: int, a2: int);
SPLIT A INTO bad IF a2 IS NULL, good IF a2 IS NOT NULL;
Errors from UDF
Apache Pig will start a job and runs the tasks. All the tasks will fail, and we will get an error message.
ERROR 2078: Caught error from UDF: my.udf.demo [error message].
We can avoid this by declaring the input the UDF expects, which is outputSchema in evalfunc interface. By default Apache Pig will try to discover the UDF’s return type from the return type of UDF’s implementation of EvalFunc, and pass the input indicated by the script to UDF. If UDF implements outputSchema method, Apache Pig will pass on the schema of the input that the script has indicated to the UDF. The UDF will throw an error in the event that it gets an input schema that does not match its expectations.
The code below is the example outputSchema, from the book Programming Apache Pig1
With this method added to our UDF, when we call the UDF that tries to pass a data type other than Integer to UDF (example: chararray, int), it will fail almost immediately with
java.lang.RuntimeException: Expected input of (int, int), but received schema (chararray, int).
This method saves lot of time as Apache Pig script checks parameters for UDF before start running a job.
Memory warnings from UDF
Another common warning or error message is memory issues in UDF, as the UDF needs more memory for processing. For example, when we need to calculate number of unique users per website visits from web log we need to perform a group by operations.
In this case, it is better to use Bags instead of Tuples or maps as Bag is an Apache Pig data type that is capable of spilling. After crossing the threshold of memory bags we must take care of spilling data to disk. However, it is best avoid spilling as it is expensive.
These are some of my experience of how NOT to choke the Pig J!
 O’Reilly® Programming Apache Pig, Alan F Gates, O’Reilly Media (October 22, 2011)
This blog is written by Mohanapriya Jagannathan, Business Analyst at BRIDGEi2i
About BRIDGEi2i: BRIDGEi2i provides Business Analytics Solutions to enterprises globally, enabling them to achieve accelerated business impact harnessing the power of data. Our analytics services and technology solutions enable business managers to consume more meaningful information from big data, generate actionable insights from complex business problems and make data driven decisions across pan-enterprise processes to create sustainable business impact. To know more visit www.bridgei2i.com
The views and opinions expressed in this article are those of the author and do not necessarily reflect the official position or viewpoint of BRIDGEi2i.