Run Hadoop Pig

Run Hadoop Pig runs an Apache Pig script. Apache Pig is a high-level language for expressing data analysis programs, and has the infrastructure for evaluating these programs. Pig programs can be parallelized and this enables them to handle very large data sets.

Run Hadoop Pig allows you to select the Pig operations,enter any needed parameters and have your Pig script automatically generated by the system. You can run the Pig script on any Hadoop server.

Run Hadoop Pig works only on Hadoop File Servers. Both Apache Hadoop 1.x and 2.x are supported.

To set the Run Hadoop Pig options:

  1. Drag and drop the Run Hadoop Pig activity to the canvas.
  2. Right-click the Run Hadoop Pig activity and select Options.
  3. The server name fields indicates the Hadoop Server on which the file to be processed resides.
  4. Click the browse button ([...]) to go to the file to be processed.
  5. Select the file type. Run Hadoop Pig supports both delimited as well as delimited sequence files.
  6. Select the delimiter and the text qualifier as appropriate.
  7. Click Add from the Fields section and add the fields that are present in the file to be processed. For sequence files, the first field is considered the key and the other fields are part of the delimited values.
  8. Select the Trim operation as desired. The trim operation trims white spaces in the input field, before processing it.
  9. Go to the operations tab. Click Add to start adding the Pig operations to be performed on the file. This opens the Operations editor.
  10. Select an operation to be performed. The various operations are as follows:
    • Sort - Sorts the data in alphabetical order.
    • Filter -Filters data according to your requirements.
    • Aggregate - Performs statistical operations, such as Sum and Count on the data.
    • Distinct - Selects all unique records from the specified field.
    • Limit - Limits the number of records processed to a specified number.
  11. Use the Move Up and Move Down buttons to change the order of operations.
  12. Once you have selected the operations and entered the required input for processing the operations, click Add to save your selection and return to the Pig options editor.
  13. The Pig Script is automatically generated based on the selected operations.
    • The editor allows you to override the generated Pig script with your own script, as needed. Click the Edit Script option and enter your own script in the Pig Script text box. The Regenerate button is enabled in this case. If you want the system generated script again, click Regenerate from the Pig Script section to generate the Pig script.
    • To validate script syntax, click the Validate button.
  14. You can specify the output file under Variables tab. The output file can be used by the subsequent activities.
  15. Click OK to save the Pig Script. By default, the output file type is the same as the input file type. You can change this using the generated Pig script.