Restarting Jobs

Overview

An essential component of a highly available grid is the ability to handle SAS jobs that fail or have to be restarted for some reason. If a long-running job fails, it can cause a significant loss of productivity. After the failure is noticed, you must manually resubmit the job and wait while the program starts over again from the beginning. For SAS programs that run for a considerable amount of time, this can cause unacceptable delays.
The SAS Grid Manager Client Utility, combined with LSF queue policies and the SAS checkpoint restart feature, provides support for these solutions to this problem:
  • the capability to restart a job from the last successful job step
  • the ability to set up a special queue to automatically send failed jobs to another host in the grid to continue execution

Using SAS Checkpoint and Label Restart

The SAS Grid Manager Client Utility includes options that enable you to restart SAS programs from the last successful PROC or DATA step. When the program runs, it records information about the SAS procedures and DATA steps or labels in the program and tracks the ones that have been passed during execution.
If the program fails and has to be restarted, SAS first executes global statements and macros. Then, it reads the checkpoint or label library to determine which checkpoints or labels have been passed. When SAS determines where the program stopped, execution is resumed from that point. Program steps that have already successfully completed are not re-executed.
The restart capability is available on the grid only if you are using the SAS Grid Manager Client Utility or scheduling grid jobs. It is not available if you are using other application interfaces to submit work to the grid.
If you use the restart options, your SAS WORK library must be on shared storage. Using this capability adds some overhead to your SAS program, so it is not recommended for every SAS program that you run.
To set up the checkpoint or label restart capability, use the SAS Grid Manager Client Utility to submit the SAS program to the grid. Specify either the GRIDRESTARTOK argument (for checkpoints) or the GRIDLRESTARTOK argument (for labels). You cannot specify both arguments.
When you use the GRIDRESTARTOK argument, these options are automatically added to your SAS program:
STEPCHKPT
enables checkpoint mode and causes SAS to record checkpoint-restart data.
STEPRESTART
enables restart mode, ensuring that execution resumes at the proper checkpoint.
When you use the GRIDLRESTARTOK argument, these options are automatically added to your SAS program:
LABELCHKPT
enables checkpoint mode for labeled code sections.
LABELRESTART
enables restart mode, ensuring that execution resumes at the proper labeled section.
Other options are automatically added to control restart mode. See “Checkpoint Mode and Restart Mode” in SAS Language Reference: Concepts for a list of options and their definitions as well as complete information about enabling checkpoint restart mode in your SAS programs.
If the host that is running the job becomes unresponsive, the program is automatically restarted at the last checkpoint.

Setting Up Automatic Job Requeuing

You can set up a queue that automatically requeues and redispatches any job that ends with a specified return code or terminates due to host failure. Using job requeuing enables you to handle situations where the host or the system fails while the job is running. Using the requeue capability ensures that any failed jobs are automatically dispatched to another node in the grid.
To use this functionality in a grid, you must use the SAS Grid Manager Client Utility and configure the SAS WORK library to run on shared storage.
To set up a queue for automatic restart, follow these steps:
  1. Create a queue, including these two options in the queue definition:
    • REQUEUE_EXIT_VALUES=return_code_areturn_code_b ...return_code_n option in the queue definition. The return_code values are the job exit codes that you want to filter. Any job that exits with one of the specified codes is restarted.
      Specifying REQUEUE_EXIT_VALUES=all ~0 ~1 specifies that jobs that end with an exit code other than 0 (success) or 1 (warnings) are requeued.
      Note: If you specify a return_code value higher than 255, LSF uses the modulus of the value with 256. For example, if SAS returns an exit code of 999, LSF sees that value as (999 mod 256) or 231. Therefore, you must specify a value of 231 on REQUE_EXIT_VALUES.
    • RERUNNABLE=YES. This specifies that jobs sent from this queue can be rerun if the host running them fails.
  2. Specify the queue that you created in step 1, either by modifying a grid server definition or by specifying the -GRIDJOBOPTS option.
    To create or modify a grid server definition, use the Server Manager plug-in in SAS Management Console. To specify the queue, specify “queue=<name_of_requeue_queue>” in the Additional Options field of the server definition.
    To use -GRIDJOBOPTS, submit the job using the -GRIDJOBOPTS queue=name_of_requeue_queue option.
  3. Submit the job to the requeue queue on the grid. You must use the SAS Grid Manager Client Utility to specify the -GRIDRESTARTOK option. Send the job to the requeue queue by using the server that you specified in step 2.