High Availability and SAS Grid Manager :: Grid Computing in SAS(R) 9.4, Third Edition

Your organization might have services and long-running SAS programs that are critical to your operations. The services must be available at all times, even if the servers that are running them become unavailable. The SAS programs must complete in a timely manner, even if something happens to cause them to fail. For a SAS program that takes a long time to run, this means that the program cannot be required to restart from the beginning if it ends prematurely.

SAS Grid Manager provides high availability through these capabilities:

Multi-machine architecture. Because how a SAS grid is configured and operates, there is no single point of failure. Because jobs are processed on the available grid nodes, if a node becomes unavailable other nodes can take over the workload.
Platform Suite for SAS. The default configuration of Platform Suite for SAS provides high availability for the grid operation. The LSF master daemon runs on a specified grid node (usually the grid control server), and a failover node is also identified. If the master daemon node fails, the failover node automatically takes over and broadcasts to the rest of the grid. The grid recognizes the new master daemon node and continues operation without interruption. Platform PM and GMS must be treated as critical services and configured for failover along with all other critical services.
Critical service failover. There are certain services and processes that are critical to the operation of SAS applications on the grid and that must always be available (for example, the SAS Metadata Server). After providing a failover host for the service, you can use Platform Computing’s Enterprise Grid Orchestrator (EGO) to monitor the service, restart the service if it stops, and start the service on the failover host when needed. Once the service has started on the failover host, you can use either hardware (a load balancer) or software (EGO) to automatically direct clients to the failover host. EGO is part of Platform Suite for SAS that is included with the SAS Grid Manager and installed as part of the LSF installation process.
Automatic SAS program failover. If a long-running SAS job fails before completion, rerunning it from the beginning can cause a loss of productivity. You can use the SAS Grid Manager Client Utility to specify that the job is restartable. This means that a failed job restarts from the last successful procedure, DATA step, or labeled section. This capability uses the SAS checkpoint and restart functions to enable failed jobs to complete without causing delays. You can also use attributes on the queue definitions in the grid to automatically restart and requeue any job that ends with a specified return code or that terminates due to host failure. Using these options together ensures that critical SAS programs always run successfully and in a timely manner, even if they encounter problems.

All of these strategies are independent of one another, so you can implement the ones that provide the greatest benefit to your organization.