Your organization might
have services and long-running SAS programs that are critical to your
operations. The services must be available at all times, even if the
servers that are running them become unavailable. The SAS programs
must complete in a timely manner, even if something happens to cause
them to fail. For a SAS program that takes a long time to run, this
means that the program cannot be required to restart from the beginning
if it ends prematurely.
SAS Grid Manager provides
high availability through these capabilities:
-
Multi-machine architecture. Because
how a SAS grid is configured and operates, there is no single point
of failure. Because jobs are processed on the available grid nodes,
if a node becomes unavailable other nodes can take over the workload.
-
Platform Suite for SAS. The default
configuration of Platform Suite for SAS provides high availability
for the grid operation. The LSF master daemon runs on a specified
grid node (usually the grid control server), and a failover node is
also identified. If the master daemon node fails, the failover node
automatically takes over and broadcasts to the rest of the grid. The
grid recognizes the new master daemon node and continues operation
without interruption. Platform PM and GMS must be treated as critical
services and configured for failover along with all other critical
services.
-
Critical service failover. There
are certain services and processes that are critical to the operation
of SAS applications on the grid and that must always be available
(for example, the SAS Metadata Server). After providing a failover
host for the service, you can use Platform Computing’s Enterprise
Grid Orchestrator (EGO) to monitor the service, restart the service
if it stops, and start the service on the failover host when needed.
Once the service has started on the failover host, you can use either
hardware (a load balancer) or software (EGO) to automatically direct
clients to the failover host. EGO is part of Platform Suite for SAS
that is included with the SAS Grid Manager and installed as part of
the LSF installation process.
-
Automatic SAS program failover.
If a long-running SAS job fails before completion, rerunning it from
the beginning can cause a loss of productivity. You can use the SAS
Grid Manager Client Utility to specify that the job is restartable.
This means that a failed job restarts from the last successful procedure,
DATA step, or labeled section. This capability uses the SAS checkpoint
and restart functions to enable failed jobs to complete without causing
delays. You can also use attributes on the queue definitions in the
grid to automatically restart and requeue any job that ends with a
specified return code or that terminates due to host failure. Using
these options together ensures that critical SAS programs always run
successfully and in a timely manner, even if they encounter problems.
All of these strategies
are independent of one another, so you can implement the ones that
provide the greatest benefit to your organization.