Setting Up High Availability for Critical Applications

On a grid, there are certain services that always need to be available and accessible to clients. These services are vital to the applications running on the grid and its ability to process SAS jobs. Examples include:
  • SAS Metadata Server
  • SAS object spawner
  • Platform Process Manager
  • Platform Grid Management Service
  • web application tier components
Configuring a grid that provides high availability for these services requires these components:
  • providing failover hosts for machines that run critical applications. Using multiple machines for critical functions eliminates a single point of failure for the grid.
  • providing a way to monitor the high-availability applications on the grid and to automatically restart a failed application on the same host or on a failover host if needed.
  • providing a method to let the client know to connect to the failover host instead of the regular host. This can be done through software (DNS resolution) or hardware (the hardware load balancer), but only one is used.
In normal operations, the following sequence takes place:
  1. The client determines that it needs to access a service on a machine in the grid.
  2. The client sends a query to the corporate DNS server. The DNS server looks up the address for the machine and returns that information to the client.
  3. The client uses the address to connect to the machine and use the application.
Normal Grid Operations
grid in normal operation
To provide business continuity for the application, a failover host must be provided for the critical services running in the grid environment. This provides an alternative location for running the critical services and ensures that it remains available to the applications on the grid. In addition, both the main and failover machines must have access to a shared file server. This ensures that the application has access to the data required for operation, regardless of which machine is running the service.
To provide business continuity for the application, the failover capability must also be automatic. EGO is configured to monitor any number of critical services running on the grid. If it detects that the application has failed or that the machine running it has gone down, it is configured to start the application on the failover server automatically, which enables applications to continue running on the grid.
However, once the application has started on the failover server, the client must have a way to know which server is running the application. There are two methods for accomplishing this:
  • Using a hardware load balancer. The load balancer serves as an intermediary between the client and the services running on the grid, which decouples the grid operation from the physical structure of the grid. When the client wants to connect to the service, it connects to the load balancer, which then directs the request to the machine that is running the service. The load balancer knows the addresses of both the main and failover machines, so it passes the request on to whichever of the machines is running in the servers. During normal operation, the request goes to the main machine. When failover occurs, EGO starts services on the failover host, and the load balancer forwards connections to it (because it is not the host running the services).
    Grid Failover with a Load Balancer
    grid failover with load balancer
  • DNS resolution. Once EGO starts the application on the failover server, it sends the address of the failover machine to the corporate DNS server. The entry for the application is updated in the server, so the next time a client requests a connection to the application, the DNS server returns the address of the failover machine.
    Grid Failover with EGO
    grid in failover state
    If you do not want EGO to directly update the corporate DNS, you can configure the DNS server to always point to EGO to provide the IP address for the machine. When EGO starts the application on the failover machine, it then points to the new machine.
The choice of whether to use a load balancer or a DNS solution depends on your organization’s policies. Using DNS resolution prevents you from having to purchase an addition piece of hardware (the load balancer). However, your organization’s policies might prohibit either the corporate DNS from being changed by an outside DNS (EGO) or DNS requests to be forwarded to an outside DNS. If this is the case, the hardware load balancer provides a high-availability solution.