Verifying the Platform Suite for SAS Environment

Verifying That LSF Is Running

After the installation and configuration process is complete, verify that all of the LSF daemons are running on each machine.
For Windows machines, log on to each machine in the grid and check the Services dialog box to verify that these services are running:
  • Platform LIM
  • Platform RES
  • Platform SBD
For UNIX machines, log on to each machine in the grid and execute the ps command to check for processes that are running in a subdirectory of the $LSF_install_dir. An example command is:
ps -ef|grep LSF_install_dir
The daemons create log files that can help you to debug problems. The log files are located in the machine's LSF_install_dir\logs directory (Windows) or the shared LSF_TOP/log directory (UNIX). If the daemon does not have access to the share on UNIX, the log files are located in the /tmp directory.
If the command fails, check the following:
  • Verify that the path to the LSF programs is in the PATH environment variable. For LSF 7, the path is LSF_install_dir/7.0/bin.
  • On UNIX machines, you might have to source the LSF_TOP/conf/profile.lsf file to set up the LSF environment.
  • A machine might not be able to access the configuration files. Verify that the machine has access to the shared directory that contains the binary and configuration files, defined by the LSF_ENVDIR environment variable. If the file server that is sharing the drive starts after the grid machine that is trying to access the shared drive, the daemons on the machine might not start. Add the LSF_GETCONF_TIMES environment variable to the system environment and set the variable value to the number of times that you want the daemon to try accessing the share in each five-second interval before the daemon quits. For example, setting the variable to a value of 600 results in the node trying for 50 minutes ((600*5 seconds)/60 seconds per minute) before quitting.
  • The license file might be invalid or missing. If LSF cannot find a license file, some daemons might not start or work correctly. Make sure that the license file exists, is properly referenced by the LSF_LICENSE_FILE parameter in the LSF_ENVDIR/conf/lsf.conf file, and is accessible by the daemons.
  • All daemons might not be running. Restart the daemons on every machine in the grid using the lsfrestart command. If this command does not work, run the /etc/init.d/lsf restart command (UNIX) or use the Services Administration tool (Windows). Open Services Administration, stop the SBD, RES, and LIM services (in that order). Next, start the LIM, RES, and SBD services (in that order).
  • A grid machine might not be able to connect to the SAS grid control machine. The grid control machine is the first machine listed in the lsf.cluster.<cluster_name> file. Make sure that the daemons are running on the master host and verify that the machines can communicate with each other.

Verifying LSF Setup

You must verify that all grid machine names are specified correctly in the LSF_ENVDIR/conf/lsf.cluster.<cluster_name> file and the resource is specified in the lsf.shared file. Follow these steps to make sure the configuration is correct:
  1. Log in as an LSF administrator on one of the machines in the grid, preferably the grid control server machine. The LSF administrator ID is listed in the lsf.cluster.<cluster_name> file under the line Administrators=username1username2 ... usernameN.
  2. Run the command lsadmin ckconfig -v to check the LSF configuration files for errors.
  3. Run the command badmin ckconfig -v to check the batch configuration files for errors.
  4. Run the command lshosts to list all the hosts in LSF and to verify that all the hosts are listed with the proper resources.
  5. Run the command bhosts to list all the hosts in LSF's batch system. Verify that all hosts are listed. Make sure that the Status for all hosts is set to ok and that the MAX column has the correct number of jobs slots defined for each host (the maximum number of jobs the host can process at the same time).
  6. If you find any problems, correct the LSF configuration file and issue the commands lsadmin reconfig and badmin reconfig so that the daemons use the updated configuration files.
  7. If you added or removed hosts from the grid, restart the master batch daemon by issuing the command badmin mbdrestart. To restart everything, issue the lsfrestart command.

Verifying LSF Job Execution

Some problems occur only when you run jobs on the grid. To minimize and isolate these problems, you can run debug jobs on specific machines in the grid.
To submit the debug job, run the command bsub -I -m <host_name> set from the grid client machine to each grid node. This command displays the environment for a job running on the remote machine and enables you to verify that a job runs on the machine.
If this job fails, run the bhist -l <job_id>' command, where job_id is the ID of the test job. The output of the command includes the user name of the person submitting the job, the submitted command, and all the problems LSF encountered when executing the job. Some messages in the bhist output for common problems are:
Failed to logon user with password
specifies that the password in the Windows passwd.lsfuser file is invalid. Update the password using the lspasswd command.
Unable to determine user account for execution
specifies that the user does not have an account on the destination machine. This condition can occur between a Windows grid client to a UNIX grid node, because the Windows user has a domain prefixed to the user name. Correct this problem by making sure that the user has an account on the UNIX machines. Also, add the line LSF_USER_DOMAIN= to the Windows lsf.conf file to strip the domain from the user name.