After the installation
and configuration process is complete, verify that all of the LSF
daemons are running on each machine.
For Windows machines,
log on to each machine in the grid and check the
Services dialog
box to verify that these services are running:
For UNIX machines, log
on to each machine in the grid and execute the
ps
command
to check for processes that are running in a subdirectory of the $LSF_install_dir.
An example command follows:
ps -ef|grep LSF_install_dir
The daemons create log
files that can help you debug problems. The log files are located
in the machine's
LSF_install_dir\logs
directory (Windows) or the shared LSF_TOP/log directory (UNIX). If
the daemon does not have access to the share on UNIX, the log files
are located in the /tmp directory.
If the command fails,
check the following:
-
Verify that the path to the LSF
programs is in the PATH environment variable. For LSF 7, the path
is
LSF_install_dir/7.0/bin.
-
On UNIX machines, you might have
to source the
LSF_TOP/conf/profile.lsf
file
to set up the LSF environment.
-
A machine might not be able to
access the configuration files. Verify that the machine has access
to the shared directory that contains the binary and configuration
files, defined by the LSF_ENVDIR environment variable. If the file
server that is sharing the drive starts after the grid machine that
is trying to access the shared drive, the daemons on the machine might
not start. Add the LSF_GETCONF_TIMES environment variable to the system
environment and set the variable value to the number of times that
you want the daemon to try accessing the share in each five-second
interval before the daemon quits. For example, setting the variable
to a value of 600 results in the node trying for 50 minutes ((600*5
seconds)/60 seconds per minute) before quitting.
-
The license file might be invalid or missing.
If LSF cannot find a license file, some daemons might not start or
work correctly. Make sure that the license file exists, is properly
referenced by the LSF_LICENSE_FILE parameter in the LSF_ENVDIR/conf/lsf.conf
file, and is accessible by the daemons.
-
All daemons might not be running.
Restart the daemons on every machine in the grid using the
lsfrestart
command.
If this command does not work, run the /etc/init.d/lsf restart command
(UNIX) or use the Services Administration tool (Windows). Open Services
Administration, stop the SBD, RES, and LIM services (in that order).
Next, start the LIM, RES, and SBD services (in that order).
-
A grid machine might not be
able to connect to the SAS grid control machine. The grid control
machine is the first machine listed in the lsf.cluster.<
cluster_name>
file. Make sure that the daemons are running on the master host, and
verify that the machines can communicate with each other.