After
the installation and configuration process is complete, verify that
all of the LSF daemons are running on each machine.
For Windows
machines, log on to each machine in the grid and check the Services
dialog box to verify that these services are running:
For UNIX
machines, log on to each machine in the grid and execute the
ps
command to check for processes that are running
in a subdirectory of the $LSF_install_dir. An example command is:
ps -ef|grep LSF_install_dir
The daemons
create log files that can help you to debug problems. The log files
are located in the machine's
LSF_install_dir\logs directory (Windows) or the shared LSF_TOP/log directory (UNIX).
If the daemon does not have access to the share on UNIX, the log files
are located in the /tmp directory.
If the
command fails, check the following:
-
Verify that the path to the LSF
programs is in the PATH environment variable. For LSF 7, the path
is
LSF_install_dir/7.0/bin.
-
On UNIX machines, you might have
to source the
LSF_TOP/conf/profile.lsf
file to set up the LSF environment.
-
A machine might not be able to
access the configuration files. Verify that the machine has access
to the shared directory that contains the binary and configuration
files, defined by the LSF_ENVDIR environment variable. If the file
server that is sharing the drive starts after the grid machine that
is trying to access the shared drive, the daemons on the machine might
not start. Add the LSF_GETCONF_TIMES environment variable to the system
environment and set the variable value to the number of times that
you want the daemon to try accessing the share in each five-second
interval before the daemon quits. For example, setting the variable
to a value of 600 results in the node trying for 50 minutes ((600*5
seconds)/60 seconds per minute) before quitting.
-
The license file might
be invalid or missing. If LSF cannot find a license file, some daemons
might not start or work correctly. Make sure that the license file
exists, is properly referenced by the LSF_LICENSE_FILE parameter in
the LSF_ENVDIR/conf/lsf.conf file, and is accessible by the daemons.
-
All daemons might not be running.
Restart the daemons on every machine in the grid using the
lsfrestart
command. If this command does not work,
run the /etc/init.d/lsf restart command (UNIX) or use the Services
Administration tool (Windows). Open Services Administration, stop
the SBD, RES, and LIM services (in that order). Next, start the LIM,
RES, and SBD services (in that order).
-
A grid machine
might not be able to connect to the SAS grid control machine. The
grid control machine is the first machine listed in the lsf.cluster.<
cluster_name> file. Make sure that the daemons
are running on the master host and verify that the machines can communicate
with each other.