FOCUS AREAS

Scalability & Performance Notes and Questions

Notes and Frequently Asked Questions about Scheduling


Troubleshooting Scheduling Problems


General

Privileges assignments cause the greatest number of problems when you schedule jobs. Therefore, you should: Use Platform tools and log files to isolate a problem. Verify that the command line will run correctly from an operating environment command prompt.

Examine the output from your job.

Foundation SAS exits with warnings that cause an Exit instead of a Done status when using Platform Computing's scheduling servers

Platform LSF treats any exit code of non-zero as being Exit instead of Done. Here are two common ways to resolve this issue: use a job starter or use Platform LSF in conjunction with the sasbatch script from a TUE install. The TUE install will create t he sasbatch script to run SAS in batch mode.

For Windows installation, you can modify the script as follows:

  if not {%username%}=={} (
    call sas.bat %*%
  ) else (
    call sas.bat -sasuser work %*%
  )

  set rc=%ERRORLEVEL%
  if %rc%==1 goto makenormalexit

  exit %rc%

  :makenormalexit

  exit 0


For UNIX installation, you can modify the script as follows:

  sas $*
  rc=$?
  if [ $rc -eq 1 ]; then
    exit 0
  else
    exit $rc
  fi


The job starter is similar to an automatic job wrapper and is responsible for starting the task.

Below are references to a job starter for Windows that checks the ERRORLEVEL returned by SAS. If the ERRORLEVEL is 2 or greater, it exits with that error level. If the ERRORLEVEL is 0 or 1, it exits with 0. This addresses the issue of warnings in the S AS program log. The relevant pieces for these Windows implementation and test cases are available:



To use these test programs, do the following:

  1. Copy run_sas_batch.cmd to a directory, for example,
    D:\SAS_files\scripts
    
  2. Open the lsb.queues file in an editor, for example,
    LSF_TOP\conf\lsbatch\\configdir\lsb.queues
    
  3. Add the following lines to _all_ the queue definitions:
    Begin Queue
    QUEUE_NAME = normal
    ....
    JOB_STARTER = D:\SAS_files\scripts\run_sas_batch.cmd    #<-- you add this
    ...
    End Queue
    
    Begin Queue
    QUEUE_NAME = priority
    ...
    JOB_STARTER = D:\SAS_files\scripts\run_sas_batch.cmd
    ...
    End Queue
    
    and so on.
  4. After you have modified the configuration, tell the batch system to re-read the configuration by running the command BADMIN RECONFIG. You will need to run this command as one of the Platform LSF administrator accounts. Alternatively, you can stop all the Platform LSF services, and then re-start them.
  5. Test this. If you use a SAS program such as the add warn.sas.
    date test;
    x=1;
    run;
    
    Call it in a .bat file.
    "c:\program files\...\sas.exe" D:\SAS_files\scripts\warn.sas -noterminal
    
    Submit it.
    bsub -o warn.txt warn.bat
    
    You will see that the job is in a DONE state when it is finished.


Note:: The run_sas_batch.cmd script relies on the command having the correct file extension. Be sure that your command, which is defined in your SAS Batch Server definition, has the correct extension, for example, .bat, .cmd, or .exe. Alternatively, you can add call in front of the :program section in run_sas_batch if your command is a .bat or .cmd file.

Windows JobStarter Script Example

rem echo off
rem
rem This script wraps a sas.exe invocation and analyzes the ERRORLEVEL
rem set by sas.exe to determine whether or not the sas program has
rem failed or not, and return this information to LSF using 'exit'
rem (since LSF does not interpret ERRORLEVEL).
rem
rem Errorlevels set by SAS:
rem
rem Condition                              Return code
rem =========                              ===========
rem All steps terminated normally               0
rem SAS System issued warning(s)                1
rem SAS issued error(s)                         2
rem User issued the ABORT statement             3
rem User issued the ABORT RETURN statement      4
rem User issued the ABORT ABEND statement       5
rem SAS internal error                          6
rem
rem Any error codes above 6 are returned as a result of using the ABORT
rem statement with a numeric argument. If your program calls ABORT with
rem return codes above 6, you'll need to modify the script.
rem
rem If the error condition is SUCCESS or WARNING, then the script will
rem exit with a zero exit code, thus indicating success to LSF. If the
rem error condition is ERROR, INFORMATIONAL or FATAL, the script will
rem exit with the provided ERRORLEVEL, thus indicating failure to LSF.
rem
rem The script also distinguishes .cmd and .bat script files and runs
rem them using 'call' so that the exit from the script does not exit the
rem entire shell (and thus stop the execution of this script).
rem

rem
rem Check for at least one argument (command to run)
rem
if X%1 == X goto noarg

rem
rem Check if the program to run is a script or not
rem If there is no file extension, assume a regular program
rem
echo "Checking extension ...."
if X%~x1 == X goto program

echo "Is it a cmd file...."
if %~x1 == .cmd goto script

echo "Is it a bat file...."
if %~x1 == .bat goto script

:program
        echo "Running program %1...."
        %*
        goto ran

:script
        echo "Running script %1...."
        call %*
        goto ran

:ran
        echo Errorlevel from %1 is %ERRORLEVEL%
        if %ERRORLEVEL% GEQ 7 goto unknown
        goto level%ERRORLEVEL%

:level0
        echo All steps terminated normally.
        goto success

:level1
        echo SAS issued warning(s).
        goto success

:level2
        echo SAS issued error(s).
        goto failure

:level3
        echo User issued the ABORT statement.
        goto failure

:level4
        echo User issued the ABORT RETURN statement.
        goto failure

:level5
        echo User issued the ABORT ABEND statement.
        goto failure

:level6
        echo SAS internal error.
        goto failure

:unknown
        echo Unknown ERRORLEVEL %ERRORLEVEL%.
        goto failure

:failure
        exit %ERRORLEVEL%

:success
        exit 0

:noarg
        echo "Usage: %0  "
        exit 127

UNIX JobStarter Script Example

#! /bin/ksh

# This script wraps a sas invocation and analyzes the return code
# set by sas to determine whether or not the sas program has
# failed or not, and return this information to LSF using 'exit'
#
# Errorlevels set by SAS:
#
# Condition                             Return code
# =========                             ===========
# All steps terminated normally             0
# SAS System issued warning(s)              1
# SAS issued error(s)                       2
# User issued the ABORT statement           3
# User issued the ABORT RETURN statement    4
# User issued the ABORT ABEND statement     5
# SAS internal error                        6

$*
rc=$?

# if exits with 1 make it be 0; otherwise exist with same value
if [ $rc -eq 1 ]; then
  exit 0
else
  exit $rc
fi

Diagnostic Output

Process Manager (PM), JobScheduler (JS), and LSF all contain log directories where they output logging information. You can look in the log files that PM, JS and LSF write to, in their respective default locations: <PM_TOP>/log, <JS_TOP>/log and <LSF_TOP>/logs. PM and JS output will be in the jfd.log.<host>. LSF output will be in one of the following logs:



You can change the configuration files (js.conf or lsf.conf) to have more diagnostic messages printed out. PM/JS logging is controlled by the parameter JS_LOG_MASK. The default value for JS_LOG_MASK is LOG_NOTICE. Debug settings are LOG_DEBUG1, LOG_DEBUG2, LOG_DEBUG3. LSF logging is controlled by a series of options (LSB_DEBUG*).

Platform LSF daemons logging is controlled by the parameter LSF_LOG_MASK. Possible values for this parameter can be any log priority symbol that is defined in /usr/include/sys/syslog.h. The default value for LSF_LOG_MASK is LOG_WARNING. You can temporarily set the message log level by using the following commands:

jreconfigdebug -l debug_level

lsadmin limdebug [-c class_name] [-l debug_level] [-f logfile_name] [-o] [host_name]
lsadmin resdebug [-c class_name] [-l debug_level] [-f logfile_name] [-o] [host_name]
badmin mbddebug [-c class_name] [-l debug_level] [-f logfile_name] [-o]
badmin sbddebug [-c class_name] [-l debug_level] [-f logfile_name] [-o] [host_name]
Where -o resets back to the daemon starting state.

You can add -DDebug to the Java command line to invoke SAS Management Console. This causes JS to put information about the communication between the client (SAS Management Console) and the server. You will find this output in the errorlog.txt file that's generated by SAS Management Console.

Windows Specific

The Windows security policy has some requirements that are unique. In order to correctly install the Platform Computing software, you must provide a valid user ID and password to run and administer the services. This requires that the user ID that's used to run the installation program have the privilege Act as part of the operating system assigned to it. The user ID that you specify to run the services under must have the privilege Log on as a batch job assigned to it.

There can also be problems such as the password expired or the wrong domain name was provided. Many times these user IDs are not your usual user ID. One simple way to test the domain, user ID, and password is to:

  1. Bring up a DOS command prompt.
  2. Issue the RUNAS command to bring up a new DOS command prompt running as the other user ID
    -->runas /user:DOMAIN\userid cmd
    
  3. Type the password, and a new DOS command prompt should be running


You can use this new DOS command prompt to run the various scheduled commands that are failing to find out if they work from the DOS prompt. If they run, you know there is a problem with the scheduler setup. If they don’t run, then it’s probable that you will have additional information on the console of the DOS command window to help you find the problem.

In a multi-user environment in which you want more than one user submitting and running flows, there are privilege settings on folders that need to be in place. The scheduling server folders should already be set for service and administrator accounts and need no further changes. Verify that the LSF installed files have the following privileges:

Folder User Group Privileges
LSFTOP\work LSF service accounts full control (All) (All)
LSFTOP\work LSF administrators full control (All) (All)
LSFTOP\work Everyone special access (R) (R)
LSFTOP\logs LSF service accounts full control (All) (All)
LSFTOP\logs LSF administrators full control (All) (All)
LSFTOP\logs Everyone special access (R) (R)
LSFTOP\conf\lsfuser.passwd JS service accounts special access (R) (R)
Verify that the scheduling servers installed files have the following privileges:
Folder User group Privileges
JSTOP\work JS service accounts full control (All) (All)
JSTOP\work JS administrators full control (All) (All)
JSTOP\work Everyone special access (R) (R)
JSTOP\log JS service accounts full control (All) (All)
JSTOP\log JS administrators full control (All) (All)
JSTOP\log Everyone special access (R) (R)


AIX Specific

The AIX environment is unique in that you can configure your kernel to be in either 32-bit mode or 64-bit mode, and your 64-bit applications will run with either mode. Platform Computing software requires that the kernel be in 64-bit mode for their 64-bit application. This means that you will need the kernel in 64-bit mode in order to use Platform Computing’s scheduling servers and Platform LSF under the AIX environment.

To determine if your kernel is in 64-bit mode, use the LSCONF command. Here is an example of a 32-bit kernel running with a 64-bit kernel installed.

   $ lsconf -k
   Kernel Type: 32-bit
   $ lslpp -l | grep bos | grep 64
     bos.64bit                 5.1.0.50  COMMITTED  Base Operating System 64 bit
     bos.mp64                  5.1.0.50  COMMITTED  Base Operating System 64-bit
     bos.64bit                 5.1.0.35  COMMITTED  Base Operating System 64 bit
     bos.mp64                  5.1.0.35  COMMITTED  Base Operating System 64-bit


To switch to 64-bit mode, the RTFM command must be used. Ask your System Administrator to issue this command for you. Here is a link that tells how to boot in 64-bit mode: www-106.ibm.com/developerworks/eserver/articles/dutta_cmds.html

Platform Computing's Scheduling Servers and Platform LSF Specific

Q: What does the following message mean when I run the LSADMIN CKCONFIG –V command?
readKernel(): read(/dev/kmem) failed, Bad address.

A: Usually, this warning message means that a binary that’s running on the system is different from its operating environment’s bit system (it might be that a 32-bit machine is running a 64-bit binary or vice versa).



Q: What does the exception below mean when I try to run Flow Manager?

Exception in thread "main" java.lang.UnsatisfiedLinkError: /usr/local/js/5.31/linux2.4-glibc2.3-ia64/jre/lib/ia64/libfontmanager.so: libstdc++-libc6.2-2.so.3: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1473)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1389)
at java.lang.Runtime.loadLibrary0(Runtime.java:788)
at java.lang.System.loadLibrary(System.java:832)
at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:50)
at java.security.AccessController.doPrivileged(Native Method)
at sun.awt.font.NativeFontWrapper.(NativeFontWrapper.java:42)
at sun.awt.X11GraphicsEnvironment.initDisplay(Native Method)
at sun.awt.X11GraphicsEnvironment.(X11GraphicsEnvironment.java:125)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:140)
at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(GraphicsEnvironment.java:62)
at java.awt.Window.init(Window.java:223)
at java.awt.Window.(Window.java:267)
at java.awt.Frame.(Frame.java:398)
at java.awt.Frame.(Frame.java:363)
at javax.swing.JFrame.(JFrame.java:154)
at com.platform.LSFJobFlow.app.caleditor.JFCalContainer.(JFCalContainer.java:106)
at com.platform.LSFJobFlow.app.caleditor.JFCalContainer.main(JFCalContainer.java:672)


A: The preceding information indicates that the required version of the stdc++ library is missing. Ask your System Administrator to install the correct version.