Service Architecture Alert Definitions

The following alerts are defined when you initialize the SAS Environment Manager Service Architecture.
Note: Alerts that are triggered by comparison to a baseline value require that the metric for the alert be monitored long enough to first establish a baseline value.
Platform Alerts
Resource
Alert name
Description
Linux
CPU Count
Triggered if the number of CPUs on the platform changes. This alert indicates a possible hardware problem.
CPU Usage >70
Triggered if the overall CPU usage in the system exceeds 70%.
CPU Usage >95
Triggered if the overall CPU usage in the system exceeds 95%.
Pct Free Memory
Triggered if the percentage of free memory falls below 20% of the maximum free memory.
Pct Free Swap
Triggered if the percentage of free swap memory falls below 20% of the maximum free swap memory
Swap Out Rate
Triggered if the number of pages swapped out of memory exceeds 20% of the baseline value of page swaps.
This alert indicates that your system is memory constrained. Swapping occurs when the system requires more memory than is physically available.
TCP Attempt Fails
Triggered if the number of failed attempts to connect to the TCP service exceeds 20% of the baseline value of attempted connections. The number of failed attempts should normally be close to zero.
TCP In Errors
Triggered if the number of TCP interface errors exceeds 20% of the baseline value of TCP interface requests. The number of TCP interface errors should normally be close to zero.
Zombie Processes
Triggered if the number of zombie processes exceeds 20% of the baseline value of total processes. Zombie processes are processes that have completed execution but still have entries in the process table. This alert indicates an application problem.
Win32
CPU Count
Triggered if the number of CPUs on the platform changes. This alert indicates a possible hardware problem.
CPU Usage >70
Triggered if the overall CPU usage in the system exceeds 70%.
CPU Usage >95
Triggered if the overall CPU usage in the system exceeds 95%.
Pct Free Memory
Triggered if the percentage of free memory falls below 20% of the maximum free memory.
Pct Free Swap
Triggered if the percentage of free swap memory falls below 20% of the maximum free swap memory.
Swap Out Rate
Triggered if the number of pages swapped out of memory exceeds 20% of the baseline value of page swaps.
This alert indicates that your system is memory constrained. Swapping occurs when the system requires more memory than is physically available.
TCP Attempt Fails
Triggered if the number of failed attempts to connect to the TCP service exceeds 20% of the baseline value of attempted connections. The number of failed attempts should normally be close to zero.
TCP In Errors
Triggered if the number of TCP interface errors exceeds 20% of the baseline value of TCP interface requests. The number of TCP interface errors should normally be close to zero.
Zombie Processes
Triggered if the number of zombie processes exceeds 20% of the baseline value of total processes. Zombie processes are processes that have completed execution but still have entries in the process table. This alert indicates an application problem.
AIX
CPU Count
Triggered if the number of CPUs on the platform changes. This alert indicates a possible hardware problem.
CPU Usage >70
Triggered if the overall CPU usage in the system exceeds 70%.
CPU Usage >95
Triggered if the overall CPU usage in the system exceeds 95%.
Pct Free Memory
Triggered if the percentage of free memory falls below 20% of the maximum free memory.
Pct Free Swap
Triggered if the percentage of free swap memory falls below 20% of the maximum free swap memory.
Swap Out Rate
Triggered if the number of pages swapped out of memory exceeds 20% of the baseline value of page swaps.
This alert indicates that your system is memory constrained. Swapping occurs when the system requires more memory than is physically available.
TCP Attempt Fails
Triggered if the number of failed attempts to connect to the TCP service exceeds 20% of the baseline value of attempted connections. The number of failed attempts should normally be close to zero.
TCP In Errors
Triggered if the number of TCP interface errors exceeds 20% of the baseline value of TCP interface requests. The number of TCP interface errors should normally be close to zero.
Zombie Processes
Triggered if the number of zombie processes exceeds 20% of the baseline value of total processes. Zombie processes are processes that have completed execution but still have entries in the process table. This alert indicates an application problem.
SAS Application Server Tier
Metadata Cluster Avail
Triggered if the availability of the SAS Metadata Server cluster falls below 100%.
Metadata Quorum Chg
Triggered if the SAS Metadata Server cluster is not in quorum.
SAS License Termination
Triggered if there are fewer than 30 days remaining before the SAS license terminates. If this alert is triggered, it recurs once every 12 hours.
Server Alerts
Resource
Alert name
Description
HQ Agent
HQ Agent ERROR message in log
Triggered if an error message appears in the HQ agent log.
HQ Agent Memory
Triggered if the JVM free memory for the HQ agent falls below 14.3 MB.
HQ Time Agent Spends Fetching Metrics
Triggered if the time that the HQ agent spends collecting metric data exceeds five seconds per minute. This alert might indicate an overloaded agent or a problem with the scheduling thread. These problems might be present even with values for this metric greater than 3 or 4 seconds per minute.
PostgreSQL 9.x
PostgreSQL 9.x - Availability
Triggered if the availability of PostgreSQL falls below 100%.
pg: Buffer Hits % <50% of Max
Triggered if the number of buffer hits is less than 50% of the total block read requests. A buffer hit is a block read request that is avoided because the block is in the buffer cache). This alert might indicate that more system memory is needed or that you need to adjust the shared buffers.
pg: Commits per Second >20
Triggered if the number of commits to PostgreSQL is greater than 20 per second. This alert indicates that you might need to provide a durable write cache to prevent potential data loss.
pg: Connection Usage >80% of Max
Triggered if the number of connections used is greater than 80% of the maximum number allowed. This alert indicates that you might need to increase the maximum number of available connections in order to prevent denial of service.
pg: Memory Size changed
Triggered if the memory used by PostgreSQL falls below 90% of the baseline value. If this condition is met, the alert is triggered once every 12 hours.
SAS Config Level Directory 9.4
SASConfig Disk Use % > 95
Triggered if the volume that contains the SASConfig directory is more than 95% full.
SAS Connect Spawner 9.4
Connect Spawner Health % < 100
Triggered if the health of the SAS Connect Spawner falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
SAS Home Directory 9.4
SASHome Disk Use % > 95
Triggered if the volume that contains the SASHome directory is more than 95% full.
SAS Metadata Server 9.4
Metadata - Availability
Triggered if the availability of the SAS Metadata Server falls below 100%.
Metadata Major (page) Faults
Triggered if the number of page faults that require disk activity is above 10% of the baseline value of total page faults. This alert might indicate a memory constraint that is causing slow performance.
Metadata Server ERROR message in log
Triggered if an error message appears in the SAS Metadata Server log.
Metadata Server Health % < 100
Triggered if the health of the SAS Metadata Server falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
Metadata Time in Calls per Minute
Triggered if the time taken by calls to the SAS Metadata Server exceeds 300% of the baseline value of calls to the server. This alert might be an indication of slow performance.
Metadata User Lockout
Triggered if the message “locked out due to excessive log on failures” appears in the SAS Metadata Server log.
SAS OLAP Server 9.4
OLAP - Availability
Triggered if the availability of the OLAP server falls below 100%.
OLAP Server ERROR message in log
Triggered if an error message appears in the OLAP server log.
OLAP Server Health % < 100
Triggered if the health of the OLAP server falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
OLAP Server User Lockout
Triggered if the message “locked out due to excessive log on failures” appears in the OLAP server log.
SAS Object Spawner 9.4
Object Spawner ERROR message in log
Triggered if an error message appears in the SAS Object Spawner log.
Object Spawner User Lockout
Triggered if the message “locked out due to excessive log on failures” appears in the SAS Object Spawner log.
Object Spawner - Availability
Triggered if the availability of the SAS Object Spawner falls below 100%.
Object Spawner Failed Connections
Triggered if the SAS Object Spawner fails to spawn a server.
Object Spawner Major (page) Faults
Triggered if the number of page faults that require disk activity is above 10% of the baseline value of total page faults. This alert might indicate a memory constraint that is causing slow performance.
Object Spawner Server Health % < 100
Triggered if the health of the SAS Object Spawner falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
SAS SMP LASR Server
LASR SMP Major (page) Faults
Triggered if the number of page faults that require disk activity is above 10% of the baseline value of total page faults. This alert might indicate a memory constraint that is causing slow performance.
SMP LASR - Availability
Triggered if the availability of the SAS LASR Analytic Server falls below 100%.
SAS System Info
EMI Event Log Alert
This alert is a template for detecting a string in the EMI Events log (sasev.events). This log contains messages generated by the SAS Environment Manager application. Through the use of macros, you can also write log messages from SAS applications to this log.
To have the alert trigger when a specific string appears in the log, edit the alert and replace the string “match this text” with the string that you want to use.
SpringSource tc Runtime 6.0
Deadlocks Detected
Triggered if a deadlock is detected. A deadlock occurs when multiple actions are waiting for the other to complete, so none of the actions ever finish.
Excessive Time Spent in Garbage Collection
Triggered if the amount of time spent in garbage collection exceeds 40% of the total process time.
SpringSource tc Runtime 7.0
Deadlocks Detected
Triggered if a deadlock is detected. A deadlock occurs when multiple actions are waiting for the other to complete, so none of the actions ever finish.
Excessive Time Spent in Garbage Collection
Triggered if the amount of time spent in garbage collection exceeds 40% of the total process time.
Webapp CPU Time in Garbage Collection >30%
Triggered if the amount of time spent in garbage collection exceeds 30% of the total process time.
Webapp Heap Free Memory < 5% of Max
Triggered if the free JVM heap memory falls below 5% of the total memory. It is recommended that you have a minimum of 400MB of free heap space (calculated after garbage collection).
vFabric Web Server 5.2
Web Server: Apache Idle Workers <20
Triggered if the number of available idle workers falls below 20% of the maximum number of workers. If no idle workers are available, a service interrupt occurs.
Service Alerts
Resource
Alert name
Description
FileServer Mount
File Mount Use Pct
Triggered if the percentage of space used on the file mount exceeds 95%.
HTTP
HTTP Response Server Error Code => 500
Triggered if the response code for an HTTP service ping is greater than 500, indicating an error. Possible response codes are:
500
Unexpected Error
501
Does Not Support
502
Overload
503
Gateway Timeout
Network Server Interface
NetIF Rcv Dropped
Triggered if the number of dropped network receive packets exceeds 20% of the baseline value of total network receive packets. This alert requires that the metrics be monitored long enough to establish a baseline value of network packets for your system.
NetIF Rcv Errors
Triggered if the number of network receive errors exceeds 20% of the baseline value of total network attempts. This alert requires that the metrics be monitored long enough to establish a baseline value of network traffic for your system.
NetIF Tx Collisions
Triggered if the number of network interface transmit collisions exceeds 20% of the baseline value of total network attempts. This alert requires that the metrics be monitored long enough to establish a baseline value of network traffic for your system.
HQ Agent
HQ Agent ERROR message in log
Triggered if an error message appears in the HQ agent log.
HQ Agent Memory
Triggered if the JVM free memory for the HQ agent falls below 14.3 MB.
HQ Time Agent Spends Fetching Metrics
Triggered if the time that the HQ agent spends collecting metric data exceeds five seconds per minute. This alert might indicate an overloaded agent or a problem with the scheduling thread. These problems might be present even with values for this metric greater than 3 or 4 seconds per minute.
PostgreSQL 9.x
PostgreSQL 9.x - Availability
Triggered if the availability of PostgreSQL falls below 100%.
pg: Buffer Hits % <50% of Max
Triggered if the number of buffer hits is less than 50% of the total block read requests. (A buffer hit is a block read request that is avoided because the block is in the buffer cache). This alert might indicate that more system memory is needed or that you need to adjust the shared buffers.
pg: Commits per Second >20
Triggered if the number of commits to PostgreSQL is greater than 20 per second. This alert indicates that you might need to provide a durable write cache to prevent potential data loss.
pg: Connection Usage >80% of Max
Triggered if the number of connections used is greater than 80% of the maximum number allowed. This alert indicates that you might need to increase the maximum number of available connections in order to prevent denial of service.
pg: Memory Size changed
Triggered if the memory used by PostgreSQL falls below 90% of the baseline value. If this condition is met, the alert is triggered once every 12 hours.
SAS Environment Manager Data Mart 9.4 ACM ETL Processing
Data Mart ACM ETL
Triggered if the availability of the ACM ETL process falls below 100%.
SAS Environment Manager Data Mart 9.4 APM ETL Processing
Data Mart APM ETL
Triggered if the availability of the APM ETL process falls below 100%.
SAS Environment Manager Data Mart 9.4 Kits ETL Processing
Data Mart Kits ETL
Triggered if the availability of the kits ETL process falls below 100%.
SAS Home Directory 9.4 SAS Directory
SASWork Disk Use % > 70
Triggered if the volume that contains the SASWork directory is more than 70% full.
SASWork Disk Use % > 95
Triggered if the volume that contains the SASWork directory is more than 95% full.
SAS Object Spawner 9.4 SAS Logical Pooled Workspace Server
Logical Pooled Workspace Server Timed Out Clients
Triggered if there are any failed connections between the logical pooled workspace server and applications that are trying to connect to the server.
Logical Pooled Workspace Server Unauthorized Accesses
Triggered if there are any unauthorized accesses to the logical pooled workspace server.
SAS Object Spawner 9.4 SAS Logical Stored Process Server
Logical Stored Process Server Timed Out Clients
Triggered if there are any failed connections between the logical stored process server and applications that are trying to connect to the server.
Logical Stored Process Server Unauthorized Accesses
Triggered if there are any unauthorized accesses to the logical stored process server.
SAS Object Spawner 9.4 SAS Logical Workspace Server
Logical Workspace Server Unauthorized Accesses
Triggered if there are any unauthorized accesses to the logical workspace server.
SAS Object Spawner 9.4 SAS Pooled Workspace Server
Pooled Workspace Server ERROR message in log
Triggered if an error message appears in the pooled workspace server log.
SAS Object Spawner 9.4 SAS Stored Process Server
Stored Process Server ERROR message in log
Triggered if an error message appears in the stored process server log.
SAS Object Spawner 9.4 SAS Workspace Server
Workspace Server ERROR message in log
Triggered if an error message appears in the workspace server log.
Spring Insight Application
Application error rate is high
Triggered if the application error rate for the past five minutes exceeds 10%.
SpringSource tc Runtime 6.0 Thread Diagnostics Context
Slow or Failed Request
Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 6.0 Thread Diagnostics Engine
Slow or Failed Request
Triggered if a request is taking too long or has failed, which is indicated by an entry appearing in the service’s log.
SpringSource tc Runtime 6.0 Thread Diagnostics Host
Slow or Failed Request
Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed..
SpringSource tc Runtime 6.0 Tomcat JDBC Connection Pool Context
JDBC Connection Abandoned
Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
JDBC Connection Failed
Triggered if a JDBC connection failed, identified by a “CONNECTION FAILED” entry in the log.
JDBC Query Failed
Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
Slow JDBC Query
Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
SpringSource tc Runtime 6.0 Tomcat JDBC Connection Pool Global
JDBC Connection Abandoned
Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
JDBC Connection Failed
Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
JDBC Query Failed
Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
Slow JDBC Query
Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
SpringSource tc Runtime 7.0 Executor
Webapp Active Thread Count >250
Triggered if the number of active threads exceeds 250, which indicates heavy use. You can add additional servers to provide load balancing.
The maximum number of threads allowed is 300, and the minimum is 50. If the number of active threads exceeds 300, the thread queue resets to 100, and then additional threads are refused.
SpringSource tc Runtime 7.0 Manager
Webapp Manager Rejected Sessions
Triggered if the number of rejected sessions exceeds 10% of the baseline number of sessions.
SpringSource tc Runtime 7.0 Thread Diagnostics Context
Slow or Failed Request
Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 7.0 Thread Diagnostics Engine
Slow or Failed Request
Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 7.0 Thread Diagnostics Host
Slow or Failed Request
Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 7.0 Tomcat JDBC Connection Pool Context
JDBC Connection Abandoned
Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
JDBC Connection Failed
Triggered if a JDBC connection failed, identified by a “CONNECTION FAILED” entry in the log.
JDBC Query Failed
Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
Slow JDBC Query
Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
SpringSource tc Runtime 7.0 Tomcat JDBC Connection Pool Global
JDBC Connection Abandoned
Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
JDBC Connection Failed
Triggered if a JDBC connection failed, identified by a “CONNECTION FAILED” entry in the log.
JDBC Query Failed
Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
Slow JDBC Query
Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
Application health is degrading
Triggered if the application health metric (measured over the past five minutes) falls below 85%.