Service Architecture Alert Definitions

The following alerts are defined when you initialize the SAS Environment Manager Service Architecture.

Note: Alerts that are triggered by comparison to a baseline value require that the metric for the alert be monitored long enough to first establish a baseline value.

Platform Alerts
Resource	Alert name	Description
Linux	CPU Count	Triggered if the number of CPUs on the platform changes. This alert indicates a possible hardware problem.
	CPU Usage >70	Triggered if the overall CPU usage in the system exceeds 70%.
	CPU Usage >95	Triggered if the overall CPU usage in the system exceeds 95%.
	Pct Free Memory	Triggered if the percentage of free memory falls below 20% of the maximum free memory.
	Pct Free Swap	Triggered if the percentage of free swap memory falls below 20% of the maximum free swap memory
	Swap Out Rate	Triggered if the number of pages swapped out of memory exceeds 20% of the baseline value of page swaps. This alert indicates that your system is memory constrained. Swapping occurs when the system requires more memory than is physically available.
	TCP Attempt Fails	Triggered if the number of failed attempts to connect to the TCP service exceeds 20% of the baseline value of attempted connections. The number of failed attempts should normally be close to zero.
	TCP In Errors	Triggered if the number of TCP interface errors exceeds 20% of the baseline value of TCP interface requests. The number of TCP interface errors should normally be close to zero.
	Zombie Processes	Triggered if the number of zombie processes exceeds 20% of the baseline value of total processes. Zombie processes are processes that have completed execution but still have entries in the process table. This alert indicates an application problem.
Win32	CPU Count	Triggered if the number of CPUs on the platform changes. This alert indicates a possible hardware problem.
	CPU Usage >70	Triggered if the overall CPU usage in the system exceeds 70%.
	CPU Usage >95	Triggered if the overall CPU usage in the system exceeds 95%.
	Pct Free Memory	Triggered if the percentage of free memory falls below 20% of the maximum free memory.
	Pct Free Swap	Triggered if the percentage of free swap memory falls below 20% of the maximum free swap memory.
	Swap Out Rate	Triggered if the number of pages swapped out of memory exceeds 20% of the baseline value of page swaps. This alert indicates that your system is memory constrained. Swapping occurs when the system requires more memory than is physically available.
	TCP Attempt Fails	Triggered if the number of failed attempts to connect to the TCP service exceeds 20% of the baseline value of attempted connections. The number of failed attempts should normally be close to zero.
	TCP In Errors	Triggered if the number of TCP interface errors exceeds 20% of the baseline value of TCP interface requests. The number of TCP interface errors should normally be close to zero.
	Zombie Processes	Triggered if the number of zombie processes exceeds 20% of the baseline value of total processes. Zombie processes are processes that have completed execution but still have entries in the process table. This alert indicates an application problem.
AIX	CPU Count	Triggered if the number of CPUs on the platform changes. This alert indicates a possible hardware problem.
	CPU Usage >70	Triggered if the overall CPU usage in the system exceeds 70%.
	CPU Usage >95	Triggered if the overall CPU usage in the system exceeds 95%.
	Pct Free Memory	Triggered if the percentage of free memory falls below 20% of the maximum free memory.
	Pct Free Swap	Triggered if the percentage of free swap memory falls below 20% of the maximum free swap memory.
	Swap Out Rate	Triggered if the number of pages swapped out of memory exceeds 20% of the baseline value of page swaps. This alert indicates that your system is memory constrained. Swapping occurs when the system requires more memory than is physically available.
	TCP Attempt Fails	Triggered if the number of failed attempts to connect to the TCP service exceeds 20% of the baseline value of attempted connections. The number of failed attempts should normally be close to zero.
	TCP In Errors	Triggered if the number of TCP interface errors exceeds 20% of the baseline value of TCP interface requests. The number of TCP interface errors should normally be close to zero.
	Zombie Processes	Triggered if the number of zombie processes exceeds 20% of the baseline value of total processes. Zombie processes are processes that have completed execution but still have entries in the process table. This alert indicates an application problem.
SAS Application Server Tier	Metadata Cluster Avail	Triggered if the availability of the SAS Metadata Server cluster falls below 100%.
	Metadata Quorum Chg	Triggered if the SAS Metadata Server cluster is not in quorum.
	SAS License Termination	Triggered if there are fewer than 30 days remaining before the SAS license terminates. If this alert is triggered, it recurs once every 12 hours.

Server Alerts
Resource	Alert name	Description
HQ Agent	HQ Agent ERROR message in log	Triggered if an error message appears in the HQ agent log.
	HQ Agent Memory	Triggered if the JVM free memory for the HQ agent falls below 14.3 MB.
	HQ Time Agent Spends Fetching Metrics	Triggered if the time that the HQ agent spends collecting metric data exceeds five seconds per minute. This alert might indicate an overloaded agent or a problem with the scheduling thread. These problems might be present even with values for this metric greater than 3 or 4 seconds per minute.
PostgreSQL 9.x	PostgreSQL 9.x - Availability	Triggered if the availability of PostgreSQL falls below 100%.
	pg: Buffer Hits % <50% of Max	Triggered if the number of buffer hits is less than 50% of the total block read requests. A buffer hit is a block read request that is avoided because the block is in the buffer cache). This alert might indicate that more system memory is needed or that you need to adjust the shared buffers.
	pg: Commits per Second >20	Triggered if the number of commits to PostgreSQL is greater than 20 per second. This alert indicates that you might need to provide a durable write cache to prevent potential data loss.
	pg: Connection Usage >80% of Max	Triggered if the number of connections used is greater than 80% of the maximum number allowed. This alert indicates that you might need to increase the maximum number of available connections in order to prevent denial of service.
	pg: Memory Size changed	Triggered if the memory used by PostgreSQL falls below 90% of the baseline value. If this condition is met, the alert is triggered once every 12 hours.
SAS Config Level Directory 9.4	SASConfig Disk Use % > 95	Triggered if the volume that contains the SASConfig directory is more than 95% full.
SAS Connect Spawner 9.4	Connect Spawner Health % < 100	Triggered if the health of the SAS Connect Spawner falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
SAS Home Directory 9.4	SASHome Disk Use % > 95	Triggered if the volume that contains the SASHome directory is more than 95% full.
SAS Metadata Server 9.4	Metadata - Availability	Triggered if the availability of the SAS Metadata Server falls below 100%.
	Metadata Major (page) Faults	Triggered if the number of page faults that require disk activity is above 10% of the baseline value of total page faults. This alert might indicate a memory constraint that is causing slow performance.
	Metadata Server ERROR message in log	Triggered if an error message appears in the SAS Metadata Server log.
	Metadata Server Health % < 100	Triggered if the health of the SAS Metadata Server falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
	Metadata Time in Calls per Minute	Triggered if the time taken by calls to the SAS Metadata Server exceeds 300% of the baseline value of calls to the server. This alert might be an indication of slow performance.
	Metadata User Lockout	Triggered if the message “locked out due to excessive log on failures” appears in the SAS Metadata Server log.
SAS OLAP Server 9.4	OLAP - Availability	Triggered if the availability of the OLAP server falls below 100%.
	OLAP Server ERROR message in log	Triggered if an error message appears in the OLAP server log.
	OLAP Server Health % < 100	Triggered if the health of the OLAP server falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
	OLAP Server User Lockout	Triggered if the message “locked out due to excessive log on failures” appears in the OLAP server log.
SAS Object Spawner 9.4	Object Spawner ERROR message in log	Triggered if an error message appears in the SAS Object Spawner log.
	Object Spawner User Lockout	Triggered if the message “locked out due to excessive log on failures” appears in the SAS Object Spawner log.
	Object Spawner - Availability	Triggered if the availability of the SAS Object Spawner falls below 100%.
	Object Spawner Failed Connections	Triggered if the SAS Object Spawner fails to spawn a server.
	Object Spawner Major (page) Faults	Triggered if the number of page faults that require disk activity is above 10% of the baseline value of total page faults. This alert might indicate a memory constraint that is causing slow performance.
	Object Spawner Server Health % < 100	Triggered if the health of the SAS Object Spawner falls below 100%. This metric is the equivalent of the Validate command in SAS Management Console, and confirms that the server is responding.
SAS SMP LASR Server	LASR SMP Major (page) Faults	Triggered if the number of page faults that require disk activity is above 10% of the baseline value of total page faults. This alert might indicate a memory constraint that is causing slow performance.
	SMP LASR - Availability	Triggered if the availability of the SAS LASR Analytic Server falls below 100%.
SAS System Info	EMI Event Log Alert	This alert is a template for detecting a string in the EMI Events log (sasev.events). This log contains messages generated by the SAS Environment Manager application. Through the use of macros, you can also write log messages from SAS applications to this log. To have the alert trigger when a specific string appears in the log, edit the alert and replace the string “match this text” with the string that you want to use.
SpringSource tc Runtime 6.0	Deadlocks Detected	Triggered if a deadlock is detected. A deadlock occurs when multiple actions are waiting for the other to complete, so none of the actions ever finish.
	Excessive Time Spent in Garbage Collection	Triggered if the amount of time spent in garbage collection exceeds 40% of the total process time.
SpringSource tc Runtime 7.0	Deadlocks Detected	Triggered if a deadlock is detected. A deadlock occurs when multiple actions are waiting for the other to complete, so none of the actions ever finish.
	Excessive Time Spent in Garbage Collection	Triggered if the amount of time spent in garbage collection exceeds 40% of the total process time.
	Webapp CPU Time in Garbage Collection >30%	Triggered if the amount of time spent in garbage collection exceeds 30% of the total process time.
	Webapp Heap Free Memory < 5% of Max	Triggered if the free JVM heap memory falls below 5% of the total memory. It is recommended that you have a minimum of 400MB of free heap space (calculated after garbage collection).
vFabric Web Server 5.2	Web Server: Apache Idle Workers <20	Triggered if the number of available idle workers falls below 20% of the maximum number of workers. If no idle workers are available, a service interrupt occurs.

Service Alerts
Resource	Alert name	Description
FileServer Mount	File Mount Use Pct	Triggered if the percentage of space used on the file mount exceeds 95%.
HTTP	HTTP Response Server Error Code => 500	Triggered if the response code for an HTTP service ping is greater than 500, indicating an error. Possible response codes are: 500 Unexpected Error 501 Does Not Support 502 Overload 503 Gateway Timeout
Network Server Interface	NetIF Rcv Dropped	Triggered if the number of dropped network receive packets exceeds 20% of the baseline value of total network receive packets. This alert requires that the metrics be monitored long enough to establish a baseline value of network packets for your system.
	NetIF Rcv Errors	Triggered if the number of network receive errors exceeds 20% of the baseline value of total network attempts. This alert requires that the metrics be monitored long enough to establish a baseline value of network traffic for your system.
	NetIF Tx Collisions	Triggered if the number of network interface transmit collisions exceeds 20% of the baseline value of total network attempts. This alert requires that the metrics be monitored long enough to establish a baseline value of network traffic for your system.
HQ Agent	HQ Agent ERROR message in log	Triggered if an error message appears in the HQ agent log.
	HQ Agent Memory	Triggered if the JVM free memory for the HQ agent falls below 14.3 MB.
	HQ Time Agent Spends Fetching Metrics	Triggered if the time that the HQ agent spends collecting metric data exceeds five seconds per minute. This alert might indicate an overloaded agent or a problem with the scheduling thread. These problems might be present even with values for this metric greater than 3 or 4 seconds per minute.
PostgreSQL 9.x	PostgreSQL 9.x - Availability	Triggered if the availability of PostgreSQL falls below 100%.
	pg: Buffer Hits % <50% of Max	Triggered if the number of buffer hits is less than 50% of the total block read requests. (A buffer hit is a block read request that is avoided because the block is in the buffer cache). This alert might indicate that more system memory is needed or that you need to adjust the shared buffers.
	pg: Commits per Second >20	Triggered if the number of commits to PostgreSQL is greater than 20 per second. This alert indicates that you might need to provide a durable write cache to prevent potential data loss.
	pg: Connection Usage >80% of Max	Triggered if the number of connections used is greater than 80% of the maximum number allowed. This alert indicates that you might need to increase the maximum number of available connections in order to prevent denial of service.
	pg: Memory Size changed	Triggered if the memory used by PostgreSQL falls below 90% of the baseline value. If this condition is met, the alert is triggered once every 12 hours.
SAS Environment Manager Data Mart 9.4 ACM ETL Processing	Data Mart ACM ETL	Triggered if the availability of the ACM ETL process falls below 100%.
SAS Environment Manager Data Mart 9.4 APM ETL Processing	Data Mart APM ETL	Triggered if the availability of the APM ETL process falls below 100%.
SAS Environment Manager Data Mart 9.4 Kits ETL Processing	Data Mart Kits ETL	Triggered if the availability of the kits ETL process falls below 100%.
SAS Home Directory 9.4 SAS Directory	SASWork Disk Use % > 70	Triggered if the volume that contains the SASWork directory is more than 70% full.
	SASWork Disk Use % > 95	Triggered if the volume that contains the SASWork directory is more than 95% full.
SAS Object Spawner 9.4 SAS Logical Pooled Workspace Server	Logical Pooled Workspace Server Timed Out Clients	Triggered if there are any failed connections between the logical pooled workspace server and applications that are trying to connect to the server.
	Logical Pooled Workspace Server Unauthorized Accesses	Triggered if there are any unauthorized accesses to the logical pooled workspace server.
SAS Object Spawner 9.4 SAS Logical Stored Process Server	Logical Stored Process Server Timed Out Clients	Triggered if there are any failed connections between the logical stored process server and applications that are trying to connect to the server.
	Logical Stored Process Server Unauthorized Accesses	Triggered if there are any unauthorized accesses to the logical stored process server.
SAS Object Spawner 9.4 SAS Logical Workspace Server	Logical Workspace Server Unauthorized Accesses	Triggered if there are any unauthorized accesses to the logical workspace server.
SAS Object Spawner 9.4 SAS Pooled Workspace Server	Pooled Workspace Server ERROR message in log	Triggered if an error message appears in the pooled workspace server log.
SAS Object Spawner 9.4 SAS Stored Process Server	Stored Process Server ERROR message in log	Triggered if an error message appears in the stored process server log.
SAS Object Spawner 9.4 SAS Workspace Server	Workspace Server ERROR message in log	Triggered if an error message appears in the workspace server log.
Spring Insight Application	Application error rate is high	Triggered if the application error rate for the past five minutes exceeds 10%.
SpringSource tc Runtime 6.0 Thread Diagnostics Context	Slow or Failed Request	Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 6.0 Thread Diagnostics Engine	Slow or Failed Request	Triggered if a request is taking too long or has failed, which is indicated by an entry appearing in the service’s log.
SpringSource tc Runtime 6.0 Thread Diagnostics Host	Slow or Failed Request	Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed..
SpringSource tc Runtime 6.0 Tomcat JDBC Connection Pool Context	JDBC Connection Abandoned	Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
	JDBC Connection Failed	Triggered if a JDBC connection failed, identified by a “CONNECTION FAILED” entry in the log.
	JDBC Query Failed	Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
	Slow JDBC Query	Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
SpringSource tc Runtime 6.0 Tomcat JDBC Connection Pool Global	JDBC Connection Abandoned	Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
	JDBC Connection Failed	Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
	JDBC Query Failed	Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
	Slow JDBC Query	Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
SpringSource tc Runtime 7.0 Executor	Webapp Active Thread Count >250	Triggered if the number of active threads exceeds 250, which indicates heavy use. You can add additional servers to provide load balancing. The maximum number of threads allowed is 300, and the minimum is 50. If the number of active threads exceeds 300, the thread queue resets to 100, and then additional threads are refused.
SpringSource tc Runtime 7.0 Manager	Webapp Manager Rejected Sessions	Triggered if the number of rejected sessions exceeds 10% of the baseline number of sessions.
SpringSource tc Runtime 7.0 Thread Diagnostics Context	Slow or Failed Request	Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 7.0 Thread Diagnostics Engine	Slow or Failed Request	Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 7.0 Thread Diagnostics Host	Slow or Failed Request	Triggered if a record is written to the log for the service. This alert indicates that a request is taking too long or has failed.
SpringSource tc Runtime 7.0 Tomcat JDBC Connection Pool Context	JDBC Connection Abandoned	Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
	JDBC Connection Failed	Triggered if a JDBC connection failed, identified by a “CONNECTION FAILED” entry in the log.
	JDBC Query Failed	Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
	Slow JDBC Query	Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
SpringSource tc Runtime 7.0 Tomcat JDBC Connection Pool Global	JDBC Connection Abandoned	Triggered if a JDBC connection was abandoned, identified by a “CONNECTION ABANDONED” entry in the log.
	JDBC Connection Failed	Triggered if a JDBC connection failed, identified by a “CONNECTION FAILED” entry in the log.
	JDBC Query Failed	Triggered if a JDBC query failed, identified by a “FAILED QUERY” entry in the log.
	Slow JDBC Query	Triggered if some JDBC queries are taking a long time to execute, identified by a “SLOW QUERY” entry in the log.
	Application health is degrading	Triggered if the application health metric (measured over the past five minutes) falls below 85%.