Shared Appendix 12: Analyzing Your System

Introduction
Examining Resource Consumption Patterns
Examining Service Levels
Analyzing Performance for Applications
Summary of IT Service Vision Performance Measurements
Analyzing Data with IT Service Vision Software
Using the SAS System for Further Data Analysis

Analyzing Your System

Table of contents

Introduction

The document discusses analysis methods useful for evaluating the performance of both networks and workstations using IT Service Vision software and the SAS System.

For workstations, the chapter looks at

CPU
memory
disk I/O
network interface I/O.

For networks, the chapter examines the measurements and methods of analyzing

traffic patterns
congestion
utilization
error rates.

The chapter also describes why performance analysis needs to include an analysis of service levels. It describes the measurements used to report on service levels, such as

system and network response time
application throughput
utilization of key components
availability of hardware and software components.

In addition to these discussions, this document summarizes, in table form, the specific data sources for measurements of interest, which collectors gather these measurements, and where the data are stored in the IT Service Vision performance database.

Performance Analysis Objectives

Analyzing and understanding the performance of your computer system is a necessary first step to diagnosing, solving, and preventing performance problems. The first step in analyzing your system is to establish workload and performance baselines when the system is running smoothly so that you have healthy comparison values for your system when there is a problem. Having a clear idea of baseline values in your network and noticing significant deviations from these values helps you to address problems before they occur. Publicizing baselines enables you to clearly establish users' expectations of service levels. Finally, when you know your system's baseline values, you can predict performance at future load levels.

Service-Level Objectives

In its most basic form, the evaluation of system performance, whether for a computer network, an airline reservation system, or a five-chair beauty salon, consists of this simple process:

Monitor the level of service that the system delivers to its users.
Is the service satisfactory?
If yes, then continue. If no, make changes to the system to attempt to improve performance and monitor the level of service again.

Though this simplistic approach does not define how to improve performance, it is essential to keep this model in mind when designing a performance study. This model is also useful to remember when you prepare reports for users and management; you want to be sure to emphasize relevant measurements and avoid unnecessary reports that obscure the primary information you need to convey.

How IT Service Vision Software Aids in Performance Analysis

IT Service Vision software helps you analyze your system's performance by simplifying the mechanics (such as data cleanup, data validation, and data transformation) of managing and presenting performance data. This frees you to concentrate on selecting the evaluation techniques, measurements, and reports appropriate for your situation. IT Service Vision software helps you diagnose problems, set baseline values, perform capacity planning, and present results in easy-to-understand graphs, charts, and printed reports without programming. Because IT Service Vision software uses the SAS System to manage the data, you can also handle more complex experimental designs and perform additional sophisticated statistical analyses of your data using DATA step programming, procedures, and the SQL interface in SAS software.

Areas to Examine

A complete analysis of the performance of a computer networking system must include examining resource consumption patterns of the network and the workstations attached to it as well as the service levels delivered by the system. The following sections describe suggested measurements for tracking resource consumption and indicate how to access these measurements with IT Service Vision software.

Each section that describes the kind of analysis you might perform also includes a table to summarize useful information for each relevant data collection software product.
Table 4.9 through Table 4.14 contain this information:

the names of PDB tables that contain measurements appropriate for the analysis you are performing
brief descriptions of the tables
the names of variables that contain relevant information
brief descriptions of the variables.

These tables provide a workable but not exhaustive list of places to find useful performance measurements in IT Service Vision data. Many other measurements are also available; you can look through a more comprehensive list by selecting items on the Administration tab in the IT Service Vision window interface. In addition, your hardware vendor can provide more information about the devices and how their performance measurements are recorded.

The information in each of the small tables in the following sections is also summarized in a comprehensive table in Table 4.10

Examining Resource Consumption Patterns

You must become familiar with the logical organization of the system to draw accurate conclusions about the performance of a distributed computer system and to select what factors to change to affect its performance. When you understand the general operation of the system, you can examine many measurements internal to the operation of the system (such as queue lengths, error counts, and paging rates). These measurements indicate what happens to the system under load. Analyzing this type of data helps you to arrive at sound recommendations for change.

Performance problems with client/server applications are more complicated to investigate than those of applications that run on a single system. You need to analyze the resource consumption of both the client and server and all networks connecting them. For example, if you receive complaints about poor response time at a workstation that uses NFS to access files on a remote file server, you might ask these questions to investigate the problem:

Is the workstation CPU constrained?
Is the file server CPU constrained?
Is the workstation memory constrained?
Is the file server memory constrained?
Is the workstation's disk causing the delay?
Is the file server's disk causing the delay?
Is the NFS client causing the delay?
Is the NFS server causing the delay?
Are any of the network segments between the workstation and the file server causing the delay?
Are the routers or hubs connected to the network segments causing the delay?
For each of these questions answered affirmatively, is there any way to avoid the delay?

IT Service Vision software can help you analyze the problem by helping with data reduction and data presentation.

Data Reduction

Consider this example. To get an overview of distributed analysis, you could examine

CPUs, memory, disk I/O, and NFS information for an HP workstation (collected by HP Measureware) and a Sun file server (collected by SunNet Manager)
network segment utilization (collected by TRAKKER)
router/hub utilization (collected by SPECTRUM or SunNet Manager).

These collectors produce a voluminous quantity of data, and the measurements are difficult to correlate because the time stamps used by the collectors are different. IT Service Vision software addresses these problems by summarizing and reducing the data volume and by using standardized time stamps in the performance database for the data from all supported collectors.

Data Presentation

IT Service Vision software also provides example reports and an interactive report builder to facilitate exploring the data after it is in your PDB. IT Service Vision formula variables can be used to calculate measurements that are not native to the collected data. For example, your data collection software may capture the number of packets in and the number of error packets; then, you can calculate the rate of errors=packetsIn/error_packets instead of reporting either of the collected variables.

Most SNMP data are reported in a data type called COUNT. These counters steadily accumulate the number of occurrences of an event such as packets out, packets in, or collisions. Because graphs of count data do not clearly show trends in the data, IT Service Vision software automatically converts count values to rates per second by subtracting the value of the measurement in the previous interval record from the current value and dividing by the length of the interval (DURATION). IT Service Vision software performs this conversion before it records the data in the PDB and displaces the old count value. You can plot or print these calculated rates without further processing.

The following sections describe measurements from the collectors supported by IT Service Vision software that help you analyze complex heterogeneous computer networks, workstations, and the applications that run on them.

Workstation

To analyze the performance of a distributed computing network, you must analyze the individual workstations on the network. The four hardware resources that experience contention and thus affect performance of the workstation are the CPU, memory, disk, and network interface. In addition, major subsystems used by an application, such as NFS, often have performance measurements recorded for them individually. Thus, you can examine these subsystems separately to determine if they are a source of delay in the system.

Many of the devices and collectors supported by IT Service Vision software provide measurements that are appropriate for workstation performance analysis. Several examples of these measurements are discussed in the following sections.

CPU

To analyze the CPU for a workstation, consider these measurements:

CPU busy rate
average ready-queue length (the load average)
context switches.

CPU Busy Rate

CPU utilization is easiest to understand when it is reported as the average percent of time that a CPU is busy over some time interval. In SunNet Manager's host performance table (table SNMHPF, variable HPFCPUP), this measurement is a gauge, that is, a supposedly instantaneous reading of how busy the CPU is at a particular moment (like a speedometer). But, a CPU is either 100% busy or 100% idle at any given instant, so a true gauge reading is not very useful. Consequently, this measurement is usually a moving average calculated over a time interval that can range from a few seconds to a few minutes. The exact size of the interval varies from system to system and is not available to the monitor, so treating the CPU busy rate as a gauge is the practical compromise.

The rate at which the CPU is busy is not directly available from some collectors; they supply only the number of CPU seconds used in an interval. You can create a formula variable to calculate the CPU busy rate as CPU seconds divided by DURATION (the variable names vary with different collectors). For HP Measureware, use variable GLBCPSC from table PCSGLB to calculate the CPU busy rate:

cpubusy=glbcpsc/duration

Average Ready-Queue Length

Another commonly available measurement for the CPU is average ready-queue length. The ready-queue length, which is also called load average, measures the number of processes waiting to get CPU cycles. An increase greater than two or three in the length of the ready queue may indicate an overloaded CPU. Like CPU busy, the average ready-queue length is a moving average. Unlike CPU busy, you know the length of the interval for the moving average. Intervals are generally set for 1, 5, and 15 minutes for calculating the moving average of the queue length. The interval is part of the name of the statistic. For example, SunNet Manager collects ready-queue length statistics in its host performance table, SNMHPF, in the variables HPFAV1, HPFAV5, and HPFAV15.

Context Switches

An increase in the number of context switches in a system may indicate an overcommitment of some resource. Context switches can be correlated to other utilization measurements (such as disk I/O) to identify the problem resource. This measurement appears in SunNet Manager's host performance table (table SNMHPF, variable HPFCSWT).

Summary Table

Table 4.1 summarizes the measurements for analyzing workstation CPUs.

Table 4. 1 Measurements for Analyzing Workstation CPUs

Data
  Collection       PDB Table   Table          Variable
  Software         Name        Description    Name        Variable Description

CPU

  SunNetMgr        SNMHPF      hostperf.data  HPFCPUP     Percent CPU busy

                                              HPFAV1      CPU queue length (1-min avg)

                                              HPFAV5      CPU queue length (5-min avg)

                                              HPFAV15     CPU queue length (15-min avg)

                                              HPFCSWT     Context switches

  HP Measureware   PCSGLB      global data    GLBCPSC     CPU seconds

  PROBE/Net        PRXCPU      CPU            CPUAV1      CPU queue length (1-min avg)

                                              CPUAV5      CPU queue length (5-min avg)

                                              CPUAV15     CPU queue length (15-min avg)

                                              CPUIDLP     Percent CPU idle

Memory Utilization

Another resource that experiences contention and can affect workstation performance is memory utilization. Generally, performance data logs indirectly report memory utilization as average paging rates. An exception is HP Measureware, which provides memory wait times for processes and applications (tables PCSAPP and PCSPRO, variables APPSVM and PROSVM) and the queue length of the memory subsystem (table PCSGLB, variable GLBMEMQ). For other collectors, you can detect problems with memory overcommitment by tracking paging rates. Suspect a memory shortage if the rates become significantly higher than normal. Paging measurements are available in SunNet Manager's host performance table (table SNMHPF, variables HPFPGI, HPFPGO, HPFPSWI, and HPFPSWO). The PROBE/Net buffer cache table contains measurements indicating the effectiveness of the cache on the system (table PRXBUF, variables BUFRCAC and BUFWCAC).

Table 4.2 summarizes the measurements for analyzing workstation memory.

Table 4. 2 Measurements for Analyzing Workstation Memory

Data
  Collection      PDB Table   Table              Variable
  Software        Name        Description        Name        Variable Description

MEMORY

     Accton       ACCESA      pacct data         ACCKMIN     Process memory use
                                                             (K-core min)

                                                 ACCMEMA     Process average memory use


                  ACCHP7      pacct data         ACCKMIN     Process memory use
                                                             (K-core min)

                                                 ACCMEMA     Process memory use

                  ACCRS6      pacct data         ACCKMIN     Process memory use
                                                             (K-core min)

                                                 ACCMEMA     Process average memory use

                  ACCSUN      pacct data         ACCKMIN     Process memory use
                                                             (K-core min)

                                                 ACCMEMA     Process average memory use

  HP Measureware  PCSAPP      application data   APPSVM      Application memory wait time

                  PCSPRO      process data       PROSVM      Process memory wait time

                  PCSGLB      global data        GLBMEMQ     Memory queue length

  SunNetMgr       SNMHPF      hostperf.data      HPFPGI      Pages read in

                                                 HPFPGO      Pages written out

                                                 HPFPSWI     Pages swapped in

                                                 HPFPSWO     Pages swapped out

  PROBE/Net       PRXBUF      buffer cache       BUFRCAC     cache hit ratio

                                                 BUFWCAC     write cache hit ratio

                  PRXSYS      system statistics  SYSPINR     page in rate

                                                 SYSPOTR     page out rate

Disk I/O

Disk I/O, the level of disk activity, is another significant factor in workstation performance. Disk I/O measurements are generally recorded in performance data as inputs and outputs per second rather than as percentages. Measurements that report on the total disk activity of all disks on a host are available in SunNet Manager's host performance data (table SNMHPF, variable HPFDISK). Measurements that report on disk activity by disk volume appear in SunNet Manager's disk I/O table (table SNMIOD) and HP Measureware's global disk metrics table (table PCSGDK). You might also want to examine data on seek times. When seek times increase, they may indicate that data on the disk is fragmented. For this information, also check SunNet Manager's disk I/O table (table SNMIOD).

Table 4.3 summarizes the measurements for analyzing workstation disk I/O.

Table 4. 3 Measurements for Analyzing Workstation Disk I/O

Data
  Collection      PDB Table   Table                Variable
  Software        Name        Description          Name       Variable Description

Disk I/O

  SunNetMgr       SNMHPF      hostperf.data        HPFDISK    Total system disk I/O

                  SNMIOD      iostat.disk          IOD*       Multiple variables
                                                              pertaining to disk I/O
                                                              (seek times, read/write
                                                              rates and times, and so on)
                                                              per disk volume

  PROBE/Net       PRXDSK      disk information     DSK*       Multiple variables
                                                              pertaining  to disk I/O
                                                              (seek times, read/write
                                                              rates and times, and so on)
                                                              per disk volume.


                  PRXFBL      I/O load             FBL*       Multiple variables
                                                              pertaining  to disk I/O
                                                              (seek times, read/write
                                                              rates and times, and so on)
                                                              per disk volume.


  HP Measureware  PCSGDK      global disk metrics  GDK*       Multiple variables
                                                              pertaining  to disk I/O
                                                              (seek times, read/write
                                                              rates and times, and so on)
                                                              per disk volume.

Network Interface I/O

Network interface I/O can also affect workstation performance. Large mainframe systems typically perform work on one host, access data on disks attached to that system, and use the network only to handle terminal traffic. Open computer systems depend heavily on their network interface to service disk data requests and to perform terminal I/O. Client/server applications and distributed processing, in general, have made it important to monitor both the network itself and the network interface on the processor. Network interface I/O rates appear in several collectors' data.

Note: In this and following discussions of SNMP data, you will see the symbol * as the prefix for some table names. These are table definitions for those SNMP data collectors (SunNet Manager, HP Network Node Manager, and IBM Network for AIX) that group the SNMP metrics they collect according to the MIB table definitions. IT Service Vision contains similar table and variable definitions for the tables of these collectors. These tables are labeled "SNMP data collector" below. For SNMP data gathered by SunNet Manager, substitute "S" for the prefix to the table names. For HP Network Node Manager and IBM Netview for AIX, substitute "H" for the prefix.

Table 4.4 summarizes the measurements for analyzing network interface I/O.

Table 4. 4 Measurements for Analyzing Network Interface I/O

Data
  Collection      PDB Table   Table                Variable
  Software        Name        Description          Name        Variable Description

Network
  Interface
  I/O 

  SunNetMgr       SNMHPF      hostperf.data        HPFIPKT     Input packets

                                                   HPFOPKT     Output packets

                  SNME*       Etherif.*                        Multiple variables

                  SNML*       layers.*                         Multiple variables

                  SNMT*       traffic.*                        Multiple variables

  HP Measureware  PCSGLN      global network       GLN*        Multiple variables
                              interface

Network

Client/server applications rely heavily on network resources, so when you analyze the performance of these applications, you must examine the utilization of the network as well as the workstation to investigate performance problems with these applications. To analyze network performance, you monitor

network traffic patterns (who is talking to whom?)
utilization (how busy are the lines?)
congestion (is the network moving data fast enough to avoid retranmissions?)
error rates (how often does something go wrong?).

If you periodically examine measurements in each of these areas, you can establish baseline values that you can use to detect and address problems before they affect performance. To more readily understand the cause of the problem, you can also compare these baseline values to the values you gather when a problem occurs.

Many of the devices and collectors supported by IT Service Vision software provide network performance measurements. Several examples of these measurements are discussed in the following section.

Examining Network Traffic Patterns

You can gain insight into the source of network performance problems by examining the type of traffic on a network segment as well as the quantity of traffic. Two types of traffic that are monitored in MIB-II data are broadcast and non-broadcast (also called unicast) packets. Broadcast packets are addressed to every node; non-broadcast are addressed to only one node. A high ratio of broadcast to non-broadcast packets on a network could indicate that network segment bandwidth is being wasted by inappropriately forwarded broadcast messages. In a transparent or source-route bridged environment, broadcasts must be forwarded to all segments. However, many Ethernet networks can restrict broadcast message propagation and free considerable bandwidth with the judicious addition of intelligent routers or hubs. In addition to restricting the range of the broadcast packets, you can recover considerable useful bandwidth by investigating hosts that are generating excessive broadcasts. You may then be able to stop or slow the rate of packet generation by these hosts.

The SNMP MIB-II interface table (*N2IFT) contains packet counts for broadcast or nonbroadcast packets To convert the input variables to rates, divide the variable by the sum of the broadcast and nonbroadcast input packets; for output variables, divide by the sum of the output packets. These measurements are also available in TRAKKER and SPECTRUM data.

Table 4.5 summarizes the measurements for analyzing network traffic.

Table 4. 5 Measurements for Analyzing Network Traffic

Data
  Collection   PDB Table   Table                Variable
  Software     Name        Description          Name        Variable Description

Network Traffic
   Patterns 

  SNMP data   *N2IFT       ifInterfaces         IFIUCP      Input unicast
  collector                                                 packets/second

                                                IFINUCP     Input broadcast
                                                            packets/second

                                                IFOUCPT     Output unicast
                                                            packets/second

                                                IFONCTP     Output broadcast
                                                            packets/second


  SPECTRUM    C*          <device               CSILOAD     Percent of bandwidtch
                           tables>                          utilized


              CCI*        <Cisco                LOCIBSC     Input bits/second
                           router                              (5-min avg)
                           tables>


                          <Cisco                LOCOBSC     Output bits/second
                           router                              (5-min avg)
                           tables>


                          <Cisco                LOCIPSC     Input pkts/second
                           router                              (5-min avg)
                           tables>


                           <Cisco               LOCOPSC     Output pkts/second
                            router                             (5-min avg)
                            tables>


  TRAKKER     TKRSTS       segment              IPOSPR      IP off-segment pkts
                           statistics                       received/second

                                                IPOSPX      IP off-segment pkts
                                                            xmitted/second

                                                IPPKTS      IP packets/second

                                                IPTRAN      IP transit packets/second

Note: In this table, the symbol * is the prefix for table names. For SunNet Manager MIB-II SNMP data tables, substitute S for the prefix to the table names for IBM NetView and HP-Network Mode Mgr, substitue H..

Monitoring Utilization

Utilization is another measurement of performance for you to monitor to help analyze the network. You monitor the utilization of network segments and links to establish performance baselines, to compare individual samples to established baselines, and to assist in trend analysis and capacity planning. Use SNMP MIB-II interface I/O table (*N2iFT) to calculate link utilization. These measurements give the number of bytes sent and received on an interface. You can calculate other useful values from these variables.

To track the total number of bytes on the interface, add IFIOCTS and IFOOCTS.
To calculate the utilization of the link as a percentage of its capacity, multiply the sum of IFIOCTS and IFOOCTS by eight (to convert from bytes to bits) and divide by the line speed (table <*>IFS, variable IFSPEED).
To determine the utilization of the entire segment from MIB-II data alone, add IFIOCTS and IFOOCTS for all interfaces on a network segment. TRAKKER software collects data on all packets on a network segment; thus, this addition is unnecessary for TRAKKER data.

Table 4.6 summarizes the measurements for analyzing network utilization.

Table 4. 6 Network Utilization

Data         PDB
  Collection   Table                        Variable
  Software     Name     Table Description   Name        Variable Description

Utilization

  SNMP data    *N2IFT   ifInterfaces        IFIOCTS     Input  octets/second
  collector                                 IFOOCTS     Output octets/second
                                            IFSPEED     Interface speed


  SPECTRUM     C*       <device             CSILOAD     Percent of bandwidth
                        tables>                         utilized

               CCI*     <Cisco              LOCIBSC     Input bits/second
                        router                             (5-min avg)
                        tables>

                        <Cisco              LOCOBSC     Output bits/second
                        router                             (5-min avg)
                        tables>

                        <Cisco              LOCIPSC     Input pkts/second
                        router                             (5-min avg)
                        tables>

                        <Cisco              LOCOPSC     Output pkts/second
                        router                             (5-min avg)
                        tables>

  TRAKKER      TKRSTS   segment             DLLBYTS     Segment bytes/second
                        statistics

Note: In this table, the symbol * is the prefix for table names. For SunNet Manager MIB-II SNMP data tables, substitute S for the prefix to the table names. For HP-OV, substitute H.

Detecting High Error Rates

High error rates on a network segment have much the same effect on network performance as high utilization: they cause slow response time, low throughput, and, ultimately, congestion. Errors are reported for all protocols and at different protocol levels (for example, the interface level, IP level, and TCP level). Errors for different protocols are counted separately. That is, a UDP error does not affect the IP error counter. Because errors are counted separately, higher layers in the protocol do not necessarily count errors detected at the lower layers. For example, an IP address error is not counted in the TCP table (unless it causes a TCP retransmission to occur).

Errors generally appear as the rate of error packets per second in IT Service Vision data sets. You can monitor and compare these rates to baseline values to detect changes, or you may find it useful to convert these rates to a percentage of the total packets handled. For example, encountering 1000 error packets when 2000 packets were sent (50 percent errors) indicates a problem, but the same rate of errors, 1000 errors, is probably not significant if a million packets were sent.

Error counts on an interface are available in SNMP MIB-II interface tables (*N2iFT). Error percentage rates are also available as formula variables (ifInErrors/(IfInUCastPkts+ifInNUCastPkts) and ifOutErrors/(ifOutUCastPkts+ifOutNUCastPkts)). A high interface input error percentage may indicate a problem with the network software on a node on the network (for example, sending packets that are too large or too small) or transmission

clocking issues. Interface output errors generally indicate a problem with the physical network medium.

At the IP level, you can use bad input header and address error rates to determine if datagrams were discarded because of bad headers or an invalid destination address. These measurements are in SNMP MIB-II IP input table (*N2IP_). To convert these values to rates, divide them by the total datagrams received (ipInAddrErrors/ipInDelivers and ipInHdrErrors/ipInDelivers).

ICMP error rates appear in SNMP MIB-II ICMP tables (*N2ICM). To convert these rates to percentages, divide them by the total ICMP messages (icmpInErrors/icmpInMsgs or icmpOutErrors/icmpOutMsgs). If you compare the ICMP error percentage to the IP error percentage, you can determine how many of the errors were caused by ICMP-detected errors and how many had other causes. This may help you isolate the cause of the errors.

TCP error rates appear in SNMP MIB-II TCP table (N2TCP). To convert these variables to percentages, divide them by the total TCP segments (tcpInErrs/tcpInSegs or tcpRetransSegs/tcpOutSegs).

UCP error rates appear in SNMP MIB-II UDP table (*N2UDP). To convert these variables to percentages, divide them by the total UDP datagrams ((udpInErrors+udpNoPorts)/udpInDatagrams).

EGP error rates appear in SNMP MIB-II EGP table (*N2EGP). To convert these variables to percentages, divide them by the total EGP messages (egpInErrors/egpInMsgs or egpOutErrors/egpOutMsgs) and (egnInErrors/egnInMsgs or egnOutErrors/egnOutMsgs). The egpNeighTable (*N2EGN) provides rates of error messages by originator and, thus, can be used to isolate which EGP neighbor is causing the errors.

SNMP MIB-II SNMP groups (*N2SNP) count many different kinds of errors that are encountered when sending and receiving SNMP packets. To convert these rates to percentages, divide them by the total SNMP packet rate (SNMPINPKTS or SNMPOUTPKTS).

Table 4.7 summarizes the measurements for analyzing errors on the network.

Table 4. 7 Measurements for Analyzing Network Errors

Data         PDB
  Collection   Table                        Variable
  Software     Name     Table Description   Name      Variable Description

Error
  Rates

  SNMP data    *N2IFT   ifInterfaces        IFIERRS   Input errors/second
  collector                                 IFIUCP    Input  unicast packets/sec
                                            IFINUCP   Input  broadcast packets/sec
                                            IFOERRS   Output errors/second
                                            IFOUCPT   Output unicast packets/sec
                                            IFONCTP   Output broadcast packets/sec

  SNMP data    *N2IP_   ip                  IPIADER   Input address errors/sec
  collector                                 IPIHDER   Input header errors/sec
                                            IPIDELV   Input packets delivered/sec

  SNMP data    *N2ICM   icmp                ICIERRS   Input  errors/sec
  collector                                 ICIMSGS   Input  messages/sec
                                            ICOERRS   Output errors/sec
                                            ICOMSGS   Output messages/sec

  SNMP data    *N2TCP   tcp                 TCPIERR   Input  errors/sec
  collector                                 TCPORST   Retransmissions/sec
                                            TCPINSG   Input  segments/sec
                                            TCPOTSG   Output segments/sec

  SNMP data    *N2UDP   udp                 UDPIERR   Input  errors/sec
  collector                                 UDPINDG   Input datagrams/sec

  SNMP data    *N2EGP   egp                 EGPIERR   Input  errors/sec
  collector                                 EGPIMSG   Input  messages/sec
                                            EGPOERR   Output errors/sec
                                            EGPOMSG   Output messages/sec

  SNMP data    *N2EGN   egn                 EGNIERR   Input  errors/sec
  collector                                 EGNIMSG   Input  messages/sec
                                            EGNOERR   Output errors/sec
                                            EGNOMSG   Output messages/sec

  SNMP data    *N2SNP   snmp                SNPIPKT   Input  packets/sec
  collector                                 SNPI*     Multiple variables
                                                      pertaining to various
                                                      SNMP input errors/second
                                            SNPOPKT   Output packets/sec
                                            SNPO*     Multiple variables
                                                      pertaining to various
                                                      SNMP output errors/second

  SPECTRUM     C*       <device             CSISERT   Soft error rate second
                        tables>

                                            IPIADER   Input address errors/second


                                            IPIHDER   Bad header errors/second

                                            CSIPKRT   Packet rate/second


                                            IPIDELV   Datagrams delivered/second

  TRAKKER      TKRSTS   segment             DLLEFRM   Ethernet frames/second
                        statistics

                                            ELLERRS   Errors/second

Note:
In this table, the
symbol * is the prefix for table names.  For
SunNet Manager MIB-II SNMP data tables, substitute SM2 for the
prefix to the table names.  Use H for HP-OV data.

Finding Possible Congestion

Monitoring network congestion is especially effective when you analyze network performance. Congestion in a network is defined as the point at which throughput on a segment or link is significantly degraded because data cannot be transmitted fast enough to avoid retransmission. Detecting congestion on a network is similar to detecting overutilization of a CPU. For networks, to locate congestion you monitor the length of the output queues associated with the network interfaces of the devices attached to that segment. Look for these two signs of congestion:

a rise in the number of input packets discarded before transmission. This usually occurs when the receiving system is too busy to process the packets before its buffers fill up.
a rise in the number of output packets retransmitted. This usually occurs because the packet never reached its destination.

Network congestion is even more problematic than CPU overload. If a packet waits too long, is discarded, or is destroyed by a collision, it is retransmitted by higher-level software. This multiplies the load on the already-overloaded network.

Congestion is a symptom of a problem in the network rather than a cause of poor response time or low throughput. It is usually caused by overutilization or by error rates that are too high.

The number of packets discarded is available in SNMP MIB-II interface tables (table *N2IFT, variables in In Discards and if Out Discards). Interface output queue length table *N2TP variable IFOQLEN) and the number of retransmissions in a TCP/IP network (table *N2TCP, variable TCPRTSG) are also available in SNMP MIB-II data.

SPECTRUM data provides information on discards and retransmitted segments. Table 4.8 summarizes the measurements for analyzing congestion on the network.

Table 4. 8 Measurements for Analyzing Network Congestion

Data         PDB
  Collection   Table                        Variable
  Software     Name     Table Description   Name      Variable Description

Congestion

  SNMP data   *N2IFT   ifInterfaces        IFODISC    Discards on output
  collector                                IFIDISC    Discards on input
                                           IFOQLEN    Output queue length

                        tcp                IFODISC    Discards on output
                                           TCPRTSG    Segments retransmitted
                                           TCPRALG    Retransmission algorithm



  SPECTRUM     C*       <device         CSIIIND    Interface input discards
                        tables>

                                           CSIIOTD    Interface output discards

                                           IPIDISC    IP  discards

                                           TCPRTSG    Segments retransmitted

                                           TCPOTSG    TCP  segments transmitted

Note: In this table, the symbol * is the prefix for table names. For SunNet Manager MIB-II SNMP data tables, substitute SM2 for the prefix to the table names. Use H for HP-OV data.

Examining Service Levels

Service-level reports can be used to monitor the effectiveness of any changes you make to your computer system as well as to establish an explicit service-level agreement with your user community. Though you may not have a formal, written service-level agreement with users at your site, an idea of what constitutes acceptable throughput and responsiveness of a system always exists in the minds of the users. Service-level reports clarify the expectations of all concerned and provide concrete proof to your customers of the progress you have made in improving performance.

System service levels are generally reported in four areas:

availability. Is the system up?
responsiveness. Are customers' requests being filled quickly?
throughput. Is the necessary work getting done?
utilization. Are major (expensive) system components being used efficiently?

These measurements can also be reported to your users for the system as a whole or for a specific application.

Many of the devices and data collection software supported by IT Service Vision software provide measurements that are appropriate for service-level reporting. The following sections discuss several examples of these measurements.

System Availability

System availability is reported as the percentage of time the system (or a major component of the system) is available to service the users' requests. SunNet Manager ping status (table SNMPGS, variable PGSRCHA) is a good source of availability information because the ICMP request acknowledge packet indicates whether a specific network device (workstation, router, and so on) is available and also shows whether it is reachable over the network. HP Measureware configuration information (table PCSCFG, all variables) contains information about the existence of hardware components at certain times (such as during a system boot) rather than periodic status reports on the components. Thus, HP Measureware configuration information can be used for inventory and maintenance contract purposes. Table 4.9 summarizes the measurements for analyzing availability.

Table 4. 9 Measurements for Analyzing Availability

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

Availability

  SunNetMgr       SNMPGS   ping.stats          PGSRCHA   Reachable

  HP Measureware  PCSCFG   configuration       CFG*      Multiple variables
                           data                          showing the status of
                                                         various hardware
                                                         components at boot

  SPECTRUM        C*       <device tables>     CSICNST   contact status

System Response Time

System response time is the amount of time between the completion of a request for service and the system's response to that request. For interactive work on open systems, this may be defined as the time between the user pressing a mouse button or return key and the system appearing to respond. Whether response time should be measured to the first or last character of the response is sometimes hotly debated in CPE circles, but you rarely have the choice of whether output time is included in the response time value in open systems networks; in addition, the first or last character is irrelevant for windowing applications that buffer all characters of the response and rewrite the entire screen at once.

Response times are generally reported for four types of work in your system:

command response time
user application response time
noninteractive (batch) turnaround time
individual device response time.

Response Time for Specific Commands

Response times for individual commands (or processes) are widely available in UNIX systems in accton process data (IT Service Vision tables prefixed ACC, variable name ACCETM) and in HP Measureware process records (table PCSPRO, variables PRO1RSP and PRORUNT). Monitoring response time for individual commands does convey some information about the level of service being offered, but it is difficult to set up a consistently repeatable test in a heterogeneous, multitasking operating system environment. For example, accton data might be used to track the usage for large applications, such as the SAS System. But the responsiveness of the application may depend on external factors such as the file system or the network. This limitation is addressed somewhat by combining process response data with network response data generated by SNMP ping requests from SunNet Manager (table SNMPGS, variables PGSTAVG, PGSTMAX, and PGSTMIN).

Response Time for User-Defined Applications

Though not as commonly available as process elapsed-time data, response-time measurements associated with a particular, user-defined application are even more informative. These applications are usually more homogeneous, and you are probably more familiar with them. Statistics for these applications reflect both command response time within the application as well as response times for user-defined transaction boundaries, which are typically of more interest to your customers than that of the individual commands. Application boundaries allow data records to be logged at informative times within an application. For example, each PROC or DATA step within a SAS session is an application boundary. Transaction response times appear in HP Measureware application records (table PCSAPP, variables APP1RSP , APPURSP , and APPRUTM).

Turnaround Time for Noninteractive Job

For noninteractive work (such as an application invoked by a cron task), response time equates to turnaround time, the time between submitting the command script and its completion. Because true batch mode is not native to UNIX environments, measurements that show turnaround time are not directly available.

You can calculate the approximate total turnaround time used by a script by gathering and summing accton process accounting data for the commands used in the batch script (variable ACCETM in tables ACCESA, ACCHP7, ACCRS6, and ACCSUN), but a command that forks may hide some resource usage. To overcome this problem, you can use special userids for each script, but this can become an administrative burden. Using a monitor (such as HP Measureware) that enables you to associate an application ID with certain commands is a cleaner solution. HP Measureware application data (table PCSAPP) provides information on application response and process time (variables APP1RSP, APPURSP and APPRUTM).

Response Time for Devices

Response time can also mean the time it takes for various components within a system to respond to a request. For example, the response time of a disk drive is the average length of time it takes for the drive to respond to an I/O request. This type of internal measurement is useful for analyzing the performance of your system and locating bottlenecks, but it is not usually meaningful to your users. Disk response time appears in PROBE/Net disk information records (table PRXDSK, variable DSKRESP).

Summary Table

Table 4.10 summarizes the measurements for analyzing response time.

Table 4. 10 Measurements for Analyzing Response Time

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

Response
  Time 

  Accton          ACCESA   pacct data         ACCETM    Process elapsed time

                  ACCHP7   pacct data         ACCETM    Process elapsed time

                  ACCRS6   pacct data         ACCETM    Process elapsed time

                  ACCSUN   pacct data         ACCETM    Process elapsed time

  HP Measureware  PCSAPP   application data   APP1RSP   Application first
                                                        response

                                              APPURSP   Application transaction
                                                        response

                                              APPRUTM   Application runtime

                  PCSPRO   process data       PRO1RSP   Process first response

                                              PRORUNT   Process runtime

  SunNetMgr       SNMPGS   ping.stats         PGSTAVG   Average roundtrip time

                                              PGSTMAX   Maximum roundtrip time

                                              PGSTMIN   Minimum roundtrip time

  PROBE/Net       PRXDSK   disk info          DSKRESP   Disk response time

System Throughput

Another service level of interest is system throughput. System throughput is the rate at which requests for work are serviced by a system. Up to a certain point, throughput generally increases as the workload on a system increases. Eventually, your system reaches the point where the overhead of managing additional tasks causes less work to be accomplished.

You monitor throughput for two primary reasons:

to track and possibly charge for the work being done.
to determine when the overload to the system causes less work to be accomplished. When you locate this point, you want to adjust the load accordingly to avoid excess overhead.

In addition to balancing the throughput to avoid excessive overhead, you must also keep track of the response time. Unfortunately, throughput and response time are almost always inversely related, so improving one degrades the other. You can determine the point at which response time is not significantly degraded but throughput is still kept high by plotting the measurements.

Examples of throughput measurements are readily available in open systems networks. They include accton process accounting data (tables prefixed ACC, nearly all variables), HP Measureware process and application records (tables PCSPRO and PCSAPP, nearly all variables), and PROBE/Net process disk access records (table PRXPDR, variables DPRPCNT, DPRRDS, and DPRWRTS).

Table 4.11 summarizes the measurements for analyzing throughput.

Table 4. 11 Measurements for Analyzing Throughput

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

Throughput

  Accton          ACCESA    pacct data         ACCUTM     Process user CPU time


                                               ACCSTM     Process system CPU time

                                               ACCRW      Process disk I/O

                  ACCHP7    pacct data         ACCUTM     Process user CPU time

                                               ACCSTM     Process system CPU time

                                               ACCRW      Process disk I/O

                  ACCRS6    pacct data         ACCUTM     Process user CPU time

                                               ACCSTM     Process system CPU time

                                               ACCRW      Process disk I/O

                  ACCSUN    pacct data         ACCUTM     Process user CPU time

                                               ACCSTM     Process  system CPU time

                                               ACCRW      Process disk I/O

  HP Measureware  PCSPRO    process data       PROCPSC    Process CPU time

                                               PRODKTO    Process disk I/O

                  PCSAPP    application data   APPCPSC    Application CPU seconds

                                               APPCTNO    Application process count

                                               APPDKIO    Application disk I/O

                                               APPTRNO    Application transactions

  PROBE/Net       PRXDPR    process disk       DPRPCNT    Process percent disk ops

                                               DPRRDS     Process disk reads

                                               DPRWRTS    Process disk writes

System Utilization

System utilization of a resource can be either

the fractional usage of a resource that can be partially occupied (for example, memory)
the ratio of the time that the resource is servicing requests to the total elapsed time (for resources that cannot be partially busy, for example, a CPU).

Why should you examine utilization as part of service-level reporting? Response time and throughput are clearly evident to your users; therefore, it makes sense to report them when you describe the service the system delivers. However, utilization is usually invisible to users of the system.

The reason for analyzing utilization is to maximize the use of the system and provide this information to the people who purchase the equipment. Keeping system resources busy (especially the expensive ones) is a high priority to those who pay for the resources. You also need to measure utilization to be able to effectively plan capacity. These measurements help you to locate bottlenecks and to project the effect of future workloads on response time and throughput. Monitoring utilization of major system components also helps you decide what additional acquisitions or system design changes are appropriate to improve system performance.

The most appropriate utilization measurements to monitor are those for the expensive, well-known system components, such as CPU utilization, disk I/O rates for a server, or bytes per second on a network communication line. Measurements for smaller components, such as port utilization on a router, are not generally as appropriate to include in service-level reports to users and management. Examples of utilization measurements that are appropriate to track for an audience of users and management are SunNet Manager host performance data (table SNMHPF, variables HPFCPUP and HPFDISK) and SNMP MIB-II data on a hub or router (table *N2IFT, variables iFiOCTS/iFSPEED and iFOOCTS/iFSPEED).

Utilization data is usually reported as rates of usage. A rate does not convey the capacity of the device, however, so it is not clear what the rate really means. One method of analyzing this data is to calculate the percentage of the device's top speed. You can convert rates to percentages by defining a IT Service Vision formulat variable that divides the rate by the nominal capacity of the device or line. You may prefer to record baseline values for these rates and then simply compare rates to the baseline values on a regular basis to spot exceptions or new trends in the rates.

Table 4.12 summarizes the measurements for analyzing utilization.

Table 4. 12 Measurements for Analyzing Utilization

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

Utilization

  SunNetMgr       SNMHPF   hostperf.data       HPFCPUP   Percent CPU busy

                                               HPFDISK   Disk I/O's/second

  SNMP data       *N2IFT   ifInterfaces        IFOOCTS   Output octets/second
  collector
                                               IFIOCTS   Input  octets/second

                                               IFSPEED   Line speed of interface

  TRAKKER         TKRSTS   segment statistics  DLLBYTS   Segment bytes/second

Note: In this table, the symbol * is the prefix for table names. For SunNet Manager MIB-II SNMP data tables, substitute SM2 for the prefix to the table names. Use H for H P-OV data.

User Applications

For most users, the most interesting indicator of service is the performance (the responsiveness and throughput) of the application they use. For many users, the performance of the application is more important than system service levels. Specific measurements for user applications are collected by HP Measureware (table PCSAPP, variables APP1RST and APPRUTM for response time and variables APPCPSC, APPCTNO, APPDKIO, and APPTRNO for throughput). Other collectors record process or program service levels that you can use to calculate application service levels.

Table 4.13 summarizes the measurements for analyzing user applications.

Table 4. 13 Measurements for Analyzing User Applications

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

User
  Applications

  Accton          ACCESA,   pacct data         ACC*      Multiple variables
                  ACCHP7,                                pertainint to resource
                  ACCRS6,                                consumption (CPU
                  ACCSUN                                 seconds, disk I/O/s,
                                                         net I/O's, memory
                                                         utilization)

  HP Measureware  PCSPRO    process data       PRO*      Multiple variables
                                                         to resource
                                                         consumption (CPU
                                                         seconds, disk I/O's, net
                                                         I/O's, memory utilization)

                  PCSAPP    application        APP*      Multiple variables
                            data                         pertaining to resource
                                                         consumption (CPU seconds,
                                                         disk I/O's, net I/O's,
                                                         memory utilization)

  PROBE/Net       PRXDPR    process disk       DPRPCNT   Process percent
                            access                       disk ops

                                               DPRRDS    Process disk reads

                                               DPRWRTS   Process disk writes

                  PRXPGM    program            PGM*      Multiple variables
                            information                  pertaining to resource
                                                         consumption program

  TRAKKER         TKRDLL    Data link table    APPBYTS   Application bytes/second

                                               APPPKTS   Application packets/second

Analyzing Performance for Applications

You can monitor the resource consumption of applications and individual processes on your system to determine which applications and processes consume the most resources. These high-resource applications are good targets for further analysis. Improving these high-resource applications can significantly improve system response time.

Information on processes is recorded in accton process data (tables ACC*) and in HP Measureware process records (table PCSPRO). Information on user-identified applications is recorded in HP Measureware application records (table PCSAPP).

In addition to general purpose process and application information, some data collection software records performance measurements for system applications such as NFS. SunNet Manager records performance information on NFS in its remote procedure call tables (table SNMRNC and table SNMRPS). HP Measureware records this information in its global table (table PCSGLB, variable GLBNRQ). You can compare the values of the call rate variables in these tables to known baseline values to see if NFS is busier than normal when a problem occurs.

Table 4.14 summarizes the measurements for analyzing system applications.

Table 4. 14 Measurements for Analyzing System Applications

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

User
  Applications

  Accton          ACCESA,  pacct data          ACC*      Multiple variables
                  ACCHP7,                                pertaining to resource
                  ACCRS6,                                consumption (CPU seconds,
                  ACCSUN                                 disk I/O's, net I/O's,
                                                         memory utilization)

  HP Measureware  PCSPRO   process data        PRO*      Multiple variables to
                                                         resource consumption (CPU
                                                         seconds, disk I/O's, net
                                                         I/O's, memory utilization)

                  PCSAPP   application data    APP*      Multiple variables
                                                         pertaining to resource
                                                         consumption (CPU
                                                         seconds, disk I/O's, net
                                                         I/O's, memory utilization)

  PROBE/Net       PRXDPR   process disk        DPRPCNT   Process percent disk ops

                                               DPRRDS    Process disk reads

                                               DPRWRTS   Process disk writes

                  PRXPGM   program             PGM*      Multiple variables
                           information                   pertaining to resource
                                                         consumption for a
                                                         program

  TRAKKER         TKRDLL   Data link table     APPBYTS   Application bytes/second

                                               APPPKTS   Application packets/second




System
  Applications

  HP Measureware  PCSGLB   global data         GLBNRQ    NFS queue length

  SunNetMgr       SNMRNC   rpcnfs.client       RNC*      Multiple variables
                                                         pertaining to NFS
                                                         client activity (call
                                                         and wait times, errors,
                                                         and so on)

                  SNMRNS   rpcnfs.server       RNS*      Multiple variables
                                                         pertaining to NFS
                                                         server activity (call
                                                         and wait times, errors,
                                                         and so on)

  PROBE/Net       PRXNFS   Client NFS table    NFSREQP   % requests of all
                                                         clients

                                               NFSRWSV   Service time
                                                         (read/write)

                                               NFSTIMP   % server time of
                                                         all clients

                  PRXPCS   User/proc NFS       PCVSARSP  Average response time
                           table

Summary of IT Service Vision Performance Measurements

This section summarizes the tables included in the separate sections earlier in the chapter. The tables contain this information:

the names of PDB tables that contain measurements appropriate for the analysis you are performing
brief descriptions of the tables
the names of variables that contain relevant information
brief descriptions of the variables.

These tables provide a workable but not exhaustive list of places to find useful performance measurements in IT Service Vision data.

Table 4. 15 Measurements for Analyzing Service Levels

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

Availability 

  SunNetMgr       SNMPGS   ping.stats          PGSRCHA   Reachable

  HP Measureware  PCSCFG   configuration data  CFG*      Multiple variables
                                                         showing the status
                                                         of various hardware
                                                         components at boot

  SPECTRUM        C*      <device tables>               CSICNST  contact status

Response Time 

  Accton          ACCESA   pacct data          ACCETM    Process elapsed time

                  ACCHP7   pacct data          ACCETM    Process elapsed time

                  ACCRS6   pacct data          ACCETM    Process elapsed time

                  ACCSUN   pacct data          ACCETM    Process elapsed time

  HP Measureware  PCSAPP   application data    APP1RSP   Application first
                                                         response

                                               APPURSP   Application
                                                         transaction response

                                               APPRUTM   Application runtime

                  PCSPRO   process data        PRO1RSP   Process first response

                                               PRORUNT   Process runtime

  SunNetMgr       SNMPGS   ping.stats          PGSTAVG   Average roundtrip
                                                         time

                                               PGSTMAX   Maximum roundtrip
                                                         time

                                               PGSTMIN   Minimum roundtrip
                                                         time

  PROBE/Net       PRXDSK   disk info           DSKRESP   Disk response time

Throughput 

  Accton          ACCESA   pacct data          ACCUTM    Process usr CPU time

                                               ACCSTM    Process system CPU
                                                         time

                                               ACCRW     Process disk I/O

                  ACCHP7   pacct data          ACCUTM    Process user CPU time

                                               ACCSTM    Process system CPU time

                                               ACCRW     Process disk I/O

                  ACCRS6   pacct data          ACCUTM    Process user CPU time

                                               ACCSTM    Process system CPU time

                                               ACCRW     Process disk I/O

                  ACCSUN   pacct data          ACCUTM    Process user CPU time

                                               ACCSTM    Process system CPU time

                                               ACCRW     Process disk I/O

  HP Measureware  PCSPRO   process data        PROCPSC   Process CPU time

                                               PRODKTO   Process disk I/O

                  PCSAPP   application data    APPCPSC   Application CPU seconds

                                               APPCTNO   Application process
                                                         count

                                               APPDKIO   Application disk I/O

                                               APPTRNO   Application transactions

  PROBE/Net       PRXDPR   process disk        DPRPCNT   Process percent disk ops
                           access

                                               DPRRDS    Process disk reads

                                               DPRWRTS   Process  disk writes

Utilization

  SunNetMgr       SNMHPF   hostperf.data       HPFCPUP   Percent CPU busy

                                               HPFDISK   Disk I/O's/second

  TRAKKER         TKRSTS   segment             DLLBYTS   Segment bytes/second
                           statistics

Table 4. 16 Measurements for Analyzing the Workstation

Data            PDB
  Collection      Table                        Variable
  Software        Name     Table Description   Name      Variable Description

CPU

  SunNetMgr       SNMHPF   hostperf.data       HPFCPUP   Percent CPU busy

                                               HPFAV1    CPU queue length
                                                         (1-min avg)

                                               HPFAV5    CPU queue length
                                                         (5-min avg)

                                               HPFAV15   CPU queue length
                                                         (15-min avg)

                                               HPFCSWT   Context switches

  HP Measureware  PCSGLB   global data         GLBCPSC   CPU seconds

  PROBE/Net       PRXCPU   CPU                 CPUAV1    CPU queue length
                                                         (1-min avg)

                                               CPUAV5    CPU queue length
                                                         (5-min avg)

                                               CPUAV15   CPU queue length
                                                         (15-min avg)

                                               CPUIDLP   Percent CPU idle

MEMORY 

  Accton          ACCESA   pacct data          ACCKMIN   Process memory use
                                                         (K-core min)

                                               ACCMEMA   Process average
                                                         memory use

                  ACCHP7   pacct data          ACCKMIN   Process  memory use
                                                         (K-core min)

                                               ACCMEMA   Process average memory use

                  ACCRS6  pacct data           ACCKMIN   Process memory use
                                                         (K-core min)

                                               ACCMEMA   Process average memory use

                  ACCSUN  pacct data           ACCKMIN   Process memory use
                                                         (K-core min)

                                               ACCMEMA   Process average memory use

  HP Measureware  PCSAPP   application         APPSVM    Application memory
                           data                          wait time

                  PCSPRO   process data        PROSVM    Process memory wait time

                  PCSGLB   global data         GLBMEMQ   Memory queue length

  SunNetMgr       SNMHPF   hostperf.data       HPFPGI    Pages read in

                                               HPFPGO    Pages written out

                                               HPFPSWI   Pages swapped in

                                               HPFPSWO   Pages swapped  out

  PROBE/Net       PRXBUF   buffer cache        BUFRCAC   Cache hit ratio

                                               BUFWCAC   Write cache hit ratio

                  PRXSYS   system              SYSPINR   Page in rate
                           statistics

                                               SYSPOTR   Page out rate

Disk I/O

  SunNetMgr       SNMHPF   hostperf.data       HPFDISK   Total system disk I/O

                  SNMIOD   iostat.disk         IOD*      Multiple variables pertaining
                                                         to disk I/O (seek times,
                                                         read/write rates and times, and
                                                         so on) per disk volume

  PROBE/Net       PRXDSK   disk information    DSK*      Multiple variables pertaining
                                                         to disk I/O (seek times,
                                                         read/write rates and times, and
                                                         so on) per disk volume

                  PRXFBL   I/O load            FBL*      Multiple  variables pertaining
                                                         to disk I/O (seek times,
                                                         read/write  rates and times, and
                                                         so on) per disk volume

  HP Measureware  PCSGDK   global disk         GDK*      Multiple variables pertaining
                           metrics                       to disk I/O (seek times,
                                                         read/write rates and times, and
                                                         so on) per disk volume

Network
  Interface I/O

  SunNetMgr       SNMHPF   hostperf.data       HPFIPKT   Input packets

                                               HPFOPKT   Output packets

                  SNME*    Etherif.*                     Multiple variables

                  SNML*    layers.*                      Multiple variables

                  SNMT*    traffic.*                     Multiple variables

  HP Measureware  PCSGLN   global network      GLN*      Multiple variables
                           interface

Table 4. 17 Measurements for Analyzing Network Levels

Data            PDB
  Collection      Table                         Variable
  Software        Name     Table Description    Name      Variable Description

Congestion

  SNMP data       *N2IFT   ifInterfaces        IFODISC   Discards on output
  collector                                    IFIDISC   Discards on input
                                               IFOQLEN   Output queue length

                           tcp                 IFODISC   Discards on output
                                               TCPRTSG   Segments retransmitted
                                               TCPRALG   Retransmission algorithm


  SPECTRUM        C*       <device             CSIIIND   Interface input discards
                           tables>

                                               CSIIOTD   Interface output discards

                                               IPIDISC   IP discards

                                               TCPRTSG   Segments retransmitted

                                               TCPOTSG   TCP segments transmitted

Error Rates

  SNMP data       *N2IFT   ifInterfaces        IFIERRS   Input errors/second
  collector                                    IFIUCP    Input  unicast packets/sec
                                               IFINUCP   Input  broadcast packets/sec
                                               IFOERRS   Output errors/second
                                               IFOUCPT   Output unicast packets/sec
                                               IFONCTP   Output broadcast packets/sec

  SNMP data       *N2IP_   ip                  IPIADER   Input address errors/sec
  collector                                    IPIHDER   Input header errors/sec
                                               IPIDELV   Input packets delivered/sec

  SNMP data       *N2ICM   icmp                ICIERRS   Input  errors/sec
  collector                                    ICIMSGS   Input  messages/sec
                                               ICOERRS   Output errors/sec
                                               ICOMSGS   Output messages/sec

  SNMP data       *N2TCP   tcp                 TCPIERR   Input  errors/sec
  collector                                    TCPORST   Retransmissions/sec
                                               TCPINSG   Input  segments/sec
                                               TCPOTSG   Output segments/sec

  SNMP data       *N2UDP   udp                 UDPIERR   Input  errors/sec
  collector                                    UDPINDG   Input datagrams/sec

  SNMP data       *N2EGP   egp                 EGPIERR   Input  errors/sec
  collector                                    EGPIMSG   Input  messages/sec
                                               EGPOERR   Output errors/sec
                                               EGPOMSG   Output messages/sec

  SNMP data       *N2EGN   egn                 EGNIERR   Input  errors/sec
  collector                                    EGNIMSG   Input  messages/sec
                                               EGNOERR   Output errors/sec
                                               EGNOMSG   Output messages/sec

  SNMP data       *N2SNP   snmp                SNPIPKT   Input  packets/sec
  collector                                    SNPI*     Multiple variables
                                                         pertaining to various
                                                         SNMP input errors/second
                                               SNPOPKT   Output packets/sec
                                               SNPO*     Multiple variables
                                                         pertaining to various
                                                         SNMP output errors/second

  SPECTRUM        C*       <device             CSISERT   Soft error rate/second
                            tables>

                                               IPIADER   Input address errors/second

                                               IPIHDER   Bad header errors/second

                                               CSIPKRT   Packet rate/second

                                               IPIDELV   Datagrams delivered/second

  TRAKKER         TKRSTS   segment             DLLEFRM   Ethernet frames/second
                           statistic

                                               ELLERRS   Errors/second

 Utilization

  SNMP data       *N2IFT   ifInterfaces        IFOOCTS   Output octets/second
  collector
                                               IFIOCTS   Input  octets/second

                                               IFSPEED   Line speed of interface


  SPECTRUM        C*       <device             CSILOAD   Percent of bandwidth
                            tables>                      utilized


                  CCI*     <Cisco              LOCIBSC   Input bits/second
                            router                          (5-min avg)
                            tables>

                           <Cisco              LOCOBSC   Output bits/second
                            router                          (5-min avg)
                            tables>

                           <Cisco              LOCIPSC   Input pkts/second
                            router                          (5-min avg)
                            tables>

                           <Cisco              LOCOPSC   Output pkts/second
                            router                          (5-min avg)
                            tables>

Network Traffic
   Patterns

  SNMP data       *N2IFT   ifInterfaces        IFIUCP    Input unicast
  collector                                              packets/second

                                               IFINUCP   Input broadcast
                                                         packets/second

                                               IFOUCPT   Output unicast
                                                         packets/second

                                               IFONCTP   Output broadcast
                                                         packets/second

  SPECTRUM        C*       <device             CSILOAD   Percent of bandwidth
                            tables>                      utilized

                  CCI*     <Cisco              LOCIBSC   Input bits/second
                            router                          (5-min avg)
                            tables>

                           <Cisco              LOCOBSC   Output bits/second
                            router                          (5-min avg)
                            tables>

                           <Cisco              LOCIPSC   Input pkts/second
                            router                          (5-min avg)
                            tables>

                           <Cisco              LOCOPSC   Output pkts/second
                            router                          (5-min avg)
                            tables>

  TRAKKER         TKRSTS   segment             IPOSPR    IP off-segment pkts
                           statistics                    received/second

                                               IPOSPX    IP off-segment pkts
                                                         xmitted/second

                                               IPPKTS    IP packets/second

                                               IPTRAN    IP transit packets/second

Table 4.18 Measurements for Analyzing Applications

Data            PDB
  Collection      Table                         Variable
  Software        Name     Table Description    Name      Variable Description

User
  Applications

  Accton          ACCESA,  pacct data           ACC*     Multiple variables
                  ACCHP7,                                pertaining to resource
                  ACCRS6,                                consumption (CPU seconds,
                  ACCSUN                                 disk I/O's, net I/O's,
                                                         memory utilization)

  HP Measureware  PCSPRO   process data         PRO*     Multiple variables to
                                                         resource consumption (CPU
                                                         seconds, disk I/O's, net
                                                         I/O's, memory utilization)

                  PCSAPP   application         APP*      Multiple variables
                           data                          pertaining to resource
                                                         consumption (CPU seconds,
                                                         disk I/O's, net I/O's,
                                                         memory utilization)

  PROBE/Net       PRXDPR   process disk        DPRPCNT   Process percent disk ops
                           access

                                               DPRRDS    Process disk reads

                                               DPRWRTS   Process disk writes

                  PRXPGM   program             PGM*      Multiple variables
                           information                   pertaining to resource
                                                         consumption for a program

  TRAKKER         TKRDLL   Data link table     APPBYTS   Application bytes/second

                                               APPPKTS   Application packets/second

 System
  Applications 

  HP Measureware  PCSGLB   global data         GLBNRQ    NFS queue length


  SunNetMgr       SNMRNC   rpcnfs.client       RNC*      Multiple variables
                                                         pertaining to NFS client
                                                         activity (call and wait times,
                                                         errors, and so on)

                  SNMRNS   rpcnfs.server       RNS*      Multiple variables
                                                         pertaining to NFS server
                                                         activity (call and wait
                                                         times, errors, and so on)

  PROBE/Net       PRXNFS   Client NFS table    NFSREQP   % requests of all clients

                                               NFSRWSV   Service time (read/write)

                                               NFSTIMP   % server time of all clients

                  PRXPCS   User/proc NFS       PCSARSP   Average response time
                           table

Analyzing Data with IT Service Vision Software

General Information

Before you begin using the reports in IT Service Vision software, you need to understand how the reports classify data, what types of tables contain the data, and how variables can be used to group data. The following sections provide an overview of this information. For more information on table types, variables, or using statistics to summarize data, refer to the Macro reference, which is available from the following path in the IT Service Vision interface.

OnlineHelp -> Documentation Contents -> Usage -> Macro Reference

Data Classification

Performance data can be classified by its

type: character or numeric.
measurement level: nominal or interval.
Nominal variables have a discrete set of values. For example, shift is a nominal measure consisting of a set of labels defining the work shifts within an organization (for example, Prime Time or Off-Hour). Character data always have a nominal measurement level.

Interval measurements are numeric data that vary across a continuous range. Differences between values are significant and the data will have an inherent order. Network Error rate or total number of disk I/O's are examples of interval data measurements.
scale: discreet or continuous
Discrete scales have a fixed set of values. Continuous data can have an infinite number of values. Numeric CPE data is typically measured on a continuous scale.

One additional characteristic of performance data is time. Most performance measurements are collected sequentially and represent a quantity over an interval of time. You can detect significant trends or cycles by plotting the collected data in time-sequence.

Table Types

IT Service Vision software groups performance measurements into two types of tables: interval and event. Interval tables contain performance measurements that represent the quantity or count of a resource that occurs over time. Most data collected and analyzed with IT Service Vision software are stored in interval tables.

Event tables contain the amount of resources consumed during the life of the event. Event table records are recorded only at the termination of the event, whereas interval tables typically contain data sampled at regular intervals. Since many reports analyze data over a time period, you must be careful when you select reports for event tables. Be sure to always specify PERIOD=ASIS for event tables.

Grouping Data Using the BY and CLASS Statements

When you use the Reporting facility to create reports, or when you invoke one of the report macros, you can specify BY and class variables to change the appearance of the report.

BY variables group observations with the same value for the BY variable. For example, if you specify MACHINE as a BY variable, all observations with the same value for machine are treated as a single group. When you specify BY variables for a report, the system generates a separate analysis for each group defined by the values of the BY variables. SAS procedures expect the input dataset to be sorted in order of the BY variables. Graphical procedures typically produce a separate plot for each unique value of the BY variable. The axis for all plots is based on the maximum and minimum values of the analysis variable in the data population.

Class variables can also group data. Class variables are typically used to summarize observations that contain the same value for the class variable. The CLASS statement identifies class variables, which typically have a small number of discrete values or unique levels. Graphical procedures produce one plot for all class values and assign a unique symbol or color to distinguish each of the values of the class variables.

Typically, you use either MACHINE or SHIFT as the class or BY variable. If you use shift as a BY variable, it produces one plot or analysis section per shift, which enables you to distinguish load patterns unique to prime or off-hour work shifts. If you specify MACHINE as a class variable, a one-Y-axis plot produces one line (color or symbol group) per machine in the data. From this plot you can see performance patterns for individual machines in comparison to other machines included in the data.

Using Statistics to Summarize Data

Summarizing measured data is one of the most common problems faced by the performance analyst. A monitor may produce hundreds or even millions of records. You need to use descriptive statistics or percentiles to generalize and compare the collected information. IT Service Vision software provides two principal methods for summarizing data: ad-hoc summary statistics, which are available in the Reporting facility, and performance database reduction.

In addition, base SAS software provides several procedures, such as CORR, FREQ, MEANS, SUMMARY, and UNIVARIATE, which calculate descriptive statistics.

PDB reductions, the Reporting facility in IT Service Vision software, and base SAS statistical procedures all support a number of descriptive statistics. For more information refer to the statistical codes in the Macro Reference, which is available from this path the IT Service Vision window interface:

OnlineHelp -> Documentation Contents -> Usage -> Macro Reference

Using the SAS System for Further Data Analysis

In addition to the capabilities available in IT Service Vision software, the SAS System provides a wealth of analytical tools. Several of these tools are discussed here and in the following sections.

The UNIVARIATE procedure in base SAS software provides a comprehensive statistical analysis of the distribution of performance measurements, including

simple descriptive statistics (maximum, minimum, and mean)
details on extreme values
quantiles, such as the median
histograms and box plots to illustrate data distribution.

The CORR procedure, also in base SAS software, provides insight into variable relationships.

SAS/INSIGHT software produces most descriptive and distribution statistics in an interactive, multi-window exploration format.

Summarizing Data with the UNIVARIATE Procedure

Several SAS procedures produce descriptive statistics (for example, the FREQ, MEANS, and UNIVARIATE procedures), but the UNIVARIATE procedure produces the most extensive summary. For example, to summarize the CPU usage for all machines collected by SunNet Manager hostperf, execute this step:

proc univariate data=detail.snmhpf plot;
   var hpfcpup;
   id machine;
   title '% CPU Usage';
run;

The PROC UNIVARIATE statement asks for a summary of the detail HOSTPERF dataset and requests a histogram with the PLOT option. The VAR statement selects the variable HPFCPUP (% CPU usage), and the ID statement selects MACHINE to identify the extreme observations. Finally, the TITLE statement generates a descriptive title on each page of output.

Moments Section

The moments section of the output contains a number of descriptive statistics. Descriptive statistics provide insight into the character and range of the data and are essential to understanding complex data distributions and relationships. Mean, or the arithmetic average, is the most common measure. N is the number of observations with nonmissing values. Variance and Std Dev are measures of the variance within the data. The kurtosis is a measure of the tendency for the data distribution to be spread out more on one side of the mean than the other. Most of the other statistics are normality and distribution tests. Refer to The SAS Procedures documentation for yur current version of SAS, for a definition of these statistics.

The mean presents the average value within the data population and is commonly used to summarize performance measurements. However, does the mean represent 5 samples or 1000? N provides the answer by displaying the number of nonmissing values used to calculate the mean. If presented with a mean of 5 for 100 samples, is the value 5 for all 100 samples, or is there a large variability within the samples? Variance, standard deviation, and coefficient of variation all provide additional information about the variability within the data sample. Range, the maximum subtracted from the minimum, can also be used.

Quantiles Section

The quantiles section of the output gives distribution information in the form of the following percentiles:

The 0th percentile is the smallest value.
The 25th percentile, or first quartile (Q1), is larger than 25 percent of the data sample.
The median is the 50th percentile and the mode is the data value with the largest number of observations.
The 75th percentile, also called third quartile (Q3), is the value larger than 75 percent of the values within the data sample.
The 100th percentile is the largest value in the sample.

The range is the difference between the largest and smallest value.

For %CPU usage, the quantiles section clearly shows a positive skewness. This can be verified by the positive value of kurtosis (.55569). The histogram on the fourth page of output shows this skewing.

Extremes Section

The Extremes section lists the five lowest and five highest values in the population. The ID is the value of the ID value for the observation selected. In this case, all extreme values are represented by machines SOL and SUPERNOVA.

Missing Values Section

Finally, if your data contain missing values, the Missing values section displays the label used for missing values (in this case, a period), the count of missing values, and the percentage of missing values to the number of observations in the sample.

Correlating Data with the CORR Procedure

The CORR procedure, a base SAS procedure, measures the strength of linear relationships between two variables. If one variable can be expressed exactly as a linear function of another variable, then the correlation is 1. If a variable is inversely related, then the correlation value will be -1. If there is no correlation between two variables, then the correlation value will be 0. For example, to determine if CPU usage, disk I/O, interface packet traffic, and interface output collisions show a strong relationship, execute the following statements:

proc corr data=detail.snmhpf noprob nomiss;
   var hpfcpup hpfdisk hpfopkt hpfcoll hpfipkt;
run;

The PROC CORR statement asks for a correlation analysis of the detail dataset HOSTPERF; the VAR statement selects the variables HPFCPUP HPFDISK, HPFOPKT, HPFCOLL, and HPFIPKT. The NOMISS option drops an observation if any missing value exists for any requested variable. The NOPROB option suppresses printing of significant probabilities. For more information on the CORR procedure and its options, see the SAS Procedures Guide for your current version of SAS.

Using SAS/INSIGHT Software for Exploring Data

SAS/INSIGHT software is an interactive and graphical software package for data exploration. Descriptive statistics, percentiles, charts, and plots can be produced and linked across multiple analysis windows. SAS/INSIGHT software also produces correlation matrices, scatter plots, data transforms, and the general linear model analysis for data fitting, distribution tests, and regression analysis. For more information, refer to the SAS/INSIGHT User's Guide for your current version of SAS software..

Table of Contents

Analyzing Your System

CPU Busy Rate

Average Ready-Queue Length

Context Switches

Summary Table