The OPTGRAPH Procedure

Summary

In PROC OPTGRAPH, various summary statistics for a graph and its nodes can be calculated by invoking the SUMMARY statement. The options for this statement are described in the section SUMMARY Statement.

The SUMMARY statement reports status information in a macro variable called _OPTGRAPH_SUMMARY_. See the section Macro Variable _OPTGRAPH_SUMMARY_ for more information about this macro variable.

Output Data Sets

The summary statistics produced are broken into two categories: statistics on the entire graph and statistics on the nodes and links of the graph. The latter is appended to the output nodes and links data sets specified in the OUT_NODES= and OUT_LINKS= option in the PROC OPTGRAPH statement. The former is contained in the data set that is specified in the OUT= option in the SUMMARY statement.

Let $\delta (u)$ represent the list of nodes connected to node u in an undirected graph. In a directed graph $\delta ^{\textrm{out}}(u)$ represents the list of nodes that are connected from node u (out-links), while $\delta ^{\textrm{in}}(u)$ represents the list of nodes that are connected to node u (in-links).

OUT= Data Set

By default, the summary output data set that is specified in the OUT= option in the SUMMARY statement contains the following columns:

  • nodes: the number of nodes in the graph ($|N|$)

  • links: the number of links in the graph ($|A|$)

  • avg_links_per_node: the average number of links per node

  • density: the number of links in the graph divided by the number of links in a complete graph $\left(\frac{|A|}{|K(N)|}\right)$

  • self_links_ignored: the number of self links ignored

  • dup_links_ignored: the number of duplicate links ignored

  • leaf_nodes: the number of leaf nodes

    • Undirected Graph: $u \in N$ such that $\delta (u)=1$

    • Directed Graph: $u \in N$ such that $\delta ^{\textrm{out}}(u)=0$ and $\delta ^{\textrm{out}}(u)+\delta ^{\textrm{in}}(u) > 0$

  • singleton_nodes: the number of singleton nodes

    • Undirected Graph: $u \in N$ such that $\delta (u)=0$

    • Directed Graph: $u \in N$ such that $\delta ^{\textrm{out}}(u) + \delta ^{\textrm{in}}(u)=0$

You can produce statistics about the connectedness of the graph by using the CONCOMP and BICONCOMP options. For more information about connected components and biconnected components, see the sections Connected Components and Biconnected Components and Articulation Points, respectively. If you use the CONCOMP and BICONCOMP options, the following columns also appear in the summary output data set for undirected graphs:

  • concomp: the number of connected components in the graph

  • biconcomp: the number of biconnected components in the graph

  • artpoints: the number of articulation points in the graph

  • isolated_pairs: the number of isolated pairs of nodes (a connected component of size two)

  • isolated_stars: the number of isolated stars, (a connected component, C, of size greater than two with):

    • one node i with $\delta (i)=|C|-1$ and all other nodes $u \in C \setminus \{ i\} $ with $\delta (u)=1$

and for directed graphs:

  • concomp: the number of strongly connected components in the graph

  • isolated_pairs: the number of isolated pairs of nodes (a weakly connected component of size 2)

  • isolated_stars_out: the number of isolated outward stars, (a weakly connected component, C, of size greater than two with):

    • one node i with $\delta ^{\textrm{out}}(i)=|C|-1$ and all other nodes $u \in C \setminus \{ i\} $ with $\delta ^{\textrm{in}}(u)=1$

  • isolated_stars_in: the number of isolated inward stars, (a weakly connected component, C, of size greater than two with):

    • one node i with $\delta ^{\textrm{in}}(i)=|C|-1$ and all other nodes $u \in C \setminus \{ i\} $ with $\delta ^{\textrm{out}}(u)=1$

You can produce statistics about the shortest paths in the graph by using the SHORTPATH= option. The diameter of a graph is the longest shortest path distance of all possible source-sink pairs in the graph. Calculating the diameter of a graph is an expensive computation, because it involves calculating shortest paths for all pairs. For undirected graphs, an approximate method is available based on Boitmanis et al. (2006). The algorithm can be invoked by using the DIAMETER_APPROX= option. The exact method runs in $O(|N| \times (|N| \log |N| + |A|))$; the approximate method runs in $O(|A| \sqrt {|N|})$ with an additive error of $O(\sqrt {|N|})$. For more information about shortest paths, see the section Shortest Path. If you use the SHORTPATH= option, the following columns also appear in the summary output data set:

  • diameter_wt: longest weighted shortest path in the graph

  • diameter_unwt: longest unweighted shortest path in the graph

  • diameter_approx_wt: approximate longest weighted shortest path in the graph

  • diameter_approx_unwt: approximate longest unweighted shortest path in the graph

  • avg_shortpath_wt: average weighted shortest path in the graph

  • avg_shortpath_unwt: average unweighted shortest path in the graph

Depending on which other options you specify, some of these columns might not appear in the summary output data set.

OUT_NODES= Data Set

In addition, you can produce summary statistics about the nodes of the graph. By default, the following columns are appended to the data set that you specify in the OUT_NODES= option in the PROC OPTGRAPH statement:

  • sum_in_and_out_wt: sum of the link weights from and to the node

  • leaf_node: 1, if the node is a leaf node; otherwise, 0

  • singleton_node: 1, if the node is a leaf node; otherwise, 0

  • isolated_pair: the identifier, if the node is in an isolated pair; otherwise, missing (.)

In addition, for undirected graphs:

  • isolated_star: the identifier, if the node is in an isolated star; otherwise, missing (.)

and for directed graphs:

  • isolated_star_out: the identifier, if the node is in an isolated outward star; otherwise, missing (.)

  • isolated_star_in: the identifier, if the node is in an isolated inward star; otherwise, missing (.)

You can produce statistics about the shortest path distances to and from nodes in the graph by using the SHORTPATH= option. The eccentricity of a node u is the longest shortest path distance of all possible shortest path distances between u and any other node. If you use the SHORTPATH= option, the following columns also appear in the nodes output data set:

  • eccentr_out_wt: the longest weighted shortest path distance from the node

  • eccentr_out_unwt: the longest unweighted shortest path distance from the node

  • eccentr_in_wt: the longest weighted shortest path distance to the node

  • eccentr_in_unwt: the longest unweighted shortest path distance to the node

Summary Statistics of a Simple Directed Graph

This section illustrates the calculation of summary statistics on the simple directed graph G that is shown in Figure 1.115.

Figure 1.115: A Simple Directed Graph G

A Simple Directed Graph


The directed graph G can be represented using the following nodes data set NodeSetIn and links data set LinkSetIn:

data NodeSetIn;
  input node $ @@;
  datalines;
A B C D E F G H I J K L M N O P
;
data LinkSetIn;
   input from $ to $ weight @@;
   datalines;
A B 1 A C 2 A D 2 B A 2 D E 2
D F 1 E F 2 F D 2 F E 1 A A 2
A B 2 I J 5 K L 3 K M 2 N O 1
P O 1
;

The following statements calculate the default summary statistics and output the results in the data set Summary:

proc optgraph
   graph_direction = directed
   data_nodes      = NodeSetIn
   data_links      = LinkSetIn;
   summary
      out          = Summary;
run;

The data set Summary contains the default summary statistics of the input graph and is shown in Figure 1.116.

Figure 1.116: Graph Summary Statistics of a Simple Directed Graph

nodes links avg_links_per_node density self_links_ignored dup_links_ignored leaf_node singleton_node
16 14 0.875 0.058333 1 1 5 2



The following statements calculate the default summary statistics and information about the connectedness of the graph, and they output the results in the data set Summary:

proc optgraph
   graph_direction = directed
   data_nodes      = NodeSetIn
   data_links      = LinkSetIn;
   summary
      concomp
      out          = Summary;
run;

The data set Summary contains the summary statistics of the input graph and is shown in Figure 1.116.

Figure 1.117: Graph Summary and Connectedness Statistics of a Simple Directed Graph

nodes links avg_links_per_node density self_links_ignored dup_links_ignored leaf_node singleton_node concomp isolated_pairs isolated_star_out isolated_star_in
16 14 0.875 0.058333 1 1 5 2 13 1 1 1



Summary Statistics of a Simple Directed Graph by Cluster

Similar to how you can use the BY_CLUSTER option in the CENTRALITY statement, as described in the section Processing by Cluster, you can process a number of induced subgraphs of a graph with only one call to PROC OPTGRAPH by using the BY_CLUSTER option in the SUMMARY statement. In this section, you want to work on the subgraphs induced by node subsets $N_0 = \{ A,B,C\} $, $N_1 = \{ D,E,F\} $, and $N_3 = \{ G,H,I,J,K,L,M,N,O,P\} $ for the directed graph shown in Figure 1.115. The induced subgraphs are shown graphically in Figure 1.118 (the dashed link is removed).

Figure 1.118: Induced Subgraphs of G

Induced Subgraphs of


Define the subgraphs in the nodes data set by using the cluster variable as follows:

data NodeSetIn;
   input node $ cluster @@;
   datalines;
A 0 B 0 C 0 D 1 E 1 
F 1 G 2 H 2 I 2 J 2 
K 2 L 2 M 2 N 2 O 2 
P 2
;

The following statements process the summary statistics for each induced subgraph:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes      = NodeSetIn
   out_links       = LinkSetOut
   out_nodes       = NodeSetOut;
   performance
      nthreads     = 2;
   summary
      by_cluster
      concomp
      out          = Summary;
run;

Notice in this example that you can process each subgraph in parallel by using the NTHREADS= option in the PERFORMANCE statement.

The data sets Summary, NodeSetOut, and LinkSetOut now contain the summary statistics for each induced subgraph; they are shown in Figure 1.119.

Figure 1.119: Summary Statistics for Induced Subgraphs of G

Summary

cluster nodes links avg_links_per_node density self_links_ignored dup_links_ignored leaf_node singleton_node concomp isolated_pairs isolated_star_out isolated_star_in
0 3 3 1.00000 0.50000 1 1 1 0 2 0 0 0
1 3 5 1.66667 0.83333 0 0 0 0 1 0 0 0
2 10 5 0.50000 0.05556 0 0 4 2 10 1 1 1

NodeSetOut

node cluster sum_in_and_out_wt leaf_node singleton_node isolated_pair isolated_star_out isolated_star_in
A 0 6 0 0 . . .
B 0 4 0 0 . . .
C 0 2 1 0 . . .
D 1 5 0 0 . . .
E 1 5 0 0 . . .
F 1 6 0 0 . . .
G 2 0 0 1 . . .
H 2 0 0 1 . . .
I 2 5 0 0 1 . .
J 2 5 1 0 1 . .
K 2 5 0 0 . 1 .
L 2 3 1 0 . 1 .
M 2 2 1 0 . 1 .
N 2 1 0 0 . . 1
O 2 2 1 0 . . 1
P 2 1 0 0 . . 1

LinkSetOut

from to cluster weight isolated_pair isolated_star_out isolated_star_in
A B 0 1 . . .
A C 0 2 . . .
A D . 2 . . .
B A 0 2 . . .
D E 1 2 . . .
D F 1 1 . . .
E F 1 2 . . .
F D 1 2 . . .
F E 1 1 . . .
I J 2 5 1 . .
K L 2 3 . 1 .
K M 2 2 . 1 .
N O 2 1 . . 1
P O 2 1 . . 1



Summary Statistics of a Simple Undirected Graph

This section illustrates the calculation of summary and shortest path statistics on the simple undirected graph G that is shown in Figure 1.120.

Figure 1.120: A Simple Undirected Graph G

A Simple Undirected Graph


The undirected graph G can be represented using the following links data set LinkSetIn:

data LinkSetIn;
   input from $ to $ weight @@;
   datalines;
A B 1 A C 2 A D 2 B A 2 D E 2
D F 1 E F 2 F D 2 F E 1
;

The following statements calculate the default summary statistics and information about shortest path distances of the graph, and they output the results in the data set Summary. In addition, node statistics are produced and output in the data set NodeSetOut.

proc optgraph
   data_links      = LinkSetIn
   out_nodes       = NodeSetOut;
   summary
      out          = Summary
      shortpath    = weight;
run;

The data sets Summary and NodeSetOut now contain the summary statistics of the input graph and is shown in Figure 1.121.

Figure 1.121: Graph Summary and Shortest Path Statistics of a Simple Undirected Graph

Summary

nodes links avg_links_per_node density self_links_ignored dup_links_ignored leaf_node singleton_node diameter_wt avg_shortpath_wt
6 6 1 0.2 0 3 2 0 6 3.13333

NodeSetOut

node sum_in_and_out_wt eccentr_wt_out leaf_node singleton_node
A 5 4 0 0
B 1 5 1 0
C 2 6 1 0
D 5 4 0 0
E 4 6 0 0
F 3 5 0 0