The OPTGRAPH Procedure

Reach (Ego) Network

Subsections:

Output Data Sets
Reach Network of a Simple Directed Graph
Processing Multiple Reach Networks in One Pass
Processing Reach Networks by Cluster
Processing Multiple Reach Networks in One Pass by Cluster
Processing Each Source Reach Network for Hop Limits of Both 1 and 2 in One Pass by Cluster

The reach network of a graph $G = (N,A)$ is a graph $G^ R_ L = (N^ R_ L,A^ R_ L)$ that is defined as the induced subgraph over the set of nodes $N^ R_ L$ that are reachable in L steps (or hops) from a set S of nodes, called the source nodes. Reach networks are often referred to as ego networks in the context of social networks, since they focus around the neighbors of one (or more) particular individuals.

In PROC OPTGRAPH, reach networks can be calculated by using the REACH statement. The options for this statement are described in the section REACH Statement.

The REACH statement reports status information in a macro variable called _OPTGRAPH_REACH_. See the section Macro Variable _OPTGRAPH_REACH_ for more information about this macro variable.

In most cases, the set of source nodes from which to calculate reach are defined in a node subset data set, as described in the section Node Subset Input Data. The node subset data set can be used to define several sets of sources nodes. Each source node set is used to calculate the reach networks. The reach network identifier is given in the node subset data set’s reach column. When you use the EACH_SOURCE option, every node in the original graph’s node set N is used to find a reach network from each node separately.

Output Data Sets

Depending on the options selected, the reach network algorithm produces output data sets as described in the following sections.

OUT_NODES= Data Set

This data set describes the nodes in each reach network that are found from each set of source nodes. The data set contains the following columns:

node: node label for each node in each reach network
reach: reach network identifier (which defines the set of source nodes that was used)

OUT_LINKS= Data Set

This data set describes the links in each reach network that are found from each set of source nodes. Output of the reach network links can sometimes be more costly computationally, relative to calculating only the nodes or counts in the reach networks. This option does not work when you use the BY_CLUSTER option. The data set contains the following columns:

from: the from node label for each link in each reach network
to: the to node label for each link in each reach network
reach: reach network identifier (which defines the set of source nodes that was used)

OUT_COUNTS= Data Set

This data set describes the number of nodes in each reach network for each set of sources nodes. The data set contains the following columns:

node: node label for each node in the source node sets
reach: reach network identifier (which defines the set of source nodes that was used)
count: the number of nodes reachable using outgoing links from the source nodes
count_not: the number of nodes not reachable using outgoing links from the source nodes

If the graph is directed and you use the DIGRAPH option, then the OUT_COUNTS= data set contains the following additional columns:

count_in: the number of nodes reachable using incoming links from the source node
count_out: the number of nodes reachable using outgoing links from the source node (equivalent to count)
count_in_or_out: the number of nodes reachable using incoming or outgoing links (but not both) from the source node
count_in_and_out: the number of nodes reachable using both incoming and outgoing links from the source node

If node weights are present, the OUT_COUNTS= data set contains the following additional columns:

count_wt: the sum of the weights of the nodes reachable using outgoing links from the source node
count_not_wt: the sum of the weights of the nodes not reachable from the source node
count_in_wt: the sum of the weights of the nodes reachable using incoming links from the source node
count_out_wt: the sum of the weights of the nodes reachable using outgoing links from the source node
count_in_or_out_wt: the sum of the weights of the nodes reachable using incoming or outgoing links (but not both) from the source node
count_in_and_out_wt: the sum of the weights of the nodes reachable using both incoming and outgoing links from the source node

When you want to calculate hop limits of 1 and 2 on the same graph, you can use the OUT_COUNTS1= and OUT_COUNTS2= options to do this in one call. This option works only when the EACH_SOURCE and BY_CLUSTER options are specified.

Reach Network of a Simple Directed Graph

This section illustrates the use of the reach networks algorithm on the simple directed graph G that is shown in Figure 1.91.

Figure 1.91: Simple Directed Graph G

The directed graph G can be represented using the following links data set LinkSetIn:

data LinkSetIn;
   input from $ to $ @@;
   datalines;
A B  A C  A D  B C  B E
B F  C E  D E  E D  E F
F G  G H  G I  H G  H I
;

Consider two sets of source nodes, $S_1=\{ A,G\}$ and $S_2=\{ B\}$ . These can be defined separately in two node subset data sets as follows:

data NodeSubSetIn1;
   input node $ reach;
   datalines;
A 1
G 1
;

data NodeSubSetIn2;
   input node $ reach;
   datalines;
B 1
;

For the first set of source nodes, you can use the following statements to calculate the reach network with a hop limit of 1:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes_sub  = NodeSubSetIn1;
   reach
      out_nodes    = ReachNodes1
      out_links    = ReachLinks1
      out_counts   = ReachCounts1
      maxreach     = 1;
run;

The data sets ReachNodes1, ReachLinks1, and ReachCounts1 now contain the nodes, links, and counts of the reach network, respectively, that come from $S_1$ .

Figure 1.92: Reach Network for $S_1=\{ A,G\}$ with Hop Limit of 1

ReachNodes1

reach	node
1	A
1	B
1	C
1	D
1	G
1	H
1	I

ReachLinks1

reach	from	to
1	A	B
1	A	C
1	A	D
1	B	C
1	G	H
1	H	G
1	G	I
1	H	I

ReachCounts1

reach	node	count	count_not
1	A	7	2
1	G	7	2

The results are displayed graphically in Figure 1.93.

Figure 1.93: Reach Network for $S_1=\{ A,G\}$ with Hop Limit of 1

For the second set of source nodes, you can use the following statements to calculate the reach network with a hop limit of 2:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes_sub  = NodeSubSetIn2;
   reach
      out_nodes    = ReachNodes2
      out_links    = ReachLinks2
      out_counts   = ReachCounts2
      maxreach     = 2;
run;

The data sets ReachNodes2, ReachLinks2, and ReachCounts2 now contain the nodes, links, and counts of the reach network, respectively, that come from $S_2$ .

Figure 1.94: Reach Network for $S_2=\{ B\}$ with Hop Limit of 2

ReachNodes2

reach	node
1	B
1	C
1	D
1	E
1	F
1	G

ReachLinks2

reach	from	to
1	B	C
1	B	E
1	B	F
1	C	E
1	D	E
1	E	D
1	E	F
1	F	G

ReachCounts2

reach	node	count	count_not
1	B	6	3

The results are displayed graphically in Figure 1.95.

Figure 1.95: Reach Network for $S_1=\{ B\}$ with Hop Limit of 2

Processing Multiple Reach Networks in One Pass

You can process a set of reach networks from one graph in one pass using one node subset data set. The MAXREACH= option applies to all of the reach networks requested. If the node subset data set column reach is set to 0 or missing (.), then the node is not processed. If the column reach is set to a value greater than 0, then the node is processed with other nodes by using the same marker.

Consider again the graph shown in Figure 1.91, now with source node sets $S_1=\{ C\}$ and $S_2=\{ A,H\}$ . These source node sets can be defined together as follows:

data NodeSubSetIn;
   input node $ reach;
   datalines;
A 2
C 1
H 2
;

You can use the following statements to process the two one-hop-limit reach networks in one pass:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes_sub  = NodeSubSetIn;
   reach
      out_nodes    = ReachNodes
      out_links    = ReachLinks
      out_counts   = ReachCounts
      maxreach     = 1;
run;

The data sets ReachNodes, ReachLinks, and ReachCounts now contain the nodes, links, and counts of the reach networks, respectively, that come from $S_1$ and $S_2$ .

Figure 1.96: Reach Networks for $S_1=\{ C\}$ and $S_2=\{ A,H\}$ with Hop Limit of 1

ReachNodes

reach	node
1	C
1	E
2	A
2	B
2	C
2	D
2	G
2	H
2	I

ReachLinks

reach	from	to
1	C	E
2	A	B
2	A	C
2	A	D
2	B	C
2	G	H
2	H	G
2	G	I
2	H	I

ReachCounts

reach	node	count	count_not
1	C	2	7
2	A	7	2
2	H	7	2

Processing Reach Networks by Cluster

Similar to the usage for centrality described in the section Processing by Cluster, you can use the BY_CLUSTER option in the REACH statement to process a number of induced subgraphs of a graph with only one call to PROC OPTGRAPH. In this section, you want to work on the subgraphs that are induced by node subsets $N_0 = \{ A,C,D,E\}$ and $N_1 = \{ B,F,G,H,I\}$ for the directed graph shown in Figure 1.91. The induced subgraphs are shown graphically in Figure 1.97 and Figure 1.98.

Figure 1.97: Induced Subgraph for $N^0 = \{ A,C,D,E\}$

Figure 1.98: Induced Subgraph for $N^1 = \{ B,F,G,H,I\}$

Define the subgraphs in the nodes data set by using the cluster variable as follows:

data NodeSetIn;
   input node $ cluster @@;
   datalines;
A 0  B 1  C 0  D 0  E 0
F 1  G 1  H 1  I 1
;

In the node subset data set, define the source nodes set $S=\{ B,C\}$ by using the reach variable as follows:

data NodeSubSetIn;
   input node $ reach;
   datalines;
B 1
C 1
;

To process the two-hop-limit reach network for each induced subgraph, you can use the following statements:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes      = NodeSetIn
   data_nodes_sub  = NodeSubSetIn;
   performance
      nthreads     = 2;
   reach
      by_cluster
      out_nodes    = ReachNodes
      out_counts   = ReachCounts
      maxreach     = 2;
run;

Notice in this example that you can process each subgraph in parallel by using the NTHREADS= option in the PERFORMANCE statement.

The data sets ReachNodes and ReachCounts now contain the nodes and counts of the reach networks, respectively, that come from S for each induced subgraph.

Figure 1.99: Reach Networks for $S=\{ B,C\}$ with Hop Limit of 2 for Induced Subgraphs

ReachNodes

reach	node	cluster
1	B	1
1	C	0
1	D	0
1	E	0
1	F	1
1	G	1

ReachCounts

reach	node	cluster	count	count_not
1	C	0	3	1
1	B	1	3	2

Notice that since you are operating on the induced subgraphs (not the original graph), node B cannot reach nodes C and E because they are not in its induced subgraph.

Processing Multiple Reach Networks in One Pass by Cluster

You can also process several reach networks in one pass while looking over decomposed subgraphs. Consider the same original graph and subgraphs from the section Processing Reach Networks by Cluster. Now, suppose you want the one-hop-limit reach network where each original node is its own source node subset. Define nine source sets by using the node subset data set as follows:

data NodeSubSetIn;
   input node $ reach @@;
   datalines;
A 1  B 2  C 3  D 4  E 5
F 6  G 7  H 8  I 9
;

Then, to calculate the reach networks (including the directed graph counts) for each source node set on the induced subgraphs, use the following statements:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes      = NodeSetIn
   data_nodes_sub  = NodeSubSetIn;
   performance
      nthreads     = 2;
   reach
      by_cluster
      digraph
      out_nodes    = ReachNodes
      out_counts   = ReachCounts
      maxreach     = 1;
run;

Notice that you can do the same thing using the EACH_SOURCE option. In this case, you do not need the subset data set.

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes      = NodeSetIn;
   performance
      nthreads     = 2;
   reach
      each_source
      by_cluster
      digraph
      out_nodes    = ReachNodes
      out_counts   = ReachCounts
      maxreach     = 1;
run;

The resulting data sets ReachNodes and ReachCounts are displayed in Figure 1.100.

Figure 1.100: Reach Networks for Each Source for Induced Subgraphs with a Node Hop Limit of 1

ReachNodes

reach	node	cluster
1	A	0
1	C	0
1	D	0
2	B	1
2	F	1
3	C	0
3	E	0
4	D	0
4	E	0
5	D	0
5	E	0
6	F	1
6	G	1
7	G	1
7	H	1
7	I	1
8	G	1
8	H	1
8	I	1
9	I	1

ReachCounts

reach	node	cluster	count	count_not	count_in	count_out	count_in_or_out	count_in_and_out
1	A	0	3	1	0	3	3	0
2	B	1	2	3	0	2	2	0
3	C	0	2	2	1	2	3	0
4	D	0	2	2	2	2	2	1
5	E	0	2	2	2	2	2	1
6	F	1	2	3	1	2	3	0
7	G	1	3	2	2	3	3	1
8	H	1	3	2	1	3	2	1
9	I	1	1	4	2	1	3	0

Processing Each Source Reach Network for Hop Limits of Both 1 and 2 in One Pass by Cluster

In this section, suppose you want to calculate the one-hop- and two-hop-limit reach counts on the same graph for each source node on a set of induced subgraphs. You can do this in one pass by using the OUT_COUNTS1= and OUT_COUNTS2= options, as follows:

proc optgraph
   graph_direction = directed
   data_links      = LinkSetIn
   data_nodes      = NodeSetIn;
   performance
      nthreads     = 2;
   reach
      each_source
      by_cluster
      out_counts1  = ReachCounts1
      out_counts2  = ReachCounts2;
run;

The resulting data sets ReachCounts1 and ReachCounts1 are displayed in Figure 1.101.

Figure 1.101: Reach Counts for Each Source Node for Induced Subgraphs with a Hop Limit of 1 and 2

ReachCounts1

reach	node	cluster	count	count_not
1	A	0	3	1
3	C	0	2	2
4	D	0	2	2
5	E	0	2	2
2	B	1	2	3
6	F	1	2	3
7	G	1	3	2
8	H	1	3	2
9	I	1	1	4

ReachCounts2

reach	node	cluster	count	count_not
1	A	0	4	0
3	C	0	3	1
4	D	0	2	2
5	E	0	2	2
2	B	1	3	2
6	F	1	4	1
7	G	1	3	2
8	H	1	3	2
9	I	1	1	4

For a more detailed example, see Reach Networks for Computation of Market Coverage of a Terrorist Network.

reach	node	cluster
1	A	0
1	C	0
1	D	0
2	B	1
2	F	1
3	C	0
3	E	0
4	D	0
4	E	0
5	D	0
5	E	0
6	F	1
6	G	1
7	G	1
7	H	1
7	I	1
8	G	1
8	H	1
8	I	1
9	I	1

reach	node	cluster	count	count_not	count_in	count_out	count_in_or_out	count_in_and_out
1	A	0	3	1	0	3	3	0
2	B	1	2	3	0	2	2	0
3	C	0	2	2	1	2	3	0
4	D	0	2	2	2	2	2	1
5	E	0	2	2	2	2	2	1
6	F	1	2	3	1	2	3	0
7	G	1	3	2	2	3	3	1
8	H	1	3	2	1	3	2	1
9	I	1	1	4	2	1	3	0

reach	node	cluster
1	A	0
1	C	0
1	D	0
2	B	1
2	F	1
3	C	0
3	E	0
4	D	0
4	E	0
5	D	0
5	E	0
6	F	1
6	G	1
7	G	1
7	H	1
7	I	1
8	G	1
8	H	1
8	I	1
9	I	1

reach	node	cluster	count	count_not	count_in	count_out	count_in_or_out	count_in_and_out
1	A	0	3	1	0	3	3	0
2	B	1	2	3	0	2	2	0
3	C	0	2	2	1	2	3	0
4	D	0	2	2	2	2	2	1
5	E	0	2	2	2	2	2	1
6	F	1	2	3	1	2	3	0
7	G	1	3	2	2	3	3	1
8	H	1	3	2	1	3	2	1
9	I	1	1	4	2	1	3	0

reach	node	cluster
1	A	0
1	C	0
1	D	0
2	B	1
2	F	1
3	C	0
3	E	0
4	D	0
4	E	0
5	D	0
5	E	0
6	F	1
6	G	1
7	G	1
7	H	1
7	I	1
8	G	1
8	H	1
8	I	1
9	I	1

reach	node	cluster	count	count_not	count_in	count_out	count_in_or_out	count_in_and_out
1	A	0	3	1	0	3	3	0
2	B	1	2	3	0	2	2	0
3	C	0	2	2	1	2	3	0
4	D	0	2	2	2	2	2	1
5	E	0	2	2	2	2	2	1
6	F	1	2	3	1	2	3	0
7	G	1	3	2	2	3	3	1
8	H	1	3	2	1	3	2	1
9	I	1	1	4	2	1	3	0