High-Performance Features of the OPTGRAPH Procedure

Example: Community Detection on a Simple Undirected Graph

This section illustrates the use of the community detection algorithm in distributed mode on the simple undirected graph depicted in Figure 1.1.

Figure 1.1: A Simple Undirected Graph

The following statements create the data set LinkSetIn:

data LinkSetIn;
   input from $ to $ @@;
   datalines;
A B  A C  A D  B C  C D
C E  D F  F G  F H  F I
G H  G I  I J  J K  J L
K L
;

Before you run community detection in distributed mode, you must convert the undirected graph to a directed graph by replicating links in both directions, as shown in the following statements:

data LinkSetIn;
   set LinkSetIn;
   output;
   tmp  = from;
   from = to;
   to   = tmp;
   output;
   drop tmp;
run;

Then you can use PROC HPDS2 to distribute the links data to the appliance by from as follows:

libname gplib greenplm
   server            = "grid001.example.com"
   schema            = public
   user              = dbuser
   password          = dbpass
   database          = hps
   preserve_col_names=yes;

proc datasets nolist lib=gplib;
   delete LinkSetIn;
quit;

proc hpds2
   data = LinkSetIn
   out  = gplib.LinkSetIn (distributed_by='distributed by (from)');
   performance
      host    = "grid001.example.com"
      install = "/opt/TKGrid";
   data DS2GTF.out;
      method run();
         set DS2GTF.in;
      end;
   enddata;
run;

The LIBNAME statement option PRESERVE_COL_NAMES=YES is used because the links data set contains the variable from, which is a reserved keyword for DBMS tables that use SAS/ACCESS. For more information, see SAS/ACCESS for Relational Databases: Reference.

After the data have been distributed to the appliance, you can invoke PROC OPTGRAPH. In this example, some of the output data sets are written to the appliance and other output data sets are written to the client machine. It is recommended that the large output data sets be stored in the DBMS, because transferring data from the appliance to the client machine can take a very long time. In addition, further analysis (such as centrality computation) can benefit from the data’s already being distributed on the appliance. In the following statements, CommLevelOut and CommOut are stored on the client machine in the WORK library, and the remaining output data sets are stored on the appliance, although the output data sets for this simple example are very small:

proc datasets nolist lib=gplib;
   delete NodeSetOut CommOverlapOut CommLinksOut CommIntraLinksOut;
quit;

proc optgraph
   graph_direction = directed
   data_links      = gplib.LinkSetIn
   out_nodes       = gplib.NodeSetOut;
   performance
      host    = "grid001.example.com"
      install = "/opt/TKGrid";
   community
      resolution_list      = 0.001
      algorithm            = parallel_label_prop
      out_level            = CommLevelOut
      out_community        = CommOut
      out_overlap          = gplib.CommOverlapOut
      out_comm_links       = gplib.CommLinksOut
      out_intra_comm_links = gplib.CommIntraLinksOut;
run;

The data set NodeSetOut contains the community identifier of each node. It is shown in Figure 1.2.

Figure 1.2: Community Nodes Output

node	community_1
L	2
K	2
J	2
C	0
E	0
A	0
D	0
B	0
H	1
I	1
G	1
F	1

The data set CommLevelOut contains the number of communities and the corresponding modularity values that are found at each resolution level. It is shown in Figure 1.3.

Figure 1.3: Community Level Summary Output

level	resolution	communities	modularity
1	.001	3	0.52148

The data set CommOut contains the number of nodes that are contained in each community. It is shown in Figure 1.4.

Figure 1.4: Community Number of Nodes Output

level	resolution	community	nodes
1	.001	0	5
1	.001	1	4
1	.001	2	3

The data set CommOverlapOut contains the intensity of each node that belongs to multiple communities. It is shown in Figure 1.5. In this example, node F is connected to two communities, with 75% of its links connecting to community 1 and 25% of its links connecting to community 2.

Figure 1.5: Community Overlap Output

node	community	intensity
E	0	1.00000
B	0	1.00000
L	2	1.00000
H	1	1.00000
F	0	0.25000
F	1	0.75000
J	1	0.33333
J	2	0.66667
G	1	1.00000
A	0	1.00000
C	0	1.00000
D	0	0.66667
D	1	0.33333
K	2	1.00000
I	1	0.66667
I	2	0.33333

The data set CommLinksOut shows how the communities are interconnected. It is shown in Figure 1.6.

Figure 1.6: Community Links Output

from_community	to_community	link_weight
0	1	1
1	0	1
1	2	1
2	1	1

The data set CommIntraLinksOut shows how the nodes are connected within each community. It is shown in Figure 1.7.

Figure 1.7: Intracommunity Links Output

cluster	from	to
0	E	C
0	D	A
0	D	C
0	A	B
0	A	C
0	A	D
0	C	A
0	C	B
0	C	D
0	C	E
0	B	A
0	B	C
1	G	F
1	G	H
1	G	I
1	F	G
1	F	H
1	F	I
1	I	F
1	I	G
1	H	F
1	H	G
2	L	J
2	L	K
2	K	J
2	K	L
2	J	K
2	J	L

cluster	from	to
0	E	C
0	D	A
0	D	C
0	A	B
0	A	C
0	A	D
0	C	A
0	C	B
0	C	D
0	C	E
0	B	A
0	B	C
1	G	F
1	G	H
1	G	I
1	F	G
1	F	H
1	F	I
1	I	F
1	I	G
1	H	F
1	H	G
2	L	J
2	L	K
2	K	J
2	K	L
2	J	K
2	J	L

cluster	from	to
0	E	C
0	D	A
0	D	C
0	A	B
0	A	C
0	A	D
0	C	A
0	C	B
0	C	D
0	C	E
0	B	A
0	B	C
1	G	F
1	G	H
1	G	I
1	F	G
1	F	H
1	F	I
1	I	F
1	I	G
1	H	F
1	H	G
2	L	J
2	L	K
2	K	J
2	K	L
2	J	K
2	J	L

cluster	from	to
0	E	C
0	D	A
0	D	C
0	A	B
0	A	C
0	A	D
0	C	A
0	C	B
0	C	D
0	C	E
0	B	A
0	B	C
1	G	F
1	G	H
1	G	I
1	F	G
1	F	H
1	F	I
1	I	F
1	I	G
1	H	F
1	H	G
2	L	J
2	L	K
2	K	J
2	K	L
2	J	K
2	J	L