This section illustrates the use of the community detection algorithm in distributed mode on the simple undirected graph depicted in Figure 1.1.
Figure 1.1: A Simple Undirected Graph
The following statements create the data set LinkSetIn
:
data LinkSetIn; input from $ to $ @@; datalines; A B A C A D B C C D C E D F F G F H F I G H G I I J J K J L K L ;
Before you run community detection in distributed mode, you must convert the undirected graph to a directed graph by replicating links in both directions, as shown in the following statements:
data LinkSetIn; set LinkSetIn; output; tmp = from; from = to; to = tmp; output; drop tmp; run;
Then you can use PROC HPDS2 to distribute the links data to the appliance by from
as follows:
libname gplib greenplm server = "grid001.example.com" schema = public user = dbuser password = dbpass database = hps preserve_col_names=yes; proc datasets nolist lib=gplib; delete LinkSetIn; quit; proc hpds2 data = LinkSetIn out = gplib.LinkSetIn (distributed_by='distributed by (from)'); performance host = "grid001.example.com" install = "/opt/TKGrid"; data DS2GTF.out; method run(); set DS2GTF.in; end; enddata; run;
The LIBNAME statement option PRESERVE_COL_NAMES=YES is used because the links data set contains the variable from
, which is a reserved keyword for DBMS tables that use SAS/ACCESS. For more information, see
SAS/ACCESS for Relational Databases: Reference.
After the data have been distributed to the appliance, you can invoke PROC OPTGRAPH. In this example, some of the output data
sets are written to the appliance and other output data sets are written to the client machine. It is recommended that the
large output data sets be stored in the DBMS, because transferring data from the appliance to the client machine can take
a very long time. In addition, further analysis (such as centrality computation) can benefit from the data’s already being
distributed on the appliance. In the following statements, CommLevelOut
and CommOut
are stored on the client machine in the WORK
library, and the remaining output data sets are stored on the appliance, although the output data sets for this simple example
are very small:
proc datasets nolist lib=gplib; delete NodeSetOut CommOverlapOut CommLinksOut CommIntraLinksOut; quit; proc optgraph graph_direction = directed data_links = gplib.LinkSetIn out_nodes = gplib.NodeSetOut; performance host = "grid001.example.com" install = "/opt/TKGrid"; community resolution_list = 0.001 algorithm = parallel_label_prop out_level = CommLevelOut out_community = CommOut out_overlap = gplib.CommOverlapOut out_comm_links = gplib.CommLinksOut out_intra_comm_links = gplib.CommIntraLinksOut; run;
The data set NodeSetOut
contains the community identifier of each node. It is shown in Figure 1.2.
The data set CommLevelOut
contains the number of communities and the corresponding modularity values that are found at each resolution level. It is
shown in Figure 1.3.
The data set CommOut
contains the number of nodes that are contained in each community. It is shown in Figure 1.4.
Figure 1.4: Community Number of Nodes Output
The data set CommOverlapOut
contains the intensity of each node that belongs to multiple communities. It is shown in Figure 1.5. In this example, node F is connected to two communities, with 75% of its links connecting to community 1 and 25% of its
links connecting to community 2.
Figure 1.5: Community Overlap Output
The data set CommLinksOut
shows how the communities are interconnected. It is shown in Figure 1.6.
The data set CommIntraLinksOut
shows how the nodes are connected within each community. It is shown in Figure 1.7.
Figure 1.7: Intracommunity Links Output