High-Performance Features of the OPTGRAPH Procedure

Community Detection Details

Subsections:

Large Community
Output Data Sets

Community detection partitions a graph into communities such that the links within the community subgraphs are more densely connected than the links between communities. You use the COMMUNITY statement to detect communities. The options for this statement are described in the section COMMUNITY Statement.

In distributed mode, PROC OPTGRAPH implements the parallel label propagation algorithm, which was developed at SAS. The goal is to move a node to a community to which most of its neighbors belong. Briefly, the parallel label propagation algorithm does the following:

It initializes each node as its own community.
At each iteration, it randomly chooses some nodes as candidates and moves each candidate from its current community to the neighboring community that has the most nodes. It repeats this step until the number of movements is smaller than the specified tolerance value or the maximum number of iterations has been reached.

The parallel label propagation algorithm is an extension of the synchronous label propagation algorithm proposed in Raghavan, Albert, and Kumara (2007). During each iteration, nodes update their labels simultaneously by using the node label information from the previous iteration. In this approach, node labels can be updated in parallel. However, simultaneous updating of this nature often leads to oscillating labels because of the bipartite subgraph structure often present in large graphs. To address this issue, at each iteration the parallel algorithm skips the labeling step at some randomly chosen nodes in order to break the bipartite structure. You can control the random samples that the algorithm takes by specifying the RANDOM_FACTOR= and RANDOM_SEED= options in the COMMUNITY statement.

As you can see from the description, the algorithm adopts a heuristic local optimization approach. The final result often depends on the sequence of nodes that are presented in the links input data set. Therefore, if the sequence of nodes in the links data set changes, the result is likely to be different.

In distributed mode, the link data must be a directed graph, and you must predistribute the graph to the appliance according to the from column values before you call PROC OPTGRAPH, as described in the section Distributing Input Data to the Appliance. If the original data are an undirected graph, you must convert the undirected graph to a directed graph by replacing each undirected link $A-B$ with two directed links: $A \rightarrow B$ and $B \rightarrow A$ . This ensures that each node can access its immediate neighbors on all computing nodes, thus minimizing data movement among computing nodes.

Large Community

It has often been observed in practice that the number of nodes contained in communities (produced by community detection algorithms) usually follows a power law distribution. That is, a few communities contain a very large number of nodes, whereas most communities contain a small number of nodes. This is especially true for large graphs. PROC OPTGRAPH provides two approaches for you to alleviate this problem:

You can use the RECURSIVE option to recursively break large communities into smaller ones. At the first step, PROC OPTGRAPH processes data as if no RECURSIVE option were specified. At the end of this step, it checks whether the community result has reached the maximum community size. If the community result has reached the maximum community size, PROC OPTGRAPH stops iterations and outputs results. Otherwise, it treats each large community as an independent graph and recursively applies community detection on top of it.

In certain cases, a community is not further split even if it does not meet the maximum community size that you specify. One example is a star-shaped community that contains 200 nodes when MAX_COMM_SIZE is specified as 100.
You can use the RESOLUTION_LIST= option to assign a different value from the default value of 0.001. The resolution value that is specified in this option can be interpreted as the minimal density of communities for an undirected and unweighted graph. The density of a community is defined as the number of links inside the community divided by the total number of possible links. A larger resolution value is likely to result in communities that contain fewer nodes. For more information about resolution values for label propagation, see Traag, Van Dooren, and Nesterov (2011).

If you supply multiple resolution values at one time, the OPTGRAPH procedure performs community detection multiple times, each time with a different resolution value. This is equivalent to calling the OPTGRAPH procedure several times, each time with a different (single) resolution value specified in the RESOLUTION_LIST= option.

The value that you specify in the RESOLUTION_LIST= option has a major impact on the running time of the algorithm. When you specify a large resolution value, the algorithm is likely to create many tiny communities, and nodes are likely to change communities between iterations. Therefore the algorithm might not converge properly. On the other hand, when you specify a small resolution value, the algorithm might find some very large communities, such as a community that contains more than one million nodes. In this case, if you specify the RECURSIVE option, the algorithm spends a long time in the recursive step in order to break large communities into smaller ones.

The recommended approach is to first experiment with a set of resolution values without using the RECURSIVE option. At the end of the run, examine the resulting modularity values and the community size distributions. Remove the resolution values that lead to small modularity values or huge communities. Then add the RECURSIVE option to the COMMUNITY statement, if desired, and run PROC OPTGRAPH again.

Output Data Sets

Community detection produces up to six output data sets. In these data sets, resolution level numbers are in the same order as the values that are specified in the RESOLUTION_LIST= option. For example, if RESOLUTION_LIST=0.001 0.005 0.01, then resolution level 1 is at value 0.001, resolution level 2 is at value 0.005, and resolution level 3 is at value 0.01.

OUT_NODES= Data Set

This data set describes the community identifier of each node. If multiple resolution values have been specified, the data set reports the community identifier of each node at each resolution level. The data set contains the following columns:

node: node label
community_i: community identifier at resolution level i, where i is the resolution level number as previously described. There are K such columns if K different values are specified in the RESOLUTION_LIST= option.

OUT_LEVEL= Data Set

This data set describes the number of communities and their corresponding modularity values at various resolution levels. It contains the following columns:

level: resolution level number
resolution: resolution value
communities: number of communities at the current resolution level
modularity: modularity value at the current resolution level

OUT_COMMUNITY= Data Set

This data set describes the number of nodes in each community. It contains the following columns:

level: resolution level number
resolution: resolution value
community: community identifier
nodes: number of nodes contained in the community

OUT_OVERLAP= Data Set

This data set describes the intensity of each node. At the end of community detection, a node could have links that connect to multiple communities. The intensity of a node is computed as the sum of the link weights that connect to nodes in the specified community divided by the total link weights of the node. This data set is computationally expensive to produce, and it requires a large amount of disk space. Therefore, this data set is not produced if you specify multiple resolution values in the RESOLUTION_LIST= option and ALGORITHM=PARALLEL_LABEL_PROP.

The data set contains the following columns:

node: node label
community: community identifier
intensity: intensity of the node that belongs to the community

OUT_COMM_LINKS= Data Set

This data set describes how communities are connected. This data set is computationally expensive to produce, and it requires a large amount of disk space. Therefore, this data set is not produced if you specify multiple resolution values in the RESOLUTION_LIST= option.

The data set contains the following columns:

from_community: community identifier of the from community
to_community: community identifier of the to community
link_weight: sum of link weights of all links between from_community and to_community

OUT_INTRA_COMM_LINKS= Data Set

This data set describes how the nodes are connected within each community. This data set is computationally expensive to produce, and it requires a large amount of disk space. Therefore, this data set is not produced if you specify multiple resolution values in the RESOLUTION_LIST= option.

The data set contains the following columns:

cluster: the cluster ID (that is, the community ID)
from: the node label of the from node
to: the node label of the to node
weight: the link weight between from and to
weight2: the second link weight between from and to The column weight2 is created if the input links data set has a weight2 column, even though community detection does not use this column during the computation. This is because the OUT_INTRA_COMM_LINKS= data set is used as input to the centrality computation step, in which PROC OPTGRAPH might need two link weight columns: one column for computing PageRank, eigenvector, and hub/authority centralities, and the other column for computing betweenness and closeness centralities. Therefore, community detection carries the second link weight column to the output.