IMSTAT Procedure (Analytics)

HYPERGROUP Statement

The HYPERGROUP statement analyzes a graph whose vertices are identified by values of the analysis variables and whose edges are named by those variables within the same observation. The analysis that can be performed falls into three general areas: structural analysis of the overall graph, centrality calculation for individual vertices, and layout of the graph in either 2-D or 3-D space.

Syntax

Optional Argument

HYPERGROUP Statement Options

Details

Introduction to the HYPERGROUP Statement

Specifying Analysis Variables

Centrality Measures

Result Tables

ODS Table Names

Syntax

HYPERGROUP <variable-list> </ options>;

Optional Argument

variable-list

specifies the variables to include in the analysis. The variables must have a character data type. Separate variable names with a space. If you do not specify any variables, then all character variables in the active table are used in the analysis.

More information, including the namespace specification is available in Specifying Analysis Variables.

HYPERGROUP Statement Options

C=relative-strength

specifies the relative strength of local forces to global forces with regard to laying out the positions of vertices and edges. The Walshaw layout is a force-directed algorithm that finds positions of vertices so that no vertices are too close together and so that (usually) edges are about the same length. The force term in a force-directed layout algorithm is related to springs. Imagine each vertex is a ring and each edge is a spring whose ends are hooked around the rings of the vertices to which the edge connects.

Each spring is equally springy (has the same spring constant). If a spring is too compressed, it wants to push apart the vertices at its ends. If the spring is too extended, the spring wants to pull the vertices closer together. The forces that are exerted by these springs, for which there is a corresponding edge, are local forces.

In addition, vertices that are near each other that might or might not be connected by an edge are modeled as if there is a temporary spring between them that is capable of repulsion only. This is done to keep vertices from being located too close to each other. If vertices are very close, the repulsion is very great. These forces are known as global forces. There is not necessarily an edge between two vertices that exert global forces against each other.

This option controls the relative strength of local forces to global forces. In general, larger values for C= result in graphs with more space between vertices. The effect is that the repulsion of vertices that are strongly connected repel other vertices in other strongly connected subgraphs—the effect is to lengthen weak edges. Good values for C= begin at 0.01. Values above 0.1 typically cause global forces to become too strong relative to local forces.

Default	0.01
Applies to	LAYOUT=WALSHAW

CENTRALITY

specifies to quantify the importance of each vertex among its peers. Many types of centrality have been defined. The HYPERGROUP statement supports five that are commonly used. Four of these are based on shortest paths (the smallest number of edges in a path from one vertex to the other). The fifth is a geometric measure that can be calculated when graph layout is performed. For more information, see Centrality Measures.

CLOSITERS=n

specifies the number of layout iterations that are performed before a sub-algorithm determines the vertices that are close to each other. Increasing the value can improve performance because more iterations are performed between attempts to evaluate which nodes are too close together.

Default

COMMALGORITHM= ASYNCHRONOUS |

COMMALGORITHM= SYNCHRONOUS | SEMISYNCHRONOUS

COMMALGORITHM= LLSYNCHRONOUS | LLSEMISYNCHRONOUS

specifies a particular label propagation algorithm when STRUCTURAL=COMMUNITY analysis is specified. The LL prefix indicates to use a parallel version of the algorithm.

Alias

COMMALG

COMMITERS=n

specifies the number of iterations to perform while determining communities. Communities are determined by a variant of the label propagation algorithm described in Raghavan, Reka, and Kumara (2007). The algorithm is iterative, and stops when COMMITERS= iterations have been performed.

The algorithm might perform fewer iterations than are specified if all vertices have this property: a community c is formed by a set of vertices so that for any vertex v in c, the number of edges directed from v to other vertices in c outnumber or tie the number of edges directed from v to vertices outside c.

Default	20
Tip	The synchronous algorithms that are available with the COMMALG= option can require larger values for COMMITERS= for convergence to occur.

COMMLAYOUTS

specifies to lay out coordinates for the community graph that is produced with the STRUCTURE=COMMUNITY (or BOTH) option. The coordinates shown are returned in the _TEMPHYPGRP3_ and TEMPEDGES3_ tables.

COMMMAX=n

specifies the maximum number of iterations to perform to determine labeling for communities. For the label propagation algorithm used when you specify STRUCTURAL=COMMUNITY, this option, together with COMMPRECENDENCE, alters tie-breaking schemes when there is a choice as to what value should be assigned to a vertex label.

Label propagation is an epidemic algorithm. Avoid setting COMMMAX= too low, because low values tend to infect vertices with the wrong label in early iterations. Different combinations of this option with COMMPRECEDENCE can affect the number and quality of the communities found. Refer to Cardasco and Gargano (2011 and 2012) for a description the algorithm and these options.

COMMPRECEDENCE

An option for tuning the label propagation algorithm used with STRUCTUAL=COMMUNITIES analysis. See the explanation of the COMMMAX= option for details.

Alias

COMMPRE

CREATETEMPLAST= NEVER | ALWAYS | MULTIPLE

specifies when to create the _TEMPLAST_ temporary table that identifies the hypergroups and analysis variables. If you use a large active table with the HYPERGROUP statement, then the _TEMPLAST_ temporary table can be large as well.

NEVER

specifies to never create the _TEMPLAST_ temporary table. Be aware that the other in-memory temporary tables like _TEMPHYPGRP_ and _TEMPEDGES_ are created. These have summarized information about the hypergroups and are smaller than the _TEMPLAST_ table.

ALWAYS

specifies to create the _TEMPLAST_ temporary table.

MULTIPLE

specifies to create the _TEMPLAST_ temporary table when the analysis results in more than one hypergroup.

Default

ALWAYS

FAR_AWAY=d

specifies how to tune the layouts when LAYOUY=WALSHAW is specified.

When FAR_AWAY=1, the default value, the Walshaw algorithm models global forces between vertices if these vertices are not far away from each other. How far away is determined is complicated, depends on the size of the graph, and depends on what stage of graph partitioning is being performed. The FAR_AWAY= option expresses a multiple of the usual value that the algorithm would calculate.

In other words, if d is usually the greatest distance between two vertices that are allowed to exert a global force on each another, then specifying FAR_AWAY=2 indicates that vertices that are twice d away from each other are allowed to exert global forces. Of course, the farther away vertices are from each other, the weaker are the global forces, but even vertices farther away (but not excluded), can be influential if there are enough vertices to include. The FAR_AWAY= option controls how many of the distant vertices can exert pull.

The result of using larger values for FAR_AWAY= is similar to using larger values for C=. In both cases, the larger values for the options makes the layouts more spacious, at the expense of laying out all edges to have similar lengths. A distinctive feature of larger FAR_AWAY= values is that it causes vertices to be positioned farther from the center toward the nearest pane border.

Default

FORMATS=("format-specification",...)

specifies the formats for the GROUPBY= variables. If you do not specify the FORMAT= option, or if you do not specify the GROUPBY= option, the default format is applied for that variable.

Enclose each format specification in quotation marks and separate each format specification with a comma.

GRAPHPARTITION

specifies to tune the layout to improve the separation of vertices. This option can increase the processing duration.

Applies to

LAYOUT=WALSHAW or LAYOUT=FRUCHGOLD

GROUPBY=(variable-list)

specifies a list of variable names, or a single variable name, to use as GROUPBY variables in the order of the grouping hierarchy. If you do not specify any GROUPBY variable names, then the calculation is performed across the entire table—possibly subject to a WHERE clause.

GROUPBYLIMIT=n

specifies the maximum number of levels in a GROUPBY set. When the software determines that there are at least n levels in the GROUPBY set, it abandons the action, returns a message, and does not produce a result set. You can specify the GROUPBYLIMIT= option if you want to avoid creating excessively large result sets in GROUPBY operations.

GROUPFILTER=(groupfilter-options)

specifies a section of the GROUPBY= hierarchy to include in the HYPERGROUP computation.

HEIGHT=z

specifies the maximum value for the frame's coordinate space in the Z-axis.

Be aware that the MARGIN= value is subtracted from the HEIGHT= value.

Default	100 units
Interaction	This option is used only when you specify the THREED option.

HIGHDEGREE= 0 | 1

specifies to enable a heuristic that begins partitioning by eliminating vertices of unusually high degree. Some graphs have many vertices with low degree. The degree of a vertex is the number of edges that originate from or are directed toward the vertex. However, some graphs might have some vertices with very high degree. It is often beneficial to treat these high degree vertices as partitions early in the partitioning algorithm, even if they do not strictly split a graph. This simplifies the processing so that from what is left of the remaining graphs are less dense and faster to process.

Default	0 (disabled)
Range	0 or 1

LAYOUT= WALSHAW | FRUCHGOLD | OTHER

specifies one of three force-directed algorithms to use for graph layout.

If you specify either LAYOUT=WALSHAW or LAYOUT=FRUCHGOLD, you can also specify the GRAPHPARTITION option, so that graph partitioning is used.

The WALSHAW option performs the algorithm described by C. Walshaw (2000). The FRUCHGOLD option performs the algorithm described by T.M.J. Fruchterman and E.M. Reingold (1991). Specifying OTHER performs an algorithm that is proprietary to SAS.

The force term in force-directed layout algorithm is related to springs. Imagine each vertex is a ring and each edge is a spring whose ends are hooked around the rings of the vertices the edge connects. Each spring is equally springy. If a spring is too compressed, it wants to push apart the vertices at its ends. If the spring is too extended, the spring wants to pull the vertices closer together. In addition, vertices that are near each other but are not connected by an edge are modeled as if there is a temporary spring between them that is capable of only repulsion. This method for modeling is done to prevent laying out vertices too closely to each other. Of course, if vertices are very close, the repulsion is very great.

Default

LAYOUT=WALSHAW

LENGTH=y

specifies the maximum value for the frame's coordinate space in the Y-axis.

Be aware that the MARGIN= value is subtracted from the LENGTH= value.

Default

100 units

MARGIN=n

specifies the size of the border around the frame's coordinate space to remain free of vertices. For example, if you specify LENGTH=100, WIDTH=100, and MARGIN=12, then the frame coordinate space is 100 × 100 units and vertices have coordinates within the corners (12, 12), (12, 88), (88, 12), and (88, 88).

MAXNODES=n

specifies to tune graph partitioning by specifying the maximum number of nodes to permit in a partition. Each time a partitioning is performed, the resulting set of partitioned subgraphs is examined. If any exceed the maximum number of nodes specified in this option, then the partitioning is repeated on those partitions.

Default

MAXNVALS=i

specifies a positive integer that determines the maximum number of iterations for the percentile algorithm.

Default

1000

NITERATIONS=i

specifies a positive integer that determines the maximum number of iterations to execute for the forced-directed layout algorithm. A value between 200 and 5000 produces good results with most data sets. The LAYOUT=WALSHAW layout algorithm might stop before completing all NITER= iterations if the algorithm detects that convergence has occurred.

If you specify NITERS=0, then it is the same as specifying the NOCOORD option.

Alias	NITERS
Default	1000

NOCOLOR

specifies not to run the graph partitioning algorithm to assign colors to strongly connected communities. The algorithm is run by default. This option is useful if you do not use the color values. You can avoid the processing that is performed to assign color categories.

Alias

NOCOLOUR

NOCOORD

specifies not to perform graph layout of vertices and edges. Graph layout is the most time-consuming calculation that the HYPERGROUP statement performs. This option is useful if you do not need a visual or geometric layout, or calculation of centroid centrality. This option can improve the response time and conserve machine resources.

NOPENDANTS

specifies to simplify the graph layout by removing pendants (nodes of degree one). This option is performed repeatedly until no pendants remain in the graph.

NOVARS

specifies not to transfer additional variables to the _TEMPLAST_ table. See also VARS=.

PARTITION <=partition-key>

When you specify this option and the table is partitioned, the results are calculated separately for each value of the partition key. In other words, the partition variables function as automatic GROUPBY variables. This mode of executing calculations by partition is more efficient than using the GROUPBY= option. With a partitioned table, the server takes advantage of knowing that observations for a partition cannot be located on more than one worker node.

If you do not specify a partition-key, the analysis is performed for all partitions. If you do specify a partition-key, the analysis is carried out for the specified key value only. You can use the PARTITIONINFO statement to retrieve the valid partition key values for a table.

You can specify a partition-key in two ways. You can supply a single quoted string that is passed to the server, or you can specify the elements of a composite key separated by commas. For example, if you partition a table by variables GENDER and AGE, with formats $1 and BEST12, respectively, then the composite partition key has a length of 13. You can specify the partition for the 11-year-old females as follows:

statement / partition="F          11"; /* passed directly to the server */
statement / partition="F","11";        /* composed by the procedure */

If you choose the second format, the procedure composes a key based on formatting information from the server.

Alias

PART=

RADIANS

specifies to return the centroid centrality angles in radians rather than degrees.

Applies to

CENTRALITY option

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SCALECOORDS

specifies to scale vertex coordinate values so that they are within the boundaries specified with the LENGTH=, WIDTH=, and HEIGHT= options. This option is useful when you specify LAYOUT=FRUCHGOLD or LAYOUT=OTHER algorithm and GRAPHPARTITION is not specified.

SEPARATOR= NODES | VERTICES

SEPARATOR= ARCS | EDGES

SEPARATOR= HYBRID

specifies how to tune the graph partitioning algorithm by indicating how to choose partition separators.

Graph partitioning works to find separators that are small and partitions that are large. A vertex separator is a set of vertices that, if removed from the graph, results in two or more separate sub-graphs that correspond to partitions. An edge separator is a set of edges that, if removed, results in two or more separate sub-graphs that correspond to partitions. There is never an edge between vertices in separate partitions.

By default, SEPARATOR=HYBRID. In this case, a vertex separator is ultimately determined, but in the initial stages of graph partitioning, edge separators are determined. As graph partitioning continues, vertex separation is used.

Default

HYBRID

SETSIZE

requests that the server estimate the size of the result set. The procedure does not create a result table if the SETSIZE option is specified. Instead, the procedure reports the number of rows that are returned by the request and the expected memory consumption for the result set (in KB). If you specify the SETSIZE option, the SAS log includes the number of observations and the estimated result set size. See the following log sample:

NOTE: The LASR Analytic Server action request for the STATEMENT
      statement would return 17 rows and approximately
      3.641 kBytes of data.

The typical use of the SETSIZE option is to get an estimate of the size of the result set in situations where you are unsure whether the SAS session can handle a large result set. Be aware that in order to determine the size of the result set, the server has to perform the work as if you were receiving the actual result set. Requesting the estimated size of the result set does consume resources on the server. The estimated number of KB is very close to the actual memory consumption of the result set. It might not be immediately obvious how this size relates to the displayed table, since many tables contain hidden columns. In addition, some elements of the result set might not be converted to tabular output by the procedure.

STRUCTURAL= NONE | COLOR | COLOUR | COMMUNITY | BOTH

Hypergroups (completely disconnected subsets) are always identified within the graph. Specify this option to request additional structural analyses that identify strongly connected components within each hypergroup. This option enables you to find subsets of the graph whose vertices have many interrelationships internally, but fewer between the subset. Unlike hypergroups, these subsets are not disconnected from each other.

When you specify this option, two additional temporary tables are created for vertices and edges. These tables depict the strongly connected components as a graph. Using these tables, it is possible to zoom out from the detailed graph—to depict the mesostructure or macrostructure of the graph. In a sense, this is the graph theory equivalent of aggregation on numeric quantities.

BOTH

specifies to perform COLOR and COMMUNITY analysis.

COLOR | COLOUR

specifies to identify the strongly connected components with the graph partition algorithm and assigns a color value to each component. A color value is assigned to each vertex and edge. The following table identifies each component, table, and column name that includes a color value.

Component	Table Name	Column Name
Vertices	_TEMPHYPGRP_	_COLOR_
Edges	_TEMPEDGES_	_SCOLOR_ and _TCOLOR_

In addition, the graph of the derived components is described in the _TEMPHYPGRP2_ and _TEMPEDGES2_ temporary tables.

COMMUNITY

specifies to identify the strongly connected components with the label propagation algorithm and assign each component with a community value. A community value is assigned to each vertex and edge. The following table identifies each component, table, and column name that includes a community value.

Component	Table Name	Column Name
Vertices	_TEMPHYPGRP_	_COMMUNITY_
Edges	_TEMPEDGES_	_SCOMMUNITY_ and _TCOMMUNITY_

In addition, the graph of the derived communities is described in the _TEMPHYPGRP3_ and _TEMPEDGES3_ temporary tables.

Sometimes communities are better at indicating the components that are strongly connected. Sometimes colors do better—particularly when the graph is less structured but still can be usefully divided by separators. For many graphs that have structure, vertices that have the same color also have the same community value, although the color value and the community values can be different.

Default

NONE

TEMPTABLE

specifies to store the results of the analysis in in-memory tables on the server. You do not need to specify this option because the HYPERGROUP statement always generates in-memory tables for the result sets.

THREED

specifies to graph the layout in three dimensions instead of two dimensions. The HEIGHT= option controls the maximum values for the Z-axis.

Alias

TOPLEFT

specifies to produce the graph layout coordinates and centroid centrality angles based on an origin at the top left corner of the drawing window. By default, the HYPERGROUP statement generates coordinates based on an origin at bottom left corner of the drawing window.

VARFORMATS=("format-specification",...)

specifies the formats to apply to the variables. If you do not specify the VARFORMATS= option, the default formats are applied to the variables.

VARIABLES=(variable-1 ... variable-n)

specifies the variables from the active table to transfer to the generated _TEMPLAST_ table as additional ID variables. The variables that are specified after the HYPERGROUP statement are always transferred. By default, all variables are transferred.

Alias	VARS=
See	NOVARS

WIDTH=x

specifies the maximum value for the frame's coordinate space in the X-axis.

Be aware that the MARGIN= value is subtracted from the WIDTH= value.

Default

100 units

Details

Introduction to the HYPERGROUP Statement

Specifying Analysis Variables

Centrality Measures

Result Tables

ODS Table Names

Introduction to the HYPERGROUP Statement

Hypergroups extend how SAS LASR Analytic Server can identify data that are connected, by variables having disjoint sets of values. Hypergroup technology is used to perform analytics after data has been split up in meaningful ways. Most IMSTAT procedure statements enable analysis to be split into independent parts by values of GROUPBY variables in individual records. In contrast, hypergroups enable analysis based on data that are spread across more than one record.

The algorithms used to determine hypergroups are based on graph theory. Vertices have names that are values of hypergroup variables, and edges that connect vertices (v1, v2) are generated if there is any record in data that has v1 and v2 as values in adjacent hypergroup variables.

Besides generating the _TEMPLAST_ table (similar to the active table, but with an additional _HypGrp_ variable) , the HYPERGROUP statement generates the _TEMPHYPGRP_ that includes information about the vertices and the _TEMPEDGES_ table that includes information about the edges. Both these tables have indices with respect to all the data, and other indices with respect to vertices in subgraphs that correspond to separate hypergroups.

The _TEMPHYPGRP_ and _TEMPEDGES_ tables also have vertex coordinates and suggested color and community values. These values can assist with plotting the hypergroups. If you use the HYPERGROUP statement to identify groups only and do not plan to use the plotting information, then specify NOCOORD and NOCOLOR to improve processing times.

Specifying Analysis Variables

Simple Syntax

The analysis variables for HYPERGROUP are interpreted to form the graph. At least two variables must be listed. In a given observation, each analysis variable value is interpreted as the name of a vertex. Consecutive analysis variables in the list define an edge between the vertices they name. Thus, in this example:

hypergroup a b;   /* One edge for each observation  */
hypergroup a b c; /* Two edges for each observation */

The first statement defines one edge for each observation—between the vertex identified by the value of variable A and the vertex identified by the value of variable B. The second statement defines two edges per observation—the first between the A and B vertices and the second between the B and C vertices.

Note: When an analysis variable is a missing value, no edge is produced.

A simple example follows:

libname example sasiola host="grid001.example.com" port=10010 tag=hps;

data example.sales;
  input @1 Customer $ @7 Dealer $12. @21 Model $;
  cards;
Tina  AutoEmporium    Focus
Tony  AutoEmporium    Sonic
Tom   Prestige        TT
Tom   AutoEmporium    ATS
Blake AutoMall        Accord
Bob   AutoMall        RAV4
Bart  AutoMall        Civic
Beth  AutoMall        Accord

;;;

proc imstat data=example.sales;
  hypergroup customer dealer / vars=(model);
run;

  table example.&_TEMPLAST_;
  fetch / format;
run;
  table example.&_TEMPHYPGRP_;
  fetch / format;
run;
  table example.&_TEMPEDGES_
  fetch / format;
run;

By default, HYPERGROUP considers all vertex identifiers the equivalent, regardless of what analysis variables they come from. In the previous simple example, "Tom" is just a vertex, neither a customer or a dealer. Furthermore, there is nothing preventing you from linking "Bob" and "Tom." All analysis is based on edge relationships, not on attributes of the vertices. Do not misinterpret the syntax as indicating that a link is specifically from "Bob" to "AutoMall." This is not a directed graph. The presence of an edge indicates an undirected connection between "Bob" and "AutoMall," with no more significance in one direction than the other.

Namespace Syntax

The following HYPERGROUP example is similar to the previous example, but demonstrates the syntax for using namespaces.

proc imstat data=example.sales;
  hypergroup (customer, dealer) / vars=(model);
run;

By enclosing variables in parentheses (and separating them with commas), you can indicate that the values from one or more columns form their own vertex namespace. Each comma between variables specifies a different namespace. You can avoid interpreting a customer name and a dealership name as the same vertex by specifying the preceding syntax to use different namespaces.

By default, without the namespace syntax, the vertex identifiers in different analysis variables are equivalent. A larger data set might include a customer named "Don Smith" and a dealership that is also named "Don Smith." In this case, they would be considered equivalent and identify the same vertex unless namespace syntax is used.

When you specify the namespace syntax, an additional ODS table is generated. The table identifies each analysis variable and a namespace value.

ODS Table for Namespaces

A corresponding _Namespace_ column is also added to the _TEMPHYPGRP_ table. Two additional columns, _Snspace_ and _Tnspace_, are added to the _TEMPEDGES_ table.

Centrality Measures

Overview

When you specify the CENTRALITY option, additional columns are added to the _TEMPHYPGRP_ and _TEMPEDGES_ temporary tables that the HYPERGROUP statement generates. Information about the measures of centrality and the additional columns is included in the following sections.

If you specify COMMCENTRALITY, then similar tables and columns are generated, but they include COMM in the column names. When COMMCENTRALITY is specified, the information is related to centralities, but with respect to each community.

Graph Centrality

Let L_v,w be the length of the shortest path from vertex v to vertex w, when w is reachable from v. Graph centrality for a vertex v is the greatest L_v,w for any reachable w. For graphs that are wider than they are taller, or vice versa, the vertices in the middle have smaller graph centrality than the vertices at the pointy ends.

Graph centrality measures are included in the temporary tables that are generated by the HYPERGROUP statement. Although graph centrality is the name of this centrality, the column names with the measures use the word reach. This is because the quantification is a measure of how long it is to reach the vertex that is further out than any other.

Closeness Centrality

Let S_v = sum of L_v,w, in other words, the sum of the lengths of the shortest paths from vertex v to vertex w for all other reachable vertices. Let S_max = the greatest S_v where the closeness centrality for vertex v = S_v / S_max.

Closeness centrality measures are included in the temporary tables that are generated by the HYPERGROUP statement. Closeness centrality values can be found in the _CloseComm_ column in _TEMPHYPGRP_ for each vertex, and in the _SourceCloseComm_ and _TargetCloseComm_ columns in _TEMPEDGES_ for the source and target vertices for each edge.

Stress Centrality

Stress centrality is another centrality that requires the shortest paths to be determined between reachable vertices. In addition, if some vertex v can reach another vertex w, there might be more than one shortest path between them. Such shortest paths are multiple optima. For a vertex v, let N_v = the number of times that v is crossed when traversing all shortest paths, even those that are multiple optima. Let N_max = the greatest N_v . The stress centrality for vertex v = N_v / N_max.

Stress centrality measures are included in the temporary tables that are generated by the HYPERGROUP statement.

Betweenness Centrality

Betweenness centrality quantifies the number of times a vertex is crossed along the shortest path, or paths, between two other vertices.

Let T_x,y = the total number of shortest paths from a vertex x to reachable vertex y. Let T_x,y(v) be the number of the paths that cross vertex v. Therefore, the fraction of shortest paths that cross vertex v = T_x,y(v) / T_x,y. Let B_v = the sum of T_x,y(v) / T_x,y for all pairs of reachable vertices x and y. Let B_max = the greatest B_v. The betweenness centrality for vertex v = B_v / B_max.

Betweenness centrality measures are included in the temporary tables that are generated by the HYPERGROUP statement.

Centroid Centrality

Centroid centrality is different from the other measures of centralities because it uses information of the layout. The X coordinate of the centroid is the sum of X coordinates of other vertices, divided by the number of vertices. The Y coordinate is calculated the same way. The centroid centrality for each vertex is represented as the polar coordinates from the centroid to the vertex and consists of an angle and a magnitude.

Centroid centrality measures are included in the temporary tables that are generated by the HYPERGROUP statement.

Result Tables

Overview

The HYPERGROUP statement generates temporary tables that described the results of the analysis. Because they are temporary tables, they exist in memory until your program crosses a RUN boundary like the QUIT statement, a new PROC statement, or a DATA statement.

The following code example uses the same data set that is shown in Simple Syntax.

proc imstat data=example.sales;
  hypergroup customer dealer / vars=(model);  1
run;

  table example.&_TEMPLAST_;  2
  fetch / format;
run;

  table example.&_TEMPHYPGRP_;  3
  fetch / format;
run;

  table example.&_TEMPEDGES_;  4
  fetch / format;
run;

1	Because the analysis variables, customer and dealer, are specified with the simple syntax, the HYPERGROUP statement considers them both as vertices. The statement does not differentiate between customers and dealers that might have the same value. The VARS= option copies the model variable to the _TEMPLAST_ table.
2	The _TEMPLAST_ temporary table is set as the active table. The FETCH statement prints the first 20 rows from the table and formats the variables.The _TEMPLAST_ temporary table is set as the active table.
3	This is identical to the previous description except that the _TEMPHYPGRP_ temporary table is set as the active table.
4	This is identical to the previous description except that the _TEMPEDGES_ temporary table is set as the active table.

Note: The example shows the FETCH statement with the FORMAT option. This improves the readability of the results but converts numeric variables to characters.

If you want to save the tables to the SAS client and preserve the numeric columns, you can use syntax that is similar to FETCH / OUT=libref.HYPGROUPS;. If you want to analyze the tables with clients like SAS Visual Analytics, then use the PROMOTE statement to make it a permanent table. For metadata-aware applications like SAS Visual Analytics, you also need to register the table in SAS metadata.

The _TEMPLAST_ Table

The HYPERGROUP statement generates this table by default. The key features of this table are as follows:

For each row in the active table that is analyzed, subject to a WHERE clause, there is a row in the _TEMPLAST_ table.
The _HypGrp_ variable identifies the hypergroup number (0, 1, 2, and so on) for analysis variables.
The analysis variables are included in the table as well as any variables that you specify in the VARS= option.

The following example output shows how the _HypGrp_ variable identifies the data as two disjoint groups.

Sample _TEMPLAST_ Table

The _TEMPHYPGRP_ Table

The HYPERGROUP statement generates this table by default. The table includes records that are related to the values of the hypergroup variables. The key features of this table are as follows:

The _Value_ variable identifies the values of the hypergroup variables. These are the graph vertices names.
The _Index_ variable identifies each vertex index.
The _HypGrp_ variable identifies the hypergroup number for the vertex.
The _IndexH_ variable identifies a vertex index within a hypergroup subgraph.
The _XCoord_ and _YCoord_ variables identify the coordinates of the vertex.
The _Color_ variable is the index of a strongly connected component found by the graph partitioning algorithm.

The following display shows the output for the sample data set. Notice that "AutoMall" and "Bart" are both vertices and that they are in the same hypergroup.

Sample _TEMPHYPGRP_ Table

This data contains two hypergroups, one for customers of AutoEmporium and Prestige. The other hypergroup is for customers of AutoMall. AutoEmporium and Prestige are in the same hypergroup because Tom shopped at both dealers. But, apart from Tom, there was no cross-shopping. If, however, even one customer of AutoMall had also purchased from one of the other two dealers, then all vertices would have been connected. This would result in identifying only one hypergroup. In real world data for car buying, there would likely be enough cross-shopping with a region such that only one hypergroup is created. The HYPERGROUP statement provides more sophisticated structural analysis for such scenarios.

The _TEMPEDGES_ Table

The HYPERGROUP statement generates this table by default. The _TEMPEDGES_ table includes the same variables as the _TEMPLAST_ table, except that instead of having the hypergroup variables (in this example, Customer and Dealer), they are replaced with _Source_ and _Target_ variables. These two columns have values for the origin and destination of edges. The table also includes index and coordinate variables that are associated with the source and target vertices.

The columns in this table have a similar interpretation to those in _TEMPHYPGRP_ table, but there is a column for each of source and target vertices of the edge. (This terminology is a misnomer because the edges are undirected). Source and target do not need to be considered for the hypergroup assignment. Because hypergroups are completely disconnected from each other, edges cannot connect vertices in different hypergroups.

The following display shows the results of the COLUMNINFO statement for the _TEMPEDGES_ table. All of the columns are always included, except that the model column is included because it was included in the VARS= option to the HYPERGROUP statement.

Column Information for the _TEMPEDGES_ Table

Column information for _TEMPEDGES_ table

The table includes records that are related to the edges between vertices. The key features of this table are as follows:

The _HypGrp_ variable identifies the hypergroup number for analysis variables.
The _Source_ and _Target_ variables have the values of the vertices that each edge connects.
The _Sindex_ and _Tindex_ variables identify the vertex index (0, 1, 2, and so on) for the _Source_ and _Target_ variables.
The _SindexH_ and _TindexH_ variables identify the vertex index (0, 1, 2, and so on) for the _Source_ and _Target_ variables, within each hypergroup subgraph.
The _XCoordS_ and _YCoordS_variables identify the coordinates of the source vertex.
The _XCoordT_ and _YCoordT_ variables identify the coordinates of the target vertex.
The _SColor_ and _TColor_ variables identify the index to associate with the source and target vertices.

The following display shows the output for the sample data set.

Sample _TEMPEDGES_ Table

Additional Tables and Columns in Result Tables

The previous sections describe the temporary table names and columns that are produced by default with the HYPERGROUP statement. The statement offers many options that can add additional columns and additional tables. For example, the preceding sections include columns that are related to assigning color values to the vertices. If you specify the NOCOLOR option, then those columns are not included. The same is true for the NOCOORD option. Specifying NOCOORD eliminates the columns that are related to coordinates and improve processing times because calculating coordinates is a computationally intensive task.

If you specify the CENTRALITY option, then the result tables are affected. For more information, see Centrality Measures.

ODS Table Names

The HYPERGROUP statement generates the following ODS tables.

ODS Table Name	Description	Option
HypGrpTables	Temporary hypergroup table names	Default
Namespace	Hypergroup namespaces for a table	When analysis variables are specified with the namespace syntax.

For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.