IMSTAT Procedure (Analytics)

GROUPBY Statement

The GROUPBY statement derives the grouping hierarchy of the distinct formatted values for the specified variables. If no list of variable names is specified, the grouping hierarchy is computed for all variables in the active table. The statement can return a section of all distinct groupings to the client or save the entire grouping set as a temporary table in the server.

Syntax

GROUPBY Statement Options

Details

ODS Table Names

Syntax

GROUPBY <variable-list> </ options>;

GROUPBY Statement Options

AGGREGATE=(aggregation-method)

lists the aggregator on which the ordering of the result set is based.

lists the aggregator for which the values of the WEIGHT variable are rolled up into a rank order score, provided that a WEIGHT= variable is specified. If no WEIGHT= variable is specified, then the aggregator specification is ignored.

The available aggregation methods for the GROUPBY statement are as follows:

MAX	maximum value
MEAN	arithmetic mean
MIN	minimum value
N	counts the nonmissing values of the weight variable
SUM	sum of the weight values

Alias	AGG=
Default	SUM

DESCENDING

specifies to arrange the returned grouping hierarchy of the variables in descending order of the item rankings. If this option is not specified, the returned items are arranged in ascending order. When combined with the LIMIT=n option, the GROUPBY statement can either return the top n or the bottom n distinct groupings.

Interaction

The DESCENDING option is ignored if the TEMPTABLE option is specified.

FREQ=variable-name

specifies the numeric frequency variable that is used to compute the ranking of a distinct grouping. When this option is specified, the AGGREGATE= and WEIGHT= options are ignored. The following GROUPBY statement requests the top 5 groupings of Region and then Product from the Prdsale table. The groupings are rank ordered by the sum of the Actual column:

Example

proc imstat data = mylasr.prdsale;
    groupby region product / freq = actual limit = 5;
run;

LIMIT=n

specifies the maximum number of distinct groupings to be returned. When combined with the DESCENDING option, the GROUPBY statement can either return the top n or the bottom n distinct groupings. The value for n must be a positive integer. For example, the commands below return the bottom 5 groupings according to their Score values:

Default	0
Interaction	This option is ignored if the TEMPTABLE option is specified.
Tip	If n is zero, then all distinct groupings are returned (up to 2³¹–1). With high-cardinality data sets, setting n to zero can significantly delay the response of the server.
Example	proc imstat data = mylasr.prdsale; groupby region product / weight = actual agg = max valuegt = ("West", "Chair") limit=5; run;

NOMISSING

specifies that missing values are excluded in the determination of GROUPBY values. By default, levels with missing values are included.

Alias

NOMISS

NOTEMPPART

specifies that the temporary table that is generated by the TEMPTABLE option is not partitioned by the group-by variables. When you create a temporary table with the GROUPBY statement, by default, the server partitions the table and each partition has a single row. When the number of groups is large, this results in many tiny partitions and requires additional memory resources to store the partition information for the temporary table.

By specifying this option, the temporary table is organized similarly to the default table, but it is not partitioned. This also enables more efficient processing of the table in threaded computations. For example, it is more efficient if you were to add computed columns to the table that you want to use as dimension keys in subsequent SCHEMA statements.

ORDER=rank-order-type

specifies the rank ordering to use for sorting the distinct groupings. The following rank-order types are valid in the GROUPBY statement:

FREQ	frequency count of the variables
VALUE	formatted values of the variables
WEIGHT	aggregate values of the WEIGHT= variables

Default

FREQ

PARTITION <=partition-key>

specifies that when the active table is partitioned, then the variable-list that you specify for the GROUPBY statement is expanded to include the partition variables. The partition variables are added to the beginning of the variable-list.

If you do not specify a partition-key, the analysis is performed for all partitions. If you do specify a partition-key, the analysis is carried out for the specified key value only. You can use the PARTITIONINFO statement to retrieve the valid partition key values for a table.

The following GROUPBY statement works on the partitioned table Cars. The grouping hierarchy becomes Type

Origin

Make:

Example

data mylasr.cars(partition=(type));
    set sashelp.cars;
run;

proc imstat data=mylasr.cars;
  groupby origin make / partition weight=invoice agg=max
                        descending limit=5 ;
run;

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SETSIZE

requests that the server estimate the size of the result set. The procedure does not create a result table if the SETSIZE option is specified. Instead, the procedure reports the number of rows that are returned by the request and the expected memory consumption for the result set (in KB). If you specify the SETSIZE option, the SAS log includes the number of observations and the estimated result set size. See the following log sample:

NOTE: The LASR Analytic Server action request for the STATEMENT
      statement would return 17 rows and approximately
      3.641 kBytes of data.

The typical use of the SETSIZE option is to get an estimate of the size of the result set in situations where you are unsure whether the SAS session can handle a large result set. Be aware that in order to determine the size of the result set, the server has to perform the work as if you were receiving the actual result set. Requesting the estimated size of the result set does consume resources on the server. The estimated number of KB is very close to the actual memory consumption of the result set. It might not be immediately obvious how this size relates to the displayed table, since many tables contain hidden columns. In addition, some elements of the result set might not be converted to tabular output by the procedure.

SCOREGT=f

specifies the exclusive lower bound of the numeric rank order scores of the distinct groupings to return. All distinct groupings with numeric rank order scores that are greater than f are returned.

Alias	SGT=
Interaction	This option is ignored if the TEMPTABLE option is specified.

SCORELT=f

specifies the exclusive upper bound of the numeric rank order scores of the distinct groupings to return. All distinct groupings with numeric rank order scores that are less than f are returned.

Alias	SLT=
Interaction	This option is ignored if the TEMPTABLE option is specified.

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias

TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias

TN=

TEMPTABLE

generates an in-memory temporary table from the result set. The IMSTAT procedure displays the name of the table and stores it in the &_TEMPLAST_ macro variable, provided that the statement executed successfully.

When the IMSTAT procedure exits, all temporary tables created during the IMSTAT session are removed. Temporary tables are not displayed on a TABLEINFO request, unless the temporary table is the active table for the request.

VALUEGT=("format-specification", ...)

specifies the exclusive lower bound of the variable’s formatted values of the distinct groupings to return. All distinct groupings with formatted values for the variable that are lexicographically greater than the specified bound are returned.

Alias	VGT=
Interaction	This option is ignored if the TEMPTABLE option is specified.

VALUELT=("format-specification", ...)

specifies the exclusive upper bound of the variable’s formatted values of the distinct groupings to return. All distinct groupings with formatted values for the variable that are lexicographically less than the specified bound are returned.

Alias	VGT=
Interaction	This option is ignored if the TEMPTABLE option is specified.

VARFORMATS=("format-specification", ...)

specifies the formats for the variables. If you do not specify the VARFORMATS= option, the default formats are applied for the variables.

WEIGHT=variable-name

specifies the numeric weight variable to use for computing the rank order score of a distinct grouping.

Interaction

When the WEIGHT= option is specified, the server sets the ORDER= option to ORDER=WEIGHT.

Details

ODS Table Names

The GROUPBY statement generates the following ODS table.

ODS Table Name	Description	Option
GroupBy	Top K groupby values	Default
TempTable	Information about a temporary table	TEMPTABLE

For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.