IMSTAT Procedure (Analytics)

ARM Statement

The ARM statement is used to perform associative rule mining (ARM). You can use it to derive frequent itemsets, perform association rule mining, and sequence mining.

Syntax

Required Arguments

Optional Argument

ARM Statement Options

Details

Overview

Frequent Itemsets Table

Association Rules Table

Sequences Table

ODS Table Names

Syntax

ARM ITEM=item-variable TRAN=transaction-variable </ options>;

Required Arguments

item-variable

specifies the name of the variable in the active table that identifies items.

transaction-variable

specifies the name of the variable in the active table that identifies transactions.

Optional Argument

variable-list

specifies one or more numeric variables. If you do not specify this option, then all numeric variables in the table are used.

ARM Statement Options

AGGREGATE=aggregation-method

lists the aggregator for which the score of an itemset at each occurrence in a data set is aggregated into a final score of such itemset. If the WEIGHT= variable is not specified, then the aggregator specification is ignored.

The available aggregation methods are as follows:

MAX	maximum value
MEAN	arithmetic mean
MIN	minimum value
SUM	sum of the nonmissing values

Alias	AGG=
Default	SUM
Interaction	You must specify the WEIGHT= option to use this option.

FREQ=variable

specifies the numeric frequency variable to use for computing the score of each frequent itemset along with WEIGHT= option. When the FREQ= variable is not specified, the score of a frequent itemset equates the value of the WEIGHT= variable scaled by 1. Negative values for the specified variable are considered missing.

ITEMAGG=aggregation-method

lists the aggregator for which the values of the WEIGHT= variable, and optionally the FREQ= variable, are rolled up into the score of an itemset at each occurrence in the data set, provided that a WEIGHT= variable is specified. If the WEIGHT= variable is not specified, then the aggregator specification is ignored. The aggregation methods are identical to the list in the AGGREGATE= option.

The ITEMAGG= and AGGREGATE= options work together to derive the final score of an itemset. Given an itemset, the score S is first aggregated over the frequency, f, and weight, w, variables associated with each item at each occurrence among a transaction by item aggregator Φ_item (.). Then, the intermediate scores of such itemset among all the occurrences are then aggregated again by set aggregator Φ_set (.) See the following equation.

ITEMFMT=("format-specification")

specifies the formats for the ITEM= variable. If you do not specify the ITEMFMT= option, then the unformatted values of the ITEM= variable are used.

Enclose the format specification in quotation marks.

ITEMSTBL

specifies to save the derived frequent itemsets in a temporary table. By default, the frequent itemsets are not saved.

MAXITEMS=n

specifies the maximal number of items to allow in a frequent itemset. The value must be greater than or equal to 1. If an invalid value is specified, then it is replaced with 1, the default value.

Default

MINITEMS=n

specifies the minimal number of items to allow in a frequent itemset. The value must be greater than or equal to zero. If an invalid value is specified, then it is replaced with 0, the default value.

If you specify MAXITEMS= < MINITEMS=, the server swaps the values. If you specify MAXITEMS = MINITEMS, the server assigns MAXITEMS = MINITEMS + 1.

Default

NOMISSING

specifies that missing values of the ITEM= and TRAN= variables are excluded from analysis. By default, missing values of the ITEM= variable are considered a separate item. Missing values of the TRAN= variable are considered a separate transaction.

Alias

NOMISS

PARTITION <=partition-key>

specifies to use partitioning variables. When only PARTITION is specified and the table is partitioned first by the TRAN= variable, and the TRANFMT= option is specified, the associative rule mining is performed separately for each value of the partition key. If a value for partition-key is specified, then the associative rule mining is performed on that partition only.

RELSUPPORT

specifies that the values for LOWER= and UPPER= in the SUPPORT option are relative to the most frequent itemset. For example, if 500 is the support of the most frequent itemset, then specifying RELSUPPORT SUPPORT(LOWER=0.1 UPPER=0.5) means the minimum and the maximum supports for the analysis are 50 and 250, respectively. When using this option, the values for LOWER= and UPPER= must be between 0 and 1. Otherwise, they are set to the default values 0.05 and 1.0, respectively.

RULES(<suboptions>)

specifies the requirements for how association rules are generated from frequent itemsets. The following suboptions are available:

AGGREGATE=aggregation-method

lists the aggregator for which the score of a rule at each occurrence in a data set is aggregated into a final score of such rule. If the WEIGHT= variable is not specified, then the aggregator specification is ignored.

The available aggregation methods are as follows:

MAX	maximum value
MEAN	arithmetic mean
MIN	minimum value
SUM	sum of the nonmissing values

Alias	AGG=
Default	SUM
Interaction	You must specify the WEIGHT= option to use this option.

CONFIDENCE(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimal and maximal confidence values of the association rules have to fulfill. The default value for LOWER= is 0.5.

If you specify UPPER= < LOWER=, the server swaps the values. If you specify the same value for LOWER= and UPPER=, the server adds ε (0.1110223024625157e-12) to value and uses the result for UPPER=.

Range

0 to 1

FREQ=variable-name

specifies the numeric frequency variable to use for computing the score of each association rule along with ORDER= option. When a FREQ= variable is not specified, the score of an association rule equates the value of the ORDER= variable scaled by 1. Negative values for the specified variable are considered missing.

ITEMAGG=aggregation-method

lists the aggregator for which the values of the WEIGHT= variable, and optionally the FREQ= variable, are rolled up into the score of a rule at each occurrence in the data set, provided that a WEIGHT= variable is specified. If you do not specify a WEIGHT= variable, then the aggregator specification is ignored. The aggregation methods are identical to the list in the AGGREGATE= option.

NUMLHS(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum number of items in the left-hand side (LHS) of a rule to allow. If you specify UPPER= < LOWER=, the server swaps the values.

NUMRHS(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum number of items in the right-hand side (RHS) of a rule to allow. If you specify UPPER= < LOWER=, the server swaps the values.

SCORE(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum scores of the association rules that are derived. If you specify UPPER= < LOWER=, the server swaps the values. If you specify the same value for LOWER= and UPPER=, the server adds ε (0.1110223024625157e-12) to value and uses the result for UPPER=.

WEIGHT=variable-name

specifies the numeric weight variable to use for computing the score of each association rule, along with FREQ= variable. If you do not specify a WEIGHT= variable, then the AGGREGATE=, FREQ=, and ITEMAGG= options are ignored.

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SCORE(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum scores of the frequent itemsets that are derived. If you specify UPPER= < LOWER=, the server swaps the values. If you specify the same value for LOWER= and UPPER=, the server adds ε (0.1110223024625157e-12) to value and uses the result for UPPER=.

SEQUENCES(TIME=t <sub-options>)

specifies the requirements for how sequences are generated from the original table. The sequences do not necessarily depend on previously generated frequent itemsets. You can specify the following sub-options in SEQUENCES option:

ADJACENT

specifies that any two events of a sequence must be adjacent to each other in time in a transaction.

For example, in the following table, the transaction supports only two sequences with a chain length of 3. The first is Sequence of A, B, C

and the second is Sequence of B, C, D

. The transaction does not support sequence Sequence of A, B, D

because events B and D do not happen consecutively in this transaction. By default, ADJACENT option is not enabled so that the transaction would support the third sequence, Sequence of A, B, D

, when the chain length is 3.

Transaction	Item	Time
0	A	0
0	B	1
0	C	2
0	D	3

AGGREGATE=aggregation-method

lists the aggregator for which the score of a sequence at each occurrence in a data set is aggregated into a final score of such sequence. If the WEIGHT= variable is not specified, then the aggregator specification is ignored.

The available aggregation methods are as follows:

MAX	maximum value
MEAN	arithmetic mean
MIN	minimum value
SUM	sum of the nonmissing values

Alias	AGG=
Default	SUM
Interaction	You must specify the WEIGHT= option to use this option.

FREQ=variable-name

specifies the numeric frequency variable to use for computing the score of each sequence along with WEIGHT= option. When a FREQ= variable is not specified, the score of a sequence equates the value of the WEIGHT= variable scaled by 1. Negative values for the specified variable are considered missing.

INCLUDEMISSTIME

indicates that records with a missing value for the TIME= variable are considered for sequence analysis. If this option is specified, then the missing value for the TIME= variable is treated as the smallest value in a sequence.

ITEMAGG=aggregation-method

lists the aggregator for which the values of the WEIGHT= variable, and optionally the FREQ= variable, are rolled up into the score of a rule at each occurrence in the data set, provided that a WEIGHT= variable is specified. If the WEIGHT= variable is not specified, then the aggregator specification is ignored. The aggregation methods are identical to the list in the AGGREGATE= option.

ITEMSETFILTER=SINGLETONS | ALLITEMS | NONE

specifies how the sequences are filtered by frequent itemsets. The SINGLETONS setting means that each item of any sequence has to be a frequent singleton. The ALLITEMS setting means the set of all distinct items of any sequence have to be a frequent itemset. The NONE setting indicates that sequences are not influenced by what frequent itemsets are derived.

The following IMSTAT statements specify that all frequent itemsets must have their support greater or equal than 150. The support ranges specified on sequences are LOWER=110 and UPPER=140. The support boundaries on frequent itemsets and sequences are disjoint. However, with FILTER=NONE, the sequences that are generated do not depend on the frequent itemsets.

proc imstat data=example.assocs;
  arm item=Product tran=Customer / maxItems=6 support(lower=150) itemsTbl 
      sequs(time=Time minItems=4 maxItems=6 minWindow=0 
            support(lower=110 upper=140) filter=none);

Alias	FILTER=
Default	SINGLETONS

LASRRULE=table-name

specifies an in-memory table that contains trained association rules. The rules are used to score the current active transaction table.

LASRSEQU=table-name

specifies an in-memory table that contains trained sequences. The sequences are used to score the current active transaction table.

MAXITEMS=n

specifies the maximal number of items to allow in any sequence. The value must be greater than or equal to 1. If an invalid value is specified, then it is replaced with 1, the default value.

specifies the maximum number of items to allow in any sequence. The value must be greater than or equal to 1. Otherwise, it is set to the default value, 1.

MAXDURATION=t

specifies the maximum duration to allow between the onset time of the first item and the time of the last item in a sequence. If the difference is greater than t, then the sequence is excluded from the result set. The value must be greater than or equal to zero.

MAXWINDOW=t

specifies the maximum difference to allow between the onset of any two adjacent items in a sequence. If the difference is greater than t, then the two items cannot be part of the same sequence. The value must be greater than or equal to zero.

MINDURATION=t

specifies the minimum difference to allow between the onset time of the last item and the first item in a sequence. If the difference is less than t, then the sequence is excluded from the result set. The value must be greater than or equal to zero. If you specify a value for MAXDURATION= that is less than MINDURATION=, the server swaps the values.

MINITEMS=n

specifies the minimal number of items to allow in a sequence. The value must be greater than or equal to 1. If an invalid value is specified, then it is replaced with 1, the default value.

If you specify MAXITEMS= < MINITEMS=, the server swaps the values. If you specify MAXITEMS = MINITEMS, the server assigns MAXITEMS = MINITEMS + 1.

Default

MINWINDOW=t

specifies the minimum difference to allow between the onset of any two adjacent items in a sequence. If the difference is less than or equal to t, then the two items are treated as happening at the same time. The value must be greater than or equal to zero.

If you specify MAXWINDOW= < MINWINDOW=, the server swaps the values.

NODUP

specifies that duplicated items within a sequence are not allowed.

NOMERGE

specifies that a transaction supports only one sequence with the same number of events in that transaction. In the transaction table that is shown in the ADJACENT option, the transaction supports only one sequence, Sequence of A, B, C, D . By default, the NOMERGE option is not enabled.

Interaction

Specifying this option implies the ADJACENT option.

SCORE(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum scores of the sequences that are derived. If you specify UPPER= < LOWER=, the server swaps the values. If you specify the same value for LOWER= and UPPER=, the server adds ε (0.1110223024625157e-12) to the value and uses the result for UPPER=.

SUPPORT(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum support of one sequence allowed in the analysis. By default, LOWER=1 and UPPER= is not set. Valid values for LOWER= and UPPER= are integers greater than 0. If you specify an invalid value for LOWER= or UPPER=, the server sets LOWER=1. The value for LOWER= must be less than or equal to the UPPER= value. If you specify UPPER= < LOWER=, the server swaps the values. Note that

Default	LOWER=1
Note	This option does not overwrite the SUPPORT option that is specified for deriving frequent itemsets.

TIME=t

specifies the numeric variable to use for sorting the items in a sequence. This option is required for sequence analysis.

TIMEAGG=aggregation-method

specifies how to aggregate the time values when two adjacent events are the same in a sequence.

The available aggregation methods are as follows:

MAX	maximum value
MEAN	arithmetic mean
MIN	minimum value

For example, see the values in the following table:

Transaction	Item	Time
0	A	0
0	B	1
0	B	2
0	D	3

The aggregated timestamps for the events in the sequence with length three ( Sequence of A, B, D

) are as follows:

TIMEAGG= Value	Item B	Item D
MAX	1	3
MIN	1	3
MEAN	1.5	3

If you also specify MINWINDOW=1, then the sequences will be different from the sequences shown in the previous table.

WEIGHT=variable-name

specifies the numeric weight variable to use for computing the score of each sequence, along with FREQ= variable. If you do not specify a WEIGHT= variable, then the AGGREGATE=, FREQ=, and ITEMAGG= options are ignored.

WINDOWAGG=aggregation-method

is used with the MINWINDOW= and MAXWINDOW= options. It lists the aggregator for which the values of the TIME= variable to update the anchor time. The default value is MEAN.

The available aggregation methods are as follows:

MAX	maximum value
MEAN	arithmetic mean
MIN	minimum value

For example, see the values in the following table:

Transaction	Item	Time
0	A	0
0	B	1
0	C	2
0	D	3

If MINWINDOW=0, then the following sequence is formed with a chain length of four because the time difference between two adjacent items is > 0.

If MINWINDOW=1 and WINDOWAGG=MIN, then the following sequence is formed with a chain length of two.

If MINWINDOW=1 and WINDOWAGG=MAX, then the following sequence is formed, with a chain length of one because after A and B are merged as concurrent items, the WINDOWAGG=MEAN setting defines the time value for A & B to be 0.5. Item C is then merged with A & B with a new merged time value of 1.0. Because the time value for item D is 3, then it is not merged with A & B &C.

If MAXWINDOW=1, then no two items in the transaction can form a sequence.

SEQUSTBL

specifies to save the derived sequences from frequent itemsets to a temporary table. By default, the ARM statement does not save sequences.

Example

Sequences Table

SUPPORT(<LOWER=lower-value> <UPPER=upper-value>)

specifies the minimum and maximum frequencies to allow for derived frequent itemsets. If RELSUPPORT is not specified, then LOWER= and UPPER= are the minimum and maximum frequencies of frequent items that appeared in the transactions. If RELSUPPORT is specified, then specify the two values as the ratios for the minimum and maximum frequencies of frequent itemsets to the frequency of the most frequent itemset.

By default, LOWER=1 when RELSUPPORT is not specified and LOWER=0.05 when RELSUPPORT is specified.

The values for LOWER= and UPPER= must be greater than or equal to zero. If you specify UPPER= < LOWER=, then the server swaps the values.

For example, the following statements derive and display frequent itemsets of sizes between MINITEMS=3 and MAXITEMS=4. The support for each frequent itemset must be between [LOWER=97, UPPER=100). LOWER= is inclusive and UPPER= is exclusive.

proc imstat data=example.assocs;
    arm item=Product tran=Customer / minItems=3 maxItems=4
        support(lower=97 upper100) itemsTbl;
run;
    table example.&_tempARMItems_;
    fetch / orderby=(_SetSize_ _Count) desc=_Count_;
run;

The ARM statement produces the following display, indicating that 14 frequent itemsets were derived.

The FETCH statement displays the 14 frequent itemsets.

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias

TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias

TN=

TRANFMT=("format-specification")

specifies the formats for the TRAN= variable. If you do not specify the TRANFMT= option, then the unformatted values of the TRAN= variable are used.

Enclose the format specification in quotation marks.

WEIGHT=variable

specifies the numeric weight variable to use for computing the score of each frequent itemset, along with FREQ= variable. If you do not specify a WEIGHT= variable, then the AGGREGATE=, FREQ=, and ITEMAGG= options are ignored.

Details

Overview

Frequent itemsets are the a priori information in order to mine association rules. These are widely used in market basket analysis, web usage mining, and bio-informatics. Association rules are popular for discovering relations among different values of a variable. Sequence mining aims to discover the causality relationship among items in transactions of customer purchasing habits or anti-money laundry, for example.

By specifying ITEM= , TRAN=, and optionally TIME=, the server derives either the frequent itemsets, the association rules, the sequences, or any combinations of them. If TIME= is not specified, the server does not generate sequence results. The frequent itemsets, the association rules, and the sequences are stored in separate temporary tables in the server.

Frequent Itemsets Table

The frequent itemsets table is generated when you specify the ITEMSTBL option and it is accessed with the &_tempARMItems_ macro variable. See the following example:

data example.aggdata;
    input customer product $ time price amount product_id;
    datalines;
1 e 0    2.49 2 1
1 t 1 2999.00 1 2
1 e 2    2.49 2 1
1 t 3  499.00 1 2
1 e 4    3.49 3 1
1 t 5  199.00 1 2
2 t 0  199.00 1 2
2 e 1    3.49 2 1
2 h 2   50.00 1 3
2 e 3    3.49 1 1
2 t 4  499.00 1 2
2 e 5    3.49 1 1
;
run;

proc imstat data=example.aggdata;
    arm item=product tran=customer / maxitems=3 freq=amount
        weight=price itemagg=SUM agg=MIN itemstbl;
run;
    table example.&_tempARMItems_;
    fetch / orderby=(_SetSize_);
run;

The preceding statements generate the following output for the sample data:

The columns in the frequent itemsets table are as follows:

_SetSize_

Shows the number of items in the frequent itemset.

_Count_

Shows the frequency for the frequent itemset in all the transactions.

_Support_

Shows the ratio of the _Count_ value to the number of transactions.

_Score_

Shows the aggregated values of the FREQ= and WEIGHT= values, when they are specified.

Consider the frequent itemset for product t (row 3 in the preceding table). It appears 3 times for customer 1 and 2 times for customer 2. First, the server performs item aggregation with each customer. Then, the server performs second stage aggregation to obtain the final score of a frequent itemset. In this case, the intermediate scores of itemset t are (1*2999.00 + 1*499.00 + 1*199.00) = 3697.00 and (1*199.00 + 1*499.00) = 698.00. The final score of this itemset is MIN(3697.00, 698.00) = 698.00.

PRODUCTn

Shows the name of an item in the frequent itemset. The column name is variable. The name is based the column that is specified in the ITEM= option.

Association Rules Table

The association rules table is generated when you specify the RULES and RULESTBL options. It is accessed with the &_tempARMRules_ macro variable. For example, the following statements derive association rules of sizes between MINITEMS=3 and MAXITEMS=4. The support range of each frequent itemset is set at LOWER=125 and UPPER=130. The minimal confidence value permitted is 0.8. Each association rule's score has to be greater or equal than 2.

proc imstat data=example.assocs;
    arm item=Product tran=Customer / minItems=3 maxItems=4 itemsTbl
        support(LOWER=125 UPPER=130) weight=TIME 
        rules(confidence(LOWER=0.8) score(LOWER=1) weight=TIME) rulesTbl; 
run;

    table example.&_tempARMRules_;
	   fetch _SetSize_ -- _Rule_ / to=10;
run;

Note: The FETCH statement in the preceding example does not include the values from the ITEM= column in the display.

The preceding statements generate the following output for the Assocs data:

The columns in the association rules table are as follows:

_SetSize_

Shows the number of items in the frequent itemset.

_SetCount_

Shows the frequency for the frequent itemset that contain the rule in all the transactions.

_SetSupport_

Shows the ratio of the _SetCount_ value to the total number of transactions.

_SetScore_

Shows the aggregated values WEIGHT= values of all the frequent itemsets. , when they are specified. The AGGREGATOR=SUM and ITEMAGG= option defaults to SUM.

_Score_

Shows the aggregated values of the WEIGHT= value in the association rule suboptions. The AGGREGATOR=MAX and ITEMAGG= option defaults to SUM.

_Confidence_

Shows the confidence for the association rule.

_ExpConf_

Shows the expected confidence for the association rule.

_Lift_

Shows the lift for the association rule.

_NumLHS_

Shows the number of items in the left-hand-side of a rule.

_NumRHS_

Shows the number of items in the right-hand-side of a rule.

_Rule_

Shows the full string of the rule.

Sequences Table

The sequences table is generated when you specify the SEQUENCES and SEQUSTBL options. For example, the following statements derive association rules of sizes between MINITEMS=3 and MAXITEMS=4. The support range of each frequent itemset is set at LOWER=125 and UPPER=130. The minimal confidence value permitted is 0.8. Each association rule's score has to be greater or equal than 2.

proc imstat data=example.assocs;
    arm item=Product tran=Customer / maxItems=3
    sequences(time=time minItems=3 maxItems=3 support(lower=110 upper=120)) 
        sequstbl;
run;
  table example.&_tempARMSequs_;
  fetch / to=10;
run;

The preceding statements generate the following output for the Assocs data:

The columns in the association rules table are as follows:

_ChainLength_

Shows the number of items in the sequence.

_Count_

Shows the frequency of transactions that contain the sequence.

_Support_

Shows the ratio of the _Count_ value to the total number of transactions.

_Probability_

Is defined as probability equation

where N() is the count function.

_LiftProduct_

Is defined as lift product equation

where Ntrans is the number of transactions.

_Separatorn_

Shows the relationship of the items to the left and right. "==>" indicates that the item on the left occurs before the item on the right. "&" indicates that the two items are considered to happen at the same time.

ODS Table Names

The ARM statement generates the following ODS table.

ODS Table Name	Description	Option
ARMSummary	Association rule mining summary	Default

For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.