IMSTAT Procedure (Analytics)

AGGREGATE Statement

The AGGREGATE statement aggregates the values of one or more variables according to variable-specific options. The statement supports both numeric and class variables.

Example:

Aggregating Time Series Data

Syntax

Required Argument

Optional Argument

AGGREGATE Statement Options

Details

ODS Table Names

Syntax

AGGREGATE variable-name <(variable-options)> / < options>;

Required Argument

variable-name

specifies the variable to aggregate.

Optional Argument

variable-options

specifies the options to apply to the variable-name that precedes it. The following options are available.

AGGREGATOR=(aggregate-method)

specifies the aggregate method to apply to the specified variable.

Aggregate Method	Description
N	Number of nonmissing observations.
NMISS	Number of missing values.
NDISTINCT	Number of distinct values.
SUM, TOTAL	Sum of nonmissing values.
MEAN, AVERAGE, AVG	Arithmetic mean.
STD, STDDEV	Standard deviation.
STDERR	Standard error.
VAR	Sample variance.
USS	Uncorrected sum of squares.
CSS	Corrected sum of squares.
Q1	25th percentile.
Q2, MEDIAN, MED	50th percentile.
Q3	75th percentile.
TSTAT	t-statistic for H:mean=0.
PROBT	P-value for t-statistic.
MIN, MINIMUM	Minimum value.
MAX, MAXIMUM	Maximum value.
CV	Coefficient of variation.
LAST	Value at the last time.
LASTNOTEMPTY, LNE	Non-empty (nonmissing) value at the last time.
FIRST	Value at the first time.
FIRSTNOTEMPTY, FNE	Non-empty (nonmissing) value at the first time.
PERCENTILE	Specified list of percentiles.
SKEWNESS, SKEW	Sample skewness.
KURTOSIS, KURT	Sample kurtosis.
MODE	Most frequent value.
PGT	Percentage of cases greater than the specified value.
PLT	Percentage of cases less than the specified value.
PIN	Percentage of cases in the specified range.

Aliases	AGG=
Aliases	ACCUMULATE=
Default	SUM for numeric variables and N for character variables.
Interaction	If you specify PGT or PLT as an aggregate method, then you are required to specify the N= or CHARN= option.

CHARN="value"

specifies the character value for AGGREGATE= methods PGT or PLT. If the associated variable is numeric and the FORMAT= option is not specified, then the CHARN= option is ignored. The value is case-sensitive.

In the following code example, the AGGREGATE statement computes the percentage of cars with a value for the Make variable that sorts alphabetically before CHARN='Honda' and then groups the results by the variable Origin.

Example

data example.cars;
    set sashelp.cars;
run

proc imstat data=example.cars;
    aggregate make (agg=plt charn="Honda") / groupby=(origin);
run;

CHARRANGE<=>("lower-value", "upper-value")

specifies the inclusive character range of values to be considered in the aggregation. Enclose the values in quotation marks. If the associated variable is numeric and the FORMAT= option is not specified, then the CHARRANGE= option is ignored. The values are case-sensitive.

Example

data example.cars;
    set sashelp.cars;
run;

proc imstat data=example.cars;
    aggregate make (agg=n charrange=("B", "V") format="$upcase1.") 
                    / groupby=(origin type);

FORMAT="format-specification"

specifies the format to apply to the aggregated variable. If a format is not specified, then the unformatted values of the variable are used. Enclose the format specification in quotation marks.

Example

data example.letters;
    do id = 1 to 26;
        letter = substr("abcdefghijklmnopqrstuvwxyz", id, 1);  
        output;  
    end; 
run;

proc imstat data=example.letters;
    aggregate letter (agg=plt charn="E" format="$UPCASE1.");
quit;

N=numeric-value

specifies the numeric value for AGGREGATE= methods PGT or PLT. The following example calculates the percentage of cars in the Sashelp.Cars data set that have less than six cylinders.

Example

data example.cars;
    set sashelp.cars;
run;

proc imstat data=example.cars;
    aggregate cylinders (agg=plt n=6);
quit;

PERCENTILE=(percentiles)

specifies one or more percentiles. For example, the following statements aggregate the 25th, 15th, and 40th percentiles of the Msrp variable and groups the results by Make.

Example

data example.cars;
    set sashelp.cars;
run;

proc imstat data=example.cars;
    aggregate msrp (agg=percentile percentile=(25 15 40)) 
                   / groupby=(make);
quit;

RANGE<=>(lower-value, upper-value)

specifies the inclusive numeric range of values to be considered in the aggregation. For example, the following statements calculate the percentage of cars with an Mpg_City value within the range of 12 and 22 and groups the results by the Make variable.

Example

data example.cars;
    set sashelp.cars;
run;

proc imstat data=example.cars;
    aggregate mpg_city (agg=pin range=(12, 22)) / groupby=(make);
quit;

AGGREGATE Statement Options

ALIGN= <BEGINNING | MIDDLE | ENDING>

specifies the alignment of the representative value with respect to an interval or bin. The following example specifies a bin of width 5 years and the representing value of a bin is the bin’s beginning value, by default. Given the bins [1985, 1990], [1990, 1995], …, [2000, 2005], and ALIGN=ENDING, the representing values of these bins are 1990, 1995,..., 2005.

Alias

The values for the ALIGN= option have aliases of B, BEG, M, MID, and E, END, respectively.

Default

BEGINNING

Example

data example.stocks(partition=(stock));
   set sashelp.stocks;
run;

proc imstat data=example.stocks;	  
    aggregate close (agg=mean) close (agg=std) 
              / id=date idfmt="year."  
              bin=('01jan1985'd, '01jan1990'd)
              align=m  
              partition="IBM";
run;
    table example.&_templast_;
    fetch / format orderby=(date) to=20;
quit;

BIN=(lower-number, upper-number)

specifies the minimum and maximum values of one bin. The boundaries for additional bins are extrapolated from the range in the bin that you specify. For example, if the values of the ID= variable range from 0 to 100, and you specify BIN=(5, 15), then the server constructs 11 bins. The first bin is [–5, 5], the second bin is (5, 15], and the last bin is (95, 105]. The values on the upper boundary of a bin belong to this bin. The values on the lower boundary of a bin belong to the adjacent lower bin.

Interaction	The BIN= option has no effect unless the ID= option is specified.
Tip	If you are working with time series data, the INTERVAL= option is an alternative and offers a convenient syntax for binning time series values.

EDGEID=variable-name

specifies the variable to use for identifying the order of the analysis variable. It is required when you specify one of the following aggregation methods:

FIRST
FIRSTNOTEMPTY, FNE
LAST
LASTNOTEMPTY, LNE

Example

data example.stocks;
    set sashelp.stocks;
 run;

proc imstat data=example.stocks;
    where stock="IBM" and date >= '01jan2003'd;

    aggregate close (agg=fne) close (agg=lne) close (agg=mean) 
              / id=date idfmt="yyq6."
                edgeid=date interval="quarter";
quit;

GROUPBY=(variable-list)

specifies a list of variable names, or a single variable name, to use as GROUPBY variables in the order of the grouping hierarchy. If you do not specify any GROUPBY variable names, then the calculation is performed across the entire table—possibly subject to a WHERE clause.

GROUPBYLIMIT=n

specifies the maximum number of levels in a GROUPBY set. When the software determines that there are at least n levels in the GROUPBY set, it abandons the action, returns a message, and does not produce a result set. You can specify the GROUPBYLIMIT= option if you want to avoid creating excessively large result sets in GROUPBY operations.

GROUPFILTER=(filter-options)

specifies a section of the group-by hierarchy to be included in the computation. With this option, you can request that the server performs the analysis for only a subset of all possible groupings. The subset is determined by applying the group filter to a temporary table that you generate with the GROUPBY statement.

You can specify the following suboptions in the GROUPFILTER option:

DESCENDING

specifies the top section or the bottom section of the groupings to be collected. If the DESCENDING option is specified, the top LIMIT=n (where n > 0) groupings are collected. Otherwise, the bottom LIMIT=n groupings are collected.

Alias

DESC

LIMIT=n

specifies the maximum number of distinct groupings to be collected, where integer n >= 0. If n is zero, then all distinct groupings (up to 2³¹–1) that satisfy the boundary constraints, such as LOWERSCORE=f, are collected.

CAUTION:

High Cardinality Data Sets

Setting n to zero with high-cardinality data sets can significantly delay the response of the server.

SCOREGT=f

specifies the exclusive lower bound for the numeric scores of the distinct groupings to collect.

Alias

SGT=

SCORELT=f

specifies the exclusive upper bound for the numeric scores of the distinct groupings to collect.

Alias

SLT=

VALUEGT=("format-name1" <, "format-name2" ...>)

specifies the exclusive lower bound of the group-by variable’s formatted values for the distinct groupings to collect.

Alias

VGT=

VALUELT=("format-name1" <, "format-name2" ...>)

specifies the exclusive upper bound of the group-by variable’s formatted values for the distinct groupings to collect.

Alias

VLT=

TABLE=table-with-groupby-results

specifies the in-memory table from which to load the group-by hierarchy. If the TABLE= option is not specified, then all other GROUPFILTER= options are ignored.

The following program request all the groupings of State, City, and then Trade_In_Model in the Cars_Program_All table. The groupings are ordered by the maximum value of New_Vehicle_Msrp for each grouping:

proc imstat;
    table example.cars_program_all;
    groupby state city trade_in_model / temptable 
                 weight=new_vehicle_msrp 
                 agg   =(max) 
                 order =weight;  
run;

The TEMPTABLE option in the GROUPBY statement directs the server to save all the groupings in a temporary in-memory table. The following DISTINCT statement requests the count of the distinct unformatted values of Sales_Type for each of the selected groupings of State, City, and Trade_In_Model.

    table example.cars_program_all;
    distinct sales_type / groupfilter=(
                 table  =mylasr.&_TEMPLAST_
                 scoregt=40000
                 valuelt=("FL","Ft Myers","")
                 limit  =20
                 descending);
run;

This example considers only groupings that have maximum values of the New_Vehicle_Msrp above 40,000 and with formatted values that are less than State="FL" and City="Ft Myers." The empty quotation marks result in no restriction on Trade_In_Model values. These groupings are ordered according to the maximum values of New_Vehicle_Msrp. Because of the DESCENDING option, this example collects the 20 top groupings within the specified group-by range for the DISTINCT analysis.

Interaction

If you specify the GROUPFILTER= option, then the GROUPBY= and FORMATS= options have no effect.

ID=variable-name

specifies a numeric variable that identifies the time associated with each observation in the input table. The values of the ID= variable are typically SAS date, time, or datetime values, but that is not a requirement.

IDFORMAT=("format-specification")

specifies the format for the ID= variable. If you do not specify this option, the default format for the variable is applied. If the ID= option is not specified, this option is ignored.

Alias

IDFMT=

IDEND=numeric-value

specifies the end value of the ID= variable to be included in the analysis. If the last ID= variable value is less than the specified IDEND= value, then the series is extended with missing values. If the last ID= variable value is greater than the specified IDEND= value, then the series is truncated.

The IDEND= value can be a date ('13SEP1998'd), a time ('12:34:56't), a datetime ('01MAY88:12:34:56'dt) or a numeric value. If the ID= option is not specified, then this option is ignored.

IDSTART=numeric-value

specifies the beginning value of the ID= variable to be included in the analysis. If the first ID= variable value is greater than the specified IDSTART= value, then the series is prefixed with missing values. If the first ID= variable value is less than the specified IDSTART= value, then the series is truncated. If the ID= option is not specified, then this option is ignored.

INTERVAL="interval"

specifies the time period for the accumulation of observations. For example, if you specify INTERVAL='MONTH', then the AGGREGATE statement summarizes the observations in monthly intervals. If the ID= option is not specified, then this option is ignored.

For information about specifying interval values, see About Date and Time Intervals in SAS Language Reference: Concepts.

JUMPINGWINDOW

specifies that during aggregation, a window considers data within a specified multiple of intervals. A jumping window resets the aggregation process when the specified range of time expires. If you do not specify the JUMPINGWINDOW option, a window always retains the same multiple of intervals.

KEEP=(variable-name1 <variable-name2 ...>)

specifies one or more variables to transfer from the active table to the ODS table output or temporary table. When multiple input observations contribute to an output observation (this is the most common situation), the minimum value is used.

Alias	TABLEVARS=
Interaction	This option is ignored unless a PARTITION, GROUPBY=, or ID= option is also specified.

KEEPRECORD

specifies to output an aggregated value for each input observation by aggregating input observations whose ID= values are specified by INTERVAL= and WINDOWINT= options. Be aware that using this option can increase the volume of the result set.

The following code sums the deposit amount and counts the number of deposits, grouped by member ID and date. The aggregation at each output record considers all observations with a date value that is within one year from each record's own date value.

Example

libname example sasiola host="grid001.example.com" port=10010 tag=hps;

data example.retaindata;
  input transaction dt memberid deposit;
  informat dt date9.;
  format   dt date9.;
  datalines;
    1  	15Apr2014	  2	5000
    2	  01May2014	  2	5000
    3 	01May2014 	3 8000
    4 	02May2014 	2	5000
    5 	03May2014	  1	4000
    6 	03May2014 	3	3000
    7 	04May2014 	2	4000
    8 	04May2014 	3	2000
    9 	05May2014 	1	3000
    10 	06May2014 	1	3000
    11 	07May2014 	1	2500
    12 	08May2014 	2	1000
    13 	09May2014	  2	2000
    14 	10May2014 	2	2000
    15 	10May2014 	4	12000
run;

proc imstat data=example.retaindata; 
  aggregate 
     deposit(agg=sum) 
     deposit(agg=n) 
     / 
     groupby=memberid id=dt idfmt='DATE9.' 
     interval='day' windowInt='year' 
     noemptyinterval keeprecord; 
run;

NOEMPTYINTERVAL

specifies that intervals that no ID= variable value belongs to are omitted from the output. By default, the empty intervals contain missing values.

NOMISSING

specifies that you do not want to include missing values in the determination of group-by values. This option also applies to analysis variables when you specify AGGREGATOR=NDISTINCT.

Alias

NOMISS

OFFSET=

specifies the time series shift in order to match up with an existing time series. It can be used to match up with existing time series from previous year, for example.

Alias

DIF=

PARTIALTOINTERVAL=numeric-value

PARTIALTODATE=numeric-value

specifies the time value when the aggregation within an interval or a bin is terminated. For example, if you specify INTERVAL='MONTH' and PARTIALTOINTERVAL='10FEB98'd, then the action aggregates records from the first 10 days of each month only.

Aliases	PTD=
Aliases	PTI=
Example	aggregate var1 (agg=n) / id=date interval='month' windowint='year' partialtointerval='10feb1998'd jumpingwindow;

PARTIALTOWINDOW=numeric-value

specifies the time value when the aggregation within a window interval or a window bin is terminated. For example, if you specify WINDOWINT='YEAR', PARTIALTOWINDOW='10FEB98'd, and JUMPINGWINDOW, then the action aggregates records from the first 41 days of each year for each interval, except the interval contains all 31 days from January of each year. The PARTIALTOWINDOW= and PARTIALTOINTERVAL= options can be used together.

In addition to the above specification, if you specify INTERVAL='MONTH', PARTIALTOINTERVAL='08FEB98'd, then the action counts only the first 8 days from JANUARY and the first 8 days from FEBURARY when it aggregates on intervals FEBURARY, … , DECEMBER for each year.

Interaction	This option is ignored unless you specify the JUMPINGWINDOW option, because the starting time of a sliding window (a non-jumping window) is varying.
Example	aggregate var1 (agg=n) / id=date interval='month' windowint='year' partialtowindow='08feb98'd jumpingwindow;

PARTITION <=partition-key>

specifies that when the active table is partitioned, then the aggregation is performed separately for each specified partition-key. If you do not specify a partition-key, the analysis is performed for all partitions. You can use the PARTITIONINFO statement to retrieve the valid partition key values for a table.

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SUBBINOFFSET=n

specifies the offset from the start of a bin. n must be positive and less or equal than the bin width. If the specified n is out of range, then this option is ignored.

Interaction

You must specify ID= and BIN= to use this option.

SUBBINWIDTH=n

specifies the width of the sub-bin within a bin. For example, if the values of the ID= variable range from 0 to 100, and you specify BIN=(5, 15), SUBBINOFFSET=2, and SUBBINWIDTH=5, then this action summarizes the observations with ID= variable values that fall into the ranges [–3, 2], (7, 12], (17, 22], …, (97, 102]. n must be positive and the sum of SUBBINOFFSET= and SUBBINWIDTH= must be less or equal to the bin width. If the specified n is out of range, then this option is ignored.

Interaction	You must specify ID= and BIN= to use this option.
Example	aggregate var1 (agg=n) / id=id bin=(5, 15) subbinoffset=2 subbinwidth=5;

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias

TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias

TN=

TEMPTABLE

generates an in-memory temporary table from the result set. The IMSTAT procedure displays the name of the table and stores it in the &_TEMPLAST_ macro variable, provided that the statement executed successfully.

When the IMSTAT procedure exits, all temporary tables created during the IMSTAT session are removed. Temporary tables are not displayed on a TABLEINFO request, unless the temporary table is the active table for the request.

Interaction

The TEMPTABLE option requires a group-by analysis or a partitioned analysis with this statement.

WINDOWBIN=(lower-number, upper-number)

specifies the minimum and maximum values of a time window for the aggregation of observations with respect to each bin. The construction of this option is similar to the BIN= option. If the width of the window is not a multiple of the width of the bin, then the action fails. If the value for lower-number is not equal to that of BIN= option, the action fails.

WINDOWINT="interval"

specifies the time window for the aggregation of observations with respect to each time interval. For example, if you specify INTERVAL='MONTH' and WINDOWINT='YEAR', then the AGGREGATE statement summarizes a year’s worth of observations before the end of each monthly interval.

WINDOWOFFSET=n

specifies the time series shift in terms of an integer multiple of the WINDOWINT= or WINDOWBIN= value. For example, if you specify WINDOWINT='YEAR' and WINDOWOFFSET=–3, then at each time interval, the aggregated records are from three years earlier.

Alias

WINDOWDIF=

Details

ODS Table Names

The AGGREGATE statement generates the following ODS table.

ODS Table Name	Description	Option
AggStatistics	Aggregation statistics for one column	Default
TempTable	Information about a temporary table	TEMPTABLE

For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.