The AGGREGATE statement aggregates the values of one or more variables according to variable-specific options. The statement supports both numeric and class variables.
Example: | Aggregating Time Series Data |
specifies the options to apply to the variable-name that precedes it. The following options are available.
specifies the aggregate method to apply to the specified variable.
Aggregate Method
|
Description
|
---|---|
N
|
Number of nonmissing
observations.
|
NMISS
|
Number of missing values.
|
NDISTINCT
|
Number of distinct values.
|
SUM, TOTAL
|
Sum of nonmissing values.
|
MEAN, AVERAGE, AVG
|
Arithmetic mean.
|
STD, STDDEV
|
Standard deviation.
|
STDERR
|
Standard error.
|
VAR
|
Sample variance.
|
USS
|
Uncorrected sum of squares.
|
CSS
|
Corrected sum of squares.
|
Q1
|
25th percentile.
|
Q2, MEDIAN, MED
|
50th percentile.
|
Q3
|
75th percentile.
|
TSTAT
|
t-statistic
for H:mean=0.
|
PROBT
|
P-value
for t-statistic.
|
MIN, MINIMUM
|
Minimum value.
|
MAX, MAXIMUM
|
Maximum value.
|
CV
|
Coefficient of variation.
|
LAST
|
Value at the last time.
|
LASTNOTEMPTY, LNE
|
Non-empty (nonmissing)
value at the last time.
|
FIRST
|
Value at the first time.
|
FIRSTNOTEMPTY, FNE
|
Non-empty (nonmissing)
value at the first time.
|
PERCENTILE
|
Specified list of percentiles.
|
SKEWNESS, SKEW
|
Sample skewness.
|
KURTOSIS, KURT
|
Sample kurtosis.
|
MODE
|
Most frequent value.
|
PGT
|
Percentage of cases
greater than the specified value.
|
PLT
|
Percentage of cases
less than the specified value.
|
PIN
|
Percentage of cases
in the specified range.
|
Aliases | AGG= |
ACCUMULATE= | |
Default | SUM for numeric variables and N for character variables. |
Interaction | If you specify PGT or PLT as an aggregate method, then you are required to specify the N= or CHARN= option. |
specifies the character value for AGGREGATE= methods PGT or PLT. If the associated variable is numeric and the FORMAT= option is not specified, then the CHARN= option is ignored. The value is case-sensitive.
Example | data example.cars; set sashelp.cars; run proc imstat data=example.cars; aggregate make (agg=plt charn="Honda") / groupby=(origin); run; |
specifies the inclusive character range of values to be considered in the aggregation. Enclose the values in quotation marks. If the associated variable is numeric and the FORMAT= option is not specified, then the CHARRANGE= option is ignored. The values are case-sensitive.
Example | data example.cars; set sashelp.cars; run; proc imstat data=example.cars; aggregate make (agg=n charrange=("B", "V") format="$upcase1.") / groupby=(origin type); |
specifies the format to apply to the aggregated variable. If a format is not specified, then the unformatted values of the variable are used. Enclose the format specification in quotation marks.
Example | data example.letters; do id = 1 to 26; letter = substr("abcdefghijklmnopqrstuvwxyz", id, 1); output; end; run; proc imstat data=example.letters; aggregate letter (agg=plt charn="E" format="$UPCASE1."); quit; |
specifies the numeric value for AGGREGATE= methods PGT or PLT. The following example calculates the percentage of cars in the Sashelp.Cars data set that have less than six cylinders.
Example | data example.cars; set sashelp.cars; run; proc imstat data=example.cars; aggregate cylinders (agg=plt n=6); quit; |
specifies one or more percentiles. For example, the following statements aggregate the 25th, 15th, and 40th percentiles of the Msrp variable and groups the results by Make.
Example | data example.cars; set sashelp.cars; run; proc imstat data=example.cars; aggregate msrp (agg=percentile percentile=(25 15 40)) / groupby=(make); quit; |
specifies the inclusive numeric range of values to be considered in the aggregation. For example, the following statements calculate the percentage of cars with an Mpg_City value within the range of 12 and 22 and groups the results by the Make variable.
Example | data example.cars; set sashelp.cars; run; proc imstat data=example.cars; aggregate mpg_city (agg=pin range=(12, 22)) / groupby=(make); quit; |
specifies the alignment of the representative value with respect to an interval or bin. The following example specifies a bin of width 5 years and the representing value of a bin is the bin’s beginning value, by default. Given the bins [1985, 1990], [1990, 1995], …, [2000, 2005], and ALIGN=ENDING, the representing values of these bins are 1990, 1995,..., 2005.
Alias | The values for the ALIGN= option have aliases of B, BEG, M, MID, and E, END, respectively. |
Default | BEGINNING |
Example | data example.stocks(partition=(stock)); set sashelp.stocks; run; proc imstat data=example.stocks; aggregate close (agg=mean) close (agg=std) / id=date idfmt="year." bin=('01jan1985'd, '01jan1990'd) align=m partition="IBM"; run; table example.&_templast_; fetch / format orderby=(date) to=20; quit; |
specifies the minimum and maximum values of one bin. The boundaries for additional bins are extrapolated from the range in the bin that you specify. For example, if the values of the ID= variable range from 0 to 100, and you specify BIN=(5, 15), then the server constructs 11 bins. The first bin is [–5, 5], the second bin is (5, 15], and the last bin is (95, 105]. The values on the upper boundary of a bin belong to this bin. The values on the lower boundary of a bin belong to the adjacent lower bin.
Interaction | The BIN= option has no effect unless the ID= option is specified. |
Tip | If you are working with time series data, the INTERVAL= option is an alternative and offers a convenient syntax for binning time series values. |
specifies the variable to use for identifying the order of the analysis variable. It is required when you specify one of the following aggregation methods:
Example | data example.stocks; set sashelp.stocks; run; proc imstat data=example.stocks; where stock="IBM" and date >= '01jan2003'd; aggregate close (agg=fne) close (agg=lne) close (agg=mean) / id=date idfmt="yyq6." edgeid=date interval="quarter"; quit; |
specifies a list of variable names, or a single variable name, to use as GROUPBY variables in the order of the grouping hierarchy. If you do not specify any GROUPBY variable names, then the calculation is performed across the entire table—possibly subject to a WHERE clause.
specifies the maximum number of levels in a GROUPBY set. When the software determines that there are at least n levels in the GROUPBY set, it abandons the action, returns a message, and does not produce a result set. You can specify the GROUPBYLIMIT= option if you want to avoid creating excessively large result sets in GROUPBY operations.
specifies a section of the group-by hierarchy to be included in the computation. With this option, you can request that the server performs the analysis for only a subset of all possible groupings. The subset is determined by applying the group filter to a temporary table that you generate with the GROUPBY statement.
specifies the top section or the bottom section of the groupings to be collected. If the DESCENDING option is specified, the top LIMIT=n (where n > 0) groupings are collected. Otherwise, the bottom LIMIT=n groupings are collected.
Alias | DESC |
specifies the maximum number of distinct groupings to be collected, where integer n >= 0. If n is zero, then all distinct groupings (up to 231–1) that satisfy the boundary constraints, such as LOWERSCORE=f, are collected.
CAUTION: |
High Cardinality
Data Sets
Setting n to
zero with high-cardinality data sets can significantly delay the response
of the server.
|
specifies the exclusive lower bound for the numeric scores of the distinct groupings to collect.
Alias | SGT= |
specifies the exclusive upper bound for the numeric scores of the distinct groupings to collect.
Alias | SLT= |
specifies the exclusive lower bound of the group-by variable’s formatted values for the distinct groupings to collect.
Alias | VGT= |
specifies the exclusive upper bound of the group-by variable’s formatted values for the distinct groupings to collect.
Alias | VLT= |
specifies the in-memory table from which to load the group-by hierarchy. If the TABLE= option is not specified, then all other GROUPFILTER= options are ignored.
proc imstat; table example.cars_program_all; groupby state city trade_in_model / temptable weight=new_vehicle_msrp agg =(max) order =weight; run;
table example.cars_program_all;
distinct sales_type / groupfilter=(
table =mylasr.&_TEMPLAST_
scoregt=40000
valuelt=("FL","Ft Myers","")
limit =20
descending);
run;
Interaction | If you specify the GROUPFILTER= option, then the GROUPBY= and FORMATS= options have no effect. |
specifies a numeric variable that identifies the time associated with each observation in the input table. The values of the ID= variable are typically SAS date, time, or datetime values, but that is not a requirement.
specifies the format for the ID= variable. If you do not specify this option, the default format for the variable is applied. If the ID= option is not specified, this option is ignored.
Alias | IDFMT= |
specifies the end value of the ID= variable to be included in the analysis. If the last ID= variable value is less than the specified IDEND= value, then the series is extended with missing values. If the last ID= variable value is greater than the specified IDEND= value, then the series is truncated.
specifies the beginning value of the ID= variable to be included in the analysis. If the first ID= variable value is greater than the specified IDSTART= value, then the series is prefixed with missing values. If the first ID= variable value is less than the specified IDSTART= value, then the series is truncated. If the ID= option is not specified, then this option is ignored.
specifies the time period for the accumulation of observations. For example, if you specify INTERVAL='MONTH', then the AGGREGATE statement summarizes the observations in monthly intervals. If the ID= option is not specified, then this option is ignored.
specifies that during aggregation, a window considers data within a specified multiple of intervals. A jumping window resets the aggregation process when the specified range of time expires. If you do not specify the JUMPINGWINDOW option, a window always retains the same multiple of intervals.
specifies one or more variables to transfer from the active table to the ODS table output or temporary table. When multiple input observations contribute to an output observation (this is the most common situation), then the minimum value is used.
Alias | TABLEVARS= |
Interaction | This option is ignored unless a PARTITION, GROUPBY=, or ID= option is also specified. |
specifies to output an aggregated value for each input observation by aggregating input observations whose ID= values are specified by INTERVAL= and WINDOWINT= options. Be aware that using this option can increase the volume of the result set.
Example | libname example sasiola host="grid001.example.com" port=10010 tag=hps; data example.retaindata; input transaction dt memberid deposit; informat dt date9.; format dt date9.; datalines; 1 15Apr2014 2 5000 2 01May2014 2 5000 3 01May2014 3 8000 4 02May2014 2 5000 5 03May2014 1 4000 6 03May2014 3 3000 7 04May2014 2 4000 8 04May2014 3 2000 9 05May2014 1 3000 10 06May2014 1 3000 11 07May2014 1 2500 12 08May2014 2 1000 13 09May2014 2 2000 14 10May2014 2 2000 15 10May2014 4 12000 run; proc imstat data=example.retaindata; aggregate deposit(agg=sum) deposit(agg=n) / groupby=memberid id=dt idfmt='DATE9.' interval='day' windowInt='year' noemptyinterval keeprecord; run; |
specifies that intervals that no ID= variable value belongs to are omitted from the output. By default, the empty intervals contain missing values.
specifies that you do not want to include missing values in the determination of group-by values. This option also applies to analysis variables when you specify AGGREGATOR=NDISTINCT.
Alias | NOMISS |
specifies the time series shift in order to match up with an existing time series. It can be used to match up with existing time series from previous year, for example.
Alias | DIF= |
specifies the time value when the aggregation within an interval or a bin is terminated. For example, if you specify INTERVAL='MONTH' and PARTIALTOINTERVAL='10FEB98'd, then the action aggregates records from the first 10 days of each month only.
Aliases | PTD= |
PTI= | |
Example | aggregate var1 (agg=n) / id=date interval='month' windowint='year' partialtointerval='10feb1998'd jumpingwindow; |
specifies the time value when the aggregation within a window interval or a window bin is terminated. For example, if you specify WINDOWINT='YEAR', PARTIALTOWINDOW='10FEB98'd, and JUMPINGWINDOW, then the action aggregates records from the first 41 days of each year for each interval, except the interval contains all 31 days from January of each year. The PARTIALTOWINDOW= and PARTIALTOINTERVAL= options can be used together.
Interaction | This option is ignored unless you specify the JUMPINGWINDOW option, because the starting time of a sliding window (a non-jumping window) is varying. |
Example | aggregate var1 (agg=n) / id=date interval='month' windowint='year' partialtowindow='08feb98'd jumpingwindow; |
specifies that when the active table is partitioned, then the aggregation is performed separately for each specified partition-key. If you do not specify a partition-key, the analysis is performed for all partitions. You can use the PARTITIONINFO statement to retrieve the valid partition key values for a table.
saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.
specifies the offset from the start of a bin. n must be positive and less or equal than the bin width. If the specified n is out of range, then this option is ignored.
Interaction | You must specify ID= and BIN= to use this option. |
specifies the width of the sub-bin within a bin. For example, if the values of the ID= variable range from 0 to 100, and you specify BIN=(5, 15), SUBBINOFFSET=2, and SUBBINWIDTH=5, then this action summarizes the observations with ID= variable values that fall into the ranges [–3, 2], (7, 12], (17, 22], …, (97, 102]. n must be positive and the sum of SUBBINOFFSET= and SUBBINWIDTH= must be less or equal to the bin width. If the specified n is out of range, then this option is ignored.
Interaction | You must specify ID= and BIN= to use this option. |
Example | aggregate var1 (agg=n) / id=id bin=(5, 15) subbinoffset=2 subbinwidth=5; |
specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.
Alias | TE= |
specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.
Alias | TN= |
generates an in-memory temporary table from the result set. The IMSTAT procedure displays the name of the table and stores it in the &_TEMPLAST_ macro variable, provided that the statement executed successfully.
specifies the minimum and maximum values of a time window for the aggregation of observations with respect to each bin. The construction of this option is similar to the BIN= option. If the width of the window is not a multiple of the width of the bin, then the action fails. If the value for lower-number is not equal to that of BIN= option, the action fails.
specifies the time window for the aggregation of observations with respect to each time interval. For example, if you specify INTERVAL='MONTH' and WINDOWINT='YEAR', then the AGGREGATE statement summarizes a year’s worth of observations before the end of each monthly interval.
specifies the time series shift in terms of an integer multiple of the WINDOWINT= or WINDOWBIN= value. For example, if you specify WINDOWINT='YEAR' and WINDOWOFFSET=–3, then at each time interval, the aggregated records are from three years earlier.
Alias | WINDOWDIF= |
ODS Table Name
|
Description
|
Option
|
---|---|---|
AggStatistics
|
Aggregation statistics
for one column
|
Default
|
TempTable
|
Information about a
temporary table
|
TEMPTABLE
|