![]() | ![]() | ![]() | ![]() | ![]() |
| Contents: | Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / See Also / References |
- Changed location measure for AGK from median to mean
to allow weights 2Mar95
%inc "<location of your file containing the STDIZE macro>";
Following this statement, you may call the %STDIZE macro. See the Results tab for an example.
The following arguments may be listed within parentheses in any order, separated by commas:
DATA= SAS data set to be standardized. The default is _LAST_.
Most data set options may be used.
VAR= List of numeric variables to be standardized.
The usual forms of abbreviated lists
(e.g., X1-X100, ABC--XYZ, ABC:) may be used.
Variable names should not begin with an underscore.
The default is all numeric variables not listed in the
BY=, FREQ=, or WEIGHT= lists.
FREQ= A single numeric frequency variable used as in PROC
UNIVARIATE.
WEIGHT= A single numeric weight variable used as in PROC
UNIVARIATE. Only works for MEAN, SUM, EUCLEN, STD, AGK,
and L(p).
BY= List of variables for BY groups. Abbreviated variable
lists (e.g., X1-X100, ABC--XYZ, ABC:) may NOT be used.
OUT= The output data set, which is a copy of the DATA=
data set except that the VAR= variables have been
standardized. The default is _DATA_.
Data set options may NOT be used with the OUT= data set.
METHOD= Method for computing location and scale measures:
method scale location
------ ----- --------
MEAN 1 mean
MEDIAN 1 median
SUM sum 0
EUCLEN Euclidean length 0
USTD standard dev. about origin 0
STD standard deviation mean
RANGE range minimum
MIDRANGE range/2 midrange
MAXABS maximum abs value 0
IQR interquartile range median
MAD median abs dev from median median
ABW(c) biweight A-estimate biweight 1-step M-estimate
AHUBER(c) Huber A-estimate Huber 1-step M-estimate
AGK(p) AGK estimate (ACECLUS) mean
SPACING(p) minimum spacing mid minimum-spacing
L(p) L(p) L(p)
IN(ds) read from data set read from data set
The default is METHOD=STD.
For METHOD=ABW(c) or METHOD=AHUBER(c), c is a positive
numeric tuning constant (Iglewicz, 1983).
For METHOD=AGK(p), p is a numeric constant giving the
proportion of pairs to be used with METHOD=COUNT in
the ACECLUS procedure (SAS Technical Report P-229).
This is the noniterative univariate form of the
estimator described by Art, Gnanadesikan, & Kettenring
(1982).
For METHOD=SPACING(p), p is a numeric constant giving
the proportion of data to be contained in the spacing.
A spacing is the absolute difference between two data
values. The minimum spacing for a proportion p is the
minimum absolute difference between two data values that
contain a proportion p of the data between them. The
mid minimum-spacing is the mean of these two data
values.
For METHOD=L(p), p is a numeric constant greater than
or equal to 1 specifying the power to which differences
are to be raised in computing an L(p) or Minkowski
metric.
For METHOD=IN(ds), ds is the name of a SAS data set
containing the location and scale measures. The names
of the variables are specified by the LOCATION= and
SCALE= arguments.
For robust estimators, see Iglewicz (1983). MAD has the
highest breakdown point (50%), but isn't very efficient.
ABW and AHUBER provide a good comprise between
breakdown and efficiency. L(p) location estimates are
increasingly robust as p drops from 2 (least squares,
i.e. mean) to 1 (least absolute value, i.e. median),
but the L(p) scale estimates are not robust.
SPACING is robust to both outliers and clustering
(Jannsen, Marron, Veraverbeke, and Sarle, 1993) and is
therefore a good choice for cluster analysis or
nonparametric density estimation. The mid minimum
spacing estimates the mode for small p. AGK is also
robust to clustering and more efficient than SPACING,
but is not as robust to outliers and takes longer to
compute. If you expect g clusters, the argument to
SPACING or AGK should be 1/g or less. AGK is less biased
than SPACING in small samples. It would generally be
reasonable to use AGK for samples of size 100 or less
and SPACING for samples of size 1000 or more, with the
treatment of intermediate sample sizes depending on the
available computer resources.
FUZZ= Relative fuzz factor. Default is 1E-14.
If abs(score) < scale * fuzz then score = 0;
If abs(location) < scale * fuzz then location = 0;
If scale < abs(location) * fuzz then scale = 0;
LOCATION= List of numeric variables containing location measures
in the data set specified by METHOD=IN(ds).
The usual forms of abbreviated lists
(e.g., X1-X100, ABC--XYZ, ABC:) may be used.
Variable names should not begin with an underscore.
SCALE= List of numeric variables containing scale measures
in the data set specified by METHOD=IN(ds).
The usual forms of abbreviated lists
(e.g., X1-X100, ABC--XYZ, ABC:) may be used.
Variable names should not begin with an underscore.
INITIAL= Method for computing initial estimates for A estimates.
The default is MAD.
VARDEF= See PROC UNIVARIATE.
PCTLDEF= See PROC UNIVARIATE.
MULT= Constant to multiply each value by after standardizing.
The default is 1.
ADD= Constant to add to each value after standardizing
and multiplying by MULT=. The default is 0.
MISSING= Method or a numeric value for replacing missing values.
Use MISSING= when you want to replace missing values by
something other than the location measure associated
with the METHOD= argument, which is what the REPLACE
option replaces them by. The usual methods include MEAN,
MEDIAN, and MIDRANGE. Any of the values for the METHOD=
argument can also be specified for MISSING=, and the
corresponding location measure will be used to replace
missing values. If a numeric value is given, it replaces
missing values after standardizing the data. However,
the REPONLY option can be used together with MISSING= to
suppress standardization in case you only want to
replace missing values.
OPTIONS= List of additional options separated by blanks:
PRINT Print the standardized variables.
PSTAT Print the location and scale measures.
NOMISS Omit observations with any missing
values among the VAR= variables from
computation of the location and scale
measures. Otherwise, all nonmissing
values are used.
NORM Normalize the scale estimator to be
consistent for the standard deviation
of a normal distribution.
ONLY works for: AGK IQR MAD SPACING
SNORM Normalize the scale estimator to have
an expectation of approximately 1 for
a standard normal distribution.
ONLY works for: AGK IQR MAD SPACING
REPLACE Replace missing data by zero in the
standardized data (which corresponds
to the location measure before
standardizing). To replace missing
data by something else, see the
MISSING= argument.
REPONLY Replace missing data by the location
measure and do _not_ standardize the
data. You may not specify both REPLACE
and REPONLY.
result = add + multiply * (original - location) / scale
where:
If BY variables are specified, each BY group is standardized separately.
%let _notes_=1; %* Prints SAS notes for all steps;
%let _echo_=1; %* Prints the arguments to the STDIZE macro;
%let _echo_=2; %* Prints the arguments to the STDIZE macro
after most defaults have been set;
options mprint; %* Prints SAS code generated by the macro
language;
options mlogic symbolgen; %* Prints lots of macro debugging info;
This macro normally spends a lot of time checking the arguments you specify for validity, in hopes of avoiding mysterious error messages from the generated SAS code. You can reduce the amount of time spent checking arguments (and thereby speed up the macro at the risk of getting inscrutable error messages if you make a mistake) by using one of the following statements before invoking the macro:
%let _check_=1; %* reduce argument checking; %let _check_=0; %* suppress argument checking--use at your own risk!;
Iglewicz, B. (1983), "Robust scale estimators and confidence intervals for location", in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., _Understanding Robust and Exploratory Data Analysis_, New York: Wiley.
Jannsen, P., Marron, J.S., Veraverbeke, N, and Sarle, W.S. (1995), "Scale measures for bandwidth selection", J. of Nonparametric Statistics, 5, 359-380.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
The variables are id (student identification), Type (type of school attended: "urban"=urban area and "rural"=rural area), and total (total assessment scores in History, Geometry, and Chemistry).
The following DATA step creates the SAS data set TotalScores.
data TotalScores;
title 'High School Scores Data';
input id Type $ total;
datalines;
1 rural 135
2 rural 125
3 rural 223
4 rural 224
5 rural 133
6 rural 253
7 rural 144
8 rural 193
9 rural 152
10 rural 178
11 rural 120
12 rural 180
13 rural 154
14 rural 184
15 rural 187
16 rural 111
17 rural 190
18 rural 128
19 rural 110
20 rural 217
21 urban 192
22 urban 186
23 urban 64
24 urban 159
25 urban 133
26 urban 163
27 urban 130
28 urban 163
29 urban 189
30 urban 144
31 urban 154
32 urban 198
33 urban 150
34 urban 151
35 urban 152
36 urban 151
37 urban 127
38 urban 167
39 urban 170
40 urban 123
;
The following statements use the traditional standardization method to compute the location and scale measures. The PSTAT option displays the table of location and scale measures. The %STDIZE macro uses the mean as the location measure and the standard deviation as the scale measure for standardizing. PROC MEANS shows that the resulting standardized variables have mean zero and standard deviation one.
%inc "<location of your file containing the STDIZE macro>";
%stdize(data=totalscores, var=total, by=type,
out=stdscores, method=std, options=pstat)
proc means data=stdscores;
class type;
var total;
run;
High School Scores Data
Obs _loc1 _sca1
1 167.05 41.9567
2 153.30 30.0668
The MEANS Procedure
Analysis Variable : total
N
Type Obs N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------------
rural 20 20 -2.44249E-16 1.0000000 -1.3597347 2.0485399
urban 20 20 -3.66374E-16 1.0000000 -2.9700565 1.4866912
-------------------------------------------------------------------------------------
Right-click on the link below and select Save to save
the %STDIZE macro definition
to a file. It is recommended that you name the file
stdize.sas.
After saving the file, edit it to uncomment (remove the leading asterisk from) the %inc statement in the first line and change it to point to the file containing the XMACRO macro definitions on your system.
| Type: | Sample |
| Topic: | Analytics ==> Transformations SAS Reference ==> Procedures ==> STANDARD SAS Reference ==> Procedures ==> STDIZE |
| Date Modified: | 2007-08-14 03:03:13 |
| Date Created: | 2005-01-18 07:07:17 |
| Product Family | Product | Host | SAS Release | |
| Starting | Ending | |||
| SAS System | Base SAS | All | n/a | n/a |
| SAS System | SAS/STAT | All | n/a | n/a |





