The KDE Procedure

UNIVAR Statement

UNIVAR variable <(v-options)> <…variable <(v-options)>> </ options> ;

The UNIVAR statement computes univariate kernel density estimates. You can specify various v-options for each variable by enclosing them in parentheses after the variable name. You can also specify global options among the UNIVAR statement options following a slash (/). Global options apply to all the variables specified in the UNIVAR statement. However, individual variable v-options override the global options.

Note: The VAR statement supported by PROC KDE in SAS 8 and earlier releases is now obsolete. The VAR statement has been replaced by the UNIVAR and BIVAR statements, which enable you to produce multiple kernel density estimates with a single invocation of the procedure.

Table 48.2 summarizes the options available in the UNIVAR statement.

Table 48.2: UNIVAR Statement Options

Option

Description

BWM=

Specifies a bandwidth multiplier

GRIDL=

Specifies a lower grid limit

GRIDU=

Specifies an upper grid limit

METHOD=

Specifies the method used to compute the bandwidth

NGRID=

Specifies a number of grid points

NOPRINT

Suppresses output tables produced

OUT=

Specifies the output SAS data set containing the kernel density estimate

PERCENTILES

Requests that a table of percentiles

PLOTS=

Requests plots of the univariate kernel density estimate

SJPIMAX=

Specifies the maximum grid value in determining the Sheather-Jones plug-in bandwidth

SJPIMIN=

Specifies the minimum grid value in determining the Sheather-Jones plug-in bandwidth

SJPINUM=

Specifies the number of grid values used in determining the Sheather-Jones plug-in bandwidth

SJPITOL=

Specifies the tolerance for termination of the bisection algorithm

UNISTATS

Produces a table for each variable containing standard univariate statistics and the bandwidth


You can specify the following options in the UNIVAR statement. As noted, some options can be used as v-options.

BWM=number

specifies a bandwidth multiplier used for each kernel density estimate. The default value is 1. Larger multipliers produce a smoother estimate, and smaller ones produce a rougher estimate. To specify different bandwidth multipliers for different variables, specify BWM= as a v-option.

GRIDL=number

specifies a lower grid limit used for each kernel density estimate. The default value for a given variable is the minimum observed value of that variable. To specify different lower grid limits for different variables, specify GRIDL= as a v-option.

GRIDU=number

specifies an upper grid limit used for each kernel density estimate. The default value for a given variable is the maximum observed value of that variable. To specify different upper grid limits for different variables, specify GRIDU= as a v-option.

METHOD=SJPI | SNR | SNRQ | SROT | OS

specifies the method used to compute the bandwidth. Available methods are Sheather-Jones plug-in (SJPI), simple normal reference (SNR), simple normal reference that uses the interquartile range (SNRQ), Silverman’s rule of thumb (SROT), and oversmoothed (OS). See the section Bandwidth Selection and see Jones, Marron, and Sheather (1996) for a description of these methods. SJPI is the default method.

NGRID=number
NG=number

specifies a number of grid points used for each kernel density estimate. The default value is 401. To specify different numbers of grid points for different variables, specify NGRID= as a v-option.

NOPRINT

suppresses output tables produced by the UNIVAR statement. You can use the NOPRINT option when you want to produce graphical output only.

OUT=SAS-data-set

specifies the output SAS data set containing the kernel density estimate. This output data set contains the following variables:

  • var, whose value is the name of the variable in the kernel density estimate

  • value, with values corresponding to grid coordinates for the variable

  • density, with values equal to kernel density estimates at the associated grid point

  • count, containing the number of original observations contained in the bin corresponding to a grid point

PERCENTILES
PERCENTILES=numlist

requests that a table of percentiles be computed for each UNIVAR variable. You can specify a list of percentiles to be computed. The default percentiles are 0.5, 1, 2.5, 5, 10, 25, 50, 75, 90, 95, 97.5, 99, and 99.5.

PLOTS=plot-request | ALL | NONE
PLOTS=(plot-request < $\ldots $ plot-request>)

requests plots of the univariate kernel density estimate. When you specify only one plot request, you can omit the parentheses around the plot request.

ODS Graphics must be enabled before plots can be requested. For example:

ods graphics on;

proc kde data=channel;
   univar length / plots=histdensity;
run;

ods graphics off;

For more information about enabling and disabling ODS Graphics, see the section Enabling and Disabling ODS Graphics in Chapter 21: Statistical Graphics Using ODS.

The following table shows the available plot-requests.

Keyword

Description

ALL

produces all plots

DENSITY

univariate kernel density estimate curve

DENSITYOVERLAY

overlaid univariate kernel density estimate curves

HISTDENSITY

univariate histogram of data overlaid with kernel density estimate curve

HISTOGRAM

univariate histogram of data

NONE

suppresses all plots

By default, if you ODS Graphics is enabled and you do not specify the PLOTS= option, then the UNIVAR statement creates a histogram overlaid with a kernel density estimate. If you specify the PLOTS= option, you get only the requested plots.

If you specify more than one variable in the UNIVAR statement, the DENSITYOVERLAY keyword overlays the density curves for all the variables on a single plot. The other keywords each produce a separate plot for every variable listed in the UNIVAR statement.

SJPIMAX=number

specifies the maximum grid value in determining the Sheather-Jones plug-in bandwidth. The default value is two times the oversmoothed estimate.

SJPIMIN=number

specifies the minimum grid value in determining the Sheather-Jones plug-in bandwidth. The default value is the maximum value divided by 18.

SJPINUM=number

specifies the number of grid values used in determining the Sheather-Jones plug-in bandwidth. The default is 21.

SJPITOL=number

specifies the tolerance for termination of the bisection algorithm used in computing the Sheather-Jones plug-in bandwidth. The default value is 0.001.

UNISTATS

produces a table for each variable containing standard univariate statistics and the bandwidth used to compute its kernel density estimate. The statistics listed are the mean, variance, standard deviation, range, and interquartile range.

Examples

Suppose you have the variables x1, x2, x3, and x4 in the SAS data set MyData. You can request a univariate kernel density estimate for each of these variables with the following statements:

proc kde data=MyData; 
   univar x1 x2 x3 x4;
run;

You can also specify different bandwidths and other options for each variable. For example, the following statements request kernel density estimates that use Silverman’s rule of thumb (SROT) method for all variables:

proc kde data=MyData; 
   univar x1 (bwm=2) 
          x2 (bwm=0.5 ngrid=100)
          x3 x4 / ngrid=200 method=srot;
run;

The option NGRID=200 applies to the variables x1, x3, and x4, but the v-option NGRID=100 is applied to x2. Bandwidth multipliers of 2 and 0.5 are specified for the variables x1 and x2, respectively.