### HISTOGRAM Statement

Subsections:

HISTOGRAM <variables> < / options> ;

The HISTOGRAM statement creates histograms and optionally superimposes estimated parametric and nonparametric probability density curves. You cannot use the WEIGHT statement with the HISTOGRAM statement. You can use any number of HISTOGRAM statements after a PROC UNIVARIATE statement. The components of the HISTOGRAM statement are follows.

variables

are the variables for which histograms are to be created. If you specify a VAR statement, the variables must also be listed in the VAR statement. Otherwise, the variables can be any numeric variables in the input data set. If you do not specify variables in a VAR statement or in the HISTOGRAM statement, then by default, a histogram is created for each numeric variable in the DATA= data set. If you use a VAR statement and do not specify any variables in the HISTOGRAM statement, then by default, a histogram is created for each variable listed in the VAR statement.

For example, suppose a data set named Steel contains exactly two numeric variables named Length and Width. The following statements create two histograms, one for Length and one for Width:

proc univariate data=Steel;
histogram;
run;


Likewise, the following statements create histograms for Length and Width:

proc univariate data=Steel;
var Length Width;
histogram;
run;


The following statements create a histogram for Length only:

proc univariate data=Steel;
var Length Width;
histogram Length;
run;

options

add features to the histogram. Specify all options after the slash (/) in the HISTOGRAM statement. Options can be one of the following:

• primary options for fitted parametric distributions and kernel density estimates

• secondary options for fitted parametric distributions and kernel density estimates

• general options for graphics and output data sets

For example, in the following statements, the NORMAL option displays a fitted normal curve on the histogram, the MIDPOINTS= option specifies midpoints for the histogram, and the CTEXT= option specifies the color of the text:

proc univariate data=Steel;
histogram Length / normal
midpoints = 5.6 5.8 6.0 6.2 6.4
ctext     = blue;
run;


Table 4.5 through Table 4.8 list the HISTOGRAM options by function. For complete descriptions, see the sections Dictionary of Options and Dictionary of Common Options.

#### Parametric Density Estimation Options

Table 4.5 lists primary options that display parametric density estimates on the histogram. You can specify each primary option once in a given HISTOGRAM statement, and each primary option can display multiple curves from its family on the histogram.

Table 4.5: Primary Options for Parametric Fitted Distribution

Option

Description

fits beta distribution with threshold parameter , scale parameter , and shape parameters and

fits exponential distribution with threshold parameter and scale parameter

fits gamma distribution with threshold parameter , scale parameter , and shape parameter

fits gumbel distribution with location parameter , and scale parameter

fits inverse Gaussian distribution with location parameter , and shape parameter

fits lognormal distribution with threshold parameter , scale parameter , and shape parameter

fits normal distribution with mean and standard deviation

fits generalized Pareto distribution with threshold parameter , scale parameter , and shape parameter

fits power function distribution with threshold parameter , scale parameter , and shape parameter

fits Rayleigh distribution with threshold parameter , and scale parameter

fits Johnson distribution with threshold parameter , scale parameter , and shape parameters and

fits Johnson distribution with threshold parameter , scale parameter , and shape parameters and

fits Weibull distribution with threshold parameter , scale parameter , and shape parameter

Table 4.6 lists secondary options that specify parameters for fitted parametric distributions and that control the display of fitted curves. Specify these secondary options in parentheses after the primary distribution option. For example, you can fit a normal curve by specifying the NORMAL option as follows:

proc univariate;
histogram / normal(color=red mu=10 sigma=0.5);
run;


The COLOR= normal-option draws the curve in red, and the MU= and SIGMA= normal-options specify the parameters and for the curve. Note that the sample mean and sample standard deviation are used to estimate and , respectively, when the MU= and SIGMA= normal-options are not specified.

You can specify lists of values for secondary options to display more than one fitted curve from the same distribution family on a histogram. Option values are matched by list position. You can specify the value EST in a list of distribution parameter values to use an estimate of the parameter.

For example, the following code displays two normal curves on a histogram:

proc univariate;
histogram / normal(color=(red blue) mu=10 est sigma=0.5 est);
run;


The first curve is red, with and . The second curve is blue, with equal to the sample mean and equal to the sample standard deviation.

See the section Formulas for Fitted Continuous Distributions for detailed information about the families of parametric distributions that you can fit with the HISTOGRAM statement.

Table 4.6: Secondary Options for Parametric Distributions

Option

Description

Options Used with All Distributions

specifies colors of density curves

fills area under density curve

specifies line types of density curves

prints table of midpoints of histogram intervals

suppresses tables summarizing curves

lists percents for which quantiles calculated from data and quantiles estimated from curves are tabulated

specifies widths of density curves

Beta-Options

specifies first shape parameter for beta curve

specifies second shape parameter for beta curve

specifies scale parameter for beta curve

specifies lower threshold parameter for beta curve

Exponential-Options

specifies scale parameter for exponential curve

specifies threshold parameter for exponential curve

Gamma-Options

specifies shape parameter for gamma curve

specifies change in successive estimates of at which the Newton-Raphson approximation of terminates

specifies initial value for in the Newton-Raphson approximation of

specifies maximum number of iterations in the Newton-Raphson approximation of

specifies scale parameter for gamma curve

specifies threshold parameter for gamma curve

Gumbel-Options

specifies number of samples for EDF goodness-of-fit simulation

specifies seed value for EDF goodness-of-fit simulation

specifies location parameter for gumbel curve

specifies scale parameter for gumbel curve

IGauss-Options

specifies number of samples for EDF goodness-of-fit simulation

specifies seed value for EDF goodness-of-fit simulation

specifies shape parameter for inverse Gaussian curve

specifies location parameter for inverse Gaussian curve

Lognormal-Options

specifies shape parameter for lognormal curve

specifies threshold parameter for lognormal curve

specifies scale parameter for lognormal curve

Normal-Options

specifies mean for normal curve

specifies standard deviation for normal curve

Pareto-Options

specifies number of samples for EDF goodness-of-fit simulation

specifies seed value for EDF goodness-of-fit simulation

specifies shape parameter for generalized Pareto curve

specifies scale parameter for generalized Pareto curve

specifies threshold parameter for generalized Pareto curve

Power-Options

specifies shape parameter for power function curve

specifies scale parameter for power function curve

specifies threshold parameter for power function curve

Rayleigh-Options

specifies number of samples for EDF goodness-of-fit simulation

specifies seed value for EDF goodness-of-fit simulation

specifies scale parameter for Rayleigh curve

specifies threshold parameter for Rayleigh curve

Johnson -Options

specifies first shape parameter for Johnson curve

specifies -value for method of percentiles

specifies method of parameter estimation

specifies tolerance for method of percentiles

specifies second shape parameter for Johnson curve

specifies scale parameter for Johnson curve

specifies lower threshold parameter for Johnson curve

Johnson -Options

specifies first shape parameter for Johnson curve

specifies -value for method of percentiles

specifies method of parameter estimation

specifies tolerance for method of percentiles

specifies second shape parameter for Johnson curve

specifies the sampling range for parameter starting values in MLE optimization

specifies an interation limit for MLE optimization

specifies the maximum number of starting points to be used for MLE optimization

prints an iteration history for MLE optimization

specifies a seed value for MLE optimization

specifies the optimality tolerance for MLE optimization

specifies scale parameter for Johnson curve

specifies lower threshold parameter for Johnson curve

Weibull-Options

specifies shape parameter c for Weibull curve

requests table of iteration history and optimizer details

specifies maximum number of iterations in the Newton-Raphson approximation of

specifies scale parameter for Weibull curve

specifies threshold parameter for Weibull curve

#### Nonparametric Density Estimation Options

Use the option KERNEL(kernel-options) to compute kernel density estimates. Specify the following secondary options in parentheses after the KERNEL option to control features of density estimates requested with the KERNEL option.

Table 4.7: Kernel-Options

Option

Description

specifies standardized bandwidth parameter

specifies color of the kernel density curve

fills area under kernel density curve

specifies type of kernel function

specifies line type used for kernel density curve

specifies lower bound for kernel density curve

specifies upper bound for kernel density curve

specifies line width for kernel density curve

#### General Options

Table 4.8 summarizes options for enhancing histograms.

Table 4.8: General Graphics Options

Option

Description

General Graphics Options

produces labels above histogram bars

scales vertical axis without considering fitted curves

lists endpoints for histogram intervals

creates a grid

constructs hanging histogram

specifies reference lines perpendicular to the horizontal axis

specifies labels for HREF= lines

specifies vertical position of labels for HREF= lines

specifies midpoints for histogram intervals

specifies number of histogram interval endpoints

specifies number of histogram interval midpoints

suppresses histogram bars

suppresses label for horizontal axis

suppresses plot

suppresses label for vertical axis

suppresses tick marks and tick mark labels for vertical axis

includes right endpoint in interval

specifies label for vertical axis

specifies reference lines perpendicular to the vertical axis

specifies labels for VREF= lines

specifies horizontal position of labels for VREF= lines

specifies scale for vertical axis

specifies annotate data set

specifies width for the bars

specifies color for axis

specifies color for outlines of histogram bars

specifies color for filling under curve

specifies color for frame

specifies color for grid lines

specifies colors for HREF= lines

draws reference lines behind histogram bars

specifies colors for STATREF= lines

specifies color for text

specifies colors for VREF= lines

specifies description for plot in graphics catalog

specifies software font for text

draws reference lines in front of histogram bars

specifies AXIS statement for horizontal axis

specifies height of text used outside framed areas

specifies number of horizontal minor tick marks

specifies offset for horizontal axis

specifies software font for text inside framed areas

specifies height of text inside framed areas

specifies space between histogram bars

specifies a line type for grid lines

specifies line types for HREF= lines

specifies line types for STATREF= lines

specifies line types for VREF= lines

specifies name for plot in graphics catalog

suppresses frame around plotting area

specifies pattern for filling under curve

turns and vertically strings out characters in labels for vertical axis

specifies AXIS statement or values for vertical axis

specifies number of vertical minor tick marks

specifies length of offset at upper end of vertical axis

specifies line thickness for axes and frame

specifies line thickness for bar outlines

specifies line thickness for grid

Options for ODS Graphics Output

specifies footnote displayed on histogram

specifies secondary footnote displayed on histogram

specifies title displayed on histogram

specifies secondary title displayed on histogram

overlays histograms for different class levels

Options for Comparative Plots

applies annotation requested in ANNOTATE= data set to key cell only

specifies color for filling frame for row labels

specifies color for filling frame for column labels

specifies color for proportion of frequency bar

specifies color for row labels of comparative histograms

specifies color for column labels of comparative histograms

specifies distance between tiles

specifies maximum number of bins to display

limits the number of bins that display to within a specified number of standard deviations above and below mean of data in key cell

specifies number of columns in comparative histogram

specifies number of rows in comparative histogram

Miscellaneous Options

creates table of histogram intervals

creates a data set containing information about histogram intervals

creates a data set containing kernel density estimates

#### Dictionary of Options

The following entries provide detailed descriptions of options in the HISTOGRAM statement. Options marked with † are applicable only when traditional graphics are produced. See the section Dictionary of Common Options for detailed descriptions of options common to all plot statements.

ALPHA=value-list

specifies the shape parameter for fitted curves requested with the BETA, GAMMA, PARETO and POWER options. Enclose the ALPHA= option in parentheses after the distribution keyword. By default, or if you specify the value EST, the procedure calculates a maximum likelihood estimate for . You can specify A= as an alias for ALPHA= if you use it as a beta-option. You can specify SHAPE= as an alias for ALPHA= if you use it as a gamma-option.

BARLABEL=COUNT | PERCENT | PROPORTION

displays labels above the histogram bars. If you specify BARLABEL=COUNT, the label shows the number of observations associated with a given bar. If you specify BARLABEL=PERCENT, the label shows the percentage of observations represented by that bar. If you specify BARLABEL=PROPORTION, the label displays the proportion of observations associated with the bar.

† BARWIDTH=value

specifies the width of the histogram bars in percentage screen units. If both the BARWIDTH= and INTERBAR= options are specified, the INTERBAR= option takes precedence.

BETA <(beta-options)>

displays fitted beta density curves on the histogram. The BETA option can occur only once in a HISTOGRAM statement, but it can request any number of beta curves. The beta distribution is bounded below by the parameter and above by the value . Use the THETA= and SIGMA= beta-options to specify these parameters. By default, THETA=0 and SIGMA=1. You can specify THETA=EST and SIGMA=EST to request maximum likelihood estimates for and .

The beta distribution has two shape parameters: and . If these parameters are known, you can specify their values with the ALPHA= and BETA= beta-options. By default, the procedure computes maximum likelihood estimates for and . Note: Three- and four-parameter maximum likelihood estimation may not always converge.

Table 4.6 lists secondary options you can specify with the BETA option. See the section Beta Distribution for details and Example 4.21 for an example that uses the BETA option.

BETA=value-list
B=value-list

specifies the second shape parameter for beta density curves requested with the BETA option. Enclose the BETA= option in parentheses after the BETA option. By default, or if you specify the value EST, the procedure calculates a maximum likelihood estimate for .

C=value-list

specifies the shape parameter for Weibull density curves requested with the WEIBULL option. Enclose the C= Weibull-option in parentheses after the WEIBULL option. By default, or if you specify the value EST, the procedure calculates a maximum likelihood estimate for . You can specify the SHAPE= Weibull-option as an alias for the C= Weibull-option.

C=value-list

specifies the standardized bandwidth parameter for kernel density estimates requested with the KERNEL option. Enclose the C= kernel-option in parentheses after the KERNEL option. You can specify a list of values to request multiple estimates. You can specify the value MISE to produce the estimate with a bandwidth that minimizes the approximate mean integrated square error (MISE), or SJPI to select the bandwidth by using the Sheather-Jones plug-in method.

You can also use the C= kernel-option with the K= kernel-option (which specifies the kernel function) to compute multiple estimates. If you specify more kernel functions than bandwidths, the last bandwidth in the list is repeated for the remaining estimates. Similarly, if you specify more bandwidths than kernel functions, the last kernel function is repeated for the remaining estimates. If you do not specify the C= kernel-option, the bandwidth that minimizes the approximate MISE is used for all the estimates.

† CBARLINE=color

specifies the color for the outline of the histogram bars when producing traditional graphics. The option does not apply to ODS Graphics output.

† CFILL=color

specifies the color to fill the bars of the histogram (or the area under a fitted density curve if you also specify the FILL option) when producing traditional graphics. See the entries for the FILL and PFILL= options for additional details. Refer to SAS/GRAPH: Reference for a list of colors. The option does not apply to ODS Graphics output.

† CGRID=color

specifies the color for grid lines when a grid displays on the histogram in traditional graphics. This option also produces a grid if the GRID= option is not specified.

CLIPCURVES

scales the vertical axis without taking fitted curves into consideration. Curves that extend above the tallest histogram bar can be clipped. You can use this option to avoid compression of the histogram bars due to extremely high fitted curve peaks.

† CLIPREF

draws reference lines requested with the HREF= and VREF= options behind the histogram bars. When the GSTYLE system option is in effect for traditional graphics, reference lines are drawn in front of the bars by default.

CONTENTS=

specifies the table of contents grouping entry for tables associated with a density curve. Enclose the CONTENTS= option in parentheses after the distribution option. You can specify CONTENTS='' to suppress the grouping entry.

DELTA=value-list

specifies the first shape parameter for Johnson and Johnson distribution functions requested with the SB and SU options. Enclose the DELTA= option in parentheses after the SB or SU option. If you do not specify a value for , or if you specify the value EST, the procedure calculates an estimate.

EDFNSAMPLES=value

specifies the number of simulation samples used to compute -values for EDF goodness-of-fit statistics for density curves requested with the GUMBEL, IGAUSS, PARETO, and RAYLEIGH options. Enclose the EDFNSAMPLES= option in parentheses after the distribution option. The default value is 500.

EDFSEED=value

specifies an integer value used to start the pseudo-random number generator when creating simulation samples for computing EDF goodness-of-fit statistic -values for density curves requested with the GUMBEL, IGAUSS, PARETO, and RAYLEIGH options. Enclose the EDFSEED= option in parentheses after the distribution option. By default, the procedure uses a random number seed generated from reading the time of day from the computer’s clock.

ENDPOINTS <=values | KEY | UNIFORM>

uses histogram bin endpoints as the tick mark values for the horizontal axis and determines how to compute the bin width of the histogram bars. The values specify both the left and right endpoint of each histogram interval. The width of the histogram bars is the difference between consecutive endpoints. The procedure uses the same values for all variables.

The range of endpoints must cover the range of the data. For example, if you specify

endpoints=2 to 10 by 2


then all of the observations must fall in the intervals [2,4) [4,6) [6,8) [8,10]. You also must use evenly spaced endpoints which you list in increasing order.

KEY

determines the endpoints for the data in the key cell. The initial number of endpoints is based on the number of observations in the key cell by using the method of Terrell and Scott (1985). The procedure extends the endpoint list for the key cell in either direction as necessary until it spans the data in the remaining cells.

UNIFORM

determines the endpoints by using all the observations as if there were no cells. In other words, the number of endpoints is based on the total sample size by using the method of Terrell and Scott (1985).

Neither KEY nor UNIFORM apply unless you use the CLASS statement.

If you omit ENDPOINTS, the procedure uses the histogram midpoints as horizontal axis tick values. If you specify ENDPOINTS, the procedure computes the endpoints by using an algorithm (Terrell and Scott, 1985) that is primarily applicable to continuous data that are approximately normally distributed.

If you specify both MIDPOINTS= and ENDPOINTS, the procedure issues a warning message and uses the endpoints.

If you specify RTINCLUDE, the procedure includes the right endpoint of each histogram interval in that interval instead of including the left endpoint.

If you use a CLASS statement and specify ENDPOINTS, the procedure uses ENDPOINTS=KEY as the default. However if the key cell is empty, then the procedure uses ENDPOINTS=UNIFORM.

EXPONENTIAL <(exponential-options)>
EXP <(exponential-options)>

displays fitted exponential density curves on the histogram. The EXPONENTIAL option can occur only once in a HISTOGRAM statement, but it can request any number of exponential curves. The parameter must be less than or equal to the minimum data value. Use the THETA= exponential-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the SIGMA= exponential-option to specify . By default, the procedure computes a maximum likelihood estimate for . Table 4.6 lists options you can specify with the EXPONENTIAL option. See the section Exponential Distribution for details.

FILL

fills areas under the fitted density curve or the kernel density estimate with colors and patterns. The FILL option can occur with only one fitted curve. Enclose the FILL option in parentheses after a density curve option or the KERNEL option. The CFILL= and PFILL= options specify the color and pattern for the area under the curve when producing traditional graphics. For a list of available colors and patterns, see SAS/GRAPH: Reference.

† FRONTREF

draws reference lines requested with the HREF= and VREF= options in front of the histogram bars. When the NOGSTYLE system option is in effect for traditional graphics, reference lines are drawn behind the histogram bars by default, and they can be obscured by filled bars.

GAMMA <(gamma-options)>

displays fitted gamma density curves on the histogram. The GAMMA option can occur only once in a HISTOGRAM statement, but it can request any number of gamma curves. The parameter must be less than the minimum data value. Use the THETA= gamma-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the ALPHA= and the SIGMA= gamma-options to specify the shape parameter and the scale parameter . By default, PROC UNIVARIATE computes maximum likelihood estimates for and . The procedure calculates the maximum likelihood estimate of iteratively by using the Newton-Raphson approximation. Table 4.6 lists options you can specify with the GAMMA option. See the section Gamma Distribution for details, and see Example 4.22 for an example that uses the GAMMA option.

GAMMA=value-list

specifies the second shape parameter for Johnson and Johnson distribution functions requested with the SB and SU options. Enclose the GAMMA= option in parentheses after the SB or SU option. If you do not specify a value for , or if you specify the value EST, the procedure calculates an estimate.

GRID

displays a grid on the histogram. Grid lines are horizontal lines that are positioned at major tick marks on the vertical axis.

GUMBEL <(Gumbel-options)>

displays fitted Gumbel density curves on the histogram. The GUMBEL option can occur only once in a HISTOGRAM statement, but it can request any number of Gumbel curves. Use the MU= and the SIGMA= Gumbel-options to specify the location parameter and the scale parameter . By default, PROC UNIVARIATE computes maximum likelihood estimates for and . Table 4.6 lists options you can specify with the GUMBEL option. See the section Gumbel Distribution for details about the Gumbel distribution.

HANGING
HANG

requests a hanging histogram, as illustrated in Figure 4.7.

Figure 4.7: Hanging Histogram

You can use the HANGING option only when exactly one fitted density curve is requested. A hanging histogram aligns the tops of the histogram bars (displayed as lines) with the fitted curve. The lines are positioned at the midpoints of the histogram bins. A hanging histogram is a goodness-of-fit diagnostic in the sense that the closer the lines are to the horizontal axis, the better the fit. Hanging histograms are discussed by Tukey (1977), Wainer (1974), and Velleman and Hoaglin (1981).

† HOFFSET=value

specifies the offset, in percentage screen units, at both ends of the horizontal axis. You can use HOFFSET=0 to eliminate the default offset.

IGAUSS <(iGauss-options)>

displays fitted inverse Gaussian density curves on the histogram. The IGAUSS option can occur only once in a HISTOGRAM statement, but it can request any number of inverse Gaussian curves. Use the MU= and the LAMBDA= iGauss-options to specify the location parameter and the shape parameter . By default, PROC UNIVARIATE uses the sample mean for and computes a maximum likelihood estimate for . Table 4.6 lists options you can specify with the IGAUSS option. See the section Inverse Gaussian Distribution for details.

† INTERBAR=value

specifies the space between histogram bars in percentage screen units. If both the INTERBAR= and BARWIDTH= options are specified, the INTERBAR= option takes precedence.

specifies the kernel function (normal, quadratic, or triangular) used to compute a kernel density estimate. You can specify a list of values to request multiple estimates. You must enclose this option in parentheses after the KERNEL option. You can also use the K= kernel-option with the C= kernel-option, which specifies standardized bandwidths. If you specify more kernel functions than bandwidths, the procedure repeats the last bandwidth in the list for the remaining estimates. Similarly, if you specify more bandwidths than kernel functions, the procedure repeats the last kernel function for the remaining estimates. By default, K=NORMAL.

KERNEL<(kernel-options)>

superimposes kernel density estimates on the histogram. By default, the procedure uses the AMISE method to compute kernel density estimates. To request multiple kernel density estimates on the same histogram, specify a list of values for the C= kernel-option or K= kernel-option. Table 4.7 lists options you can specify with the KERNEL option. See the section Kernel Density Estimates for more information about kernel density estimates, and see Example 4.23.

LAMBDA=value

specifies the shape parameter for fitted curves requested with the IGAUSS option. Enclose the LAMBDA= option in parentheses after the IGAUSS distribution keyword. If you do not specify a value for , the procedure calculates a maximum likelihood estimate.

† LGRID=linetype

specifies the line type for the grid when a grid displays on the histogram. This option also creates a grid if the GRID option is not specified.

LOGNORMAL<(lognormal-options)>

displays fitted lognormal density curves on the histogram. The LOGNORMAL option can occur only once in a HISTOGRAM statement, but it can request any number of lognormal curves. The parameter must be less than the minimum data value. Use the THETA= lognormal-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the SIGMA= and ZETA= lognormal-options to specify and . By default, the procedure computes maximum likelihood estimates for and . Table 4.6 lists options you can specify with the LOGNORMAL option. See the section Lognormal Distribution for details, and see Example 4.22 and Example 4.24 for examples using the LOGNORMAL option.

LOWER=value-list

specifies lower bounds for kernel density estimates requested with the KERNEL option. Enclose the LOWER= option in parentheses after the KERNEL option. If you specify more kernel estimates than lower bounds, the last lower bound is repeated for the remaining estimates. The default is a missing value, indicating no lower bounds for fitted kernel density curves.

MAXNBIN=n

limits the number of bins displayed in the comparative histogram. This option is useful when the scales or ranges of the data distributions differ greatly from cell to cell. By default, the bin size and midpoints are determined for the key cell, and then the midpoint list is extended to accommodate the data ranges for the remaining cells. However, if the cell scales differ considerably, the resulting number of bins can be so great that each cell histogram is scaled into a narrow region. By using MAXNBIN= to limit the number of bins, you can narrow the window about the data distribution in the key cell. This option is not available unless you specify the CLASS statement. The MAXNBIN= option is an alternative to the MAXSIGMAS= option.

MAXSIGMAS=value

limits the number of bins displayed in the comparative histogram to a range of value standard deviations (of the data in the key cell) above and below the mean of the data in the key cell. This option is useful when the scales or ranges of the data distributions differ greatly from cell to cell. By default, the bin size and midpoints are determined for the key cell, and then the midpoint list is extended to accommodate the data ranges for the remaining cells. However, if the cell scales differ considerably, the resulting number of bins can be so great that each cell histogram is scaled into a narrow region. By using MAXSIGMAS= to limit the number of bins, you can narrow the window that surrounds the data distribution in the key cell. This option is not available unless you specify the CLASS statement.

MIDPERCENTS

requests a table listing the midpoints and percentage of observations in each histogram interval. If you specify MIDPERCENTS in parentheses after a density estimate option, the procedure displays a table that lists the midpoints, the observed percentage of observations, and the estimated percentage of the population in each interval (estimated from the fitted distribution). See Example 4.18.

MIDPOINTS=values | KEY | UNIFORM

specifies how to determine the midpoints for the histogram intervals, where values determines the width of the histogram bars as the difference between consecutive midpoints. The procedure uses the same values for all variables.

The range of midpoints, extended at each end by half of the bar width, must cover the range of the data. For example, if you specify

midpoints=2 to 10 by 0.5


then all of the observations should fall between 1.75 and 10.25. You must use evenly spaced midpoints listed in increasing order.

KEY

determines the midpoints for the data in the key cell. The initial number of midpoints is based on the number of observations in the key cell that use the method of Terrell and Scott (1985). The procedure extends the midpoint list for the key cell in either direction as necessary until it spans the data in the remaining cells.

UNIFORM

determines the midpoints by using all the observations as if there were no cells. In other words, the number of midpoints is based on the total sample size by using the method of Terrell and Scott (1985).

Neither KEY nor UNIFORM apply unless you use the CLASS statement. By default, if you use a CLASS statement, MIDPOINTS=KEY; however, if the key cell is empty then MIDPOINTS=UNIFORM. Otherwise, the procedure computes the midpoints by using an algorithm (Terrell and Scott, 1985) that is primarily applicable to continuous data that are approximately normally distributed.

MU=value-list

specifies the parameter for Gumbel, inverse Gaussian, and normal density curves requested with the GUMBEL, IGAUSS, and NORMAL options, respectively. Enclose the MU= option in parentheses after the distribution keyword. By default, or if you specify the value EST, the procedure uses the sample mean for for normal and inverse Gaussian distributions and computes a maximum likelihood estimate of for the Gumbel distribution. For more detail please see Inverse Gaussian Distribution and Gumbel Distribution.

NENDPOINTS=

uses histogram interval endpoints as the tick mark values for the horizontal axis and determines the number of bins.

NMIDPOINTS=

specifies the number of histogram intervals.

NOBARS

suppresses drawing of histogram bars, which is useful for viewing fitted curves only.

NOPLOT
NOCHART

suppresses the creation of a plot. Use this option when you only want to tabulate summary statistics for a fitted density or create an OUTHISTOGRAM= data set.

NOPRINT

suppresses tables summarizing the fitted curve. Enclose the NOPRINT option in parentheses following the distribution option.

NORMAL<(normal-options)>

displays fitted normal density curves on the histogram. The NORMAL option can occur only once in a HISTOGRAM statement, but it can request any number of normal curves. Use the MU= and SIGMA= normal-options to specify and . By default, the procedure uses the sample mean and sample standard deviation for and . Table 4.6 lists options you can specify with the NORMAL option. See the section Normal Distribution for details, and see Example 4.19 for an example that uses the NORMAL option.

NOTABCONTENTS

OPTBOUNDRANGE=value

defines the sampling range for each parameter during maximum likelihood estimation for the Johnson distribution. PROC UNIVARIATE computes initial estimates for each parameter by using the method of percentiles. The value determines the range of parameter values around the initial estimate that can be sampled for local optimization starting values. The default is 100.

OPTMAXITER=value

limits the number of iterations that are used by the optimizer in maximum likelihood estimation for the Johnson distribution. The default is 500.

OPTMAXSTARTS=N

defines the maximum number of starting points to be used for local optimization in maximum likelihood estimation for the Johnson distribution. That is, no more than N local optimizations are used in the multistart algorithm. The default value is 100.

OPTPRINT

prints the interation history for the Johnson distribution maximum likelihood estimation.

OPTSEED=value

specifies a positive integer seed for generating random number sequences in Johnson distribution maximum likelihood estimation. You can use this option to replicate results from different runs.

OPTTOLERANCE=value

specifies the tolerance for declaring optimality in maximum likelihood estimation for the Johnson distribution. The default value is 1E–8.

OUTHISTOGRAM=SAS-data-set
OUTHIST=SAS-data-set

creates a SAS data set that contains information about histogram intervals. Specifically, the data set contains the midpoints of the histogram intervals (or the lower endpoints of the intervals if you specify the ENDPOINTS option), the observed percentage of observations in each interval, and the estimated percentage of observations in each interval (estimated from each of the specified fitted curves).

PARETO <(Pareto-options)>

displays fitted generalized Pareto density curves on the histogram. The PARETO option can occur only once in a HISTOGRAM statement, but it can request any number of generalized Pareto curves. The parameter must be less than the minimum data value. Use the THETA= Pareto-option to specify . By default, THETA=0. Use the SIGMA= and the ALPHA= Pareto-options to specify the scale parameter and the shape parameter . By default, PROC UNIVARIATE computes maximum likelihood estimates for and . Table 4.6 lists options you can specify with the PARETO option. See the section Generalized Pareto Distribution for details.

PERCENTS=values
PERCENT=values

specifies a list of percents for which quantiles calculated from the data and quantiles estimated from the fitted curve are tabulated. The percents must be between 0 and 100. Enclose the PERCENTS= option in parentheses after the curve option. The default percents are 1, 5, 10, 25, 50, 75, 90, 95, and 99.

† PFILL=pattern

specifies a pattern used to fill the bars of the histograms (or the areas under a fitted curve if you also specify the FILL option) when producing traditional graphics. See the entries for the CFILL= and FILL options for additional details. Refer to SAS/GRAPH: Reference for a list of pattern values. The option does not apply to ODS Graphics output.

POWER <(power-options)>

displays fitted power function density curves on the histogram. The POWER option can occur only once in a HISTOGRAM statement, but it can request any number of power function curves. The parameter must be less than the minimum data value. Use the THETA= and SIGMA= power-options to specify and . The default values are 0 and 1, respectively. Use the ALPHA= power-option to specify the and the shape parameter, . By default, PROC UNIVARIATE computes a maximum likelihood estimate for . Table 4.6 lists options you can specify with the POWER option. See the section Power Function Distribution for details.

RAYLEIGH <(Rayleigh-options)>

displays fitted Rayleigh density curves on the histogram. The RAYLEIGH option can occur only once in a HISTOGRAM statement, but it can request any number of Rayleigh curves. The parameter must be less than the minimum data value. Use the THETA= Rayleigh-option to specify . By default, THETA=0. Use the SIGMA= Rayleigh-option to specify the scale parameter . By default, PROC UNIVARIATE computes maximum likelihood estimates for . Table 4.6 lists options you can specify with the RAYLEIGH option. See the section Rayleigh Distribution for details.

RTINCLUDE

includes the right endpoint of each histogram interval in that interval. By default, the left endpoint is included in the histogram interval.

SB<( -options)>

displays fitted Johnson density curves on the histogram. The SB option can occur only once in a HISTOGRAM statement, but it can request any number of Johnson curves. Use the THETA= and SIGMA= normal-options to specify and . By default, the procedure computes maximum likelihood estimates of and . Table 4.6 lists options you can specify with the SB option. See the section Johnson Distribution for details.

SIGMA=value-list

specifies the parameter for the fitted density curve when you request the BETA, EXPONENTIAL, GAMMA, GUMBEL, LOGNORMAL, NORMAL, PARETO, POWER, RAYLEIGH, SB, SU, or WEIBULL options.

See Table 4.9 for a summary of how to use the SIGMA= option. You must enclose this option in parentheses after the density curve option. You can specify the value EST to request a maximum likelihood estimate for .

Table 4.9: Uses of the SIGMA= Option

Distribution Keyword

SIGMA= Specifies

Default Value

Alias

BETA

scale parameter

1

SCALE=

EXPONENTIAL

scale parameter

maximum likelihood estimate

SCALE=

GAMMA

scale parameter

maximum likelihood estimate

SCALE=

GUMBEL

scale parameter

maximum likelihood estimate

LOGNORMAL

shape parameter

maximum likelihood estimate

SHAPE=

NORMAL

scale parameter

standard deviation

PARETO

scale parameter

1

POWER

scale parameter

maximum likelihood estimate

RAYLEIGH

scale parameter

maximum likelihood estimate

SB

scale parameter

1

SCALE=

SU

scale parameter

percentile-based estimate

WEIBULL

scale parameter

maximum likelihood estimate

SCALE=

SU<( -options)>

displays fitted Johnson density curves on the histogram. The SU option can occur only once in a HISTOGRAM statement, but it can request any number of Johnson curves. Use the THETA= and SIGMA= normal-options to specify and . By default, the procedure computes maximum likelihood estimates of and . Table 4.6 lists options you can specify with the SU option. See the section Johnson Distribution for details.

THETA=value-list
THRESHOLD= value-list

specifies the lower threshold parameter for curves requested with the BETA, EXPONENTIAL, GAMMA, LOGNORMAL, PARETO, POWER, RAYLEIGH, SB, SU, and WEIBULL options. Enclose the THETA= option in parentheses after the curve option. By default, THETA=0. If you specify the value EST, an estimate is computed for .

UPPER=value-list

specifies upper bounds for kernel density estimates requested with the KERNEL option. Enclose the UPPER= option in parentheses after the KERNEL option. If you specify more kernel estimates than upper bounds, the last upper bound is repeated for the remaining estimates. The default is a missing value, indicating no upper bounds for fitted kernel density curves.

† VOFFSET=value

specifies the offset, in percentage screen units, at the upper end of the vertical axis.

VSCALE=COUNT | PERCENT | PROPORTION

specifies the scale of the vertical axis for a histogram. The value COUNT requests the data be scaled in units of the number of observations per data unit. The value PERCENT requests the data be scaled in units of percent of observations per data unit. The value PROPORTION requests the data be scaled in units of proportion of observations per data unit. The default is PERCENT.

† WBARLINE=n

specifies the width of bar outlines when producing traditional graphics. The option does not apply to ODS Graphics output.

WEIBULL<(Weibull-options)>

displays fitted Weibull density curves on the histogram. The WEIBULL option can occur only once in a HISTOGRAM statement, but it can request any number of Weibull curves. The parameter must be less than the minimum data value. Use the THETA= Weibull-option to specify . By default, THETA=0. You can specify THETA=EST to request the maximum likelihood estimate for . Use the C= and SIGMA= Weibull-options to specify the shape parameter and the scale parameter . By default, the procedure computes the maximum likelihood estimates for and . Table 4.6 lists options you can specify with the WEIBULL option. See the section Weibull Distribution for details, and see Example 4.22 for an example that uses the WEIBULL option.

PROC UNIVARIATE calculates the maximum likelihood estimate of iteratively by using the Newton-Raphson approximation. See also the C=, SIGMA=, and THETA= Weibull-options.

† WGRID=n

specifies the line thickness for the grid when producing traditional graphics. The option does not apply to ODS Graphics output.

ZETA= value-list

specifies a value for the scale parameter for lognormal density curves requested with the LOGNORMAL option. Enclose the ZETA= lognormal-option in parentheses after the LOGNORMAL option. By default, or if you specify the value EST, the procedure calculates a maximum likelihood estimate for . You can specify the SCALE= option as an alias for the ZETA= option.