Previous Page | Next Page

Producing Charts to Summarize Variables

Creating High-Resolution Histograms


Understanding How to Use the HISTOGRAM Statement

A histogram is similar to a vertical bar chart. This type of bar chart emphasizes the individual ranges of continuous numeric variables and enables you to examine the distribution of your data.

The HISTOGRAM statement in a PROC UNIVARIATE step produces histograms and comparative histograms. PROC UNIVARIATE creates a histogram by dividing the data into intervals of equal length, counting the number of observations in each interval, and plotting the counts as vertical bars that are centered around the midpoint of each interval.

If you use the HISTOGRAM statement without any options, then PROC UNIVARIATE automatically does the following:

The HISTOGRAM statement provides various options that enable you to control the layout of the histogram and enhance the graph. You can also fit families of density curves and superimpose kernel density estimates on the histograms, which can be useful in examining the data distribution. For additional information about the density curves that SAS computes, see the UNIVARIATE procedure in the Base SAS Procedures Guide.

Understanding How to Use SAS/GRAPH to Create Histograms

If your site licenses SAS/GRAPH software, then you can use the HISTOGRAM statement to create high-resolution graphs. When you create charts with a graphics device, you can also use the AXIS, LEGEND, PATTERN, and SYMBOL statements to enhance your plots.

To control the appearance of a high-resolution graph, you can specify a GOPTIONS statement before the PROC step that creates the graph. The GOPTIONS statement changes the values of the graphics options that SAS uses when graphics output is created. Graphics options affect the characteristics of a graph, such as size, colors, type fonts, fill patterns, and line thickness. In addition, they affect the settings of device parameters such as the appearance of the display, the type of output that is produced, and the destination of the output.

Most of the examples in this section use the following GOPTIONS statement:

goptions reset=global
         gunit=pct
         hsize= 5.625 in
         vsize= 3.5 in
         htitle=4
         htext=3
         vorigin=0 in
         horigin= 0 in
         cback=white border
         ctext=black 
         colors=(black blue green red yellow)
         ftext=swiss
         lfactor=3;

For additional information about how to modify the appearance of your graphics output, see SAS/GRAPH: Reference.


Creating a Simple Histogram

The following program uses the HISTOGRAM statement to create a histogram for the numeric variable ExamGrade1:

proc univariate data=grades noprint;
   histogram ExamGrade1;
   title 'Grades for First Chemistry Exam';
run;

The NOPRINT option suppresses the tables of statistics that the PROC UNIVARIATE statement creates.

The following figure shows the histogram:

Using a Histogram to Show Percentages

[Using a Histogram to Show Percentages]

The midpoint axis for the above histogram goes from 40 to 100 and is incremented in intervals of 10. The following table shows the values:
Interval Midpoint
35 to 44 40
45 to 54 50
55 to 64 60
65 to 74 70
75 to 84 80
85 to 94 90
95 to 104 10

Note:   Because PROC UNIVARIATE selects the size of the intervals and the location of their midpoints based on all values of the numeric variable, the highest and lowest intervals can extend beyond the values in the data. In this example the lowest grade is 39 while the lowest interval extends from 35 to 44. Similarly, the highest grade is 98 while the highest interval extends from 95 to 104.  [cautionend]


Changing the Axes of a Histogram


Enhancing the Vertical Axis

The exact value of a histogram bar is sometimes difficult to determine. By default, PROC UNIVARIATE does not provide minor tick marks between the vertical axis values (major tick marks). You can specify the number of minor tick marks between major tick marks with the VMINOR= option.

To make it easier to see the location of major tick marks, you can use the GRID option to add grid lines on the histogram. Grid lines are horizontal lines that are positioned at major tick marks on the vertical axis. PROC UNIVARIATE provides two options to change the appearance of the grid line:

Action Option
set the color of the grid lines CGRID=
set the line type of the grid lines LGRID=

By default, PROC UNIVARIATE draws a solid line using the first color in the device color list. For a list of the available line types, see SAS/GRAPH: Reference.

The following program creates a histogram that displays minor tick marks and grid lines for the numeric variable ExamGrade1:

proc univariate data=grades noprint;
   histogram Examgrade1 / vminor=4 grid lgrid=34;
   title 'Grades for First Chemistry Exam';
run;

Four minor tick marks are inserted between each major tick mark. Narrowly spaced dots are used to draw the grid lines.

The following figure shows the histogram:

Specifying Grid Lines for a Histogram

[Specifying Grid Lines for a Histogram]

Now, the height of each histogram bar is easily determined from the chart. The following table shows the percentage each interval represents:
Interval Percent
35 to 44 6
45 to 54 12
55 to 64 28
65 to 74 10
75 to 84 22
85 to 94 20
95 to 104 2


Specifying the Vertical Axis Values

PROC UNIVARIATE enables you to specify what the bars in the histogram represent, and the values of the vertical axis. By default, each bar represents the percentage of observations that fall into the given interval.

The VSCALE= option enables you to specify the following scales for the vertical axis:

The VAXIS= option enables you to specify evenly spaced tick mark values for the vertical axis. The form of this option is

HISTOGRAM variable / VAXIS=value-list;
where value-list is a list of numbers to use as major tick mark values. The first value is always equal to zero and the last value is always greater than or equal to the height of the largest bar.

The following program creates a histogram that displays counts on the vertical axis for the numeric variable ExamGrade1:

proc univariate data=grades noprint;
   histogram Examgrade1 / vscale=count vaxis=0 to 16 by 2 vminor=1; 
   title 'Grades for First Chemistry Exam';
run;

The values of the vertical axis range from 0 to 16 in increments of two. One minor tick mark is inserted between each major tick mark.

The following figure shows the histogram:

Using a Histogram to Show Counts

[Using a Histogram to Show Counts]


Specifying the Midpoints of a Histogram

You can control the width of the histogram bars by using the MIDPOINTS= option. PROC UNIVARIATE uses the value of the midpoints to determine the width of the histogram bars. The difference between consecutive midpoints is the bar width.

To specify midpoints, use the MIDPOINTS= option in the HISTOGRAM statement. The form of the MIDPOINTS= option is

HISTOGRAM variable / MIDPOINTS=midpoint-list;
where midpoint-list is a list of numbers to use as midpoints. You must use evenly spaced midpoints that are listed in increasing order.

For example, to specify the traditional grading ranges with midpoints from 55 to 95, use the following option:

midpoints=55 65 75 85 95

Or, you can abbreviate this list of midpoints:

midpoints=55 to 95 by 10

The following program uses the MIDPOINTS= option to create a histogram for the numeric variable ExamGrade1:

proc univariate data=grades noprint;
   histogram Examgrade1 / vscale=count vaxis=0 to 16 by 2 vminor=1  
                          midpoints=55 65 75 85 951  hoffset=102 
                          vaxislabel='Frequency'3 ;
   title 'Grades for First Chemistry Exam';
run;

The following list corresponds to the numbered items in the preceding program:

[1] The MIDPOINTS= option forces PROC UNIVARIATE to center the five bars around the traditional midpoints for exam grades.

[2] The HOFFSET= option uses a 10 percent offset at both ends of the horizontal axis.

[3] The VAXISLABEL= option uses Frequency as the label for the vertical axis. The default label is Count.

The following figure shows the histogram:

Specifying Five Midpoints for a Histogram

[Specifying Five Midpoints for a Histogram]

The midpoint axis for the above histogram goes from 55 to 95 and is incremented in intervals of 10. The histogram excludes any exam scores that are below 50.

Displaying Summary Statistics in a Histogram


Understanding How to Use the INSET Statement

PROC UNIVARIATE enables you to add a box or table of summary statistics, called an inset, directly in the histogram. Typically, an inset displays statistics that PROC UNIVARIATE has calculated, but an inset can also display values that you provide in a SAS data set.

To add a table of summary statistics, use the INSET statement. You can use multiple INSET statements in the UNIVARIATE procedure to add more than one table to a histogram. The INSET statements must follow the HISTOGRAM statement that creates the plot that you want augmented. The inset appears in all the graphs that the preceding HISTOGRAM statement produces.

The form of the INSET statement is as follows:

INSET<keyword(s)> </ option(s)>
You specify the keywords for inset statistics (such as N, MIN, MAX, MEAN, and STD) immediately after the word INSET. You can also specify the keyword DATA= followed by the name of a SAS data set to display customized statistics that are stored in a SAS data set. The statistics will appear in the order in which you specify the keywords.

By default, PROC UNIVARIATE uses appropriate labels and appropriate formats to display the statistics in the inset. To customize a label, specify the keyword followed by an equal sign (=) and the desired label in quotation marks. To customize the format, specify a numeric format in parentheses after the keyword. You can assign labels that are up to 24 characters. If you specify both a label and a format for a keyword, then the label must appear before the format. For example,

inset n='Sample Size' std='Std Dev' (5.2);

requests customized labels for two statistics (sample size and standard deviation). The standard deviation is also assigned a format that has a field width of five and includes two decimal places.

Various options enable you to customize the appearance of the inset. For example, you can do the following:

For a complete list of the keywords and the options that you can use in the INSET statement, see the Base SAS Procedures Guide.

The Program

The following program uses the INSET statement to add summary statistics for the numeric variable ExamGrade1 to the histogram:

proc univariate data=grades noprint;
   histogram Examgrade1 /vscale=count vaxis=0 to 16 by 2 vminor=1 hoffset=10
                         midpoints=55 65 75 85 95 vaxislabel='Frequency';
   inset n='No. Students' mean='Mean Grade' min='Lowest Grade'1 
         max='Highest Grade' / header='Summary Statistics'2  position=ne3 
                               format=3.4 ;
   title 'Grade Distribution for the First Chemistry Exam';
run;

The following list corresponds to the numbered items in the preceding program:

[1] The statistical keywords N, MEAN, MIN, and MAX specify that the number of observations, the mean exam grade, the minimum exam grade, and the maximum exam grade appear in the inset. Each keyword is assigned a customized label to identify the statistic in the inset.

[2] The HEADER= option specifies the heading text that appears at the top of the inset.

[3] The POSITION= option uses a compass point to position the inset. The table will appear at the northeast corner of the histogram.

[4] The FORMAT= option requests a format with a field width of three for all the statistics in the inset.

The following figure shows the histogram:

Adding an Inset to a Histogram

[Adding an Inset to a Histogram]

The histogram shows the data distribution. The table of summary statistics in the upper-right corner of the histogram provides information about the sample size, the mean grade, the lowest value, and the highest value.

Creating a Comparative Histogram


Understanding Comparative Histograms

A comparative histogram is a series of component histograms that are arranged as an array or a matrix. PROC UNIVARIATE uses uniform horizontal and vertical axes to display the component histograms. This enables you to use the comparative histogram to visually compare the distribution of a numeric variable across the levels of up to two classification variables.

You use the CLASS statement with a HISTOGRAM statement to create either a one-way or a two-way comparative histogram. The form of the CLASS statement is as follows:

CLASS variable-1<(variable-option(s))> <variable-2<(variable-option(s))>></ options>;
Class variables can be numeric or character. Class variables can have continuous values, but they typically have a few discrete values that define levels of the variable. You can reduce the number of classification levels by using a FORMAT statement to combine the values of a class variable.

When you specify one class variable, PROC UNIVARIATE displays an array of component histograms (stacked or side-by-side). To create the one-way comparative histogram, PROC UNIVARIATE categorizes the values of the analysis variable by the formatted values (levels) of the class variable. Each classification level generates a separate histogram.

When you specify two class variables, PROC UNIVARIATE displays a matrix of component plots. To create the two-way comparative histogram, PROC UNIVARIATE categorizes the values of the analysis variable by the cross-classified values (levels) of the class variables. Each combination of the cross-classified levels generates a separate histogram. The levels of class variable-1 are the labels for the rows of the matrix, and the levels of class variable-2 are the labels for the columns of the matrix.

You can specify options in the HISTOGRAM statement to customize the appearance of the comparative histogram. For example, you can do the following:

For a complete list of the keywords and the options that you can use in the HISTOGRAM statement, see the Base SAS Procedures Guide.

The Program

The following program uses the CLASS statement to create a comparative histogram by gender and section for the numeric variable ExamGrade1:

proc format;
   value $gendfmt 'M'='Male'
                  'F'='Female'1 ;
run;
                                  
proc univariate data=grades noprint;
   class Gender2  Section(order=data)3 ;
   histogram Examgrade1 / midpoints=45 to 95 by 10 vscale=count vaxis=0 to 6 by 2 
                          vaxislabel='Frequency' turnvlabels4  nrows=2 ncols=35  
                          cframe=ligr6  cframeside=gwh cframetop=gwh cfill=gwh7 ;
   inset mean(4.1) n / noframe8  position=(2,65)9 ;
   format Gender $gendfmt.1 ;
   title 'Grade Distribution for the First Chemistry Exam';
run;

The following list corresponds to the numbered items in the preceding program:

[1] PROC FORMAT creates a user-written format that will label Gender with a character string. The FORMAT statement assigns the format to Gender.

[2] The CLASS statement creates a two-way comparative histogram that uses Gender and Section as the classification variables. PROC UNIVARIATE produces a component histogram for each level (a distinct combination of values) of these variables.

[3] The ORDER= option positions the values of Section according to their order in the input data set. The comparative histogram displays the levels of Section according to the days of the week (Mon, Wed, and Fri). The default order of the levels is determined by sorting the internal values of Section (Fri, Mon, and Wed).

[4] The TURNVLABELS option turns the characters in the vertical axis labels so that they display vertically instead of horizontally.

[5] The NROWS= option and the NCOLS= option specify a 2 × 3 arrangement for the component histograms.

[6] The CFRAME= option specifies the color that fills the area of each component histogram that is enclosed by the axes and the frame. The CFRAMESIDE= option and the CFRAMETOP= option specify the color to fill the frame area for the column labels and the row labels that appear down the side and across the top of the comparative histogram. By default, these areas are not filled.

[7] The CFILL= option specifies the color to fill the bars of each component histogram. By default, the bars are not filled.

[8] The NOFRAME option suppresses the frame around the inset table.

[9] The POSITION= option uses axis percentage coordinates to position the inset. The position of the bottom-left corner of the inset is 2% of the way across the horizontal axis and 65% of the way up the vertical axis.

The following figure shows the comparative histogram:

Using a Comparative Histogram to Examine Exam Grades by Gender and Section

[Using a Comparative Histogram to Examine Exam Grades by Gender and Section]

The comparative histogram is a 2 × 3 matrix of component histograms for each combination of Section and Gender. Each component histogram displays a table of statistics that reports the mean of ExamGrade1 and the number of students. You can easily see that both females and males in the Friday section earned higher grades than their counterparts in the other sections.

Previous Page | Next Page | Top of Page