Producing Charts to Summarize Variables |
Understanding How to Use the HISTOGRAM Statement |
A histogram is similar to a vertical bar chart. This type of bar chart emphasizes the individual ranges of continuous numeric variables and enables you to examine the distribution of your data.
The HISTOGRAM statement in a PROC UNIVARIATE step produces histograms and comparative histograms. PROC UNIVARIATE creates a histogram by dividing the data into intervals of equal length, counting the number of observations in each interval, and plotting the counts as vertical bars that are centered around the midpoint of each interval.
If you use the HISTOGRAM statement without any options, then PROC UNIVARIATE automatically does the following:
scales the vertical axis to show the percentage of observations in an interval
determines the bar width based on the method of Terrell and Scott (1985)
Understanding How to Use SAS/GRAPH to Create Histograms |
If your site licenses SAS/GRAPH software, then you can use the HISTOGRAM statement to create high-resolution graphs. When you create charts with a graphics device, you can also use the AXIS, LEGEND, PATTERN, and SYMBOL statements to enhance your plots.
To control the appearance of a high-resolution graph, you can specify a GOPTIONS statement before the PROC step that creates the graph. The GOPTIONS statement changes the values of the graphics options that SAS uses when graphics output is created. Graphics options affect the characteristics of a graph, such as size, colors, type fonts, fill patterns, and line thickness. In addition, they affect the settings of device parameters such as the appearance of the display, the type of output that is produced, and the destination of the output.
Most of the examples in this section use the following GOPTIONS statement:
goptions reset=global gunit=pct hsize= 5.625 in vsize= 3.5 in htitle=4 htext=3 vorigin=0 in horigin= 0 in cback=white border ctext=black colors=(black blue green red yellow) ftext=swiss lfactor=3;
For additional information about how to modify the appearance of your graphics output, see SAS/GRAPH: Reference.
Creating a Simple Histogram |
The following program uses the HISTOGRAM statement to create a histogram for the numeric variable ExamGrade1:
proc univariate data=grades noprint; histogram ExamGrade1; title 'Grades for First Chemistry Exam'; run;
The NOPRINT option suppresses the tables of statistics that the PROC UNIVARIATE statement creates.
The following figure shows the histogram:
Using a Histogram to Show Percentages
The midpoint axis for the above histogram goes from 40 to 100 and is incremented in intervals of 10. The following table shows the values:Interval | Midpoint |
---|---|
35 to 44 | 40 |
45 to 54 | 50 |
55 to 64 | 60 |
65 to 74 | 70 |
75 to 84 | 80 |
85 to 94 | 90 |
95 to 104 | 10 |
Note: Because PROC UNIVARIATE selects the size of the intervals and the location of their midpoints based on all values of the numeric variable, the highest and lowest intervals can extend beyond the values in the data. In this example the lowest grade is 39 while the lowest interval extends from 35 to 44. Similarly, the highest grade is 98 while the highest interval extends from 95 to 104.
Changing the Axes of a Histogram |
The exact value of a histogram bar is sometimes difficult to determine. By default, PROC UNIVARIATE does not provide minor tick marks between the vertical axis values (major tick marks). You can specify the number of minor tick marks between major tick marks with the VMINOR= option.
To make it easier to see the location of major tick marks, you can use the GRID option to add grid lines on the histogram. Grid lines are horizontal lines that are positioned at major tick marks on the vertical axis. PROC UNIVARIATE provides two options to change the appearance of the grid line:
Action | Option |
set the color of the grid lines | CGRID= |
set the line type of the grid lines | LGRID= |
By default, PROC UNIVARIATE draws a solid line using the first color in the device color list. For a list of the available line types, see SAS/GRAPH: Reference.
The following program creates a histogram that displays minor tick marks and grid lines for the numeric variable ExamGrade1:
proc univariate data=grades noprint; histogram Examgrade1 / vminor=4 grid lgrid=34; title 'Grades for First Chemistry Exam'; run;
Four minor tick marks are inserted between each major tick mark. Narrowly spaced dots are used to draw the grid lines.
The following figure shows the histogram:
Specifying Grid Lines for a Histogram
Now, the height of each histogram bar is easily determined from the chart. The following table shows the percentage each interval represents:Interval | Percent |
---|---|
35 to 44 | 6 |
45 to 54 | 12 |
55 to 64 | 28 |
65 to 74 | 10 |
75 to 84 | 22 |
85 to 94 | 20 |
95 to 104 | 2 |
PROC UNIVARIATE enables you to specify what the bars in the histogram represent, and the values of the vertical axis. By default, each bar represents the percentage of observations that fall into the given interval.
The VSCALE= option enables you to specify the following scales for the vertical axis:
The VAXIS= option enables you to specify evenly spaced tick mark values for the vertical axis. The form of this option is
HISTOGRAM variable / VAXIS=value-list; |
The following program creates a histogram that displays counts on the vertical axis for the numeric variable ExamGrade1:
proc univariate data=grades noprint; histogram Examgrade1 / vscale=count vaxis=0 to 16 by 2 vminor=1; title 'Grades for First Chemistry Exam'; run;
The values of the vertical axis range from 0 to 16 in increments of two. One minor tick mark is inserted between each major tick mark.
The following figure shows the histogram:
Using a Histogram to Show Counts
You can control the width of the histogram bars by using the MIDPOINTS= option. PROC UNIVARIATE uses the value of the midpoints to determine the width of the histogram bars. The difference between consecutive midpoints is the bar width.
To specify midpoints, use the MIDPOINTS= option in the HISTOGRAM statement. The form of the MIDPOINTS= option is
HISTOGRAM variable / MIDPOINTS=midpoint-list; |
For example, to specify the traditional grading ranges with midpoints from 55 to 95, use the following option:
midpoints=55 65 75 85 95
Or, you can abbreviate this list of midpoints:
midpoints=55 to 95 by 10
The following program uses the MIDPOINTS= option to create a histogram for the numeric variable ExamGrade1:
proc univariate data=grades noprint; histogram Examgrade1 / vscale=count vaxis=0 to 16 by 2 vminor=1 midpoints=55 65 75 85 951 hoffset=102 vaxislabel='Frequency'3 ; title 'Grades for First Chemistry Exam'; run;
The following list corresponds to the numbered items in the preceding program:
The following figure shows the histogram:
Specifying Five Midpoints for a Histogram
The midpoint axis for the above histogram goes from 55 to 95 and is incremented in intervals of 10. The histogram excludes any exam scores that are below 50.Displaying Summary Statistics in a Histogram |
PROC UNIVARIATE enables you to add a box or table of summary statistics, called an inset, directly in the histogram. Typically, an inset displays statistics that PROC UNIVARIATE has calculated, but an inset can also display values that you provide in a SAS data set.
To add a table of summary statistics, use the INSET statement. You can use multiple INSET statements in the UNIVARIATE procedure to add more than one table to a histogram. The INSET statements must follow the HISTOGRAM statement that creates the plot that you want augmented. The inset appears in all the graphs that the preceding HISTOGRAM statement produces.
The form of the INSET statement is as follows:
INSET<keyword(s)> </ option(s)> |
By default, PROC UNIVARIATE uses appropriate labels and appropriate formats to display the statistics in the inset. To customize a label, specify the keyword followed by an equal sign (=) and the desired label in quotation marks. To customize the format, specify a numeric format in parentheses after the keyword. You can assign labels that are up to 24 characters. If you specify both a label and a format for a keyword, then the label must appear before the format. For example,
inset n='Sample Size' std='Std Dev' (5.2);
requests customized labels for two statistics (sample size and standard deviation). The standard deviation is also assigned a format that has a field width of five and includes two decimal places.
Various options enable you to customize the appearance of the inset. For example, you can do the following:
Specify graphical enhancements, such as background colors, text colors, text height, text font, and drop shadows.
The following program uses the INSET statement to add summary statistics for the numeric variable ExamGrade1 to the histogram:
proc univariate data=grades noprint; histogram Examgrade1 /vscale=count vaxis=0 to 16 by 2 vminor=1 hoffset=10 midpoints=55 65 75 85 95 vaxislabel='Frequency'; inset n='No. Students' mean='Mean Grade' min='Lowest Grade'1 max='Highest Grade' / header='Summary Statistics'2 position=ne3 format=3.4 ; title 'Grade Distribution for the First Chemistry Exam'; run;
The following list corresponds to the numbered items in the preceding program:
The following figure shows the histogram:
Adding an Inset to a Histogram
The histogram shows the data distribution. The table of summary statistics in the upper-right corner of the histogram provides information about the sample size, the mean grade, the lowest value, and the highest value.Creating a Comparative Histogram |
A comparative histogram is a series of component histograms that are arranged as an array or a matrix. PROC UNIVARIATE uses uniform horizontal and vertical axes to display the component histograms. This enables you to use the comparative histogram to visually compare the distribution of a numeric variable across the levels of up to two classification variables.
You use the CLASS statement with a HISTOGRAM statement to create either a one-way or a two-way comparative histogram. The form of the CLASS statement is as follows:
CLASS variable-1<(variable-option(s))> <variable-2<(variable-option(s))>></ options>; |
When you specify one class variable, PROC UNIVARIATE displays an array of component histograms (stacked or side-by-side). To create the one-way comparative histogram, PROC UNIVARIATE categorizes the values of the analysis variable by the formatted values (levels) of the class variable. Each classification level generates a separate histogram.
When you specify two class variables, PROC UNIVARIATE displays a matrix of component plots. To create the two-way comparative histogram, PROC UNIVARIATE categorizes the values of the analysis variable by the cross-classified values (levels) of the class variables. Each combination of the cross-classified levels generates a separate histogram. The levels of class variable-1 are the labels for the rows of the matrix, and the levels of class variable-2 are the labels for the columns of the matrix.
You can specify options in the HISTOGRAM statement to customize the appearance of the comparative histogram. For example, you can do the following:
Specify the number of columns for the comparative histogram.
Specify graphical enhancements, such as background colors and text colors for the labels.
The following program uses the CLASS statement to create a comparative histogram by gender and section for the numeric variable ExamGrade1:
proc format; value $gendfmt 'M'='Male' 'F'='Female'1 ; run; proc univariate data=grades noprint; class Gender2 Section(order=data)3 ; histogram Examgrade1 / midpoints=45 to 95 by 10 vscale=count vaxis=0 to 6 by 2 vaxislabel='Frequency' turnvlabels4 nrows=2 ncols=35 cframe=ligr6 cframeside=gwh cframetop=gwh cfill=gwh7 ; inset mean(4.1) n / noframe8 position=(2,65)9 ; format Gender $gendfmt.1 ; title 'Grade Distribution for the First Chemistry Exam'; run;
The following list corresponds to the numbered items in the preceding program:
The following figure shows the comparative histogram:
Using a Comparative Histogram to Examine Exam Grades by Gender and Section
The comparative histogram is a 2 × 3 matrix of component histograms for each combination of Section and Gender. Each component histogram displays a table of statistics that reports the mean of ExamGrade1 and the number of students. You can easily see that both females and males in the Friday section earned higher grades than their counterparts in the other sections.
Copyright © 2012 by SAS Institute Inc., Cary, NC, USA. All rights reserved.