A bivariate
histogram shows the distribution of data for two continuous numeric
variables. In the following graph, the X axis displays HEIGHT values
and the Y axis displays WEIGHT values. The Z axis represents the frequency
count of observations. The Z values could be some other measure (for
example, percentage of observations), but they can never be negative.
As with
a standard histogram, the X and Y variables in the bivariate histogram
have been uniformly binned, which means that their data ranges have
been divided into equal sized intervals (bins), and that observations
are distributed into one of these bin combinations.
The BIHISTOGRAM3DPARM statement,
which produced this plot, does not perform any binning computation
on the input columns. Thus, you must pre-bin the data. In the following
example, the binning is done with PROC KDE (part of the
SAS/STAT product).
proc kde data=sashelp.heart;
bivar height(ngrid=8) weight(ngrid=10) /
out=kde(keep=value1 value2 count) noprint plots=none;
run;
In this
program, the NGRID= option sets the number of bins to create for each
variable. The default for NGRID is 60. The binned values for HEIGHT
are stored in VALUE1, and the binned values for WEIGHT are stored
in VALUE2. This selection of bins produces 1 observation for each
of the 80 bin combinations. Frequency counts for each bin combination
are placed in a COUNT variable in the output data set.
Notice
that when you form the grid by choosing the number of bins, the bin
widths (about 3.5 for HEIGHT and about 26 for WEIGHT) are most often
non-integer.
The following
template definition displays this data. By default, the BINAXIS=TRUE
setting requests that X and Y axes show tick values at bin boundaries.
Also by default, XVALUES=MIDPOINTS and YVALUES=MIDPOINTS, which means
that the X and Y columns represent midpoint values rather than lower
bin boundaries (LEFTPOINTS) or upper bin boundaries (RIGHTPOINTS).
Not all of the bins in this graph can be labelled without collision
because the graph is small. Thus, the ticks and tick values were thinned.
The non-integer bin values are converted to integers ( TICKVALUEFORMAT=5.
) to simplify the axis tick values. DISPLAY=ALL means "show outlined,
filled bins."
proc template;
define statgraph bihistogram1a;
begingraph;
entrytitle "Distribution of Height and Weight";
entryfootnote halign=right "SASHELP.HEART";
layout overlay3d / cube=false zaxisopts=(griddisplay=on)
xaxisopts=(linearopts=(tickvalueformat=5.))
yaxisopts=(linearopts=(tickvalueformat=5.));
bihistogram3dparm x=value1 y=value2 z=count /
display=all;
endlayout;
endgraph;
end;
run;
proc sgrender data= kde template=bihistogram1a;
label value1="Height" value2="Weight";
run;
Eliminating Bins that Have No Data. Notice
that the bins of 0 frequency (there are several) are included in
the plot. If you want to eliminate the bins where there is no data,
you can generate a subset of the data. The subset makes it a bit clearer
where there are bins with small frequency counts verses portions of
the grid with no data.
proc sgrender data= kde template=bihistogram1a;
where count > 0;
label value1="Height" value2="Weight";
run;
Displaying Percentages on Z Axis. To display
the percentage of observations on the Z axis instead of the actual
count, you need to perform an additional data transformation to convert
the counts to percentages.
proc kde data=sashelp.heart;
bivar height(ngrid=8) weight(ngrid=10) /
out=kde(keep=value1 value2 count) noprint plots=none;
run;
data kde;
if _n_ = 1 then do i=1 to rows;
set kde(keep=count) point=i nobs=rows;
TotalObs+count;
end;
set kde;
Count=100*(Count/TotalObs);
label Count="Percent";
run;
proc sgrender data= kde template=bihistogram1a;
label value1="Height" value2="Weight";
run;
Setting Bin Width. Another technique for binning data
is to set a bin width and compute the number of observations in each
bin. In the DATA step below, 5 is the bin width for HEIGHT and 25
for WEIGHT. With this technique you do not know the exact number
of bins, but you can assure that the bins are of a "good" size.
data heart;
set sashelp.heart(keep=height weight);
if height ne . and weight ne .;
height=round(height,5);
weight=round(weight,25);
run;
After
rounding, HEIGHT and WEIGHT can be used as classifiers for a summarization.
Notice that the COMPLETETYPES option forces all possible combinations
of the two variables to be output, even if no data exists for a particular
crossing.
proc summary data=heart nway completetypes;
class height weight;
var height;
output out=stats(keep=height weight count) N=Count;
run;
The template
can be simplified because we know that the bin midpoints are uniformly
spaced integers. For this selection of bin widths, 6 bins were produced
for HEIGHT and 10 for WEIGHT.
proc template;
define statgraph bihistogram2a;
begingraph;
entrytitle "Distribution of Height and Weight";
entryfootnote halign=right "SASHELP.HEART";
layout overlay3d / cube=false zaxisopts=(griddisplay=on);
bihistogram3dparm x=height y=weight z=count /
display=all;
endlayout;
endgraph;
end;
run;
proc sgrender data=stats template=bihistogram2a;
run;
If
you prefer to see the axes labeled with the bin endpoints rather the
bin midpoints, you can use the ENDLABELS=TRUE setting on the BIHISTOGRAM3DPARM
statement. Note that the ENDLABELS= option is independent of the XVALUES=
and YVALUES= options.
In the
following example, the bin widths are changed to even numbers (10
and 50) to make the bin endpoints even numbers:
proc template;
define statgraph bihistogram2a;
begingraph;
entrytitle "Distribution of Height and Weight";
entryfootnote halign=right "SASHELP.HEART";
layout overlay3d / cube=false zaxisopts=(griddisplay=on);
bihistogram3dparm x=height y=weight z=count /
binaxis=true endlabels=true display=all;
endlayout;
endgraph;
end;
run;
data heart;
set sashelp.heart(keep=height weight);
height=round(height,10);
weight=round(weight,50);
run;
proc summary data=heart nway completetypes;
class height weight;
var height;
output out=stats(keep=height weight count) N=Count;
run;
proc sgrender data=stats template=bihistogram2a;
run;
If you choose bin widths that are too small, "gaps" might be displayed
among axis ticks values, which might cause the following message:
WARNING: The data for a HISTOGRAMPARM statement is not appropriate.
HISTOGRAMPARM statement expects uniformly-binned data. The
histogram might not be drawn correctly.
Because
BIHISTOGRAM3DPARM is a parameterized plot, you can use it to show
the 3D data summarization of a response variable Z, which must have
non-negative values, by two numeric classification variables that
are uniformly spaced (X and Y). That is, even though the graphical
representation is a bivariate histogram, the Z axis does not have
to display a frequency count or a percent.
data cars;
set sashelp.cars(keep=weight horsepower mpg_highway);
if horsepower ne . and weight ne .;
horsepower=round(horsepower,75);
weight=round(weight,1000);
run;
proc summary data=cars nway completetypes;
class weight horsepower;
var mpg_highway;
output out=stats mean=Mean ;
run;
proc template;
define statgraph bihistogram2b;
begingraph;
entrytitle
"Distribution of Gas Mileage by Vehicle Weight and Horsepower";
entryfootnote halign=right "SASHELP.CARS";
layout overlay3d / cube=false zaxisopts=(griddisplay=on) rotate=130;
bihistogram3dparm y=weight x=horsepower z=mean / binaxis=true
display=all;
endlayout;
endgraph;
end;
run;
proc sgrender data=stats template=bihistogram2b;
run;