Example 102.5 A Box Plot of the Square Root Difference Cloud

The Gaussian form selected for the semivariogram in the section Getting Started: VARIOGRAM Procedure is based on consideration of the plots of the sample semivariogram. For the coal thickness data, the Gaussian form appears to be a reasonable choice.

However, it can often happen that a plot of the sample variogram shows so much scatter that no particular form is evident. The cause of this scatter can be one or more outliers in the pairwise differences of the measured quantities.

A method of identifying potential outliers is discussed in Cressie (1993, section 2.2.2). This example illustrates how to use the OUTPAIR= data set from PROC VARIOGRAM to produce a square root difference cloud, which is useful in detecting outliers.

For the SRF , the square root difference cloud for a particular direction is given by for a given lag distance h. In the actual computation, all pairs of points , within a distance tolerance around h and an angle tolerance around the direction are used. This generates a number of point pairs for each lag class h. The spread of these values gives an indication of outliers.

Following the example in the section Getting Started: VARIOGRAM Procedure, this example uses a basic LAGDISTANCE=7, with a distance tolerance of 3.5, and a direction of N–S, with an angle tolerance ATOL=30 .

First, use PROC VARIOGRAM to produce an OUTPAIR= data set. Then use a DATA step to subset this data by choosing pairs within 30 of N–S. In addition, compute lag class and square root difference variables, as the following statements show:

title 'Square Root Difference Cloud Example';

data thick;
input East North Thick @@;
label Thick='Coal Seam Thickness';
datalines;
0.7  59.6  34.1   2.1  82.7  42.2   4.7  75.1  39.5
4.8  52.8  34.3   5.9  67.1  37.0   6.0  35.7  35.9
6.4  33.7  36.4   7.0  46.7  34.6   8.2  40.1  35.4
13.3   0.6  44.7  13.3  68.2  37.8  13.4  31.3  37.8
17.8   6.9  43.9  20.1  66.3  37.7  22.7  87.6  42.8
23.0  93.9  43.6  24.3  73.0  39.3  24.8  15.1  42.3
24.8  26.3  39.7  26.4  58.0  36.9  26.9  65.0  37.8
27.7  83.3  41.8  27.9  90.8  43.3  29.1  47.9  36.7
29.5  89.4  43.0  30.1   6.1  43.6  30.8  12.1  42.8
32.7  40.2  37.5  34.8   8.1  43.3  35.3  32.0  38.8
37.0  70.3  39.2  38.2  77.9  40.7  38.9  23.3  40.5
39.4  82.5  41.4  43.0   4.7  43.3  43.7   7.6  43.1
46.4  84.1  41.5  46.7  10.6  42.6  49.9  22.1  40.7
51.0  88.8  42.0  52.8  68.9  39.3  52.9  32.7  39.2
55.5  92.9  42.2  56.0   1.6  42.7  60.6  75.2  40.1
62.1  26.6  40.1  63.0  12.7  41.8  69.0  75.6  40.1
70.5  83.7  40.9  70.9  11.0  41.7  71.5  29.5  39.8
78.1  45.5  38.7  78.2   9.1  41.7  78.4  20.0  40.8
80.5  55.9  38.7  81.1  51.0  38.6  83.8   7.9  41.6
84.5  11.0  41.5  85.2  67.3  39.4  85.5  73.0  39.8
86.7  70.4  39.6  87.2  55.7  38.8  88.1   0.0  41.6
88.4  12.1  41.3  88.4  99.6  41.2  88.8  82.9  40.5
88.9   6.2  41.5  90.6   7.0  41.5  90.7  49.6  38.9
91.5  55.4  39.0  92.9  46.8  39.1  93.4  70.9  39.7
55.8  50.5  38.1  96.2  84.3  40.3  98.2  58.2  39.5
;
proc variogram data=thick outp=outp noprint;
compute novariogram;
coordinates xc=East yc=North;
var Thick;
run;

data sqroot;
set outp;
/*- Include only points +/- 30 degrees of N-S -------*/
where abs(cos) < 0.5;
/*- Unit lag of 7, distance tolerance of 3.5 --------*/
lag_class=int(distance/7 + 0.5000001);
sqr_diff=sqrt(abs(v1-v2));
run;
proc sort data=sqroot;
by lag_class;
run;

Next, summarize the results by using the MEANS procedure:

proc means data=sqroot noprint n mean std;
var sqr_diff;
by lag_class;
output out=msqrt n=n mean=mean std=std;
run;
title2 'Summary of Results';
proc print data=msqrt;
id lag_class;
var n mean std;
run;

The preceding statements produce Output 102.5.1.

Output 102.5.1: Summary of Results

 Square Root Difference Cloud Example Summary of Results

lag_class n mean std
0 5 0.47300 0.14263
1 31 0.77338 0.41467
2 51 1.17052 0.47800
3 58 1.52287 0.51454
4 65 1.68625 0.58465
5 65 1.66963 0.68582
6 80 1.79693 0.62929
7 88 1.73334 0.73191
8 83 1.75528 0.68767
9 108 1.72901 0.58274
10 80 1.48268 0.48695
11 84 1.19242 0.47037
12 68 0.89765 0.42510
13 38 0.84223 0.44249
14 7 1.05653 0.42548
15 3 1.35076 0.11472

Finally, present the results in a box plot by using the SGPLOT procedure. The box plot facilitates the detection of outliers. The statements are as follows:

proc sgplot data=sqroot;
xaxis label = "Lag Class";
yaxis label = "Square Root Difference";
title "Box Plot of the Square Root Difference Cloud";
vbox sqr_diff / category=lag_class;
run;

Output 102.5.2 suggests that outliers, if any, do not appear to be adversely affecting the empirical semivariogram in the N–S direction for the coal seam thickness data. The conclusion from Output 102.5.2 is consistent with our previous semivariogram analysis of the same data set in the section Getting Started: VARIOGRAM Procedure. The effect of the isolated outliers in lag classes 6 and 10–12 in Output 102.5.2 is demonstrated as the divergence between the classical and robust empirical semivariance estimates in the higher distances in Figure 102.7. The difference in these estimates comes from the definition of the robust semivariance estimator (see the section Theoretical and Computational Details of the Semivariogram), which imposes a smoothing effect on the outlier influence.

Output 102.5.2: Box Plot of the Square Root Difference Cloud 