is the expected value of height squared, that is,
the mean value of the population obtained by squaring every value
in the population of heights.
. The field of statistics is largely concerned with the study of the behavior of sample statistics.
options nodate pageno=1 linesize=80 pagesize=52;
title 'Example of Quantiles and Measures of Location';
data random;
drop n;
do n=1 to 1000;
X=floor(exp(rannor(314159)*.8+1.8));
output;
end;
run;
proc univariate data=random nextrobs=0;
var x;
output out=location
mean=Mean mode=Mode median=Median
q1=Q1 q3=Q3 p5=P5 p10=P10 p90=P90 p95=P95
max=Max;
run;data _null_;
set location;
call symput('MEAN',round(mean,1));
call symput('MODE',mode);
call symput('MEDIAN',round(median,1));
call symput('Q1',round(q1,1));
call symput('Q3',round(q3,1));
call symput('P5',round(p5,1));
call symput('P10',round(p10,1));
call symput('P90',round(p90,1));
call symput('P95',round(p95,1));
call symput('MAX',min(50,max));
run;
%macro formgen;
%do i=1 %to &max;
%let value=&i;
%if &i=&p5 %then %let value=&value P5;
%if &i=&p10 %then %let value=&value P10;
%if &i=&q1 %then %let value=&value Q1;
%if &i=&mode %then %let value=&value Mode;
%if &i=&median %then %let value=&value Median;
%if &i=&mean %then %let value=&value Mean;
%if &i=&q3 %then %let value=&value Q3;
%if &i=&p90 %then %let value=&value P90;
%if &i=&p95 %then %let value=&value P95;
%if &i=&max %then %let value=>=&value;
&i="&value"
%end;
%mend;
proc format print;
value stat %formgen;
run;
options pagesize=42 linesize=80;
proc chart data=random;
vbar x / midpoints=1 to &max by 1;
format x stat.;
footnote 'P5 = 5TH PERCENTILE';
footnote2 'P10 = 10TH PERCENTILE';
footnote3 'P90 = 90TH PERCENTILE';
footnote4 'P95 = 95TH PERCENTILE';
footnote5 'Q1 = 1ST QUARTILE ';
footnote6 'Q3 = 3RD QUARTILE ';
run;
, is the expected value of the squared difference
of the values from the population mean:
. The difference between a value and the mean is
called a deviation from the mean. Thus, the variance approximates the mean of the squared deviations.
to remove the effect of scale, so multiplying all
values by a constant does not change the skewness. Skewness can thus
be interpreted as a tendency for one tail of the population to be
heavier than the other. Skewness can be positive or negative and is
unbounded.
, multiplying each value by a constant has no effect
on kurtosis.
and
, inclusive. If
represents population skewness and
represents population kurtosis, then
options nodate pageno=1 linesize=80 pagesize=52;
title '10000 Obs Sample from a Normal Distribution';
title2 'with Mean=50 and Standard Deviation=10';
data normaldat;
drop n;
do n=1 to 10000;
X=10*rannor(53124)+50;
output;
end;
run;
proc univariate data=normaldat nextrobs=0 normal
mu0=50 loccount;
var x;
run;proc format;
picture msd
20='20 3*Std' (noedit)
30='30 2*Std' (noedit)
40='40 1*Std' (noedit)
50='50 Mean ' (noedit)
60='60 1*Std' (noedit)
70='70 2*Std' (noedit)
80='80 3*Std' (noedit)
other=' ';
run;
options linesize=80 pagesize=42;
proc chart;
vbar x / midpoints=20 to 80 by 2;
format x msd.;
run;
. The standard deviation of the sampling distribution
of the mean is called the standard error of the
mean. The standard error of the mean provides
an indication of the accuracy of a sample mean as an estimator of
the population mean.
options nodate pageno=1 linesize=80 pagesize=42;
title '1000 Observation Sample';
title2 'from an Exponential Distribution';
data expodat;
drop n;
do n=1 to 1000;
X=ranexp(18746363);
output;
end;
run;
proc format;
value axisfmt
.05='0.05'
.55='0.55'
1.05='1.05'
1.55='1.55'
2.05='2.05'
2.55='2.55'
3.05='3.05'
3.55='3.55'
4.05='4.05'
4.55='4.55'
5.05='5.05'
5.55='5.55'
other=' ';
run;
proc chart data=expodat ;
vbar x / axis=300
midpoints=0.05 to 5.55 by .1;
format x axisfmt.;
run;
, whereas the standard deviation of this sample
from the sampling distribution is .30. The skewness (.55) and kurtosis
(-.006) are closer to zero in the sample from the sampling distribution
than in the original sample from the exponential distribution because
the sampling distribution is closer to a normal distribution than
is the original exponential distribution. The CHART procedure displays
a histogram of the 1000-sample means. The shape of the histogram is
much closer to a bell-like, normal density, but it is still distinctly
lopsided.
options nodate pageno=1 linesize=80 pagesize=48;
title '1000 Sample Means with 10 Obs per Sample';
title2 'Drawn from an Exponential Distribution';
data samp10;
drop n;
do Sample=1 to 1000;
do n=1 to 10;
X=ranexp(433879);
output;
end;
end;
proc means data=samp10 noprint;
output out=mean10 mean=Mean;
var x;
by sample;
run;proc format;
value axisfmt
.05='0.05'
.55='0.55'
1.05='1.05'
1.55='1.55'
2.05='2.05'
other=' ';
run;
proc chart data=mean10;
vbar mean/axis=300
midpoints=0.05 to 2.05 by .1;
format mean axisfmt.;
run;options nodate pageno=1 linesize=80 pagesize=48;
title '1000 Sample Means with 50 Obs per Sample';
title2 'Drawn from an Exponential Distribution';
data samp50;
drop n;
do sample=1 to 1000;
do n=1 to 50;
X=ranexp(72437213);
output;
end;
end;
proc means data=samp50 noprint;
output out=mean50 mean=Mean;
var x;
by sample;
run;
. The other two hypotheses, called alternative hypotheses, are that the students
are underweight on the average,
, and that the students are overweight on the average,
.
and decide among the three hypotheses according
to the following rule:
being exactly zero are almost nil. If μ is
slightly less than zero, so that H1 is true,
then there might be nearly a 50% chance that
will be greater than zero in repeated sampling,
so the chances of incorrectly choosing H2
would also be nearly 50%. Thus, you have a high probability of making
an error if
is near zero. In such cases, there is not enough
evidence to make a confident decision, so the best response might
be to reserve judgment until you can obtain more evidence.
be for you to be able to make a confident decision?
The answer can be obtained by considering the sampling distribution
of
. If X has an approximately normal distribution,
then
has an approximately normal sampling distribution.
The mean of the sampling distribution of
is μ. Assume temporarily that σ, the
standard deviation of X, is known to be 12. Then the standard error
of
for samples of nine observations is
.
between
and
, or between −8 and 8. Consider the chances
of making an error with the following decision rule:
will be between the critical
values
and 8, so you will reserve judgment. In these cases
the statistical evidence is not strong enough to fell the straw man.
In the other 5% of the samples you will make an error; in 2.5% of
the samples you will incorrectly choose H1,
and in 2.5% you will incorrectly choose H2.
. In this example, an
value less than
or greater than 8 is said to be statistically significant at the 5% level. You
can adjust the type I error rate according to your needs by choosing
different critical values. For example, critical values of −4
and 4 would produce a significance level of about 32%, while −12
and 12 would give a type I error rate of about 0.3%.
.
is called the Type II error
rate, which is the probability of not rejecting
a false null hypothesis. The power depends on the true value of the
parameter. In the example, assume that the population mean is 4. The
power for detecting H2 is the probability of
getting a sample mean greater than 8. The critical value 8 is one
standard error higher than the population mean 4. The chance of getting
a value at least one standard deviation greater than the mean from
a normal distribution is about 16%, so the power for detecting the
alternative hypothesis H2 is about 16%. If
the population mean were 8, then the power for H2 would be 50%, whereas a population mean of 12 would yield a power
of about 84%.
divided by the estimated standard error of the
mean.
degrees of freedom. This distribution looks very
similar to a normal distribution, but the tails of the Student's t distribution are heavier. As the sample size
gets larger, the sample standard deviation becomes a better estimator
of the population standard deviation, and the t distribution gets closer to a normal distribution.
) degrees of freedom. Most common statistics texts
contain a table of Student's t distribution. If you do not have a statistics text handy, then you
can use the DATA step and the TINV function to print any values from
the t distribution.
, along with related statistics. Use the MU0= option
in the PROC statement to specify another value for the null hypothesis.
. Then, the TINV function in a DATA step computes
the value of Student's t distribution
for a two-tailed test at the 5% level of significance and eight degrees
of freedom.
data devnorm;
title 'Deviations from Normal Weight';
input X @@;
datalines;
-7 -2 1 3 6 10 15 21 30
;
proc means data=devnorm maxdec=3 n mean
std stderr t probt;
run;
title 'Student''s t Critical Value';
data _null_;
file print;
t=tinv(.975,8);
put t 5.3;
run;
Deviations from Normal Weight 1
The MEANS Procedure
Analysis Variable : X
N Mean Std Dev Std Error t Value Pr > |t|
--------------------------------------------------------------
9 8.556 11.759 3.920 2.18 0.0606
--------------------------------------------------------------
Student's t Critical Value 2
2.306
is the probability of obtaining a t value greater than the observed t value. Once the p-value is computed, you can perform a hypothesis test by comparing
the p-value with the desired
significance level. If the p-value is less than or equal to the type I error rate of the test,
then the null hypothesis can be rejected. The two-tailed p-value, labeled Pr > |t| in the PROC MEANS output, is .0606, so the null hypothesis could
be rejected at the 10% significance level but not at the 5% level.