Kernel Density Estimates :: SAS/STAT(R) 12.3 User's Guide

Kernel Density Estimates

A weighted univariate kernel density estimate involves a variable X and a weight variable W. Let $(X_{i},W_{i}), \ i=1,2,\ldots ,n$ , denote a sample of X and W of size n. The weighted kernel density estimate of , the density of X, is as follows:

$\hat{f}(x) = \frac{1}{\sum _{i=1}^{n} W_{i}} \sum _{i=1}^{n} W_{i} \varphi _{h}(x-X_{i})$

where h is the bandwidth and

$\varphi _{h}(x) = \frac{1}{\sqrt {2\pi }h} \exp \left( -\frac{x^{2}}{2h^{2}} \right)$

is the standard normal density rescaled by the bandwidth. If $h\rightarrow 0$ and $nh\rightarrow \infty$ , then the optimal bandwidth is

$h_\mr {AMISE} = \left[ \frac{1}{2\sqrt {\pi } n \int (f)^{2}} \right]^{1/5}$

This optimal value is unknown, and so approximations methods are required. For a derivation and discussion of these results, see Silverman (1986, Chapter 3) and Jones, Marron, and Sheather (1996).

For the bivariate case, let $\mb {X} = (X,Y)$ be a bivariate random element taking values in with joint density function

$f(x,y), \ (x,y) \in R^2$

and let $\mb {X}_{i} = (X_{i},Y_{i}), \ i = 1,2, \ldots , n$ , be a sample of size n drawn from this distribution. The kernel density estimate of based on this sample is

$\displaystyle \hat{f}(x,y)$	$\displaystyle =$	$\displaystyle \frac{1}{n} \sum _{i=1}^{n} \varphi _{\Strong{h}}(x-X_{i},y-Y_{i})$
$\displaystyle$	$\displaystyle =$	$\displaystyle \frac{1}{nh_{X}h_{Y}} \sum _{i=1}^{n}\varphi \left( \frac{x-X_{i}}{h_{X}}, \frac{y-Y_{i}}{h_{Y}} \right)$

where $(x,y) \in R^2$ , $h_{X}>0$ and $h_{Y}>0$ are the bandwidths, and $\varphi _{\mb {h}}(x,y)$ is the rescaled normal density

$\varphi _{\mb {h}}(x,y) = \frac{1}{ h_{X}h_{Y}} \varphi \left( \frac{x}{h_{X}}, \frac{y}{h_{Y}} \right)$

where $\varphi (x,y)$ is the standard normal density function

$\varphi (x,y) = \frac{1}{2\pi } \exp \left( -\frac{x^{2}+y^{2}}{2} \right)$

Under mild regularity assumptions about , the mean integrated squared error (MISE) of $\hat{f}(x,y)$ is

$\displaystyle \textrm{MISE}(h_{X},h_{Y})$	$\displaystyle =$	$\displaystyle \textrm{E}\int (\hat{f}-f)^{2}$
$\displaystyle$	$\displaystyle =$	$\displaystyle \frac{1}{4\pi n h_{X} h_{Y}}+ \frac{h_{X}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial X^{2}}\right)^{2}dxdy$
$\displaystyle$	$\displaystyle$	$\displaystyle {} + \frac{h_{Y}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial Y^{2}}\right)^{2}dxdy + O\left(h_{X}^{4} + h_{Y}^{4} + \frac{1}{nh_{X}h_{Y}}\right)$

as $h_{X} \rightarrow 0$ , $h_{Y} \rightarrow 0$ and $n h_{X} h_{Y} \rightarrow \infty$ .

Now set

$\displaystyle \textrm{AMISE}(h_{X},h_{Y})$	$\displaystyle =$	$\displaystyle \frac{1}{4\pi n h_{X} h_{Y}} + \frac{h_{X}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial X^{2}}\right)^{2}dxdy$
$\displaystyle$	$\displaystyle$	$\displaystyle {} + \frac{h_{Y}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial Y^{2}}\right)^{2}dxdy$

which is the asymptotic mean integrated squared error (AMISE). For fixed n, this has a minimum at $(h_{\mr {AMISE}\_ X}, h_{\mr {AMISE}\_ Y})$ defined as

$h_{\mr {AMISE}\_ X} = \left[\frac{\int (\frac{\partial ^{2}f}{\partial X^{2}})^{2}}{4n\pi }\right]^{1/6} \left[\frac{\int (\frac{\partial ^{2}f}{\partial X^{2}})^{2}}{\int (\frac{\partial ^{2}f}{\partial Y^{2}})^{2}}\right]^{2/3}$

and

$h_{\mr {AMISE}\_ Y} = \left[\frac{\int (\frac{\partial ^{2}f}{\partial Y^{2}})^{2}}{4n\pi }\right]^{1/6} \left[\frac{\int (\frac{\partial ^{2}f}{\partial Y^{2}})^{2}}{\int (\frac{\partial ^{2}f}{\partial X^{2}})^{2}}\right]^{2/3}$

These are the optimal asymptotic bandwidths in the sense that they minimize MISE. However, as in the univariate case, these expressions contain the second derivatives of the unknown density being estimated, and so approximations are required. See Wand and Jones (1993) for further details.

The KDE Procedure

Kernel Density Estimates