The KDE Procedure

Kernel Density Estimates

A weighted univariate kernel density estimate involves a variable X and a weight variable W. Let $(X_{i},W_{i}), \ i=1,2,\ldots ,n$ , denote a sample of X and W of size n. The weighted kernel density estimate of $f(x)$ , the density of X, is as follows:

$\hat{f}(x) = \frac{1}{\sum _{i=1}^{n} W_{i}} \sum _{i=1}^{n} W_{i} \varphi _{h}(x-X_{i})$

where h is the bandwidth and

$\varphi _{h}(x) = \frac{1}{\sqrt {2\pi }h} \exp \left( -\frac{x^{2}}{2h^{2}} \right)$

is the standard normal density rescaled by the bandwidth. If $h\rightarrow 0$ and $nh\rightarrow \infty$ , then the optimal bandwidth is

$h_\mr {AMISE} = \left[ \frac{1}{2\sqrt {\pi } n \int (f'')^{2}} \right]^{1/5}$

This optimal value is unknown, and so approximations methods are required. For a derivation and discussion of these results, see Silverman (1986, Chapter 3) and Jones, Marron, and Sheather (1996).

For the bivariate case, let $\mb{X} = (X,Y)$ be a bivariate random element taking values in $R^2$ with joint density function

$f(x,y), \ (x,y) \in R^2$

and let $\mb{X}_{i} = (X_{i},Y_{i}), \ i = 1,2, \ldots , n$ , be a sample of size n drawn from this distribution. The kernel density estimate of $f(x,y)$ based on this sample is

$\begin{eqnarray*} \hat{f}(x,y) & = & \frac{1}{n} \sum _{i=1}^{n} \varphi _{\Strong{h}}(x-X_{i},y-Y_{i}) \\ & = & \frac{1}{nh_{X}h_{Y}} \sum _{i=1}^{n}\varphi \left( \frac{x-X_{i}}{h_{X}}, \frac{y-Y_{i}}{h_{Y}} \right) \end{eqnarray*}$

where $(x,y) \in R^2$ , $h_{X}>0$ and $h_{Y}>0$ are the bandwidths, and $\varphi _{\mb{h}}(x,y)$ is the rescaled normal density

$\varphi _{\mb{h}}(x,y) = \frac{1}{ h_{X}h_{Y}} \varphi \left( \frac{x}{h_{X}}, \frac{y}{h_{Y}} \right)$

where $\varphi (x,y)$ is the standard normal density function

$\varphi (x,y) = \frac{1}{2\pi } \exp \left( -\frac{x^{2}+y^{2}}{2} \right)$

Under mild regularity assumptions about $f(x,y)$ , the mean integrated squared error (MISE) of $\hat{f}(x,y)$ is

$\begin{eqnarray*} \textrm{MISE}(h_{X},h_{Y}) & = & \textrm{E}\int (\hat{f}-f)^{2} \\ & = & \frac{1}{4\pi n h_{X} h_{Y}}+ \frac{h_{X}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial X^{2}}\right)^{2}dxdy \\ & & {} + \frac{h_{Y}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial Y^{2}}\right)^{2}dxdy + O\left(h_{X}^{4} + h_{Y}^{4} + \frac{1}{nh_{X}h_{Y}}\right) \end{eqnarray*}$

as $h_{X} \rightarrow 0$ , $h_{Y} \rightarrow 0$ and $n h_{X} h_{Y} \rightarrow \infty$ .

Now set

$\begin{eqnarray*} \textrm{AMISE}(h_{X},h_{Y}) & = & \frac{1}{4\pi n h_{X} h_{Y}} + \frac{h_{X}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial X^{2}}\right)^{2}dxdy \\ & & {} + \frac{h_{Y}^{4}}{4}\int \left(\frac{\partial ^{2}f}{\partial Y^{2}}\right)^{2}dxdy \end{eqnarray*}$

which is the asymptotic mean integrated squared error (AMISE). For fixed n, this has a minimum at $(h_{\mr{AMISE}\_ X}, h_{\mr{AMISE}\_ Y})$ defined as

$h_{\mr{AMISE}\_ X} = \left[\frac{\int (\frac{\partial ^{2}f}{\partial X^{2}})^{2}}{4n\pi }\right]^{1/6} \left[\frac{\int (\frac{\partial ^{2}f}{\partial X^{2}})^{2}}{\int (\frac{\partial ^{2}f}{\partial Y^{2}})^{2}}\right]^{2/3}$

and

$h_{\mr{AMISE}\_ Y} = \left[\frac{\int (\frac{\partial ^{2}f}{\partial Y^{2}})^{2}}{4n\pi }\right]^{1/6} \left[\frac{\int (\frac{\partial ^{2}f}{\partial Y^{2}})^{2}}{\int (\frac{\partial ^{2}f}{\partial X^{2}})^{2}}\right]^{2/3}$

These are the optimal asymptotic bandwidths in the sense that they minimize MISE. However, as in the univariate case, these expressions contain the second derivatives of the unknown density ${f}$ being estimated, and so approximations are required. See Wand and Jones (1993) for further details.