The QUANTREG Procedure

Optimization Algorithms

The optimization problem for median regression has been formulated and solved as a linear programming (LP) problem since the 1950s. Variations of the simplex algorithm, especially the method of Barrodale and Roberts (1973), have been widely used to solve this problem. The simplex algorithm is computationally demanding in large statistical applications, and in theory the number of iterations can increase exponentially with the sample size. This algorithm is often useful with data containing no more than tens of thousands of observations.

Several alternatives have been developed to handle $\text{[math]}$ regression for larger data sets. The interior point approach of Karmarkar (1984) solves a sequence of quadratic problems in which the relevant interior of the constraint set is approximated by an ellipsoid. The worst-case performance of the interior point algorithm has been proved to be better than that of the simplex algorithm. More important, experience has shown that the interior point algorithm is advantageous for larger problems.

Like $\text{[math]}$ regression, general quantile regression fits nicely into the standard primal-dual formulations of linear programming.

In addition to the interior point method, various heuristic approaches are available for computing $\text{[math]}$ -type solutions. Among these, the finite smoothing algorithm of Madsen and Nielsen (1993) is the most useful. It approximates the $\text{[math]}$ -type objective function with a smoothing function, so that the Newton-Raphson algorithm can be used iteratively to obtain a solution after a finite number of iterations. The smoothing algorithm extends naturally to general quantile regression.

The QUANTREG procedure implements the simplex, interior point, and smoothing algorithms. The remainder of this section describes these algorithms in more detail.

Simplex Algorithm

Let $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ , where $\text{[math]}$ is the nonnegative part of z.

Let $\text{[math]}$ . For the $\text{[math]}$ problem, the simplex approach solves $\text{[math]}$ by reformulating it as the constrained minimization problem

$\text{[math]}$

where e denotes an $\text{[math]}$ vector of ones.

Let $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ , where $\text{[math]}$ . The reformulation presents a standard LP problem:

$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

This problem has the dual formulation

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

which can be simplified as

$\text{[math]}$

By setting $\text{[math]}$ , the problem becomes

$\text{[math]}$

For quantile regression, the minimization problem is $\text{[math]}$ , and a similar set of steps leads to the dual formulation

$\text{[math]}$

The QUANTREG procedure solves this LP problem by using the simplex algorithm of Barrodale and Roberts (1973). This algorithm solves the primary LP problem (P) by two stages, which exploit the special structure of the coefficient matrix $\text{[math]}$ . The first stage picks the columns in $\text{[math]}$ or $\text{[math]}$ as pivotal columns. The second stage interchanges the columns in $\text{[math]}$ or $\text{[math]}$ as basis or nonbasis columns, respectively. The algorithm obtains an optimal solution by executing these two stages interactively. Moreover, because of the special structure of $\text{[math]}$ , only the main data matrix $\text{[math]}$ is stored in the current memory.

Although this special version of the simplex algorithm was introduced for median regression, it extends naturally to quantile regression for any given quantile and even to the entire quantile process (Koenker and d’Orey 1994). It greatly reduces the computing time required by the general simplex algorithm, and it is suitable for data sets with fewer than 5,000 observations and 50 variables.

Interior Point Algorithm

There are many variations of interior point algorithms. The QUANTREG procedure uses the primal-dual predictor-corrector algorithm implemented by Lustig, Marsden, and Shanno (1992). The text by Roos, Terlaky, and Vial (1997) provides more information about this particular algorithm. The following brief introduction of this algorithm uses the notation in the first reference.

To be consistent with the conventional LP setting, let $\text{[math]}$ , $\text{[math]}$ , and let u be the general upper bound. The linear program to be solved is

: $\text{[math]}$
subject to: $\text{[math]}$
: $\text{[math]}$

To simplify the computation, this is treated as the primal problem. The problem has n variables. The index i denotes a variable number, and k denotes an iteration number. If k is used as a subscript or superscript, it denotes "of iteration k."

Let v be the primal slack so that $\text{[math]}$ . Associate dual variables w with these constraints. The interior point algorithm solves the system of equations to satisfy the Karush-Kuhn-Tucker (KKT) conditions for optimality:

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
where: $\text{[math]}$ (that is, $\text{[math]}$ if $\text{[math]}$ , $\text{[math]}$ otherwise)
: $\text{[math]}$

These are the conditions for feasibility, with the addition of complementarity conditions $\text{[math]}$ and $\text{[math]}$ . $\text{[math]}$ must occur at the optimum. Complementarity forces the optimal objectives of the primal and dual to be equal, $\text{[math]}$ , as

: $\text{[math]}$

: $\text{[math]}$
: $\text{[math]}$
Therefore
: $\text{[math]}$

The duality gap, $\text{[math]}$ , is used to measure the convergence of the algorithm. You can specify a tolerance for this convergence criterion with the TOLERANCE= option in the PROC statement.

Before the optimum is reached, it is possible for a solution $\text{[math]}$ to violate the KKT conditions in one of several ways:

Primal bound constraints can be broken, $\text{[math]}$ .
Primal constraints can be broken, $\text{[math]}$ .
Dual constraints can be broken, $\text{[math]}$ .
Complementarity conditions are unsatisfied, $\text{[math]}$ and $\text{[math]}$ .

The interior point algorithm works by using Newton’s method to find a direction $\text{[math]}$ to move from the current solution $\text{[math]}$ toward a better solution:

$\text{[math]}$

$\text{[math]}$ is the step length and is assigned a value as large as possible, but not so large that a $\text{[math]}$ or $\text{[math]}$ is "too close" to zero. You can control the step length with the KAPPA= option in the PROC statement.

The QUANTREG procedure implements a predictor-corrector variant of the primal-dual interior point algorithm. First, Newton’s method is used to find a direction $\text{[math]}$ in which to move. This is known as the affine step.

In iteration k, the affine step system that must be solved is

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$

Therefore, the computations involved in solving the affine step are

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$

: $\text{[math]}$
: $\text{[math]}$

$\text{[math]}$ is the step length as before.

The success of the affine step is gauged by calculating the complementarity of $\text{[math]}$ and $\text{[math]}$ at $\text{[math]}$ and comparing it with the complementarity at the starting point $\text{[math]}$ . If the affine step was successful in reducing the complementarity by a substantial amount, the need for centering is not great, and a value close to zero is assigned to $\text{[math]}$ in a second linear system (see following), which is used to determine a centering vector. If, however, the affine step was unsuccessful, then centering is deemed beneficial, and a value close to 1.0 is assigned to $\text{[math]}$ . In other words, the value of $\text{[math]}$ is adaptively altered depending on progress made toward the optimum.

The following linear system is solved to determine a centering vector $\text{[math]}$ from $\text{[math]}$ :

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
where
: $\text{[math]}$ , complementarity at the start of the iteration
: $\text{[math]}$ , the affine complementarity
: $\text{[math]}$ , the average complementarity
: $\text{[math]}$

Therefore, the computations involved in solving the centering step are

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$

Then

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$

: $\text{[math]}$
: $\text{[math]}$
: $\text{[math]}$

where, as before, $\text{[math]}$ is the step length assigned a value as large as possible, but not so large that a $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ , or $\text{[math]}$ is "too close" to zero.

Although the predictor-corrector variant entails solving two linear systems instead of one, fewer iterations are usually required to reach the optimum. The additional overhead of the second linear system is small because the matrix $\text{[math]}$ has already been factorized in order to solve the first linear system.

You can specify the starting point with the INEST= option in the PROC statement. By default, the starting point is set to be the least squares estimate.

Smoothing Algorithm

To minimize the sum of the absolute residuals $\text{[math]}$ , the smoothing algorithm approximates the nondifferentiable function $\text{[math]}$ by the following smooth function, which is referred to as the Huber function:

$\text{[math]}$

where

$\text{[math]}$

Here $\text{[math]}$ , and the threshold $\text{[math]}$ is a positive real number. The function $\text{[math]}$ is continuously differentiable and a minimizer $\text{[math]}$ of $\text{[math]}$ is close to a minimizer $\text{[math]}$ of $\text{[math]}$ when $\text{[math]}$ is close to zero.

The advantage of the smoothing algorithm as described in Madsen and Nielsen (1993) is that the $\text{[math]}$ solution $\text{[math]}$ can be detected when $\text{[math]}$ is small. In other words, it is not necessary to let $\text{[math]}$ converge to zero in order to find a minimizer of $\text{[math]}$ . The algorithm terminates before going through the entire sequence of values of $\text{[math]}$ that are generated by the algorithm. Convergence is indicated by no change of the status of residuals $\text{[math]}$ as $\text{[math]}$ goes through this sequence.

The smoothing algorithm extends naturally from $\text{[math]}$ regression to general quantile regression; refer to Chen (2007). The function

$\text{[math]}$

can be approximated by the smooth function

$\text{[math]}$

where

$\text{[math]}$

The function $\text{[math]}$ is determined by whether $\text{[math]}$ , $\text{[math]}$ , or $\text{[math]}$ . These inequalities divide $\text{[math]}$ into subregions separated by the parallel hyperplanes $\text{[math]}$ and $\text{[math]}$ . The set of all such hyperplanes is denoted by $\text{[math]}$ :

$\text{[math]}$

Define the sign vector $\text{[math]}$ as

$\text{[math]}$

and introduce

$\text{[math]}$

Therefore,

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

yielding

$\text{[math]}$

where $\text{[math]}$ is the diagonal $\text{[math]}$ matrix with diagonal elements $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ .

The gradient of $\text{[math]}$ is given by

$\text{[math]}$

and for $\text{[math]}$ the Hessian exists and is given by

$\text{[math]}$

The gradient is a continuous function in $\text{[math]}$ , whereas the Hessian is piecewise constant.

Following Madsen and Nielsen (1993), the vector s is referred to as a $\text{[math]}$ -feasible sign vector if there exists $\text{[math]}$ with $\text{[math]}$ . If s is $\text{[math]}$ -feasible, then $\text{[math]}$ is defined as the quadratic function $\text{[math]}$ that is derived from $\text{[math]}$ by substituting s for $\text{[math]}$ . Thus, for any $\text{[math]}$ with $\text{[math]}$ ,

$\text{[math]}$

In the domain $\text{[math]}$

$\text{[math]}$

For each $\text{[math]}$ and $\text{[math]}$ , there can be one or several corresponding quadratics $\text{[math]}$ . If $\text{[math]}$ then $\text{[math]}$ is characterized by $\text{[math]}$ and $\text{[math]}$ , but for $\text{[math]}$ the quadratic is not unique. Therefore, a reference

$\text{[math]}$

determines the quadratic.

Again following Madsen and Nielsen (1993), let

: $\text{[math]}$ be a feasible reference if s is a $\text{[math]}$ -feasible sign vector with $\text{[math]}$ , and
: $\text{[math]}$ be a solution reference if it is feasible and $\text{[math]}$ minimizes $\text{[math]}$ .

The smoothing algorithm for minimizing $\text{[math]}$ is based on minimizing $\text{[math]}$ for a set of decreasing $\text{[math]}$ . For each new value of $\text{[math]}$ , information from the previous solution is used. Finally, when $\text{[math]}$ is small enough, a solution can be found by the modified Newton-Raphson algorithm as stated by Madsen and Nielsen (1993):

: find an initial solution reference $\text{[math]}$
: repeat
: decrease $\text{[math]}$
: find a solution reference $\text{[math]}$
: until $\text{[math]}$
: $\text{[math]}$ is the solution.

By default, the initial solution reference is found by letting $\text{[math]}$ be the least squares solution. Alternatively, you can specify the initial solution reference with the INEST= option in the PROC statement. Then $\text{[math]}$ and s are chosen according to these initial values.

There are several approaches for determining a decreasing sequence of values of $\text{[math]}$ . The QUANTREG procedure uses a strategy by Madsen and Nielsen (1993). The computation involved is not significant comparing with the Newton-Raphson step. You can control the ratio of consecutive decreasing values of $\text{[math]}$ with the RRATIO= suboption of the ALGORITHM= option in the PROC statement. By default,

$\text{[math]}$

For the $\text{[math]}$ and quantile regression, it turns out that the smoothing algorithm is very efficient and competitive, especially for a fat data set—namely, when $\text{[math]}$ and $\text{[math]}$ is dense. Refer to Chen (2007) for a complete smoothing algorithm and details.