This example demonstrates how you can use the QUANTSELECT procedure to select covariate effects for quantile regression. The data set under investigation contains salary and performance information for Major League Baseball (MLB) players, excluding pitchers, who played at least one game in both the 1986 and 1987 seasons. The salaries (Time Inc., 1987) are for the 1987 season, and the performance measures are from 1986 (Reichler, 1987).
data baseball; length team $ 12; input name $ 1-18 nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB League $ Division $ Team $ Position $ nOuts nAssts nError Salary; label name="Player's Name" nAtBat="Times at Bat in 1986" nHits="Hits in 1986" nHome="Home Runs in 1986" nRuns="Runs in 1986" nRBI="RBIs in 1986" nBB="Walks in 1986" yrMajor="Years in the Major Leagues" crAtBat="Career times at bat" crHits="Career Hits" crHome="Career Home Runs" crRuns="Career Runs" crRbi="Career RBIs" crBB="Career Walks" League="League at the end of 1986" Division="Division at the end of 1986" Team="Team at the end of 1986" Position="Position(s) in 1986" nOuts="Put Outs in 1986" nAssts="Assists in 1986" nError="Errors in 1986" Salary="1987 Salary in $ Thousands"; if League='American' and Division='East' then Div='AE'; if League='American' and Division='West' then Div='AW'; if League='National' and Division='East' then Div='NE'; if League='National' and Division='West' then Div='NW'; logSalary = log(Salary); datalines; Allanson, Andy 293 66 1 30 29 14 1 293 66 1 30 29 14 American East Cleveland C 446 33 20 . Ashby, Alan 315 81 7 24 38 39 ... more lines ... Wilson, Willie 631 170 9 77 44 31 11 4908 1457 30 775 357 249 American West KansasCity CF 408 4 3 1000 ;
Suppose you want to investigate how the MLB players’ salaries for the 1987 season depend on performance measures for the players’ previous season and MLB careers. As a starting point for such a analysis, you can use the following statements to obtain a parsimonious conditional median model at :
proc quantselect data=baseball; class Div; model Salary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB nAssts nError nOuts Div / selection=lasso(adaptive stop=aic choose=sbc sh=7); run;
The SELECTION=LASSO(ADAPTIVE) option in the MODEL statement specifies the adaptive LASSO method (Zou, 2006), which controls the effect selection process. The STOP=AIC option specifies that Akaike’s information criterion (AIC) be used to determine the stopping condition. The CHOOSE=SBC option specifies that the Schwarz Bayesian information criterion (SBC) be used to determine the final selected model. The SH= option specifies the number of stop horizons, which requests that the selection process be stopped whenever the STOP= criterion values at step are worse than those for step s for some .
Figure 78.1 shows the “Model Information” table, which indicates the effect selection settings. You can see that the default quantile type is single level, so this effect selection is effective only for .
Figure 78.1: Model Information
Model Information | |
---|---|
Data Set | WORK.BASEBALL |
Dependent Variable | Salary |
Selection Method | Adaptive LASSO |
Quantile Type | Single Level |
Stop Criterion | AIC |
Choose Criterion | SBC |
Figure 78.2 summarizes the effect selection process, which starts with an intercept-only model at step 0. At step 1, the effect that corresponds to the career runs is added to the model that reduced the AIC value from 2691.6511 to 2510.7297. You can see that step 10 has the minimum AIC and that step 7 has the minimum SBC. Common sense also tells you that the SBC favors a smaller model than the AIC.
Figure 78.2: Selection Summary
Selection Summary | ||||
---|---|---|---|---|
Step | Effect Entered |
Number Effects In |
AIC | SBC |
0 | Intercept | 1 | 2691.6511 | 2695.2232 |
1 | crRuns | 2 | 2510.7297 | 2517.8740 |
2 | nHits | 3 | 2470.4807 | 2481.1971 |
3 | crHome | 4 | 2463.5953 | 2477.8839 |
4 | nBB | 5 | 2463.7806 | 2481.6414 |
5 | nOuts | 6 | 2455.6212 | 2477.0541 |
6 | Div AW | 7 | 2451.4609 | 2476.4660 |
7 | nAtBat | 8 | 2445.0446 | 2473.6218* |
8 | crBB | 9 | 2445.5432 | 2477.6926 |
9 | nHome | 10 | 2443.4818 | 2479.2033 |
10 | nRuns | 11 | 2442.6036* | 2481.8973 |
11 | crAtBat | 12 | 2442.8387 | 2485.7046 |
12 | yrMajor | 13 | 2443.5374 | 2489.9754 |
13 | nError | 14 | 2445.2085 | 2495.2187 |
14 | Div NE | 15 | 2446.4042 | 2499.9865 |
15 | nRBI | 16 | 2448.2003 | 2505.3547 |
16 | nAssts | 17 | 2449.9731 | 2510.6997 |
* Optimal Value Of Criterion |
Figure 78.3 shows that the selection process stopped at a local minimum of the STOP= criterion, which is step 10. According to the SH=7 option, the effect selection process is stopped at step 10 because all the AIC values for step 11 through step 17 are no less than the AIC at step 10. Step 17 is ignored in the selection summary table because it is the last step.
Figure 78.4 shows how the final selected model is determined. CHOOSE=SBC is specified in this example, so the model at step 7 is chosen as the final selected model.
Figure 78.5 shows the final selected effects and Figure 78.6 shows the parameter estimates for the final selected model.
Figure 78.5: Selected Effects
Selected Effects: | Intercept nAtBat nHits nBB crHome crRuns nOuts Div AW |
---|
Figure 78.6: Parameter Estimates
Parameter Estimates | |||
---|---|---|---|
Parameter | DF | Estimate | Standardized Estimate |
Intercept | 1 | -18.187539 | 0 |
nAtBat | 1 | -1.582714 | -0.500417 |
nHits | 1 | 7.044354 | 0.686968 |
nBB | 1 | 2.053726 | 0.097911 |
crHome | 1 | 1.429926 | 0.272726 |
crRuns | 1 | 0.425955 | 0.316167 |
nOuts | 1 | 0.282803 | 0.175489 |
Div AW | 1 | -57.671778 | -0.056862 |
Quantile regression can fit a conditional quantile model at any quantile level , so it can describe the entire distribution of a response variable conditional on covariate effects. To further investigate the effects that might affect the MLB players’ salaries, you can also conduct effect selection at and , which correspond to low-end salaries and high-end salaries respectively. The following statements use the same selection settings that are used in the previous program:
proc quantselect data=baseball; class Div; model Salary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB nAssts nError nOuts Div / quantiles=0.1 0.9 selection=lasso(adaptive stop=aic choose=sbc sh=7); run;
Figure 78.7 shows the effect selection summary with .
Figure 78.7: Selection Summary:
Selection Summary | |||||
---|---|---|---|---|---|
Step | Effect Entered |
Effect Removed |
Number Effects In |
AIC | SBC |
0 | Intercept | 1 | 2008.3489 | 2011.9211 | |
1 | crRuns | 2 | 1918.7675 | 1925.9118 | |
2 | nHits | 3 | 1897.2425 | 1907.9590* | |
3 | yrMajor | 4 | 1897.2476 | 1911.5362 | |
4 | crBB | 5 | 1896.1765 | 1914.0373 | |
5 | nBB | 6 | 1894.1257 | 1915.5587 | |
6 | Div AW | 7 | 1891.0577 | 1916.0628 | |
7 | crHome | 8 | 1892.2575 | 1920.8348 | |
8 | crHome | 7 | 1891.0577 | 1916.0628 | |
9 | nAtBat | 8 | 1889.6365* | 1918.2137 | |
10 | nHome | 9 | 1891.2595 | 1923.4089 | |
11 | crHome | 10 | 1890.7463 | 1926.4679 | |
12 | nRBI | 11 | 1892.7421 | 1932.0358 | |
13 | Div NW | 12 | 1892.9745 | 1935.8404 | |
14 | nRBI | 11 | 1891.3217 | 1930.6154 | |
15 | crAtBat | 12 | 1891.8810 | 1934.7468 | |
* Optimal Value Of Criterion |
Figure 78.8 shows the parameter estimates for the final selected model with . You can see from Figure 78.8 that low-end salaries for MLB players depend mainly on career runs and hits in 1986.
Figure 78.8: Parameter Estimates:
Parameter Estimates | |||
---|---|---|---|
Parameter | DF | Estimate | Standardized Estimate |
Intercept | 1 | -4.397043 | 0 |
nHits | 1 | 0.878564 | 0.085678 |
crRuns | 1 | 0.327350 | 0.242977 |
Figure 78.9 shows the effect selection summary with .
Figure 78.9: Selection Summary:
Selection Summary | |||||
---|---|---|---|---|---|
Step | Effect Entered |
Effect Removed |
Number Effects In |
AIC | SBC |
0 | Intercept | 1 | 2436.7289 | 2440.3011 | |
1 | crHits | 2 | 2197.4349 | 2204.5792 | |
2 | crRbi | 3 | 2183.6148 | 2194.3313 | |
3 | nHits | 4 | 2113.2757 | 2127.5643 | |
4 | crRbi | 3 | 2127.8632 | 2138.5797 | |
5 | crRbi | 4 | 2113.2757 | 2127.5643 | |
6 | crRbi | 3 | 2127.8632 | 2138.5797 | |
7 | crRbi | 4 | 2113.2757 | 2127.5643 | |
8 | crHome | 5 | 2099.2203 | 2117.0811 | |
9 | crRbi | 4 | 2099.3891 | 2113.6777 | |
10 | crRbi | 5 | 2099.2203 | 2117.0811 | |
11 | crRbi | 4 | 2099.3891 | 2113.6777 | |
12 | nOuts | 5 | 2067.1926 | 2085.0533 | |
13 | Div AW | 6 | 2048.2393 | 2069.6723 | |
14 | crRuns | 7 | 2028.8040 | 2053.8090 | |
15 | nAtBat | 8 | 2012.8195 | 2041.3968 | |
16 | crHits | 7 | 2017.0290 | 2042.0341 | |
17 | crRbi | 8 | 2009.3551 | 2037.9324 | |
18 | crAtBat | 9 | 2011.2415 | 2043.3908 | |
19 | crRbi | 8 | 2011.4053 | 2039.9825 | |
20 | crRbi | 9 | 2011.2415 | 2043.3908 | |
21 | crAtBat | 8 | 2009.3551 | 2037.9324 | |
22 | crAtBat | 9 | 2011.2415 | 2043.3908 | |
23 | nBB | 10 | 2004.5033 | 2040.2249 | |
24 | crAtBat | 9 | 2003.1023 | 2035.2517* | |
25 | crAtBat | 10 | 2004.5033 | 2040.2249 | |
26 | crAtBat | 9 | 2003.1023 | 2035.2517 | |
27 | crAtBat | 10 | 2004.5033 | 2040.2249 | |
28 | nError | 11 | 2004.2230 | 2043.5167 | |
29 | crHits | 12 | 2003.0544 | 2045.9203 | |
30 | Div NE | 13 | 2001.9603 | 2048.3983 | |
31 | Div AE | 14 | 2001.8349* | 2051.8451 | |
32 | nRuns | 15 | 2003.5961 | 2057.1784 | |
33 | nHome | 16 | 2004.2721 | 2061.4266 | |
34 | nRBI | 17 | 2006.0023 | 2066.7289 | |
35 | yrMajor | 18 | 2007.9975 | 2072.2963 | |
36 | crBB | 19 | 2009.9514 | 2077.8223 | |
37 | nAssts | 20 | 2011.9095 | 2083.3525 | |
* Optimal Value Of Criterion |
Figure 78.10 shows the parameter estimates for the final selected model with .
Figure 78.10: Parameter Estimates:
Parameter Estimates | |||
---|---|---|---|
Parameter | DF | Estimate | Standardized Estimate |
Intercept | 1 | 92.893875 | 0 |
nAtBat | 1 | -1.858170 | -0.587509 |
nHits | 1 | 8.155573 | 0.795335 |
nBB | 1 | 3.392794 | 0.161751 |
crHome | 1 | 3.191472 | 0.608700 |
crRuns | 1 | 1.394317 | 1.034939 |
crRbi | 1 | -0.913371 | -0.664951 |
nOuts | 1 | 0.437241 | 0.271323 |
Div AW | 1 | -167.110005 | -0.164764 |
To visually illustrate how the model evolves through the selection process, the QUANTSELECT procedure provides the coefficient plot, the average check loss plot, and several criterion plots in either packed or unpacked forms. You can request these plots by using the PLOTS= option. The following statements request all the plots for the baseball data at ; they also use the STOP=AIC criterion, the CHOOSE=SBC criterion, and the SH=7 option:
ods graphics on; proc quantselect data=baseball plot=all; class Div; model Salary = nAtBat nHits nHome nRuns nRBI nBB yrMajor crAtBat crHits crHome crRuns crRbi crBB nAssts nError nOuts Div / quantiles=0.1 selection=lasso(adaptive stop=aic choose=sbc sh=7); run;
Figure 78.11 shows the progression of the parameter estimates as the selection process proceeds.
Figure 78.12 shows the progression of the average check losses as the selection process proceeds.
Figure 78.13 shows the progression of four effect selection criteria as the selection process proceeds.