The DATASOURCE Procedure

OUT= Data Set

The OUT= data set can contain the following variables:

  • the BY variables, which identify cross-sectional dimensions when the input data file contains time series replicated for different values of the BY variables. Use the BY variables in a WHERE statement to process the OUT= data set by cross sections. The order in which BY variables are defined in the OUT= data set corresponds to the order in which the data file is sorted.

  • DATE, a SAS date-, time-, or datetime-valued variable that reports the time period of each observation. The values of the DATE variable may span different time ranges for different BY groups. The format of the DATE variable depends on the INTERVAL= option.

  • the periodic time series variables, which are included in the OUT= data set only if they have data in at least one selected BY group and they are not discarded by a KEEP or DROP statement

  • the event variables, which are included in the OUT= data set if they are not discarded by a KEEP or DROP statement. By default, these variables are not output to OUT= data set.

The values of BY variables remain constant in each cross section. Observations within each BY group correspond to the sampling of the series variables at the time periods indicated by the DATE variable.

You can create a set of single indexes for the OUT= data set by using the INDEX option, provided there are BY variables. Under some circumstances, this may increase the efficiency of subsequent PROC and DATA steps that use BY and WHERE statements. However, there is a cost associated with creation and maintenance of indexes. The SAS Language Reference: Concepts lists the conditions under which the benefits of indexes outweigh the cost.

With data files containing cross sections, there can be various degrees of overlap among the series variables. One extreme is when all the series variables contain data for all the cross sections. In this case, the output data set is very compact. In the other extreme case, however, the set of time series variables are unique for each cross section, making the output data set very sparse, as depicted in Table 13.4.

Table 13.4: The OUT= Data Set Containing Unique Series for Each BY Group

BY

Series in

Series in

${\dots }$

Series in

Variables

first BY group

second BY group

${\dots }$

last BY group

BY1 ${\dots }$ BYP

F1 F2 F3 ${\dots }$ FN

S1 S2 S3 ${\dots }$ SM

${\dots }$

T1 T2 T3 ${\dots }$ TK

BY

DATA

 

group

is

 

1

here

 

BY

 

DATA

data is missing

group

 

is

everywhere except

2

 

here

on diagonal

   

DATA

 

${\vdots }$

 

is

 
   

here

 

BY

 

DATA

group

 

is

N

 

here


The data in Table 13.4 can be represented more compactly if cross-sectional information is incorporated into series variable names.