Potential Result Set Differences When Processing Null Data

When your data contains null values or when internal processing generates intermediate data sets that contain null values, you might receive different result sets depending on whether the processing is done by SAS or by the DBMS. Although in many cases this does not present a problem, it is important to understand how these differences occur.
Most relational database systems have a special value called null, which means an absence of information and is analogous to a SAS missing value. SAS/ACCESS translates SAS missing values to DBMS null values when creating DBMS tables from within SAS. Conversely, SAS/ACCESS translates DBMS null values to SAS missing values when reading DBMS data into SAS.
However, there is an important difference in the behavior of DBMS null values and SAS missing values.
  • A DBMS null value is interpreted as the absence of data, so you cannot sort a DBMS null value or evaluate it with standard comparison operators.
  • A SAS missing value is interpreted as its internal floating-point representation because SAS supports 28 missing values (where a period (.) is the most common missing value). Because SAS supports multiple missing values, you can sort a SAS missing value and evaluate it with standard comparison operators.
This means that SAS and the DBMS interpret null values differently, which has significant implications when SAS/ACCESS passes queries to a DBMS for processing. This can be an issue in these situations:
  • when filtering data (for example, in a WHERE clause, a HAVING clause, or an outer join ON clause). SAS interprets null values as missing; many DBMS exclude null values from consideration. For example, if you have null values in a DBMS column that is used in a WHERE clause, your results might differ depending on whether the WHERE clause is processed in SAS or is passed to the DBMS for processing. This is because the DBMS removes null values from consideration in a WHERE clause, but SAS does not.
  • when using certain functions. For example, if you use the MIN aggregate function on a DBMS column that contains null values, the DBMS does not consider the null values, but SAS interprets the null values as missing. This interpretation affects the result.
  • when submitting outer joins where internal processing generates nulls for intermediate result sets.
  • when sorting data. SAS sorts null values low; most DBMSs sort null values high. See Sorting DBMS Data for more information.
For example, create a simple data set that consists of one observation and one variable.
libname myoralib oracle user=testuser password=testpass;
data myoralib.table;
x=.;       /*create a missing value */
run;
Then print the data set using a WHERE clause, which SAS/ACCESS passes to the DBMS for processing.
proc print data=myoralib.table;
   where x<0;
run;
The log indicates that no observations were selected by the WHERE clause, because Oracle interprets the missing value as the absence of data, and does not evaluate it with the less-than (<) comparison operator.
When there is the potential for inconsistency, consider using one of these strategies.
  • Use the DIRECT_SQL= LIBNAME option to control whether SAS or the DBMS handles processing.
  • Use the SQL pass-through facility to ensure that the DBMS handles processing.
  • Add the is not null expression to WHERE clauses and ON clauses to ensure that you obtain the same result regardless of whether SAS or the DBMS does the processing.
Use the NULLCHAR= data set option to specify how the DBMS interprets missing SAS character values when updating DBMS data or inserting rows into a DBMS table.
You can use the first of these strategies to force SAS to process the data in this example.
libname myoralib oracle user=testuser password=testpass
        direct_sql=nowhere;   /* forces SAS to process WHERE clauses */
data myoralib.table;
x=.;       /*create a missing value */
run;
You can then print the data set using a WHERE clause
proc print data=myoralib.table;
   where x<0;
run;
This time the log indicates that one observation was read from the data set because SAS evaluates the missing value as satisfying the less-than-zero condition in the WHERE clause.