Considerations and Limitations

Greenplum, Hadoop, and Teradata

  • If the thread program is run inside the database, the number of threads is set by the SAS In-Database Code Accelerator. When this occurs, the THREADS= argument in the SET FROM statement in the data program has no effect.
  • When a matrix is declared in a thread program, each thread program has its own, individual instance of a matrix. The DS2 matrix package does not support data partitioning between nodes or threads to perform parallel matrix operations. Instead, each thread performs the matrix operations on its own instance of the matrix.
  • The DS2 program fails if you try to use an empty format that you defined with PROC FORMAT.
  • In-database processing does not occur when the following methods are used to load data. Instead, the data and thread programs are run on the client.
    • using an SQLSTMT package
    • using an initialized hash package
    • using an HTTP package
  • Multi-table SET statements and a SET statement with embedded SQL code are allowed. Here is an example.
    set dblib.invoice dblib.paysched;
    
    Note: The librefs in the SET statement must be the same (for example, they must have the same schema, permissions, or use the same catalog). Otherwise, the data and thread programs are run on the client.
    Note: Only one SET statement is allowed. If more than one SET statement is used in the thread program, the thread program is not run inside the database. Instead, the thread program runs on the client.
    Note: Using multi-table SET statements, embedded SQL, or the MERGE statement requires Hive.
    Note: Use of the multi-table SET statement or a SET statement with embedded SQL with Hadoop requires Hive .13 or later.
  • Using an unrecognized catalog in the SET statement causes the thread program to run on the client.
  • MERGE statements are allowed when using the SAS In-Database Code Accelerator.
    Note: Tables with the SPD Engine or HDMD format do not support the MERGE statement.
    Note: Use of the MERGE statement with Hadoop requires Hive .13 or later.
  • Only one SET FROM statement is allowed in the data program. Otherwise, an error occurs.
  • Some data sources choose their one preferred order for columns in the output table from DS2. For example, on Hive, the BYPARTITION columns are always moved to the end of the table. This is common as various data sources try to optimize their performance.
    The order of declaration in a DS2 program might not be used as the order of columns in the data source. For example, if you use keep K1- - K4;, you might not get the columns as you expect or you might get an error because K1 appears after K4 in the CREATE TABLE statement.
  • Custom null values in delimited tables are not supported.
  • Null values are converted to blank values.
  • The NOT IN operator returns null values.

Greenplum

  • Only the thread program runs inside the database.

Hadoop

  • Both the data and thread program can run inside the database if the output table from the data program resides in Hadoop.
    Note: If the data program contains any data transformations beyond creating output table data, the data program is not run inside the database.
    You can use a different LIBNAME statement for the input and output table if the input and output librefs meet the following conditions:
    • The librefs are on the same Hadoop cluster.
    • Both files must be accessible by Hive, or both files must be accessible in HDFS by means of an HDMD file.
    • When the connection strings are compared, they must be identical in value and case except for these values:
      • SCHEMA
      • HDFS_METADIR
      • HDFS_TEMPDIR
      • HDFS_PERMDIR
    If the output table from the data program does not reside in Hadoop, only the thread program is run inside the cluster.
  • If you use a HAVING clause to format output column data, the format is not applied to the output column data when the data is written back to a file. The format is specified in the output column metadata. The SAS/ACCESS Engine for Hadoop is currently unable to understand column format. Therefore, PROC PRINT or PROC CONTENTS do not print or display the contents with the format specified in the column's metadata.
  • A Hive STRING data type is always converted to a VARCHAR data type using the following rules:
    • STRING -> VARCHAR(65355)
    • STRING + SASFMT:CHAR(n) -> VARCHAR(n)
    • STRING + SASFMT:VARCHAR(n) -> VARCHAR(n)
    • STRING + DBMAX_TEXT -> VARCHAR(DBMAX_TEXT)
  • The Hive user needs Read and Write access to the TempDir and the Destination Warehouse directories. In addition, the MapReduce user needs Read and Write permission.
  • When working with delimited files, data is textualized using the closest fitting format. The data is stored in a textualized manner. Therefore, some discrepancies might occur, and the transformation causes alteration of precision. For example, Hadoop would create an HDFS text representation of a floating point DOUBLE value. After retrieving the value, the resulting DOUBLE value could be slightly different from the starting value.
  • The BYPARTITION=NO option in the PROC DS2 statement specifies that the input data is not re-partitioned even if there is a BY statement and enables two-stage aggregation. When using the SAS In-Database Code Accelerator for Hadoop, this option setting is ignored and the BYPARTITION=YES is used. Alternate thread stage aggregation techniques such as a hash object should be used instead of BYPARTITION=NO.
  • Hadoop reserved keywords cannot be used for table names. Quoting table names that are Hadoop reserved keywords does not work.

Teradata

  • Both the data and thread program run inside the database if the output table from the data program resides in Teradata.
    Note: If the data program contains any data transformations beyond creating output table data, the data program is not run inside the database.
    You can use a different LIBNAME statement for the input and output table if the input and output librefs meet the following conditions:
    • The librefs are in the same Teradata database.
    • When the connection strings are compared, they must be identical in value and case except for these values:
      • CATALOG
      • SCHEMA
    If the output table from the data program does not reside in Teradata, only the thread program is run inside the database.