Considerations and Limitations
Greenplum, Hadoop, and Teradata
-
If the thread program is run inside
the database, the number of threads is set by the SAS In-Database
Code Accelerator. When this occurs, the THREADS= argument in the SET
FROM statement in the data program has no effect.
-
When a matrix is declared in a
thread program, each thread program has its own, individual instance
of a matrix. The DS2 matrix package does not support data partitioning
between nodes or threads to perform parallel matrix operations. Instead,
each thread performs the matrix operations on its own instance of
the matrix.
-
The DS2 program fails if you try
to use an empty format that you defined with PROC FORMAT.
-
In-database processing does not
occur when the following methods are used to load data. Instead, the
data and thread programs are run on the client.
-
-
using an initialized hash package
-
-
Multi-table SET statements and
a SET statement with embedded SQL code are allowed. Here is an example.
set dblib.invoice dblib.paysched;
Note: The librefs in the SET statement
must be the same (for example, they must have the same schema, permissions,
or use the same catalog). Otherwise, the data and thread programs
are run on the client.
Note: Only one SET statement is
allowed. If more than one SET statement is used in the thread program,
the thread program is not run inside the database. Instead, the thread
program runs on the client.
Note: Using multi-table SET statements,
embedded SQL, or the MERGE statement requires Hive.
Note: Use of the multi-table SET
statement or a SET statement with embedded SQL with Hadoop requires
Hive .13 or later.
-
Using an unrecognized catalog in
the SET statement causes the thread program to run on the client.
-
MERGE statements are allowed when
using the SAS In-Database Code Accelerator.
Note: Tables with the SPD Engine
or HDMD format do not support the MERGE statement.
Note: Use of the MERGE statement
with Hadoop requires Hive .13 or later.
-
Only one SET FROM statement is
allowed in the data program. Otherwise, an error occurs.
-
Some data sources choose their
one preferred order for columns in the output table from DS2. For
example, on Hive, the BYPARTITION columns are always moved to the
end of the table. This is common as various data sources try to optimize
their performance.
The order of declaration
in a DS2 program might not be used as the order of columns in the
data source. For example, if you use keep K1- - K4;
,
you might not get the columns as you expect or you might get an error
because K1
appears after K4
in
the CREATE TABLE statement.
-
Custom null values in delimited
tables are not supported.
-
Null values are converted to blank
values.
-
The NOT IN operator returns null
values.
Greenplum
-
Only the thread program runs inside
the database.
Hadoop
-
Both the data and thread program
can run inside the database if the output table from the data program
resides in Hadoop.
Note: If the data program contains
any data transformations beyond creating output table data, the data
program is not run inside the database.
You can use a different
LIBNAME statement for the input and output table if the input and
output librefs meet the following conditions:
-
The librefs are on the same Hadoop
cluster.
-
Both files must be accessible by
Hive, or both files must be accessible in HDFS by means of an HDMD
file.
-
When the connection strings are
compared, they must be identical in value and case except for these
values:
If the output table
from the data program does not reside in Hadoop, only the thread program
is run inside the cluster.
-
If you use a HAVING clause to format
output column data, the format is not applied to the output column
data when the data is written back to a file. The format is specified
in the output column metadata. The SAS/ACCESS Engine for Hadoop is
currently unable to understand column format. Therefore, PROC PRINT
or PROC CONTENTS do not print or display the contents with the format
specified in the column's metadata.
-
A Hive STRING data type is always
converted to a VARCHAR data type using the following rules:
-
-
STRING + SASFMT:CHAR(n)
-> VARCHAR(n)
-
STRING + SASFMT:VARCHAR(n)
-> VARCHAR(n)
-
STRING + DBMAX_TEXT -> VARCHAR(DBMAX_TEXT)
-
The Hive user needs Read and Write
access to the TempDir and the Destination Warehouse directories. In
addition, the MapReduce user needs Read and Write permission.
-
When working with delimited files,
data is textualized using the closest fitting format. The data is
stored in a textualized manner. Therefore, some discrepancies might
occur, and the transformation causes alteration of precision. For
example, Hadoop would create an HDFS text representation of a floating
point DOUBLE value. After retrieving the value, the resulting DOUBLE
value could be slightly different from the starting value.
-
The BYPARTITION=NO option in the
PROC DS2 statement specifies that the input data is not re-partitioned
even if there is a BY statement and enables two-stage aggregation.
When using the SAS In-Database Code Accelerator for Hadoop, this option
setting is ignored and the BYPARTITION=YES is used. Alternate thread
stage aggregation techniques such as a hash object should be used
instead of BYPARTITION=NO.
-
Hadoop reserved keywords cannot
be used for table names. Quoting table names that are Hadoop reserved
keywords does not work.
Teradata
-
Both the data and thread program
run inside the database if the output table from the data program
resides in Teradata.
Note: If the data program contains
any data transformations beyond creating output table data, the data
program is not run inside the database.
You can use a different
LIBNAME statement for the input and output table if the input and
output librefs meet the following conditions:
-
The librefs are in the same Teradata
database.
-
When the connection strings are
compared, they must be identical in value and case except for these
values:
If the output table
from the data program does not reside in Teradata, only the thread
program is run inside the database.
Copyright © SAS Institute Inc. All Rights Reserved.