Running PROC TRANSPOSE inside the Database

Overview

The TRANSPOSE procedure works within a third-party database through the SAS In-Database Code Accelerator. This in-database processing allows PROC TRANSPOSE to achieve high performance through the distributed computation provided by a massively parallel processing (MPP) database.
The TRANSPOSE procedure performs a dynamic transformation of the data in which the characteristics of the output table, specifically the number and names of the variables as well as their types, are determined from the variable values as well as the characteristics of the input table. This dynamic behavior is achieved by a two-pass process. During the first pass, rows of the input table are examined to determine the characteristics of the output table. During the second pass, the work of transposing the data is performed. Parallel operation within the MPP database speeds up both the first and second pass. In the first pass, rows are examined in parallel and in place, according to the manner in which they are already partitioned across the nodes of a cluster. In the second pass, rows are repartitioned to form BY groups that are then processed independently and in parallel.
Both the first and second passes of the in-database processing are performed by executing a DS2 program within the SAS Embedded Process, which resides within the nodes of the cluster. The database provides the SAS Embedded Process with the ability to read data from tables and write data to tables. The SAS Embedded Process provides an execution context for the DS2 program. Because the two passes of work are expressed in the DS2 language, columns of the tables are cast to variables that have DS2 data types. Data type support within DS2 is more extensive than that provided by the traditional SAS system. Therefore, the ability of the TRANSPOSE procedure to preserve data types and values of input data with the transposed output data is enhanced when the TRANSPOSE procedure is executed inside the database.

Requirements

The TRANSPOSE procedure can perform in-database processing if the following requirements are met:
  • The following products are licensed at your site:
    • third maintenance release of SAS 9.4 or later
    • SAS In-Database Code Accelerator (Hadoop or Teradata)
    • SAS/ACCESS Interface to your data provider (Hadoop or Teradata)
  • The SAS Embedded Process is properly installed and configured on the cluster.
  • A client is installed and configured for the data provider.
  • The input and output files are in Teradata or Hadoop.
    For in-database processing to occur, the data to be transposed must reside within the database. When the results of the transposition are directed to a table in the database, in-database processing results in very little network traffic between SAS and the data provider. In this case, only the metadata that is required to determine the characteristics of the output table is transferred back to SAS. Large-volume data transfers, which can be very slow, are avoided.
  • A valid SAS/ACCESS LIBNAME statement must be provided to communicate with the data provider.
    For more information about the SAS/ACCESS LIBNAME statement, see SAS/ACCESS for Relational Databases: Reference.
  • The following data provider client and SAS environment variables must be set:
    • For the Teradata client, the COPLIB environment variable.
    • For Hadoop, the SAS_HADOOP_JAR_PATH and SAS_HADOOP_CONFIG_PATH environment variables.
    For more information about the COPLIB environment variable, see your Teradata documentation. For more information about the SAS Hadoop environment variables, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
  • The SQLGENERATION system option is set to DBMS and the INDB =YES option is specified in the TRANSPOSE procedure statement.
    Currently, in addition to the SQLGENERATION system option being set to allow in-database processing, the INDB=YES option must be specified in the TRANSPOSE statement. In the future, this requirement might be relaxed or deprecated.
    For more information about the SQLGENERATION system option, see SAS/ACCESS for Relational Databases: Reference. For more information about the TRANSPOSE statement, see Base SAS Procedures Guide.
  • One or more variables is specified in a BY statement.
    In an MPP environment, the volume of data processed is expected to be large. To benefit from distributed computation, the data must be sub-divided among nodes in the MPP cluster such that it can be processed in parallel, independently on each node. For the TRANSPOSE procedure, this parallelism is achieved by partitioning the data across the nodes using one or more BY variables. Ideally, the cardinality of the BY variable or combination of BY variables should be large enough so that each BY group is small enough that it can be processed effectively on one node of the cluster. An attempt to process large BY groups can fail if the memory resources required exceed those available on a single node.
    For more information about the BY statement, see the TRANSPOSE procedure in Base SAS Procedures Guide
  • One or more variables is specified in an ID statement.
    Mathematical transposition depends on a defined order of the rows and columns in a matrix. However, in data processing, a defined order of the rows and columns does not necessarily exist. SAS data sets, accessed through the Base engine, maintain the order of observations within the file containing the data so that observations can be read from the data set in an order that is consistent from one read to the next. Therefore, when running PROC TRANSPOSE with a Base SAS data set as input, the first observation of the input data set is transposed to the first result variable of the output data set, the second observation to the second result variable, and so on. However, when reading a table or the results of a query through a SAS/ACCESS engine into SAS, no such consistency of order is guaranteed with an MPP data source. Further, non-deterministic ordering is very common in an MPP environment because of the variations introduced by parallel processing.
    Because there is no guaranteed consistent ordering of input rows from a database, the TRANSPOSE procedure requires the specification of one or more ID variables for in-database processing. The ID variable (or variables) serve to map an input row to an output column. The assignment of input row to output column is based on the value of the ID variable(s) and is accomplished by forming the name of the output column from the value of the ID variable(s). Simply described, the names are formed by concatenating the formatted values of the ID variables in order of the appearance of the variables in the ID statement and then post-processing the concatenation to ensure that the result is a valid SAS name. The SAS name is validated according to SAS variable naming rules and the value of the VALIDVARNAME option. The unique set of these output column names and representative raw values of the ID variables from which they were formed are discovered by a DS2 program, executed within the SAS EP, and returned to SAS during the first pass of transposition inside the database. Subsequently, these names are used in the creation of the second DS2 program, which performs the actual transposition within the DBMS.
    For more information about the ID statement, see the TRANSPOSE procedure in Base SAS Procedures Guide.
  • The LET option is specified in the TRANSPOSE procedure statement.
    When executing in the traditional fashion in Base SAS, the TRANSPOSE procedure stops with an error when it detects that two input rows within a BY group have ID variable values that form the same output variable name. This behavior is designed to prevent unintentional loss of data during a transposition by disallowing two input rows to be transposed to a single output variable. When both rows form the same name, if not prevented, the values of the second rows overwrite the values of the first row and result in data loss. In the case that the loss of data is acceptable and intentional, the LET option can be specified in the TRANSPOSE procedure statement. Specifying the LET option allows execution to continue, issuing a warning for the duplicate observation, and letting the values of the last observation that formed the duplicate name to overwrite previously transposed values. However, to achieve high performance when executing inside the database, the information gathered and transferred back to SAS regarding the variable names formed from the ID variables is minimized. Therefore, the information is insufficient to detect these duplicate input rows to output variable mappings. Not being detectable, execution does not stop, and error or warning messages are not issued for duplicates. Because of this, the LET option, specified in the TRANSPOSE procedure statement, is required for in-database processing. This requirement is meant to serve as an explicit acknowledgment of the potential for data loss by the user.
    Because of the non-deterministic nature of row ordering within a database or file system, if two or more input rows within a BY group map to a single output column, the column values in the transposed output table are also not deterministic and can vary from one run to the next. The values surviving in the output table are associated with the last row mapped from the input table, but the last row mapped can vary due to differences in apparent row ordering. This variation can make comparison of the results of two different transpositions of the same input data difficult. Because data can be lost, even though a user is forced to acknowledge this potential through the use of the LET option, and output results can vary from run to run in the presence of duplicates, the best practice is to ensure that BY and ID variables are chosen such that there is a well-defined, one-to-one mapping between input observations and output variables. Such assurance can be gained through knowledge and inspection of the data being analyzed.
    For more information about the LET option, see the TRANSPOSE procedure in Base SAS Procedures Guide.

Data Type Conversion

In accordance with traditional TRANSPOSE behavior, when no VAR statement is specified, the analysis variables participating in the transposition are those that are numeric in the Base SAS context. To involve other variables in the transposition, explicitly specify them in a VAR statement.
The Base SAS language has two types of variables: numeric and character. Database management and file systems, on the other hand, have many different variable data types. When the TRANSPOSE procedure is run inside the database or file system, the SAS In-Database Code Accelerator is the interface between the two systems. The SAS In-Database Code Accelerator supports more data types than traditional Base SAS but likely fewer than the number of data types that are supported by the database or file system. When running within the SAS Embedded Process, DS2 converts and preserves data types as best it can. When processed inside the database, the TRANSPOSE procedure runs DS2 programs within the SAS Embedded Process. The DS2 programs uses neither the SAS variable types nor the database or file system data types. It uses DS2 data types.
The work of the TRANSPOSE procedure, introduces an additional difficulty. The difficulty is the determination of a data type for output variables that can preserve the values of the input variables. A common data type must be determined for all output variables to hold the transposed results of the input variables. If all input analysis variables (those listed in the VAR statement or implied by the lack of one) are of the same data type, then the data type of all result variables in the output is exactly the same as the input data type.
For example, if all input variables to be transposed are of the double-precision, floating-point data type, then all result variables in the output are double-precision, floating-point. However, if the input analysis variables are not of the same type, then a common type capable of preserving the values of all input variables must be determined. For example, similar to traditional TRANSPOSE operation, if there are two input analysis variables, and one is double-precision, floating-point while the other variable is character, then all result output variables must be of the character type. Using DS2 in the SAS Embedded Process complicates this common type determination because DS2 supports more than just the two traditional SAS variable types. For in-database processing, the TRANSPOSE procedure first attempts to preserve the data type of the input analysis variables in the output. If that is not possible, the TRANSPOSE procedure attempts to determine a common representative type that allows the full precision of input values to be preserved in the output. For numeric data types, if no common output type can preserve full numerical precision, then the TRANSPOSE procedure uses the approximate double-precision, floating-point data type to ensure that the magnitude and range of values are preserved.

In-Database Processing Results

If in-database processing is successful, you will see the following message appear in the SAS log after submission of the TRANSPOSE procedure.
NOTE: The transposition was performed in the DBMS
Some procedure and data set options can prevent in-database processing. If in-database processing is requested but a transposition cannot be performed inside the database, then traditional execution proceeds in the SAS session.
To determine why in-database processing was not attempted, set the MSGLEVEL=I system option and look for additional WARNING messages in the SAS log. For more information about MSGLEVEL, see Using the MSGLEVEL Option to Control Messaging.
If in-database processing is requested but fails, and the reason for the failure is not clear, then set the SQL_IP_TRACE=ALL system option, re-run the job, and look for additional details in the SAS log.