Troubleshooting :: SAS(R) Data Loader 2.4 for Hadoop: User's Guide

Active Directory (LDAP) Authentication

If Active Directory and LDAP (Lightweight Directory Access Protocol) are used to protect your Hadoop cluster, an LDAP user and password must be specified in the Hadoop connection for SAS Data Loader. For more information about the Hadoop connection, see Hadoop Configuration Panel.

Oozie does not support LDAP authentication. SAS Data Loader uses the SQOOP and Oozie components installed with the Hadoop cluster to move data to and from a DBMS. Accordingly, when LDAP authentication is used with your Hadoop cluster, directives that rely on Oozie, such as Copy Data To Hadoop, do not receive the authentication benefits provided by LDAP. However, the operation of these directives is otherwise unaffected.

Change the File Format of Hadoop Target Tables

In Hadoop file system (HDFS), tables are stored as one or more files. Each file is formatted according to the Output Table Format option, which is specified in each file. In SAS Data Loader, when you create a new target table in Hadoop, the Output Table Format option is set by the value of the Output table format field.

You can change the default value of the Output table format field in the SAS Data Loader Configuration window. In any given directive, you can override the default value using the Action menu icon in the Target Table task.

The default format is applied to all new target tables that are created with SAS Data Loader. To override the default format in a new table or an existing table, you select a different format in the directive and run the job.

To change the default value of the Output table format field, go to the SAS Data Loader directives page, click the More icon

, and select Configuration. In the Configuration window, click General Preferences. In the General Preferences panel, change the value of Output table format.

To override the default value of the Output table format field for a given target in a given directive, open the directive and proceed to the Target Table task. Select a target table, and then click the Action menu Action Menu Icon

for that target table. Select Advanced Options, and then, in the Advanced Options window, set the value of Output table format. The format that you select will apply only to the selected target table. The default table format is not changed. The default format will continue to be applied to all new target tables.

The available values of the Output table format field are defined as follows:

Use HIVE default

specifies that the new target table receives the Output Table Format option value that is specified in HDFS. This is the default value for the Output table format field in SAS Data Loader.

Text

specifies that the new target table is formatted as a series of text fields that are separated by delimiters. For this option, you select a value for the Delimiter field. The default value of the Delimiter field is (Use HIVE default). You can also select the value Comma, Space, Tab, or Other. If you select Other, then you enter a delimiter value. Valid values consist of single ASCII characters that are numbered between 0 and 127 (decimal). An ACII character number can be specified as the delimiter using a three-digit octal value between 0 and 177.

Parquet

specifies the Parquet format, which is optimized for nested data. The Parquet algorithm is considered to be more efficient that using flattened nested namespaces. The Parquet format requires HCatalog to be enabled on the Hadoop cluster. Hadoop administrators can refer to the topic "Additional Configuration Needed to Use HCatalog File Format" in the SAS 9.4 In-Database Products: Administrator’s Guide.

Note: If your cluster uses a MapR distribution of Hadoop, or if the selected run-time target is Spark, then the Parquet output table format is not supported.

ORC

specifies the Optimized Row Columnar format, which is a columnar format that efficiently manages large amounts of data in Hive and HDFS.

Sequence

specifies the SequenceFile output format, which enables Hive to efficiently run MapReduce.

Consult your Hadoop administrator for advice about output file formats. Testing might be required to establish the format that has the highest efficiency on your Hadoop cluster.

Change the Maximum Length for SAS Character Columns

By default, the character columns of the source tables of these directives are expanded or truncated to 1024 characters in length. The valid range for the maximum length option is 1–32,767 characters. The default length should perform well in most cases, though there might be situations where a larger value is required.

Note: As you increase the maximum length for SAS character columns, you also increase the likelihood that performance will be affected.

The affected directives are as follows:

Cleanse Data
Transform Data
Transpose Data
Load Data to LASR
Copy Data to Hadoop
Copy Data from Hadoop

To change the default maximum length for SAS character columns for all new directives, go to the SAS Data Loader directives page, click the Action menu Action Menu Icon

, and select Configuration. In the Configuration window, select General Preferences, and specify the desired length for SAS character columns.

Note: If you change the default maximum length of SAS columns, the new value applies only to new directives.

You can override the default maximum length for SAS character columns in individual directives, without changing the default. In one of the directives listed above, open the Source Table task, click the Action menu, and select Advanced Options. In the Advanced Options window, specify the desired length for SAS character columns.

Note: The directives Cleanse Data and Transform Data can be enabled to run in the Spark run-time environment. When Spark support is enabled for one of these directives, the maximum length of SAS columns can be determined by the value of a configuration option. To learn about the configuration option and the change in behavior, see String Truncation in Spark-Enabled Directives.

Change the Temporary Storage Location

If the default temporary storage directory for SAS Data Loader is not appropriate for some reason, you can change that directory. For example, some SAS Data Loader directives might fail to run if they cannot write to the temporary directory. If that happens, ask your Hadoop administrator if the sticky bit has been set on the default temporary directory (typically /tmp). When a directory's sticky bit is set, only the directory's owner, a file’s owner, or the root user can rename or delete the directory or files in the directory. This is a security measure to avoid deletion of critical folders and their contents, even when other users have full permissions. If that is the case, specify a new location for temporary storage. Click More Menu Icon

to open the Configuration window. In the Storage Settings panel, use the SAS HDFS temporary storage location field, as described in Storage Settings Panel.

Discover New Columns Added to a Source after Job Execution

When you add columns to a source table, any directives that need to use the new columns need to discover them. To make the new columns visible in a directive, open the Source Table task, click the source table again, and click Next. The new columns will then be available for use in the body of the directive, in a transformation or query, for example.

Avoid Using Reserved Keywords in Column Names

For column names, avoid using words that are DS2 or DBMS reserved keywords. For some directives, if you use a reserved keyword for the name of a column that is the target of the directive, it can result in a run-time error. For more information, see Naming Requirements for Schemas, Tables, and Columns.

Hive Limit of 127 Expressions per Table

Due to a limitation in the Hive database, tables can contain a maximum of 127 expressions. When the 128th expression is read, the directive fails and the SAS log receives a message similar to the following:

ERROR: java.sql.SQLException: Error while processing statement: FAILED:
Execution Error, return
       code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
ERROR: Unable to execute Hadoop query.
ERROR: Execute error.
SQL_IP_TRACE: None of the SQL was directly passed to the DBMS.

The Hive limitation applies anytime a table is read as part of a directive. For SAS Data Loader, the error can occur in aggregations, profiles, when viewing results, and when viewing sample data.

Override the Default Hive Storage Location for Target Tables

When you work with directives that create target tables, those tables are stored in a directory location in the Hadoop file system. The default location is defined in the Storage Settings Panel of the Configuration window.

Follow these steps to override the default Hive storage location for an individual directive:

Proceed through the initial tasks for the directive as usual. For example, if you are using the Transform Data directive, you would select a source table and specify any transformations for the data.
When you reach the Target Table task in the directive, click to open the Advanced Options window.
On the General Preferences page, select Specify alternate storage location for the Hive storage location setting, and then click to open the Select Directory window.
Navigate to a folder where you want to store the target table and click OK. You can also create a new folder, if needed.

Note: To use the alternate location, you must have appropriate permissions to the selected directory.
Continue through the remaining tasks for the directive to submit the job.

Tip

Due to a defect in org.apache.sqoop.teradata.TeradataConnManager, an insert into an existing Teradata table at an alternative location is not supported for HortonWorks or any distribution that uses org.apache.sqoop.teradata.TeradataConnManager.

Unsupported Hive Data Types and Values

The Hive database in Hadoop identifies table columns by name and data type. To access a column of data, SAS Data Loader first converts the Hadoop column name and data type into its SAS equivalent (numeric or character.) When the transformation is complete, SAS Data Loader writes the data into the target table using the original Hadoop column name and data type.

If your target data is incorrectly formatted, then you might have encountered a data type conversion error.

The Hive database supports a Boolean data type. SAS does not support the Boolean data type in Hive at this time. Boolean columns in source tables will not be available for selection in SAS Data Loader.

The BIGINT data type in Hive supports integer values larger than those that are currently supported in SAS. BIGINT values that exceed +/–9,223,372,036,854,775,807 generate a stack overflow error in SAS.

Complex data types are not supported by SAS Data Loader.

Although SAS Data Loader does not generate HiveQL UNION statements, you can submit them in the directive Run a Hadoop SQL Program. It is also possible to add UNION statements to the code that is generated by the directives Query or Join Data, or Sort and De-Duplicate Data. (Your version of Hive must be new enough to support UNION statements.)

Restarting a Session after Time-out

SAS Data Loader records periods of inactivity in the user interface. After a period of continuous inactivity, the current web page receives a session time-out warning message in a window. If you do not provide input within three minutes after you receive the warning, the current web page is replaced by the Session Time-out page. You can restart your session by clicking the text Return to the SAS Data Loader application.

When a session terminates, any directives that you did not save or run are lost.

To open an unsaved directive that you ran before your session terminated, follow these steps:

Open the Run Status directive.
Locate the entry for your unsaved directive.
If the unsaved directive is still running, click the Refresh button.
If the directive continues to run, either click Stop in the action menu , or wait for the completion of the run.
In the action menu, select Open to open the directive.
In the open directive, select Save from the title bar.

Replace User Preferences after User ID Changes in MapR or Kerberos Environments

Overview

If your cluster uses a MapR distribution of Hadoop, or if your cluster uses Kerberos authentication, then user preferences are dropped when the user ID changes. A new user ID invalidates the existing user preferences, which are associated with the previous user ID. The following preferences can be invalidated, primarily in the Configuration window:

values from the General Preferences panel
storage locations from the Storage Settings panel
default locale from the QKB panel
configuration information in the Profiles panel
list of recently used tables

User preferences are not affected by a user ID change when the Hadoop cluster is not secured with Kerberos and when the Hadoop distribution is not MapR.

Causes

User ID changes take place in MapR when a new ID is entered into the file mapr-user.json. That new user ID must also be entered into the Hadoop Configuration panel of the Configuration window. The new user ID invalidates the existing SAS Data Loader user preferences. The path to the JSON file on the client host is vApp-install-path\SASWorkspace\hadoop\conf\mapr-user.json.

In Kerberos environments, the primary cause of a user ID change occurs when a value appears in the User ID field of the Hadoop Configuration panel. This value needs to be deleted by clicking the trash can icon next to the User ID field. This field is blank in the Kerberos environment because Kerberos uses the operating system login user ID for authentication purposes. At times, a different user ID is applied, particularly when you move from a non-Kerberos environment to a Kerberos environment.

A user ID change can also take place when a user other than yourself starts or restarts the SAS Data Loader vApp.

A third possible user ID change can occur if a SAS Data Loader backup is not secure, and if the restore or migration is secured with Kerberos. To learn more about backups and restores, see the SAS Data Loader for Hadoop: vApp Deployment Guide.

Resolution

To resolve user preferences after a user ID change, open the Configuration window More Menu Icon

and update as needed the values in the affected panels. Click OK to save your changes.

The settings in the panels LASR Analytic Servers and Databases panels are not invalidated as a result of a user ID change.

Note: In the General Preferences panel, if no value is specified for the field Maximum length for SAS columns, then SAS Data Loader supplies a system default of 1024. When Hadoop Spark is specified as the run-time environment, the default value or an explicit value can be overridden by a configuration option, as described inString Truncation in Spark-Enabled Directives.