SAS Federation Server Driver for SASHDAT

About the SAS Federation Server Driver for SASHDAT

The SAS Federation Server Driver for SASHDAT (Driver for SASHDAT) is a write-only driver designed for use with Hadoop on a grid host, such as the SAS LASR Analytic Server. SAS LASR Analytic Server integrates with Hadoop by storing SAS data in the Hadoop Distributed File system (HDFS). Using the Driver for SASHDAT, you can write files into HDFS, which makes them available for load to SAS LASR Analytic Server. Because the data volumes in HDFS are usually very large, the driver is not designed to read from HDFS and transfer data back to the client.
The Driver for SASHDAT enables a user to create a table and insert into it multiple times. Data is written when the connection is closed or when the user issues a COMMIT or ROLLBACK. When a table is created, the Driver for SASHDAT stores the table definition in the connection. Catalog functions return information about the table, but the information is not written to HDFS until the first row is inserted. The table remains open and available for data until one of the following conditions are met:
  • The SQL command COMMIT WORK is executed. This command closes and finalizes all tables with inserted data that are open for connection.
  • A COMMIT or ROLLBACK is received from the client.
  • The client issues a disconnect event at which time all tables with inserted data are closed.
  • When the COMMIT= statement option is specified in the connection string, the table is closed when the INSERT statement changes to unprepared. If a user creates a table, prepares an INSERT, and executes it multiple times, the table remains open and available for more data. When the statement changes to unprepared, the table is closed. When a table is closed, data can no longer be written. To change the data in an HDFS table, it must be dropped and re-created.

Connection Options

To access data that is hosted on SAS Federation Server, a client must submit a DSN that defines how to connect to the data. DSNs are associated with a data service that provides the foundation for the connection, such as user access control. See Working with Data Services for additional information.
The Driver for SASHDAT supports the following connection options.
Option
Description
CATALOG
CATALOG=catalog name;
Specifies the catalog name for the connection.
Note: SAS Federation Server automatically quotes SQL identifiers that do not meet the regular naming convention as defined in the SAS FedSQL Reference Guide.
COMMIT
COMMIT=S|STATEMENT|C|CONNECTION;
Specifies when to close the SASHDAT file. Use S to close when the statement is unprepared, or C when the connection is disconnected. The default is C, to close when the connection is disconnected.
COPIES
COPIES=number-of-copies;
Specifies how many copies are made when file blocks are written to HDFS. Note that specifying COPIES=0 is valid and signals the engine that you do not want any replicate copies of the data in HDFS. Defaults for this option depend on the setting for NODIST. The default is 1 when NODIST=NO is specified and 2 when NODIST=YES is specified.
CT_PRESERVE
CT_PRESERVE = STRICT | SAFE | FORCE | FORCE_COL_SIZE
Allows users to control how data types are mapped. Note that data type mapping is disabled when CT_PRESERVE is set to STRICT. If the requested type does not exist on the target database, an error is returned. The options are as follows:
  • STRICT The requested type must exist in the target database. No type promotion occurs. If the type does not exist, an error is returned.
  • SAFE Target data types are upscaled only if they do not result in a loss of precision or scale. When character encodings are changed, the new column size is recalculated to ensure all characters can be stored in the new encoding.
  • FORCE This is the default for all drivers. The best corresponding target data type is chosen, even if it could potentially result in a loss of precision or scale. When character encodings are changed, the new column size is recalculated to ensure all characters can be stored in the new encoding.
  • FORCE_COL_SIZE This option is the same as FORCE, except that the column size for the new encoding is the same as the original encoding. This option can be used to avoid column size creep. However, the resulting column might be too large or too small for the target data.
DEFAULT SCHEMA
DEFSCHEMA=schema-name;
Specifies the default schema for the connection. When using the DEFSCHEMA option, the default schema and path must also be specified with SCHEMA=(NAME=schema-namePATH|PRIMARYPATH=path) in the connection string.
ENCODING
ENCODING=SAS-NLS-encoding-identifier;
Specifies the encoding for SASHDAT data and character conversions, both to and from. If not specified, the default encoding is inherited from SAS Federation Server.
Note: Tables created with the Driver for SASHDAT use the encoding specified in the connection string. If the encoding option is not specified, encoding defaults to the character set associated with the operating system for SAS Federation Server.
HASH
HASH=Y|YES|N|NO;
Specifies the algorithm that determines the distribution of partitions to nodes of the LASR Analytic Server proxy. The default is HASH=NO, which specifies that the distribution scheme depends on a binary tree. HASH=YES indicates that the distribution scheme depends on a hash function. As a result, the distribution properties of the partitions are not as balanced, but result in less memory usage. HASH=YES is recommended when working with high-cardinality partition keys (in the order of millions of partitions).
HOST
HOST | SERVER=grid-server-name;
Specifies the name of the grid host that has a running Hadoop NameNode. This option is required in the connection string. There is no default.
INSTALL
INSTALL=path;
Specifies the path to the TKGrid installation on the grid host. This option is required in the connection string. There is no default.
LOCALE
LOCALE=SAS-locale-identifier;
Specifies the locale for message text and character conversions, both to and from. The default locale is acquired from the server operating system.
NODIST
NODIST|INNAMEONLY=Y|YES|N|NO;
Specifies whether to place small tables into HDFS. Use NODIST=Y as the mode for placing small tables into HDFS. The default is N (No).
PASSWORD
PWD=alternate-password;
Specifies the password for the alternate user when connecting to the grid host with a running Hadoop NameNode.
SCHEMA
SCHEMA=(NAME=schema-name PATH|PRIMARYPATH=path);
This option maps a logical schema name to a specific path for the grid host with a running Hadoop NameNode. This option can be specified multiple times in a single connection string to define multiple schemas. At least one schema is required. There is no default if a schema is not specified. However, once specified, the first schema listed in the connection string is designated as the default schema if DEFSCHEMA= is not used.
SQUEEZE
SQUEEZE=Y|YES|N|NO
Specifies whether the SASHDAT file will be compressed. The default is N (no compression).
TIMEOUT
TIMEOUT=timeout-in-seconds;
Specifies the amount of time to wait while trying to establish a connection before terminating the attempt and generating an error. The default is 20 seconds.
USER
UID=alternate-userid;
Specifies the ID of an alternate user when connecting to the grid host with a running Hadoop NameNode.

Example Connection Strings

The following connection string connects a default user and closes files on disconnect, commit, or rollback:
CATALOG=HDAT;DRIVER=SASHDAT;HOST=hostname;INSTALL="/opt/TKGrid/v940m1/laxnd";
  SCHEMA=(name=SCHEMA1;PATH="/user/test"); 
This connection string connects a test user, defines two schemas, stores data in UTF8, and closes files on statement unprepare, disconnect, commit, or rollback:
CATALOG=HDAT;DRIVER=SASHDAT;COMMIT=S;UID=test;
  ENCODING=UTF8;HOST=hostname;INSTALL="/opt/TKGrid/v940m1/laxnd/";
  SCHEMA=(name=CUSTOMERS;PATH="/user/custs");SCHEMA=(name=Accounts;PATH="/user/accts"); 
Last updated: March 6, 2018