In Hadoop file system
(HDFS), tables are stored as one or more files. Each file is formatted
according to the Output Table Format option, which is specified in
each file. In SAS Data Loader, when you create a new target table
in Hadoop, the Output Table Format option is set by the value of the Output
table format field.
You can change the default
value of the Output table format field in
the SAS Data Loader Configuration window.
In any given directive, you can override the default value using the Action menu
icon in the Target Table task.
The default format is
applied to all new target tables that are created with SAS Data Loader.
To override the default format in a new table or an existing table,
you select a different format in the directive and run the job.
To change the default
value of the
Output table format field, go
to the SAS Data Loader directives page, click the
More icon
, and select
Configuration.
In the
Configuration window, click
General
Preferences. In the General Preferences panel, change
the value of
Output table format.
To override the default
value of the
Output table format field for
a given target in a given directive, open the directive and proceed
to the
Target Table task. Select a target
table, and then click the
Action menu
for that target table. Select
Advanced
Options, and then, in the
Advanced Options window,
set the value of
Output table format. The
format that you select will apply only to the selected target table.
The default table format is not changed. The default format will continue
to be applied to all new target tables.
The available values
of the
Output table format field are defined
as follows:
Use HIVE default
specifies that the
new target table receives the Output Table Format option value that
is specified in HDFS. This is the default value for the Output
table format field in SAS Data Loader.
Text
specifies that the
new target table is formatted as a series of text fields that are
separated by delimiters. For this option, you select a value for the Delimiter field.
The default value of the Delimiter field
is (Use HIVE default). You can also select
the value Comma, Space, Tab,
or Other. If you select Other,
then you enter a delimiter value. Valid values consist of single ASCII
characters that are numbered between 0 and 127 (decimal). An ACII
character number can be specified as the delimiter using a three-digit
octal value between 0 and 177.
Parquet
specifies the Parquet
format, which is optimized for nested data. The Parquet algorithm
is considered to be more efficient that using flattened nested namespaces.
The Parquet format requires HCatalog to be enabled on the Hadoop cluster.
Hadoop administrators can refer to the topic "Additional Configuration
Needed to Use HCatalog File Format" in the
SAS 9.4 In-Database Products: Administrator’s Guide.
Note: If your cluster uses a MapR
distribution of Hadoop, or if the selected run-time target is Spark,
then the Parquet output table format is not supported.
ORC
specifies the Optimized
Row Columnar format, which is a columnar format that efficiently manages
large amounts of data in Hive and HDFS.
Sequence
specifies the SequenceFile
output format, which enables Hive to efficiently run MapReduce.
Consult your Hadoop
administrator for advice about output file formats. Testing might
be required to establish the format that has the highest efficiency
on your Hadoop cluster.