In Hadoop file system
(HDFS), tables are stored as one or more files. Each file is formatted
according to the Output Table Format option, which is specified in
each file. In SAS Data Loader, when you create a new target table
in Hadoop, the Output Table Format option is set by the value of the Output
table format field.
You can change the default
value of the Output table format field in
the SAS Data Loader Configuration window.
In any given directive, you can override the default value using the Action menu
icon in the Target Table task.
The default format is
applied to all new target tables that are created with SAS Data Loader.
To override the default format in a new table or an existing table,
you select a different format in the directive and run the job.
To change the default
value of the
Output table format field, go
to the SAS Data Loader directives page, click the
More icon
, and select
Configuration.
In the
Configuration window, click
General
Preferences. In the General Preferences panel, change
the value of
Output table format.
To override the default
value of the
Output table format field for
a given target in a given directive, open the directive and proceed
to the
Target Table task. Select a target
table, and then click the
Action menu
for that target table. Select
Advanced
Options, and then, in the
Advanced Options window,
set the value of
Output table format. The
format that you select applies only to the selected target table.
The default table format is not changed. The default format continues
to be applied to all new target tables.
The available values
of the
Output table format field are defined
as follows:
Use HIVE default
specifies that the
new target table receives the Output Table Format option value that
is specified in HDFS. This is the default value for the Output
table format field in SAS Data Loader.
Text
specifies that the
new target table is formatted as a series of text fields that are
separated by delimiters. For this option, you select a value for the Delimiter field.
The default value of the Delimiter field
is (Use HIVE default). You can also select
the value Comma, Space, Tab,
or Other. If you select Other,
then you enter a delimiter value. To see a list of valid delimiter
values, click the question mark icon to the right of the Delimiter field.
Parquet
specifies the Parquet
format, which is optimized for nested data. The Parquet algorithm
is considered to be more efficient than using flattened nested namespaces.
Note: If your cluster uses a MapR
distribution of Hadoop, then the Parquet output table format is not
supported.
ORC
specifies the Optimized
Row Columnar format, which is a columnar format that efficiently manages
large amounts of data in Hive and HDFS.
Sequence
specifies the SequenceFile
output format, which enables Hive to efficiently run MapReduce. This
format enables Hive to efficiently split large tables into separate
threads.
Consult your Hadoop
administrator for advice about output file formats. Testing might
be required to establish the format that has the highest efficiency
on your Hadoop cluster.