Data Types for Hadoop

Overview

Hive is a data warehouse that supplies metadata about data that is stored in Hadoop files. Hive includes a data dictionary and an accompanying SQL-like interface called HiveQL or Hive SQL. HiveQL implements data definition language (DDL) and data manipulation language (DML) statements similar to many relational database management systems. Hive tables are defined with a CREATE TABLE statement, so every column in a table has a name and a data type. The data type indicates the format in which the data is stored. Hive exposes data that is stored in HDFS and other file systems through the data types that are described in this section. These are accepted column types in a Hive CREATE TABLE statement. This section includes information about Hive data types and data conversion between Hive and SAS.
For more information about Hive, Hadoop, and data types, see these online documents at https://cwiki.apache.org/confluence/display/Hive.
  • Hive Getting Started
  • Hive Language Manual
  • Hive Language Manual UDF
  • Hive Tutorial

Simple Hive Data Types

STRING
specifies variable-length character data. The maximum length is 2 gigabytes.
TINYINT
specifies a signed one-byte integer.
SMALLINT
specifies a signed two-byte integer.
INT
specifies a signed four-byte integer.
BIGINT
specifies a signed eight-byte integer.
FLOAT
specifies a signed four-byte, floating-point number.
DOUBLE
specifies a signed eight-byte, floating-point number.
BOOLEAN
specifies a textual true or false value.

Complex Hive Data Types

ARRAY
specifies an array of integers (indexable lists).
MAP
specifies an associated array from strings to string (key-value pairs).
STRUCT
specifies how to access elements within the type—namely, by using the dot (.) notation.

Hive Date, Time, and Timestamp Data

Hive does not define data types for dates, times, or timestamps. By convention, these are typically stored in ANSI format in Hive STRING columns. For example, the last day of this millennium is stored as the string ‘2999-12-31’. SAS/ACCESS assumes ANSI format for dates, times, and timestamps that are stored in Hadoop. Therefore, if you use the DBSASTYPE option to indicate that a Hive STRING column contains a date, SAS/ACCESS expects an ANSI date that it converts to SAS date format. For output, SAS DATE, TIME, and DATETIME formats are converted to ANSI format and are stored in Hive STRING columns.
SAS/ACCESS does not currently support conversion of Hadoop binary UNIX timestamps to SAS DATETIME format.

SAS Data Types

SAS has two fundamental data types, character and numeric. SAS character variables (columns) are of a fixed length with a maximum of 32,767 characters. SAS numeric variables are signed eight-byte, floating-point numbers. When SAS numerics are used in conjunction with SAS formats, they can represent a number of data types, including DATE, TIME, and DATETIME. For more detailed information about SAS data types, see SAS Language Reference: Concepts.

Data Conversion from Hive to SAS

This table shows the default SAS formats that are assigned to SAS variables that are created when SAS/ACCESS reads Hive table columns.
Hive to SAS: Default SAS Formats for Hive Data Types
Hive Data Type
SAS Data Type
Default SAS Format
STRING
character
$32767.
TINYINT
numeric
4.
SMALLINT
numeric
6.
INT
numeric
11.
BIGINT
numeric
20.
FLOAT
numeric
none
DOUBLE
numeric
none
BOOLEAN
numeric
1.

Issues When Converting Data from Hive to SAS

Below are some potential conversion issues.
  • Hive STRING columns that contain ANSI date, time, or timestamp values do not automatically convert respectively to SAS DATE, TIME, or DATETIME types.
  • STRING: Depending on the length of Hadoop STRING data, the SAS character $32767. format might be unnecessarily large for short STRING columns or can truncate Hadoop STRING columns that contain more than 32767 characters.
  • BIGINT: Converting Hadoop BIGINT to a SAS numeric can result in loss of precision because the internal SAS eight-byte, floating-point format accurately preserves only 15 digits of precision. A BIGINT preserves up to 19.
Work-arounds are based on how you access data.

SAS Table Properties for Hive and Hadoop

Although HiveQL supplies critical metadata for Hadoop files, in some cases more metadata is beneficial. Fortunately, HiveQL CREATE TABLE and ALTER TABLE statements provide an extensible feature called table properties. For more information, see the Hive Language Manual at https://cwiki.apache.org/confluence/display/Hive.) SAS/ACCESS uses table properties to describe and interpret the contents of Hive STRING columns.
Here is an example of a new Hive table that is created with SAS/ACCESS. A SAS variable (column) has an associated SAS DATETIME format, so SAS/ACCESS creates a DATETIME table property for the Hive column.
libname hdp hadoop server=dbihadoop user=hadoop_usr pwd= hadoop_pwd;
data hdp.datetime_tableproperty_sample;
format dt_stamp datetime25.6;
dt_stamp=datetime();
run;
This code creates a new Hive table, DATETIME_TABLEPROPERTY_SAMPLE, by generating this HiveQL:
CREATE TABLE `DATETIME_TABLEPROPERTY_SAMPLE` (`dt_stamp` STRING) 
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
   STORED AS TEXTFILE TBLPROPERTIES ('SASFMT:dt_stamp'='DATETIME(25.6)')
SAS stores dt_stamp as a Hive ANSI STRING, as in this example:
2012-02-23 09:51:37.218
Based on the SAS DATETIME25.6 format, SAS/ACCESS also generates the Hive table property that describes STRING column dt_stamp as DATETIME(25.6).
When SAS/ACCESS reads this Hive table, the SASFMT table property indicates STRING column dt_stamp contains an ANSI timestamp. SAS/ACCESS automatically converts and formats it a SAS DATETIME25.6 variable, as in this example:
data;
set hdp.datetime_tableproperty_sample; 
put dt_stamp=;
run;
dt_stamp=23FEB2012:09:51:37.218000
NOTE: There were 1 observations read from the data set HDP.DATETIME_TABLEPROPERTY_SAMPLE.
When SAS/ACCESS creates a new Hive table, it generates table properties for SAS variables with character, date, datetime, and time formats—all of which produce Hive STRING columns. See the generated table properties in Data Conversion from SAS to Hive

Data Conversion from SAS to Hive

This table shows the Hive data types and table properties that are assigned when SAS/ACCESS creates a Hive table.
SAS sets table properties only when creating a new Hive table. It does not create or alter table properties when appending to an existing Hive table.
Hive Data Types and Table Properties That Are Created for SAS Data Type and Format Combinations
SAS Data Type
SAS Format
Hive Data Type
Hive Table Property
character
$n.
STRING
CHAR(n)
numeric
DATETIMEw.p
STRING
DATETIME(w.p)
numeric
DATEw.
STRING
DATE(w.0)
numeric
TIMEw.p.
STRING
TIME(w.p)
numeric
1. to 4.
SMALLINT
none
numeric
5. to 9.
INT
none
numeric
other numeric formats
DOUBLE
none

Leverage Table Properties for Existing Hive Tables

SAS/ACCESS generates SAS table properties only when creating a new Hive table. Many or perhaps all of your Hive tables are created by other means. For example, your Hadoop administrator might create Hive table definitions by submitting DDL scripts to the Hive CLI. SAS and SAS users can benefit by adding SAS table properties to existing Hive table definitions. Here is an example, where a Hive table has already been defined.
CREATE EXTERNAL TABLE weblogs (extract_date STRING, 
   extract_type INT, webdata STRING) ROW FORMAT DELIMITED FIELDS 
   TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/web_data'
Based on this table definition, here is how SAS interprets the columns.
libname hdp sasiohdp server=dbihadoop user=hadoop_usr pwd= hadoop_pwd;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=2012-02-21 extract_type=1
webdata=http://www.sas.com/industry/oilgas
NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
          Alphabetic List of Variables and Attributes
#  Variable       Type   Len     Format    Informat    Label
1  extract_date   Char   32767   $32767.   $32767.     extract_date
2  extract_type   Num        8   11.       11.         extract_type
3  webdata        Char   32767   $32767.   $32767.     webdata
Notice that Hive describes the extract_date column to SAS as a 32767 length STRING. It also describes the webdata column as a 32767 length STRING. So SAS/ACCESS enters both of these columns as character data and uses $32767. to format them. The result is an overly wide SAS data set with an observation (row) width of 64 kilobytes that also does not format extract_date to a SAS DATE.
SAS issues a warning message for this situation, which includes the maximum column length that was in the result set. In the example, the maximum length read for the extract_date STRING column is 10 bytes. The maximum length read for the webdata STRING column was 320 bytes.
WARNING: SAS/ACCESS assigned these columns a length of 32767. If resulting SAS character variables remain this length, SAS performance is impacted. See SAS/ACCESS documentation for details. Columns followed by the maximum length observed were: extract_date:10, webdata:320
The example below assumes that the length of the webdata STRING in Hive never exceeds 1000 characters. A Hadoop user ID with the appropriate authority can issue Hive ALTER TABLE statements to add SAS table properties to the Hive table definition.
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:extract_date'='DATE(9.0)')
ALTER TABLE weblogs SET TBLPROPERTIES ('SASFMT:webdata'='CHAR(1000)')
SAS/ACCESS honors the added properties, and here is the result.
libname hdp sasiohdp server=dbihadoop user=hadoop_usr pwd= hadoop_pwd;
data sheetmetal_sales; set hdp.weblogs(obs=1);
put extract_date= extract_type=;
put webdata=;
run;
extract_date=21FEB2012 extract_type=1
webdata=http://www.sas.com/industry/oilgas
NOTE: There were 1 observations read from the data set HDP.WEBLOGS.
proc contents data=hdp.weblogs; run;
          Alphabetic List of Variables and Attributes
#  Variable       Type   Len     Format    Informat    Label
1  extract_date   Num       8    DATE9.    DATE9.      extract_date
2  extract_type   Num       8    11.       11.         extract_type
3  webdata        Char   1000    $1000		 $1000.      webdata
The resulting SAS data set that is created from the Hive table has a much smaller observation width, which helps SAS save disk space and reduce CPU consumption. It also automatically converts and formats extract_date to SAS standard DATE9. format.
Adding SAS properties to existing Hive tables does not impact table use by software that is not SAS. You can also issue ALTER TABLE and other DDL statements using SAS/ACCESS explicit SQL. (See SQL Pass-Through Facility Specifics for Hadoop). Issuing such DDL as an ALTER TABLE statement can be restricted to only the Hadoop administrator.

Address Issues When Converting Data from Hive to SAS with Table Properties

Some issues currently exist when reading Hadoop data into SAS. (See Issues When Converting Data from Hive to SAS.) For example, Hive STRING columns default to the $32767. SAS character format without a defined SASFMT table property or a SAS override option such as DBSASTYPE=.
Here is how you can address specific conversion issues.
STRING issues
To automatically convert Hive STRING columns that contain ANSI date, timestamp, or time values to suitable SAS formats, you can use the following ALTER TABLE statements. In the statements are these sample Hive columns: d contains ANSI date, ts contains ANSI timestamp, and t contains ANSI time.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='DATE(9.0)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='DATETIME(25.6)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='TIME(15.6)')
Instead, you could use these statements to create SAS character variables of optimal length that contain the identical ANSI representation as those that are stored in Hive:
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='CHAR(9)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='CHAR(25)')
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='CHAR(15)')
You can use the following statement for other Hive STRING columns where the maximum length is less than 32767. Here, the string_col column has a maximum length of 100.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(100)')
However, if you anticipate that the string_col column in Hive might grow to a maximum length of 200 in the future, you could instead use this statement to set the table property.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(200)')
Hive STRING columns longer than 32767 characters are truncated when they are read into SAS. Here is how the warning for this data loss is flagged in the SAS log:
WARNING: Column 'string_col' was truncated 1 times.  
Observation (row) number 2 was the first observation truncated.
BIGINT issues
Converting a Hadoop BIGINT column to a SAS numeric column can cause a loss of precision. A SAS numeric column can accurately preserve only 15 digits of precision, whereas a BIGINT column can preserve up to 19 significant digits of precision. You can address this issue by applying a CHAR(19) table property format. SAS then automatically reads a Hive BIGINT column into a SAS character string with $19. format, which preserves all BIGINT digits in character format. Here is an example, using the bgint BIGINT column.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:bgint'='CHAR(19)')
Keep this important consideration in mind, however: For Hive tables that SAS/ACCESS creates, you might not need to issue ALTER TABLE statements. See Data Conversion from SAS to Hive for table properties that SAS/ACCESS automatically generates when it creates a Hive table.
CAUTION:
Do not create multiple table properties for a single Hive column. Unpredictable data conversion can result.

Alternatives to Table Properties for Issues with Data Conversion from Hive to SAS

For various reasons, it might be impractical or undesirable to issue ALTER TABLE statements to create SAS table properties. In such cases, you can instead use these data set options.
DBSASTYPE=
Use DBSASTYPE= in your SAS code to cause data conversion from Hive to SAS that is identical to automatic conversion with table properties. The pairs below are SAS DATA steps with identical behavior. The first of each pair uses a SASFMT table property, the second one uses no table property, and DBSASTYPE= is added to achieve the same functionality. (For details, see the DBSASTYPE= data set option.) Here is the SAS LIBNAME statement for all of these SAS DATA steps.
libname hdp sasiohdp server=dbihadoop user=hadoop_usr pwd= hadoop_pwd;

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:d'='DATE(9.0)')
data work.local_sample; set hdp.sample_table( keep=d ); run;

[---no table property for column ‘d’---]
data work.local_sample;
set hdp.sample_table( keep=d dbsastype=(d='DATE(9.0)') ); run;

ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:ts'='DATETIME(25.6)')
data work.local_sample; set hdp.sample_table( keep=ts ); run;

[---no table property for column ‘ts’---]
data work.local_sample;
set hdp.sample_table( keep=ts dbsastype=(ts='DATETIME(25.6)') ); run;


ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:t'='TIME(15.6)')
data work.local_sample; set hdp.sample_table( keep=t ); run;

[---no table property for column ‘t’---]
data work.local_sample;
set hdp.sample_table( keep=t dbsastype=(t='TIME(15.6)') ); run;


ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:string_col'='CHAR(200)')
data work.local_sample; set hdp.sample_table( keep=string_col ); run;

[---no table property for column ‘string_col’---]
data work.local_sample;
set hdp.sample_table( keep=string_col dbsastype=(string_col='CHAR(200)') );
run;
COMPRESS=YES
If you can use neither table properties nor DBSASTYPE= to reduce the default 32767 SAS character length for Hive STRING columns, consider using this data set option. Although SAS character variables still use the $32767. format, using SAS data set compression, you can realize significant savings in terms of disk space and CPU consumption. Here is an example.
data work.local_sample( COMPRESS=YES ); set hdp.sample_table; run;
Apply COMPRESS=YES to the SAS data set reference, not to the Hadoop table reference. Also consider using this value even if you reduce default SAS character 32767 lengths with table properties or DBSASTYPE=, but also when many strings in the Hadoop data are much smaller than the reduced SAS character variable length. For example, you could use COMPRESS=YES if you had the table property shown below, but most values of address_1 are less than 100 characters.
ALTER TABLE sample_table SET TBLPROPERTIES ('SASFMT:address_l'='CHAR(1000)')
For more information about this data set option, see SAS Data Set Options: Reference.

Address Issues When Converting Data from Hive to SAS for Pass-Through SQL

Neither table properties nor DBSASTYPE= address data conversion issues from Hive to SAS if you use pass-through SQL to read Hive data. For pass-through SQL, you might need to explicitly convert and format each Hive column as you want it to be represented in SAS. You can see this if you use SAS to create a table with SAS table properties that are generated for all but the BIGINT column. Here is the table that SAS creates.
libname hdp sasiohdp server=dbihadoop user=hadoop_usr pwd=hadoop_pwd;
data hdp.passthrough_ex( dbtype=(bgint="BIGINT") );
bgint='1234567890123456789';
format ts datetime25.6; ts=datetime();
format d date9.; d=today();
format t time10.; t=time();
format string_col $20.; string_col='hello';
run;
SAS issues this HiveQL when creating the table.
CREATE TABLE `PASSTHROUGH_EX` (`bgint` BIGINT,`ts` STRING,
   `d` STRING,`t` STRING,`string_col` STRING) ROW FORMAT DELIMITED FIELDS 
   TERMINATED BY '\001' STORED AS TEXTFILE TBLPROPERTIES
('SASFMT:ts'='DATETIME(25.6)','SASFMT:d'='DATE(9.0)',
   'SASFMT:t'='TIME(10.0)','SASFMT:string_col'='CHAR(20)')
Next, an ALTER TABLE statement is issued to add a table property for BIGINT column bgint.
ALTER TABLE passthrough_ex SET TBLPROPERTIES ('SASFMT:bgint'='CHAR(19)')
A LIBNAME-based table that is read to SAS honors the table properties.
data work.local; set hdp.passthrough_ex;
run;
data _null_; set work.local;
put bgint=; put ts=; put d=; put t=; put string_col=;
run;
bgint=1234567890123456789
ts=25FEB2012:02:00:55.141000
d=25FEB2012
t=2:00:55
string_col=hello
This pass-through SQL step converts and formats each column identically to the LIBNAME-based step that applied the table properties.
proc sql; connect to sasiohdp(server=dbihadoop user=hadoop_usr pwd=hadoop_pwd);
create table work.local as select 
bgint length 19 format $19. informat $19.,
input(ts, IS8601DT26.) as ts format datetime25.6 informat datetime25.6,
input(d, yymmdd10.) as d format date9. informat date9.,
input(t, IS8601TM15.) as t format time15.6 informat time15.6,
string_col length  20 format $20. informat $20.
from connection to sasiohdp( select cast(bgint as STRING) 
   as bgint,ts,d,t,string_col from passthrough_ex );
quit;
data _null_; set work.local;
put bgint=; put ts=; put d=; put t=; put string_col=;
run;
bgint=1234567890123456789
ts=25FEB2012:02:00:55.141000
d=25FEB2012
t=2:00:55.141000
string_col=hello

Hadoop Null Values

Hadoop has a special value called NULL. A Hadoop NULL value means an absence of information and is analogous to a SAS missing value. When SAS/ACCESS reads a Hadoop NULL value, it interprets it as a SAS missing value. For more information about how SAS handles NULL values, see Potential Result Set Differences When Processing Null Data .