Distributed Server: Co-located HDFS

Introduction

About the HDFS Tab

Introduction

System Properties

Basic File Information

Table Information

Block Detail Information

Block Distribution Information

How to Introduce an Additional Directory

How to Delete an HDFS Table

Introduction

Co-located HDFS is a deployment of Hadoop that meets the following criteria:

The deployment runs on the same hardware as a distributed SAS LASR Analytic Server.
The deployment incorporates services that SAS High-Performance Deployment of Hadoop provides.

SAS High-Performance Deployment of Hadoop adds services to Apache Hadoop (and other supported Hadoop distributions) to provide the following integrated functionality:

SAS uses a special file format (with the filename suffix SASHDAT) to store tables in HDFS. Like any file that is stored in HDFS, a SASHDAT file is distributed as a series of blocks. Copies of blocks are stored to provide data redundancy.
SAS enhances the block distribution algorithm to make sure that blocks are distributed evenly. Because SAS LASR Analytic Server reads blocks of data directly, the even block distribution contributes to an even workload on the machines in the cluster.

This integration enables a distributed SAS LASR Analytic Server to use HDFS to read SASHDAT tables in parallel very efficiently.

Tip

Basic HDFS commands are documented in the SAS LASR Analytic Server: Reference Guide.

About the HDFS Tab

Introduction

To open the HDFS tab, select Tools

Explore HDFS from the main menu in the administrator.

Note: The HDFS tab is available in deployments that use co-located HDFS. Only users that have the Browse HDFS capability can access the HDFS tab.

The HDFS tab provides a host-layer view of HDFS folders and tables. The view is not mediated by metadata or by your permissions. Instead, a privileged Hadoop account retrieves the information that this tab displays.

You can use the HDFS tab to perform the following tasks:

Browse HDFS folders and tables.
View row count, columns, column information, and block information for tables that have been added to HDFS. Information about block distribution, block redundancy, and measures of block utilization is provided.
Delete HDFS tables that are stored in SASHDAT format. (Files that are not SASHDAT files are listed, but they cannot be deleted.)

System Properties

To view HDFS system properties, click

. The following table describes the fields:

HDFS System Properties
Property	Description
Command for setting permissions	This setting is not used.
Set permissions as root?	This setting is not used.
Command for getting file information	This setting is not used.
Data directories	Specifies the directory that is used to store blocks.
Name Node	Specifies the host name of the machine that is used as the Hadoop NameNode.
Live Data Nodes	Specifies the number of Hadoop DataNodes that are reachable.
Dead Data Nodes	Specifies the number of Hadoop DataNodes that are not available.

Basic File Information

To view basic file information, select a file. The following information is provided:

Basic File Information
Field	Description
Name	Specifies the name of the file.
Size	Specifies the file size. This value includes the disk space required to store the data in blocks and metadata about the file.
Date Modified	Specifies the date on which the file was created or replaced.
Path	Specifies the HDFS directory.
Description	Specifies the description that is stored with the data. The description is displayed beside the table name in the explorer interface.
Copies	Specifies the number of redundant copies of the data.
Block Size	Specifies the number of bytes that are used to store each block of data.
Number of Variables	Specifies the number of columns in the HDFS table.
Owner	Specifies the user account that added the data to HDFS.
Group	Specifies the primary UNIX group for the user account.
Permissions	Specifies the Read, Write, and Execute access permissions for owner, group, and other.
SASHDAT file?	Specifies whether the file is in the SASHDAT format. `Yes` indicates that the file is in the SASHDAT format.
Compression	Specifies whether the file is compressed. `Yes` indicates that the file is compressed.
Encryption	Specifies whether the file is encrypted. `Yes` indicates that the file is encrypted.

Note: The HDFS tab might display multiple files for a table as the table is being added to HDFS. After the table is added, the multiple files disappear.

Table Information

To view column information, select a table, and click

. The following information is provided:

Column Information
Field	Description
Column Name	Specifies the column name from the source table.
Label	Specifies the label for the data set column when the table was added to HDFS.
Type	Numeric or Character. Numeric variables are encoded as `1`.
Offset	Specifies the starting position for the variable in the SASHDAT file.
Length	Specifies the storage used by the variable.
Format	Specifies the format associated with the variable.
Format Length	Specifies the format length of the format that existed on the variable when it was added to HDFS. This value is zero if the variable did not have a format when it was added to HDFS.
Precision	Specifies the precision portion of the format for number formats.
Length (Formatted)	Specifies the length of the variable when formatting is applied.

To view the row count, select a table, and click

. The following information is provided:

Row Count Information
Field	Description
Rows	Specifies the number of rows in the data.
Blocks	Specifies the number of HDFS blocks that are used to store the data.
Allocated	Specifies the number of bytes allocated to store the data. The value is a multiple of the block size and the number of blocks. This value is smaller than the file size because it does not include the space needed for the SASHDAT file header.
Used	Specifies the number of bytes within the allocated blocks that are used for storing rows of data.
Utilization	Specifies the percentage of allocated space that is used for storing rows of data.

Block Detail Information

To view block details, select a file, and click

. The following information is provided:

Block Detail Information
Field	Description
Host Name	Specifies the machine in the cluster that stores the block of data.
Block Name	Specifies the filename for the block.
Path	Specifies the directory to the block.
Record Length	Specifies the sum of the column lengths for the variables in the data.
Records	Specifies the number of rows stored in the block. Because redundant blocks are listed in the table, the sum of the records listed does not equal the number of rows in the data.
Owner	Specifies the user account that added the data to HDFS.
Group	Specifies the primary UNIX group for the user account that stored the data.
Permissions	Specifies the Read, Write, and Execute access permissions for owner, group, and other.

You can sort by the column headings to identify anomalies. It is normal for several blocks to be stored on the same machine. However, it is not normal for the values of Record Length, Owner, Group, or Permissions to be different from row to row.

The files added to HDFS are stored as blocks. One block is the preferred block, and additional copies of the blocks are used to provide data redundancy. The Block Distribution dialog box offers two ways to view this information. The Block Detail View tab enables you to select a block number and view the host names that store the original or redundant blocks. The Node Detail View enables you to select a host name and view the block numbers that are stored on the machine.

Block Distribution Information

To view the block distribution, select a table, and click

. The following information is provided:

Block Distribution Information
Field	Description
File Size	Specifies the size of the file in bytes.
Block Size	Specifies the block size for the file.
Blocks	Specifies the number of blocks used to store the original copy of the data.
Machines Used	Specifies the number of machines in the cluster that have original or redundant blocks for the file.
Copies	Specifies the number of redundant block copies of the data.

On the Block Detail View tab, you can select a block number. This enables you to view how many copies of the block exist and the host names for the machines that store the blocks. The value in the Total Copies column equals the number of redundant copies of the block plus the original block. You can select the column heading to sort the rows. In an ideal distribution, the number of total copies is equal for all blocks.

On the Host Detail View tab, you can expand a host name node, and then view the block numbers that are stored on that machine. When you select the block number, the host name and any additional machines with copies of the block are identified in the host name list.

How to Introduce an Additional Directory

Each co-located HDFS directory that you use must be represented in metadata by a library that uses the SASHDAT engine. To create the required metadata, see the chapter Connecting to Common Data Sources in the SAS Intelligence Platform: Data Administration Guide.

Here are some key points:

Each directory in co-located HDFS must also have a corresponding LASR library. See Add a LASR Library.
The server tag for the corresponding LASR library must be the source path in dot-delimited format. See Server Tags.
To facilitate parallel loads, use single-level paths that have only eight or fewer characters. For example, use /sales instead of /dept/sales or /sales_department. The path is the basis for the server tag, and the server tag is used as a libref in parallel loads.

How to Delete an HDFS Table

Right-click on the table in the Folders pane, and select Delete.
In the confirmation window, if you want to delete the physical table with the metadata object that represents it, select the Remove from HDFS storage check box.

Tip

You can also delete an HDFS table from the HDFS tab. Select the table, and click Remove from HDFS

in the tab’s toolbar.

Last updated: December 18, 2018