Distributed Server: Co-located HDFS

Introduction

Co-located HDFS is a deployment of Hadoop that meets the following criteria:
  • The deployment runs on the same hardware as a distributed SAS LASR Analytic Server.
  • The deployment incorporates services that SAS High-Performance Deployment of Hadoop provides.
SAS High-Performance Deployment of Hadoop adds services to Apache Hadoop (and other supported Hadoop distributions) to provide the following integrated functionality:
  • SAS uses a special file format (with the filename suffix SASHDAT) to store tables in HDFS. Like any file that is stored in HDFS, a SASHDAT file is distributed as a series of blocks. Copies of blocks are stored to provide data redundancy.
  • SAS enhances the block distribution algorithm to make sure that blocks are distributed evenly. Because SAS LASR Analytic Server reads blocks of data directly, the even block distribution contributes to an even workload on the machines in the cluster.
This integration enables a distributed SAS LASR Analytic Server to use HDFS to read SASHDAT tables in parallel very efficiently.
Tip
Basic HDFS commands are documented in the SAS LASR Analytic Server: Reference Guide.

About the HDFS Tab

Introduction

To open the HDFS tab, select Toolsthen selectExplore HDFS from the main menu in the administrator.
Note: The HDFS tab is available in deployments that use co-located HDFS. Only users that have the Browse HDFS capability can access the HDFS tab.
The HDFS tab provides a host-layer view of HDFS folders and tables. The view is not mediated by metadata or by your permissions. Instead, a privileged Hadoop account retrieves the information that this tab displays.
You can use the HDFS tab to perform the following tasks:
  • Browse HDFS folders and tables.
  • View row count, columns, column information, and block information for tables that have been added to HDFS. Information about block distribution, block redundancy, and measures of block utilization is provided.
  • Delete HDFS tables that are stored in SASHDAT format. (Files that are not SASHDAT files are listed, but they cannot be deleted.)

System Properties

To view HDFS system properties, click Properties. The following table describes the fields:
HDFS System Properties
Property
Description
Command for setting permissions
This setting is not used.
Set permissions as root?
This setting is not used.
Command for getting file information
This setting is not used.
Data directories
Specifies the directory that is used to store blocks.
Name Node
Specifies the host name of the machine that is used as the Hadoop NameNode.
Live Data Nodes
Specifies the number of Hadoop DataNodes that are reachable.
Dead Data Nodes
Specifies the number of Hadoop DataNodes that are not available.

Basic File Information

To view basic file information, select a file. The following information is provided:
Basic File Information
Field
Description
Name
Specifies the name of the file.
Size
Specifies the file size. This value includes the disk space required to store the data in blocks and metadata about the file.
Date Modified
Specifies the date on which the file was created or replaced.
Path
Specifies the HDFS directory.
Description
Specifies the description that is stored with the data. The description is displayed beside the table name in the explorer interface.
Copies
Specifies the number of redundant copies of the data.
Block Size
Specifies the number of bytes that are used to store each block of data.
Number of Variables
Specifies the number of columns in the HDFS table.
Owner
Specifies the user account that added the data to HDFS.
Group
Specifies the primary UNIX group for the user account.
Permissions
Specifies the Read, Write, and Execute access permissions for owner, group, and other.
SASHDAT file?
Specifies whether the file is in the SASHDAT format. Yes indicates that the file is in the SASHDAT format.
Compression
Specifies whether the file is compressed. Yes indicates that the file is compressed.
Encryption
Specifies whether the file is encrypted. Yes indicates that the file is encrypted.
Note: The HDFS tab might display multiple files for a table as the table is being added to HDFS. After the table is added, the multiple files disappear.

Table Information

To view column information, select a table, and click Column information. The following information is provided:
Column Information
Field
Description
Column Name
Specifies the column name from the source table.
Label
Specifies the label for the data set column when the table was added to HDFS.
Type
Numeric or Character. Numeric variables are encoded as 1.
Offset
Specifies the starting position for the variable in the SASHDAT file.
Length
Specifies the storage used by the variable.
Format
Specifies the format associated with the variable.
Format Length
Specifies the format length of the format that existed on the variable when it was added to HDFS. This value is zero if the variable did not have a format when it was added to HDFS.
Precision
Specifies the precision portion of the format for number formats.
Length (Formatted)
Specifies the length of the variable when formatting is applied.
To view the row count, select a table, and click Row count. The following information is provided:
Row Count Information
Field
Description
Rows
Specifies the number of rows in the data.
Blocks
Specifies the number of HDFS blocks that are used to store the data.
Allocated
Specifies the number of bytes allocated to store the data. The value is a multiple of the block size and the number of blocks. This value is smaller than the file size because it does not include the space needed for the SASHDAT file header.
Used
Specifies the number of bytes within the allocated blocks that are used for storing rows of data.
Utilization
Specifies the percentage of allocated space that is used for storing rows of data.

Block Detail Information

To view block details, select a file, and click Block details. The following information is provided:
Block Detail Information
Field
Description
Host Name
Specifies the machine in the cluster that stores the block of data.
Block Name
Specifies the filename for the block.
Path
Specifies the directory to the block.
Record Length
Specifies the sum of the column lengths for the variables in the data.
Records
Specifies the number of rows stored in the block. Because redundant blocks are listed in the table, the sum of the records listed does not equal the number of rows in the data.
Owner
Specifies the user account that added the data to HDFS.
Group
Specifies the primary UNIX group for the user account that stored the data.
Permissions
Specifies the Read, Write, and Execute access permissions for owner, group, and other.
You can sort by the column headings to identify anomalies. It is normal for several blocks to be stored on the same machine. However, it is not normal for the values of Record Length, Owner, Group, or Permissions to be different from row to row.
The files added to HDFS are stored as blocks. One block is the preferred block, and additional copies of the blocks are used to provide data redundancy. The Block Distribution dialog box offers two ways to view this information. The Block Detail View tab enables you to select a block number and view the host names that store the original or redundant blocks. The Node Detail View enables you to select a host name and view the block numbers that are stored on the machine.

Block Distribution Information

To view the block distribution, select a table, and click Block distribution. The following information is provided:
Block Distribution Information
Field
Description
File Size
Specifies the size of the file in bytes.
Block Size
Specifies the block size for the file.
Blocks
Specifies the number of blocks used to store the original copy of the data.
Machines Used
Specifies the number of machines in the cluster that have original or redundant blocks for the file.
Copies
Specifies the number of redundant block copies of the data.
On the Block Detail View tab, you can select a block number. This enables you to view how many copies of the block exist and the host names for the machines that store the blocks. The value in the Total Copies column equals the number of redundant copies of the block plus the original block. You can select the column heading to sort the rows. In an ideal distribution, the number of total copies is equal for all blocks.
On the Host Detail View tab, you can expand a host name node, and then view the block numbers that are stored on that machine. When you select the block number, the host name and any additional machines with copies of the block are identified in the host name list.

How to Introduce an Additional Directory

Each co-located HDFS directory that you use must be represented in metadata by a library that uses the SASHDAT engine. To create the required metadata, see the chapter Connecting to Common Data Sources in the SAS Intelligence Platform: Data Administration Guide.
Here are some key points:
  • Each directory in co-located HDFS must also have a corresponding LASR library. See Add a LASR Library.
  • The server tag for the corresponding LASR library must be the source path in dot-delimited format. See Server Tags.
  • To facilitate parallel loads, use single-level paths that have only eight or fewer characters. For example, use /sales instead of /dept/sales or /sales_department. The path is the basis for the server tag, and the server tag is used as a libref in parallel loads.

How to Delete an HDFS Table

  1. Right-click on the table in the Folders pane, and select Delete.
  2. In the confirmation window, if you want to delete the physical table with the metadata object that represents it, select the Remove from HDFS storage check box.
Tip
You can also delete an HDFS table from the HDFS tab. Select the table, and click Remove from HDFS in the tab’s toolbar.
Last updated: December 18, 2018