HDFS does not support updating data. However, because traditional SAS processing involves
updating data,
SPD Server supports SAS Update operations for data stored in HDFS.
To update data in HDFS, SPD Server uses an approach that replaces the table’s
data partition file for each row that is updated. When an update is requested, SPD Server re-creates
the data
partition file in its entirety (including all replications), and then inserts the updated data
into the new data partition file. Because the data partition file is replaced for
each row that is updated, the greater the number of rows to be updated, the longer
the process.
For general-purpose data storage, the ability to perform small, infrequent updates
can be beneficial. However, updating data in HDFS is intended for situations when
the time it takes to complete the update outweighs
the alternatives.
Here are some best practices for Update operations using SPD Server:
-
It is recommended that you set
up a test in your environment to measure Update operation performance.
For example, update a small number of rows to gauge how long updates
take in your environment. Then, project the test results to a larger
number of rows to determine whether updating is realistic.
-
It is recommended that you do not use the SQL procedure to update data in HDFS because
of how PROC SQL opens, updates, and closes a file. There are other SAS methods
that provide better performance such as the DATA step UPDATE statement and MODIFY
statement.
-
The performance of appending a
table can be slower if the table has a unique index. Test case results
show that appending a table to another table without a unique index
is significantly faster than appending the same table to another table
with a unique index.