Updating Data in HDFS

HDFS does not support updating data. However, because traditional SAS processing involves updating data, SPD Server supports SAS Update operations for data stored in HDFS.
To update data in HDFS, SPD Server uses an approach that replaces the table’s data partition file for each row that is updated. When an update is requested, SPD Server re-creates the data partition file in its entirety (including all replications), and then inserts the updated data into the new data partition file. Because the data partition file is replaced for each row that is updated, the greater the number of rows to be updated, the longer the process.
For general-purpose data storage, the ability to perform small, infrequent updates can be beneficial. However, updating data in HDFS is intended for situations when the time it takes to complete the update outweighs the alternatives.
Here are some best practices for Update operations using SPD Server:
  • It is recommended that you set up a test in your environment to measure Update operation performance. For example, update a small number of rows to gauge how long updates take in your environment. Then, project the test results to a larger number of rows to determine whether updating is realistic.
  • It is recommended that you do not use the SQL procedure to update data in HDFS because of how PROC SQL opens, updates, and closes a file. There are other SAS methods that provide better performance such as the DATA step UPDATE statement and MODIFY statement.
  • The performance of appending a table can be slower if the table has a unique index. Test case results show that appending a table to another table without a unique index is significantly faster than appending the same table to another table with a unique index.