In our earlier blogs, we discussed the optimization of Delta tables through the use of OPTIMIZE and ZORDER commands. We have learned about their functionality and their impact on the performance of the tables. However, it is also important to consider the number of files that are invalidated when we run OPTIMIZE.
As an example, after running OPTIMIZE on a Delta table, we saw 547 files were merged into a single file, and the smaller files were invalidated. This means that we have data in two different files: the original small file and the new file that was created after merging these small files. It is therefore crucial for us to maintain these data files effectively and decrease the cost for the invalidated files.
In this scenario, the VACUUM command comes in handy.
It helps us to:
a. Reduce the cost incurred for saving these files.
b. Delete the files which contain the deleted and updated records, as well as the files that were invalidated due to being merged after OPTIMIZE and exceeding the threshold value for data retention period.
The VACUUM command removes any file or folder that does not start with an underscore (_). That is why the delta log file (that is _delta_log) folder does not get deleted. The command only deletes the data files, but it does not delete the transaction log. However, it adds a new JSON transaction file in the transaction log file. The transaction files get automatically deleted after a checkpoint file is created. By default, the transaction file retention period is 28 days, while the retention period for the data file is 7 days. However, the 7-day retention period can be changed using the configuration.
Since the VACUUM command cannot be reverted, it always checks if you try to delete a file that has the threshold value less than the set value. If it is less, then it flashes a warning. The check can be enabled or disabled using the configuration.
spark.databricks.delta.retentionDurationCheck.enabled.

By default, the value for the retentionDurationCheck parameter is set to True, which means that the check for retention duration is enabled. However, you can turn it off depending on your needs.

The retention period for the VACUUM command can be configured through the table properties. Additionally, you can preview the list of files that would be deleted after running the VACUUM command. You can achieve this by running the VACUUM command with the DRY RUN option.
The DRY RUN option ensures that the command is not executed; rather, it just lists out the files that would be deleted if we run the command. It is always good practice to use the DRY RUN option before performing the actual VACUUM command. This helps to avoid unintended deletion of important files.

In conclusion, this blog has provided insights on the functionality of the VACUUM command in Delta Lake. By using VACUUM, you can effectively manage the invalidated files, reduce storage costs, and delete unnecessary files. Moreover, by previewing the list of files to be deleted using DRY RUN, you can avoid accidental deletion of important files.