What is OPTMIZE in Synapse Analytics Delta Table?

In our last blog, we saw the workings of the Delta Table. Today, we have a topic that will help optimize our delta table.

In the real time scenarios, the number of data files increases at a very high rate due to the multiple transactions performed on the table during day-to-day work. The size of the data files could be of small. With such small file sizes, the performance of combining and selecting the data from the table will be degraded.

To enhance the performance, we can combine all the small files into one large file so that the delta engine doesn’t waste compute and time combining multiple small files and sending the results back to the user.

The default size of files that OPTIMIZE target is 1 GB after running the OPTIMIZE command. You can configure this property by setting the spark.microsoft.delta.optimize.maxFileSize property. This enables you to set a target file size that’s different from the default.

%%pyspark
spark.conf.get('spark.microsoft.delta.optimize.maxFileSize')

How it works?

When we run the Optimize command, it analyzes the files and merge them into a single file, creating a new file in the folder. The files that get combined are marked as non-referential and  the Delta Engine ignore these files and only refer the new file.

However, the non-referential files are not deleted immediately. You can use the VACUUM command later to clean up these files.

You can optimize the files with a predicate on the date so that you don’t exhaust the compute of the delta lake on the already compacted files.

Internally,  the command sorts the files based on the file size less then <1 GB and keeps them in a bin. Keep dumping small size file into the bin if it overflows 1 GB size, it opens another bin. This actions is performed for each partition.

Let’s look at the log file after running the OPTIMIZE command. You can see the name of the file that got merged and the final file that was added to the table.

Here we have only used OPTIMIZE command, but it works well and give good performance with Z-ORDER. We will discuss about the Z-Order and VACUUM command in our next blog.

Reference link: https://delta.io/blog/2023-01-25-delta-lake-small-file-compaction-optimize/

Leave a Reply

Your email address will not be published. Required fields are marked *