Query Optimization In Synapse Dedicated SQL Pool

Today, we are going to look into one of the commonly faced issues in the field of Data Engineering i.e. slow running queries in the production environment. I encountered this issue in production where the queries were running over 13 hours whereas initially used to take 22-25 mins.

Use case: One weekend, we saw the failure in the PROD. While checking the failure, we discovered an activity timeout. Upon digging dipper, we found that a Stored Procedure ran over 13 hour and timed out due to the activity set timeout.

Further analyzed the Stored Procedure and found; it is calling a view with all the performance. We ran the same Stored Procedure again and it worked fine and completing within the previous set time. However, it failed again the very next day.

In these situations the performance tuning must be performed to make the query faster. There are certain points that needs to be taken into the consideration.

This post does not solves all the query performance optimization issues. Here, I am listing the basic checks that actually helped  me optimize my query to the level where it takes less time than it did previously.

First, we need to identify the node where we need to optimize the query and pinpoint where it is taking more time.

Compute Check: Firstly, we checked the compute of the Data Warehouse (DW) to see if it had changed. The Compute was running the same as earlier.

Patch On DW: We considered whether a recent patch on the database might have caused the increased query duration. However, the maintenance window was not over the weekends.

Load Capacity Check: We checked the load on the environment, i.e., the number of queries being fired at the database, to see if it had increased or decreased. The database was processing the same amount of load and queries as before.

The above checks were passed and we found nothing abnormal or change in the compute and load of the database. Therefore, we knew we have to look into the query optimization options.

Analyzing the Query Plan: This is the first and basic tool  that a data engineer can use to understand query behavior and running compute being used. We checked the Query Plan of the view and we found the “shuffle move” was taking more than 50% of the overall time and it was occurring in more than one join.  The Shuffle move is the most expensive and time taking data movement because it has to move the data in same partition it needs to join with other data.

To minimize the data shuffling, we first checked the table distribution. Since the tables were holding approximately 10-11 Million records it was crucial for the tables to have correct distribution else it will create problems.

To address this issue, we changed the table distribution from Round Robin to Hash Distribution. This change was applied to the large tables as well. As a result, the shuffle time reduced to only 5% of the total query time. When we ran the query again, it worked fine and completed in 9 minutes, a significant improvement compared to the initial 22-25 minutes.

However, the next day it failed again due to timeout. Upon investigation, we found that the table distribution was still HASH, so why did it take time?

However, the next day it failed again due to timeout. Upon investigation, we found that the table distribution was still HASH, so why did it take time? After further analysis, we discovered that the table statistics were not updated after the load on the table. We updated the statistics, and once again, the query completed in 9 minutes. However, after updating the statistics for all the tables, the query execution time reduced drastically to 1 minute and 32 seconds from 9 minutes.

UPDATE STATISTICS table_name

By updating the statistics you enforce that the engine the latest information of the tables records, which helps it to generate a more optimized query plan.

Therefore, we fined tuned the query by following some of the best practices.

Steps that we followed during the optimization process.

  1. Check the compute of the database.
  2. Schedule or manual patch on the database.
  3. Monitored Load variation on database when the query is running.
  4. Checking the Table Distribution
  5. Updating the Table Statistics

By using some of the basics checks and best practices we managed to reduced the query execution time from over 13 hours to just 1 min 32 seconds.

Hope this blog provided some insights on doing the basic that can have a significant impact on query execution time.

Leave a Reply

Your email address will not be published. Required fields are marked *