Delta Lake Solution Architecture in Azure Synapse

Let’s understand the concept of a Delta Lake Architecture so that later on we can go ahead and understand the concept of the Delta tables in the Delta Lake and their capabilities.

The Delta Lake is optimized advance storage layer. The Delta Lake follows the medallion architecture where the tables are segregated in different categories like the Bronze, Silver and Gold. Each of these layers represent the different state of the data and their usability. The Bronze layer is the raw landing area where all your data from various sources will load in its original form. The Silver Layer will serve as the silos of the cleaned data and transformed data on which the analysis can be done by the teams. The Gold layer is the business layer where the aggregated business driven data would reside of any organization who adapts the Delta Lake architecture. All the ETL process will be done using the SPARK on Delta Lake.

The Delta Lake Architecture has lots of offerings to provide that would eliminate the issues or the shortcomings that was there in the Data Lake and Datawarehouse. It packages itself with the tools that would be the one-point solution. Delta lake comes with the concept of the three Data layer and each one of the layers serves the different purpose of business required data. At the end of the day the Delta Lake gives you the Datawarehouse capability within the Data Lake.

Since we always mention that it is the optimized combination of both Datawarehouse and Data Lake.

Difference between Data Warehouse vs Data Lake vs Delta lake

Let’s have a quick comparison between all these three and see the key differences between them.

PropertiesData WarehouseData LakeDelta Lake
StorageIt stores the data in a RDBMS systemIt Stores the data in the Storage location such as ADLS, S3, Hadoop etc.The Delta Lake is optimized layer on top of the Storage, so the data gets stored In ADLS, S3 etc.
TypeIt is best suited for Relational structured dataIt can handle both semi-structured and unstructured dataIt is best suited for Structured, semi-structured and unstructured data
PricingIt is costly to scale, time consumingStorage is quite cheap, ScalableStorage is quite cheap, Scalable
ACIDACID compliant and guarantees data integrityNon-ACID compliant, it can leave data corruptedACID compliant and guarantees data integrity
SchemaSchema on WriteSchema on ReadSchema enforcement
UsabilityGood for BISupports AI and MLCan be used both for AI/ML and for BI Analytics

All your data would not need any Datawarehouse to get stored, it can sit in your Azure Data Storage as parquet files and would give you the capability to perform the operations that was not supported in the Data Lake like DML operations.

The Delta Lake is backed by the Delta Tables which is a default table type in Databricks. Since the Delta Lake is a file-based system, it stores the data in snappy compressed parquet files while the transactions logs are stored in the “_delta_log “ folder in JSON format.

Table Versioning in Delta Lake

Each Write/Update/Delete to the Delta Lake keep versioning the table and updates the transaction log. It is possible to check the previous version of the table at any point of time and the awesome part is that we can restore our table to the previous version. The restoration to the previous version is termed as the Time Travel concept in Delta Lake.

Handling Concurrency in Delta Lake

As the Delta Lake captures the transactional log, it provides good concurrency and prevents the dead lock while updating or reading the data from the table by multiple users. It uses the Snapshot Isolation on the table read and by default it uses write Serialize Isolation for table WRITE and UPDATE.

To prevent the conflict during write/append operations it uses the OPTIMISTIC CONCURRENCY CONTROL, it follows three steps before committing to the table and make an entry to the transaction log table.

Steps involves:

  1. Read: Reads (if needed) the latest available version of the table to identify which files need to be modified.
  2. Write: Plans all the changes by writing new data files.
  3. Validate and commit: Before committing the changes, checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read. If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds. However, if there are conflicts, the write operation fails and it repeat the process and check for the latest entry to the log, stages the data again and makes the entry or it raises the exception(in some conditions).

However, you can see the exceptions in case or delete or update and prevents the data corruption. We will discuss on Delta tables and their properties in our next blog. This is just the beginning with the overview of the Delta Lake solution architecture.

Leave a Reply

Your email address will not be published. Required fields are marked *