SPARK Tables in Delta Lake

We saw how to create a database in the synapse SPARK pool in our previous blog. Today we are going to understand the types of Delta table in SPARK.

There are mainly two types of SPARK tables.

  1. Managed (Spark Internal table)
  2. Unmanaged (SPARK External table)

There is slight difference while creating these tables, in a nutshell “Managed” table does not need any Location to be specified while creating the table whereas the “Unmanaged” SPARK tables do need a location to be specified during the table creation.

Let’s check the difference between them.

 SPARK Managed tableSPARK Unmanaged Table
Table CreationNo need to specify the locationLocation is needed while table creation
ManagementSPARK manages the table metadata and the data in the file stores.Spark manages only the table Metadata. User needs to manage the data files in the file stores.
Drop OperationIt drops the table metadata as well as the data files of the table.It drops the metadata of the table but doesn’t delete the data files underneath.
Truncate operationSupportedNot supported
Naming conventionAlso known as Internal TableAlso known as External Table

URL to point to the ADLS file location : abfss://<Container name>@<storage name>.dfs.core.windows.net/<folderpath/filename>

Creating SPARK Managed table.

I am using the SPARK syntax to create the delta table. You can use SQL syntax as well to create delta / Parquet tables.

%%pyspark
#reading the file in ADLS Gen 2 into a dataframe
df = spark.read.load('abfss://delta@stgdeltademo.dfs.core.windows.net/TEST/customer.parquet', format='parquet')

#Writing the dataframe as table to the Delta Lake
df.write.format('delta').\
partitionBy('year').mode('overwrite').saveAsTable('landing.customer')

If you notice in the above script, we have not specified any path as an option while reading the File into the DataFrame and  writing as table (.saveAsTable(‘landing.customer’). It will get registered to the warehouse/landing.db.

You can verify the type of table and other information about the table by using the below query.

%%sql
DESCRIBE EXTENDED landing.customer

Creating Unmanaged table.

I am using the SPARK syntax to create the delta table. You can use SQL syntax as well to create delta / Parquet tables by specifying the LOCATION clause.

%%pyspark
#reading the file in ADLS Gen 2 into a dataframe
df = spark.read.load('abfss://delta@stgdeltademo.dfs.core.windows.net/TEST/customer.parquet', format='parquet')

#Writing the dataframe as table to the Delta Lake
df.write.format('delta').\
options (‘path’, 'abfss://delta@stgdeltademo.dfs.core.windows.net/Delta_tables/).\
partitionBy('year').mode('overwrite').saveAsTable('landing.customer')

If you notice in the above script, we have not specified any path as an option while reading the File into the DataFrame and writing as table (.saveAsTable(‘landing.customer’). It will get registered to the path you have specified while creating/writing the table <delta/Delta_tables/>.

You can verify the type of table and other information about the table by using the below query.

%%sql
DESCRIBE EXTENDED landing.customer

Leave a Reply

Your email address will not be published. Required fields are marked *