A Deep Dive into FileDataLakeClient in Azure Synapse Analytics

FileDataLakeClient is the python package to communicate with the azure Hierarchical Namespace enabled storage(ADLS GEN2). This API is added to the storage SDK specifically for ADLS Gen2.

This package can be used to do many operations like (creating, deleting, renaming) filesystem, Directory, Files, lease and many more like setting ACLs, acquiring Leases etc.

The containers in the ADLS Gen2 is called as File System. However, this is not the built in package that comes along with python inbuilt library. You will have to manually install this package when using the python. While using the synapse you will have to configure this package to use by adding them to the workspace package or by doing the pip install.

I will leave the link on how to add any external package in synapse as workspace package.

Once you have the package added to the workspace then you can import in the notebook.

The client means the connection to the entity, here entity can be file system, directory or the file itself.

Once you create the client then you work with it and perform the actions on them.

This package offers you to create the File system Client, Directory Client or the File Client individually by using the FileSystemClient, DirectoryClient or FileClient which is good if you have to work only one specific File system, directory or File. But if you have to work with multiple file system, directory or file then it will be difficult to create and hop to one another.

In this situation you can create a service client and from the service client you can create multiple clients separately. The service Client is the connection to the account level, hence the entity inside the account (like file system, directory or file ) can be used without making any specific connection to the entity.

For an example I have to work with two different clients in the same code then I have to create the connection two time which is not a good practice.

from azure.storage.filedatalake import FileSystemClient

#make the storage account connection string

conn = '<Enter the connection string of the storage account>'

#Name of the file system which you want to work with

Container1 = 'Enter the first container name here'

Container2 = 'Enter the second container name here'

#First file system you created to work with the first file system

file_system1 = FileSystemClient.from_connection_string(conn_str=conn, file_system_name='Container1')

#Second file system you created to work with the first file system

file_system2 = FileSystemClient.from_connection_string(conn_str=conn, file_system_name='Container2')

Similar way you have create multiple directory or file client if you have to work with multiple directory or file.

However, to eliminate this you can create one Service client and then you can create multiple file system, directory or file clients.

from azure.storage.filedatalake import DataLakeServiceClient

#make the storage account connection string

conn = '<Enter the connection string of the storage account>'

#First file system you created to work with the first file system

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fsClient1 = service_client.get_file_system_client('dummy1')

fsClient2 = service_client.get_file_system_client('dummy2')

Here you can see we created a single service client and now we can make the multiple clients out of it.

How to Create and Delete a File System

Here is a simple code example for creating a file system. First, we create a service client by providing the account connection string. Then, using the service client, we utilize the create_file_system method to create the container and the delete_file_system method to remove an existing container.

from azure.storage.filedatalake import DataLakeServiceClient

def create_system(file_system_name):

    service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

    try:

        service_client.create_file_system(file_system_name)

        print('The File System is created successfully')

    except Exception as e:

        print(str(e))

def delete_system(file_system):

    service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

    try:

        service_client.delete_file_system(file_system)

        print('The file system is deleted successfully')

    except Exception as e:

        print(str(e))

Create or Delete Directory and subdirectory

If you have an existing file system, you can easily create directories or subdirectories using the package. You can create a directory client whether the directory exists or not. If it doesn’t exist, you can call the create_directory() method to create it. If it already exists, you can perform various actions on this directory, such as creating files, subdirectories, and more.

Directories can be renamed even if they contain subdirectories. If you want to rename a subdirectory, you can create a sub_directory_client and perform the desired actionsCreate Directory:

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

#Creating directory client and then creating the directory

dir_client = fs_client.get_directory_client('filedatalaketestfolder')

dir_client.create_directory()

Create Sub-Directory:

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

#Deleting the directory

dir_client = fs_client.get_directory_client('filedatalaketestfolder')

dir_client.create_sub_directory('nestedfolder')

Rename Directory:

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

#Renaming the Directory

dir_client = fs_client.get_directory_client('testingdir')

dir_client.rename_directory(dir_client.file_system_name+ '/' + 'testname')

Create file and upload data into it

Actions on a file can be performed by creating a file_client either within the file_system, directory, or sub-directory. To create a file_client, you simply need to use the get_file_client method. The file_client can be created whether the file is present in the directory or not.

Since I already have a directory named filedatalaketestfolder, I will use the directory client to create a file inside this directory

Create File:

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

dir_client = fs_client.get_directory_client('filedatalaketestfolder')

file_client = dir_client.get_file_client('testfile.log')

file_client.create_file()

Delete File:

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

dir_client = fs_client.get_directory_client('filedatalaketestfolder')

file_client = dir_client.get_file_client('testfile.log')

file_client.delete_file()

Rename File:

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

dir_client = fs_client.get_directory_client('filedatalaketestfolder')

#File Client

file_client = dir_client.get_file_client('testfile.log')

file_client.rename_file(file_client.file_system_name + '/' + 'filedatalaketestfolder' + '/' +'renamed.txt')

Upload Data to the file after file creation:

The upload_data overwrites the file and cannot be used to incrementally append the data. For appending the data you can use the append_data with the offset values.

from azure.storage.filedatalake import DataLakeServiceClient

service_client = DataLakeServiceClient.from_connection_string(conn_str=conn)

fs_client = service_client.get_file_system_client('test')

dir_client = fs_client.get_directory_client('filedatalaketestfolder')

data = 'This is the test data that will be uploaded to the file'

#File Client

file_client = dir_client.get_file_client('testfile.log')

file_client.create_file()

file_client.upload_data(data=data, overwrite=True)

The upload_data method does not need to be committed in order to write to the file. However, when appending data to the file, it is mandatory to flush the data. In other words, calling flush commits the changes. If you do not call flush, the data won’t be written to the file.

Note: As of now, there is a bug when using the upload_data method. If you use the overwrite=False parameter, it causes the API to fail in writing the data.

Error : ResourceModifiedError: (ConditionNotMet) The condition specified using HTTP conditional header(s) is not met.

Append Data to the File

Once the file you have created in the target directory, you can append the data to it incrementally, you can change the offset value and can write the data.

Sample code:

from azure.storage.filedatalake import DataLakeServiceClient

class Storagehandler:

    def __init__(self, storage_conn_string):

        self.conn = storage_conn_string

    def data_append(self, container, directory, filename, data):

        service_client =       DataLakeServiceClient.from_connection_string(conn_str=self.conn)

        fs_client = service_client.get_file_system_client(container)

        dir_client = fs_client.get_directory_client(directory)

        #File Client

        file_client = dir_client.get_file_client(filename)

        s = file_client.get_file_properties().size

        file_client.append_data(data=data, offset=s, length=len(data), flush=True)

This is just a sample code to show the working of the append_data to the file. It can written as per your requirement.

Download File:

from azure.storage.filedatalake import DataLakeServiceClient

class Storagehandler:

    def __init__(self, storage_conn_string):

        self.conn = storage_conn_string

    def file_download(self, container, directory, filename):

        service_client = DataLakeServiceClient.from_connection_string(conn_str=self.conn)

        fs_client = service_client.get_file_system_client(container)

        dir_client = fs_client.get_directory_client(directory)

        file_client = dir_client.get_file_client(filename)

        downloaded_file = file_client.download_file()

        return downloaded_file.readall()

Here, you can either return the file content using the readall() method or use the readinto() method to write the downloaded content to another file.

There are many actions you can perform using the FileDataLake class. I have tried to cover the most commonly used operations with this package.

Here are some links that will help you delve deeper into it.

https://pypi.org/project/azure-storage-file-datalake/

https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-file-datalake/12.0.0/azure.storage.filedatalake.html

Leave a Reply

Your email address will not be published. Required fields are marked *