Today we are going to see, how to read an excel file which is stored in the Azure Storage from Synapse notebook using python OpenPyxl Library.
Openpyxl is the library that is designed to specially work with the Excel files, and it provides rich list of methods to work with the Excel data. However, the Openpyxl doesn’t understand the distributed file system.
So, if you are thinking to pass the file name along with path just as the https URL (simple or along with the SAS token) or the DFS file path then it is not going to recognize the file location. You might end up with the error stating “There is no file or directory fount.”.
The URL that won’t work.
- https://stgadeltademo.blob.core.windows.net/conatiner>/<filename>/<SAS token>
- https://stgadeltademo.dfs.core.windows.net/<conatiner>/<filename>/<SAS token>
- abfss://test@stgdeltademo.dfs.core.windows.net/<filename>
If you provide the file path along with the SAS token then it will give the error that “.xlsx?SV-vaieK2noisankdsDFfhsaik26nfksa_1aho3 is not supported by openpyxl”
That is the reason, we need to work with the blob storage API which will help us to navigate to the file and read them without any issues.
Libraries you would need to install.
- azure-storage-blob
- openpyxl
The Azure storage blob API would allow us to work with the blob and containers and Openpyxl will be used to read the file.
Let’s first install these libraries by PIP (Python pip is the package manager for python packages. We can use pip to install packages that do not come with Python).
<pip install azure-storage-blob>
<pip install openpyxl>
If you face any issues while installation, then please do check the PIP version and if needed just upgrade it to the newer version and try again.
First, we will import the BlobServiceClient from the blob API.
Then we would need Storage account’s connection string, storage account name, container name where your file is and at last the file name. Once you have assigned these values in the variables, we can pass them in the fields and make the connection.
from azure.storage.blob import BlobServiceClient import openpyxl as op from io import BytesIO conn_string = "DefaultEndpointsProtocol=https;AccountName=stgdeltademo;AccountKey=DK+kahfKAOhahkaHLG35n82nkKAabg/MabHAloqAKH23kja/GFbakenfaKBGIkbagi34==;EndpointSuffix=core.windows.net" account = "stgdeltademo" container = "test" filename = "test.xlsx" blob_service_client = BlobServiceClient.from_connection_string(conn_string) blob_client = blob_service_client.get_blob_client(container=container, blob=filepath) file = blob_client.download_blob().readall() wb = op.load_workbook(BytesIO(file)) ws = wb[active] print(ws)
Here we made the connection with the storage and created a blob client. So here, the blob client is the file you want to point to. Hence you can perform action on it like download, get its URL and others as well.
Openpyxl does not accepts the blob client as the input, we need to download this file locally in a variable. We have downloaded the file into the variable “file”.
Here the file variable is the memory buffer location which holds the file as in binary format.
Since it is binary string, we would need to use the BytesIO to read this. The IO is the inbuilt module and you don’t need install it explicitly. Now we can load the excel file by openpyxl Load_workbook method and then get the active sheet in the workbook and print it on the screen.
This code will run fine from your local but if you are running the code in Synapse notebook and your Synapse workspace is the managed flavour then you might see the
error “ServiceRequestError: <urllib3.connection.HTTPSConnection object at 0x7f0d17f430d0>: Failed to establish a new connection: [Errno -2] Name or service not known”
This is because you might not have the Private end Point to communicate with storage. There is catch that you cannot use the ADLS gen 2 private endpoint as we are calling from blob API.
So, you would need to create a private endpoint for the blob storage and once the endpoint is approved then it can communicate with the storage without any issues. Now you can perform your actions as per your requirement. Note: If you have requirement where your file is in nested directories(Like ADLS Gen2) then you should use the azure-storage-file-datalake Module as it comes specifically to work with the storage with enabled HNS.