Polybase in Synapse Analytics?

Today we are going to talk about Polybase which is quite amazing feature for the data virtualization. If you have data stored in some external source and you want to read and propagate that chunk of data to your warehouse post transformation or you just want to load it using an INSERT query in any staging table after some simple joins with other tables to form a meaningful data then Polybase would be a good option.

Since the Polybase was intended to query the Hadoop using T-SQL. Later on many new data sources got added that can be queried with Polybase including storage location like Azure Data Lake Gen2, S3 along with the database like Oracle, Teradata etc. Synapse supports the Polybase only for Azure Data Lake Storage(ADLS).

In synapse you don’t have to worry about the installation or the configuration for the Polybase in any wizard while installation, like you have to do in 2016 or other version where you explicitly check the options while installation.

Uses of Polybase in Real time scenarios in synapse:
1.) Ingest the data sitting in Azure Data Lake Storage(ADLS) as external table in the database attached in synapse analytics.
2.) Data Exploration using ad-hoc queries directly on the external table.

3.) Data virtualization after joining with other tables to create a business meaningful dataset.

Note: To see the support chart of Polybase then here is the chart from MS Doc.

External data sourcesSQL Server 2016-2019 with PolyBaseSQL Server 2022 (16.x) with PolyBaseAPS PDWAzure Synapse Analytics
Oracle, MongoDB, TeradataReadReadNoNo
Generic ODBCRead (Windows Only)Read (Windows Only)NoNo
Azure StorageRead/WriteRead/WriteRead/WriteRead/Write
HadoopRead/WriteNoRead/WriteNo
SQL ServerReadReadNoNo
S3-compatible object storageNoRead/WriteNoNo

How to communicate with the data stored in ADLS.

  1. Create a MASTER ENCRYPTION KEY for the database.

CREATE MASTER KEY ENCRYPTION BY PASSWORD = <‘Something’ >;

  1. Create a Database Scoped credential.

CREATE DATABASE SCOPED CREDENTIAL <dummy> 

WITH IDENTITY = <‘Managed Service Identity’> ;

It is good to have a connection with the MSI instead of maintaining the number of USERS and it’s password or service ids.

you can create user or create by using the Managed Identity. In case of MSI the workspace credential will be used while connecting to the external resources.

  1. Create EXTERNAL DATA SOURCE to which you want to connect.

CREATE EXTERNAL DATA SOURCE <[dummy_data_source]> WITH (LOCATION = N’abfss://container@storage_name.dfs.core.windows.net’, CREDENTIAL = [dummy]);

  1. Create External FILE FORMAT for the type of data you are going to access like CSV, PARQUET ETC.

you can create customized File Format or you can use the standard syntax available in MS doc. However, Synapse has the ability that you can just right click on the file on which you want to create an external Table >>CREATE EXTERNAL TABLE  it will write the syntax for you and that Data Source and File Format will be used for that external table creation.

  1. Create the External table.
    This is the final step creating external table in your attached SQL pool or SQL serverless pool.

Note:
External tables are just the structure of the table and you cannot perform any DML operation on it.
If you want to add, drop or modify any column in the external table then you need to DROP and re-Create the External table.

Syntax to drop External Table:
DROP EXNTERNAL TABLE SCHEMA.TABLE_NAME

Difference between Polybase and Elastic Query.

To know more in detail of what elastic Query is then you can follow along with the MS doc.

But if you want to know the main difference between them then in a nutshell, Elastic Query allows you to query the Database But on the other hand the Polybase gives you the capability to query the Structured/Semi-Unstructured Files sitting in the Azure Data Lake storage(In synapse).

Real Time errors while using Polybase to load/visualize the data.

These are some of the errors which may occur when using the external table in Dedicated SQL Pool.

  1. Error handling external file: ‘IOError: Peek file failed. HRESULT = 0x8007045D’

This IO Peek File Failed error occurs due to the less compute to process the file, but if you are using serverless then compute should never be an issue for this error. If you have less compute in your dedicated SQL pool then you should consider increasing it.

  • HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: HadoopExecutionException:

This error is raised when you do not have the proper file format. So let’s say you are saving the data in Parquet file then by default it creates the extension of file as (.parquet) with the parquet properties applied to it. If you are manually typing the name of the file by concatenating the file name with extension @concat(‘filename’, ‘.parquet’) in your sink location then while querying this error can surface due to file being improper parquet.

  • Error handling external file: ‘Invalid: Parquet file size is 0 bytes’.

You can face this issue due to the file not having any data or the file is corrupted.

  • Not able to validate external location because The remote server returned an error: (500) Internal Server Error.

This is the issue when you query the external table but it is unable to make a connectivity to the storage, this is temporary error and can be remediated post rerunning.

  • Not able to validate the external location because The remote server returned an error: (404) not found.

You can face this issue when you query the external table but the file which is being referenced by the external table is not present in the location.

Leave a Reply

Your email address will not be published. Required fields are marked *