Data and Devops practices and tools: AZURE SYNAPSE ANALYTICS - THE ESSENTIAL SQL ON-DEMAND CHEAT SHEET

Saturday, August 1, 2020

AZURE SYNAPSE ANALYTICS - THE ESSENTIAL SQL ON-DEMAND CHEAT SHEET

FILTER BY TOPIC

Blog, News
AUTHORED BY : Etienne Oosthuysen
POSTED : June 1, 2020

Disclaimer: as Azure Synapse Analytics is still in Public Preview, some areas may not yet function as it will in a full General Availability stage.

This article contains the Synapse SQL on-demand test drive as well as a cheat sheet that describes how to get up and running step-by-step. I then conclude with some observations, including performance and cost.

But first let’s look at important architecture concepts of the SQL components of Azure Synapse Analytics, the clear benefits of using the new SQL on-demand feature, and who will benefit from it. For those not interested in these background concepts, just skip to the “Steps to get up and running” section later in this article.

Synapse SQL Architecture

Azure Synapse Analytics is a “limitless analytics service that brings together enterprise data warehousing and big data analytics. It gives you the freedom to query data…, using either serverless on-demand compute or provisioned resources—at scale.” https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/

It has two analytics runtimes; Synapse SQL for T-SQL workloads and Synapse Spark for Scala, Python, R and .NET. This article focusses on Synapse SQL, and more specifically the SQL on-demand consumption model.

Synapse SQL leverages Azure Storage, or in this case, Azure Data Lake Gen 2, to store your data. This means that storage and compute charges are incurred separately.

Synapse SQL’s node-based architecture allows applications to connect and issue T-SQL commands to a Control node, which is the single point of entry for Synapse SQL.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-architecture

The Control node of the SQL pool consumption model (also called provisioned) utilises a massive parallel processing (MPP) engine to optimise queries for parallel processing and then passes operations to Compute nodes to do their work in parallel. SQL Pools allows for querying files in your data lake in a read-only manner, but it also allows you to ingest data into SQL itself, and sharding them using a Hash, Round Robin or Replicate pattern.

As SQL Pool is a provisioned service, you pay for the resources provisioned and these can be scaled up or down to meet changes in compute demand, or even paused to save costs during periods of no usage.

The Control node of the SQL on-demand consumption model (also called serverless) on the other hand utilises a distributed query processing (DQP) engine to optimise and orchestrate the distribution of queries by splitting them into smaller queries, executed on Compute nodes. SQL on-demand allows for querying files in your data lake in a read-only manner.

SQL on-demand is, as the name suggests, an on-demand service where you pay per query. You are therefore not required to pick a particular size as is the case with SQL Pool, because the system automatically adjusts. The Azure Pricing calculator, https://azure.microsoft.com/en-us/pricing/calculator/, currently shows the cost to query 1TB of data as being A$8.92. I give my observations re cost and performance later in this article.

Now let's focus on SQL on-demand more specifically.

Why SQL on-demand

I can think of several reasons why a business would want to consider Synapse SQL on- demand. Some of these might be:

It is very useful if you want to discover and explore the data in your data lake which could exist in various formats (Parquet, CSV and JSON), so you can plan how to extract insights from it. This might be the first step towards your logical data warehouse, or towards changes or additions to a previously created logical data warehouse.
You can build a logical data warehouse by creating a relational abstraction (almost like a virtual data warehouse) on top of raw or disparate data in your data lake without relocating the data.
You can transform your data to satisfy whichever model you want for your logical data warehouse (for example star schemas, slowly changing dimensions, conformed dimensions, etc.) upon query rather than upon load, which was the regime used in legacy data warehouses. This is done by using simple, scalable, and performant T-SQL (for example as views) against your data in your data lake, so it can be consumed by BI and other tools or even loaded into a relational data store in case there is a driver to materialise the data (for example into Synapse SQL Pool, Azure SQL Database, etc.).
Cost management, as you pay only for what you use.
Performance, the architecture auto-scales so you do not have to worry about infrastructures, managing clusters, etc.

Who will benefit from SQL on-demand?

Data Engineers can explore the lake, then transform the data in ad-hoc queries or build a logical data warehouse with reusable queries.
Data Scientists can explore the lake to build up context about the contents and structure of the data in the lake and ultimately contribute to the work of the data engineer. Features such as OPENROWSET and automatic schema inference are useful in this scenario.
Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favourite tools that support connection to SQL on-demand.
BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview

Is T-SQL used in Synapse SQL the same as normal T-SQL?

Mostly, yes. Synapse SQL on-demand offers a T-SQL querying surface area, which in some areas are more extensive compared to the T-SQL we are already familiar with, mostly to accommodate the need to query semi-structured and unstructured data. On the other hand, some aspects of T-SQL we are already familiar with are not supported due to the design of SQL on-demand.

High-level T-SQL language differences between consumption models of Synapse SQL are described here: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/overview-features.

Let's now look at getting up and running.

Steps to get up and running

I have already provisioned both a data lake and Azure Synapse Analytics. In this section, I will:

Access my Azure Synapse Analytics workspace.
Then load five raw data parquet files, each containing approx. 1,000 records to my data lake.
Then access the data lake through Synapse and do a simple query over a single file in the data lake.
1. Part of this sees me set appropriate RBAC roles on the data lake.
Then extend the query to include all relevant files.
Then create the SQL on-demand database and convert the extended query into a reusable view.
Then publish the changes.
Then connect to the SQL on-demand database through Power BI and create a simple report.
Then extend the dataset from 5,000 records to approx. 50,000.
And test performance over a much larger dataset, i.e. 500,000 records, followed by a new section on performance enhancements and side by side comparisons.

Step 1 – Access my Synapse workspace

Access my workspace via the URL https://web.azuresynapse.net/

I am required to specify my Azure Active Directory tenancy, my Azure Subscription, and finally my Azure Synapse Workspace.

Before users can access the data through the Workspace, their access control must first be set appropriately. This is best done through Security Groups, but in this quick test drive, I used named users.

When I created Azure Synapse Analytics, I specified the data lake I want to use, this is shown under Data > Linked > data lake > containers. I can, of course, link other datasets, for example, those in other storage accounts or data lakes here too.

Step 2 – load data to my data lake

I have a data lake container called "rawparquet" where I loaded 5 parquet files containing the same data structure. If I right-click on any of the Parquet files, I can see some useful starter options.

https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-synapse-studio

Step 3 - Initial query test (access the data lake)

I right-clicked and selected "Select TOP 100 rows", which created the following query:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/userdata2.parquet',
        FORMAT='PARQUET'
    ) AS [r];

The first time I ran this query, I got this error:

This was because my Azure Active Directory identity doesn't have rights to access the file. By default, SQL on-demand is trying to access the file using my Azure Active Directory identity. To resolve this issue, I need to have the proper rights to access the file.

To resolve this, I granted both 'Storage Blob Data Contributor' and the 'Storage Blob Data Reader' role on the storage account (i.e. the data lake).

https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-synapse-studio

and https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/access-control.

Those steps resolved the error.

Step 4 - Extend the SQL

In my theoretical use case, I have a Data Factory pipeline that loads user data from the source into my data lake in Parquet format. I currently have 5 separate Parquet files in my data lake.

The query mentioned previously obviously targeted a specific file explicitly, i.e. "userdata2.parquet"

In my scenario, my Parquet files are all delta files, and I want to
query the full set.

I now simply extend the query by removing the "TOP 100 *" and open the OPENROWSET part of the query to the whole Container, not just the specific file. It now looks like this:

SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/',
        FORMAT='PARQUET'
    ) AS [r];

Step 5 - Now let's create a database and views dedicated for my SQL on-demand queries

This database serves as my Logical Data Warehouse built over my Data Lake.

I firstly ensure that SQL on-demand and Master is selected:

CREATE DATABASE SQL_on_demand_demo

I now create a view that will expose all the data in my dedicated Container, i.e. "rawparquet" as a single SQL dataset for use by (for example) Power BI.

I firstly ensure that SQL on-demand and the new database SQL_on_demand_demo is selected.

I now run the create view script:

CREATE VIEW dbo.vw_UserData as
SELECT
     *
FROM
    OPENROWSET(
        BULK 'https://xxxxxxdatalakegen2.dfs.core.windows.net/rawparquet/',
        FORMAT='PARQUET'
    ) AS [r];

I now test the view by running a simple select statement:

Select * from dbo.vw_UserData

Step 6 – Publish changes

Select Publish to move all the changes to the live environment.

If you now refresh your Data pane, you will see the new database and view appear as an on-demand database. Here you will be able to see both Provisioned (SQL Pool) and on-demand databases:

My data volumes at this stage is still very low, only 5,000 records. But we will first hook Power BI on to Synapse, and then throw more data at it to see how it performs.

Step 7 – Query through Power BI

It is possible to build interactive Power BI reports right here in the Synpase workspace, but for now I am going to go old school and create a Direct Query report from the view we created, essentially querying the data in the data lake via the logical data warehouse, SQL_on_demand_demo.

To connect:
1. Open a new Power BI Desktop file.
2. Select Get Data
3. Select Azure SQL Database.
4. Find the server name
  1. Navigate to your Synapse Workspace
  2. Copy the SQL on-demand endpoint from the overview menu
5. Paste it into the Server field in the Power BI Get Data dialog box
6. Leave the database name blank
7. Remember to select Direct Query if the processing must be handed over to Synapse, and the footprint of Power BI must be kept to a minimum.
8. Select Microsoft Account as the authentication method and sign in with your organisational account.
9. Now select the view vw_UserData
10. Transform, then load, or simply load the data.
Create a simple report, which now runs in Direct Query mode:

Step 8 - Add more files to the data lake and see if it simply flows into the final report

I made arbitrary copies of the original Parquet files in the "rawparquet" container and increased the volume of files from 5 to 55, and as they are copies, they obviously all have the same structure.

I simply refreshed the Power BI Report and the results were instantaneous.

Step 9 - Performance over a much larger dataset

For this, I am going to publish the report to Power BI Service to eliminate any potential issues with connectivity or my local machine.

The published dataset must authenticate using
OAuth2.

Once the report is published, I select the 'Female' pie slice and the full report renders in approx. 4 seconds. This means the query generated by Power BI is sent to Synapse Analytics, using the SQL on demand mode and its SQL Query to query the multiple Parquet files in the data lake and return the data back to Power BI to render.

I now again arbitrarily increase the number of
files from 55 to 500.

Refreshing this new dataset containing 498,901
took 17 seconds.

Selecting the same 'Female' pie slice initially rendered the full report in approx. 35 seconds. And then in approx. 1 second after that. The same pattern is observed for the other slices.

I am now going to try and improve this performance.

Performance enhancements and side by side comparison

The performance noted above is okay considering the record volumes and the separation of stored data in my data lake and the compute services; but I want the performance to be substantially better, and I want to compare the performance with a competitor product (* note that the competitor product is not mentioned as the purpose of this article is a test drive of Azure Synapse Analytics SQL on-demand, and not a full scale competitor analysis).

To improve performance I followed two best practice guidelines: (a) I decreased the number of Parquet files the system has to contend with and of course increased the record volumes within each file, and (b) I collocated the data lake and the
Azure Synapse Analytics in the same region.

Tests 1 and 2 shows the impact of performance enhancements, whereas tests 3 and 4 represents my observations when the two competitors, i.e. Synapse in test 3 and the competitor in test 4 are compared side by side.

Test summaries

Test 1 - large number of records and files, not collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes - 500,000
Number of Parquet Files - 500
Azure Data Lake Gen 2 region - Australia Southeast
Azure Synapse Analytics - Australia East

Results:
Initial refresh - 17 seconds
Refresh on initial visual interaction - 35 seconds
Refresh on subsequent visual interaction - 1 second

Test 2 - large number of records, decreased numbers of files, not collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes - 500,000
Number of Parquet Files - 20
Azure Data Lake Gen 2 region - Australia Southeast
Azure Synapse Analytics - Australia East

Results:
Initial refresh - 9 seconds
Refresh on initial visual interaction - 4 seconds
Refresh on subsequent visual interaction - less than 1 second

Test 3 - large number of records and files, collocated, Azure Synapse, Azure Data Lake, Power BI

Record volumes - 500,000
Number of Parquet Files - 20
Azure Data Lake Gen 2 region - Australia East
Azure Synapse Analytics - Australia East

Results:
Initial refresh - 3 seconds
Refresh on initial visual interaction - 2.5 seconds
Refresh on subsequent visual interaction - less than 1 second

Test 4 - large number of records and files, collocated, Competitor product, Azure Data Lake, Power BI

Record volumes - 500,000
Number of Parquet Files - 20
Azure Data Lake Gen 2 region - Australia East
Azure Synapse Analytics - Australia East

Results:
Initial refresh - 4 seconds
Refresh on initial visual interaction - 3 seconds
Refresh on subsequent visual interaction - less than 1 second

Performance Conclusion

The results in the table above show that Azure Synapse performed best in a side by side competitor analysis - see tests 3 and 4.

We describe this as a side-by-side test as both Synapse and the compared competitor analytic services are located in the same Azure region as the data lake, and the same parquet files are used for both.

Cost observation

With the SQL on-demand consumption model, you pay only for the queries you use, and Microsoft describes the service as auto scaling to meet your requirements. Running numerous queries in the steps described and across the course of three days seemed to have incurred only very nominal query charges when analysing cost analysis on the particular resource group hosting both the data lake and Azure Synapse Analytics.

I did initially observe higher than expected storage costs, but this, it turns out related to a provisioned SQL Pool, which had no relation to this SQL on-demand use case. Once that unrelated data was deleted, we were left only with the very nominal storage charge across the large record volumes in the Parquet files in the data lake.

All in all a very cost effective solution!

Conclusion

Getting up and running with Synapse SQL on-demand once data is loaded to the data lake was a very simple task.
I ran a number of queries over a large dataset over the course of five days. The observed cost was negligible compared to what would be expected with a provisioned consumption model provided by SQL Pools.
The ability to use T-SQL to query data lake files, and the ability to create a logical data warehouse provides for a very compelling operating model.
Access via Power BI was simple.
Performance was really good after performance adjustments as described in the "Performance enhancements and side by side comparison" section.
A logical data warehouse holds huge advantages compared to materialised data as it opens up the concept to reporting over data streams, real time data from LOB systems, increased design responsiveness, and many others.

Exposé will continue to test drive other aspects of Azure Synapse Analytics such as the Spark Pool runtime for Data Scientists and future integration with the Data Catalog replacement.

Data and Devops practices and tools