Saturday, August 1, 2020

Monitor your Azure Synapse Analytics SQL pool workload using DMVs

This article describes how to use Dynamic Management Views (DMVs) to monitor your workload including investigating query execution in SQL pool.

Permissions

To query the DMVs in this article, you need either VIEW DATABASE STATE or CONTROL permission. Usually, granting VIEW DATABASE STATE is the preferred permission as it is much more restrictive.

SQL
GRANT VIEW DATABASE STATE TO myuser;

Monitor connections

All logins to your data warehouse are logged to sys.dm_pdw_exec_sessions. This DMV contains the last 10,000 logins. The session_id is the primary key and is assigned sequentially for each new logon.

SQL
-- Other Active Connections
SELECT * FROM sys.dm_pdw_exec_sessions where status <> 'Closed' and session_id <> session_id();

Monitor query execution

All queries executed on SQL pool are logged to sys.dm_pdw_exec_requests. This DMV contains the last 10,000 queries executed. The request_id uniquely identifies each query and is the primary key for this DMV. The request_id is assigned sequentially for each new query and is prefixed with QID, which stands for query ID. Querying this DMV for a given session_id shows all queries for a given logon.

 Note

Stored procedures use multiple Request IDs. Request IDs are assigned in sequential order.

Here are steps to follow to investigate query execution plans and times for a particular query.

STEP 1: Identify the query you wish to investigate

SQL
-- Monitor active queries
SELECT *
FROM sys.dm_pdw_exec_requests
WHERE status not in ('Completed','Failed','Cancelled')
  AND session_id <> session_id()
ORDER BY submit_time DESC;

-- Find top 10 queries longest running queries
SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;

From the preceding query results, note the Request ID of the query that you would like to investigate.

Queries in the Suspended state can be queued due to a large number of active running queries. These queries also appear in the sys.dm_pdw_waits waits query with a type of UserConcurrencyResourceType. For information on concurrency limits, see Memory and concurrency limits or Resource classes for workload management. Queries can also wait for other reasons such as for object locks. If your query is waiting for a resource, see Investigating queries waiting for resources further down in this article.

To simplify the lookup of a query in the sys.dm_pdw_exec_requests table, use LABEL to assign a comment to your query, which can be looked up in the sys.dm_pdw_exec_requests view.

SQL
-- Query with Label
SELECT *
FROM sys.tables
OPTION (LABEL = 'My Query')
;

-- Find a query with the Label 'My Query'
-- Use brackets when querying the label column, as it it a key word
SELECT  *
FROM    sys.dm_pdw_exec_requests
WHERE   [label] = 'My Query';

STEP 2: Investigate the query plan

Use the Request ID to retrieve the query's distributed SQL (DSQL) plan from sys.dm_pdw_request_steps

SQL
-- Find the distributed query plan steps for a specific query.
-- Replace request_id with value from Step 1.

SELECT * FROM sys.dm_pdw_request_steps
WHERE request_id = 'QID####'
ORDER BY step_index;

When a DSQL plan is taking longer than expected, the cause can be a complex plan with many DSQL steps or just one step taking a long time. If the plan is many steps with several move operations, consider optimizing your table distributions to reduce data movement. The Table distribution article explains why data must be moved to solve a query. The article also explains some distribution strategies to minimize data movement.

To investigate further details about a single step, the operation_type column of the long-running query step and note the Step Index:

  • Proceed with Step 3a for SQL operations: OnOperation, RemoteOperation, ReturnOperation.
  • Proceed with Step 3b for Data Movement operations: ShuffleMoveOperation, BroadcastMoveOperation, TrimMoveOperation, PartitionMoveOperation, MoveOperation, CopyOperation.

STEP 3: Investigate SQL on the distributed databases

Use the Request ID and the Step Index to retrieve details from sys.dm_pdw_sql_requests, which contains execution information of the query step on all of the distributed databases.

SQL
-- Find the distribution run times for a SQL step.
-- Replace request_id and step_index with values from Step 1 and 3.

SELECT * FROM sys.dm_pdw_sql_requests
WHERE request_id = 'QID####' AND step_index = 2;

When the query step is running, DBCC PDW_SHOWEXECUTIONPLAN can be used to retrieve the SQL Server estimated plan from the SQL Server plan cache for the step running on a particular distribution.

SQL
-- Find the SQL Server execution plan for a query running on a specific SQL pool or control node.
-- Replace distribution_id and spid with values from previous query.

DBCC PDW_SHOWEXECUTIONPLAN(1, 78);

STEP 4: Investigate data movement on the distributed databases

Use the Request ID and the Step Index to retrieve information about a data movement step running on each distribution from sys.dm_pdw_dms_workers.

SQL
-- Find information about all the workers completing a Data Movement Step.
-- Replace request_id and step_index with values from Step 1 and 3.

SELECT * FROM sys.dm_pdw_dms_workers
WHERE request_id = 'QID####' AND step_index = 2;
  • Check the total_elapsed_time column to see if a particular distribution is taking significantly longer than others for data movement.
  • For the long-running distribution, check the rows_processed column to see if the number of rows being moved from that distribution is significantly larger than others. If so, this finding might indicate skew of your underlying data.

If the query is running, you can use DBCC PDW_SHOWEXECUTIONPLAN to retrieve the SQL Server estimated plan from the SQL Server plan cache for the currently running SQL Step within a particular distribution.

SQL
-- Find the SQL Server estimated plan for a query running on a specific SQL pool Compute or control node.
-- Replace distribution_id and spid with values from previous query.

DBCC PDW_SHOWEXECUTIONPLAN(55, 238);

Monitor waiting queries

If you discover that your query is not making progress because it is waiting for a resource, here is a query that shows all the resources a query is waiting for.

SQL
-- Find queries
-- Replace request_id with value from Step 1.

SELECT waits.session_id,
      waits.request_id,  
      requests.command,
      requests.status,
      requests.start_time,  
      waits.type,
      waits.state,
      waits.object_type,
      waits.object_name
FROM   sys.dm_pdw_waits waits
   JOIN  sys.dm_pdw_exec_requests requests
   ON waits.request_id=requests.request_id
WHERE waits.request_id = 'QID####'
ORDER BY waits.object_name, waits.object_type, waits.state;

If the query is actively waiting on resources from another query, then the state will be AcquireResources. If the query has all the required resources, then the state will be Granted.

Monitor tempdb

Tempdb is used to hold intermediate results during query execution. High utilization of the tempdb database can lead to slow query performance. For every DW100c configured, 399 GB of tempdb space is allocated (DW1000c would have 3.99 TB of total tempdb space). Below are tips for monitoring tempdb usage and for decreasing tempdb usage in your queries.

Monitoring tempdb with views

To monitor tempdb usage, first install the microsoft.vw_sql_requests view from the Microsoft Toolkit for SQL pool. You can then execute the following query to see the tempdb usage per node for all executed queries:

SQL
-- Monitor tempdb
SELECT
    sr.request_id,
    ssu.session_id,
    ssu.pdw_node_id,
    sr.command,
    sr.total_elapsed_time,
    es.login_name AS 'LoginName',
    DB_NAME(ssu.database_id) AS 'DatabaseName',
    (es.memory_usage * 8) AS 'MemoryUsage (in KB)',
    (ssu.user_objects_alloc_page_count * 8) AS 'Space Allocated For User Objects (in KB)',
    (ssu.user_objects_dealloc_page_count * 8) AS 'Space Deallocated For User Objects (in KB)',
    (ssu.internal_objects_alloc_page_count * 8) AS 'Space Allocated For Internal Objects (in KB)',
    (ssu.internal_objects_dealloc_page_count * 8) AS 'Space Deallocated For Internal Objects (in KB)',
    CASE es.is_user_process
    WHEN 1 THEN 'User Session'
    WHEN 0 THEN 'System Session'
    END AS 'SessionType',
    es.row_count AS 'RowCount'
FROM sys.dm_pdw_nodes_db_session_space_usage AS ssu
    INNER JOIN sys.dm_pdw_nodes_exec_sessions AS es ON ssu.session_id = es.session_id AND ssu.pdw_node_id = es.pdw_node_id
    INNER JOIN sys.dm_pdw_nodes_exec_connections AS er ON ssu.session_id = er.session_id AND ssu.pdw_node_id = er.pdw_node_id
    INNER JOIN microsoft.vw_sql_requests AS sr ON ssu.session_id = sr.spid AND ssu.pdw_node_id = sr.pdw_node_id
WHERE DB_NAME(ssu.database_id) = 'tempdb'
    AND es.session_id <> @@SPID
    AND es.login_name <> 'sa'
ORDER BY sr.request_id;

If you have a query that is consuming a large amount of memory or have received an error message related to allocation of tempdb, it could be due to a very large CREATE TABLE AS SELECT (CTAS) or INSERT SELECT statement running that is failing in the final data movement operation. This can usually be identified as a ShuffleMove operation in the distributed query plan right before the final INSERT SELECT. Use sys.dm_pdw_request_steps to monitor ShuffleMove operations.

The most common mitigation is to break your CTAS or INSERT SELECT statement into multiple load statements so the data volume will not exceed the 1TB per node tempdb limit. You can also scale your cluster to a larger size which will spread the tempdb size across more nodes reducing the tempdb on each individual node.

In addition to CTAS and INSERT SELECT statements, large, complex queries running with insufficient memory can spill into tempdb causing queries to fail. Consider running with a larger resource class to avoid spilling into tempdb.

Monitor memory

Memory can be the root cause for slow performance and out of memory issues. Consider scaling your data warehouse if you find SQL Server memory usage reaching its limits during query execution.

The following query returns SQL Server memory usage and memory pressure per node:

SQL
-- Memory consumption
SELECT
  pc1.cntr_value as Curr_Mem_KB,
  pc1.cntr_value/1024.0 as Curr_Mem_MB,
  (pc1.cntr_value/1048576.0) as Curr_Mem_GB,
  pc2.cntr_value as Max_Mem_KB,
  pc2.cntr_value/1024.0 as Max_Mem_MB,
  (pc2.cntr_value/1048576.0) as Max_Mem_GB,
  pc1.cntr_value * 100.0/pc2.cntr_value AS Memory_Utilization_Percentage,
  pc1.pdw_node_id
FROM
-- pc1: current memory
sys.dm_pdw_nodes_os_performance_counters AS pc1
-- pc2: total memory allowed for this SQL instance
JOIN sys.dm_pdw_nodes_os_performance_counters AS pc2
ON pc1.object_name = pc2.object_name AND pc1.pdw_node_id = pc2.pdw_node_id
WHERE
pc1.counter_name = 'Total Server Memory (KB)'
AND pc2.counter_name = 'Target Server Memory (KB)'

Monitor transaction log size

The following query returns the transaction log size on each distribution. If one of the log files is reaching 160 GB, you should consider scaling up your instance or limiting your transaction size.

SQL
-- Transaction log size
SELECT
  instance_name as distribution_db,
  cntr_value*1.0/1048576 as log_file_size_used_GB,
  pdw_node_id
FROM sys.dm_pdw_nodes_os_performance_counters
WHERE
instance_name like 'Distribution_%'
AND counter_name = 'Log File(s) Used Size (KB)'

Monitor transaction log rollback

If your queries are failing or taking a long time to proceed, you can check and monitor if you have any transactions rolling back.

SQL
-- Monitor rollback
SELECT
    SUM(CASE WHEN t.database_transaction_next_undo_lsn IS NOT NULL THEN 1 ELSE 0 END),
    t.pdw_node_id,
    nod.[type]
FROM sys.dm_pdw_nodes_tran_database_transactions t
JOIN sys.dm_pdw_nodes nod ON t.pdw_node_id = nod.pdw_node_id
GROUP BY t.pdw_node_id, nod.[type]

Monitor PolyBase load

The following query provides an approximate estimate of the progress of your load. The query only shows files currently being processed.

SQL

-- To track bytes and files
SELECT
    r.command,
    s.request_id,
    r.status,
    count(distinct input_name) as nbr_files,
    sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
    sys.dm_pdw_exec_requests r
    inner join sys.dm_pdw_dms_external_work s
        on r.request_id = s.request_id
GROUP BY
    r.command,
    s.request_id,
    r.status
ORDER BY
    nbr_files desc,
    gb_processed desc;

Next steps

For more information about DMVs, see System views.

Unlock Query Performance with SQL Data Warehouse using Graphical Execution Plans


The Graphical Execution Plan feature within SQL Server Management Studio (SSMS) is now supported for SQL Data Warehouse (SQL DW)! With a click of a button, you can create a graphical representation of a distributed query plan for SQL DW.

Before this enhancement, query troubleshooting for SQL DW was often a tedious process, which required you to run the EXPLAIN command. SQL DW customers can now seamlessly and visually debug query plans to identify performance bottlenecks directly within the SSMS window. This experience extends the query troubleshooting experience by displaying costly data movement operations which are the most common reasons for slow distributed query plans. Below is a simple example of troubleshooting a distributed query plan with SQL DW leveraging the Graphical Execution Plan.

The view below displays the estimated execution plan for a query. As we can see, this is an incompatible join which occurs when there is a join between two tables distributed on different columns. An incompatible join will create a ShuffleMove operation, where temp tables will be created on every distribution to satisfy the join locally before streaming the results back to the user. The ShuffleMove has become a performance bottleneck for this query:

ssms_dw_execution_plan_1

Taking a closer look at our table definition, we can see that the two tables being joined are distributed on different columns:

ssms_dw_execution_plan_2 (002)

By leveraging the estimated Graphical Execution Plan, we have identified that we can improve the performance of this query by redistributing the table on the appropriate column to remove the ShuffleMove operation.

Troubleshooting and tuning distributed SQL queries just got easier. Download the latest SSMS release to start using this feature today! If you need help for a POC, contact us directly. Stay up-to-date on the latest Azure SQL DW news and features by following us on Twitter @AzureSQLDW

Azure SQL Data Warehouse loading patterns and strategies


Abstract

This article provides an overview of the Microsoft Azure SQL Data Warehouse architecture. This platform-as-a service (PaaS) offering provides independent compute and storage scaling on demand. This document provides data loading guidelines for SQL Data Warehouse. Several common loading options are briefly described, but the main focus is the PolyBase technology, the preferred and fastest loading method for ingesting data into SQL Data Warehouse. See also What is Azure SQL Data Warehouse?

Introduction

Whether you are building a data mart or a data warehouse, the three fundamentals you must implement are an extraction process, a transformation process, and a loading process—also known as extract, transform, and load (ETL). When working with smaller workloads, the general rule from the perspective of performance and scalability is to perform transformations before loading the data. In the era of big data, however, as data sizes and volumes continue to increase, processes may encounter bottlenecks from difficult-to-scale integration and transformation layers.

As workloads grow, the design paradigm is shifting. Transformations are moving to the compute resource, and workloads are distributed across multiple compute resources. In the distributed world, we call this massively parallel processing (MPP), and the order of these processes differs. You may hear it described as ELT—you extract, load, and then transform as opposed to the traditional ETL order. The reason for this change is today’s highly scalable parallel computing powers, which put multiple compute resources at your disposal such as CPU (cores), RAM, networking, and storage, and you can distribute a workload across them.

With SQL Data Warehouse, you can scale out your compute resources as you need them on demand to maximize power and performance of your heavier workload processes.

However, we still need to load the data before we can transform. In this article, we'll explore several loading techniques that help you reach maximum data-loading throughput and identify the scenarios that best suit each of these techniques.

Architecture

SQL Data Warehouse uses the same logical component architecture for the MPP system as the Microsoft Analytics Platform System (APS). APS is the on-premises MPP appliance previously known as the Parallel Data Warehouse (PDW).

As you can see in the diagram below, SQL Data Warehouse has two types of components, a Control node and a Compute node:

Figure 1. Control node and Compute nodes in the SQL Data Warehouse logical architecture

undefined

The Control node is the brain and orchestrator of the MPP engine. We connect to this area when using SQL Data Warehouse to manage and query data. When you send a SQL query to SQL Data Warehouse, the Control node processes that query and converts the code to what we call a DSQL plan, or Distributed SQL plan, based on the cost-based optimization engine. After the DSQL plan has been generated, for each subsequent step, the Control node sends the command to run in each of the compute resources.

The Compute nodes are the worker nodes. They run the commands given to them from the Control node. Compute usage is measured using SQL Data Warehouse Units (DWUs). A DWU , similar to the Azure SQL Database DTU, represents the power of the database engine as a blended measure of CPU, memory, and read and write rates. The smallest compute resource (DWU 100) consists of the Control node and one Compute node. As you scale out your compute resources (by adding DWUs), you increase the number of Compute nodes.

Within the Control node and in each of the Compute resources, the Data Movement Service (DMS) component handles the movement of data between nodes—whether between the Compute nodes themselves or from Compute nodes to the Control node.

DMS also includes the PolyBase technology. An HDFS bridge is implemented within the DMS to communicate with the HDFS file system. PolyBase for SQL Data Warehouse currently supports Microsoft Azure Storage Blob and Microsoft Azure Data Lake Store.

Network and data locality

The first considerations for loading data are source-data locality and network bandwidth, utilization, and predictability of the path to the SQL Data Warehouse destination. Depending on where the data originates, network bandwidth will play a major part in your loading performance. For source data residing on your premises, network throughput performance and predictability can be enhanced with a service such as Azure Express Route. Otherwise, you must consider the current average bandwidth, utilization, predictability, and maximum capabilities of your current public Internet-facing, source-to-destination route.

Note Express Route routes your data through a dedicated connection to Azure without passing through the public Internet. ExpressRoute connections offer more reliability, faster speeds, lower latencies, and higher security than typical Internet connections. For more information, see Express Route .

Using PolyBase for SQL Data Warehouse loads

SQL Data Warehouse supports many loading methods, including non-PolyBase options (BCP and SQLBulkCopy API), and PolyBase options CTAS/INSERT, PolyBase with SSIS PolyBase with Azure Databricks and PolyBase with Azure Data Factory (ADF).

PolyBase is by far the fastest and most scalable SQL Data Warehouse loading method to date, so we recommend it as your default loading mechanism. PolyBase is a scalable, query processing framework compatible with Transact-SQL that can be used to combine and bridge data across relational database management systems, Azure Blob Storage, Azure Data Lake Store and Hadoop database platform ecosystems (APS only).

Note As a general rule, we recommend making PolyBase your first choice for loading data into SQL Data Warehouse unless you can’t accommodate PolyBase-supported file formats. Currently PolyBase can load data from UTF-8 and UTF-16 encoded delimited text files as well as the popular Hadoop file formats RC File, ORC, and Parquet (non-nested format). PolyBase can load data from gzip, zlib and Snappy compressed files. PolyBase currently does not support extended ASCII, fixed-file format, WinZip and semi-structured data such as Parquet (nested/hierarchical), JSON, and XML.  A popular pattern to load semi-structured data is to use Azure Databricks or similarly HDI/Spark to load the data, flatten/transform to the supported format, then load into SQL DW.

As the following architecture diagrams show, each HDFS bridge of the DMS service from every Compute node can connect to an external resource such as Azure Blob Storage, and then bidirectionally transfer data between SQL Data Warehouse and the external resource.

Note As of this writing, SQL Data Warehouse supports Azure Blob Storage and Azure Data Lake Store as the external data sources.

Figure 2. Data transfers between SQL Data Warehouse and an external resource

undefined

PolyBase data loading is not limited by the Control node, and so as you scale out your DWU, your data transfer throughput also increases. By mapping the external files as external tables in SQL Data Warehouse, the data files can be accessed using standard Transact-SQL commands—that is, the external tables can be referenced as standard tables in your Transact-SQL queries.

Copying data into storage

The general load process begins with migrating your data into Azure Blob Storage. Depending on your network’s capabilities, reliability, and utilization, you can use AZCOPY to upload your source data files to Azure Storage Blobs with an upload rate from 80 MB/second to 120 MB/second.

Then, in SQL Data Warehouse, you configure your credentials that will be used to access Azure Blob Storage:

CREATE DATABASE SCOPED CREDENTIAL myid_credential WITH IDENTITY = 'myid', Secret='mysecretkey';

Next you define the external Azure Blob Storage data source with the previously created credential:

CREATE EXTERNAL DATA SOURCE data_1tb WITH (TYPE = HADOOP, LOCATION = 'wasbs://data_1tb@myid.blob.core.windows.net', CREDENTIAL= myid_credential);

And for the source data, define the file format and external table definition:

CREATE EXTERNAL FILE FORMAT pipedelimited

WITH (FORMAT_TYPE = DELIMITEDTEXT,

FORMAT_OPTIONS(

FIELD_TERMINATOR = '|',

STRING_DELIMITER = '',

DATE_FORMAT = '',

USE_TYPE_DEFAULT = False)

);

CREATE EXTERNAL TABLE orders_ext (

o_orderkey bigint NULL,

o_custkey bigint NULL,

o_orderstatus char(1),

o_totalprice decimal(15, 2) NULL,

o_orderdate date NULL,

o_orderpriority char(15),

o_clerk char(15),

o_shippriority int NULL,

o_comment varchar(79)

)

WITH (LOCATION='/orders',

DATA_SOURCE = data_1tb,

FILE_FORMAT = pipedelimited,

REJECT_TYPE = VALUE,

REJECT_VALUE = 0

);

For more information about PolyBase, see SQL Data Warehouse documentation.

Using CTAS to load initial data

Then you can use a CTAS ( CREATE TABLE AS SELECT ) operation within SQL Data Warehouse to load the data from Azure Blob Storage to SQL Data Warehouse:

CREATE TABLE orders_load

WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(o_orderkey),

PARTITION (o_orderdate RANGE RIGHT FOR VALUES ('1992-01-01','1993-01-01','1994-01-01','1995-01-01')))

as select * from orders_ext;

CTAS creates a new table. We recommend using CTAS for the initial data load. This is an all-or-nothing operation with minimal logging.

Using INSERT INTO to load incremental data

For an incremental load, use INSERT INTO operation. This is a full logging operation when inserting into a populated partition which will impact on the load performance. Furthermore, the roll-back operation on a large transaction can be expensive. Consider breaking your transaction into smaller batches.

INSERT INTO TABLE orders_load

select * from orders_current_ext;

Note The source is using different external table, orders_current_ext.  This is the external table defining the path for the incremental data on ASB.


Another popular pattern is to load into a partitioned aligned stage table via CTAS, then partition switch into the final table.

Data Reader, Writer consideration

The number of external readers and writers varies depending on the following factors:

  • Gen1 or Gen2 instance
  • SLO or size of the instance (DWU)
  • Type of operations (DMS query or PolyBase load
  • Type of file being loaded (Parquet, text, etc)
  • Concurrency at the time the operation was submitted (readers/writers auto-adjust dynamically)

As illustrated in Table 1 below, each DWU has a specific number of readers and writers.  As you scale out, each node gets additional number of readers and writers.  The static and dynamic resource classes also varies with the number of readers and writers.  Note that Parquet files typically has half the number of readers compared to non-Parquet files.  The number of readers and writers is an important factor in determining your load performance.

Table 1. Number of readers and writers for Gen 1 SQL DW xlargerc resource class

undefined

Table 2. Table 1. Number of readers and writers for Gen 2 SQL DW xlargerc resource class

undefined

To check for the number of readers/writers, use the following query (adjust the appropriate request_id and step_index). For more information, see link Monitoring your workload using DMVs

SELECT type, count(*) FROM sys.dm_pdw_dms_workers

WHERE request_id = ' QIDXXXX ' AND step_index = XX

group by type;

Best practices and considerations when using PolyBase

Here are a few more things to consider when using PolyBase for SQL Data Warehouse loads:

  • A single PolyBase load operation provides best performance.
  • The load performance scales as you increase DWUs.
  • PolyBase automatically parallelizes the data load process, so you don’t need to explicitly break the input data into multiple files and issue concurrent loads, unlike some traditional loading practices.  Each reader automatically read 512MB for each file for Azure Storage BLOB and 256MB on Azure Data Lake Storage.
  • Multiple readers will not work against gzip files. Only a single reader is used per gzip compressed file since uncompressing the file in the buffer is single threaded. Alternatively, generate multiple gzip files.  The number of files should be greater than or equal to the total number of readers.
  • Multiple readers will work against compressed columnar/block format files (e.g. ORC, RC) since individual blocks are compressed independently.
Known issues when working with different file formats

In addition to the UTF-8/UTF-16 encoding considerations, other known file format issues can arise when using PolyBase.

Mixed intra-file date formats

In a CREATE EXTERNAL FILE FORMAT command, the DATE_FORMAT argument specifies a single format to use for all date and time data in a delimited text file. If the DATE_FORMAT argument isn’t designated, the following default formats are used:

DateTime: 'yyyy-MM-dd HH:mm:ss'

  • SmallDateTime: 'yyyy-MM-dd HH:mm'
  • Date: 'yyyy-MM-dd'
  • DateTime2: 'yyyy-MM-dd HH:mm:ss'
  • DateTimeOffset: 'yyyy-MM-dd HH:mm:ss'
  • Time: 'HH:mm:ss'

For source formats that don’t reflect the defaults, you must explicitly specify a custom date format. However, if multiple non-default formats are used within one file, there is currently no method for specifying multiple custom date formats within the PolyBase command.

Fixed-length file format not supported

Fixed-length character file formats—for example, where each column has a fixed width of 10 characters—are not supported today.

If you encounter the restrictions from using PolyBase, considers changing the data extract process to address those limitations.  This could be formatting the dates to PolyBase supported format, transforming JSON files to text files, etc.  If the option is not possible, then your option is to use any one of the methods in the next section.

Using Control-node and single-client gated load methods

In the Architecture section we mentioned that all incoming connections go through the Control node. Although you can increase and decrease the number of compute resources, there is only a single Control node. And as mentioned earlier, one reason why PolyBase provides a superior load rate is that PolyBase data transfer is not limited by the Control node. But if using PolyBase is not currently an option, the following technologies and methods can be used for loading into SQL Data Warehouse:

  • BCP
  • SQLBulkCopy

For these load methods, the bottleneck is on the client machine and the single Control node. Each load uses a single core on the client machine and only accesses the single Control node. Therefore, the load does not scale if you increase DWUs for an SQL Data Warehouse instance.

Note You can, however, increase load throughput if you add parallel loads into either the same table or different tables.

Using multiple client concurrent executions should improve your load throughput - to a point. The number of parallel loads no longer improves your throughput when the maximum capacity of the Control node is reached.

Best practices and considerations for single-client gated load methods

Consider the following when using BCP, SQLBulkCopy or other Control-node and client-gated loading methods:

  • Issue multiple threads into different tables to improve throughput.  SQL DW does not support loading multiple threads into the same table since it requires exclusive lock.
  • Include retry logic.
  • Don’t specify the batch size with Control-node gated methods. The goal is to load all or nothing so that the retry logic will restart the load. If you designate a batch size and the load encounters failure (for example, network or database not available), you may need to add more logic to restart from the last successful commit.

Comparing load method performance characteristics

The following table details the results for PolyBase options and non-PolyBase options BCP and SQLBulkCopy.

Table 2. SQL Data Warehouse performance testing results

undefined

As you can see, the PolyBase method are significantly fastest compared to BCP and SQLBulkCopy Control-node client gated load methods. If PolyBase is not an option, however, BCP provides the next best load rate.

Regarding increase throughput as you add more threads, (the third row in the chart), keep in mind that SQL Data Warehouse supports up to 128 concurrent queries/loads. It is unlikely that you can load 128 threads in parallel as resource contention will develop.  Take the time to balance your load throughput and concurrency for the best matrix to meet your ingestion SLA and user query experience.  For more information about concurrency, see Concurrency and workload management in SQL Data Warehouse.

Conclusion

SQL DW provides many options to load data as we discussed in this article. Each method has its own advantages and disadvantages. It’s easy to “lift and shift” your existing SSIS packages, BCP scripts and other Control-node client gated methods to mitigate migration effort. However, if you require higher speeds for data ingestion, consider rewriting your processes to take advantage of PolyBase with its high throughput, highly scalable loading methodology.

Free hosting web sites and features -2024

  Interesting  summary about hosting and their offers. I still host my web site https://talash.azurewebsites.net with zero cost on Azure as ...