Learn how Microsoft has put in a significant effort to maximize the benefits of building a modern Data Warehouse Storage using both Azure Synapses Analytics (formerly SQL Data Warehouse) and Azure Data Lake Storage
Author: Juan A. Vega, Simatree's Director of Technology and Architecture
Publish Date: April 17th, 2020
Microsoft has come a long way in the Data Warehousing technology wars. From their original entry for Massive Parallel Processing (MPP) with SQL Server Parallel Data Warehouse to their latest release of Azure Synapse Analytics. At first, I didn’t quite understand why they changed the name to include “Synapse”. The traditional Data Warehouse has always implied the final truth of a company’s detail, summary key metrics, and key performance indicators (KPIs) for historical and current reporting. So, when I saw they removed “Data Warehouse” and added “Synapse Analytics” I had to investigate further.
The word “Synapse” means, “the junction between two neurons or nerve cells where there is a small gap that neurotransmitters help nerve impulses to cross.” In other words, Azure Synapse Analytics is like a “bridge” or a “junction” or a “connection” between your traditional data warehouse to your “Big Data” as found in the data lake.
Data Lakes came about due to the exponential increases in raw data and that it was simply too cost prohibitive to push petabytes of raw data into your expensive Data Warehouse storage. Therefore, over the past five years many large organizations have built up huge data lakes but are having a hard time linking this into their existing enterprise data warehouse. In many cases, Data Engineers are spending a great deal of time moving traditional data warehouse data (such as detail level FACT and Dimension data) into the Data Lake to allow the Data Scientists to do their jobs.
But with Azure Synapses Analytics you have a new cloud solution that allows you to combine your huge data sets, that could come from Azure Data Lake or BLOB Storage, and be able to process the data within the MPP architecture of your Azure Synapse Analytics (SQL DW). In other words, it is combining Big Data into the parallel architecture of Data Warehousing.
There are many benefits to this new architecture:
Utilize the SQL environment that ETL Developers and BI Specialists have traditionally used for over three decades.
Huge savings on cheaper storage costs using Azure BLOB Storage or Azure Data Lake Gen2. These savings can be from 17% up to 70% depending on many factors such as size of your data warehouse, yearly commitment discounts, etc.… Note, a second article will delve into the details on how much you can save.
Integrate your “Data Warehouse” relational data and “Big Data” within Azure Databricks using Spark. This opens a whole new door for Data Scientists to quickly and easily pull in these data sets and apply the programming language of their choice such as Python, R, Scala, Java, and SQL. A lot of new Data Scientists are getting trained from various universities and online programs using these languages. Databricks will allow them to easily transfer this knowledge and apply it quickly in the real world.
Within Azure Databricks you have literally over dozens of libraries and utilities for Machine Learning and Artificial Intelligence. Azure Databricks easily integrates your FACT/Dimension data as well as your summary Key Performance Indicators (KPIs), from your traditional data warehouse, and links this to your “Big Data” as found in your Azure Data Lake. All this without having to copy data from one place to another. Instead of spending time on shuffling data around, you can focus on the analysis and vision you’re trying to gain. This will allow you to uncover data insights that were previously impractical to find. You can then build rich graphical presentations and save the finalized results of these insights back into Azure Synapse Analytics for historical tracking and reporting.
Use External Tables to point “Big Data” that are in the petabytes scale using classic SQL to join to fact tables or dimensions within the Data Warehouse. See the quote from Rohan Kumar with Microsoft, "key part of Synapse is the ability to query both relational and non-relational data at petabyte scale blazingly fast using an ANSI SQL language." https://azure.microsoft.com/en-us/resources/videos/rohan-kumar-azure-synapse-analytics-demo/
You get all the historical benefits of using traditional SQL Server database that have matured over the past decade such as:
Massive Parallel Processing (MPP)
Store data in relational tables with columnar storage
Columnar storage reduces the data storage costs, and greatly improves query performance (up to 10 to 100 times faster than a traditional SQL Database)
All the traditional user administration and security features found in SQL Database and Active Directory
All the classic syntax for SQL
Includes many of the new versions of SQL syntax for analytics
Stored procedures using T-SQL
Role based security
Auditing
Familiar tools such as SQL Server Management Studio and Visual Studio
Numerous SQL Server utilities that have matured and proven beneficial
Numerous ETL development tools that are available to SQL Server technology
It’s hard to describe how helpful some of these benefits can really be. In the past I recall building file loading processes that would load very large data sets into the data warehouse for further processing. This was a time consuming and, at times, frustrating process. In many cases, you had issues such as file corruption, data type issues, delays in receiving data files, etc.… that caused a lot of development and production support work to maintain the inputs into the data warehouse. Also, storage costs more than doubled since you needed storage for the files and the space inside the data warehouse.
But, with Azure Synapse Analytics these issues are greatly mitigated since the data files will already reside in your Azure BLOB storage or your Azure Data Lake. It’s like saying “once the files have landed in the landing pad (i.e. BLOB storage), they’re ready to be queried with no need to load the data”. This is a tremendous benefit compared to the old Data Warehouse strategy of constantly loading raw data files into your warehouse. At the same time, you don’t need to move this data back to the data lake for the Data Scientists to use. They can access the No SQL data and relational data using Azure Databricks at the same time.
What is also hugely beneficial is the ability to build and deploy the data warehouse using the many tools and languages that have been available for so long. You don’t have to re-learn everything or re-invent the wheel. All the 3rd party ETL tools still work with the Azure Synapse Analytics. Popular tools such as Informatica, SQL Server Integration Services (SSIS), DataStage, Pentaho, Talend, and a lot more still function. And then you have Microsoft’s new ETL / ELT cloud offering called Azure Data Factory, which has over 81 connectors to so many different platforms and databases. Furthermore, all your traditional front-end BI tools work including: PowerBI, Tableau, MicroStrategy, ViveBI, etc.… You don’t have to re-engineer the front-end to work with Azure Synapse Analytics.
Some key questions I want to cover are:
What is the difference between an external table and a regular relational table?
What is better to use, Azure BLOB Storage or Azure Data Lake?
What is the best file format to use within our Azure BLOB Storage or Azure Data Lake?
Question #1) What is the difference between an external table and a regular relational table?
An external table is basically a pointer, or “named reference,” to the data found in your Azure BLOB storage, or Azure Data Lake. Azure Synapse Analytics stores the definition, data structure, and data source connection; this is called your metadata. However, the actual data itself is not stored in Azure Synapse Analytics. Instead, Microsoft developed a technology called PolyBase, which is used to query and/or load huge data sets. Your traditional relational table can be called a “managed table,” which means the database stores both the metadata and the actual data inside the database. But for external tables, these can also be considered “unmanaged tables,” where the database only stores the metadata but not the actual data. The syntax to get this working is not complicated, as shown below with six steps. Note, this is sourced directly from the Azure documentation at:
1. Create a master key on the database, if one does not already exist. This is required to encrypt the credential secret.
SQL
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'password';
Arguments
PASSWORD ='password'
Is the password that is used to encrypt the master key in the database. password must meet the Windows password policy requirements of the computer that is hosting the instance of SQL Server.
2. Create a database scoped credential for Kerberos-secured Hadoop clusters.
SQL
-- IDENTITY: the Kerberos user name.
-- SECRET: the Kerberos password
CREATE DATABASE SCOPED CREDENTIAL HadoopUser1
WITH IDENTITY = '<hadoop_user_name>', Secret = '<hadoop_password>';
3. Create an external data source with CREATE EXTERNAL DATA SOURCE.
SQL
-- LOCATION (Required) : Hadoop Name Node IP address and port.
-- RESOURCE MANAGER LOCATION (Optional): Hadoop Resource Manager location to enable pushdown computation.
-- CREDENTIAL (Optional): the database scoped credential, created above.
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
TYPE = HADOOP,
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',
CREDENTIAL = HadoopUser1
);
4. Create an external file format with CREATE EXTERNAL FILE FORMAT.
SQL
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
USE_TYPE_DEFAULT = TRUE))
5. Create an external table pointing to data stored in Hadoop with CREATE EXTERNAL TABLE. In this example, the external data contains car sensor data.
SQL
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
[SensorKey] int NOT NULL,
[CustomerKey] int NOT NULL,
[GeographyKey] int NULL,
[Speed] float NOT NULL,
[YearMeasured] int NOT NULL
)
WITH (LOCATION='/Demo/',
DATA_SOURCE = MyHadoopCluster,
FILE_FORMAT = TextFileFormat
);
6. Create statistics on an external table.
SQL
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
From here you run queries directly on the table named “CarSensor_Data” and can join this to a relational table stored within the data warehouse. That is the beauty and one of the key features of Azure Synapse Analytics.
Question #2) What is better to use Azure BLOB Storage or Azure Data Lake?
In many cases, they are nearly the same. However, in my humble opinion, I would choose Azure Data Lake Gen2. Pricing is the same for both. However, Azure Data Lake Gen2 has some key features that make it immediately usable for Hadoop or HDInsight technology and integrates very well with Azure Databricks. Note, don’t select Azure Data Lake Gen1 which is significantly more expensive and has less features. Based on pricing alone, Microsoft is really pushing everyone towards Azure Data Lake Gen2.
One of the key features of Azure Data Lake Gen2 is it has “Hierarchical namespace” enabled. This feature allows you to create and organize your blob data into directories and stores metadata about each directory and the files within it. This allows operations, such as directory renames/moves/deletes, to be performed in a single operation. Hierarchical namespaces keep the data organized, which yields better storage and retrieval performance for an analytical “use case” and lowers the cost of analysis.
Azure BLOB Storage does not use “Hierarchical namespace” but instead has flat namespace. It’s similar to having a “root” folder and nothing else. Thus, if you have hundreds or thousands of BLOB files, they are much harder to find and retrieve. There are 3rd party utilities that do let you create “virtual” directories in Azure BLOB. But why go thru the trouble if you can get Azure Data Lake Gen2 for the same price. Azure BLOB is also more for image files, video storage, database backups, Virtual Hard Drives, and old data sets for archive storage. Whereas Azure Data Lake Gen2 is more suitable for an analytic or real-time operational environment. Note, Azure Data Lake Gen2 does have hot, cold, and archive storage tiers, like BLOB storage. Most importantly, when you store all your “Big Data” and/or “IOT Data” into your Azure Data Lake Gen2, you have the option to utilize Azure Databricks or HDInsight. HDInsight is Microsoft’s Hadoop stack which includes these open source tools:
Apache Hadoop
Spark
Kafka
HBase
Storm
Interactive Query
Apache Hive
HBase
Spark
Kafka
In both cases, your costs can easily increase depending on your disaster recovery and business continuity (DRBC) solution. For example, Azure offers four levels of redundant storage which:
Local redundant storage (LRS) – fails over to another node in the same data center
Zone redundant storage (ZRS) – fails over to another node in a different data center in the same region
Geo redundant storage (GRS) – fails over to another node in a different data center in a different region often 100s of miles away
read-access geo redundant storage (RAGRS) – same as geo redundant storage but allows read access to the fail-over node.
So, prices for your BLOB storage or Data Lake Solution can easily increase based on your requirements.
Question #3) What is the best file format to use within our Azure BLOB Storage or Azure Data Lake?
In many cases, you really do not have a choice. Your source providers are going to give you the raw data “as is”. If you’re lucky, they can drop the file in the preferred format you want. PolyBase natively supports CSV uncompressed files (or compressed using GZIP), Hive RCFile, Hive ORC, and Parquet files. In all these cases, PolyBase can handle the job. However, if you’re able to make this request, have your source providers send the data using Parquet. Parquet files were designed within the Hadoop ecosystem for fast performance and retrieval. These files can be pulled into the data warehouse up to 13 times faster. You get even more performance gains if you choose fewer columns. Parquet files store the data in a compressed columnar format which works well with summary queries or queries with fewer columns selected.
You also have the choice to use Azure Data Factory (ADF) to convert your source files to Parquet. For example, you can use ADF to convert JSON to Parquet, CSV to Parquet, etc.… If you can get the file format to be Parquet, you’ll have a very powerful solution for querying the data outside of the data warehouse. The main benefit is keeping these huge data sets in the significantly lower cost storage where you can gain tremendous savings.
In summary, Azure Synapse Analytics is breaking the boundaries of traditional data warehousing. With the processing power of MPP combined with the ability to link to your “Big Data” found in the Data Lake, you’re able to quickly accomplish things that previously were not practical. Microsoft has introduced Azure Synapse Analytics as a key bridge technology in the innovative ecosystem of Machine Learning and Artificial Intelligence found within Azure Databricks and Azure Data Lake. The “use cases” that you can now do were not possible with traditional data warehouse solutions. Azure Synapse Analytics breaks barriers and brings a new rich set of capabilities that can take your organization to the next level.
https://azure.microsoft.com/en-us/resources/videos/rohan-kumar-azure-synapse-analytics-demo/
https://azure.microsoft.com/en-us/resources/videos/tour-synapse-analytics-01/
https://azure.microsoft.com/en-us/resources/videos/introducing-azure-synapse-analytics-01/
https://azure.microsoft.com/en-us/blog/azure-sql-data-warehouse-is-now-azure-synapse-analytics/
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices
https://medium.com/@gary.strange/flattening-json-in-azure-data-factory-2f2130794258