Skip to content

Commit 0f0a375

Browse files
authored
Merge pull request #190044 from MladjoA/main
New Data virtualization article
2 parents 41f69da + 39164fd commit 0f0a375

2 files changed

Lines changed: 343 additions & 0 deletions

File tree

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
---
2+
title: Data virtualization
3+
titleSuffix: Azure SQL Managed Instance
4+
description: Learn about data virtualization capabilities of Azure SQL Managed Instance
5+
services: sql-database
6+
ms.service: sql-managed-instance
7+
ms.subservice: service-overview
8+
ms.custom:
9+
ms.devlang:
10+
ms.topic: conceptual
11+
author: MladjoA
12+
ms.author: mlandzic
13+
ms.reviewer: mathoma, MashaMSFT
14+
ms.date: 03/02/2022
15+
---
16+
17+
# Data virtualization with Azure SQL Managed Instance (Preview)
18+
[!INCLUDE[appliesto-sqlmi](../includes/appliesto-sqlmi.md)]
19+
20+
Azure SQL Managed Instance enables you to execute T-SQL queries that read data from files stored in Azure Data Lake Storage Gen2 or Azure Blob Storage, and to combine it in queries with locally stored relational data via joins. This way you can transparently access external data still allowing it to stay in its original format and location using the concept of data virtualization.
21+
22+
## Overview
23+
24+
There are two ways of querying external files, intended for different scenarios:
25+
26+
- OPENROWSET syntax – optimized for ad-hoc querying of files. Typically used to quickly explore the content and the structure of a new set of files.
27+
- External tables – optimized for repetitive querying of files using identical syntax as if data were stored locally in the database. It requires few more preparation steps compared to the first option, but it allows more control over data access. It’s typically used in analytical workloads and for reporting.
28+
29+
File formats directly supported are parquet and delimited text (CSV). JSON file format is supported indirectly by specifying CSV file format and queries returning every document as a separate row. Rows can be further parsed using JSON_VALUE and OPENJSON.
30+
31+
Location of the file(s) to be queried needs to be provided in a specific format, with location prefix corresponding to the type of the external source and endpoint/protocol used:
32+
33+
```sql
34+
--Blob Storage endpoint
35+
abs://<container>@<storage_account>.blob.core.windows.net/<path>/<file_name>.parquet
36+
37+
--Data Lake endpoint
38+
adls://<container>@<storage_account>.dfs.core.windows.net/<path>/<file_name>.parquet
39+
```
40+
41+
> [!IMPORTANT]
42+
> Usage of the generic https:// prefix is discouraged and will be disabled in the future. Make sure you use endpoint-specific prefixes to avoid interruptions.
43+
44+
The feature needs to be explicitly enabled before using it. Run the following commands to enable the data virtualization capabilities:
45+
46+
```sql
47+
exec sp_configure 'polybase_enabled', 1;
48+
go
49+
reconfigure;
50+
go
51+
```
52+
53+
If you're new to the data virtualization and want to quickly test functionality, start from querying publicly available data sets available in [Azure Open Datasets](https://docs.microsoft.com/azure/open-datasets/dataset-catalog), like the [Bing COVID-19 dataset](https://docs.microsoft.com/azure/open-datasets/dataset-bing-covid-19?tabs=azure-storage) allowing anonymous access:
54+
55+
- Bing COVID-19 dataset - parquet: abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet
56+
- Bing COVID-19 dataset - CSV: abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.csv
57+
58+
Once you have first queries executing successfully, you may want to switch to private data sets that require configuring specific access rights or firewall rules.
59+
60+
To access a private location, you need to authenticate to the storage account using Shared Access Signature (SAS) key with proper access permissions and validity period. The SAS key isn't provided directly in each query. It's used for creation of a database-scoped credential, which is in turn provided as a parameter of an External Data Source.
61+
62+
All the concepts outlined so far are described in detail in the following sections.
63+
64+
## External data source
65+
66+
External Data Source is an abstraction intended for easier management of file locations across multiple queries and for referencing authentication parameters encapsulated in database-scoped credential.
67+
68+
Public locations are described in an external data source by providing the file location path:
69+
70+
```sql
71+
CREATE EXTERNAL DATA SOURCE DemoPublicExternalDataSource
72+
WITH (
73+
LOCATION = 'abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest'
74+
-- LOCATION = 'abs://<container>@<storage_account>.blob.core.windows.net/<path>'
75+
)
76+
```
77+
78+
Private locations beside path require also reference to a credential to be provided:
79+
80+
```sql
81+
-- Step0 (optional): Create master key if it doesn't exist in the database:
82+
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Put Some Very Strong Password Here>'
83+
-- GO
84+
85+
--Step1: Create database-scoped credential (requires database master key to exist):
86+
CREATE DATABASE SCOPED CREDENTIAL [DemoCredential]
87+
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
88+
SECRET = '<your SAS key without leading "?" mark>';
89+
GO
90+
91+
--Step2: Create external data source pointing to the file path, and referencing database-scoped credential:
92+
CREATE EXTERNAL DATA SOURCE DemoPrivateExternalDataSource
93+
WITH (
94+
LOCATION = 'abs://<container>@<storage_account>.blob.core.windows.net/<path>',
95+
CREDENTIAL = [DemoCredential]
96+
)
97+
```
98+
99+
## Query data sources using OPENROWSET
100+
[OPENROWSET](https://docs.microsoft.com/sql/t-sql/functions/openrowset-transact-sql) syntax enables instant and ad-hoc querying with minimal required database objects created. DATA_SOURCE parameter value is automatically prepended to the BULK parameter to form full path to the file. Format of the file also needs to be provided:
101+
102+
```sql
103+
SELECT TOP 10 *
104+
FROM OPENROWSET(
105+
BULK 'bing_covid-19_data.parquet',
106+
DATA_SOURCE = 'DemoPublicExternalDataSource',
107+
FORMAT = 'parquet'
108+
) AS filerows
109+
```
110+
111+
### Querying multiple files and folders
112+
While in the previous example OPENROWSET command queried a single file, it can also query multiple files or folders by using wildcards in the BULK path.
113+
Here's an example using [NYC yellow taxi trip records open data set](https://docs.microsoft.com/azure/open-datasets/dataset-taxi-yellow):
114+
115+
```sql
116+
--Query all files with .parquet extension in folders matching name pattern:
117+
SELECT TOP 10 *
118+
FROM OPENROWSET(
119+
BULK 'taxi/year=*/month=*/*.parquet',
120+
DATA_SOURCE = 'NYCTaxiDemoDataSource',--You need to create the data source first
121+
FORMAT = 'parquet'
122+
) AS filerows
123+
```
124+
When you're querying multiple files or folders, all files accessed with the single OPENROWSET must have the same structure, that is, number of columns and their data types. Folders can't be traversed recursively.
125+
126+
### Schema inference
127+
The automatic schema inference helps you quickly write queries and explore data without knowing file schemas, as seen in previous sample scripts.
128+
129+
The cost of the convenience is that inferred data types may be larger than the actual data types, affecting the performance of queries. This happens when there's no enough information in the source files to make sure the appropriate data type is used. For example, parquet files don't contain metadata about maximum character column length, so instance infers it as varchar(8000).
130+
131+
> [!NOTE]
132+
> Schema inference works only with files in the parquet format.
133+
134+
You can use sp_describe_first_results_set stored procedure to check the resulting data types of your query:
135+
```sql
136+
EXEC sp_describe_first_result_set N'
137+
SELECT
138+
vendor_id, pickup_datetime, passenger_count
139+
FROM
140+
OPENROWSET(
141+
BULK ''taxi/*/*/*'',
142+
DATA_SOURCE = ''NYCTaxiDemoDataSource'',
143+
FORMAT=''parquet''
144+
) AS nyc';
145+
```
146+
147+
Once you know the data types, you can specify them using WITH clause to improve the performance:
148+
```sql
149+
SELECT TOP 100
150+
vendor_id, pickup_datetime, passenger_count
151+
FROM
152+
OPENROWSET(
153+
BULK 'taxi/*/*/*',
154+
DATA_SOURCE = 'NYCTaxiDemoDataSource',
155+
FORMAT='PARQUET'
156+
)
157+
WITH (
158+
vendor_id varchar(4), -- we're using length of 4 instead of the inferred 8000
159+
pickup_datetime datetime2,
160+
passenger_count int
161+
) AS nyc;
162+
```
163+
164+
For CSV files the schema can’t be automatically determined, and you need to explicitly specify columns using WITH clause:
165+
166+
```sql
167+
SELECT TOP 10 *
168+
FROM OPENROWSET(
169+
BULK 'population/population.csv',
170+
DATA_SOURCE = 'PopulationDemoDataSourceCSV',
171+
FORMAT = 'CSV')
172+
WITH (
173+
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
174+
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
175+
[year] smallint,
176+
[population] bigint
177+
) AS filerows
178+
```
179+
180+
### File metadata functions
181+
When querying multiple files or folders, you can use Filepath and Filename functions to read file metadata and get part of the path or full path and name of the file that the row in the result set originates from:
182+
```sql
183+
--Query all files and project file path and file name information for each row:
184+
SELECT TOP 10 filerows.filepath(1) as [Year_Folder], filerows.filepath(2) as [Month_Folder],
185+
filerows.filename() as [File_name], filerows.filepath() as [Full_Path], *
186+
FROM OPENROWSET(
187+
BULK 'taxi/year=*/month=*/*.parquet',
188+
DATA_SOURCE = 'NYCTaxiDemoDataSource',
189+
FORMAT = 'parquet') AS filerows
190+
--List all paths:
191+
SELECT DISTINCT filerows.filepath(1) as [Year_Folder], filerows.filepath(2) as [Month_Folder]
192+
FROM OPENROWSET(
193+
BULK 'taxi/year=*/month=*/*.parquet',
194+
DATA_SOURCE = 'NYCTaxiDemoDataSource',
195+
FORMAT = 'parquet') AS filerows
196+
```
197+
198+
When called without a parameter, filepath function returns the file path that the row originates from. When DATA_SOURCE is used in OPENROWSET, it returns path relative to DATA_SOURCE, otherwise it returns full file path.
199+
200+
When called with a parameter, it returns part of the path that matches the wildcard on the position specified in the parameter. For example, parameter value 1 would return part of the path that matches the first wildcard.
201+
202+
Filepath function can also be used for filtering and aggregating rows:
203+
204+
### File metadata functions
205+
When querying multiple files or folders, you can use Filepath and Filename functions to read file metadata and get part of the path or full path and name of the file that the row in the result set originates from:
206+
```sql
207+
SELECT
208+
r.filepath() AS filepath
209+
,r.filepath(1) AS [year]
210+
,r.filepath(2) AS [month]
211+
,COUNT_BIG(*) AS [rows]
212+
FROM OPENROWSET(
213+
BULK 'taxi/year=*/month=*/*.parquet',
214+
DATA_SOURCE = 'NYCTaxiDemoDataSource',
215+
FORMAT = 'parquet'
216+
) AS r
217+
WHERE
218+
r.filepath(1) IN ('2017')
219+
AND r.filepath(2) IN ('10', '11', '12')
220+
GROUP BY
221+
r.filepath()
222+
,r.filepath(1)
223+
,r.filepath(2)
224+
ORDER BY
225+
filepath;
226+
```
227+
228+
### Creating view on top of OPENROWSET
229+
You can create and use views to wrap OPENROWSET for easy reusing of underlying query:
230+
```sql
231+
CREATE VIEW TaxiRides AS
232+
SELECT *
233+
FROM OPENROWSET(
234+
BULK 'taxi/year=*/month=*/*.parquet',
235+
DATA_SOURCE = 'NYCTaxiDemoDataSource',
236+
FORMAT = 'parquet'
237+
) AS filerows
238+
```
239+
240+
It’s also convenient to add columns with file location data to a view, using filepath function for easier and more performant filtering. It can reduce the number of files and the amount of data the query on top of the view needs to read and process when filtered by any of those columns:
241+
```sql
242+
CREATE VIEW TaxiRides AS
243+
SELECT *
244+
,filerows.filepath(1) AS [year]
245+
,filerows.filepath(2) AS [month]
246+
FROM OPENROWSET(
247+
BULK 'taxi/year=*/month=*/*.parquet',
248+
DATA_SOURCE = 'NYCTaxiDemoDataSource',
249+
FORMAT = 'parquet'
250+
) AS filerows
251+
```
252+
253+
Views also enable reporting and analytic tools like Power BI to consume results of OPENROWSET.
254+
255+
## External tables
256+
External tables encapsulate access to the files making the querying experience almost identical to querying local relational data stored in user tables. Creation of an external table requires external data source and external file format objects to exist:
257+
258+
```sql
259+
--Create external file format
260+
CREATE EXTERNAL FILE FORMAT DemoFileFormat
261+
WITH (
262+
FORMAT_TYPE=PARQUET
263+
)
264+
GO
265+
266+
--Create external table:
267+
CREATE EXTERNAL TABLE tbl_TaxiRides(
268+
vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
269+
pickup_datetime DATETIME2,
270+
dropoff_datetime DATETIME2,
271+
passenger_count INT,
272+
trip_distance FLOAT,
273+
fare_amount FLOAT,
274+
extra FLOAT,
275+
mta_tax FLOAT,
276+
tip_amount FLOAT,
277+
tolls_amount FLOAT,
278+
improvement_surcharge FLOAT,
279+
total_amount FLOAT
280+
)
281+
WITH (
282+
LOCATION = 'taxi/year=*/month=*/*.parquet',
283+
DATA_SOURCE = DemoDataSource,
284+
FILE_FORMAT = DemoFileFormat
285+
);
286+
GO
287+
```
288+
289+
Once external table is created, you can query it just like any other table:
290+
```sql
291+
SELECT TOP 10 *
292+
FROM tbl_TaxiRides
293+
```
294+
295+
Just like OPENROWSET, external tables allow querying multiple files and folders by using wildcards. Schema inference and filepath/filename functions aren't supported with external tables.
296+
297+
## Performance considerations
298+
There's no hard limit in terms of number of files or amount of data that can be queried, but query performance will depend on the amount of data, data format, and complexity of queries and joins.
299+
300+
Collecting statistics on your external data is one of the most important things you can do for query optimization. The more instance knows about your data, the faster it can execute queries. Automatic creation of statistics isn't supported, but you can and should create statistics manually.
301+
302+
### OPENROWSET statistics
303+
Single-column statistics for OPENROWSET path can be created using sp_create_openrowset_statistics
304+
stored procedure, by passing the select query with a single column as a parameter:
305+
```sql
306+
EXEC sys.sp_create_openrowset_statistics N'
307+
SELECT pickup_datetime
308+
FROM OPENROWSET(
309+
BULK ''abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest/*.parquet'',
310+
FORMAT = ''parquet'') AS filerows
311+
'
312+
```
313+
314+
By default instance uses 100% of the data provided in the dataset for creating statistics. You can optionally specify sample size as a percentage using TABLESAMPLE options. To create single-column statistics for multiple columns, you should execute stored procedure for each of the columns. You can’t create multi-column statistics for OPENROWSET path.
315+
316+
To update existing statistics, drop them first using sp_drop_openrowset_statistics stored procedure, and then recreate them:
317+
```sql
318+
EXEC sys.sp_drop_openrowset_statistics N'
319+
SELECT pickup_datetime
320+
FROM OPENROWSET(
321+
BULK ''abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/bing_covid-19_data/latest/*.parquet'',
322+
FORMAT = ''parquet'') AS filerows
323+
'
324+
```
325+
326+
### External table statistics
327+
Syntax for creating stats on external tables resembles the one used for ordinary user tables. To create statistics on a column, provide a name for the statistics object and the name of the column:
328+
```sql
329+
CREATE STATISTICS sVendor
330+
ON tbl_TaxiRides (vendor_id)
331+
WITH FULLSCAN, NORECOMPUTE
332+
```
333+
334+
Provided WITH options are mandatory, and for the sample size allowed options are FULLSCAN and SAMPLE n percent.
335+
To create single-column statistics for multiple columns, execute stored procedure for each of the columns. You can’t create multi-column statistics.
336+
337+
## Next steps
338+
339+
- To learn more about syntax options available with OPENROWSET, see [OPENROWSET T-SQL](https://docs.microsoft.com/sql/t-sql/functions/openrowset-transact-sql).
340+
- For more information about creating external table in SQL Managed Instance, see [CREATE EXTERNAL TABLE](https://docs.microsoft.com/sql/t-sql/statements/create-external-table-transact-sql).
341+
- To learn more about creating external file format, see [CREATE EXTERNAL FILE FORMAT](https://docs.microsoft.com/sql/t-sql/statements/create-external-file-format-transact-sql)

azure-sql/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -950,6 +950,8 @@
950950
href: managed-instance/link-feature.md
951951
- name: Instance pools
952952
href: managed-instance/instance-pools-overview.md
953+
- name: Data virtualization
954+
href: managed-instance/data-virtualization-overview.md
953955
- name: Management operations
954956
items:
955957
- name: Overview

0 commit comments

Comments
 (0)