Skip to content

Commit 60a70e8

Browse files
Merge branch 'master' of https://github.com/MicrosoftDocs/sql-docs-pr into us1613566a
2 parents a08c148 + 8344911 commit 60a70e8

29 files changed

Lines changed: 194 additions & 170 deletions

.openpublishing.redirection.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
{
22
"redirections": [
3+
{
4+
"source_path": "docs/relational-databases/polybase/data-virtualization-csv.md",
5+
"redirect_url": "/sql/big-data-cluster/data-virtualization-csv",
6+
"redirect_document_id": true
7+
},
38
{
49
"source_path": "docs/sql-server/sql-server-help-installation.md",
510
"redirect_url": "/sql/sql-server/sql-server-offline-documentation",
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: Virtualize CSV data from storage pool
3+
subtitle: SQL Server Big Data Clusters
4+
description: Steps detailing the create external table for virtualization of a CSV file in a Big Data Cluster
5+
author: MikeRayMSFT
6+
ms.author: mikeray
7+
ms.reviewer: mikeray
8+
ms.date: 04/24/2020
9+
ms.topic: conceptual
10+
ms.prod: sql
11+
ms.technology: polybase
12+
monikerRange: ">= sql-server-ver15 || = sqlallproducts-allversions"
13+
ms.metadata: seo-lt-2019
14+
---
15+
16+
# Virtualize CSV data from storage pool (Big Data Clusters)
17+
18+
SQL Server Big Data Clusters can virtualize data from CSV files in HDFS. This process allows the data to stay in its original location, but can be queried from a SQL Server instance like any other table. This feature uses PolyBase connectors, and minimizes the need for ETL processes. For more information on data virtualization, see [What is PolyBase?](../relational-databases/polybase/polybase-guide.md)
19+
20+
## Prerequisites
21+
22+
- [A deployed big data cluster](deployment-guidance.md)
23+
- [Azure Data Studio](../azure-data-studio/download-azure-data-studio.md)
24+
25+
## Select or upload a CSV file for data virtualization
26+
27+
In Azure Data Studio (ADS) [connect to the SQL Server master instance](connect-to-big-data-cluster.md#master) of your Big Data Cluster. Once connected, expand the HDFS elements in the object explorer to locate the CSV file(s) you would like to data virtualize.
28+
29+
For the purposes of this tutorial, create a new directory named **Data**.
30+
31+
1. Right-click on the HDFS root directory context menu.
32+
2. Click **New directory**.
33+
3. Name the new directory *Data*.
34+
35+
Upload sample data. For a simple walk through, you can use a sample csv data file. This article uses airline delay cause data from the [US Department of Transportation](https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1). Download the raw data, and extract the data to your computer. Name the file *airline_delay_causes.csv*.
36+
37+
To upload the sample file after you extract it:
38+
39+
1. In Azure Data Studio, *right-click* the new directory you created.
40+
2. Click **Upload files**.
41+
42+
![example csv file in HDFS](media/data-virtualization/100-csv-sample-file-hdfs.png)
43+
44+
Azure Data Studio uploads the files to HDFS on the Big Data Cluster.
45+
46+
## Create the storage pool external data source in your target database
47+
48+
The storage pool external data source is not created in a database by default in your Big Data Cluster. Before you can create the external table, create the default **SqlStoragePool** External Data Source in your target database with the following Transact-SQL query. Make sure you first change the context of the query to your target database.
49+
50+
```sql
51+
-- Create the default storage pool source for SQL Big Data Cluster
52+
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool')
53+
CREATE EXTERNAL DATA SOURCE SqlStoragePool
54+
WITH (LOCATION = 'sqlhdfs://controller-svc/default');
55+
```
56+
57+
## Create the external table
58+
59+
From ADS, right-click on the CSV file and select **Create External Table From CSV File** from the context menu. You can also create external tables from CSV files from a directory in HDFS if the files under the directory follow the same schema. This would allow the virtualization of the data at a directory level without the need to process individual files and get a joined result set over the combined data.Azure data studio guides you through the steps to create the external table.
60+
61+
Specify the database, the data source, a table name, the schema, and the name for the table's external file format.
62+
63+
Click **Next**.
64+
65+
## Preview Data
66+
67+
Azure Data Studio provides a preview of the imported data.
68+
69+
![External Data Source credentials](media/data-virtualization/130-csv-preview-data.png)
70+
71+
Once done viewing the preview, click **Next** to continue
72+
73+
## Modify Columns
74+
75+
In the next window, you may modify the columns of the external table you intend to create. You are able to alter the column name, change the data type and allow for nullable rows.
76+
77+
![External Data Source credentials](media/data-virtualization/140-csv-modify-columns.png)
78+
79+
After you verify the destination columns, click **Next**.
80+
81+
## Summary
82+
83+
This step provides a summary of your selections. It provides the SQL Server name, database name, table name, table schema, and external table information. In this step, you have the option to generate a script or create a table. **Generate Script** creates a script in T-SQL to create the external data source. **Create Table** creates the external data source.
84+
85+
![Summary screen](media/data-virtualization/150-csv-virtualize-data-summary.png)
86+
87+
If you click **Create Table**, SQL Server creates the external table in the destination database.
88+
89+
If you click, **Generate Script**, you Azure Data Studio creates the T-SQL query for creating the external table.
90+
91+
Once created the table can now be queried directly using T-SQL from the SQL Server instance.
92+
93+
## Next steps
94+
95+
For more information on SQL Server Big Data Cluster and related scenarios, see [What is SQL Server Big Data Cluster?](big-data-cluster-overview.md).
66.5 KB
Loading
80.5 KB
Loading
85.9 KB
Loading
226 KB
Loading
46.9 KB
Loading
48.2 KB
Loading
54.9 KB
Loading
109 KB
Loading

0 commit comments

Comments
 (0)