| title | Get started with PolyBase | Microsoft Docs | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ms.custom | |||||||||
| ms.date | 08/15/2017 | ||||||||
| ms.prod | sql | ||||||||
| ms.prod_service | database-engine | ||||||||
| ms.service | |||||||||
| ms.component | polybase | ||||||||
| ms.reviewer | |||||||||
| ms.suite | sql | ||||||||
| ms.technology |
|
||||||||
| ms.tgt_pltfrm | |||||||||
| ms.topic | get-started-article | ||||||||
| helpviewer_keywords |
|
||||||||
| caps.latest.revision | 78 | ||||||||
| author | barbkess | ||||||||
| ms.author | barbkess | ||||||||
| manager | craigg | ||||||||
| ms.workload | On Demand |
[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md]
This topic contains the basics about running PolyBase on a SQL Server instance.
After running the steps below, you will have:
-
PolyBase installed and runnable on your server
-
Examples of statements that create PolyBase objects
-
An understanding of how to manage PolyBase objects in SQL Server Management Studio (SSMS)
-
Examples of queries using PolyBase objects
If you haven't installed PolyBase, see PolyBase installation. The installation article explains the prerequisites.
After installation, run the following command to confirm that PolyBase has been successfully installed. If PolyBase is installed, returns 1; otherwise, 0.
SELECT SERVERPROPERTY ('IsPolybaseInstalled') AS IsPolybaseInstalled; After installing, you must configure SQL Server to use either your Hadoop version or Azure Blob Storage. PolyBase supports two Hadoop providers, Hortonworks Data Platform (HDP) and Cloudera Distributed Hadoop (CDH). The supported external data sources include:
-
Hortonworks HDP 1.3 on Linux/Windows Server
-
Hortonworks HDP 2.1 – 2.6 on Linux
-
Hortonworks HDP 2.1 - 2.3 on Windows Server
-
Cloudera CDH 4.3 on Linux
-
Cloudera CDH 5.1 – 5.5, 5.9 - 5.13 on Linux
-
Azure Blob Storage
Hadoop follows the "Major.Minor.Version" pattern for its new releases. All versions within a supported Major and Minor release are supported.
Note
Azure Data Lake Store connectivity is only supported in Azure SQL Data Warehouse.
-
Run sp_configure (Transact-SQL) ‘hadoop connectivity’ and set an appropriate value. By Default, the hadoop connectivity is set to 7. To find the value, see PolyBase Connectivity Configuration (Transact-SQL).
-- Values map to various external data sources. -- Example: value 7 stands for Azure blob storage and Hortonworks HDP 2.3 on Linux. sp_configure @configname = 'hadoop connectivity', @configvalue = 7; GO RECONFIGURE GO
-
You must restart SQL Server using services.msc. Restarting SQL Server restarts these services:
-
SQL Server PolyBase Data Movement Service
-
SQL Server PolyBase Engine
-
To improve query performance, enable pushdown computation to a Hadoop cluster:
-
Find the file yarn-site.xml in the installation path of SQL Server. Typically, the path is:
C:Program FilesMicrosoft SQL ServerMSSQL13.MSSQLSERVERMSSQLBinnPolybaseHadoopconf -
On the Hadoop machine, find the analogous file in the Hadoop configuration directory. In the file, find and copy the value of the configuration key yarn.application.classpath.
-
On the SQL Server machine, in the yarn.site.xml file, find the yarn.application.classpath property. Paste the value from the Hadoop machine into the value element.
-
For all CDH 5.X versions, you will need to add the mapreduce.application.classpath configuration parameters either to the end of your yarn.site.xml file or into the mapred-site.xml file. HortonWorks includes these configurations within the yarn.application.classpath configurations. See PolyBase configuration for examples.
The PolyBase group feature allows you to create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance.
-
Install SQL Server with PolyBase on multiple machines.
-
Select one SQL Server as head node.
-
Add other instances as compute nodes by running sp_polybase_join_group.
-- Enter head node details: -- head node machine name, head node dms control channel port, head node sql server name EXEC sp_polybase_join_group 'PQTH4A-CMP01', 16450, 'MSSQLSERVER'; -
Restart the PolyBase Data Movement Service on the compute nodes.
For details, see PolyBase scale-out groups.
Create objects depending on the external data source, either Hadoop or Azure Storage.
-- 1: Create a database scoped credential.
-- Create a master key on the database. This is required to encrypt the credential secret.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo';
-- 2: Create a database scoped credential for Kerberos-secured Hadoop clusters.
-- IDENTITY: the Kerberos user name.
-- SECRET: the Kerberos password
CREATE DATABASE SCOPED CREDENTIAL HadoopUser1
WITH IDENTITY = '<hadoop_user_name>', Secret = '<hadoop_password>';
-- 3: Create an external data source.
-- LOCATION (Required) : Hadoop Name Node IP address and port.
-- RESOURCE MANAGER LOCATION (Optional): Hadoop Resource Manager location to enable pushdown computation.
-- CREDENTIAL (Optional): the database scoped credential, created above.
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
TYPE = HADOOP,
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',
CREDENTIAL = HadoopUser1
);
-- 4: Create an external file format.
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
USE_TYPE_DEFAULT = TRUE)
-- 5: Create an external table pointing to data stored in Hadoop.
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
[SensorKey] int NOT NULL,
[CustomerKey] int NOT NULL,
[GeographyKey] int NULL,
[Speed] float NOT NULL,
[YearMeasured] int NOT NULL
)
WITH (LOCATION='/Demo/',
DATA_SOURCE = MyHadoopCluster,
FILE_FORMAT = TextFileFormat
);
-- 6: Create statistics on an external table.
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
--1: Create a master key on the database.
-- Required to encrypt the credential secret.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo';
-- Create a database scoped credential for Azure blob storage.
-- IDENTITY: any string (this is not used for authentication to Azure storage).
-- SECRET: your Azure storage account key.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'user', Secret = '<azure_storage_account_key>';
--2: Create an external data source.
-- LOCATION: Azure account storage account name and blob container name.
-- CREDENTIAL: The database scoped credential created above.
CREATE EXTERNAL DATA SOURCE AzureStorage with (
TYPE = HADOOP,
LOCATION ='wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
--3: Create an external file format.
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR ='|',
USE_TYPE_DEFAULT = TRUE
)
);
--4: Create an external table.
-- The external table points to data stored in Azure storage.
-- LOCATION: path to a file or directory that contains the data (relative to the blob container).
-- To point to all files under the blob container, use LOCATION='/'
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
[SensorKey] int NOT NULL,
[CustomerKey] int NOT NULL,
[GeographyKey] int NULL,
[Speed] float NOT NULL,
[YearMeasured] int NOT NULL
)
WITH (LOCATION='/Demo/',
DATA_SOURCE = AzureStorage,
FILE_FORMAT = TextFileFormat
);
--5: Create statistics on an external table.
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
There are three functions that PolyBase is suited for:
-
ad-hoc queries against external tables.
-
importing data.
-
exporting data.
-
Ad-hoc queries
-- PolyBase Scenario 1: Ad-Hoc Query joining relational with Hadoop data -- Select customers who drive faster than 35 mph: joining structured customer data stored -- in SQL Server with car sensor data stored in Hadoop. SELECT DISTINCT Insured_Customers.FirstName,Insured_Customers.LastName, Insured_Customers. YearlyIncome, CarSensor_Data.Speed FROM Insured_Customers, CarSensor_Data WHERE Insured_Customers.CustomerKey = CarSensor_Data.CustomerKey and CarSensor_Data.Speed > 35 ORDER BY CarSensor_Data.Speed DESC OPTION (FORCE EXTERNALPUSHDOWN); -- or OPTION (DISABLE EXTERNALPUSHDOWN)
-
Importing data
-- PolyBase Scenario 2: Import external data into SQL Server. -- Import data for fast drivers into SQL Server to do more in-depth analysis and -- leverage Columnstore technology. SELECT DISTINCT Insured_Customers.FirstName, Insured_Customers.LastName, Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus INTO Fast_Customers from Insured_Customers INNER JOIN ( SELECT * FROM CarSensor_Data where Speed > 35 ) AS SensorD ON Insured_Customers.CustomerKey = SensorD.CustomerKey ORDER BY YearlyIncome CREATE CLUSTERED COLUMNSTORE INDEX CCI_FastCustomers ON Fast_Customers;
-
Exporting data
-- PolyBase Scenario 3: Export data from SQL Server to Hadoop. -- Enable INSERT into external table sp_configure ‘allow polybase export’, 1; reconfigure -- Create an external table. CREATE EXTERNAL TABLE [dbo].[FastCustomers2009] ( [FirstName] char(25) NOT NULL, [LastName] char(25) NOT NULL, [YearlyIncome] float NULL, [MaritalStatus] char(1) NOT NULL ) WITH ( LOCATION='/old_data/2009/customerdata', DATA_SOURCE = HadoopHDP2, FILE_FORMAT = TextFileFormat, REJECT_TYPE = VALUE, REJECT_VALUE = 0 ); -- Export data: Move old data to Hadoop while keeping it query-able via an external table. INSERT INTO dbo.FastCustomer2009 SELECT T.* FROM Insured_Customers T1 JOIN CarSensor_Data T2 ON (T1.CustomerKey = T2.CustomerKey) WHERE T2.YearMeasured = 2009 and T2.Speed > 40;
In SSMS, external tables are displayed in a separate folder External Tables. External data sources and external file formats are in subfolders under External Resources.
Use DMVs to troubleshoot performance and queries. For details, see PolyBase troubleshooting.
After upgrading from SQL Server 2016 RC1 to RC2 or RC3, queries may fail. For details and a remedy, see SQL Server 2016 Release Notes and search for "PolyBase."
To understand the scale-out feature, see PolyBase scale-out groups. To monitor PolyBase, see PolyBase troubleshooting. To troubleshoot PolyBase performance, see PolyBase troubleshooting with dynamic management views.
PolyBase Guide
PolyBase scale-out groups
PolyBase stored procedures
CREATE EXTERNAL DATA SOURCE (Transact-SQL)
CREATE EXTERNAL FILE FORMAT (Transact-SQL)
CREATE EXTERNAL TABLE (Transact-SQL)

