Skip to content

Commit 76b23ba

Browse files
authored
Merge pull request #7214 from rothja/overviewwork99
Reworking overview for SQL Big Data Clusters
2 parents 623fed9 + 0a59c49 commit 76b23ba

6 files changed

Lines changed: 58 additions & 7 deletions

File tree

docs/big-data-cluster/big-data-cluster-overview.md

Lines changed: 58 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: What is SQL Server Big Data Cluster? | Microsoft Docs
2+
title: What is SQL Server 2019 Big Data Clusters? | Microsoft Docs
33
description:
44
author: rothja
55
ms.author: jroth
@@ -9,19 +9,70 @@ ms.topic: overview
99
ms.prod: sql
1010
---
1111

12-
# What is SQL Server Big Data Cluster?
12+
# What is SQL Server 2019 Big Data Clusters?
1313

14-
[!INCLUDE[SQL Server 2019 CTP 2.0](../includes/sssqlv15-md.md)] CTP 2.0 enables you to integrate your "high-value" relational data in SQL Server with your "high-volume" data in big data environments, such as Hadoop.
14+
Staring with [!INCLUDE[SQL Server 2019](../includes/sssqlv15-md.md)], SQL Big Data Clusters allows you to deploy scalable clusters of SQL Server containers on Kubernetes. These containers are then used to read, write, and process big data from Transact-SQL, allowing you to easily combine your high-value relational data with high-volume big data within the same query.
15+
16+
## Big data scenarios
17+
18+
SQL Server Big Data Clusters enable the following scenarios:
19+
20+
### Data virtualization
21+
22+
By leveraging [SQL Server PolyBase](../relational-databases/polybase/polybase-guide.md), SQL Big Data Clusters can query external data sources without importing the data. [!INCLUDE[SQL Server 2019](../includes/sssqlv15-md.md)] introduces new connectors to data sources.
23+
24+
![Data virtualization](media/big-data-cluster-overview/data-virtualization.png)
25+
26+
### Data lake
27+
28+
A SQL Big Data Cluster includes a scalable HDFS [storage pool](concept-storage-pool.md). This can be used to directly store big data, potentially ingested from multiple external sources. Once in the Big Data Cluster, you can analyze and query the data and combine it with your high-value relational data.
29+
30+
![Data lake](media/big-data-cluster-overview/data-lake.png)
31+
32+
### Scale-out data mart
33+
34+
SQL Big Data Clusters provides scale-out compute and storage to improve the performance of analyzing big data. Data from a variety of sources can be ingested and distributed across [data pool](concept-data-pool.md) nodes for further analysis.
35+
36+
![Data mart](media/big-data-cluster-overview/data-mart.png)
37+
38+
### Integrated AI and Machine Learning
39+
40+
SQL Big Data Clusters enable AI and machine learning tasks on the data stored in data HDFS storage pools and the data pools. You can use Spark as well as built-in AI tools in SQL Server, using R, Python, or Java.
41+
42+
![AI and ML](media/big-data-cluster-overview/ai-ml-spark.png)
43+
44+
### Management and Monitoring
45+
46+
Management and monitoring are provided through a combination of open-source components, SQL Server tools, and Dynamic Management Views.
47+
48+
The cluster Admin portal is a web interface that displays the status and health of the pods in the cluster. It also provides links to other dashboards provided by Kubernetes, Grafana, or Kibana.
49+
50+
You can use Azure Data Studio to perform a variety of tasks on the Big Data Cluster. This is enabled by a new Scale-out Data Management extension. This extension provides:
51+
52+
- Built-in snippets for common management tasks.
53+
- Reports on the number of compute pools and the status of running jobs.
54+
- Reports on the status of HDFS and Spark jobs.
55+
- Ability to browse HDFS, upload files, preview files, and create directories.
56+
- Ability to create, open, and run Jupyter-compatible notebooks.
57+
- Data virtualization wizard to simplify the creation of external data sources.
1558

1659
## Architecture
1760

18-
CTP 2.0 allows you to create and deploy a *data pool* that consists of many SQL Server *data pool instances* in your cluster. You can then ingest your high-volume data from HDFS via Spark streaming jobs into the SQL Server data pool instances by partitioning the data and spreading the partitions across the SQL Server data pool instances in the data pool.
61+
A SQL Big Data Cluster is a cluster of Linux nodes orchestrated by Kubernetes. Nodes in the cluster are arranged into three logical planes: the control plane, the compute pane, and the data plane. Each plane has different responsibilities in the cluster. Every Kubernetes node in a SQL Big Data Cluster is member of at least one plane.
62+
63+
![Architecture overview](media/big-data-cluster-overview/architecture-diagram-planes.png)
64+
65+
### Control plane
66+
67+
The control plane provides management and [security](concept-security.md) for the cluster. It contains the Kubernetes master, the [SQL Server master instance](concept-master-instance.md), and other cluster-level services such as the Hive Metastore and Spark Driver.
68+
69+
### Compute plane
1970

20-
Once the high-volume data is stored in partitions in the SQL Server data pool instances on the cluster, you can create an *external table* in the SQL Server *master instance* that represents the high-volume data that resides in the partitions stored in the SQL Server data pool instances in your cluster. This external table can be queried in the master instance just like any other table, but in this case a fan-out query will be simultaneously executed against each of the SQL Server data pool instances to query the partitioned data. This fan-out query runs the filter part of the query and local aggregations in parallel across all of the data pool instances. The results of these queries will be brought back to the master instance and you can optionally choose to join the results of the high-volume data fan-out query with the results of a high-value data query in the SQL Server master instance.
71+
The compute plane provides computational resources to the cluster. It contains nodes running SQL Server on Linux pods. The pods in the compute plane are divided into [compute pools](concept-compute-pool.md) for specific processing tasks. A compute pool can act as a [PolyBase](../relational-databases/polybase) scale-out group for distributed queries over different data sources—such as Oracle, MongoDB, or Teradata. It can also be configured to cache data returned from external queries.
2172

22-
The following diagram shows the eventual state of the Big Data Cluster architecture:
73+
### Data plane
2374

24-
![Architecture diagram](./media/big-data-cluster-overview/architecture-diagram.png)
75+
The data plane is used for data persistence and caching. It contains the SQL data pool, and storage nodes. The SQL [data pool](concept-data-pool.md) consists of one or more nodes running SQL Server on Linux. It is used to cache data returned by Spark jobs. SQL Big Data Cluster data marts are persisted in the data pool. The [storage pool](concept-storage-pool.md) consists of storage nodes comprised of SQL Server on Linux, Spark, and HDFS. All the storage nodes in a SQL Big Data cluster are members of an HDFS cluster.
2576

2677
## Next steps
2778

49.5 KB
Loading
126 KB
Loading
6.64 KB
Loading
6.22 KB
Loading
8.93 KB
Loading

0 commit comments

Comments
 (0)