title	Use sparklyr from RStudio
titleSuffix	SQL Server big data clusters
description	Connect to big data cluster using sparklyr from RStudio.
author	jejiang
ms.author	jejiang
ms.reviewer	mikeray
ms.date	11/04/2019
ms.topic	conceptual
ms.prod	sql
ms.technology	big-data-cluster

Use sparklyr in SQL Server big data cluster

[!INCLUDEtsql-appliesto-ssver15-xxxx-xxxx-xxx]

Sparklyr provides an R interface for Apache Spark. Sparklyr is a popular way for R developers to use Spark. This article describes how to use sparklyr in a [!INCLUDEbig-data-clusters-2019] using RStudio.

Prerequisites

Deploy a SQL Server 2019 big data cluster.

Install RStudio Desktop

Install and configure RStudio Desktop with the following steps:

If you are running on a Windows client, download and install R 3.4.4.
Download and install RStudio Desktop.

After installation completes, run the following commands inside of RStudio Desktop to install the required packages:

install.packages("DBI", repos = "https://cran.microsoft.com/snapshot/2019-01-01")
install.packages("dplyr", repos = "https://cran.microsoft.com/snapshot/2019-01-01")
install.packages("sparklyr", repos = "https://cran.microsoft.com/snapshot/2019-01-01")

Connect to Spark in a big data cluster

You can use sparklyr to connect from a client to the big data cluster using Livy and the HDFS/Spark gateway.

In RStudio, create an R script and connect to Spark as in the following example:

Tip

For the <AZDATA_USERNAME> and <AZDATA_PASSWORD> values, use the username (such as root) and password you set during the big data cluster deployment. For the <IP> and <PORT> values, see the documentation on connecting to a big data cluster.

library(sparklyr)
library(dplyr)
library(DBI)

#Specify the Knox username and password
config <- livy_config(user = "<AZDATA_USERNAME>", password = "<AZDATA_PASSWORD>")

httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L))

sc <- spark_connect(master = "https://<IP>:<PORT>/gateway/default/livy/v1",
                    method = "livy",
                    config = config)

Run sparklyr queries

After connecting to Spark, you can run sparklyr. The following example performs a query on iris dataset using sparklyr:

iris_tbl <- copy_to(sc, iris)

iris_count <- dbGetQuery(sc, "SELECT COUNT(*) FROM iris")

iris_count

Distributed R computations

One feature of sparklyr is the ability to distribute R computations with spark_apply.

Because big data clusters use Livy connections, you must set packages = FALSE in the call to spark_apply. For more information, see the Livy section of the sparklyr documentation on distributed R computations. With this setting, you can only use the R packages that are already installed on your Spark cluster in the R code passed to spark_apply. The following example demonstrates this functionality:

iris_tbl %>% spark_apply(function(e) nrow(e), names = "nrow", group_by = "Species", packages = FALSE)

Next steps

For more information about big data clusters, see [What are [!INCLUDEbig-data-clusters-2019]](big-data-cluster-overview.md).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use sparklyr in SQL Server big data cluster

Prerequisites

Install RStudio Desktop

Connect to Spark in a big data cluster

Run sparklyr queries

Distributed R computations

Next steps

FilesExpand file tree

sparklyr-from-RStudio.md

Latest commit

History

sparklyr-from-RStudio.md

File metadata and controls

Use sparklyr in SQL Server big data cluster

Prerequisites

Install RStudio Desktop

Connect to Spark in a big data cluster

Run sparklyr queries

Distributed R computations

Next steps