title

View and summarize data using R (walkthrough)| Microsoft Docs

ms.date

11/10/2017

ms.reviewer

ms.suite

sql

ms.prod

machine-learning-services

ms.prod_service

machine-learning-services

ms.component

ms.technology

ms.tgt_pltfrm

ms.topic

tutorial

applies_to

SQL Server 2016

dev_langs

ms.assetid

358e1431-8f47-4d32-a02f-f90e519eef49

caps.latest.revision

author

jeannt

ms.author

jeannt

manager

cgronlund

ms.workload

On Demand

View and summarize data using R

[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md-winonly]

Now let's work with the same data using R code. In this lesson, you learn how to use the functions in the RevoScaleR package.

An R script is provided with this walkthrough that includes all the code needed to create the data object, generate summaries, and build models. The R script file, RSQL_RWalkthrough.R, can be found in the location where you installed the script files.

If you are experienced with R, you can run the script all at once.
For people learning to use RevoScaleR, this tutorial goes through the script line by line.
To run individual lines from the script, you can highlight a line or lines in the file and press Ctrl + ENTER.

Tip

Save your R workspace in case you want to complete the rest of the walkthrough later. That way the data objects and other variables are ready for re-use.

Define a SQL Server compute context

Microsoft R makes it easy to get data from [!INCLUDEssNoVersion] to use in your R code. The process is as follows:

Create a connection to a [!INCLUDEssNoVersion] instance
Define a query that has the data you need, or specify a table or view
Define one or more compute contexts to use when running R code
Optionally, define transformations that are applied to the data source while it is being read from the source

The following steps are all part of the R code and should be run in an R environment. We used Microsoft R Client, because it includes all the RevoScaleR packages, as well as a basic, lightweight set of R tools.

If the RevoScaleR package is not already loaded, run this line of R code:
```
library("RevoScaleR")
```
The quotation marks are optional, in this case, though recommended.

If you get an error, make sure that your R development environment is using a library that includes the RevoScaleR package. Use a command such as .libPaths() to view the current library path.
Create the connection string for SQL Server and save it in an R variable, connStr.
```
connStr <- "Driver=SQL Server;Server=your_server_name;Database=Your_Database_Name;Uid=Your_User_Name;Pwd=Your_Password"
```
For the server name, you might be able to use only the instance name, or you might need to fully qualify the name, depending on your network.

For Windows authentication, the syntax is a bit different:
```
connStr <- "Driver=SQL Server;Server=SQL_instance_name;Database=database_name;Trusted_Connection=Yes"
```
The R script available for download uses SQL logins only. Generally, we recommend that you use Windows authentication where possible, to avoid saving passwords in your R code. However, to ensure that the code in this tutorial matches the code downloaded from Github, we'll use a SQL login for the rest of the walkthrough.
Define variables to use in creating a new compute context. After you create the compute context object, you can use it to run R code on the SQL Server instance.
```
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
```
- R uses a temporary directory when serializing R objects back and forth between your workstation and the [!INCLUDEssNoVersion] computer. You can specify the local directory that is used as sqlShareDir, or accept the default.
- Use sqlWait to indicate whether you want R to wait for results from the server. For a discussion of waiting vs. non-waiting jobs, see Distributed and parallel computing with ScaleR in Microsoft R.
- Use the argument sqlConsoleOutput to indicate that you don't want to see output from the R console.
You call the RxInSqlServer constructor to create the compute context object with the variables and connection strings already defined, and save the new object in the R variable sqlcc.
```
sqlcc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, wait = sqlWait, consoleOutput = sqlConsoleOutput)
```
By default, the compute context is local, so you need to explicitly set the active compute context.
```
rxSetComputeContext(sqlcc)
```
- rxSetComputeContext returns the previously active compute context invisibly so that you can use it
- rxGetComputeContext returns the active compute context
Note that setting a compute context only affects operations that use functions in the RevoScaleR package; the compute context does not affect the way that open source R operations are performed.

Create a data source using RxSqlServer

In Microsoft R, a data source is an object you create using RevoScaleR functions. The data source object specifies some set of data that you want to use for a task, such as model training or feature extraction. You can get data from a variety of sources; for the list of currently supported sources, see RxDataSource.

Earlier you defined a connection string, and saved that information in an R variable. You can re-use that connection information to specify the data you want to get.

Save a SQL query as a string variable. The query defines the data for training the model.
```
sampleDataQuery <- "SELECT tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude FROM nyctaxi_sample"
```
We've used a TOP clause here to make things run faster, but the actual rows returned by the query can vary depending on order. Hence, your summary results might also be different from those listed below. Feel free to remove the TOP clause.
Pass the query definition as an argument to the RxSqlServerData function.
```
inDataSource <- RxSqlServerData(
  sqlQuery = sampleDataQuery,
  connectionString = connStr,
  colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric",
  dropoff_longitude = "numeric", dropoff_latitude = "numeric"),
  rowsPerRead=500
  )
```
- The argument colClasses specifies the column types to use when moving the data between [!INCLUDEssNoVersion] and R. This is important because [!INCLUDEssNoVersion] uses different data types than R, and more data types. For more information, see R Libraries and Data Types.
- The argument rowsPerRead is important for managing memory usage and efficient computations. Most of the enhanced analytical functions in[!INCLUDErsql_productname] process data in chunks and accumulate intermediate results, returning the final computations after all of the data has been read. By adding the rowsPerRead parameter, you can control how many rows of data are read into each chunk for processing. If the value of this parameter is too large, data access might be slow because you don’t have enough memory to efficiently process such a large chunk of data. On some systems, setting rowsPerRead to an excessively small value can also provide slower performance.
At this point, you've created the inDataSource object, but it doesn't contain any data. The data is not pulled from the SQL query into the local environment until you run a function such as rxImport or rxSummary.

However, now that you've defined the data objects, you can use it as the argument to other functions.

Use the SQL Server data in R summaries

In this section, you'll try out several of the functions provided in [!INCLUDErsql_productname] that support remote compute contexts. By applying R functions to the data source, you can explore, summarize, and chart the [!INCLUDEssNoVersion] data.

Call the function rxGetVarInfo to get a list of the variables in the data source and their data types.

rxGetVarInfo is a handy function; you can call it on any data frame, or on a set of data in a remote data object, to get information such as the maximum and minimum values, the data type, and the number of levels in factor columns.

Consider running this function after any kind of data input, feature transformation, or feature engineering. By doing so, you can ensure that all the features you want to use in your model are of the expected data type and avoid errors.
```
rxGetVarInfo(data = inDataSource)
```
Results
```
Var 1: tipped, Type: integer
Var 2: fare_amount, Type: numeric
Var 3: passenger_count, Type: integer
Var 4: trip_time_in_secs, Type: numeric, Storage: int64
Var 5: trip_distance, Type: numeric
Var 6: pickup_datetime, Type: character
Var 7: dropoff_datetime, Type: character
Var 8: pickup_longitude, Type: numeric
Var 9: pickup_latitude, Type: numeric
Var 10: dropoff_longitude, Type: numeric
```
Now, call the RevoScaleR function rxSummary to get more detailed statistics about individual variables.

rxSummary is based on the R summary function, but has some additional features and advantages. rxSummary works in multiple compute contexts and supports chunking. You can also use rxSummary to transform values, or summarize based on factor levels.

In this example, you summarize the fare amount based on the number of passengers.
```
start.time <- proc.time()
rxSummary(~fare_amount:F(passenger_count,1,6), data = inDataSource)
used.time <- proc.time() - start.time
print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds,
  Elapsed Time=", round(used.time[3],2),
  " seconds to summarize the inDataSource.", sep=""))
```
- The first argument to rxSummary specifies the formula or term to summarize by. Here, the F() function is used to convert the values in passenger_count into factors before summarizing. You also have to specify the minimum value (1) and maximum value (6) for the passenger_count factor variable.
- If you do not specify the statistics to output, by default rxSummary outputs Mean, StDev, Min, Max, and the number of valid and missing observations.
- This example also includes some code to track the time the function starts and completes, so that you can compare performance.
Results

If the rxSummary function runs successfuly, you should see results like these, followed by a list of statistics by category.
```
rxSummary(formula = ~fare_amount:F(passenger_count, 1,6), data = inDataSource)
Data: inDataSource (RxSqlServerData Data Source)
Number of valid observations: 1000
```

Bonus exercise on big data

Try defining a new query string with all the rows. we recommend you set up a new data source object for this experiment. You might also try changing the rowsToRead parameter to see how it affects throughput.

bigDataQuery  <- "SELECT tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude FROM nyctaxi_sample"

bigDataSource <- RxSqlServerData(
      sqlQuery = bigDataQuery,
      connectionString = connStr,
      colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric",
      dropoff_longitude = "numeric", dropoff_latitude = "numeric"),
      rowsPerRead=500
      )

start.time <- proc.time()
rxSummary(~fare_amount:F(passenger_count,1,6), data = bigDataSource)
used.time <- proc.time() - start.time
print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds,
  Elapsed Time=", round(used.time[3],2),
  " seconds to summarize the inDataSource.", sep=""))

Tip

While this is running, you can use a tool like Process Explorer or SQL Profiler to see how the connection is made and the R code is run using SQL Server services.

Another option is to monitor R jobs running on SQL Server using these custom reports.

Next lesson

Create graphs and plots using R

Previous lesson

Explore the data using SQL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

View and summarize data using R

Define a SQL Server compute context

Create a data source using RxSqlServer

Use the SQL Server data in R summaries

Bonus exercise on big data

Next lesson

Previous lesson

FilesExpand file tree

walkthrough-view-and-summarize-data-using-r.md

Latest commit

History

walkthrough-view-and-summarize-data-using-r.md

File metadata and controls

View and summarize data using R

Define a SQL Server compute context

Create a data source using RxSqlServer

Use the SQL Server data in R summaries

Bonus exercise on big data

Next lesson

Previous lesson