title

View and Summarize Data using R (Data Science End-to-End Walkthrough) | Microsoft Docs

ms.custom

SQL2016_New_Updated

ms.date

03/17/2017

ms.prod

sql-server-2016

ms.reviewer

ms.suite

ms.technology

r-services

ms.tgt_pltfrm

ms.topic

article

applies_to

SQL Server 2016

dev_langs

ms.assetid

358e1431-8f47-4d32-a02f-f90e519eef49

caps.latest.revision

author

jeannt

ms.author

jeannt

manager

jhubbard

Lesson 2-1 - View and Summarize Data using R

Now you'll work with the same data using R code. You'll also learn how to use the functions in the RevoScaleR package included with [!INCLUDErsql_productname].

An R script is provided with this walkthrough that includes all the code needed to create the data object, generate summaries, and build models. The R script file, RSQL_RWalkthrough.R, can be found in the location where you installed the script files.

If you are experienced with R, you can run the script all at once.
For people learning to use RevoScaleR, this tutorial goes through the script line by line
To run individual lines from the script, you can highlight a line or lines in the file and press Ctrl + ENTER.

Save your R workspace in case you want to complete the rest of the walkthrough later. That way the data objects and other variables will be ready for re-use.

Define a SQL Server Compute Context

To get data from [!INCLUDEssNoVersion] to use in your R code, you need to:

Create a connection to a [!INCLUDEssNoVersion] instance
Define a query that has the data you need, or specify a table or view
Define one or more compute contexts to use when running R code
Optionally, you can define transformation that are applied to the data source while it is being read from the source

The following steps are all part of the R code and should be run in an R IDE.

If the RevoScaleR package is not already loaded, run:
```
library("RevoScaleR")`.  
```
If you get an error, make sure that your R development environment is using the library that includes the RevoScaleR package. Use a command such as .libPaths()) to view the current path.
Create the connection string for SQL Server.
```
# SQL authentication  
connStr <- "Driver=SQL Server;Server=Your_Server_Name.somedomain.com;Database=Your_Database_Name;Uid=Your_User_Name;Pwd=Your_Password"


# Windows authentication  
connStrWin <- "Driver=SQL Server;Server=SQL_instance_name;Database=database_name;Trusted_Connection=Yes" 

# Map the connection string to the one used in the rest of this tutorial
connStr \<- connStrWin 
```
[!NOTE] The R script available for download uses SQL logins only. In this tutorial, we've provided examples of both SQL logins and Windows integrated authentication. We recommend that you use Windows authentication where possible, to avoid saving passwords in your R code.

Regardless of which credentials you use, the account that you use must have permissions to read data and to create new tables in the specified database. For information about how to add users to the SQL database and give them the correct permissions, see Post-Installation Server Configuration (SQL Server R Services).
Define variables to use in a new compute context. By setting the compute context to the [!INCLUDEssNoVersion] instance, you can take advantage of server resources, rather than trying to run it in the memory on your laptop.
```
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")  
sqlWait <- TRUE  
sqlConsoleOutput <- FALSE    
```
- R uses a temporary directory when serializing R objects back and forth between your workstation and the [!INCLUDEssNoVersion] computer. You can specify the local directory that is used as sqlShareDir, or accept the default.
- Use sqlWait to indicate whether you want R to wait for results or not. For a discussion of waiting vs. non-waiting jobs, see ScaleR Distributed Computing.
- Use the argument sqlConsoleOutput to indicate that you dont want to see output from the R console.

Instantiate the compute context object with the variables and connection strings already defined, and save it in the R variable sqlcc.

cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,   
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)

By default, the compute context is local, so you'll need to explicitly set the active compute context.
```
rxSetComputeContext(cc)  
```
- rxSetComputeContext returns the previously active compute context invisibly so that you can use it
- rxGetComputeContext returns the active compute context
Note that setting a compute context only affects operations that use functions in the RevoScaleR package; the compute context does not affect the way that open source R operations are performed.

Create an RxSqlServer Data Source

A data source specifies some set of data that you want to use for a task, such as training, exploration, scoring, or generating features.

You already defined the database you want to work with, and saved that information in R variables. Now you can re-use that data connection to create data objects, by calling the RxSqlServer function.

Save the SQL statement as a string variable. The query defines the data you'll use to train the model.

sampleDataQuery <- "SELECT TOP 1000 tipped, fare_amount, 
       passenger_count,trip_time_in_secs,trip_distance,   
       pickup_datetime, dropoff_datetime, pickup_longitude, 
       pickup_latitude, dropoff_longitude,    
       dropoff_latitude 
       FROM nyctaxi_sample"

Pass the query definition as an argument to the RxSqlServerData function.
```
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, 
     connectionString = connStr,
     colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric",
     dropoff_longitude = "numeric", dropoff_latitude = "numeric"),  
     rowsPerRead=500)  
```
- The argument colClasses specifies the column types to use when moving the data between [!INCLUDEssNoVersion] and R. This is important because [!INCLUDEssNoVersion] uses different data types than R, and more data types. For more information, see Working with R Data Types.
- The argument rowsPerRead is important for handling memory usage and efficient computations. Most of the enhanced analytical functions in[!INCLUDErsql_productname] process data in chunks and accumulate intermediate results, returning the final computations after all of the data has been read. By adding the rowsPerRead parameter, you can control how many rows of data are read into each chunk for processing. If the value of this parameter is too large, data access might be slow because you don’t have enough memory to efficiently process such a large chunk of data. On some systems, setting rowsPerRead to too small a value can also provide slower performance.
At this point, the inDataSource object doesn't contain any data from the SQL query. The data is not pulled into the local environment until you run a function such as rxImport or rxSummary.

However, this object is a convenient shortcut for defining the data. You can call the data source using multiple functions, to move data, to get a summary of the data and its variables, to manipulate and transform the data, or to use it for training a model.

Use the SQL Server Data in R

You can now apply R functions to the data source, to explore, summarize, and chart the [!INCLUDEssNoVersion] data. In this section, you 'll try out several of the functions provided in [!INCLUDErsql_productname] that support remote compute contexts.

Call the function rxGetVarInfo, using the data source inDataSource as an argument, to get a list of the variables in the data source and their data types.
```
rxGetVarInfo(data = inDataSource)  
```
Results:
Var 1: tipped, Type: integer
Var 2: fare_amount, Type: numeric
Var 3: passenger_count, Type: integer
Var 4: trip_time_in_secs, Type: numeric, Storage: int64
Var 5: trip_distance, Type: numeric
Var 6: pickup_datetime, Type: character
Var 7: dropoff_datetime, Type: character
Var 8: pickup_longitude, Type: numeric
Var 9: pickup_latitude, Type: numeric
Var 10: dropoff_longitude, Type: numeric

rxGetVarInfo can be used with any data frame, or a set of data in a remote data object, to get information such as the maximum and minimum values, the data type, and the number of levels in factor columns.

Consider running this function after any kind of data input, feature transformation, or feature engineering. By doing so you can ensure that all the features you want to use in your model are of the expected data type and avoid errors.
Call the RevoScaleR function rxSummary to summarize the fare amount, based on the number of passengers.

This function can be used to get more detailed statistics about individual variables. You can also transform values, compute summaries using factor levels, and save the summaries for re-use.
```
start.time <- proc.time()  
rxSummary(~fare_amount:F(passenger_count,1,6), data = inDataSource)  
used.time <- proc.time() - start.time  
print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds,
  Elapsed Time=", round(used.time[3],2), 
  " seconds to summarize the inDataSource.", sep=""))
```
- The first argument to rxSummary specifies the formula or term to summarize by. Here, the F() function is used to convert the values in passenger_count into factors before summarizing. The rxSummary function also requires that you specify minimum and maximum values for the passenger_count column: : here, 1 and 6.
- If you do not specify the statistics to output, by default rxSummary outputs Mean, StDev, Min, Max, and the number of valid and missing observations.
- This example also includes some code to track the time the function starts and completes, so that you can compare performance.
Results
rxSummary(formula = ~fare_amount:F(passenger_count), data = inDataSource)
Summary Statistics Results for: ~fare_amount:F(passenger_count)
Data: inDataSource (RxSqlServerData Data Source)
Number of valid observations: 1000
Name Mean StdDev Min Max ValidObs MissingObs
fare_amount:F_passenger_count 12.4875 9.682605 2.5 64 1000 0
Statistics by category (6 categories):
Category F_passenger_count Means StdDev Min
fare_amount for F(passenger_count)=1 1 12.00901 9.219458 2.5
fare_amount for F(passenger_count)=2 2 11.61893 8.858739 3.0
fare_amount for F(passenger_count)=3 3 14.40196 10.673340 3.5
fare_amount for F(passenger_count)=4 4 13.69048 8.647942 4.5
fare_amount for F(passenger_count)=5 5 19.30909 14.122969 3.5
fare_amount for F(passenger_count)=6 6 12.00000 NA 12.0
Max ValidObs
55 666
52 206
52 51
39 21
64 55
12 1
"It takes CPU Time=0.5 seconds, Elapsed Time=4.59 seconds to summarize the inDataSource."

Next Step

Create Graphs and Plots Using R (Data Science End-to-End Walkthrough)

Previous Lesson

Lesson 1: Prepare the Data (Data Science End-to-End Walkthrough)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesson 2-1 - View and Summarize Data using R

Define a SQL Server Compute Context

Create an RxSqlServer Data Source

Use the SQL Server Data in R

Next Step

Previous Lesson

FilesExpand file tree

walkthrough-view-and-summarize-data-using-r.md

Latest commit

History

walkthrough-view-and-summarize-data-using-r.md

File metadata and controls

Lesson 2-1 - View and Summarize Data using R

Define a SQL Server Compute Context

Create an RxSqlServer Data Source

Use the SQL Server Data in R

Next Step

Previous Lesson