| title | View and Summarize Data using R (Data Science End-to-End Walkthrough) | Microsoft Docs | |
|---|---|---|
| ms.custom |
|
|
| ms.date | 03/17/2017 | |
| ms.prod | sql-server-2016 | |
| ms.reviewer | ||
| ms.suite | ||
| ms.technology |
|
|
| ms.tgt_pltfrm | ||
| ms.topic | article | |
| applies_to |
|
|
| dev_langs |
|
|
| ms.assetid | 358e1431-8f47-4d32-a02f-f90e519eef49 | |
| caps.latest.revision | 21 | |
| author | jeannt | |
| ms.author | jeannt | |
| manager | jhubbard |
Now you'll work with the same data using R code. You'll also learn how to use the functions in the RevoScaleR package included with [!INCLUDErsql_productname].
An R script is provided with this walkthrough that includes all the code needed to create the data object, generate summaries, and build models. The R script file, RSQL_RWalkthrough.R, can be found in the location where you installed the script files.
- If you are experienced with R, you can run the script all at once.
- For people learning to use RevoScaleR, this tutorial goes through the script line by line
- To run individual lines from the script, you can highlight a line or lines in the file and press Ctrl + ENTER.
Save your R workspace in case you want to complete the rest of the walkthrough later. That way the data objects and other variables will be ready for re-use.
To get data from [!INCLUDEssNoVersion] to use in your R code, you need to:
- Create a connection to a [!INCLUDEssNoVersion] instance
- Define a query that has the data you need, or specify a table or view
- Define one or more compute contexts to use when running R code
- Optionally, you can define transformation that are applied to the data source while it is being read from the source
The following steps are all part of the R code and should be run in an R IDE.
-
If the RevoScaleR package is not already loaded, run:
library("RevoScaleR")`.
If you get an error, make sure that your R development environment is using the library that includes the RevoScaleR package. Use a command such as
.libPaths())to view the current path. -
Create the connection string for SQL Server.
# SQL authentication connStr <- "Driver=SQL Server;Server=Your_Server_Name.somedomain.com;Database=Your_Database_Name;Uid=Your_User_Name;Pwd=Your_Password" # Windows authentication connStrWin <- "Driver=SQL Server;Server=SQL_instance_name;Database=database_name;Trusted_Connection=Yes" # Map the connection string to the one used in the rest of this tutorial connStr \<- connStrWin
[!NOTE] The R script available for download uses SQL logins only. In this tutorial, we've provided examples of both SQL logins and Windows integrated authentication. We recommend that you use Windows authentication where possible, to avoid saving passwords in your R code.
Regardless of which credentials you use, the account that you use must have permissions to read data and to create new tables in the specified database. For information about how to add users to the SQL database and give them the correct permissions, see Post-Installation Server Configuration (SQL Server R Services).
-
Define variables to use in a new compute context. By setting the compute context to the [!INCLUDEssNoVersion] instance, you can take advantage of server resources, rather than trying to run it in the memory on your laptop.
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="") sqlWait <- TRUE sqlConsoleOutput <- FALSE
-
R uses a temporary directory when serializing R objects back and forth between your workstation and the [!INCLUDEssNoVersion] computer. You can specify the local directory that is used as sqlShareDir, or accept the default.
-
Use sqlWait to indicate whether you want R to wait for results or not. For a discussion of waiting vs. non-waiting jobs, see ScaleR Distributed Computing.
-
Use the argument sqlConsoleOutput to indicate that you dont want to see output from the R console.
-
-
Instantiate the compute context object with the variables and connection strings already defined, and save it in the R variable sqlcc.
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, wait = sqlWait, consoleOutput = sqlConsoleOutput) -
By default, the compute context is local, so you'll need to explicitly set the active compute context.
rxSetComputeContext(cc)rxSetComputeContextreturns the previously active compute context invisibly so that you can use itrxGetComputeContextreturns the active compute context
Note that setting a compute context only affects operations that use functions in the RevoScaleR package; the compute context does not affect the way that open source R operations are performed.
A data source specifies some set of data that you want to use for a task, such as training, exploration, scoring, or generating features.
You already defined the database you want to work with, and saved that information in R variables. Now you can re-use that data connection to create data objects, by calling the RxSqlServer function.
-
Save the SQL statement as a string variable. The query defines the data you'll use to train the model.
sampleDataQuery <- "SELECT TOP 1000 tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude FROM nyctaxi_sample"
-
Pass the query definition as an argument to the RxSqlServerData function.
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, colClasses = c(pickup_longitude = "numeric", pickup_latitude = "numeric", dropoff_longitude = "numeric", dropoff_latitude = "numeric"), rowsPerRead=500)
-
The argument colClasses specifies the column types to use when moving the data between [!INCLUDEssNoVersion] and R. This is important because [!INCLUDEssNoVersion] uses different data types than R, and more data types. For more information, see Working with R Data Types.
-
The argument rowsPerRead is important for handling memory usage and efficient computations. Most of the enhanced analytical functions in[!INCLUDErsql_productname] process data in chunks and accumulate intermediate results, returning the final computations after all of the data has been read. By adding the
rowsPerReadparameter, you can control how many rows of data are read into each chunk for processing. If the value of this parameter is too large, data access might be slow because you don’t have enough memory to efficiently process such a large chunk of data. On some systems, settingrowsPerReadto too small a value can also provide slower performance.
-
-
At this point, the inDataSource object doesn't contain any data from the SQL query. The data is not pulled into the local environment until you run a function such as rxImport or rxSummary.
However, this object is a convenient shortcut for defining the data. You can call the data source using multiple functions, to move data, to get a summary of the data and its variables, to manipulate and transform the data, or to use it for training a model.
You can now apply R functions to the data source, to explore, summarize, and chart the [!INCLUDEssNoVersion] data. In this section, you 'll try out several of the functions provided in [!INCLUDErsql_productname] that support remote compute contexts.
-
Call the function rxGetVarInfo, using the data source inDataSource as an argument, to get a list of the variables in the data source and their data types.
rxGetVarInfo(data = inDataSource)
Results:
Var 1: tipped, Type: integer
Var 2: fare_amount, Type: numeric
Var 3: passenger_count, Type: integer
Var 4: trip_time_in_secs, Type: numeric, Storage: int64
Var 5: trip_distance, Type: numeric
Var 6: pickup_datetime, Type: character
Var 7: dropoff_datetime, Type: character
Var 8: pickup_longitude, Type: numeric
Var 9: pickup_latitude, Type: numeric
Var 10: dropoff_longitude, Type: numericrxGetVarInfo can be used with any data frame, or a set of data in a remote data object, to get information such as the maximum and minimum values, the data type, and the number of levels in factor columns.
Consider running this function after any kind of data input, feature transformation, or feature engineering. By doing so you can ensure that all the features you want to use in your model are of the expected data type and avoid errors.
-
Call the RevoScaleR function rxSummary to summarize the fare amount, based on the number of passengers.
This function can be used to get more detailed statistics about individual variables. You can also transform values, compute summaries using factor levels, and save the summaries for re-use.
start.time <- proc.time() rxSummary(~fare_amount:F(passenger_count,1,6), data = inDataSource) used.time <- proc.time() - start.time print(paste("It takes CPU Time=", round(used.time[1]+used.time[2],2)," seconds, Elapsed Time=", round(used.time[3],2), " seconds to summarize the inDataSource.", sep=""))
- The first argument to rxSummary specifies the formula or term to summarize by. Here, the
F()function is used to convert the values in passenger_count into factors before summarizing. The rxSummary function also requires that you specify minimum and maximum values for the passenger_count column: : here, 1 and 6. - If you do not specify the statistics to output, by default rxSummary outputs Mean, StDev, Min, Max, and the number of valid and missing observations.
- This example also includes some code to track the time the function starts and completes, so that you can compare performance.
Results
rxSummary(formula = ~fare_amount:F(passenger_count), data = inDataSource)
Summary Statistics Results for: ~fare_amount:F(passenger_count)
Data: inDataSource (RxSqlServerData Data Source)
Number of valid observations: 1000
Name Mean StdDev Min Max ValidObs MissingObs
fare_amount:F_passenger_count 12.4875 9.682605 2.5 64 1000 0
Statistics by category (6 categories):
Category F_passenger_count Means StdDev Min
fare_amount for F(passenger_count)=1 1 12.00901 9.219458 2.5
fare_amount for F(passenger_count)=2 2 11.61893 8.858739 3.0
fare_amount for F(passenger_count)=3 3 14.40196 10.673340 3.5
fare_amount for F(passenger_count)=4 4 13.69048 8.647942 4.5
fare_amount for F(passenger_count)=5 5 19.30909 14.122969 3.5
fare_amount for F(passenger_count)=6 6 12.00000 NA 12.0
Max ValidObs
55 666
52 206
52 51
39 21
64 55
12 1
"It takes CPU Time=0.5 seconds, Elapsed Time=4.59 seconds to summarize the inDataSource." - The first argument to rxSummary specifies the formula or term to summarize by. Here, the
Create Graphs and Plots Using R (Data Science End-to-End Walkthrough)
Lesson 1: Prepare the Data (Data Science End-to-End Walkthrough)