| title | Transform Data Using R | Microsoft Docs | |
|---|---|---|
| ms.custom |
|
|
| ms.date | 05/18/2017 | |
| ms.prod | sql-server-2016 | |
| ms.reviewer | ||
| ms.suite | ||
| ms.technology |
|
|
| ms.tgt_pltfrm | ||
| ms.topic | article | |
| applies_to |
|
|
| dev_langs |
|
|
| ms.assetid | 0327e788-94cc-4a47-933b-7c5c027b9208 | |
| caps.latest.revision | 19 | |
| author | jeannt | |
| ms.author | jeannt | |
| manager | jhubbard |
The RevoScaleR package provides multiple functions for transforming data at various stages of your analysis:
-
rxDataStep can be used to create and transform subsets of data.
-
rxImport supports transforming data as data is being imported to or from an XDF file or an in-memory data frame.
-
Although not specifically for data movement, the functions rxSummary, rxCube, rxLinMod, and rxLogit all support data transformations.
In this section, you'll learn how to use these functions. Let's start with rxDataStep.
The rxDataStep function processes data one chunk at a time, reading from one data source and writing to another. You can specify the columns to transform, the transformations to load, and so forth.
To make this example interesting, you’ll use a function from another R package to transform your data. The boot package is one of the "recommended" packages, meaning that boot is included with every distribution of R, but is not loaded automatically on start-up. Therefore, the package should already be available on the [!INCLUDEssNoVersion] instance that you've been using with [!INCLUDErsql_productname].
From the boot package, you'll use the function inv.logit, which computes the inverse of a logit. That is, the inv.logit function converts a logit back to a probability on the [0,1] scale.
Tip
Another way to get predictions in this scale would be to set the type parameter to response in the original call to rxPredict.
-
Start by creating a data source to hold the data destined for the table, ccScoreOutput.
sqlOutScoreDS <- RxSqlServerData( table = "ccScoreOutput", connectionString = sqlConnString, rowsPerRead = sqlRowsPerRead )
-
Add another data source to hold the data for the table ccScoreOutput2.
sqlOutScoreDS2 <- RxSqlServerData( table = "ccScoreOutput2", connectionString = sqlConnString, rowsPerRead = sqlRowsPerRead )
In the new table, you will get all the variables from the previous ccScoreOutput table, plus the newly created variable.
-
Set the compute context to the [!INCLUDEssNoVersion] instance.
rxSetComputeContext(sqlCompute) -
Use the function rxSqlServerTableExists to check whether the output table ccScoreOutput2 already exists; and if so, use the function rxSqlServerDropTable to delete the table.
if (rxSqlServerTableExists("ccScoreOutput2")) rxSqlServerDropTable("ccScoreOutput2")
-
Call the rxDataStep function, and specify the desired transforms in a list.
rxDataStep(inData = sqlOutScoreDS, outFile = sqlOutScoreDS2, transforms = list(ccFraudProb = inv.logit(ccFraudLogitScore)), transformPackages = "boot", overwrite = TRUE)
When you define the transformations that are applied to each column, you can also specify any additional R packages that are needed to perform the transformations. For more information about the types of transformations that you can perform, see Transforming and Subsetting Data
-
Call rxGetVarInfo to view a summary of the variables in the new data set.
rxGetVarInfo(sqlOutScoreDS2)Results
Var 1: ccFraudLogitScore, Type: numeric
Var 2: state, Type: character
Var 3: gender, Type: character
Var 4: cardholder, Type: character
Var 5: balance, Type: integer
Var 6: numTrans, Type: integer
Var 7: numIntlTrans, Type: integer
Var 8: creditLine, Type: integer
Var 9: ccFraudProb, Type: numeric
The original logit scores are preserved, but a new column, ccFraudProb, has been added, in which the logit scores are represented as values between 0 and 1.
Notice that the factor variables have been written to the table ccScoreOutput2 as character data. To use them as factors in subsequent analyses, use the parameter colInfo to specify the levels.