title	Perform chunking analysis using rxDataStep (SQL and R deep dive)\| Microsoft Docs
ms.prod	sql
ms.technology	machine-learning
ms.date	04/15/2018
ms.topic	tutorial
author	HeidiSteen
ms.author	heidist
manager	cgronlun

Perform chunking analysis using rxDataStep (SQL and R deep dive)

[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md-winonly]

This article is part of the Data Science Deep Dive tutorial, on how to use RevoScaleR with SQL Server.

In this lesson, you use the rxDataStep function to process data in chunks, rather than requiring that the entire dataset be loaded into memory and processed at one time, as in traditional R. The rxDataStep functions reads the data in chunk, applies R functions to each chunk of data in turn, and then saves the summary results for each chunk to a common [!INCLUDEssNoVersion] data source. When all data has been read, the results are combined.

Tip

For this lesson, you compute a contingency table by using the table function in R. This example is meant for instructional purposes only.

If you need to tabulate real-world data sets, we recommend that you use the rxCrossTabs or rxCube functions in RevoScaleR, which are optimized for this sort of operation.

Partition data by values

Create a custom R function that calls the R table function on each chunk of data, and name the new function ProcessChunk.

ProcessChunk <- function( dataList) {
# Convert the input list to a data frame and compute contingency table
chunkTable <- table(as.data.frame(dataList))

# Convert table output to a data frame with a single row
varNames <- names(chunkTable)
varValues <- as.vector(chunkTable)
dim(varValues) <- c(1, length(varNames))
chunkDF <- as.data.frame(varValues)
names(chunkDF) <- varNames

# Return the data frame, which has a single row
return( chunkDF )
}

Set the compute context to the server.
```
rxSetComputeContext( sqlCompute )
```

Define a SQL Server data source to hold the data you're processing. Start by assigning a SQL query to a variable. Then, use that variable in the sqlQuery argument of a new [!INCLUDEssNoVersion] data source.

dayQuery <-  "SELECT DayOfWeek FROM AirDemoSmallTest"
inDataSource <- RxSqlServerData(sqlQuery = dayQuery,
    connectionString = sqlConnString,
    rowsPerRead = 50000,
    colInfo = list(DayOfWeek = list(type = "factor",
        levels = as.character(1:7))))

Optionally, you can run rxGetVarInfo on this data source. At this point, it contains a single column: Var 1: DayOfWeek, Type: factor, no factor levels available

Before applying this factor variable to the source data, create a separate table to hold the intermediate results. Again, you just use the RxSqlServerData function to define the data, makign sure to delete any existing tables of the same name.

iroDataSource = RxSqlServerData(table = "iroResults",   connectionString = sqlConnString)
# Check whether the table already exists.
if (rxSqlServerTableExists(table = "iroResults",  connectionString = sqlConnString))  { rxSqlServerDropTable( table = "iroResults", connectionString = sqlConnString) }

Call the custom function ProcessChunk to transform the data as it is read, by using it as the transformFunc argument to the rxDataStep function.
```
rxDataStep( inData = inDataSource, outFile = iroDataSource, transformFunc = ProcessChunk, overwrite = TRUE)
```
To view the intermediate results of ProcessChunk, assign the results of rxImport to a variable, and then output the results to the console.
```
iroResults <- rxImport(iroDataSource)
iroResults
```

Partial results

	1	2	3	4	5	6	7
1	8228	8924	6916	6932	6944	5602	6454
2	8321	5351	7329	7411	7409	6487	7692

To compute the final results across all chunks, sum the columns, and display the results in the console.
```
finalResults <- colSums(iroResults)
finalResults
```

Results

1	2	3	4	5	6	7
97975	77725	78875	81304	82987	86159	94975

To remove the intermediate results table, make a call to rxSqlServerDropTable.

rxSqlServerDropTable( table = "iroResults", connectionString = sqlConnString)

Next step

Analyze data in local compute context

Previous step

Create new SQL Server table using rxDataStep

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform chunking analysis using rxDataStep (SQL and R deep dive)

Partition data by values

Next step

Previous step

FilesExpand file tree

deepdive-perform-chunking-analysis-using-rxdatastep.md

Latest commit

History

deepdive-perform-chunking-analysis-using-rxdatastep.md

File metadata and controls

Perform chunking analysis using rxDataStep (SQL and R deep dive)

Partition data by values

Next step

Previous step