--- # required metadata title: "rx_featurize: Data Transformation for revoscalepy data sources" description: "Transforms data from an input data set to an output data set." keywords: "transform, featurizer" author: WilliamDAssafMSFT ms.author: wiassaf manager: "cgronlun" ms.date: 07/15/2019 ms.topic: "reference" ms.prod: "sql" ms.technology: "machine-learning-services" ms.service: "" ms.assetid: "" # optional metadata ROBOTS: "" audience: "" ms.devlang: "Python" ms.reviewer: "" ms.suite: "" ms.tgt_pltfrm: "" ms.custom: "" monikerRange: ">=sql-server-2017||>=sql-server-linux-ver15" --- # *microsoftml.rx_featurize*: Data transformation for data sources ## Usage ``` microsoftml.rx_featurize(data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource, pandas.core.frame.DataFrame], output_data: typing.Union[revoscalepy.datasource.RxDataSource.RxDataSource, str] = None, overwrite: bool = False, data_threads: int = None, random_seed: int = None, max_slots: int = 5000, ml_transforms: list = None, ml_transform_vars: list = None, row_selection: str = None, transforms: dict = None, transform_objects: dict = None, transform_function: str = None, transform_variables: list = None, transform_packages: list = None, transform_environment: dict = None, blocks_per_read: int = None, report_progress: int = None, verbose: int = 1, compute_context: revoscalepy.computecontext.RxComputeContext.RxComputeContext = None) ``` ## Description Transforms data from an input data set to an output data set. ## Arguments ### data A [revoscalepy](/machine-learning-server/python-reference/revoscalepy/revoscalepy-package) data source object, a data frame, or the path to a `.xdf` file. ### output_data Output text or xdf file name or an `RxDataSource` with write capabilities in which to store transformed data. If *None*, a data frame is returned. The default value is *None*. ### overwrite If `True`, an existing `output_data` is overwritten; if `False` an existing `output_data` is not overwritten. The default value is `False`. ### data_threads An integer specifying the desired degree of parallelism in the data pipeline. If *None*, the number of threads used is determined internally. The default value is *None*. ### random_seed Specifies the random seed. The default value is *None*. ### max_slots Max slots to return for vector valued columns (<=0 to return all). ### ml_transforms Specifies a list of MicrosoftML transforms to be performed on the data before training or *None* if no transforms are to be performed. See [`featurize_text`](featurize-text.md), [`categorical`](categorical.md), and [`categorical_hash`](categorical-hash.md), for transformations that are supported. These transformations are performed after any specified Python transformations. The default value is *None*. ### ml_transform_vars Specifies a character vector of variable names to be used in `ml_transforms` or *None* if none are to be used. The default value is *None*. ### row_selection NOT SUPPORTED. Specifies the rows (observations) from the data set that are to be used by the model with the name of a logical variable from the data set (in quotes) or with a logical expression using variables in the data set. For example: * `row_selection = "old"` will only use observations in which the value of the variable `old` is `True`. * `row_selection = (age > 20) & (age < 65) & (log(income) > 10)` only uses observations in which the value of the `age` variable is between 20 and 65 and the value of the `log` of the `income` variable is greater than 10. The row selection is performed after processing any data transformations (see the arguments `transforms` or `transform_function`). As with all expressions, `row_selection` can be defined outside of the function call using the `expression` function. ### transforms NOT SUPPORTED. An expression of the form that represents the first round of variable transformations. As with all expressions, `transforms` (or `row_selection`) can be defined outside of the function call using the `expression` function. The default value is *None*. ### transform_objects NOT SUPPORTED. A named list that contains objects that can be referenced by `transforms`, `transform_function`, and `row_selection`. The default value is *None*. ### transform_function The variable transformation function. The default value is *None*. ### transform_variables A character vector of input data set variables needed for the transformation function. The default value is *None*. ### transform_packages NOT SUPPORTED. A character vector specifying additional Python packages (outside of those specified in `RxOptions.get_option("transform_packages")`) to be made available and preloaded for use in variable transformation functions. For example, those explicitly defined in [revoscalepy](/machine-learning-server/python-reference/revoscalepy/revoscalepy-package) functions via their `transforms` and `transform_function` arguments or those defined implicitly via their `formula` or `row_selection` arguments. The `transform_packages` argument may also be *None*, indicating that no packages outside `RxOptions.get_option("transform_packages")` are preloaded. ### transform_environment NOT SUPPORTED. A user-defined environment to serve as a parent to all environments developed internally and used for variable data transformation. If `transform_environment = None`, a new “hash” environment with parent revoscalepy.baseenv is used instead The default value is *None*. ### blocks_per_read Specifies the number of blocks to read for each chunk of data read from the data source. ### report_progress An integer value that specifies the level of reporting on the row processing progress: * `0`: no progress is reported. * `1`: the number of processed rows is printed and updated. * `2`: rows processed and timings are reported. * `3`: rows processed and all timings are reported. The default value is `1`. ### verbose An integer value that specifies the amount of output wanted. If `0`, no verbose output is printed during calculations. Integer values from `1` to `4` provide increasing amounts of information. The default value is `1`. ### compute_context Sets the context in which computations are executed, specified with a valid revoscalepy.RxComputeContext. Currently local and [revoscalepy.RxInSqlServer](/machine-learning-server/python-reference/revoscalepy/RxInSqlServer) compute contexts are supported. ## Returns A data frame or an [revoscalepy.RxDataSource](/machine-learning-server/python-reference/revoscalepy/RxDataSource) object representing the created output data. ## See also [`rx_predict`](rx-predict.md), [revoscalepy.rx_data_step](/machine-learning-server/python-reference/revoscalepy/rx-data-step), [revoscalepy.rx_import](/machine-learning-server/python-reference/revoscalepy/rx-import). ## Example ``` ''' Example with rx_featurize. ''' import numpy import pandas from microsoftml import rx_featurize, categorical # rx_featurize basically allows you to access data from the MicrosoftML transforms # In this example we'll look at getting the output of the categorical transform # Create the data categorical_data = pandas.DataFrame(data=dict(places_visited=[ "London", "Brunei", "London", "Paris", "Seria"]), dtype="category") print(categorical_data) # Invoke the categorical transform categorized = rx_featurize(data=categorical_data, ml_transforms=[categorical(cols=dict(xdatacat="places_visited"))]) # Now let's look at the data print(categorized) ``` Output: ``` places_visited 0 London 1 Brunei 2 London 3 Paris 4 Seria Beginning processing data. Rows Read: 5, Read Time: 0, Transform Time: 0 Beginning processing data. Beginning processing data. Rows Read: 5, Read Time: 0, Transform Time: 0 Beginning processing data. Elapsed time: 00:00:00.0521300 Finished writing 5 rows. Writing completed. places_visited xdatacat.London xdatacat.Brunei xdatacat.Paris \ 0 London 1.0 0.0 0.0 1 Brunei 0.0 1.0 0.0 2 London 1.0 0.0 0.0 3 Paris 0.0 0.0 1.0 4 Seria 0.0 0.0 0.0 xdatacat.Seria 0 0.0 1 0.0 2 0.0 3 0.0 4 1.0 ```