|
1 | 1 | --- |
2 | | -title: Using Prose for automated big data wrangling | Microsoft Docs |
| 2 | +title: Data Wrangling using PROSE Code Accelerator | Microsoft Docs |
3 | 3 | description: |
4 | | -author: rothja |
5 | | -ms.author: jroth |
| 4 | +author: rothja |
| 5 | +ms.author: jroth |
6 | 6 | manager: craigg |
7 | | -ms.date: 08/30/2018 |
| 7 | +ms.date: 09/24/2018 |
8 | 8 | ms.topic: conceptual |
9 | 9 | ms.prod: sql |
10 | 10 | --- |
11 | 11 |
|
12 | | -# Using Prose for automated big data wrangling |
| 12 | +# Data Wrangling using PROSE Code Accelerator |
13 | 13 |
|
14 | | -TBD |
| 14 | +PROSE Code Accelerator generates readable Python code for your data wrangling tasks. You can mix the generated code with your hand-written code in a seamless manner while working in a notebook within Azure Data Studio. This article provides an overview of how you can use the Code Accelerator. |
15 | 15 |
|
16 | | -## Next steps |
| 16 | + > [!NOTE] |
| 17 | + > Program Synthesis using Examples, aka PROSE, is a Microsoft technology that generates human-readable code using AI. It does so by analyzing a user's intent as well as data, generating several candidate programs, and picking the best program using ranking algorithms. To know more about the PROSE technology, visit the [PROSE homepage](https://microsoft.github.io/prose/). |
17 | 18 |
|
18 | | -TBD |
| 19 | +The Code Accelerator comes pre-installed with Azure Data Studio. You can import it like any other Python package in the notebook. By convention, we import it as cx for short. |
| 20 | + |
| 21 | +```python |
| 22 | +import prose.codeaccelerator as cx |
| 23 | +``` |
| 24 | + |
| 25 | +In the current release, the Code Accelerator can intelligently generate Python code for the following tasks: |
| 26 | + |
| 27 | +- Reading data files to a Pandas or Pyspark dataframe. |
| 28 | +- Fixing datatypes in a dataframe. |
| 29 | +- Finding regular expressions representing patterns in a list of strings. |
| 30 | + |
| 31 | +To get a general overview of Code Accelerator methods, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/overview?view=azure-accelerator-py). |
| 32 | + |
| 33 | +## Reading data from a file to a dataframe |
| 34 | + |
| 35 | +Often, reading files to a dataframe involves looking at the content of the file and determining the correct parameters to pass to a data-loading library. Depending on the complexity of the file, identifying the correct parameters may require several iterations. |
| 36 | + |
| 37 | +PROSE Code Accelerator solves this problem by analyzing the structure of the data file and automatically generating code to load the file. In most cases, the generated code parses the data correctly. In a few cases, you might need to tweak the code to meet your needs. |
| 38 | + |
| 39 | +Consider the following example: |
| 40 | + |
| 41 | + ```python |
| 42 | +import prose.codeaccelerator as cx |
| 43 | + |
| 44 | +# Call the ReadCsvBuilder builder to analyze the file content and generate code to load it |
| 45 | +builder = cx.ReadCsvBuilder(r'C:/911.txt') |
| 46 | + |
| 47 | +#Set target to pyspark if generating code to use pyspark library |
| 48 | +#builder.Target = "pyspark" |
| 49 | + |
| 50 | +#Get the code generated to fix the datatypes |
| 51 | +builder.learn().code() |
| 52 | + ``` |
| 53 | + |
| 54 | +The previous code block prints the following python code to read a delimited file. Notice how PROSE automatically figures out the number of lines to skip, headers, quotechars, delimiters, etc. |
| 55 | + |
| 56 | + ```python |
| 57 | +import pandas as pd |
| 58 | + |
| 59 | +def read_file(file): |
| 60 | + names = ["lat", |
| 61 | + "lng", |
| 62 | + "desc", |
| 63 | + "zip", |
| 64 | + "title"] |
| 65 | + |
| 66 | + df = pd.read_csv(file, |
| 67 | + skiprows = 11, |
| 68 | + header = None, |
| 69 | + names = names, |
| 70 | + quotechar = "\"", |
| 71 | + delimiter = "|", |
| 72 | + index_col = False, |
| 73 | + dtype = str, |
| 74 | + na_values = [], |
| 75 | + keep_default_na = False, |
| 76 | + skipinitialspace = True) |
| 77 | + return df |
| 78 | + ``` |
| 79 | + |
| 80 | +Code Accelerator can generate code to load delimited, JSON, and fixed-width files to a dataframe. For reading fixed-width files, the `ReadFwfBuilder` optionally takes a human-readable schema file that it can parse to get the column positions. To learn more, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/intro?view=azure-accelerator-py). |
| 81 | + |
| 82 | +## Fixing datatypes in a dataframe |
| 83 | + |
| 84 | +It is common to have a pandas or pyspark dataframe with wrong datatypes. Often, this happens because of a few non-conforming values in columns. As a result, Integers are read as Float or Strings, and Dates are read as Strings. On the other hand, values, such as ZIP Codes, that should be read as Strings are read as Integers by default. The effort required to manually fix the datatypes is proportional to the number of columns. |
| 85 | + |
| 86 | +You can use the `DetectTypesBuilder` in these situations. It analyzes the data, and rather than fixing the datatypes in a black-box manner, it generates code for fixing the datatypes. The code serves as a starting point. You can review, use, or modify it as needed. |
| 87 | + |
| 88 | +```python |
| 89 | +import prose.codeaccelerator as cx |
| 90 | + |
| 91 | +builder = cx.DetectTypesBuilder(df) |
| 92 | + |
| 93 | +#Set target to pyspark if working with pyspark |
| 94 | +#builder.Target = "pyspark" |
| 95 | + |
| 96 | +#Get the code generated to fix the datatypes |
| 97 | +builder.learn().code() |
| 98 | +``` |
| 99 | + |
| 100 | +To learn more, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/fixdatatypes?view=azure-accelerator-py). |
| 101 | + |
| 102 | +## Identifying patterns in Strings |
| 103 | + |
| 104 | +Another common scenario is to detect patterns in a string column for the purpose of cleaning or grouping. For example, you may have a date column with dates in multiple different formats. In order to standardize the values, you might want to write conditional statements using regular expressions. |
| 105 | + |
| 106 | + |
| 107 | +| |Name |BirthDate | |
| 108 | +|---|:-------------------------|:--------------| |
| 109 | +| 0 |Bertram du Plessis |1995 | |
| 110 | +| 1 |Naiara Moravcikova |Unknown | |
| 111 | +| 2 |Jihoo Spel |2014 | |
| 112 | +| 3 |Viachaslau Gordan Hilario |22-Apr-67 | |
| 113 | +| 4 |Maya de Villiers |19-Mar-60 | |
| 114 | + |
| 115 | +Depending on the volume and diversity in data, writing regular expressions for different patterns in the column can be a very time consuming task. The `FindPatternsBuilder` is a powerful code acceleration tool that solves the above problem by generating regular expressions for a list of Strings. |
| 116 | + |
| 117 | +```python |
| 118 | +import prose.codeaccelerator as cx |
| 119 | + |
| 120 | +builder = cx.FindPatternsBuilder(df['BirthDate']) |
| 121 | + |
| 122 | +#Set target to pyspark if working with pyspark |
| 123 | +#builder.Target = "pyspark" |
| 124 | + |
| 125 | +builder.learn().regexes |
| 126 | +``` |
| 127 | + |
| 128 | +Here are the regular expressions generated by the `FindPatternsBuilder` for the above data. |
| 129 | + |
| 130 | +``` |
| 131 | +^[0-9]{2}-[A-Z][a-z]+-[0-9]{2}$ |
| 132 | +^[0-9]{2}[\s][A-Z][a-z]+[\s][0-9]{4}$ |
| 133 | +^[0-9]{4}$ |
| 134 | +^Unknown$ |
| 135 | +``` |
| 136 | + |
| 137 | +Apart from generating Regular Expressions, `FindPatternsBuilder` can also generate code for clustering the values based on generated regexes. It can also assert that all the values in a column conform to the generated regular expressions. To learn more and see other useful scenarios, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/findpatterns?view=azure-accelerator-py). |
0 commit comments