Skip to content

Commit c39cf54

Browse files
committed
Adding draft of PROSE article
1 parent cd1a253 commit c39cf54

1 file changed

Lines changed: 127 additions & 8 deletions

File tree

Lines changed: 127 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,137 @@
11
---
2-
title: Using Prose for automated big data wrangling | Microsoft Docs
2+
title: Data Wrangling using PROSE Code Accelerator | Microsoft Docs
33
description:
4-
author: rothja
5-
ms.author: jroth
4+
author: rothja
5+
ms.author: jroth
66
manager: craigg
7-
ms.date: 08/30/2018
7+
ms.date: 09/24/2018
88
ms.topic: conceptual
99
ms.prod: sql
1010
---
1111

12-
# Using Prose for automated big data wrangling
12+
# Data Wrangling using PROSE Code Accelerator
1313

14-
TBD
14+
PROSE Code Accelerator generates readable Python code for your data wrangling tasks. You can mix the generated code with your hand-written code in a seamless manner while working in a notebook within Azure Data Studio. This article provides an overview of how you can use the Code Accelerator.
1515

16-
## Next steps
16+
> [!NOTE]
17+
> Program Synthesis using Examples, aka PROSE, is a Microsoft technology that generates human-readable code using AI. It does so by analyzing a user's intent as well as data, generating several candidate programs, and picking the best program using ranking algorithms. To know more about the PROSE technology, visit the [PROSE homepage](https://microsoft.github.io/prose/).
1718
18-
TBD
19+
The Code Accelerator comes pre-installed with Azure Data Studio. You can import it like any other Python package in the notebook. By convention, we import it as cx for short.
20+
21+
```python
22+
import prose.codeaccelerator as cx
23+
```
24+
25+
In the current release, the Code Accelerator can intelligently generate Python code for the following tasks:
26+
27+
- Reading data files to a Pandas or Pyspark dataframe.
28+
- Fixing datatypes in a dataframe.
29+
- Finding regular expressions representing patterns in a list of strings.
30+
31+
To get a general overview of Code Accelerator methods, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/overview?view=azure-accelerator-py).
32+
33+
## Reading data from a file to a dataframe
34+
35+
Often, reading files to a dataframe involves looking at the content of the file and determining the correct parameters to pass to a data-loading library. Depending on the complexity of the file, identifying the correct parameters may require several iterations.
36+
37+
PROSE Code Accelerator solves this problem by analyzing the structure of the data file and automatically generating code to load the file. In most cases, the generated code parses the data correctly. In a few cases, you might need to tweak the code to meet your needs.
38+
39+
Consider the following example:
40+
41+
```python
42+
import prose.codeaccelerator as cx
43+
44+
# Call the ReadCsvBuilder builder to analyze the file content and generate code to load it
45+
builder = cx.ReadCsvBuilder(r'C:/911.txt')
46+
47+
#Set target to pyspark if generating code to use pyspark library
48+
#builder.Target = "pyspark"
49+
50+
#Get the code generated to fix the datatypes
51+
builder.learn().code()
52+
```
53+
54+
The previous code block prints the following python code to read a delimited file. Notice how PROSE automatically figures out the number of lines to skip, headers, quotechars, delimiters, etc.
55+
56+
```python
57+
import pandas as pd
58+
59+
def read_file(file):
60+
names = ["lat",
61+
"lng",
62+
"desc",
63+
"zip",
64+
"title"]
65+
66+
df = pd.read_csv(file,
67+
skiprows = 11,
68+
header = None,
69+
names = names,
70+
quotechar = "\"",
71+
delimiter = "|",
72+
index_col = False,
73+
dtype = str,
74+
na_values = [],
75+
keep_default_na = False,
76+
skipinitialspace = True)
77+
return df
78+
```
79+
80+
Code Accelerator can generate code to load delimited, JSON, and fixed-width files to a dataframe. For reading fixed-width files, the `ReadFwfBuilder` optionally takes a human-readable schema file that it can parse to get the column positions. To learn more, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/intro?view=azure-accelerator-py).
81+
82+
## Fixing datatypes in a dataframe
83+
84+
It is common to have a pandas or pyspark dataframe with wrong datatypes. Often, this happens because of a few non-conforming values in columns. As a result, Integers are read as Float or Strings, and Dates are read as Strings. On the other hand, values, such as ZIP Codes, that should be read as Strings are read as Integers by default. The effort required to manually fix the datatypes is proportional to the number of columns.
85+
86+
You can use the `DetectTypesBuilder` in these situations. It analyzes the data, and rather than fixing the datatypes in a black-box manner, it generates code for fixing the datatypes. The code serves as a starting point. You can review, use, or modify it as needed.
87+
88+
```python
89+
import prose.codeaccelerator as cx
90+
91+
builder = cx.DetectTypesBuilder(df)
92+
93+
#Set target to pyspark if working with pyspark
94+
#builder.Target = "pyspark"
95+
96+
#Get the code generated to fix the datatypes
97+
builder.learn().code()
98+
```
99+
100+
To learn more, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/fixdatatypes?view=azure-accelerator-py).
101+
102+
## Identifying patterns in Strings
103+
104+
Another common scenario is to detect patterns in a string column for the purpose of cleaning or grouping. For example, you may have a date column with dates in multiple different formats. In order to standardize the values, you might want to write conditional statements using regular expressions.
105+
106+
107+
| |Name |BirthDate |
108+
|---|:-------------------------|:--------------|
109+
| 0 |Bertram du Plessis |1995 |
110+
| 1 |Naiara Moravcikova |Unknown |
111+
| 2 |Jihoo Spel |2014 |
112+
| 3 |Viachaslau Gordan Hilario |22-Apr-67 |
113+
| 4 |Maya de Villiers |19-Mar-60 |
114+
115+
Depending on the volume and diversity in data, writing regular expressions for different patterns in the column can be a very time consuming task. The `FindPatternsBuilder` is a powerful code acceleration tool that solves the above problem by generating regular expressions for a list of Strings.
116+
117+
```python
118+
import prose.codeaccelerator as cx
119+
120+
builder = cx.FindPatternsBuilder(df['BirthDate'])
121+
122+
#Set target to pyspark if working with pyspark
123+
#builder.Target = "pyspark"
124+
125+
builder.learn().regexes
126+
```
127+
128+
Here are the regular expressions generated by the `FindPatternsBuilder` for the above data.
129+
130+
```
131+
^[0-9]{2}-[A-Z][a-z]+-[0-9]{2}$
132+
^[0-9]{2}[\s][A-Z][a-z]+[\s][0-9]{4}$
133+
^[0-9]{4}$
134+
^Unknown$
135+
```
136+
137+
Apart from generating Regular Expressions, `FindPatternsBuilder` can also generate code for clustering the values based on generated regexes. It can also assert that all the values in a column conform to the generated regular expressions. To learn more and see other useful scenarios, see the [documentation](https://docs.microsoft.com/python/api/overview/azure/prose/findpatterns?view=azure-accelerator-py).

0 commit comments

Comments
 (0)