Skip to content

Latest commit

 

History

History
153 lines (117 loc) · 5.85 KB

File metadata and controls

153 lines (117 loc) · 5.85 KB
title Python tutorial: Prepare cluster data
description In part two of this four-part tutorial series, you'll prepare SQL data to perform clustering in Python with SQL Server Machine Learning Services.
ms.prod sql
ms.technology machine-learning
ms.devlang python
ms.date 08/30/2019
ms.topic tutorial
author garyericson
ms.author garye
ms.reviewer davidph
ms.custom seo-lt-2019
monikerRange >=sql-server-2017||>=sql-server-linux-ver15||=sqlallproducts-allversions

Tutorial: Prepare data to categorize customers in Python with SQL Server Machine Learning Services

[!INCLUDEappliesto-ss-xxxx-xxxx-xxx-md]

In part two of this four-part tutorial series, you'll restore and prepare the data from a SQL database using Python. Later in this series, you'll use this data to train and deploy a clustering model in Python with SQL Server Machine Learning Services.

In this article, you'll learn how to:

[!div class="checklist"]

  • Separate customers along different dimensions using Python
  • Load the data from the SQL database into a Python data frame

In part one, you installed the prerequisites and restored the sample database.

In part three, you'll learn how to create and train a K-Means clustering model in Python.

In part four, you'll learn how to create a stored procedure in a SQL database that can perform clustering in Python based on new data.

Prerequisites

  • Part two of this tutorial assumes you have fulfilled the prerequisites of part one.

Separate customers

To prepare for clustering customers, you'll first separate customers along the following dimensions:

  • orderRatio = return order ratio (total number of orders partially or fully returned versus the total number of orders)
  • itemsRatio = return item ratio (total number of items returned versus the number of items purchased)
  • monetaryRatio = return amount ratio (total monetary amount of items returned versus the amount purchased)
  • frequency = return frequency

Open a new notebook in Azure Data Studio and enter the following script.

In the connection string, replace connection details as needed.

# Load packages.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import revoscalepy as revoscale
from scipy.spatial import distance as sci_distance
from sklearn import cluster as sk_cluster

################################################################################################

## Connect to DB and select data

################################################################################################

# Connection string to connect to SQL Server named instance.
conn_str = 'Driver=SQL Server;Server=localhost;Database=tpcxbb_1gb;Trusted_Connection=True;'

input_query = '''SELECT
ss_customer_sk AS customer,
ROUND(COALESCE(returns_count / NULLIF(1.0*orders_count, 0), 0), 7) AS orderRatio,
ROUND(COALESCE(returns_items / NULLIF(1.0*orders_items, 0), 0), 7) AS itemsRatio,
ROUND(COALESCE(returns_money / NULLIF(1.0*orders_money, 0), 0), 7) AS monetaryRatio,
COALESCE(returns_count, 0) AS frequency
FROM
(
  SELECT
    ss_customer_sk,
    -- return order ratio
    COUNT(distinct(ss_ticket_number)) AS orders_count,
    -- return ss_item_sk ratio
    COUNT(ss_item_sk) AS orders_items,
    -- return monetary amount ratio
    SUM( ss_net_paid ) AS orders_money
  FROM store_sales s
  GROUP BY ss_customer_sk
) orders
LEFT OUTER JOIN
(
  SELECT
    sr_customer_sk,
    -- return order ratio
    count(distinct(sr_ticket_number)) as returns_count,
    -- return ss_item_sk ratio
    COUNT(sr_item_sk) as returns_items,
    -- return monetary amount ratio
    SUM( sr_return_amt ) AS returns_money
FROM store_returns
GROUP BY sr_customer_sk ) returned ON ss_customer_sk=sr_customer_sk'''


# Define the columns we wish to import.
column_info = {
    "customer": {"type": "integer"},
    "orderRatio": {"type": "integer"},
    "itemsRatio": {"type": "integer"},
    "frequency": {"type": "integer"}
}

Load the data into a data frame

Results from the query are returned to Python using the revoscalepy RxSqlServerData function. As part of the process, you'll use the column information you defined in the previous script.

data_source = revoscale.RxSqlServerData(sql_query=input_query, column_Info=column_info,
                                        connection_string=conn_str)
revoscale.RxInSqlServer(connection_string=conn_str, num_tasks=1, auto_cleanup=False)
# import data source and convert to pandas dataframe.
customer_data = pd.DataFrame(revoscale.rx_import(data_source))

Now display the beginning of the data frame to verify it looks correct.

print("Data frame:", customer_data.head(n=5))
Rows Read: 37336, Total Rows Processed: 37336, Total Chunk Time: 0.172 seconds
Data frame:     customer  orderRatio  itemsRatio  monetaryRatio  frequency
0    29727.0    0.000000    0.000000       0.000000          0
1    97643.0    0.068182    0.078176       0.037034          3
2    57247.0    0.000000    0.000000       0.000000          0
3    32549.0    0.086957    0.068657       0.031281          4
4     2040.0    0.000000    0.000000       0.000000          0

Clean up resources

If you're not going to continue with this tutorial, delete the tpcxbb_1gb database from SQL Server.

Next steps

In part two of this tutorial series, you completed these steps:

  • Separate customers along different dimensions using Python
  • Load the data from the SQL database into a Python data frame

To create a machine learning model that uses this customer data, follow part three of this tutorial series:

[!div class="nextstepaction"] Tutorial: Create a predictive model in Python with SQL Server Machine Learning Services