| title | Plot a histogram for data exploration with Python |
|---|---|
| titleSuffix | SQL machine learning |
| description | Learn how to create a histogram to visualize data using Python. |
| author | dphansen |
| ms.author | davidph |
| ms.date | 07/14/2020 |
| ms.topic | how-to |
| ms.prod | sql |
| ms.technology | machine-learning |
| monikerRange | >=sql-server-2017||>=sql-server-linux-ver15||=azuresqldb-mi-current||=azuresqldb-current||=sqlallproducts-allversions |
[!INCLUDESQL Server SQL DB SQL MI]
This article describes how to plot data using the Python package pandas'.hist(). A SQL database is the source used to visualize the histogram data intervals that have consecutive, non-overlapping values.
::: moniker range=">=sql-server-2017||>=sql-server-linux-ver15"
- SQL Server for Windows or for Linux ::: moniker-end
::: moniker range="=azuresqldb-current"
- Azure SQL Database ::: moniker-end
::: moniker range="=azuresqldb-mi-current"
-
SQL Server Management Studio for restoring the sample database to Azure SQL Managed Instance. ::: moniker-end
-
Azure Data Studio. To install, see Azure Data Studio.
-
Restore sample DW database to get sample data used in this article.
You can verify that the restored database exists by querying the Person.CountryRegion table:
USE AdventureWorksDW;
SELECT * FROM Person.CountryRegion;Download and Install Azure Data Studio.
Install the following Python packages:
- pyodbc
- pandas
To install these packages:
- In your Azure Data Studio notebook, select Manage Packages.
- In the Manage Packages pane, select the Add new tab.
- For each of the following packages, enter the package name, click Search, then click Install.
The distributed data displayed in the histogram is based on a SQL query from AdventureWorksDW. The histogram visualizes data and the frequency of data values. Edit the connection string variables: 'server', 'database', 'username', and 'password' to connect to SQL database.
To create a new notebook:
- In Azure Data Studio, select File, select New Notebook.
- In the notebook, select kernel Python3, select the +code.
- Paste code in notebook, select Run All.
import pyodbc
import pandas as plt
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
server = 'servername'
database = 'AdventureWorksDW'
username = 'yourusername'
password = 'databasename'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
sql = "SELECT DATEDIFF(year, c.BirthDate, GETDATE()) AS Age FROM [dbo].[FactInternetSales] s INNER JOIN dbo.DimCustomer c ON s.CustomerKey = c.CustomerKey"
df = pd.read_sql(sql, cnxn)
df.hist(bins=10)The display shows the age distribution of customers in the FactInternetSales table.
