You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/database-engine/availability-groups/windows/monitor-performance-for-always-on-availability-groups.md
+231-5Lines changed: 231 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ manager: craigg
34
34
35
35
-[Useful extended events](#BKMK_XEVENTS)
36
36
37
-
## <aname="BKMK_DATA_SYNC_PROCESS"></a> Data synchronization process
37
+
## Data synchronization process
38
38
To estimate the time to full synchronization and to identify the bottleneck, you need to understand the synchronization process. Performance bottleneck can be anywhere in the process, and locating the bottleneck can help you dig deeper into the underlying issues. The following figure and table illustrate the data synchronization process:
39
39
40
40

@@ -49,7 +49,7 @@ manager: craigg
49
49
|5|Harden|Log is flushed on the secondary replica for hardening. After the log flush, an acknowledgement is sent back to the primary replica.<br /><br /> Once the log is hardened, data loss is avoided.|Performance counter [SQL Server:Database > Log Bytes Flushed/sec](~/relational-databases/performance-monitor/sql-server-databases-object.md)<br /><br /> Wait type [HADR_LOGCAPTURE_SYNC](~/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql.md)|
50
50
|6|Redo|Redo the flushed pages on the secondary replica. Pages are kept in the redo queue as they wait to be redone.|[SQL Server:Database Replica > Redone Bytes/sec](~/relational-databases/performance-monitor/sql-server-database-replica.md)<br /><br /> [redo_queue_size](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md) (KB) and [redo_rate](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md).<br /><br /> Wait type [REDO_SYNC](~/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql.md)|
51
51
52
-
## <aname="BKMK_FLOW_CONTROL_GATES"></a> Flow control gates
52
+
## Flow control gates
53
53
Availability groups are designed with flow control gates on the primary replica to avoid excessive resource consumption, such as network and memory resources, on all availability replicas. These flow control gates do not affect the synchronization health state of the availability replicas, but they can affect the overall performance of your availability databases, including RPO.
54
54
55
55
After the logs have been captured on the primary replica, they are subject to two levels of flow controls, as shown in the following table.
@@ -66,7 +66,7 @@ manager: craigg
66
66
67
67
Two useful performance counters, [SQL Server:Availability Replica > Flow control/sec](~/relational-databases/performance-monitor/sql-server-availability-replica.md) and [SQL Server:Availability Replica > Flow Control Time (ms/sec)](~/relational-databases/performance-monitor/sql-server-availability-replica.md), show you, within the last second, how many times flow control was activated and how much time was spent waiting on flow control. Higher wait time on the flow control translate to higher RPO. For more information on the types of issues that can cause a high wait time on the flow control, see [Troubleshoot: Availability group exceeded RPO](troubleshoot-availability-group-exceeded-rpo.md).
68
68
69
-
## <aname="BKMK_RTO"></a> Estimating failover time (RTO)
69
+
## Estimating failover time (RTO)
70
70
The RTO in your SLA depends on the failover time of your Always On implementation at any given time, which can be expressed in the following formula:
71
71
72
72

@@ -84,7 +84,7 @@ manager: craigg
84
84
85
85
The failover overhead time, Toverhead, includes the time it takes to fail over the WSFC cluster and to bring the databases online. This time is usually short and constant.
86
86
87
-
## <aname="BKMK_RPO"></a> Estimating potential data loss (RPO)
87
+
## Estimating potential data loss (RPO)
88
88
The RPO in your SLA depends on the possible data loss of your Always On implementation at any given time. This possible data loss can be expressed in the following formula:
89
89
90
90

@@ -97,8 +97,234 @@ manager: craigg
97
97
The log send queue represents all the data that can be lost from a catastrophic failure. At first glance, it is curious that the log generation rate is used instead of the log send rate (see [log_send_rate](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md)). However, remember that using the log send rate only gives you the time to synchronize, while RPO measures data loss based on how fast it is generated, not on how fast it is synchronized.
98
98
99
99
A simpler way to estimate Tdata_loss is to use [last_commit_time](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md). The DMV on the primary replica reports this value for all replicas. You can calculate the difference between the value for the primary replica and the value for the secondary replica to estimate how fast the log on the secondary replica is catching up to the primary replica. As stated previously, this calculation does not tell you the potential data loss based on how fast the log is generated, but it should be a close approximation.
100
+
101
+
## Estimate RTO & RPO with the SSMS dashboard
102
+
In Always On Availability Groups, the RTO and RPO is calculated and displayed for the databases hosted on the secondary replicas. On the dashboard of the primary replica, the RTO and RPO is grouped by the secondary replica.
103
+
104
+
To view the RTO and RPO within the dashboard, do the following:
105
+
1. In SQL Server Management Studio, expand the **Always On High Availability** node, right-click the name of your availability group, and select **Show Dashboard**.
106
+
1. Select **Add/Remove Columns** under the **Group by** tab. Check both **Estimated Recovery Time(seconds)**[RTO] and **Estimated Data Loss (time)**[RPO].
The recovery time calculation determines how much time is needed to recover the *secondary database* after a failover happens. The failover time is usually short and constant. The detection time depends on cluster-level settings and not on the individual availability replicas.
112
+
113
+
114
+
For a secondary database (DB_sec), calculation and display of its RTO is based on its **redo_queue_size** and **redo_rate**:
115
+
116
+

117
+
118
+
Except corner cases, the formula to calculate a secondary database's RTO is:
119
+
120
+

121
+
122
+
123
+
124
+
### Calculation of secondary database RPO
125
+
126
+
For a secondary database (DB_sec), calculation and display of its RPO is based on its is_failover_ready, last_commit_time and its correlated primary database (DB_pri)'s last_commit_time. When secondary database.is_failover_ready = 1, then daa is synchronized, and no data loss will occur upon failover. However, if this value is 0, then there is a gap between the **last_commit_time** on the primary database and the **last_commit_time** on the secondary database.
127
+
128
+
For the primary database, the **last_commit_time** is the time when the latest transaction has been committed. For the secondary database, the **last_commit_time** is the latest commit time for the transaction on the primary database that has been successfully hardened on the secondary database as well. This number should be the same for both the primary and secondary database. A gap between these two values is the duration in which pending transactions have not been hardened on the secondary database, and will be lost in the event of a failover.
129
+
130
+

131
+
132
+
### Performance Counters used in RTO/RPO formulas
133
+
134
+
-**redo_queue_size** (KB) [*used in RTO*]: The redo queue size is the size of transaction logs between its **last_received_lsn** and **last_redone_lsn**. **last_received_lsn** is the log block ID identifying the point up to which all log blocks have been received by the secondary replica that hosts this secondary database. **Last_redone_lsn** is the log sequence number of the last log record that was redone on the secondary database. Based on these two values, we can find IDs of the starting log block (**last_received_lsn**) and the end log block (**last_redone_lsn**). The space between these two log blocks then can represent how may transaction log blocks have not yet been redone. This is measured in Kilobytes(KB).
135
+
-**redo_rate** (KB/sec) [*used in RTO*]: An accumulative value which represent at a period of elapsed time, how much of the transaction log (KB) has been redone on the secondary database in Kilobytes(KB)/escond.
136
+
-**last_commit_time** (Datetime) [*used in RPO*]: For the primary database, **last_commit_time** is the time when the latest transaction has been committed. For the secondary database, the **last_commit_time** is the latest commit time for the transaction on the primary database that has been successfully hardened on the secondary database as well. Since this value on the secondary should be synchronized with the same value on the primary, any gap between these two values is the estimate of data loss (RPO).
137
+
138
+
## Estimate RTO and RPO using DMVs
139
+
140
+
It is possible to query the DMVs [sys.dm_hadr_database_replica_states](../../../relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md) and [sys.dm_hadr_database_replica_cluster_states](../../../relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-cluster-states-transact-sql.md) to estimate the RPO and RTO of a database. The below queries create stored procedures that accomplish both things.
141
+
142
+
>[!NOTE]
143
+
> Be sure to create and run the stored procedure to estimate the RTO first, as the values it produces are necessary to run the stored procedure for estimating the RPO.
144
+
145
+
### Create a stored procedure to estimate RTO
146
+
147
+
1. On the target secondary replica, create stored procedure **proc_calculate_RTO**. If this stored procedure already exists, drop it first, and then recreate it.
148
+
149
+
```sql
150
+
if object_id(N'proc_calculate_RTO', 'p') is not null
151
+
drop procedure proc_calculate_RTO
152
+
go
153
+
154
+
raiserror('creating procedure proc_calculate_RTO', 0,1) with nowait
155
+
go
156
+
--
157
+
-- name: proc_calculate_RTO
158
+
--
159
+
-- description: Calculate RTO of a secondary database.
160
+
--
161
+
-- parameters: @secondary_database_name nvarchar(max): name of the secondary database.
if @is_primary_replica is nullor @is_failover_ready is nullor @redo_queue_size is nullor @replica_id is nullor @group_database_id is nullor @group_id is null
193
+
begin
194
+
print 'RTO of Database '+ @secondary_database_name +' is not available'
195
+
return
196
+
end
197
+
else if @is_primary_replica =1
198
+
begin
199
+
print 'You are visiting wrong replica';
200
+
return
201
+
end
202
+
203
+
if @redo_queue_size =0
204
+
set @RTO =0
205
+
else if @redo_rate is nullor @redo_rate =0
206
+
begin
207
+
print 'RTO of Database '+ @secondary_database_name +' is not available'
208
+
return
209
+
end
210
+
else
211
+
set @RTO = CAST(@redo_queue_size AS float) / @redo_rate
212
+
213
+
print 'RTO of Database '+ @secondary_database_name +' is '+convert(varchar, ceiling(@RTO))
214
+
print 'group_id of Database '+ @secondary_database_name +' is '+convert(nvarchar(50), @group_id)
215
+
print 'replica_id of Database '+ @secondary_database_name +' is '+convert(nvarchar(50), @replica_id)
216
+
print 'group_database_id of Database '+ @secondary_database_name +' is '+convert(nvarchar(50), @group_database_id)
217
+
end
218
+
```
219
+
220
+
2. Execute **proc_calculate_RTO** with the target secondary database name:
3. The output displays the RTO value of the target secondary replica database. Save the *group_id*, *replica_id*, and *group_database_id* to use with the RPO-estimation stored procedure.
225
+
226
+
Sample Output:
227
+
<br>RTO of Database DB_sec' is 0
228
+
<br>group_id of Database DB4 is F176DD65-C3EE-4240-BA23-EA615F965C9B
229
+
<br>replica_id of Database DB4 is 405554F6-3FDC-4593-A650-2067F5FABFFD
230
+
<br>group_database_id of Database DB4 is 39F7942F-7B5E-42C5-977D-02E7FFA6C392
231
+
232
+
### Create a stored procedure to estimate RPO
233
+
1. On the primary replica, create stored procedure **proc_calculate_RPO**. If it already exists, drop it first, and then recreate it.
234
+
235
+
```sql
236
+
if object_id(N'proc_calculate_RPO', 'p') is not null
237
+
drop procedure proc_calculate_RPO
238
+
go
239
+
240
+
raiserror('creating procedure proc_calculate_RPO', 0,1) with nowait
241
+
go
242
+
--
243
+
-- name: proc_calculate_RPO
244
+
--
245
+
-- description: Calculate RPO of a secondary database.
246
+
--
247
+
-- parameters: @group_id uniqueidentifier: group_id of the secondary database.
248
+
-- @replica_id uniqueidentifier: replica_id of the secondary database.
249
+
-- @group_database_id uniqueidentifier: group_database_id of the secondary database.
3. The output displays the RPO value of the target secondary replica database.
325
+
100
326
101
-
## <aname="BKMK_Monitoring_for_RTO_and_RPO"></a> Monitoring for RTO and RPO
327
+
## Monitoring for RTO and RPO
102
328
This section demonstrates how to monitor your availability groups for RTO and RPO metrics. This demonstration is similar to the GUI tutorial given in [The Always On health model, part 2: Extending the health model](http://blogs.msdn.com/b/sqlalwayson/archive/2012/02/13/extending-the-alwayson-health-model.aspx).
103
329
104
330
Elements of the failover time and potential data loss calculations in [Estimating failover time (RTO)](#BKMK_RTO) and [Estimating potential data loss (RPO)](#BKMK_RPO) are conveniently provided as performance metrics in the policy management facet **Database Replica State** (see [View the policy-based management facets on a SQL Server object](~/relational-databases/policy-based-management/view-the-policy-based-management-facets-on-a-sql-server-object.md)). You can monitor these two metrics on a schedule and be alerted when the metrics exceed your RTO and RPO, respectively.
0 commit comments