Skip to content

Commit ad8602c

Browse files
authored
Merge pull request #6884 from MashaMSFT/20180813_agrpo
added rpo and rto info to AG doc (vsts 1291690)
2 parents 6db2cd9 + 772679e commit ad8602c

6 files changed

Lines changed: 231 additions & 5 deletions

File tree

2.35 KB
Loading
16.6 KB
Loading
16.7 KB
Loading
4.38 KB
Loading
42.9 KB
Loading

docs/database-engine/availability-groups/windows/monitor-performance-for-always-on-availability-groups.md

Lines changed: 231 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ manager: craigg
3434

3535
- [Useful extended events](#BKMK_XEVENTS)
3636

37-
## <a name="BKMK_DATA_SYNC_PROCESS"></a> Data synchronization process
37+
## Data synchronization process
3838
To estimate the time to full synchronization and to identify the bottleneck, you need to understand the synchronization process. Performance bottleneck can be anywhere in the process, and locating the bottleneck can help you dig deeper into the underlying issues. The following figure and table illustrate the data synchronization process:
3939

4040
![Availability group data synchronization](media/always-onag-datasynchronization.gif "Availability group data synchronization")
@@ -49,7 +49,7 @@ manager: craigg
4949
|5|Harden|Log is flushed on the secondary replica for hardening. After the log flush, an acknowledgement is sent back to the primary replica.<br /><br /> Once the log is hardened, data loss is avoided.|Performance counter [SQL Server:Database > Log Bytes Flushed/sec](~/relational-databases/performance-monitor/sql-server-databases-object.md)<br /><br /> Wait type [HADR_LOGCAPTURE_SYNC](~/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql.md)|
5050
|6|Redo|Redo the flushed pages on the secondary replica. Pages are kept in the redo queue as they wait to be redone.|[SQL Server:Database Replica > Redone Bytes/sec](~/relational-databases/performance-monitor/sql-server-database-replica.md)<br /><br /> [redo_queue_size](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md) (KB) and [redo_rate](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md).<br /><br /> Wait type [REDO_SYNC](~/relational-databases/system-dynamic-management-views/sys-dm-os-wait-stats-transact-sql.md)|
5151

52-
## <a name="BKMK_FLOW_CONTROL_GATES"></a> Flow control gates
52+
## Flow control gates
5353
Availability groups are designed with flow control gates on the primary replica to avoid excessive resource consumption, such as network and memory resources, on all availability replicas. These flow control gates do not affect the synchronization health state of the availability replicas, but they can affect the overall performance of your availability databases, including RPO.
5454

5555
After the logs have been captured on the primary replica, they are subject to two levels of flow controls, as shown in the following table.
@@ -66,7 +66,7 @@ manager: craigg
6666

6767
Two useful performance counters, [SQL Server:Availability Replica > Flow control/sec](~/relational-databases/performance-monitor/sql-server-availability-replica.md) and [SQL Server:Availability Replica > Flow Control Time (ms/sec)](~/relational-databases/performance-monitor/sql-server-availability-replica.md), show you, within the last second, how many times flow control was activated and how much time was spent waiting on flow control. Higher wait time on the flow control translate to higher RPO. For more information on the types of issues that can cause a high wait time on the flow control, see [Troubleshoot: Availability group exceeded RPO](troubleshoot-availability-group-exceeded-rpo.md).
6868

69-
## <a name="BKMK_RTO"></a> Estimating failover time (RTO)
69+
## Estimating failover time (RTO)
7070
The RTO in your SLA depends on the failover time of your Always On implementation at any given time, which can be expressed in the following formula:
7171

7272
![Availability groups RTO calculation](media/always-on-rto.gif "Availability groups RTO calculation")
@@ -84,7 +84,7 @@ manager: craigg
8484

8585
The failover overhead time, Toverhead, includes the time it takes to fail over the WSFC cluster and to bring the databases online. This time is usually short and constant.
8686

87-
## <a name="BKMK_RPO"></a> Estimating potential data loss (RPO)
87+
## Estimating potential data loss (RPO)
8888
The RPO in your SLA depends on the possible data loss of your Always On implementation at any given time. This possible data loss can be expressed in the following formula:
8989

9090
![Availability groups RPO calculation](media/always-on-rpo.gif "Availability groups RPO calculation")
@@ -97,8 +97,234 @@ manager: craigg
9797
The log send queue represents all the data that can be lost from a catastrophic failure. At first glance, it is curious that the log generation rate is used instead of the log send rate (see [log_send_rate](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md)). However, remember that using the log send rate only gives you the time to synchronize, while RPO measures data loss based on how fast it is generated, not on how fast it is synchronized.
9898

9999
A simpler way to estimate Tdata_loss is to use [last_commit_time](~/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md). The DMV on the primary replica reports this value for all replicas. You can calculate the difference between the value for the primary replica and the value for the secondary replica to estimate how fast the log on the secondary replica is catching up to the primary replica. As stated previously, this calculation does not tell you the potential data loss based on how fast the log is generated, but it should be a close approximation.
100+
101+
## Estimate RTO & RPO with the SSMS dashboard
102+
In Always On Availability Groups, the RTO and RPO is calculated and displayed for the databases hosted on the secondary replicas. On the dashboard of the primary replica, the RTO and RPO is grouped by the secondary replica.
103+
104+
To view the RTO and RPO within the dashboard, do the following:
105+
1. In SQL Server Management Studio, expand the **Always On High Availability** node, right-click the name of your availability group, and select **Show Dashboard**.
106+
1. Select **Add/Remove Columns** under the **Group by** tab. Check both **Estimated Recovery Time(seconds)** [RTO] and **Estimated Data Loss (time)** [RPO].
107+
108+
![rto-rpo-dashboard.png](media/rto-rpo-dashboard.png)
109+
110+
### Calculation of secondary database RTO
111+
The recovery time calculation determines how much time is needed to recover the *secondary database* after a failover happens. The failover time is usually short and constant. The detection time depends on cluster-level settings and not on the individual availability replicas.
112+
113+
114+
For a secondary database (DB_sec), calculation and display of its RTO is based on its **redo_queue_size** and **redo_rate**:
115+
116+
![Calculation of RTO](media/calculate-rto.png)
117+
118+
Except corner cases, the formula to calculate a secondary database's RTO is:
119+
120+
![Formula to calculate RTO](media/formula-calc-second-dba-rto.png)
121+
122+
123+
124+
### Calculation of secondary database RPO
125+
126+
For a secondary database (DB_sec), calculation and display of its RPO is based on its is_failover_ready, last_commit_time and its correlated primary database (DB_pri)'s last_commit_time. When secondary database.is_failover_ready = 1, then daa is synchronized, and no data loss will occur upon failover. However, if this value is 0, then there is a gap between the **last_commit_time** on the primary database and the **last_commit_time** on the secondary database.
127+
128+
For the primary database, the **last_commit_time** is the time when the latest transaction has been committed. For the secondary database, the **last_commit_time** is the latest commit time for the transaction on the primary database that has been successfully hardened on the secondary database as well. This number should be the same for both the primary and secondary database. A gap between these two values is the duration in which pending transactions have not been hardened on the secondary database, and will be lost in the event of a failover.
129+
130+
![Calculation of RPO](media/calculate-rpo.png)
131+
132+
### Performance Counters used in RTO/RPO formulas
133+
134+
- **redo_queue_size** (KB) [*used in RTO*]: The redo queue size is the size of transaction logs between its **last_received_lsn** and **last_redone_lsn**. **last_received_lsn** is the log block ID identifying the point up to which all log blocks have been received by the secondary replica that hosts this secondary database. **Last_redone_lsn** is the log sequence number of the last log record that was redone on the secondary database. Based on these two values, we can find IDs of the starting log block (**last_received_lsn**) and the end log block (**last_redone_lsn**). The space between these two log blocks then can represent how may transaction log blocks have not yet been redone. This is measured in Kilobytes(KB).
135+
- **redo_rate** (KB/sec) [*used in RTO*]: An accumulative value which represent at a period of elapsed time, how much of the transaction log (KB) has been redone on the secondary database in Kilobytes(KB)/escond.
136+
- **last_commit_time** (Datetime) [*used in RPO*]: For the primary database, **last_commit_time** is the time when the latest transaction has been committed. For the secondary database, the **last_commit_time** is the latest commit time for the transaction on the primary database that has been successfully hardened on the secondary database as well. Since this value on the secondary should be synchronized with the same value on the primary, any gap between these two values is the estimate of data loss (RPO).
137+
138+
## Estimate RTO and RPO using DMVs
139+
140+
It is possible to query the DMVs [sys.dm_hadr_database_replica_states](../../../relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql.md) and [sys.dm_hadr_database_replica_cluster_states](../../../relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-cluster-states-transact-sql.md) to estimate the RPO and RTO of a database. The below queries create stored procedures that accomplish both things.
141+
142+
>[!NOTE]
143+
> Be sure to create and run the stored procedure to estimate the RTO first, as the values it produces are necessary to run the stored procedure for estimating the RPO.
144+
145+
### Create a stored procedure to estimate RTO
146+
147+
1. On the target secondary replica, create stored procedure **proc_calculate_RTO**. If this stored procedure already exists, drop it first, and then recreate it.
148+
149+
```sql
150+
if object_id(N'proc_calculate_RTO', 'p') is not null
151+
drop procedure proc_calculate_RTO
152+
go
153+
154+
raiserror('creating procedure proc_calculate_RTO', 0,1) with nowait
155+
go
156+
--
157+
-- name: proc_calculate_RTO
158+
--
159+
-- description: Calculate RTO of a secondary database.
160+
--
161+
-- parameters: @secondary_database_name nvarchar(max): name of the secondary database.
162+
--
163+
-- security: this is a public interface object.
164+
--
165+
create procedure proc_calculate_RTO
166+
(
167+
@secondary_database_name nvarchar(max)
168+
)
169+
as
170+
begin
171+
declare @db sysname
172+
declare @is_primary_replica bit
173+
declare @is_failover_ready bit
174+
declare @redo_queue_size bigint
175+
declare @redo_rate bigint
176+
declare @replica_id uniqueidentifier
177+
declare @group_database_id uniqueidentifier
178+
declare @group_id uniqueidentifier
179+
declare @RTO float
180+
181+
select
182+
@is_primary_replica = dbr.is_primary_replica,
183+
@is_failover_ready = dbcs.is_failover_ready,
184+
@redo_queue_size = dbr.redo_queue_size,
185+
@redo_rate = dbr.redo_rate,
186+
@replica_id = dbr.replica_id,
187+
@group_database_id = dbr.group_database_id,
188+
@group_id = dbr.group_id
189+
from sys.dm_hadr_database_replica_states dbr join sys.dm_hadr_database_replica_cluster_states dbcs on dbr.replica_id = dbcs.replica_id and
190+
dbr.group_database_id = dbcs.group_database_id where dbcs.database_name = @secondary_database_name
191+
192+
if @is_primary_replica is null or @is_failover_ready is null or @redo_queue_size is null or @replica_id is null or @group_database_id is null or @group_id is null
193+
begin
194+
print 'RTO of Database '+ @secondary_database_name +' is not available'
195+
return
196+
end
197+
else if @is_primary_replica = 1
198+
begin
199+
print 'You are visiting wrong replica';
200+
return
201+
end
202+
203+
if @redo_queue_size = 0
204+
set @RTO = 0
205+
else if @redo_rate is null or @redo_rate = 0
206+
begin
207+
print 'RTO of Database '+ @secondary_database_name +' is not available'
208+
return
209+
end
210+
else
211+
set @RTO = CAST(@redo_queue_size AS float) / @redo_rate
212+
213+
print 'RTO of Database '+ @secondary_database_name +' is ' + convert(varchar, ceiling(@RTO))
214+
print 'group_id of Database '+ @secondary_database_name +' is ' + convert(nvarchar(50), @group_id)
215+
print 'replica_id of Database '+ @secondary_database_name +' is ' + convert(nvarchar(50), @replica_id)
216+
print 'group_database_id of Database '+ @secondary_database_name +' is ' + convert(nvarchar(50), @group_database_id)
217+
end
218+
```
219+
220+
2. Execute **proc_calculate_RTO** with the target secondary database name:
221+
```sql
222+
exec proc_calculate_RTO @secondary_database_name = N'DB_sec'
223+
```
224+
3. The output displays the RTO value of the target secondary replica database. Save the *group_id*, *replica_id*, and *group_database_id* to use with the RPO-estimation stored procedure.
225+
226+
Sample Output:
227+
<br>RTO of Database DB_sec' is 0
228+
<br>group_id of Database DB4 is F176DD65-C3EE-4240-BA23-EA615F965C9B
229+
<br>replica_id of Database DB4 is 405554F6-3FDC-4593-A650-2067F5FABFFD
230+
<br>group_database_id of Database DB4 is 39F7942F-7B5E-42C5-977D-02E7FFA6C392
231+
232+
### Create a stored procedure to estimate RPO
233+
1. On the primary replica, create stored procedure **proc_calculate_RPO**. If it already exists, drop it first, and then recreate it.
234+
235+
```sql
236+
if object_id(N'proc_calculate_RPO', 'p') is not null
237+
drop procedure proc_calculate_RPO
238+
go
239+
240+
raiserror('creating procedure proc_calculate_RPO', 0,1) with nowait
241+
go
242+
--
243+
-- name: proc_calculate_RPO
244+
--
245+
-- description: Calculate RPO of a secondary database.
246+
--
247+
-- parameters: @group_id uniqueidentifier: group_id of the secondary database.
248+
-- @replica_id uniqueidentifier: replica_id of the secondary database.
249+
-- @group_database_id uniqueidentifier: group_database_id of the secondary database.
250+
--
251+
-- security: this is a public interface object.
252+
--
253+
create procedure proc_calculate_RPO
254+
(
255+
@group_id uniqueidentifier,
256+
@replica_id uniqueidentifier,
257+
@group_database_id uniqueidentifier
258+
)
259+
as
260+
begin
261+
declare @db_name sysname
262+
declare @is_primary_replica bit
263+
declare @is_failover_ready bit
264+
declare @is_local bit
265+
declare @last_commit_time_sec datetime
266+
declare @last_commit_time_pri datetime
267+
declare @RPO nvarchar(max)
268+
269+
-- secondary database's last_commit_time
270+
select
271+
@db_name = dbcs.database_name,
272+
@is_failover_ready = dbcs.is_failover_ready,
273+
@last_commit_time_sec = dbr.last_commit_time
274+
from sys.dm_hadr_database_replica_states dbr join sys.dm_hadr_database_replica_cluster_states dbcs on dbr.replica_id = dbcs.replica_id and
275+
dbr.group_database_id = dbcs.group_database_id where dbr.group_id = @group_id and dbr.replica_id = @replica_id and dbr.group_database_id = @group_database_id
276+
277+
-- correlated primary database's last_commit_time
278+
select
279+
@last_commit_time_pri = dbr.last_commit_time,
280+
@is_local = dbr.is_local
281+
from sys.dm_hadr_database_replica_states dbr join sys.dm_hadr_database_replica_cluster_states dbcs on dbr.replica_id = dbcs.replica_id and
282+
dbr.group_database_id = dbcs.group_database_id where dbr.group_id = @group_id and dbr.is_primary_replica = 1 and dbr.group_database_id = @group_database_id
283+
284+
if @is_local is null or @is_failover_ready is null
285+
begin
286+
print 'RPO of database '+ @db_name +' is not available'
287+
return
288+
end
289+
290+
if @is_local = 0
291+
begin
292+
print 'You are visiting wrong replica'
293+
return
294+
end
295+
296+
if @is_failover_ready = 1
297+
set @RPO = '00:00:00'
298+
else if @last_commit_time_sec is null or @last_commit_time_pri is null
299+
begin
300+
print 'RPO of database '+ @db_name +' is not available'
301+
return
302+
end
303+
else
304+
begin
305+
if DATEDIFF(ss, @last_commit_time_sec, @last_commit_time_pri) < 0
306+
begin
307+
print 'RPO of database '+ @db_name +' is not available'
308+
return
309+
end
310+
else
311+
set @RPO = CONVERT(varchar, DATEADD(ms, datediff(ss ,@last_commit_time_sec, @last_commit_time_pri) * 1000, 0), 114)
312+
end
313+
print 'RPO of database '+ @db_name +' is ' + @RPO
314+
end
315+
```
316+
317+
2. Execute **proc_calculate_RPO** with the target secondary database's *group_id*, *replica_id*, and *group_database_id*.
318+
319+
```sql
320+
exec proc_calculate_RPO @group_id= 'F176DD65-C3EE-4240-BA23-EA615F965C9B',
321+
@replica_id = '405554F6-3FDC-4593-A650-2067F5FABFFD',
322+
@group_database_id = '39F7942F-7B5E-42C5-977D-02E7FFA6C392'
323+
```
324+
3. The output displays the RPO value of the target secondary replica database.
325+
100326

101-
## <a name="BKMK_Monitoring_for_RTO_and_RPO"></a> Monitoring for RTO and RPO
327+
## Monitoring for RTO and RPO
102328
This section demonstrates how to monitor your availability groups for RTO and RPO metrics. This demonstration is similar to the GUI tutorial given in [The Always On health model, part 2: Extending the health model](http://blogs.msdn.com/b/sqlalwayson/archive/2012/02/13/extending-the-alwayson-health-model.aspx).
103329

104330
Elements of the failover time and potential data loss calculations in [Estimating failover time (RTO)](#BKMK_RTO) and [Estimating potential data loss (RPO)](#BKMK_RPO) are conveniently provided as performance metrics in the policy management facet **Database Replica State** (see [View the policy-based management facets on a SQL Server object](~/relational-databases/policy-based-management/view-the-policy-based-management-facets-on-a-sql-server-object.md)). You can monitor these two metrics on a schedule and be alerted when the metrics exceed your RTO and RPO, respectively.

0 commit comments

Comments
 (0)