Skip to content

Commit f16131f

Browse files
authored
Merge pull request #2118 from mihaelablendea/mihaelab-slesupdates-6-20
Mihaelab slesupdates 6 20
2 parents 9a63454 + 9e09c40 commit f16131f

4 files changed

Lines changed: 75 additions & 29 deletions

docs/includes/ss-linux-cluster-pacemaker-concepts.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,19 @@ A user can choose to override the default behavior, and configure the availabili
2222
To set `REQUIRED_COPIES_TO_COMMIT` to 0, run:
2323

2424
```bash
25-
sudo pcs resource update <**ag1**> required_copies_to_commit=0
25+
sudo pcs resource update <**ag_cluster**> required_copies_to_commit=0
26+
```
27+
28+
The equivalent command using crm (on SLES) is:
29+
30+
```bash
31+
sudo crm resource param <**ag_cluster**> set required_synchronized_secondaries_to_commit 0
2632
```
2733

2834
To revert to default computed value, run:
2935

3036
```bash
31-
sudo pcs resource update <**ag1**> required_copies_to_commit=
37+
sudo pcs resource update <**ag_cluster**> required_copies_to_commit=
3238
```
3339

3440
>[!NOTE]

docs/linux/sql-server-linux-availability-group-cluster-rhel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ sudo pcs property set stonith-enabled=false
8383
`Start-failure-is-fatal` indicates whether a failure to start a resource on a node prevents further start attempts on that node. When set to `false`, the cluster will decide whether to try starting on the same node again based on the resource's current failure count and migration threshold. So, after failover occurs, Pacemaker will retry starting the availability group resource on the former primary once the SQL instance is available. Pacemaker will take care of demoting the replica to secondary and it will automatically rejoin the availability group.
8484
To update the property value to false run:
8585
```bash
86-
pcs property set start-failure-is-fatal=false
86+
sudo pcs property set start-failure-is-fatal=false
8787
```
8888
If the property has the default value of `true`, if first attempt to start the resource fails, user intervention is required after an automatic failover to cleanup the resource failure count and reset the configuration using: `pcs resource cleanup <resourceName>` command.
8989

docs/linux/sql-server-linux-availability-group-cluster-sles.md

Lines changed: 65 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ This guide provides instructions to create a two-node cluster for SQL Server on
3131
For more details on cluster configuration, resource agent options, management, best practices, and recommendations, see [SUSE Linux Enterprise High Availability Extension 12 SP2](https://www.suse.com/documentation/sle-ha-12/index.html).
3232

3333
>[!NOTE]
34-
>At this point, SQL Server's integration with Pacemaker on Linux is not as coupled as with WSFC on Windows. SQL Server service on Linux is not cluster aware. Pacemaker controls all of the orchestration of the cluster resources as if SQL Server were a standalone instance. Also, virtual network name is specific to WSFC, there is no equivalent of the same in Pacemaker. On Linux, Always On Availability Group Dynamic Management Views (DMVs) will return empty rows. You can still create a listener to use it for transparent reconnection after failover, but you will have to manually register the listener name in the DNS server with the IP used to create the virtual IP resource (as explained below).
34+
>At this point, SQL Server's integration with Pacemaker on Linux is not as coupled as with WSFC on Windows. SQL Server service on Linux is not cluster aware. Pacemaker controls all of the orchestration of the cluster resources, including the availability group resource. On Linux, you should not rely on Always On Availability Group Dynamic Management Views (DMVs) that provide cluster information like sys.dm_hadr_cluster. Also, virtual network name is specific to WSFC, there is no equivalent of the same in Pacemaker. You can still create a listener to use it for transparent reconnection after failover, but you will have to manually register the listener name in the DNS server with the IP used to create the virtual IP resource (as explained below).
3535
3636

3737
## Roadmap
@@ -49,38 +49,39 @@ The steps to create an availability group on Linux servers for high availability
4949
>[!IMPORTANT]
5050
>Production environments require a fencing agent, like STONITH for high availability. The demonstrations in this documentation do not use fencing agents. The demonstrations are for testing and validation only.
5151
52-
>A Linux cluster uses fencing to return the cluster to a known state. The way to configure fencing depends on the distribution and the environment. At this time, fencing is not available in some cloud environments. See [SUSE Linux Enterprise High Availability Extension](https://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html#cha.ha.fencing).
52+
>A Pacemaker cluster uses fencing to return the cluster to a known state. The way to configure fencing depends on the distribution and the environment. At this time, fencing is not available in some cloud environments. See [SUSE Linux Enterprise High Availability Extension](https://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html#cha.ha.fencing).
5353
5454
5. [Add the availability group as a resource in the cluster](sql-server-linux-availability-group-cluster-sles.md#configure-the-cluster-resources-for-sql-server).
5555

5656
## Prerequisites
5757

58-
To complete the end-to-end scenario below you need two machines to deploy the two nodes cluster. The steps below outline how to configure these servers.
58+
To complete the end-to-end scenario below you need three machines to deploy the three nodes cluster. The steps below outline how to configure these servers.
5959

6060
## Setup and configure the operating system on each cluster node
6161

6262
The first step is to configure the operating system on the cluster nodes. For this walk through, use SLES 12 SP2 with a valid subscription for the HA add-on.
6363

6464
### Install and configure SQL Server service on each cluster node
6565

66-
1. Install and setup SQL Server service on both nodes. For detailed instructions see [Install SQL Server on Linux](sql-server-linux-setup.md).
66+
1. Install and setup SQL Server service on all nodes. For detailed instructions see [Install SQL Server on Linux](sql-server-linux-setup.md).
6767

68-
1. Designate one node as primary and the other as secondary, for purposes of configuration. Use these terms throughout this guide.
68+
1. Designate one node as primary and other nodes as secondaries. Use these terms throughout this guide.
6969

7070
1. Make sure nodes that are going to be part of the cluster can communicate to each other.
7171

72-
The following example shows `/etc/hosts` with additions for two nodes named SLES1 and SLES2.
72+
The following example shows `/etc/hosts` with additions for three nodes named SLES1, SLES2 and SLES3.
7373

7474
```
7575
127.0.0.1 localhost
76-
10.128.18.128 SLES1
76+
10.128.16.33 SLES1
7777
10.128.16.77 SLES2
78+
10.128.16.22 SLES3
7879
```
7980

8081
All cluster nodes must be able to access each other via SSH. Tools like `hb_report` or `crm_report` (for troubleshooting) and Hawk's History Explorer require passwordless SSH access between the nodes, otherwise they can only collect data from the current node. In case you use a non-standard SSH port, use the -X option (see `man` page). For example, if your SSH port is 3479, invoke a `crm_report` with:
8182

8283
```bash
83-
crm_report -X "-p 3479" [...]
84+
sudo crm_report -X "-p 3479" [...]
8485
```
8586

8687
For additional information, see the [SLES Administration Guide - Miscellaneous section](http://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html#sec.ha.troubleshooting.misc).
@@ -113,7 +114,7 @@ On Linux servers configure the availability group and then configure the cluster
113114
1. Log in as `root` to the physical or virtual machine you want to use as cluster node.
114115
2. Start the bootstrap script by executing:
115116
```bash
116-
ha-cluster-init
117+
sudo ha-cluster-init
117118
```
118119

119120
If NTP has not been configured to start at boot time, a message appears.
@@ -134,15 +135,15 @@ On Linux servers configure the availability group and then configure the cluster
134135
4. For any details of the setup process, check `/var/log/sleha-bootstrap.log`. You now have a running one-node cluster. Check the cluster status with crm status:
135136

136137
```bash
137-
crm status
138+
sudo crm status
138139
```
139140

140141
You can also see cluster configuration with `crm configure show xml` or `crm configure show`.
141142

142143
5. The bootstrap procedure creates a Linux user named hacluster with the password linux. Replace the default password with a secure one as soon as possible:
143144

144145
```bash
145-
passwd hacluster
146+
sudo passwd hacluster
146147
```
147148

148149
## Add nodes to the existing cluster
@@ -157,7 +158,7 @@ If you have configured the existing cluster nodes with the `YaST` cluster module
157158
2. Start the bootstrap script by executing:
158159

159160
```bash
160-
ha-cluster-join
161+
sudo ha-cluster-join
161162
```
162163

163164
If NTP has not been configured to start at boot time, a message appears.
@@ -168,14 +169,14 @@ After logging in to the specified node, the script will copy the Corosync config
168169

169170
6. For details of the process, check `/var/log/ha-cluster-bootstrap.log`.
170171

171-
1. Check the cluster status with `crm status`. If you have successfully added a second node, the output will be similar to the following:
172+
1. Check the cluster status with `sudo crm status`. If you have successfully added a second node, the output will be similar to the following:
172173

173174
```bash
174-
crm status
175+
sudo crm status
175176

176-
2 nodes configured
177+
3 nodes configured
177178
1 resource configured
178-
Online: [ SLES1 SLES2 ]
179+
Online: [ SLES1 SLES2 SLES3]
179180
Full list of resources:
180181
admin_addr (ocf::heartbeat:IPaddr2): Started SLES1
181182
```
@@ -185,29 +186,68 @@ After logging in to the specified node, the script will copy the Corosync config
185186
186187
After adding all nodes, check if you need to adjust the no-quorum-policy in the global cluster options. This is especially important for two-node clusters. For more information, refer to Section 4.1.2, Option no-quorum-policy.
187188

189+
## Set cluster property start-failure-is-fatal to false
190+
191+
`Start-failure-is-fatal` indicates whether a failure to start a resource on a node prevents further start attempts on that node. When set to `false`, the cluster will decide whether to try starting on the same node again based on the resource's current failure count and migration threshold. So, after failover occurs, Pacemaker will retry starting the availability group resource on the former primary once the SQL instance is available. Pacemaker will take care of demoting the replica to secondary and it will automatically rejoin the availability group. Also, if `start-failure-is-fatal` is set to `false`, the cluster will fall back to the configured failcount limits configured with migration-threshold so you need to make sure default for migration threshold is updated accordingly.
192+
193+
To update the property value to false run:
194+
```bash
195+
sudo crm configure property start-failure-is-fatal=false
196+
sudo crm configure rsc_defaults migration-threshold=5000
197+
```
198+
If the property has the default value of `true`, if first attempt to start the resource fails, user intervention is required after an automatic failover to cleanup the resource failure count and reset the configuration using: `sudo crm resource cleanup <resourceName>` command.
199+
200+
For more details on Pacemaker cluster properties see [Configuring Cluster Resources](https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_config_crm_resources.html).
201+
202+
# Configure fencing (STONITH)
203+
Pacemaker cluster vendors require STONITH to be enabled and a fencing device configured for a supported cluster setup. When the cluster resource manager cannot determine the state of a node or of a resource on a node, fencing is used to bring the cluster to a known state again.
204+
Resource level fencing ensures mainly that there is no data corruption in case of an outage by configuring a resource. You can use resource level fencing, for instance, with DRBD (Distributed Replicated Block Device) to mark the disk on a node as outdated when the communication link goes down.
205+
Node level fencing ensures that a node does not run any resources. This is done by resetting the node and the Pacemaker implementation of it is called STONITH (which stands for "shoot the other node in the head"). Pacemaker supports a great variety of fencing devices, e.g. an uninterruptible power supply or management interface cards for servers.
206+
For more details, see [Pacemaker Clusters from Scratch](http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch05.html), [Fencing and Stonith](http://clusterlabs.org/doc/crm_fencing.html) and [SUSE HA documentation: Fencing and STONITH](https://www.suse.com/documentation/sle_ha/book_sleha/data/cha_ha_fencing.html).
207+
208+
At cluster initialization time, STONITH is disabled if no configuration is detected. It can be enabled later by running below command
209+
210+
```bash
211+
sudo crm configure property stonith-enabled=true
212+
```
213+
214+
>[!IMPORTANT]
215+
>Disabling STONITH is just for testing purposes. If you plan to use Pacemaker in a production environment, you should plan a STONITH implementation depending on your environment and keep it enabled. Note that SUSE does not provide fencing agents for any cloud environments (including Azure) or Hyper-V. Consequentially, the cluster vendor does not offer support for running production clusters in these environments. We are working on a solution for this gap that will be available in future releases.
216+
217+
188218
## Configure the cluster resources for SQL Server
189219

190220
Refer to [SLES Administration Guid](https://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html#cha.ha.manual_config)
191221

192222
### Create availability group resource
193223

194-
The following command creates and configures the availability group resource for 2 replicas of availability group [ag1]. Run the command on one of the nodes in the cluster:
224+
The following command creates and configures the availability group resource for 3 replicas of availability group [ag1]. The monitor operations and timeouts have to be specified explicitly in SLES based on the fact that timeouts are highly workload dependent and need to be carefully adjusted for each deployment.
225+
Run the command on one of the nodes in the cluster:
195226

196227
1. Run `crm configure` to open the crm prompt:
197228

198229
```bash
199-
crm configure
230+
sudo crm configure
200231
```
201232

202233
1. In the crm prompt, run the command below to configure the resource properties.
203234

204235
```bash
205-
primitive ag_cluster \
206-
ocf:mssql:ag \
207-
params ag_name="ag1"
208-
ms ms-ag_cluster ag_cluster \
209-
meta notify="true"
210-
commit
236+
primitive ag_cluster \
237+
ocf:mssql:ag \
238+
params ag_name="ag1" \
239+
op start timeout=60s \
240+
op stop timeout=60s \
241+
op promote timeout=60s \
242+
op demote timeout=10s \
243+
op monitor timeout=60s interval=10s \
244+
op monitor timeout=60s interval=11s role="Master" \
245+
op monitor timeout=60s interval=12s role="Slave" \
246+
op notify timeout=60s
247+
ms ms-ag_cluster ag_cluster \
248+
meta master-max="1" master-node-max="1" clone-max="3" \
249+
clone-node-max="1" notify="true" \
250+
commit
211251
```
212252

213253
### Create virtual IP resource
@@ -246,7 +286,7 @@ To prevent the IP address from temporarily pointing to the node with the pre-fai
246286
To add an ordering constraint, run the following command on one node:
247287

248288
```bash
249-
crm configure \
289+
crm crm configure \
250290
order ag_first inf: ms-ag_cluster:promote admin_addr:start
251291
```
252292

docs/linux/sql-server-linux-availability-group-cluster-ubuntu.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ sudo pcs property set stonith-enabled=false
158158
To update the property value to `false` run:
159159

160160
```bash
161-
pcs property set start-failure-is-fatal=false
161+
sudo pcs property set start-failure-is-fatal=false
162162
```
163163
If the property has the default value of `true`, if first attempt to start the resource fails, user intervention is required after an automatic failover to cleanup the resource failure count and reset the configuration using: `pcs resource cleanup <resourceName>` command.
164164

0 commit comments

Comments
 (0)