This Knowledge Base article provides information about how under certain conditions, Split-brain detection may prevent failover when the active server fails.
In the following environment, Split-brain detection will by design prevent failover to the passive server and protect data integrity when the channel fails. This will prevent the passive server from becoming active with an already active primary server.
Split-brain detection on the secondary is set to monitor the following IP address 192.168.1.1 and the primary is configured to monitor 192.168.2.1 .
If a "Blue Screen", or other situation occurs on the active server where the principal network adapter is still responding but Windows has 'crashed', a failover will not occur.
Impact: Clients cannot access the protected application, the secondary server is still passive, and the primary is displaying the "Blue Screen".
Orion Failover Enginet Split-brain detection can still ping the primary network IP address, as the NIC is still contactable on the network despite the "Blue Screen". Failover will not occur because the passive server cannot use the active servers IP address as the IP address is still in use on the network. This is by design and manual intervention is required.
To recover from a Windows server “Blue Screen” where:
- The network IP address is still visible on the network.
- Orion Failover Engine is configured to check the principal IP for Split-brain Avoidance.
- Failover has NOT occurred.
The following manual steps are required to restore client connectivity to the application. Data integrity has not been compromised by a Split-brain syndrome.
- Shutdown the primary-active failed server. Do not restart, as the primary server will return as passive and Orion Failover Engine will shutdown if there is no active-passive configured servers present.
- The main network IP address will no longer be visible on the network and Orion Failover Engine will initiate a failover to the secondary passive server.
- Observe the secondary server as it becomes active on the network. The protected application should start normally if all other dependant services are available.
- Network clients should now have access.
- Unplug the primary server from network and start the server.
- Confirm that primary is now passive. Orion Failover Engine by default starts as passive following a system failure.
- Re-Connect the network cable and allow Orion Failover Engine servers to Verify and Synchronize.
- Initiate a switchback to the original primary-active and secondary-passive mode once primary server has completed Verify and Synchronize and confirmed as operational.
To help eliminate the downtime in this scenario, the alerting system should be configured to warn administrators that the SolarWinds channel has disconnected.