This article discusses Switchovers and Failovers, their similarities and differences. It also discusses a condition called False Failover, which can result in a Split Brain Syndrome.
At the end of the session you should be able to:
- Recall specifics about the switchover process.
- Recall specifics about the Failover process
- Identify the triggers related to the switchover process
- Identify the trigger related to the Failover process
- Recall specifics about false Failover.
Neverfail Continuity Engine's Switchover Process
Neverfail Continuity Engine's Switchover process is initiated either manually or automatically. Continuity Engine uses the Managed Switchover to manually change the roles of the active and passive servers. When a managed Switchover is initiated, the running of protected applications is transferred from the active machine to a passive machine in the cluster. The server roles are reversed.
Continuity Engines stops the protected applications on the active server. After the protected applications stop, no more disk updates are generated. Continuity Engines sends all updates that are still queued on the active server to the passive server. It re-designates the Secondary server as the new active server, hides the previously active server from the network and makes the newly active server visible on the network. The new passive server begins accepting updates from the new active server.
Continuity Engines then starts protected applications on the new active server.
Automatic Switchover or Auto-Switchover is similar to Failover but is triggered automatically when system monitoring detects a failure of a protected application. Like manged Switchover, Auto-Switchover changes the server roles, but then stops Continuity Engines on the previously active server to allow the administrator to investigate the cause of the Auto-Switchover and verify the integrity of the data.
After the cause of the Auto-Switchover is determined and corrected, the administrator can use the Engine Management Service or the Neverfail Advanced Management Client to return the server roles to their original state.
Overview of Failover
Neverfail Continuity Engines performs a Failover when the passive server detects that the active server has failed. A Failover is triggered only by missed Neverfail Continuity Engines heart beats. An Automatic Failover is similar to the automatic Switchover that was discussed previously, but it is triggered when a passive server detects that the active server is no longer running properly and it assumes the role of the active server. Because the active server has failed, Continuity Engines performs no operations on it.
Continuity Engines performs four steps on the passive server. If any of the steps fail, Continuity Engine logs an exception and then continues to the next step.
- Continuity Engines processes replication queue updates.
- Continuity Engines exposes the server to the network.
- Continuity Engines starts intercepting updates and the server becomes active.
- Continuity Engines starts protected applications.
You should monitor the process to ensure that the process completes successfully.
A Managed Failover is similar to an automatic Failover in that the passive server automatically determines that the active server has failed and could warn the system administrator about the failure. But no Failover actually occurs until the system administrator manually triggers this operation.
Overview of False Failover
The False Failover, also called Split-Brain Syndrome, causes two servers to be active at the same time. False Failovers may occur for various reasons:
- Channel connectivity. False Failovers are frequently caused by channel connectivity problems between the active and passive servers while Continuity Engine is running. If the passive server can't see the active server across the channel, it will automatically make itself active after the configured number of missed heartbeats.
- High Loads / Low Resources. Another common cause of false Failovers is the failure of the active server to respond to heart beat messages, due to high application load or low Windows or hardware resources. Although the channel is connected and responds to pings, the active server is unable to respond to heartbeats.
- User error. User errors, such as missconfiguring the software or hardware or unplugging network cables, can lead to false Failovers.
Results of a False Failover
When a false Failover occurs, resulting in two active servers, both servers become available to users. As users work with their applications, each server is updated independently from the other. Consequently, with two active servers, replication between the servers is impossible.
If Continuity Engine can see that two servers are active at the same time and if the channel connection is still available, Continuity Engines will automatically shut itself and the protected applications down, as a precautionary measure. During a Split-Brain, users may experience problems with their applications. IP address and machine name conflicts may occur.
After you correct the Split-Brain Syndrome by returning one of the servers to the passive role, some data is inevitably lost. Each of the servers has had different updates applied during the time of the Split-Brain and now one of the servers will be overwritten. Continuity Engines has Split-Brain Avoidance features to help prevent this problem where possible. Take a look at the Neverfail knowledge base for more information about this feature.
How Do I Recover from The False Failover?
The following tips will help you to recover from a false Failover:
- Determine the cause of the Failover and rectify it.
- Determine which server has the most up-to-date data.
- Use the Configure Server Wizard to verify that the server with the most up-to-date data is configured as the active server. If not, make the required changes.
- Restart Continuity Engine, if necessary, and allow it to re-sync data.
- After the data is synchronized, use a managed Switchover to return service to the primary server if desired. Although this process is seamless, the last step should be scheduled at a time when the least number of users are affected in the unlikely event the Switchover is problematic.
Wrap UpThis session covered Neverfail Continuity Engines Switchover and Failover processes. Remember the following key points:
- The Switchover process is initiated manually or automatically and contains 5 steps. If any step fails, Continuity Engine logs an exception and continues. Monitor the process to make sure it completes successfully.
- Failover is an emergency process triggered only in the event of missed heartbeats. It contains 2 steps. If either step fails, Continuity Engine logs an exception and continues to monitor the process to make sure it completes it successfully.
- False Failovers makes both servers to become active at the same time, causing a Split-Brain Syndrome. The causes of false Failovers are high load, low system resources, channel connectivity problems while Continuity Engines is running or user error.
- Follow the five steps of false Failover recovery process to return your system to normal.