This Knowledgebase article provides information about the failover process and how to recover from a failover.
A failover should not be confused with a switchover. A switchover is a controlled switch (initiated from the Engine Management Service) between the Primary, Secondary, or Tertiary (if installed) servers. A failover may happen when one or all of the following have suffered a failure on the active server: power, hardware, or communications. The passive server will count a preconfigured number of missed heartbeats before beginning a failover, and when this happens, it will automatically assume the active role and start to execute the protected applications.
The failover process
When the passive server detects that the active server is no longer running properly, it assumes the role of the active server by initiating a failover and takes the following steps:
- It applies any intercepted updates that are currently saved in the passive server queue, that is, the log of update records that have been saved on the passive server, but not applied to the replicated files.
The size of the passive server queue influences the length of time it takes to complete the failover process. If the passive server queue is large, the system must wait for all of the passive server queue updates to be applied before the rest of the process can take place. When there are no more complete update records to be applied, any incomplete update records will be discarded. An update record can only be applied if all earlier update records were applied, and the completion status for the update is in the passive server queue.
- The passive server changes its role and mode of operation from passive to active.
The server’s public identity is enabled. This principal (public) IP address can only be enabled on one of the two servers at any time. When the public identity is enabled, any clients that were connected to the server before the failover will now be able to reconnect.
- The newly active server starts intercepting updates to the protected data. Any updates to the protected data will be saved in the local active server queue.
- The now active server starts all protected applications. The applications will be able to use the replicated application data to recover, and then accept re-connections from any clients. Any updates that the applications make to the protected data will be intercepted and logged. At this stage, the originally active server is “offline”. The originally passive server has taken over the role of the active server and is running the protected applications. As the originally active server stopped abruptly, the protected applications may have lost some data. The application clients can reconnect to the application and continue running as before.
Note: During a failover, the data held in the active server queue is lost.
How to recover from a failover
This recovery scenario is based on Neverfail Continuity Engine in a default pair configuration with the Primary server as active and the Secondary server as passive.
A failover has occurred and the Secondary server is now running as the active server.
- Event logs should be checked at this point, (on both servers) to determine the cause of the failover. If you are unsure how to do this, please use the Neverfail Log Collector tool to collect information and send the output to Neverfail Support. See Knowledgebase article #115000768188 - How to Retrieve Neverfail Continuity Engine Logs and Other Useful Information for Support Purposes.
If any of the following has occurred (on the Primary Server), performing a switchover back to the Primary server may not be possible until other important actions are carried out. Neverfail Continuity Engine should not be restarted until these issues have been resolved:
- Hard Disk Failure - Disk may need replacing.
- Power Failure - Power may need to be restored to the Primary server.
- Virus - Server should be cleaned of all viruses before starting Neverfail Continuity Engine.
- Communications - Physical network hardware may need replaced.
- Blue Screen - Cause should be determined and resolved. This may require you to submit the Blue Screen dump file to Neverfail Support for analysis.
- Run the Server Configuration wizard and check the server is set to Primary and passive. Click Finish to accept the changes.
- Disconnect the channel network cables or disable the network card.
- Resolve the problem – list of possible failures etc.
- Reboot this server and reconnect or again enable the network card.
- After the reboot, check that the Taskbar icon now reflects the changes by showing P / - (Primary and passive)
- On the Secondary active server or from a remote client, Launch the Engine Management Service and confirm that the Secondary server is reporting as active.
If the Secondary server is not displaying as active, follow the steps below:
- If the Engine Management Service is unable to connect remotely, then try running it locally. If you are still unable to connect locally then check the service is running via the Service Control Manager. If it is not, check the event logs for a cause.
- Run the Server Configuration wizard and check that the server is set to Secondary and active. Click Finish to accept the changes.
- Determine if the protected application is accessible from clients. If it is then start Neverfail Continuity Engine on the Secondary server.
If the application is not accessible, check the application logs to determine why the application is not running.
- Run the Server Configuration wizard and check that the server is set to Secondary and active.
- Click Finish to accept any changes.
At this stage, you should now be ready to start Neverfail Continuity Engine on the Secondary active server.
Note: The data on this server should be the most up to date and this server should also be the live server on your network. Once Neverfail Continuity Engine starts, it will overwrite all the protected data (configured in the File Filter list) on the Primary passive server. If you are not sure that the data on the active server is 100% up to date, please contact Neverfail Support. Only go on to the next step if you are sure that you want to overwrite the protected data on the passive server.
- Start Neverfail Continuity Engine on the Secondary active server and check that the Taskbar icon now reflects the correct status by showing S / A (Secondary and active).
- After you have verified that the Secondary server is operating properly as active and the Primary server is operating properly as passive and the servers are synchronized, if desired, you can perform a managed switchover to return the servers to their original roles.