Neverfail Continuity Engine Channel Disconnects

Neverfail Continuity Engine Channel Disconnects

Summary  

The Neverfail Channel allows replicated data and status/control information to pass between the servers in a Neverfail group. When the Neverfail Channel disconnects unexpectedly, it is referred to as a channel drop. Channel drops interrupt the replication of data and transmission of status/control information and can result in false failovers. This behaviour is not by design and can be caused by the issues described below.

More Information

Here are the main causes, their symptoms and the solutions that can be applied:

1. Performance issues

Symptoms

The message "java.io.IOException: An existing connection was forcibly closed by the remote host" appears in the active server's NFLog.txt file, and the channel connection between the servers is lost.

Cause

This condition is unusual and generally points to an application, or Windows itself, experiencing a fault on the passive server. The most likely issue here is a sudden reboot / restart of the passive server and may be due to one of the following causes:

  • The server is configured for automatic software update management and some updates force a server reboot.
  • There is a software or Operating System issue which occasionally results in a BSOD and system restart.
  • The Neverfail Server R2 service itself experiences problems and may hang or terminate unexpectedly.

Resolution

  • If this is occurring, it should be possible to determine the likely source of the hang or reboot by examining the Windows event logs.
  • Alternatively, if the server does not show any evidence of a system restart or application hang, the issue may be due to one or both of the channel NICs forcing a channel disconnection. See Channel hardware or driver issues below for more information on this topic.

2. Passive server does not meet minimum hardware requirements

Symptoms

The data rate between the servers is very high during a Full System Check and the channel drops.

Cause

The passive server does not meet the recommended hardware requirements for Neverfail Engine or it meets them but is much less powerful than the active server.  The underpowered server cannot apply the received replication data from the active server at the rate that the data is sent to the passive server.

Resolution

In order to avoid reinstalling your Neverfail Engine solution, it is best to tackle this issue by upgrading the hardware (for example, memory, CPU) on the passive server. It is important to establish the identity (Primary, Secondary, or Tertiary) of the affected server before you perform the upgrade. An upgraded Primary server may require a new Neverfail license if the HBSIG is changed. Upgrading the Secondary server will not require a new license.

3. Hardware or driver issues on channel NICs

Symptoms

The Neverfail Channel drops or disconnects and reconnects intermittently.

Cause

  • Old/wrong drivers on the channel NICs.
  • If the physical connection for the Neverfail Channel connection uses a hub or Ethernet switch, a hardware fault may cause the channel to drop.
  • Defective Ethernet patch or crossover cables.
  • Improper configuration of the NICs used for the channel connection.
  • ISP problems in a WAN environment.

Resolution

  • Verify that channel NIC drivers are the correct/latest versions. This is a known issue with HP/Compaq ProLiant NC67xx/NC77xx Gigabit Ethernet NICs but may affect other NIC types as well. See Knowledgebase article # 116 - 'Neverfail and Gigabit Ethernet NIC drivers. (NC77XX)'.
  • Verify hubs and Ethernet switches are operating properly. Identify and replace any defective components.
  • Test for defective Ethernet patch or crossover cables and replace if defective.
  • Correctly configure the NICs used for the channel connection.
  • Check the physical link for ISP problems.

5. Firewall connection

In both a LAN or WAN deployment of Neverfail Engine, the channel may be connected via one or more Internet firewalls. Since firewalls are intended to block unauthorised network traffic, it is important to ensure that any firewalls along the route of the channel are configured to allow channel traffic.

Symptoms

The Neverfail Channel cannot connect or connects and disconnects continuously.

Cause

In a WAN deployment, port #57348 (or any other port configured for the Neverfail Channel) is closed on one or more firewalls on the route between the channel NIC on the Primary server and its counterpart on the Secondary server.

Resolution

Open port #57348 (and any other port configured for the Neverfail Channel) on all firewalls on the route between the channel NIC on the Primary server and its counterpart on the Secondary server.

6. Incorrect Neverfail Channel configuration

Symptoms

  • IP conflicts are encountered on one of the channel IP addresses.
  • The Neverfail Channel does not connect or connects and disconnects.

Cause

Identical IP addresses at each end of the channel, IP addresses in different subnets without static routing at each end of the channel, or a channel NIC configured for DHCP when a DHCP server is not available.

During the installation of Neverfail Engine, some configuration data from the Primary server is copied to the Secondary server. This includes configuration information for any NICs. The Help text displayed on the Neverfail Engine Deployment wizard describes how to configure the IP address for each NIC on the Secondary server. If this step is not completed, it is possible for one or more channel NICs on the Secondary server to contain a variety of incorrect addresses derived from the Primary server.

For example, assume you want the following correct configuration after deployment:

Primary:

Public NIC: 192.169.1.101
Channel NIC #1: 9.0.0.1
Channel NIC #2: 10.0.0.1

Secondary:

Public NIC: 192.169.1.101
Channel NIC #1: 9.0.0.2
Channel NIC #2: 10.0.0.2

Immediately following the restore/Plug and Play phase of the Secondary installation, Channel NIC #1 on the Secondary server may have acquired any of the following IP addresses:

(a) 192.169.1.101
(b) 9.0.0.1
(c) 10.0.0.1
(d) No static IP address (i.e. NIC is configured to use DHCP)

Clearly, none of these will allow a connection to address 9.0.0.1 on the Primary server - (a) and (c) are in a different subnet, (b) is a duplicate IP address, and (d) will fail because there is normally no DHCP server connected to the channel NICs.

Which address is assigned to Channel NIC #1 on the Secondary server depends on the exact driver configuration of the NICs on that server, as compared with the NICs on the Primary server. The most likely result, and the one usually expected during deployment, would be for the IP address of Channel NIC #1 on the Primary server to be transferred to Channel NIC #1 on the Secondary server.

On rare occasions, if the Primary and Secondary servers have NICs of the same type in a different order, both the name and IP address of a channel NIC on the Primary server may be transferred to the principal (public) NIC on the Secondary; or the name and IP address of the principal (public) NIC may be transferred to a channel NIC. Similarly, the names of the channel NICs may be reversed on the Secondary server under these circumstances. If this happens, it can be hard to reconcile the names of the NICs with their physical identities, making it difficult to assign the correct IP address to each NIC on the Secondary server.

Resolution

It is part of the normal Neverfail Engine installation process to manually assign the correct IP addresses to each NIC on the Secondary server. If there is no channel connection between the servers, check that the IP addresses on the Secondary server's channel NICs are correctly configured. You should also double-check the settings for the principal (public) NIC, since any configuration error here may not be apparent until a switchover is performed or a failover occurs.

It is possible to capture the identities of all of the NICs on the Secondary server prior to installing Neverfail Engine, by opening a Windows Command Prompt on that server and executing the following command:

ipconfig /all > ipconfig.txt

This saves the current name, TCP/IP configuration, and MAC address of each NIC on the Secondary server to a file called ipconfig.txt, which will be present on that server after the Plug and Play phase of the Neverfail Enigne install has completed. At this point, it is possible to compare the pre-install and post-install state of each NIC by running 'ipconfig /all' from a Windows command prompt and comparing the output of this command with the content of the file ipconfig.txt. The MAC address of each NIC is tied to the physical identity of each card, and never changes - so it is possible to identify each NIC by its MAC address and determine its original name and network configuration, even if these have been updated by the Plug and Play process.

7. Incorrect Connection Selected in the Public Page of the Configure Server UI

Symptoms

  • The Neverfail Channel drops or disconnects and reconnects repeatedly.
  • With Neverfail Engine running, ping packets are lost.
  • When Neverfail Engine is stopped, no ping packets are lost.

Cause

Configuration of the Public page of the Configure Server wizard is incorrect. The Neverfail Channel connection was selected and appears in the NIC field of the Public page in the Configure Server wizard.

Resolution

Reconfigure the Public page of the Configure Server wizard so that the NIC field contains the Public connection.

  1. Stop Neverfail Engine.
  2. Launch the Neverfail Configure Server wizard.
  3. Select the Public tab.
  4. Change the value of the NIC field to reflect Public .
  5. Click Finish .
  6. Start Neverfail Engine.

8. Routing issues:

a. In a LAN

Symptoms

The Neverfail Channel disconnects or fails to connect in a LAN deployment.

Cause

The Neverfail Channel may disconnect or fail to connect due to the principal (public) NIC and/or one or more channels sharing the same subnet

Resolution

If Neverfail Engine is deployed in a LAN environment, the principal (public) IP address and the channel IP address on a server should be in separate subnets. If there are multiple redundant channels, each should have its own subnet. Check the network configuration for each NIC on both servers in the pair, and correct any issues.

Note: If it is not possible to use different subnets for the Public and Channel, static routes between the two channel connections might have to be configured.

b. In a WAN

Symptoms

The Neverfail Channel disconnects or fails to connect in a WAN deployment.

Cause

When the Neverfail Channel disconnects or fails to connect in a WAN deployment it may be the result of the static route not being configured or that it has been configured incorrectly.

When Neverfail Engine is deployed in a WAN, it is generally not possible for the principal (public) IP address and the channel IP addresses to be in different subnets, since there is usually a single network path between the two servers. In order to ensure that channel traffic is routed only between the endpoints of the channel, it is necessary to configure a static route between these endpoints.

Resolution

Please refer to Knowledge Base article #466 - 'How to create a static route for the Neverfail Channel connection in a WAN environment', for a detailed discussion about WAN channel routing issues, and for instructions on how to configure a static route for the Neverfail Channel.

9. Clock/Time settings

Symptoms

The Neverfail Channel disconnects (apparently) at random, but reconnects soon after.

Cause

The time is changed  on one of the servers; the Windows Event Log might contain an event detailing by how much the time has been changed:

"The system time has changed to <new time stamp> from <old time stamp>"

The time change will cause the server to incorrectly assume that it has not received any response from the other server and report a "Channel Disconnected"event.

Resolution

  • Stop Windows Time on all servers protected by Neverfail. For more information, see Knowledgebase article #79 - Windows Time Service Should Not Run when Neverfail is Running.
  • If the servers protected by Neverfail Engine are virtual machines, they should not be set to synchronise their time with the physical host.

 

Applies To

All Versions 

Related Information

None

 

KBID-2855


    • Related Articles

    • Continuity Engine Troubleshooting - Channel Drops

      This article discusses unexpected channel drops. Under normal operations, Neverfail Continuity Engine maintains continuous communications between servers using the Neverfail Channel. When communications between servers fail, the condition is referred ...
    • Continuity Engine Product Architecture

      Learning objectives At the completion of this session, you should be able to: Identify major components of the Neverfail Continuity Engine product architecture. Describe major component configuration. Identify advantages of the Neverfail Continuity ...
    • Continuity Engine Troubleshooting - Synchronization Failures

      Neverfail Continuity Engine provides protection to your applications by replicating data to a passive server. Continuity Engine attempts to synchronize protected data on all servers and continually replicates changes to that data. This article ...
    • Neverfail Continuity Engine Cloning and Recloning limitations: disconnected Engine cluster

      Summary This Knowledgebase article provides details and workaround procedure for the following situation: after the cloning (initial Secondary or Tertiary deployment) or passive servers recloning the Neverfail Engine cluster is not connected via the ...
    • Continuity Engine Switchover/Failover Processes

      This article discusses Switchovers and Failovers, their similarities and differences. It also discusses a condition called False Failover, which can result in a Split Brain Syndrome. Learning objectives At the end of the session you should be able ...