MaxDiskUsage Errors in Neverfail Heartbeat v6.2 and Later

Follow

Summary

This Knowledgebase article provides information about the Neverfail active server's send queue and the passive server's receive queue and the occurrence of MaxDiskUsage errors including the symptoms, causes, and resolutions.


More Information

Disk Usage and Disk Quota Issues

Neverfail Heartbeat uses queues to buffer the flow of replication data from the active server to the passive server. This configuration provides resilience in the event of user activity spikes, channel bandwidth restrictions, or channel drops (which may be encountered when operating in a WAN deployment). Some types of file write activity may also require buffering as they may cause a sharp increase in the amount of channel traffic. The queues used by Neverfail Heartbeat are referred to as either the send queue (on the active server) or the receive queue (on the passive server).

Send queue

Neverfail considers the send queue as 'unsafe' because the data in this queue has not yet been replicated across the channel to the passive server and may therefore be lost in the event of a failover. As a result of failover, some data loss is inevitable, with the exact amount depending upon the relationship between current channel bandwidth and the required data transmission rate. If the required data transmission rate exceeds current channel bandwidth, the send queue will fill; if the current channel bandwidth exceeds the required data transmission rate, the send queue will empty. This situation is most commonly seen in a WAN environment, where channel bandwidth may be restricted. In a LAN, which normally has high bandwidth on a dedicated channel, the size of the send queue will be zero or near zero most of the time. Note that on a server which is not protected with Neverfail Heartbeat, all data is technically 'unsafe', and it is therefore possible to lose all data if the server fails.

Receive queue

Neverfail considers the receive queue 'safe' because the data in this queue has already been transmitted across the channel from the active server, and will not therefore be lost in the event of a failover, since all updates to the passive server are applied as part of the failover process.

The queues (both send and receive) are stored on-disk, by default in the <Neverfail Install Directory>\R2\log, with a quota configured for the maximum permitted queue size (by default, 10 GB on each server). Both the queue location and the quota are configurable.

There are two ways to set the queue size:

  • With Neverfail Heartbeat started, open the Neverfail Heartbeat Management Client and select Data: Traffic/Queues -> Configure . Set the value of Allow a maximum of ___ GB of disk space per server, for the Communications Send and Receive Queues and click OK . It is necessary to shut down and restart Neverfail Heartbeat for the change to take effect. This can be done without stopping the protected applications.
  • With Neverfail Heartbeat shut down, open the Configure Server wizard and select the Logs tab. Set the value of Maximum Disk Usage and click Finish .

Note: Neverfail Heartbeat is a symmetrical system, and can operate with either server in the active role. For this reason, the queue size is always set to the same value for both servers.

MaxDiskUsage Errors

If Neverfail Heartbeat exceeds its pre-configured queue size, it will report an error message. There are several possible reasons for this, with the most common ones shown below.

When Neverfail reports '[L9]Exceeded the maximum disk usage (NFChannelExceededMaxDiskUsageException)', the following conditions exist:

  • On the active server, it indicates that the size of the send queue has exceeded the disk quota allocated for it.
  • On the passive server, it indicates that the size of the receive queue has exceeded the disk quota allocated for it.

Neither of these conditions is necessarily fatal, or even harmful; but it is important to try to determine the sequence of events which led to the condition appearing in the first place.

[L9]Exceeded the maximum disk usage on the ACTIVE server

Symptoms

Replication stops and restarts or stops completely (if the event occurs while a Full System Check is in progress) and the Neverfail Heartbeat Event Log displays the error '[L9]Exceeded the maximum disk usage', originating from the ACTIVE server.

Causes

As stated previously, if there is a temporary interruption in the Neverfail Channel, or there is insufficient channel bandwidth to cope with the current volume of replication traffic, the send queue may begin to fill. If the situation persists, the size of the queue may eventually exceed the configured disk quota.

Resolution

If the Neverfail Channel is slow and a queue buildup is repeatedly experienced, measure the available and needed bandwidth and, based on SCOPE measurements, increase the bandwidth to accommodate all the changes during a peak day.

If the Neverfail Channel is having trouble keeping up when server or application maintenance tasks are running, reschedule the maintenance tasks so that they start at different times during the maintenance period, allowing 10-15 minutes between two consecutive tasks. Additionally, check if there are any task configuration settings which could be adjusted to reduce the disk I/O during the maintenance tasks.

A less desirable option is to measure the amount of data generated and, based upon the measurements, increase the size of the queue accordingly.

Caution: By simply increasing the size of the affected queue, should a failover occur, the volume of data at risk is also increased. Before adjusting the queue size, the user should review their Recovery Point Objective (RPO) to ensure that the volume of potential data loss as the result of the failover is within their RPO .

[L9]Exceeded the maximum disk usage on the PASSIVE server

Symptoms

Replication stops and restarts or stops completely (if the event occurs while a Full System Check is in progress) and the Neverfail Heartbeat Event Log displays the error '[L9]Exceeded the maximum disk usage', originating from the PASSIVE server.

Causes

  • In this situation, the bottleneck lies between the Neverfail Channel NIC and the disk subsystem on the passive server. Replication traffic therefore passes across the channel faster than it can be written to disk on the passive server; it is buffered temporarily in the receive queue. As before, if this situation persists, the size of the queue may eventually exceed the disk quota allotted for it.
  • If the passive server is much less powerful than the active server, in terms of processor speed, RAM or disk performance, it may lag behind the active server during periods of high replication activity. If you suspect this is the case, it may be useful to monitor one or more Windows performance counters in order to determine that component is experiencing sustained high activity. Intensive page file use or persistently large disk queue length may indicate a problem, which can be solved by upgrading one or more physical components of the server.

    Note that either server can be active or passive. If the Secondary server is more powerful than the Primary server, hardware-related issues might only occur while the Secondary server is in the active role.

Resolution

  • If you have multiple physical disks on each server, it may be worth locating the Neverfail send and receive queues on a separate physical disk, away from the Windows directory, the Windows page file, and any protected files help to alleviate disk performance issues. To do this:
    1. Shut down Neverfail Heartbeat.
    2. Open the Server Configuration wizard and select the Logs tab.
    3. Set the desired path for Message Queue Logs Location and click Finish .
    4. Start Neverfail Heartbeat on both servers.

      Note: The selected path will be applied to all Neverfail queues on both servers.

  • As before, you may alleviate the symptoms of this problem by simply increasing the amount of disk space allotted to the queues. However, if you have reason to suspect that a hardware issue is the root of the problem, it is better to correct that problem at the source if possible.
  • It is also possible for the size of the receive queue to increase sharply in response to certain types of file write activity on the active server. This is most obvious when Neverfail Heartbeat is replicating a large number of very small updates (typically a few bytes each) - the volume of update traffic may be far greater than the physical size of the files on the disk, and so the receive queue in particular may become disproportionately large. This pattern of disk activity is often seen during the population of Full-Text Catalogs in Microsoft SQL Server.

    In this case, you should increase the amount of disk space available for the queues, as described above; moving the queues to their own physical disk, or upgrading memory or the disk subsystem may also help to alleviate the issue.

  • Neverfail Heartbeat requires a certain amount of system resource for its own basic operations and requires some additional resources for processing replication traffic. This is in addition to the resources used by Windows and other applications running on the server (including critical applications protected by Heartbeat). It is always a good idea to ensure that there are sufficient resources for all of the applications and services running on such a server, in order to provide maximum performance, stability, and resilience in the face of changing client, server, and network activity.

[L20]Out of disk space (NFChannelOutOfDiskSpaceException)

Symptoms

Replication stops, and the Neverfail Heartbeat Event Log displays the error '[L20]Out of disk space', originating from either server in the pair.

Causes

This is similar to the '[L9]Exceeded the maximum disk usage' scenario, with one important difference - one of the queues has exceeded the amount of physical disk space available for it, without reaching its quota limit. So, for example, if the maximum queue size is set to 5 GB, but only 3 GB of physical disk space remains, this message will be reported if one of the queues exceeds 3 GB in size.

Resolution

The strategy for dealing with this is simple - it is necessary either to free up more disk space, or to move the queues to a disk with sufficient free space to accommodate queue sizes up to the limit configured for Maximum Disk Usage.


Applies To

Neverfail Heartbeat v6.2 and Later


Related Information

Knowledgebase article #993 - MaxDiskUsage Errors in Neverfail Heartbeat v5.3 through v6.0

KBID-2056

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.