Effects of Disk Bottlenecks on Neverfail Heartbeat System Performance (Queue)

Follow

Summary

This Knowledgebase article provides information about the effects of disk bottlenecks on Neverfail Heartbeat system performance.


More Information

Disk performance problems often exist prior to the installation of Neverfail Heartbeat. The additional overhead on the disks as a result of the replication load often highlights an underlying / potential problem. Particular attention should be paid to the importance of Neverfail SCOPE in analyzing your system prior to the installation of Neverfail Heartbeat. The Resolution section below presents plenty of remedial action that can be taken to improve system performance, and where possible these changes should be made before installing Neverfail Heartbeat. Neverfail SCOPE should then be re-run to ensure the expected performance improvement has occurred.

Symptoms

A disk bottleneck is characterized by the following system behaviors:

  1. Application starts running slowly on the active server.
  2. Users complain about the protected application taking a long time to process requests.
  3. Stopping Neverfail Heartbeat and leaving the Application running results in Application resuming normal operation.
  4. Passive server NFLog.txt contains a large number of the following entries:

    'com.neverfail.nfchannel.nflog Delay Factor now 2'
    'com.neverfail.nfchannel.nflog Delay Factor now 4'
    'com.neverfail.nfchannel.nflog Delay Factor now 1'
    'com.neverfail.nfchannel.nflog Delay Factor now 8'

  5. SCOPE graphs showing sustained peaks on the disk activity usage.

    Note: If this is visible pre-Neverfail Heartbeat, then this should have been rectified prior to installing Neverfail Heartbeat. Refer to the Resolution section.

    If this only occurred after installing Neverfail Heartbeat, then the additional read/write overhead of protecting data is too great for the disk. Refer to the Resolution Section

  6. High CPU utilization for some periods.

The overhead on the disks is too great and the passive server is unable to apply the changes to disk fast enough. As a result, the Neverfail Heartbeat queues build up on the disk (e.g. c:\neverfail\r2\log). When the limit is reached, Neverfail Heartbeat will start to slow the system down in an attempt to allow the updates to be written to disk, thus keeping the system synchronized, and minimizing switchover or failover times. If the configured limit (1 GB for v6.0 or earlier and 10 GB for v6.2 or later) is reached, then Neverfail Heartbeat will shut down leaving the protected application running. These actions are intended to ensure the integrity of the data on the active server.

Causes

The following is a list of the common causes of disk bottlenecks and poor system performance:

  1. Poorly configured applications. This is by far the most common cause of system performance problems. Microsoft recommends that logs and databases are stored on separate disks [NOT partitions but physically separate disk systems] for Exchange and SQL Server.
  2. Poorly configured disk sub-system on the Passive server - check configuration: are RAID levels the same on the Primary and Secondary. Check you are using the latest SCSI or RAID drivers, etc.
  3. Malfunctioning disks - check for errors in the Windows application and system Event Logs.
  4. Underspecified disk sub-system on the passive server. Are manufacturer's specs similar between the active and passive? Large differences in read / write speeds can cause a bottleneck to occur.
  5. Disk intensive activities – for example, database defragmentation, database housekeeping maintenance tasks.

Failure to adhere to Point 1. is likely to cause bottlenecks on busy systems, because all writes must be written to the logs first before being applied to the database. This condition will pre-exist the Neverfail Heartbeat installation, and can be identified by upward of 70-80% disk utilization in the SCOPE reports. Periods of utilization may be quite short with sudden flurries of activity; or sustained for long periods of time, in the case of Point 5. Regardless of the duration, the cause should be investigated because they indicate impending problems if load increases on the server from:

  • more users
  • heavy and more complex queries
  • larger number of transactions
  • additional services applications being installed

This has an obvious impact with the introduction of Neverfail Heartbeat where there will be an immediate increase in data read/writing to disk on the active and passive servers. Of particular concern is the additional overhead on the passive server. This additional overhead is attributed to the replication overhead of protecting this data. Please read the Analysis section on how best to configure the system and limit the potential for bottlenecks.

Resolution

Perform the following analysis for your situation as indicated below:

  • Pre-Neverfail Heartbeat

Run Neverfail SCOPE or Windows Perfmon and analyze the output. Look for periods of disk time and percent usage exceeding 60%. Of great concern are peaks over 80%. Periods between 60% and 80% should be investigated as well.

  • Post Neverfail Heartbeat
  1. Run Neverfail SCOPE Professional with Neverfail Heartbeat running on the Active and Passive (without network monitoring) and analyze data after 24 hrs.
  2. Collect Perfmon stats for individual drives on the Active and Passive servers.

After analyzing the information, ensure that application data layout is optimal for server and the load. Where possible follow the advice given here:

  1. Avoid all of the following on the same disk: application database files, database logs, and Neverfail logs.
  2. Avoid Neverfail Heartbeat and Application logs being on the same disk.
  3. Where possible avoid logs on the Windows system partition.
  4. Use separate disks for the file types described.

Unfortunately, spare disks are not always immediately available, so a compromise based on available resources will be necessary. For this reason, it is important to analyze the disk usage prior to making changes, and then again, after the change to ensure the bottleneck has been removed. Use the graphs obtained to balance out the disk usage so that it is spread across multiple disks.

Ideal
c:\ = windows partition & Neverfail logs
d:\ = application database
e:\ = application logs

Where only 2 drives are available, the following compromises could be tried:

Compromise 1
c:\ = windows partition & application database
d:\ = application logs & Neverfail logs

Compromise 2
c:\ = windows partition & application logs
d:\ = application database & Neverfail logs

While option 1 or 2 may be sufficient for the short term, it may be insufficient for the demands placed upon the system in the future. Close attention should be paid to the disk usage, with a view to upgrading the hard disks in the future.


Applies To

All Versions


Related Information

None

KBID-245

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.