This article provides additional information with regards to how the File State Manager (FSM) operates. It was written to supplement peoples understanding of how the FSM operates with regards to why certain configuration values exist and how they can be changed. It is aimed at QA and other developers who are not already familiar with the FSM.
In this document the word "node" is used to mean either a file or a directory.
Synchronized: we think the protected file system on the active and passive is the same.
Unchecked: we know we haven’t successfully verify and synchronize the file system
Out of Synch: we think that the protected file system on the active and passive is not the same
When a node has been identified as being "Out Of Synch" or it fails to verify or synchronize then the FSM places it in a container called the BadSate. The BadState is just a record of the nodes that the FSM has found to be "Out of Synch".
The size of the BadState can be determined using nfclient
BadState in : 1550
BadState size : 0
The FSM is task based, that is when some thing happens on the active file system that requires the FSM to do some thing then a task is loaded into the FSM TaskMgr. The majority of these tasks work out what nodes are included in the task and pass them to the FSMs VerifierSynchronizer to verify or synchronize them.
The FSM can have a current task it is working on and a queue of Tasks that are waiting to be done. The current task can always be cancelled via the GUI. The GUI only ever displays up to TaskListSize of tasks in the task queue. TaskListSize (defaulted to 10) and is included to reduce the traffic between the client and the server. This default value can be read or changed using nfclient.
setp NewFileStateMgr TaskListSize 15
The number of tasks inserted into the queue and the number of tasks that have been removed from the queue since Neverfail Heartbeat started replicating can be found using nfclient.
Tasks in : 1 out : 1 size : 0
Size in bytes : 0
When tasks are added to the task queue optimizations are carried out to determine if an identical task or a task that already contains the work of the task being added is already in the queue. AddTaskOptimisation determines whether this optimization is done (default true) and results in the new task not being added to the queue. For example if the queue already contains a FullSystemCheckTask then an attempt to add a SystemTask will not succeed since a FullSystemCheckTask contains the work included in a SystemTask. This optimization can be switched on and off using nfclient.
setp NewFileStateMgr AddTaskOptimisation true //on
setp NewFileStateMgr AddTaskOptimisation false //off
Additionally the FSM keeps track of the how much memory is being used by the tasks in the queue so that it can limit the maximum amount of memory that the queue can use. In the case where the queue reaches the maximum size in bytes TaskWorkSize (defaulted to 10MB), it collapses all the tasks into a FullSystemCheckTask. This may cause the FSM to repreatedly do FullSystemCheckTasks since while the current FullSystemCheckTask is being worked on new tasks can be loaded into the queue that can cause the queue to exceed TaskWorkSize such that a new FullSystemCheckTask is loaded. TaskWorkSize can be modified using nfclient.
setp NewFileStateMgr TaskWorkSize 15728640
The GUI can be used to navigate through the protected set. The number of child nodes that are displayed for a particular node is limit to MaxDoApplyFilterResultSize (defaulted to 1000). MaxDoApplyFilterResultSize can be modified using nfclient.
setp NewFileStateMgr MaxDoApplyFilterResultSize 2000
Some tasks are recursive and are represented by a directory. The work content for recursive tasks includes all the children of the directory and the children of the children and so on and so on. A recursive task starts at some directory, and recursively lists the contents (nodes) of the directory into a single file called a ListFile. This ListFile can be read from and written too at the same time which allows the FSM to be listing files and verifying and synchronizing files at the same time. While the ListFile is being created the gui will display "Calculating work content". The ListFile is created in a temporary directory <installDir>\Neverfail\R2\FSMtemp.
An FSM component called the TaskMgr reads up to MaxVerifySyncFiles nodes (defaulted to 1000) from the ListFile and either verifies or synchronizes them depending on the type of task that is being executed. MaxVerifySyncFiles limits the number of files and folders that the TaskMgr attempts to work on at any time in order to manage memory. It can be modified using nflient.
setp NewFileStateMgr MaxVerifySyncFiles 1500
The ListFile is not limited this means that on large files servers containing 30 million nodes the ListFile could be 30,000000(nodes) x 200(bytes) = 5.58GB in length (assuming a file name of 100 wide chars). This means that Neverfail must be installed into a location that has a large enough disc capcity to support a ListFile of this size.
Non Recursive Tasks
Some tasks are non-recursive this means that the nodes to be included in the task are explicitly named in the task and no listing is carried out.
Some tasks involve a passive exist check. That is the nodes that are listed on the active server are checked to see if they exist on the passive server. This is done to prevent an attempt to verify a file that doesnt exist on the passive. The exist check also deletes files on the passive that exist as directories on the active and similarly it deletes directories on the passive that exist as files on the active. The exist check is carried out in parallel with verification and synchronization such that while the FSM component called the VerifierSynchonizer is verifying or synchronizing a batch of nodes, the TaskMgr is exist checking the next set of nodes.
Verification and Synchronization
The FSM uses a threshold to determine whether to verify or synchronize a file; VerifySize (defaulted to 4KB). If the file is less than VerifySize then the file is synchronized without verification. No attempt is made to verify directories they are always synchronized. VerifySize can be modified using nfclient.
setp NewFileStateMgr VerifySize 1024
Once the operation (verify or synchronize) has been determined the node is placed in a RequestQueue. Nodes are loaded into the RequestQueue by the TaskMgr in batches of MaxVerifySyncFiles and are called Request. The component that reads from the RequestQueue is called the VerifierSynchronizer. The VerifierSynchronizer reads the Requests from the RequestQueue and verifies or synchronizes them and places the Request in a WaitingRequestQueue this queue is used to hold Requests that have been made (verified or synchonized) which have not recieved a response from the passive server. The number of bytes included in the verify/synchronize Request is noted and added to a count. This count is kept since the VerifierSynchonizer has a threshold; MaxVolumeOfWaitingRequests (defaulted to 30MB) that it uses to determine whether it should wait for a Response or proceed to process the next Request from the RequestQueue. The purpose of limiting the maximum volume of waiting requests is to throttle the FSM. The MaxVolumeOfWaitingRequests can be modified using nfclient.
setp NewFileStateMgr MaxVolumeOfWaitingRequests 10485760
Once a Response arrives from the passive server the corresponding WaitingRequest is removed from the WaitingRequestQueue and inserted into the ResponseQueue. The VerifierSynchronizer always removes Responses from the ResponseQueue and processes them before proceeding with a Request. Once a Response has been processed the number of bytes previously counted is decremented such that the volume of waiting requests is maintained below the threshold MaxVolumeOfWaitingRequests. The in/out values for the RequestQueue, WaitingRequestQueue and the ResponseQueue can be displayed using nfclient doPrintStats
Verify Sync VS Request in : 10000 out : 10000 size : 0
VS Waiting Requests in : 10000 out : 10000 size : 0
VS Buffers in : 0 out : 0 size : 0
VS Waiting Responses in : 10000 out : 10000 size : 0
When a Response is processed it either marks the node as "Out of Synch" or "In Synch". If the Response is "In Sync" then the file is removed from the BadState and discarded. If the initial Request was for a verify and the Response is "Out of Synch" then the initial Request is placed back into the RequestQueue and its operation is switched to synchronize. If the initial Requests was for a synchronize and the Response is "Out of Synch" then the Request is placed back into the RequestQueue and its operation remains the same. This causes the Request to be retried this process of retrying is attempted up to MaxSynAttempts (defaulted to 3) if this threshold is exceeded the node represented by the Requests is left on the BadState and the VerifierSynchronizer proceeds with the next piece of work and the current task will complete with the node still on the BadState.
Various operations performed by the application are dealt with by the FSM. These operations include, renames from protected to protected, renames from unprotected to protected and passive requests for synchronization of nodes, this type of work is called SystemWork. It is difficult to predict the volume of SystemWork since it is mainly initiated directly or indirectly by the application and therefore the FSM provides a component called SystemWork to provide an express mechanism for the work to be completed.
setp NewFileStateMgr UseSystemWork false
SystemWork provides the following functionality: It collates work so that non recursive verification and synchronization can be executed within the context of the current task by the VerifierSynchonizier It collates work so that recursive verification and synchronization can be inserted at the head of the task queue. It optimizes work being added by removing duplicates and analyzing if the work is already included by previously included work. It limits is own size by collapsing non recursive work to recursive work once the size exceeds SystemWorkMaxSize (defaulted to 10MB). This collapsing is achieved by find a common parent for individual pieces of work. For example c:\a\b\c, c:\a\b\d and c:\a\b\e could all be collapsed to a recursive c:\a\b.
Listed below are the tasks that the FSM loads and their characteristics:
This task takes the nodes in the BadState and attempts to verify and synchronize them. By default the FSM is configured to automatically create a BadStateTask when it has no other work to do and there are nodes in the BadState. This feature can be switched on and off by setting the persistent property AutoSweep to true and false using nfclient getr NewFileStateMgr setp NewFileStateMgr AutoSweep true setp NewFileStateMgr AutoSweep false Canceling The synchronization status is unchanged
Corresponding user events: FileSystemBadStateStartedEvent FileSyatemBadStateDoneEvent FileSystemBadStateCancelledEvent
FilterChangeTask "Filter Change Check"
This task is loaded when new filters are configured and Neverfail is replicating. The task comprises of the additional nodes included by the filter change. This task can be trigger by changing the current filters. Canceling The synchronization status is set to Unchecked
Corresponding events: FileSystemFilterChangeStartedEvent FileSyatemFilterChangeDoneEvent FileSystemFilterChangeCancelledEvent
FullMarkStateTask "Full Mark State As In Synch"
This task does not involve any verify sync work it just empties the BadState and marks the full file system as Synchronzied. Canceling The synchronization status is unchanged
Corresponding events: FileSystemFullMarkStateStartedEvent FileSyatemFullMarkStateDoneEvent FileSystemFullMarkStateCancelledEvent
FullSystemCheckTask "Full system Check"
Used to verify and synchronize the full protected set identified by the filters. This task is automatically loaded when neverfail starts replicating and autocheck is on, which it is by default. AutoCheck can be configured using nflcient getr NewFileStateMgr (identifies the current value of AutoCheck) setp NewFileStateMgr AutoCheck true setp NewFileStateMgr AutoCheck false Canceling The synchronization status is set to Unchecked The task can be manually loaded using nfclient doFullSystemCheck
Corresponding events: FileSystemCheckingStartedEvent FileSyatemCheckingDoneEvent FileSystemCheckingCancelledEvent
MarkStateTask "Mark State As In Sync"
This task does not involve any verify sync work it just removes the node from the BadState if it exists in the BadState. Canceling The synchronization status is unchanged This task can be manually loaded using nfclient: doMarkState C:\Protected<DIR> (This just marks the single directory as In Synch) doMarkStateRecursive C:\Protected<DIR> (This marks the directory and its contents as In Synch)
Corresponding events: FileSystemMarkStateStartedEvent FileSystemMarkStateDoneEvent FileSystemMarkStateCancelledEvent
SystemTask "Verify & Synchronize"
This task is loaded as a result of SystemWork (see above) Canceling If the task is recursive then the synchronization status is set to unchecked If the task is non recursive the files / directories included in the task are placed on the bad state and synchronization status will be out of sync
Corresponding events This task does not raise events
This task is used to synchronize files and directories it places each file and directory on the BadState and can be either for a single node a list of nodes or recursive Canceling The synchronization status is unchanged This task can be manually loaded using doSync C:\Protected<DIR> doSyncRecursive C:\Protected<DIR>
Corresponding events FileSystemSyncStartedEvent FileSystemSyncDoneEvent FileSystemSyncCancelledEvent
This task is used to verify then synchronize files and directories it only places files and directories on the BadState that fail verification or synchronization. Canceling The synchronization status is unchanged This task can be manually loaded using doVerifyAndSync c:\Protected<DIR> doVerfyAndSyncRecursive C:\Protected<DIR>
Corresponding events FileSystemVerifySyncStartedEvent FileSystemVerifySyncDoneEvent FileSystemVerifySyncCancelledEvent
When a file is being synchronized it gets broken into 1MB chunks and copied to the passive machine. If this file fails to synchronize then we re-synchronize the whole file chunk by chunk. FileSections break large files ( > 10MB) into file sections such that a 100MB file would get broken into 10 10MB sections. Each section then gets broken into chunks and copied to the passive. If a section fails to synchronize then only that section of the file is resynchronized.
The FSM has two sets of known configuration values. One set for low band width ( < 10Mbit/s) and another set for high band width ( > 10Mbit/s). The configuration of the FSM can be switched from one to the other using the configuration wizard by enabling and disabling the low band width optimization check box. Please not that each time the OK button is pressed on the configuration wizard the values below will be reset even if the check box selection has not changed. The configuration values are given below: Low Bandwidth SectionLength = 1048576 (1MB) MaxVolumeOfWaitingRequests = 3145728 (3MB) VerifySize = 100 (100 bytes) High BandWidth SectionLength == 10485760 (10MB) MaxVolumeOfWaitingRequest = 31457280 (30MB) VerifySize = 4096 (4KB)
The FSM logs its performance statistics into NFLog.txt Its measures how long it spends in various low level calls within the FSM essentially providing how long it takes to verify and synchronize each batch of MaxVerifySyncFiles (default to 1000) This facility is switched on as default and can be switched off using nfclient. setp NewFileStateMgr LogPerfCounters false The following list describes counters that may be of interest to the reader.
NO_OF_VERIFY : the number of requests verified in the current batch.
VERIFY_MBYTE : data in mbytes verified in the current batch. Directories and zero length files represent 540 bytes.
NO_OF_SYNC : the number of requests synchronized in the current batch.
SYNC_MBYTE : data in mbytes synchonized in the current batch. Directories and zero length files represent 540 bytes.
VS_WAIT_MSEC : the time spent in msecs by the verify sync thread waiting for requests or responses from the passive server.
VS_WORK_MSEC : the time spent in msecs working by the verify sync thread. For interpretation VS_WORK_MSEC (approx) = VERIFY_MSEC + SYNC_MSEC + FSM_FILE_MSEC + USF_MSEC + RESPONSE_MSEC + WORK_DONE_REPORT_MSEC + WAIT_TO_RETRY_MSEC.
SYSTEM_WORK_MSEC : the time spent in msecs working on SystemWork. This will be some portion of VS_WORK_MSEC. Cannot be added subtarcted from VS_WORK_MSEC.
WAIT_TO_RETRY_MSEC : the time spent in msecs waiting to retry nodes. If nodes fail to verify / synchonize then they are retried after a delay.
PROT_STARTUP_MSEC : the time spent in the startupverifysynchronize method in msecs.
PROT_INFO_MSEC : the time spent in the information method in msecs.
PROT_VERIFY_MSEC : the time spent in the verify method in msecs.
PROT_SYNC_MSEC : the time spent in the synchronize method in msecs.
PROT_CLEANUP_MSEC : the time spent in the cleanup method in msecs.
FSM_FILE_MSEC : the time spent constructing the fsms representation of each of the nodes in msecs.
USF_MSECS : the time spent checking for unprotected features in msecs.
VERIFY_MSEC : the time spent verifying in msecs approx = PROT_STARTUP_MSEC + PROT_INFO_MSEC + PROT_VERIFY_MSEC + PROT_CLEANUP_MSEC.
SYNC_MSEC : the time spent synchronizing in msecs approx = PROT_STARTUP_MSEC + PROT_INFO_MSEC + PROT_SYNC_MSEC + PROT_CLEANUP_MSEC.
EXIST_CHECK_MSEC : the time spent checking nodes that exist on the active exist on the passive. NB the exist check executes in parallel with verification and synchronization so should not be added to other counters during interpretation.
RESPONSE_MSEC : the time spent processing responses from the passive server ie whether the nodes passed or failed to verify sync.
WORK_DONE_REPORT_MSEC : the time spent reporting work done includes adding work to progress objects and raising events to the gui.
BATCH_SIZE_MBYTE : data in mbytes loaded into the verify synchronizer during this batch. It includes any system work performed in the context of the batch.
approx = VERIFY_MBYTE + SYNC_MBYTE.
TIME_TAKEN_SEC : time in secs to verfiy / sync requests in the batch. approx = VS_WAIT_MSEC + VS_WORK_MSEC.
VS_RATE_MBYTE/SEC : rate in MB/s = BATCH_SIZE_MBYTE / TIME_TAKEN_SEC.
The following counters describe the current task:
TOTAL_NO_OF_SYNC : the number of requests verified in the task.
TOTAL_NO_OF_SYNC : the number of requests synchonized in the task.
TOTAL_SYNC_MBYTE : data in mbytes synchronized in the task.
TOTAL_VERIFY_MBYTE : data in mbytes verified in the task.
TOTAL_TIME_TAKEN_SEC : time in secs to verify / sync requests in the task.
TOTAL_VS_RATE_MBYTE/SEC : rate in MB/s for the task.
Unprotected Feature Detection
Certain file system features are not protected by Neverfail Heartbeat. If a node has an unsupported feature and it is detected then the node is not verified or synchronized and the UnsupportedFeatureMgr performs one of or a combination the following actions: LogAsError : the detection of the feature is logged in NFLog.txt as an error MakeOutOfSync : the node is placed on the BadState and AutoSweep is switched off (this prevents a cycle of detection) RaiseStopEvent : a stop event is raised which causes replication to stop RaiseUserEvent : an event is raised to the user identifying the unsupported feature (the number of user events is limited to MaxOccurences (defaulted to 100) to prevent the log filling The UnsupportedFeatureMgr manages the configuration of the unprotected features; the table below describes the actions that are taken when they are detected.
LogAsError MakeOutOfSync RaiseStopEvent RaiseUserEvent
ExtendedAttributes true true false true
HardLink true false true true
ReparsePoint true false true true
Encryption true true false true
SparseFile true false false true
DiskQuota true false false true
This configuration can be modified using nfclient.