HybriStor Common Maintenance Practices, Notable Errors, Useful Commands and Information

HybriStor Common Maintenance Practices, Notable Errors, Useful Commands and Information

Below we will explain common scenarios we see with the HybriStor and the ways to correct them. 
They will be given a severity level and a small description of the impact on the appliance.
Also we will give an overview of specific HybriStor services and useful commands that can assist in support of the appliance.

For information not covered in this document, please refer to https://support.neverfail.com/hc/en-us/sections/207361627-HybriStor, which contains links to KB articles and other best practice documents.
If any other issues occur, please escalate to Neverfail HybriStor support.

List of Core HybriStor Daemon Services

statelocation - This daemon service is in charge of handling states of internal HybriStor objects. It is the most essential daemon and must be running for most other daemons to be active.

inlinecache - This daemon service allows for inline memory deduplication to occur. Data written to an active shared filesystem will be in memory checked for deduplicated blocks and not write duplicate data to data back-end.

postdedupe - This daemon service is the heart of the HybriStor appliance. Passes are run based on a trigger data size ingested or time that has passed. When a pass is started it checks all data written or deleted since the last pass, handles cleaning up duplicated data, and stages data for discharge from the appliance. It also handles creating recovery points on OBS backed appliances for disaster recovery.

discharge - This daemon service handles the processing of deletions from the appliance. Once the postdedupe daemon has determined data is cleared for full deletion, the discharge daemon handles the removal of the specific data blocks.

history - This daemon service handles all statistic gathering and tracking. It uses sqlite3 databases to track all statistic history for all daemons. 
               A program called 'hsdr-historyextract' can be used to show output from the databases in a readable form.

replication-source - This daemon service handles the replication process for the appliance as a source. When replication jobs are created to replicate out to a target HybriStor appliance, this daemon control all the processes.

replication-target - This daemon service handles the replication process for the appliance as a target. When a replication job targets the appliance, this daemon controls all the processes.

Common Useful Commands From HybriStor CLI

hsdr-stat - This command can be used to see status of all core HybriStor daemons and active shared filesystems. As well it will show the process ID of the daemon service, start time, and active time running.

hsdr-pdu -n - This command forces postdedupe to run a pass once. It is useful when forcing a pass which can help with discharging data from an appliance faster or simply processing data before it hits the size or timer trigger.

hsdr-collect-logs PATHTOSTOREZIP FILENAMEOFZIP - This command line tool allows you specify a path and filename, and will collect all necessary support logs from the HybriStor appliance and ZIP them at the specified location.

hsdr-license-activation - This command line tool is used for all things dealing with HybriStor licensing on the appliance. It is needed to get system code for offline licensing, can be used to refresh or override the license key on an appliance, or simply query a licenses capacity.

hsdr-networking-setup - This command line tool runs you though user interactive steps that allow you to modify networking on connected NICs. It currently does not allow for modification of bonds.

hsdr-datagen - This command line tool allows you to internally write data to a specified location. This tool is used mostly for testing and development. Should have good knowledge of function before ever using.

hsdr-historyextract - As mentioned above for the history daemon, this command line tool allows you to extract statistical history information in a human readable format.

systemctl {stop/status/start/restart} DAEMONNAME.service - These commands allow you to stop, check the status, start, or restart the core daemon service. They are needed when a daemon has failed or needs to be examined.

systemctl {stop/status/start/restart} srv-share-SHARENAME.service - These commands allow you to stop, check the status, start, or restart a shared filesystem. They are needed when a shared filesystem has failed or needs to be examined.

Notable Errors Requiring Immediate Attention

[ERROR] Hybristor Shared Filesystem 'SHARENAME' has failed.

This error will occur when a HybriStor shared filesystem has gone offline for an unknown reason. Please follow scenario below: Shared Filesystem has Failed or Crashed

[ERROR] Hybristor Core Daemon 'COREDAEMON' has failed

This error will occur when a HybriStor core daemon service has gone offline for an unknown reason. 
If the inlinecache daemon service was the failed daemon service then you should follow all steps under: Inlinecache Daemon Service has Failed or Crashed.
If it occurs for any other core daemon service follow the steps under: Core Daemon Service has Failed or Crashed.

[ERROR] ../../../hummingbird/include/LogTracker.h:58 Not Enough Disk Space: The available space: 63762563072, for: /srv/dedupe/data/1, is less than the recommended threshold: 64424509440, preventing future writes

This error will occur when a physical HybriStor appliance using a disk based storage back end has reached full capacity. This is check is in place to stop the appliance from overfilling and causing issues with deletion.
This is 100% expected when a unit is at capacity and to correct should follow steps under scenario: HybriStor Physical Appliance Reaches Full Capacity.

[ERROR] ../../../hummingbird/include/LogTracker.h:58 License over capacity, preventing future writes

This error will occur when a virtual Hybristor appliance using an object storage back end has reached its license capacity. This check is in place to stop the appliance from ingesting data passed its allowed capacity.
This is 100% expected when the unit hits its license capacity and to correct should follow steps under scenario: HybriStor Virtual Appliance Reaches License Capacity.

Common Maintenance Scenarios

Shared Filesystem has Failed or Crashed

Severity Level: High

Possible Cause:

A HybriStor shared filesystem can fail for a number of reasons. Could be due to a memory issue, a critical error while writing or many other scenarios. However most of these are rare it is important it is corrected quickly.

Error Message/Code:

hsdr-ui[30132]: [ERROR] Hybristor Shared Filesystem 'SHARENAME' has failed. Please contact HybriStor support for assistance.

Impact to appliance:

When a HybriStor shared filesystem has gone offline the impact is very high to the appliance. During this time no writes or reads can be made from this share. Customers will see job failures on their end. So it is very important that this is corrected as soon as possible.

Correction Steps:

First we will simple attempt to restart the shared filesystem server. Sometimes that is all that is needed.

Scenario 1: 

1) Run command from CLI 'systemctl restart srv-share-SHARENAME.service'.

2) We receive no error response from CLI. Now run command: 'hsdr-stat' and confirm shared filesystem service is running. Use screenshot to determine.

Scenario 2

a) Run command from CLI 'systemctl restart srv-share-SHARENAME.service'

b) We receive no error response from CLI.. But run command: 'hsdr-stat' and confirm shared filesystem service is not running. Use screenshot to determine

c) Run command 'systemctl status srv-share-SHARENAME.service'. This will show any messages sent by the daemon upon failure to start. Check any KBs that exists for existing material or escalate the issue to Neverfail HybriStor support for more help.

Inlinecache Daemon Service has Failed or Crashed

Severity Level: Moderate

Possible Cause:

As shown in the error below, the most likely reason for the crash is an out of memory issue. Due to HybriStor appliances now mostly being Virtual we attempt to keep the memory as low as we can.
The inlinecache daemon can occasionally grow to a large amount of memory as the appliance grows in storage. If the unit reaches its memory limit the Linux kernel will kill whichever process is using the most memory.
There are however over reasons the daemon could crash. If memory was not the direct cause then Correction Step 1 can be skipped.

Error Message/Code:

[ERROR] Hybristor Core Daemon 'inlinecache' has failed. Please contact HybriStor support for assistance. -- /var/log/messages -- 
kernel: Out of memory: Kill process 1571 (hsdr-inlinecach) score 236 or sacrifice child
kernel: Killed process 1571 (hsdr-inlinecach) total-vm:9277404kB, anon-rss:7774132kB, file-rss:2476kB
systemd: inlinecache.service: main process exited, code=killed, status=9/KILL
systemd: Unit inlinecache.service entered failed state.

[ERROR] Hybristor Core Daemon 'inlinecache' has failed. Please contact HybriStor support for assistance. -- /var/log/messages -- 
hsdr-filesystemd[30938]: [ERROR] ../../../hummingbird/include/InlineCacheClient.h:88 Read response failed: End of file

hsdr-filesystemd[30938]: [ERROR] ../../../hummingbird/include/InlineCacheClient.h:88 Read response failed: End of file
hba-0436 systemd: inlinecache.service: main process exited, code=killed, status=9/KILL
hba-0436 systemd: Unit inlinecache.service entered failed state.


Impact to appliance:

HybriStor is a hybrid deduplicating appliance. It uses the inlinecache daemon to in memory recognize dedupe and then postdedupe daemon to perform the final deduplicating process.
When inlinecache daemon is down it forces all writes to go through fully and then will force postdedupe to perform the full dedupe. There is no inherent danger from this as data is written properly and will be deduped during postdedupe.
However, it will impact performance of writes into the appliance, take up more data storage space, and cause postdedupe to take longer to run and perform.

Correction Steps:

1)  This step should only be taken if the inlinecache daemon failed or crashed due to the first error example for out of memory kill.

We need to edit the allowed size of the inlinecache daemon which is determined by a setting in our configuration file. There are two factors that can help us with this decision.

a) If you have access to the HybriStor GUI, login and proceed to Dashboard page. On the left side there will be a graph called Potential Cache Hit Ratio. Click this graph to reveal a more detailed larger graph.
From here hover your mouse over until you find the value that reads 'Covers 70% at'. Note that GiB size.

b) Run command from CLI 'free' and see the available memory value. Provided is an example which is highlighted. Take this number and divide by 1000000. This gives you the GiB memory free. 
Take the recorded amount and subtract 5. Note this GiB size.

c) Use the following logic to get your new inlinecache value:

if (value(a) <= value(b))

      inlinecache memory = value(a) * 1073741824

else if (value(a) > value(b))

      inlinecache memory = value(b) * 1073741824

else if (No GUI Access in step a)

      inlinecache memory = value(b) * 1073741824

d) Now that we have the proper value we need to edit the proper setting.

i) Run command from CLI 'vim /etc/hsdr/settings.conf'
ii) Search for line highlighted in screenshot
iii) Move down to this line, hit 'i' or 'insert' to edit and change the value after the equal sign to the above calculated value.
iv) Then hit ':wq' to write and save the changes

e) Now perform the following three commands:

systemctl stop postdedupe.service
systemctl restart postdedupe.service
hsdr-pdu -n

2) Now we need to restart the inlinecache daemon

Scenario 1

a) Run command from CLI 'systemctl restart inlinecache.service'

b) We receive no error response from CLI. Now run command: 'hsdr-stat' and confirm inlinecache daemon is running. Use screenshot to determine.

Scenario 2

a) Run command from CLI 'systemctl restart inlinecache.service'

b) We do receive an error response from CLI. Refer to screenshot to confirm.

c) Run command from CLI 'systemctl status inlinecache.service'

d) Refer to screenshot and if the highlighted section matches about Address already in use continue with next steps.
     If the reason for failure does not match this then you should check existing KB's for information or escalate to Neverfail HybriStor support.

e) Edit inlinecache daemon port value

i) Run command from CLI 'vim /etc/hsdr/settings.conf'
ii) Search for line highlighted in screenshot
iii) Move down to this line, hit 'i' or 'insert' to edit and change the value after the equal sign to: current value + 1.
iv) Then hit ':wq' to write and save the changes

f) Now run command from CLI 'systemctl restart inlinecache.service'

g) We receive no error response from CLI. Now run command: 'hsdr-stat' and confirm inlinecache daemon is running. Use screenshot to determine.

h) Last step run the command 'pkill -HUP -f hsdr-filesystemd'

3) At this point the inlinecache daemon should be running and after the next postdedupe pass the memory should be limited.

a) There is a small chance that the inlinecache daemon may crash again if it is a very long postdedupe pass.
If so run command 'systemctl restart inlinecache.service'

Core Daemon Service has Failed or Crashed

Severity Level: High

Possible Cause:

A HybriStor core daemon service can fail for a number of reasons. Could be due to a memory issue, a critical error while writing or many other scenarios. However most of these are rare it is important it is corrected quickly.

Error Message/Code:

hsdr-ui[30132]: [ERROR] Hybristor Core Daemon 'DAEMONSERVICE' has failed. Please contact HybriStor support for assistance.

Impact to appliance:

Depending on which core daemon service has failed or crashed the impact can be very different. Using a description of the above daemon services, you can see how each one works with the appliance and its impact. Either way usually this is a sign of a real issue and needs to be corrected quickly.

Correction Steps:

First we will simple attempt to restart the core daemon service. Sometimes that is all that is needed.

Scenario 1: 

a) Run command from CLI 'systemctl restart COREDAEMON.service'.

b) We receive no error response from CLI. Now run command: 'hsdr-stat' and confirm the core daemon service is running.

Scenario 2

a) Run command from CLI 'systemctl restart COREDAEMON.service'

b) We receive no error response from CLI.. But run command: 'hsdr-stat' and confirm core daemon service is not running.

c) Run command 'systemctl status COREDAEMON.service'. This will show any messages sent by the daemon upon failure to start. Check any KBs that exists for existing material or escalate the issue to Neverfail HybriStor support for more help.

HybriStor Virtual Appliance Reaches License Capacity

Severity Level: High

Possible Cause:

The HybriStor virtual appliance has reached its license capacity. The customer either needs to delete data/update retention policies for less data storage or smaller chains, or customer needs to purchase a higher license capacity tier.

Error Message/Code:

[ERROR] ../../../hummingbird/include/LogTracker.h:58 License over capacity, preventing future writes.

Impact to appliance:

When this error first occurs, it will immediately block any new incoming writes to the HybriStor appliance. During this time any new jobs that are attempting to write to the appliance will fail. However all reads and deletes to the appliance will be successful.

Correction Steps:

There are two solutions to correcting this issue.

Scenario 1: Customer does not want to purchase a higher license capacity tier.

a) Inform customer they have reached their license capacity on the appliance and they must remove data before new data can be written.

It is important to educate the customer on best practices if using a tool such as Veeam. They can update retention policies to store smaller chains and less data to fit into this appliance.

b) To expedite the process, once data has been removed run command from CLI 'hsdr-pdu -n' to force a postdedupe pass.

Repeat step 2 until the discharge process has kicked in. May take up to 3 postdedupe passes before data will be processed into discharge.

Scenario 2: Customer wants to purchase a higher license capacity tier.

a) Put customer in contact with Sales to handle purchase of higher tier.

b) Once purchased, HybriStor Appliance record in Sales Force must be updated to match newly purchased license capacity.

c) HybriStor license will be automatically updated on license server back end.

d) HybriStor virtual appliance will automatically update the license change.

i) To expedite the process, run command from CLI 'hsdr-license-activation -r -b LICENSEKEY', LICENSEKEY being the HybriStor license key that was updated.

ii) Last run command 'systemctl restart postdedupe.service', this updates the postdedupe daemon service with the new information.

HybriStor Physical Appliance Reaches Full Capacity

Severity Level: High

Possible Cause:

The HybriStor physical appliance has reached full capacity due to storage limits on its data back end. Most likely the customer needs to update its retention policies to store less data and smaller chains or they will need an upgrade shelf for more capacity.

Error Message/Code:

[ERROR] ../../../hummingbird/include/LogTracker.h:58 Not Enough Disk Space: The available space: 63762563072, for: /srv/dedupe/data/1, is less than the recommended threshold: 64424509440, preventing future writes.

Impact to appliance:

When this error first occurs, it will immediately block any new incoming writes to the HybriStor appliance. During this time any new jobs that are attempting to write to the appliance will fail. However all reads and deletes to the appliance will be successful.

Correction Steps:

There are two solutions to correcting this issue.

Scenario 1: Customer does not want to purchase an expansion shelf for the appliance.

a) Inform customer they have reached their storage limit on the appliance and they must remove data before new data can be written.

It is important to educate the customer on best practices if using a tool such as Veeam. They can update retention policies to store smaller chains and less data to fit into this appliance.

b) To expedite the process, once data has been removed run command from CLI 'hsdr-pdu -n' to force a postdedupe pass.

Repeat step 2 until the discharge process has kicked in. May take up to 3 postdedupe passes before data will be processed into discharge.

 

Scenario 2: Customer purchases an expansion shelf for the appliance.

a) The expansion shelf has been purchased and installed onto the physical appliance.

b) We now need to run our command line tool expansion script to allow for the new space to be added to the HybriStor appliance and data migrated.

 

i) First run command from CLI 'chmod a+x /usr/bin/hsdr-expand-capacity' to allow execution of the expansion tool.

ii) Run command 'hsdr-expand-capacity -a', if expansion shelf was properly installed we will see an available drive as shown in the screenshot with the proper size associated with it.

If the expansion shelf was a JBOD or another type of hardware raided disks, they should represent as one disk.

iii) Stop all HybriStor daemons and share filesystems using command 'systemctl stop hsdr-mounts.target' and wait until it completes.

iv) Run command 'hsdr-expand-capacity -l data -r hardware -d /dev/sdb' *Note* the -d should correspond to the disk shown in the above screenshot.

v) Follow interactive steps to complete the process. It may take some time to handle the migration of data between the new disks.

c) We must edit a configuration file to bump the size of the threshold for data storage for future monitoring checks.

i) Run command from CLI 'vim /etc/hsdr/system.conf'
ii) Search for line highlighted in screenshot
iii) Move down to this line, hit 'i' or 'insert' to edit and change the value after the equal sign to: current value + (Size of expansion shelf installed in kilobytes).

aa) Using the command above 'fdisk -l | grep -e "/dev/sdb"', where /dev/sdb is the drive name used in the above expansion. Get the value highlighted in screenshot. Divide this by 1024.

bb) So the equation is (1099511627776 / 1024) = 1073741824.

cc) Add the calculated value to the current value on the threshold line above in the screenshot. (64424509440 + 1073741824) = 65498251264.

dd) Last replace the value after the equal sign with the new calculated value.

iv) Then hit ':wq' to write and save the changes

d) Reboot the machine and everything should be good to go.