Skip to end of metadata
Go to start of metadata

Problem

If your system doesn't automatically import when the system boots as expected it can be related to a few issues.

Solution

You can review the system to log to see what the system attempted at boot time. However first we will explain how BrickStor imports pools since it has been designed as a storage appliance and doesn't automatically scan and import all pools.

Background

By default, a Brickstor system does not attempt to detect pools and automatically import them. While this seems like a missing feature, it is done this way on purpose. The goal of this is to reduce or eliminate risks associated with unintended importing of pools and possible corruption which may occur if a pool happens to be imported and made active on more than one system. Part of the import process is to confirm that in fact it is OK to import a pool, even if that pool is known to the system. This is a readiness check of sorts.

Via the myRack Manager GUI it is possible to import any pool, whether or not it originated on the system being imported. However, various safety checks will prevent unsafe imports.

When a pool is first created, or explicitly imported using the GUI for example, a mechanism on the Brickstor marks the pool as being importable, meaning when system boots, the pool is brought online and its filesystems activated and shared, assuming pool is in a state in which it could be brought Online. If devices are missing from the pool as a result of reconfiguration, physical relocation, failure of a shelf, etc., the pool will most likely not be in a state where it could be brought Online.

There is a counter which tracks pool import successes and failures, counter is for each individual pool. To protect the system from getting stuck in perpetual reboot loops, which can happen if a pool is damaged in a way which triggers system panics, these counters are consulted each time the system handles importing of pools, which normally only happens on boot. If a non-fatal problem occurs on pool import, such as perhaps failure to export a dataset via NFS, for example, the counter is not incremented. But, if components of a pool are missing, say disks are gone because one of the shelves is offline, disconnected, etc., import will be attempted once, at which point a failure will be detected, and the pool fault counter will be incremented. A system may panic at this point, as a safety valve and dump memory, creating a crash dump (which RackTop engineers will later need to analyze) in the process then reboot. This is a desired result, even though it may seem somewhat severe. Once the counter is incremented, next time system comes back up, which normally happens a few moments after the panic, that pool will no longer be considered importable and will be overlooked by import mechanism. There may be other less critical conditions that result in same outcome. Even if we observe the problem, and correct it, manual intervention is required on the system.

As a customer you will typically observe that system has rebooted and that one or more pools are no longer online. This defensive mechanism, if triggered usually means that attention from someone at RackTop is necessary.

Why Imports Sometimes Fail?

Pool imports fail for many reasons, and it is not always due to pool corruption. Most often, especially common when there has been a power outage event or events one or more parts of the whole storage system is malfunctioning. For example, with systems that have multiple shelves it is not uncommon to see one shelf or perhaps power supplies in that shelf being plugged into a circuit that is not restored at the same time as other circuits, and as system powers back up, one or more shelves are no longer visible to the system. Pools not confined to a single shelf could therefor appear faulted to the system, because a non-trivial number of drives which are part of the pool will be assumed missing or damaged. The system will attempt to bring pool(s) online, and of course with an entire shelf, or multiple shelves missing, it will fail to do so, at which point the failure counter will be incremented.

Something similar happens when after a service outage system with a number of shelves is not properly powered up. What can happen is the controller node or nodes are powered on, and as they begin to boot shelves are powered up. However, at the time the system is scanning for disks and only some of the shelves are fully initialized, which may not be immediately obvious, but the import process will fail in a similar way and even when everything appears to be OK, perhaps after another reboot, which may be initiated by the system admin, thinking that another reboot will perhaps clear whatever the problem is, because the counter was incremented previously the system will not attempt to import the pool again.

Reviewing The Logs

 To get a little bit of insight into what might be going on we should look at the log for the service responsible for handling import of pools. Below we open the log file using less pager. We will most likely have a lot of data in this log from previous imports, so jump to the end of file and review last several lines. We are likely to observe something similar to this output. Again, this assume p01 as pool name, and your results may be slightly different. If we observe a mention of Import Safety Check Failed, will not attempt import., we know right away that we had a problem with the pool, which system could not automatically correct.

# less `svcs -L poolimport`

... lines skipped ...
[INFO] Importing pool p01
[INFO] Import Safety Check Passed, it is OK to proceed with pool import.
[INFO] All previously imported pool(s) re-imported successfully.
[ Jan 29 12:20:57 Method "start" exited with status 0. ]
[ Jan 29 16:16:37 Executing start method ("/lib/svc/method/svc-poolimport start"). ]
[INFO] Importing pool p01
[ERROR] Import Safety Check Failed, will not attempt import. Please check pool health.
[ERROR] Some pools did not import correctly.

If the pool is online and not faulted it most likely means we need to reset the safe to import counter using the following command

# zfs set racktop:poolimport:p01:last_import_good=0 bp

After reboot, we should expect the pool to be online. We can confirm this easily with the following command. In this case we have two pools Online and Importedbp, which is for system's internal use, and p01.

# zpool list -Honame
bp
p01

 

By default, a Brickstor system does not attempt to detect pools and automatically import them. While this seems like a missing feature, it is done this way on purpose. The goal of this is to reduce or eliminate risks associated with unintended importing of pools and possible corruption which may occur if a pool happens to be imported and made active on more than one system. Part of the import process is to confirm that in fact it is OK to import a pool, even if that pool is known to the system. This is a readiness check of sorts.

Via the myRack Manager GUI it is possible to import any pool, whether or not it originated on the system being imported. However, various safety checks will prevent unsafe imports.

When a pool is first created, or explicitly imported using the GUI for example, a mechanism on the Brickstor marks the pool as being importable, meaning when system boots, the pool is brought online and its filesystems activated and shared, assuming pool is in a state in which it could be brought Online. If devices are missing from the pool as a result of reconfiguration, physical relocation, failure of a shelf, etc., the pool will most likely not be in a state where it could be broughtOnline.

There is a counter which tracks pool import successes and failures, counter is for each individual pool. To protect the system from getting stuck in perpetual reboot loops, which can happen if a pool is damaged in a way which triggers system panics, these counters are consulted each time the system handles importing of pools, which normally only happens on boot. If a non-fatal problem occurs on pool import, such as perhaps failure to export a dataset via NFS, for example, the counter is not incremented. But, if components of a pool are missing, say disks are gone because one of the shelves is offline, disconnected, etc., import will be attempted once, at which point a failure will be detected, and the pool fault counter will be incremented. A system may panic at this point, as a safety valve and dump memory, creating a crash dump (which RackTop engineers will later need to analize) in the process then reboot. This is a desired result, even though it may seem somewhat severe. Once the counter is incremented, next time system comes back up, which normally happens a few moments after the panic, that pool will no longer be considered importable and will be overlooked by import mechanism. There may be other less critical conditions that result in same outcome. Even if we observe the problem, and correct it, manual intervention is required on the system.

As a customer you will typically observe that system has rebooted and that one or more pools are no longer online. This defensive mechanism, if triggered usually means that attention from someone at RackTop is necessary.