Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
bgColor#fff
  1. If there are systems trying to access the SAN, failing to do so due to the service outage of pool(s), shut them all down. We need to reboot the SAN, but before we do, we should make sure that anything that the SAN depends on for name resolution and domain membership and such is online and working, if possible.
  2. Power off the head unit (controller node). If the OS is not responding to shutdown requests unplug the unit.
  3. Power down all the shelves as well, and with the head still powered off, turn on shelves one at a time. Once shelves are online and had opportunity to POST, which should not take more than about 20 to 30 seconds from flip of power switch to being ready, we are to power up the head.
  4. Power the head up and wait for it to boot. Depending upon the state of the pool(s) we may or may not have some pools online.
  5. If any pools appear to be imported, but faulted, as reported by zpool list, which is an unlikely event, power off the head unit, unplug SAS cables from storage, reboot the head and then plug the storage back in.

    Code Block
    languagebash
    # zpool list
  6. At this point we need to mark faults as repaired. It is a good idea to look at all the faults, because there may be some reports that address other parts of the system. There is always a chance of hardware failure during climate events like high temperature. Use fmadm faulty | less to get a list of faults, in a pageable format and review the faults carefully. We should see a number of faults having to do with disk drives going offline. There are likely to be ZFS related faults as well, which are a direct results of drives becoming unavailable.

    Code Block
    languagebash
    # fmadm faulty | less
  7. Once we have a good feel for the things reported as faulted, we should mark them repaired, which will allow for disks to become visible again. This loop should do what we want in one shot

    : for fault in `fmadm faulty -s | tail +4 | awk '{print

    .

    Code Block
    languagebash
    # for fault in `fmadm faulty -s | tail +4 | awk '{print $4}'` ; do fmadm repair $fault ; done
  8. Check the state of pool without importing it with zpool import. Without any arguments, zpool import will only show what pools are available to import. We should now see all pools that the system had online before, and they should hopefully report healthy.

    Code Block
    languagebash
    # zpool import
  9. If pools appear to be OK, import them one by one with the following command, replacing poolname with actual name of pool. zpool import -o cachefile=/etc/.zpools/spa_info poolname

    Code Block
    languagebash
    # zpool import -o cachefile=/etc/.zpools/spa_info poolname
  10. Reset pool import failure count to 0 in order to allow for pool(s) to be imported automatically next time the system boots. For each pool on system run the following command: zfs set racktop:poolimport:<pname>:last_import_good=0 bp, replacing <pname> with actual pool name, for example assuming we have pool named p01, the command is then zfs set racktop:poolimport:p01:last_import_good=0 bp .

    Code Block
    languagebash
    # zfs set racktop:poolimport:<pname>:last_import_good=0 bp
  11. Once pools import, assuming other parts of the infrastructure are online, check the system log and confirm that datasets are accessible. At this point you should be ready to re-start services in the infrastructure that depend on the SAN.

...