SAN problems of 27th March 2014
Following the unplanned power cut on Tuesday, one of our SAN machines (ifevo4) started reporting a problem with the flash cache memory in one of its controllers. The machine has two controllers, A and B, for redundancy. Both with two fibre channel (FC) connections, one to each of our fabrics (network). This means that should one controller fail, the other will take over its duties and service will remain uninterrupted.
After reporting the fault of controller A, our supplier shipped a replacement controller to swap out the faulty one with a working one. To minimise the length of time the ifevo4 was running in a degraded state, we decided to replace the controller on Thursday after 5pm. Due to the redundancy, this should not have caused any problems to the running service.
The redundancy only fully works if the client machines (our file servers) are configured to use the multiple paths to the FC connections on both controllers. I assumed they were (but didn’t actually check), as that’s how it should be, and we’d had a separate fabric failure a week previously, and all the servers continued to work via the one remaining path/fabric without any issue. Unfortunately not all the volumes on the ifevo4 were as fully redundant as they should have been. In some cases the volumes were only accessible via controller A and not via A and B, this was a configuration error that had probably gone unnoticed since November 2013.
So when I removed controller A, the volumes mounted by the servers that were only accessible via controller A became inaccessible. Thus causing problems for anyone trying to access data on those volumes. As it is generally group file space that is mounted from the SAN, home volumes are on disks local to the servers, not many people noticed at this point.
Unfortunately to reattach the failed volumes (once controller A had been replaced) typically means checking the consistency (salvaging) of all the data on the server, during which time the file server will not serve any files, even those unaffected by the loss of controller A. As our file servers have several terabytes of data check, this means no access to all files for a couple of hours.
To give those a chance to finish anything they may be working on, I mailed out to explain that I’d reattach, and salvage, the affected volumes at 8pm. As it turned out, after rebooting the servers at 8pm, I was able to salvage the volumes individually, without affecting the availability of the working volumes. So apart from a 5 minute break at about 8pm, file access remained working. Over the next couple of hours the volumes affected by the controller A replacement gradually came back on-line. Most files were back by by 10:30pm.
The reason that some of the volumes were incorrectly configured to only use one controller is unknown. The most likely explanation is that they were all on the JBOD part of ifevo4. The JBOD is an expansion unit containing just extra disks. It was previously attached to an older version of the SAN hardware (ifevo2), which also had dual controllers and multiple FC connections to our fabric. Back in November 2013 we shutdown ifevo2, disconnected the JBOD, and attached it to the new ifevo4. At that point everything seemed to be working fine. The file servers just continued to access the volumes from their new location, and multiple paths were available to the data, so we had redundancy. I suspect had we looked more closely, this is where the problem was introduced, and though we had multiple paths via our two different fabrics, they were only to a single controller.
Since the problem on Thursday, all the paths have been checked and updated, where necessary, to make sure there are multiple paths to both controllers on our ifevo3 and ifevo4. And in future should we need to change a controller again, we will double check that those paths are still in place before replacing a controller.
Neil