Network disruption yesterday
Apologies for the short network disruption yesterday afternoon. It was caused by a 10Gbps forwarding loop, which was created as the second-last fibre was being connected as part of our core switch upgrade programme. As soon as we realised there was a problem the fibre was disconnected again and the port configuration corrected.
Background: we (Informatics, and the constituent Departments before that) set up our network with redundant paths for resilience, using Rapid Spanning Tree Protocol to manage the links and prevent loops. EdLAN as a whole has different constraints, and they run a different STP variant across the core and no STP at the edge. Over the years there have been incompatibilties between the way these variants operate, and we have seen some instability as a result of STP-related events elsewhere. We have therefore for some time filtered BPDUs at all of our interfaces to EdLAN. This has generally operated well for many years.
So what went wrong yesterday? The cards in the new switch which was being installed yesterday are slightly different from the ones in the old switch, and the port involved in yesterday’s problems was previously set up as a hot-spare EdLAN link. (We keep some links pre-configured so that they can be quickly swapped into operation should there be a fault with our principal link.) As part of the upgrade process that port became one of our “normal” infrastructure links and the hot-spare EdLAN link was moved to a different port. The VLAN configurations were moved correctly, but the BPDU filtering was accidentally left applied to the wrong port. When that port was patched in, therefore, STP did not know to block one of the downstream links, and so a loop was set up. Unicast traffic would still have been operating normally, but we have enough multicast traffic that was looped around to completely saturate our infrastructure links.
The fix was to disconnect the problem link, so breaking the loop. The BPDU filter was then applied to the correct link, and everything connected up again.
As usual, our technical network documentation is here.