June Ceph Update – the Queen’s CURRY

We had the opportunity to try out a cold start of all our servers located in the College Server Room on the weekend. We shut down all machines on Friday 11th June to allow contractors to work on the electricity supply on Saturday.

From my perspective the most interesting aspect was shutting down the ceph cluster as we hadn’t done that since we started using CephFS for general file serving. During the day on Friday we reduced the number of active MDS to 1. This took a few minutes until all data were flushed. After six on Friday night we shut down all compute boxes and VMs reducing the number of CephFS clients. Finally we also shut down the virtualisation service. We then took the CephFS offline:
ceph fs set one down true
This took a few minutes again flushing all metadata to the ceph metadata pool. Next we shut down the ceph cluster:
ceph osd set noout ceph osd set nobackfill ceph osd set norecover
followed by
ceph osd set norebalance ceph osd set nodown ceph osd set pause
We then switched off the ceph nodes. Finally, we shutdown the two remaining servers which provide DHCP and DNS services. The whole procedure took about 1 hour.

The electrical work was carried out Saturday morning and was completed by 14:00. Time to switch everything back on. This is where we hit the first issue. All machines were switched off including the DHCP servers. Our remote controls get their IP addresses from DHCP so we couldn’t switch on the servers remotely. Ah well, nothing a bike ride couldn’t fix. The DHCP/DNS servers came up fine. Next I switched on the ceph nodes. Once all ceph nodes were up again a reversed setting the various options to shutdown the ceph cluster above. Finally, we re-enabled the CephFS
ceph fs set one down false
This procedure was nice and quick and took about 5 minutes. Next on the list was the virtualisation service. Which came up without any problems. This also started all the virtual machines including the file server frontends. We used this opportunity to switch the frontends to Ubuntu. Finally, we switched on all compute boxes. We were back in business in 1 hour from a cold start.

We spotted one more problem: we are using NFS to export our storage to Linux desktops. Root could access it but there was some locking issues for users. We are using a ganesha active-active NFS server cluster. It turned out the shared state DB was in error because we had removed the old SL7 servers. Once this was resolved NFS worked again.

All in all the super CURRY worked out quite well. Decoupling the file servers from the storage is a huge a benefit. It allowed us to switch OS without having to transfer any data. My impression is that the new ubuntu based file server frontends are more snappy as well.