Any views expressed within media held on this service are those of the contributors, should not be taken as approved or endorsed by the University, and do not necessarily reflect the views of the University in respect of any particular issue.

Magnus Hagdorn

Magnus Hagdorn

Research Software Engineer

Anatomy of a CephFS Disaster

This post describes in detail how we ended up with a damaged CephFs and our attempts to fix it.


On 3rd September we scheduled some downtime to reconfigure the network of our ceph cluster. Prior to that we used two networks: one frontend network used by the ceph clients (and the rest of the machines in our server room) and one private network used exclusively by the ceph cluster for internal traffic. Recently we upgraded our switches to 10GE. We figured that should be plenty of bandwidth for all our ceph traffic and decided to retire the cluster network. The cluster network is used by the OSDs. In order to reconfigure them they had to be restarted and the cluster quiesced. We followed the instructions in this mailing list post:

  1. we stopped all our RBD using clients
  2. we shut down all our VMs and marked the CephFS as down
    ceph fs set one down true
  3. we set some clust flags, namely noout, nodown, pause, nobackfill, norebalance and norecover
  4. waited for ceph cluster to quieten down, reconfigured ceph and restart the OSDs one failure domain and a time

Once all OSDs had been restarted we switched off the cluster network switches and made sure ceph was still happy. We then reversed the procedure.

Out of the Frying Pan

Ceph didn’t report any problems, our fileservers that use RBDs came back happily. I then re-enabled the cephfs. I was running ceph -w to watch for any problems and noticed that our MDS fell over with the following error
replayed ESubtreeMap at 8537805160800 subtree root 0x1 not in cache
failure replaying journal (EMetaBlob)

I changed the number of active MDS to 1 and restarted the MDS. The restarted MDS did not rejoin the cluster and eventually the cephfs ran out of active MDS and crashed.

My memory of events and what we did gets a bit hazy here. Thinking we had a disaster at hand we initiated the disaster recovery procedure and reset the journal. The MDS started again and we had our cephfs back. Unfortunately, it crashed as soon as we started writing data to it.

Into the Fire

We noticed a whole bunch of
bad backtrace on directory
errors. Some more reading suggested that we should do a scrub of the filesystem. We did that but were too impatient and did not let it finish before we started using the filesystem again. The cephfs crashed again on write.

At this stage we decided to follow the rest of the disaster recovery procedure. Our cephfs contains ~40TB of data. It took the 4 workers to scan the file extents over 4 days. During this time we observed a ceph read activity between ~500 op/s and ~2 kop/s. It would have been helpful if the documentation gave some hints as to how long a very long time is and how many workers are a reasonable number for a given size file system. For the second phase of scanning the inodes we used 16 workers which completed the task in a few hours. During this phase we maintained a ceph read activity of ~60 kop/s. It would also be useful to know if this process can be distributed over a number of machines. The remaining phases completed relatively quickly.


Rereading the disaster recovery documentation suggested that the cleanup phase is optional. We reasoned that we can reactivate the cephfs. At this stage it was still in a failed state. Running
ceph mds repaired 0
marked the filesystem as repaired and ready to be used. We then started a filesystem scrub and repair
ceph tell mds.a scrub start / recursive repair
which found some issues that ended up in lost+found.

We then tested the filesystem by writing data. It passed the test and we fired up our VMs again. We are back in business.


Reconstructing a 40TB distributed filesystem takes a long time. The filesystem scrub and repair should have done the trick to fix the issues we saw. We do need to chase up some issues where we get regular error messages that clients are failing to respond to cache pressure.

Update: 10/09/2020

In a, perhaps foolish, attempt to be tidy I decided to delete the entries in the lost+found directory. This crashed the MDS. One by one the MDS crashed, restarted and gave up after a few attempts. Eventually all the standby MDS were used up. This is similar to what we saw in the first place. I then applied the lessons I learned over the last few days and

  1. put on the emergency brakes (the MDS were all down anyway)
    ceph fs fail one
  2. restarted all the MDS, eg
    systemctl reset-failed ceph-mds@store09.service
    systemctl start ceph-mds@store09.service
  3. marked the cephfs as up
    ceph fs set one joinable true
  4. and finally restarted the scrub
    ceph tell mds.store08 scrub start / recursive repair

NB: for historic reasons our cephfs is called one.

The good news is that all the cephfs clients continued during the brief downtime. But we still need to figure out how to tidy up the stuff in lost+found. One issue might be that our clients are quite a bit older.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Report this page

To report inappropriate content on this page, please use the form below. Upon receiving your report, we will be in touch as per the Take Down Policy of the service.

Please note that personal data collected through this form is used and stored for the purposes of processing this report and communication with you.

If you are unable to report a concern about content via this form please contact the Service Owner.

Please enter an email address you wish to be contacted on. Please describe the unacceptable content in sufficient detail to allow us to locate it, and why you consider it to be unacceptable.
By submitting this report, you accept that it is accurate and that fraudulent or nuisance complaints may result in action by the University.