School of GeoSciences File Storage
Providing secure, reliable, performant file storage is one of the fundamental services an IT team provides. This post describes in detail how the new CephFS based storage at the School of GeoSciences works.
All our Linux machines, servers, compute boxes and desktops are running Scientific Linux and are managed using LCFG. We also use LCFG to manage our ceph cluster.
When I started working at the School of GeoSciences over 10 years ago, we had a number of file servers spread over our three buildings. Each front-line file server was paired with a second-line server in another building that took regular snapshot copies. In case of disaster a second-line server could become front-line. The disks of each server were configured using RAID6 to provide redundancy in case of disk failure and carved up into smaller volumes using LVM.
Some time ago we moved most of our servers into the College Server Room (CSR) in the JCMB and replaced the second-line servers by backing up to the centrally provided ECDF DataStore. Consolidating our servers into a single location has the advantage that we do not need to look after three locations and our compute servers are close to the file storage. Potential disaster is mitigated by using the DataStore service which uses a different data centre. Files on DataStore are also regularly snapshotted and backed up to tape.
A few years ago we started using the virtualisation system developed by the School of Engineering based on OpenNebula and ceph for clustered remote block devices. Ceph is a distributed storage platform which is scalable and avoids single points of failure.
The Current File Servers
When it came to refreshing our file servers we decided to replace them with ceph remote block devices (RBDs) used by three simple frontends: one for staff/visitors, one for students and one for group storage.
We are still using the volumes from the first iteration of file servers. The logical volumes from a number of servers became RBDs stored in a ceph RADOS pool. The frontends export volumes using SAMBA to Windows and Mac clients and NFSv4 to Linux clients. Outside the CSR we cannot trust networked computers. We therefore only allow NFS access with a valid kerberos ticket.
The main advantages of using ceph RBDs to hold the volumes are that
- we maintained the paradigm of named volumes;
- we can easily grow the volumes;
- we gain resilience: we can lose an entire server without the loss of any service; and
- we can easily upgrade the hardware by adding new servers and once ceph has rebalanced itself remove the old ones.
Although we gained resilience, we still had single points of failure in the frontends. This is mitigated to some extend by our ability to reconfigure any of the frontends to take on the role of another. However, system software upgrades still require an interruption to service.
CephFS: the next step
Meanwhile the latest iteration of the virtualisation service developed by the School of Engineering places the virtual machine disk images into a CephFs directory. This allowed us to familiarise ourselves with the new technology and gain some confidence that we can use it for our general file store.
The remainder of this post describes our CephFS setup and how we manage user and group directories.
CephFS is a POSIX-compliant, parallel file system built on top of ceph’s distributed object store.
The file data and metadata are stored in two different RADOS pools. File metadata are accessed via metadata servers (MDS). Currently, we are running 2 active MDS with two hot standby MDS.
Currently, we have a total of 6 nodes in the ceph cluster:
- 3 Dell PowerEdge R730xd with 2×5.5TB, 6x11TB disks and 8x186GB SSD
- 3 Dell PowerEdge R7415 with 6×5.5TB and 6×7.5TB disks
The nodes are connected via 2x10GE link aggregated ethernet connections. We do not use an internal cluster network.
The CephFS data pool is limited to the hard drives, while the metadata pool makes use of the SSDs. Initially, the metadata pool used the hard drives as well but we found that during a nightly backup which accesses the metadata of all files performance could be doubled by keeping the metadata on the faster SSDs.
All nodes have 128GB of memory. Each object storage daemon (OSD) – one per disk, uses 4GB of memory. The MDS are limited to 1GB of memory.
The size of the CephFS depends on the pool size and the replication level, ie the number of copies kept. Increasing the size of the file system is then simply a matter of adding more OSDs or replacing existing disks with larger disks.
We are using a replication level of 3. This means that we can lose an entire node without losing service. The ceph cluster reshuffles data to reflect the loss of any OSDs automatically. Should the second copy also disappear the data stored in that location becomes read-only to ensure data consistency. The actual metadata stored for the CephFS is small but very valuable. We are therefore thinking of increasing the replication level of the metadata pool because we really do not want the metadata to become read-only since that would freeze the entire filesystem.
The filesystem is huge – currently the raw size of our ceph pool is 463TB or about 150TB usable space once the replication level is taken into account. We have split the root of the filesystem into 3 top level directories:
/datastoresholds the disk images for our virtualisation service
/pacificholds our users’ data
/arcticholds archived data
We use a water body themed naming convention for our managed Linux machines. It seemed apt to use oceans to hold our data. The
atlantic pool holds the RBDs for our current file servers.
The top level
/pacific directory is further split up into
scratch directories. We create daily snapshots of the
/pacific/backedup directory. The current snapshot gets rsync’d to the ECDF DataStore. The snapshots are exposed to the user through a hidden directory
.snapshots. The snapshots of subdirectories can be found in the corresponding
/pacific/scratch directory contains users’ scratch directories in
/pacific/scratch/users and not backed up group storage in
The subdirectories of
/pacific/backedup are the same with the addition of extra subdirectories that hold our users’ webspace and space for our jupyter notebook and shiny services.
We reduce the number of directory entries in the user directories using a simple hash. Student user names are of the form
s1234567. The hash for the students’ directories consists of the last digit of their user name. For all other users we use the first letter of their user name as the hash.
User and group directories get archived by simply moving the directory to the corresponding location of the
/arctic directory tree. Archived data are kept for a while before they are removed from the system. Archiving a directory is atomic since it is a simple move operation on the same file system.
The size of the user and group directories is limited by setting quotas. CephFS quotas are implemented using extended attributes. We only set the maximum bytes limit. Quotas are cooperative, ie the CephFS client stops writing once the limit has been reached. Quotas can be queried using
getfattr -n ceph.quota.max_bytes /some/dir
As a consequence of the quota handling the MDS keeps recursive directory stats which are also accessed via extended attributes. You can get all recursive stats using
getfattr -d -m ceph.dir.* /some/dir
This is much faster than using the traditional
We are considering to tie some of the metadata servers to specific sub directories. It would make sense to have one MDS for the users’ home directories, one MDS for their scratch directories and one for the remaining directories.
On our managed Linux machines we use
autofs to mount user/group directories on demand. We keep the automounter maps in our LDAP. Home directories get mounted in the usual place, ie
/scratch/UUN. We mount group directories in
Access to parts of the CephFS are restricted using secrets associated with ceph users. We deliver the secrets to our trusted machines using subversion. In this context we only trust the computers in the CSR because physical access is restricted. So, inside the server room the data are accessed via the kernel cephfs module. On untrusted networks the data are accessed via kerberised NFSv4.
We spread the load over three NFS servers accessing the same CephFS using the ganesha server: one for staff, one for students and one for group shares. The ganesha server communicates directly with the ceph cluster. Ganesha is a user-space NFS server that supports various filesystem backends. It supports GSSAPI and can be clustered. We run the ganesha server as an active/active cluster. The state of the NFS cluster is kept in the CephFS metadata pool.
In both trusted and untrusted cases we use the same automounter map. So for example my entry looks like
# mhagdorn, auto.cfshome, geos.ed.ac.uk
PREFIX are set when the automounter is started. They are
- in the CSR:
STAFFSERVER=(no particular server required) and
- outside the CSR:
There are similar variables for student and group servers. The file system type is set in the
- for cephfs we append to the map
- and for NFS
The NFS servers also run SAMBA to export the data to our Windows and Mac users. Although SAMBA can also communicate directly with the ceph cluster we decided to use vanilla SAMBA on top of automounted CephFS directories. This approach has two advantages:
- quotas work out of the box since the cephfs kernel client reports quota as file system size
- symbolic links to other file systems work on all supported platforms, eg a symbolic link in the users home directory to their scratch directory resolves correctly
The NFS/SAMBA file servers do not need many resources: they do not need local storage and they do not need much memory since caching files makes no sense in a clustered file system. They do need plenty of network bandwidth. We are therefore running the servers on our virtualisation service. This way we gain some resilience should there be a hardware fault.
The NFS server cluster and the other CephFS clients use consistent file locking. They are, however, not aware of the SAMBA locks. So files can get corrupted if they are accessed from both Linux and Windows at the same time.
User Web Directories
We give our users file based webspace on request which can be accessed in the usual manner, ie
www.geos.ed.ac.uk/~UUN. We allow the users to run CGI scripts. To offer some protection the web pages and scripts are mounted read-only. The web server is only allowed to write to a log and a data directory which are not exposed via the web. Read-only access is enforced by using a ceph user that has read-only access to the web directory and read-write access to the log/data directory. On the web servers the web directory is mounted in
/home/UUN while elsewhere it is mounted on
We use a similar trick on our jupyter notebook and shiny service to replace the user’s general home directory with the user’s jupyter or shiny directory.
The second aspect to providing school storage is managing it, ie creating user/group directories, changing their quota and eventually archiving the directories. The LDAP entries also need to be maintained. We have written a number of python scripts that do these jobs together with LCFG configuration. For historical reasons the tool is called quadrat – Quota, User and Directory Remote Administration Tool.
We have created a backend virtual machine that has access to the entire CephFS filesystem. The backend is responsible for creating directories, changing quotas, creating snapshots and archiving data. The backend also modifies the LDAP automounter maps. Write access to the LDAP is granted to a process with a suitable kerberos ticket.
Group storage is managed via an inifile configured by LCFG. A python program is run when the configuration changes. We also setup user projects that consist of a directory structure containing a project directory and user directories with symbolic links to the projects that users have access to. Permissions are handled using FACLs. Users have only direct access to their directory.
User directories, ie home, scratch and web directories, are managed via the user’s LDAP entry. In addition to the account information we also store quota records for each user directory. The user entry can be modified on any machine. A process on the backend runs regularly to look for quota changes or whether new directories need to be created. New directories are populated using files configured in another LCFG managed inifile. Again the backend program updates the automounter entries for the users once the new directories are created.
The archiving script checks for unexpected directories and inactive users. These directories are simply moved to the corresponding location in the
/arctic directory tree. It also removes the automounter entries from the LDAP. Currently, we do not run the archiving script automatically.
Using CephFS greatly simplifies our file serving. The ceph system is very good at looking after itself. Normal failures like a failing disk or even an entire server failing are dealt with by the system without our intervention. We obviously still need to replace whatever stopped working but the service is not interrupted. We can easily grow the amount of storage by adding more servers. We can also easily retire old servers by just adding new servers and removing the old ones.
In the server room we also gain resilience because there is no single frontend server. Outside the server room we still have single points of failure. However, we have mitigated against some failure modes by virtualising the frontends.
The next step is to look into clustering of the NFS and SAMBA servers to remove those single points of failure as well. Apart from failures this would also allow us to do routine maintenance like software upgrades that require a reboot without disrupting the service.
SAMBA also supports expose CephFS snapshots via the Windows Explorer’s Previous Versions dialog using the
(CC-ShareAlike marsupium photography)