Retirement of our legacy SGE Compute Cluster
After many years of tireless service, we’re finally going to retire our old SGE Compute Cluster in favour of our newer compute facilities – in particular our new Slurm Compute Cluster.
This blog post outlines the timetable and rationale for this work, and provides some guidance to help you prepare for this change.
What are we doing?
We are shortly going to finally decommission our old SGE Compute Cluster. We’ve already decommissioned some parts of this cluster, so only a very small number of people from the Institute of Condensed Matter & Complex Systems have been using this lately.
If you’re not sure whether you use this cluster yourself, it’s the one that you access using commands like qsub, qstat etc.
Our newer Slurm Compute Cluster (accessed with commands like sbatch, srun, sinfo etc.) is not affected by this. Indeed – it’s actually about to get a bit bigger! But that’s a story for another day…
Why are we doing this?
Our old SGE Compute Cluster went into service back in 2015/2016. It ran the Son of Grid Engine (SGE) job scheduler on Scientific Linux 7 (SL7), with a small bunch of compute nodes that were bought around the same time.
In 2020 – and following the announced closure of the Scientific Linux project – we started developing our new Ubuntu Linux Platform as a replacement for Scientific Linux. We also decided to invest in a refresh of our School’s computation infrastructure, purchasing 20 brand new compute nodes. In tandem with this, we decide to move from SGE to Slurm as our new job scheduler, as Slurm was being used in many nearby HPC facilities like Cirrus (and subsequently ARCHER2) whereas the SGE project was effectively moribund.
We now have 25 newer compute nodes in operation in our main server room… and we have 7 brand new ones almost ready for people to start using!(More on this soon…)
In tandem with these new developments, we’ve continued to run our older SGE Compute Cluster so that users can migrate their workloads from SGE/SL7 to Slurm/Ubuntu at a time that suits them.
However we’re finally now at the stage where we need to start winding the old cluster down. Most of the hardware is now very old – and some has failed already – and its SGE/SL7 software stack hasn’t been getting much love recently as we’ve been wanting to focus on developing and improving our Slurm & Ubuntu platforms.
Related to this, we’re also now looking at ending support for SL7 some time in 2023. (An exact timetable for this will be announced later.) Most users have now moved to Ubuntu, but there are still a few users who haven’t wanted to migrate to Ubuntu yet. Retiring our SGE cluster will remove the last remaining SL7 compute facilities within our School, leaving only a few research desktop PCs running SL7. We’re hoping this will finally encourage the last remaining users to migrate to Ubuntu!
How will this impact me?
If you’re already using our newer computation facilities – such as our Slurm Compute Cluster – then you’ll continue to use those in the usual way. (If it helps, our newer compute nodes are all named phcomputeNNN, where NNN is a 3 digit number. Our older nodes are a bit more random… some were actually named after members of the So Solid Crew!)
If you currently use both Slurm and SGE clusters, then you’ll soon only be able to use Slurm. This might allow you to simplify some of your workflows to get rid of SGE-specific stuff.
For the small number of SGE users who haven’t yet accessed Slurm or our new compute nodes, you’re now going to have to port your workflows – and potentially your codes – over to Slurm and our Ubuntu Linux platform. This will require some effort. We can provide help with this – please ask!
Everyone who has used our SGE cluster in the last 6 months has been given access to our Slurm Compute Cluster, so you should be able to start using Slurm immediately.
If you are running computational workloads that only work on Scientific Linux 7, we might be able to get those running on Ubuntu via a cool piece of software called Singularity. We’ll probably need to help you with this – so please ask us.
Proposed retirement timetable
We’re proposing the following timetable for retiring our SGE Compute cluster:
9am on Friday 27th January 2023
At this time, we’ll tell the SGE cluster to stop running new jobs.
Any jobs that are already running will be allowed to run to completion over the next week or so.
In the unlikely even that there are any jobs waiting in the queue at this time, we’ll cancel these and let the owner(s) know.
Monday 6th February 2023
By this time, there will be no more jobs running on the cluster.
Any data left on the local scratch disks (/scratch) will be moved to our /storage/scratch shared scratch server. Users will need to reclaim any data they wish to keep from here before it gets auto-deleted after 1 month, i.e. in early March. We’ll contact all users with data there to give them advance warning.
We’ll then start decommissioning the SGE cluster and its compute nodes.
We’ll also remove the SGE software will be removed from all our Linux hosts. Thus common SGE commands like qsub and qstat will give you a “command not found” error.
Which School compute facilities can I use now?
A detailed overview of School (and nearby) compute facilities can be found in:
Linux computing facilities in the School and beyond
If you’ve been only using our SGE cluster, you might first want to try:
- Our Slurm Compute Cluster: This behaves in a similar way to SGE, but uses the Slurm job scheduler instead of SGE. Our Slurm documentation includes some guidance for migrating from SGE to Slurm.
- Our walk-in compute nodes: You can SSH into these and run computations directly. You might find this helpful for porting your workflows and codes over to Ubuntu.
We’re happy to provide guidance and help in using these.
If you have any questions about this work, please contact us:
- You can email the School Helpdesk: email@example.com
- Alternatively, you can post in the SoPA Research Computing space in Teams.