Long Running Jobs and AFS
As most of you will be aware, access to the School’s AFS file system requires that the user be in possession of a valid Kerberos ticket. Most of the time, this is handled behind the scenes and doesn’t cause any problems. Default Kerberos ticket is only valid for 18 hours though and this can cause problems when users attempt to run jobs for longer than 18 hours which require access to AFS space. Once the 18 hours is up, the Kerberos ticket associated with the job expires and the job loses access to the file system. This is probably not what you want.
Fortunately, there are ways around this. The tickets issued to Informatics users can be renewed for up to 28 days using a program called krenew. For jobs which need to run for even longer than this, the k5start program can use information held in a local file on a given host to obtain Kerberos tickets indefinitely. None of this is straightforward to do however and it is all too easy to make a minor error on the command line which leads to a job failing 18 hours later. Waiting 18 hours to see if something works makes for an awfully long run/debug/fix loop.
To simplify the lives of our users, we have written a wrapper script called longjob, now available on all DICE hosts, which takes care of much of the minutia of setting up long running jobs. Given an indication of how long a job is expected to last, the script will check whether suitable Kerberos tickets are in place, prompting the user for their Kerberos password if necessary to obtain new tickets, and then start the job. There is a man page which prospective users are encouraged to study and User Support will of course be happy to answer any questions about this script and indeed about long-running jobs in general.
Comments are closed
Comments to this thread have been closed by the post author or by an administrator.