Want it or not, a lot of times jobs fail. In such cases, it could be
hard to figure out what went wrong. The slurmR
package has
some tools that can help you deal with this.
The documentation that follows applies for job submitted with sbatch,
this is, job that were submitted using either Slurm_lapply
,
Slurm_sapply
, Slurm_Map
, or
Slurm_EvalQ
.
When calling any of the *apply
family functions,
slurmR
creates a folder with the name equal to
job_name
in tmp_path
as follows:
00-rscript.r
: The R script that is used to load the
data, and execute whatever the instruction is (sapply
,
lapply
, Map
, etc.).
01-bash.sh
: The Slurm configuration bash file. This
passes all the SBATCH options the user specified and calls
Rscript
to submit the job.
02-output-%A-%a.out
: The name-pattern for the log
files generated by Rscript. In the case of job-arrays, the pattern
%A
is the jobid and %a
is the Array id. This
is usually the place where to look for useful information on why the
script failed.
03-answer-%03i.rds
: The name pattern of the output
rds
files. Usually, the jobs end-up writing an output,
e.g. the results from the lapply
call, and the
%i
in the pattern indicates the array id.
*.rds
Further R objects that were exported for this
particular job. In the case of Slurm_lapply
, for example,
it usually includes X1.rds
, X2.rds
, …,
X[njobs].rds
files. Other R objects needed for the call
will be saved in this same folder as well.
If there’s an issue with the submitted job, the user can take a look at these files. In general, looking at the log files is enough to figure out what could be going on. Let’s see the following example:
library(slurmR)
x <- Slurm_lapply(
1:1000, function(x) complicated_algorithm(x),
njobs = 4,
plan = "submit"
)
By printing the output, you may see something like this:
x
Call:
Slurm_lapply(X = 1:1000, FUN = function(x) complicated_algorithm(x), njobs = 4,
plan = "submit")
job_name : slurmr-job-5724cb1616
tmp_path : /auto/rcf-40/vegayon/slurmR/slurmr-job-5724cb1616
job ID : 6163924
Status: All jobs are pending resource allocation or are on it's way to start. (Code 1)
This is a job array. The status of each job, by array id, is the following:
done : -
failed : -
pending : -
running : 1, 2, 3, 4.
The problem is, what happens if one of these fails, for example, 1 and 3:
x
Call:
Slurm_lapply(X = 1:1000, FUN = function(x) complicated_algorithm(x), njobs = 4,
plan = "submit")
job_name : slurmr-job-5724cb1616
tmp_path : /auto/rcf-40/vegayon/slurmR/slurmr-job-5724cb1616
job ID : 6163924
Status: One or more jobs failed. (Code 99)
This is a job array. The status of each job, by array id, is the following:
done : 2, 4.
failed : 1, 3.
pending : -
running : -
We can check the log-files of the failed jobs using
Slurm_log
, for example, if we wanted to checkout the
log-file of the first job of the array, we can type:
By default, while in interactive mode, you will get a prompt telling
you that less
(the default) will be called using the
system2
command, and asking you if you wish to continue.
You can change the way to checkout the log file by using an alternative
command, like cat
, e.g.:
Again, while in interactive mode, you will get a prompt asking you to
enter "y"
or "n"
. If the command fails, it is
usually due to a missing log, either you entered an invalid number in
which.
, or the job-array didn’t started the log-file. If
the error has to do with the later, then you can always inspect the
files located in the job folder using command line tools:
Following the previous case, let’s imagine that the failure was due
to some unexpected error (the node failed), so we can resubmit the job,
in order to do such, we can use the function sbatch
like it
follows:
This will re-submit the job, but only the components 1 and 3. Once it
is done, the user can collect the results using
Slurm_collect
. This will read in the results of all jobs,
not just 1 and 3.
If for some reason the R session was closed before been able to save
the slurm_job
object, users can always recover the
slurm_job
object by using the read_slurm_job
function, e.g.: