-
If it’s long enough for a progress bar, it’s long enough to store results incrementally.
-
Make the job resumable. This means that if the job fails, you can start from the last point where the job wasn’t failing, not from the beginning.
-
Make the job detachable, especially if you are working on a remote machine. Tools like tmux or screen allow you to start a job in a terminal session, then detach your terminal from the process, leaving it running.
-
Fail early. Inspect results from the job as soon as they are produced, not once the job is over. This is easier if you save results incrementally.
Useful terminal commands:
“Tail the logs”:
This command will show you the end of the log file and continues scrolling to the end automatically.
watch:
in this snippet,
command
is something informative about your job’s status. For jobs involving nvidia GPU resources, you can watch the output ofnvidia-smi
.
-
Have expectations about the behavior of your job. Form estimates of:
- How long do you expect the job to take to complete?
- What are the likely ways that it would fail?