1. If it’s long enough for a progress bar, it’s long enough to store results incrementally.


  2. Make the job resumable. This means that if the job fails, you can start from the last point where the job wasn’t failing, not from the beginning.


  3. Make the job detachable, especially if you are working on a remote machine. Tools like tmux or screen allow you to start a job in a terminal session, then detach your terminal from the process, leaving it running.


  4. Fail early. Inspect results from the job as soon as they are produced, not once the job is over. This is easier if you save results incrementally.

    Useful terminal commands:


    “Tail the logs”:

    tail -f path/to/your-job.log

    This command will show you the end of the log file and continues scrolling to the end automatically.


    watch:

    watch -n 1 'command'

    in this snippet, command is something informative about your job’s status. For jobs involving nvidia GPU resources, you can watch the output of nvidia-smi.


  5. Have expectations about the behavior of your job. Form estimates of:

    • How long do you expect the job to take to complete?
    • What are the likely ways that it would fail?