[SOLVED] Job error on Colossus nodes

Yesterday at around 15:15, about half of the Colossus nodes were reinstalled. Unfortunately, one slurm plugin was out of sync, which made jobs fail to start properly on the nodes.  This resulted in about 40 jobs exiting with an empty slurm-NNN.out file before the nodes were automatically taken out of production.  The problem was discovered and fixed within an hour, but the failed jobs must be resubmitted.

We apologize for the inconvenience.

Published Oct. 10, 2017 12:46 PM - Last modified Oct. 16, 2017 8:59 AM