[CLOSED] Colossus: $SCRATCH file system problem

Yesterday, between 14 and 21, many jobs failed to start due to a problem with the scratch file system. These jobs have been requeued now, and should start as normal again.

We are still trying to figure out what the cause was. The indications so far is that the filesystem got full, either in terms of disk space or number of files. If that is the case, jobs using $SCRATCH can have been affected or even crashed, so please check your jobs.

Update, 2020-09-27: We have confirmed that it was one or more jobs that filled up $SCRATCH, in the sense that they created too many files. We are setting up monitoring to be able to find out which user's jobs are responsible should it happen again.

 

Published Sep. 24, 2020 10:24 AM - Last modified Sep. 29, 2020 11:13 AM