Norwegian version of this page

TSD Operational Log - Page 18

Published Nov. 4, 2015 12:22 AM

Dear TSD-users,

there will be a maintenance stop of the TSD infrastructure on Thursday 5/11 from 15:00 to 15:30 CET. During the downtime the users will not be able to access TSD. The VMs will be probably rebooted at the end of the downtime therefore all the running process will be stopped. The TSD downtime coincides with the Colossus maintenance stop, so there will be no jobs running on the cluster at the time of the downtime. The short notice is due to the fact that we have decided to merge two maintenance stops, namely HNAS and Colossus, to minimise the numbers of outages.

The downtime lasted from 1500 to 1510, and everything is back up and good. Performance should be better.

Sorry for the inconvenience.
Regards,
Francesca

Published Oct. 20, 2015 10:02 PM

Dear TSD-user,

tomorrow there will be an upgrade of Cerebrum instance in TSD. The outage will last for the entire day. As a consequence of the maintenance stop the brukerinfo will not work. 
You will receive an informative email when the maintenance is finished.
Sorry for the inconvenience. 

Regards,

TSD team

Published Sep. 16, 2015 3:58 PM

Dear TSD-users,

the issue regarding the missing communication between Colossus and the Domain Controllers has been solved and now Colossus is back in production as usual. 
We expect that very few (if none) jobs had failed during the unplanned outage. 

Sorry for the inconvenience.

Happy computing!

Francesca@TSD

Published Sep. 16, 2015 1:50 PM

Dear TSD user

We got a problem with Colossus because of our Domain Controller update made an unwanted situation pop up. We are hoping to get the situation back on track today, we´ll keep you posted.

For those of you paying for CPU hours and having had jobs killed, please email us at tsd-drift@usit.uio.no to get this refunded with interests.

Sorry for the inconvenience.

Gard

Published June 8, 2015 10:38 AM

Dear TSD user,

the 10th June 2015 at 12:00 CEST there will be an update of the TSD disk. We expect the upgrade to last for a hour. During this period the system might hang up for circa one minute at intervals of every 30 minutes (the first time at 12:00, the second time at 12:30 etc).

For the Colossus users: all the jobs on Colossus nodes will keep running as usual during the upgrade. Notice however that jobs finishing during the upgarde might crash because of the failure of data writing processes back on the VMs. We therefore advice you to schedule (when possible) your jobs in order to finished well after the upgrade period.

Regards,

TSD@USIT

Published May 18, 2015 1:34 PM

Dear TSD-user,

the linux vms will be shut down for maintenance purposes, as announced previously.

You will be informed when we will finish.

Regards,

Francesca@TSD

Published May 18, 2015 12:46 PM

Dear User,

the login problem we experienced this morning has just been solved.

Regards,

TSD@USIT

Published May 18, 2015 10:55 AM

Dear Users

We have an issue with the two factor login. Problem occures second time you try to log in today. We are on the case, hopefully solved quite soon.

Best

Gard@TSD

Published May 11, 2015 10:16 PM

Dear TSD user,

the maximum wall-time-limit for the jobs running on Colossus has been now increased to 28 days. This will facilitate the execution of long  simulations/calculations. However we strongly advice you not to run jobs for more than 7 days, unless you have enable checkpointing (...

Published May 11, 2015 10:02 AM

Dear TSD user,

due to maintenance, the 18th May 2015 at 13:30 CEST all the linux machines in TSD will be shut down. We expect the downtime to last for a hour. However you will receive a notification by mail when the maintenance stop is over.

For the Colossus users: all the jobs on Colossus nodes will keep running as usual during the downtime. Notice however that jobs finishing during the downtime might crash because of the failure of data writing processes back on the VMs. We therefore advice you to schedule (when possible) your jobs in order to finished well after the downtime period.

Regards,

TSD@USIT

Published Apr. 30, 2015 8:23 PM

- This work has been finished and we are back in production (08.50 - 6/5-15)

Due to USIT starting use of a new certificate when speaking to minID (idporte/difi) there will be a restart of Nettskjema at 6/5-15, 08:30. Estimated downtime is about 20 minutes.

Nettskjema will not work during this downtime. We will try to update here when back on line again.

Best TSD@UST

 

Published Apr. 14, 2015 3:51 PM

Spice-proxy was  accidentally rebooted today at 15:40 today. This incident caused a 5 min downtime for remote connections to linux machines in TSD using spice. We are sorry for any inconvenience this may have caused.

Published Mar. 27, 2015 11:05 AM

Spice proxy is now up and running again.

 

Dear TSD users

As usual things does not go as planned. The machine has been moved, but we can not get the SPICE connection working. We will come back to this shortly. Login first to a windows machine and then putty to your linux VM works.

We are moving the SPICE proxy machine today to vmware at 14.00, so there will be a short downtime, up no later than 1430. All connections will be lost if using SPICE. If you log in on windows and then use SSH to your computer you will not be affected.

We will update this logpost when done.

Published Mar. 12, 2015 5:02 PM

Dear TSD users

We have fixed the the LDAP issue in TSD. Everything should work, except a known truble with p21 in the filesluice.

We are very sorry for the downtime. If there are any more issues, please report to tsd-drift@usit.uio.no.

Best regards

Gard@TSD

 

Published Mar. 11, 2015 3:33 PM

Dear TSD-users

NB : Amount of data inside the file-sluice was so large that we must encrease downtime for the file-sluice until 1700 today.

We are moving one of the file-sluice-machine tomorrow morning to vmware, thus, no files can be imported or exported tomorrow morning from about 0900-1200. Data will not be lost, but all jobs and connections will be cut off at 0900 tomorrow morning. Nettskjema answers from this period will pop up inside TSD once we are back online with the server.

Nettskjema will be stopped 19/3-15 from 0830 to 10.30. No answers can be handed in during this time and one can not log in to create or change Nettskjemas at nettskjema.uio.no. This downtime is due to a major upgrade to Nettskjema version 15.

Best regards

Gard

 

 

 

Published Feb. 27, 2015 10:23 AM

We had a DNS issue yesterday at about 1530. It was quickfixed yesterday afternoon, and a permanent fix was in place today at 10.15.

The reason why was that after migrating machines to vmware, these machines were left in RHEV as turned off. Some eager users had restarted these machines (we totally understand why, as you believed they where down) and this caused duplicate machines regarding names and IP addresses. This again caused DNS to panic.

Sorry for the inconvenience.

Gard

Published Feb. 24, 2015 3:59 PM

We have solved the issue with the import - export.
Sorry for the delay in with the fix

Gard

Published Feb. 19, 2015 1:21 PM

Dear TSD-users

You may logging to TSD now. Problem solved.

TSD-team

Published Feb. 19, 2015 12:50 PM

Dear TSD-users

You may experience  problems with logging to TSD.  We are on the case and  working to solve it as soon as possible.

Sorry for the inconvenience.

 

TSD-Team

Published Feb. 16, 2015 8:53 AM

We had a disk run full on one our login machines late last night. This caused logins to fail. Case has been solved. Login is now enabled again.

Gard

Published Feb. 11, 2015 8:48 AM

Dear TSD users

TSD is now back up. All windows machines will/have be(en) restarted. We are restarting a few linuxboxes now, assuming most users can start their own.

Best regards

Gard@TSD

 

Published Feb. 2, 2015 3:37 PM

We have detected a problem with the copying of data in and out of TSD. We know what is wrong, but not why. We are working on it and will upate here once solved.

 

Problem is now solved. If errors still, please report to tsd-drift@usit.uio.no with filename and project number.

Published Feb. 2, 2015 1:04 PM

The HPC resource Colossus is still down due to security update.

We hope that it is back up again during Monday 2/2-15

Update: Our work with Colossus yesterday unfortunately had to give way to an emergency maintenance stop of Abel.  New ETA is late today, Tuesday 3.

Update : Colossus came back up yesterday at about 1600, unfortunatelty queued jobs did not manage to start and has to be resubmitted.

Published Jan. 30, 2015 1:56 PM
 
We are glad to inform that TSD services except Colossus are now open. We will update you once Colossus is online.
 
Thank you for your patience,
The TSD team
Published Jan. 29, 2015 4:20 PM

As the glibc error showup up yesterday worldwide we also had to fix this issue during our downtime.
Things have not gone as smooth as planned so we must prolong our downtime until further notice.

We have several people working on the case right now so we hope to see a solution fairly soon.

We will post update on the email-list and the website as soon as we know more.

We are sorry for the inconvenience