Break in file service fs at 2012-03-31 10:41 - 16:15

Submitted by pmtonter on March 30, 2012 - 12:33

Schedule:

2012-03-30 10:41 to 16:15

Duration:

5h 34min

Affected services:

File services (fs), cluster file services (hpc-fs), web services, entry server shell.

Description:

Our disk array system suffered from performance problems on thin-LUN. In order to fix this both storage processors were rebooted (separately, of course). During this HIIT's file server Frodo's multipathd handled the reboot of first storage processor (SPA) just beautifully but dumped core when the paths via the second storage processor (SPB) went down. /etc/init.d/multipath-tools was restarted and functionality of multipathd was restored but too late; nfsd erroneously continued to access SAN- disks via failed paths and obviously failed, causing kernel to hung those nfsd processes. This was the reason for having to reboot Frodo.

Frodo's /group file system was previously resized from 9TB to 11TB on-line. Because Linux's resize2fs isn't able to update all superblocks of ext3 filesystem when on-line resize is performed, a file system check is needed on next mount. Thus /group will be checked. Duration of the file system check is expected to be approximately 12 hours.

Other file systems, excluding /group will be up sooner.

Updates on this break report will be added as the situation progresses.

Update at 13:42: It appears that previously described superblock update problem with resize2fs has been fixed and thus running file system check on /group is not neccessary. Connections for services using disks from file server are currently being restored and the break is expected to end much sooner than previously announced.

Update at 14:42: File services(fs), cluster file services (hpc-fs) working normally. There are still problems with entry sever shell and web services.

Update at 15:22: Entry shell server and web services working normally.

Update at 15:56: We install upgrades and boot server in same break.

Update at 16:15: The break is now over.

Last updated on 10 Apr 2012 by Markus Nuorento - Page created on 30 Mar 2012 by Pekka Tonteri