[Techtalk] I/O error on /var

Tue Dec 2 15:42:49 EST 2003

Hello Everybody,

I've not put in an appearance so far but have been a silent and
interested visitor to the list for some time.

Now I am hoping that you will forgive me for jumping straight in and
sending you my plea for help.

I'm alone with a number of systems while my boss (technically one of the
best people I've come across) is on holiday. Generally our systems are
so good that they don't require a lot of attention but this time, sod's
law, a lot of things seem to be going up the creek. My main problem is this:

1/ We moved one of our servers, a Linux 2.4.22-xfs - (Debian Woody), to
new location.
[RAID-5, 9 disks, GDT: Storage RAID Controller Driver. Version: 2.05).
2/ Afterwards, it wouldn't reboot.
3/ We tracked this to a scsi conflict, a tekram tape drive controller
was fighting with  the raid controller.
4/ Mike (my boss) swapped out the tekram card for another, slightly
different tekram card and rebuilt the kernel for support for this card.
5/ The box rebooted fine.
Then he went on holiday.

6/From my home box I tried to access my email (Imap, SquirrelMail) and
found the box running (SquirrelMail interface available) but email down
(authentication failed).
7/ Hunting for clues, I found that /var on sda8 had umounted itself. I
saw that Mike had encountered a similar problem before and in the end
had to reboot the box, so I did too.
8/ Rebooted but the box didn't come back up.
9/ Next morning I removed the tekram card and the box booted. In dmesg,
I found a clue:
XFS mounting filesystem sd(8,8)
Starting XFS recovery on filesystem: sd(8,8) (dev: 8/8)
Ending XFS recovery on filesystem: sd(8,8) (dev: 8/8)

10/ I hunted around for some more documentation and found something that
Mike had noted a few months earlier:

<quote>
Looking at the system somewhat later produced an I/O error when trying
to read
anything in /var. Squirrelmail has stopped working due to this, and the imap
processes are in general not happy. Can't check the logs, as there's no
/var/log visible to look in. However, dmesg sheds a little light:-

xfs_inotobp: xfs_imap()  returned an error 22 on sd(8,8).  Returning error.
xfs_iunlink_remove: xfs_inotobp()  returned an error 22 on sd(8,8).
 Returning error.
xfs_inactive:   xfs_ifree() returned an error = 22 on sd(8,8)
xfs_force_shutdown(sd(8,8),0x1) called from line 1844 of file
xfs_vnodeops.c.  Return address = 0xc01e1f64
Filesystem "sd(8,8)": I/O Error Detected.  Shutting down filesystem: sd(8,8)
Please umount the filesystem, and rectify the problem(s)

Guessing that the hard power cycling hasn't done the box any favours.

Shut down as many processes as possbile, but /var still won't unmount.

Gave in and rebooted the box remotely. Back up about 4 minutes later,
and dmesg says:-

XFS mounting filesystem sd(8,8)
Starting XFS recovery on filesystem: sd(8,8) (dev: 8/8)
Ending XFS recovery on filesystem: sd(8,8) (dev: 8/8)

Hopefully that was enough to fix it.... If not, the fs will probably
auto-shut down again. I think I'll probably take it down and check
all the filesystems properly anyway, but I might wait for the new CPU
fan before doing that.

19/3/03

The /var filesystem shut down again last night. Guessing there's something
wrong with it that the amanda run throws up, as that's been when it's
happened both nights. Took the box down, booted from rescue floppies, and
ran xfs_repair over all the filesystems. One disconnected inode was
reconnected to lost+found on /var....	
</quote>

I know for sure that this is exactly the same as the box fell over again
last night after starting to do the amanda backup. Amanda obviously
discovers something on /var that makes the system go "tilt".

What I would like to know is, has anybody an idea for another way to fix
this other than by rebooting from rescue floppies and running
xfs_repair. Could I run xfs_repair without taking the system down, just
umounting /var (I haven't done a lot of reading on xfs_repair yet) or
would it be advisable to do the whole lot? I read somewhere that you
should back up the data before running xfs_repair because if the
underlying fault is due to hardware, you might lose your data.
Considering that we haven't had a good backup in about a week, I'm
rather reluctant to run the risk of losing data. What would be the best
way to back up the data?

I hope I will be forgiven this rather lengthy message!

Thanks for any advice you might have for me,

Alexandra