Re: Root filesystem unexpectedly remounted in read-only

From: "Tim Connors" <tim.w.connors@gmail.com>
On Fri, 19 Sep 2014, Craig Sanders wrote:
On Thu, Sep 18, 2014 at 10:48:00AM +0200, Michele Bert wrote:
1) Can bad block appear on a virtual disk too? Even if it is eventually just a flat file in the host filesystem? 2) Are those bad blocks related to real bad blocks on the physical host file system?
vmware will tend to drop disk paths well before linux would have a problem with them, in the name of High Availability. Whilst Linux would just log a 120s hangcheck timer alert to the syslog if the disk didn't answer in 120 seconds, vmware might respond to the same disk outage by
I have seen the same with Oracle VirtualBox. The "disk" was an ordinary file. The write timed out so Linux (the same Ubuntu 12.04) perceived it as a disk failure and remounted read-only. There were no physical hardware errors involved. It caught me by surprise when it happened and it took a little bit until I understood what's going on. The system just behaved weird (it was only one of three virtual disks so the system worked partially) I have a nagios/Icinga script checking the expected mounts since then. It compares the mounts with a mount "snapshot" written in a file after installation. Regards Peter

On Mon, 22 Sep 2014, Peter Ross wrote:
From: "Tim Connors" <tim.w.connors@gmail.com>
On Fri, 19 Sep 2014, Craig Sanders wrote:
On Thu, Sep 18, 2014 at 10:48:00AM +0200, Michele Bert wrote:
1) Can bad block appear on a virtual disk too? Even if it is eventually just a flat file in the host filesystem? 2) Are those bad blocks related to real bad blocks on the physical host file system?
vmware will tend to drop disk paths well before linux would have a problem with them, in the name of High Availability. Whilst Linux would just log a 120s hangcheck timer alert to the syslog if the disk didn't answer in 120 seconds, vmware might respond to the same disk outage by
I have seen the same with Oracle VirtualBox. The "disk" was an ordinary file.
The write timed out so Linux (the same Ubuntu 12.04) perceived it as a disk failure and remounted read-only.
I don't think it's Linux doing the timeout here. The hypervisor typically has heartbeats and the like, and it ends up sending a SCSI error code back to Linux, which it of course then propogates up to the FS.
There were no physical hardware errors involved.
It caught me by surprise when it happened and it took a little bit until I understood what's going on. The system just behaved weird (it was only one of three virtual disks so the system worked partially)
I have a nagios/Icinga script checking the expected mounts since then.
It compares the mounts with a mount "snapshot" written in a file after installation.
What I'm finding quite managable despite the quantities of machines, is to rsync all of the system configs (including /proc/mounts munged to get rid of dynamic data[1]) to a central host, and half hourly check into git and mail git diff (further munged) to admins. It would (and has, when I've missed the syslog messages if there happens to be an unrelated syslog storm at the time) notice /usr going from ,rw to ,ro. [1] find */var/local/recovery_data/mounts | while read i ; do sed -i '/ \(nfs\|nfs4\|fuse\|fuse.sshfs\) /d' "$i" sort "$i" > "$i.tmp" && mv "$i.tmp" "$i" done -- Tim Connors
participants (2)
-
Peter Ross
-
Tim Connors