
One of my Xen DomUs is getting memory corruption. I'm not sure why. I've replaced all the RAM in the system and run memtest86+. The other DomUs work well. Everything was fine before I upgraded to Debian/Testing a week ago, so presumably it's a software bug. The most common symptoms are application SEGVs and GLIBC reporting heap corruption for no apparent reason. Anyway I moved that system to btrfs and since then I've got a couple of corrupted files. Both files were from Debian packages so the --reinstall option to apt-get fixed that. It seems that when a file on disk has a checksum mismatch for all copies you get ESTALE as well as messages on the system console like the following: [ 6747.164889] btrfs: corrupt leaf, bad key order: block=1116618752,root=1, slot=0 BTRFS is doing some good already! -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Thu, 5 Jul 2012, Russell Coker <russell@coker.com.au> wrote:
Anyway I moved that system to btrfs and since then I've got a couple of corrupted files. Both files were from Debian packages so the --reinstall option to apt-get fixed that.
It seems that when a file on disk has a checksum mismatch for all copies you get ESTALE as well as messages on the system console like the following:
[ 6747.164889] btrfs: corrupt leaf, bad key order: block=1116618752,root=1, slot=0
Sometimes you solve a problem once, think it's solved properly, post to a list, and then discover you can't reproduce it. The second corrupt file gives the following: # ls -al /var/lib/dpkg/info/ed.md5sums ls: cannot access /var/lib/dpkg/info/ed.md5sums: Stale NFS file handle # rm -f /var/lib/dpkg/info/ed.md5sums rm: cannot remove `/var/lib/dpkg/info/ed.md5sums': Stale NFS file handle # unlink /var/lib/dpkg/info/ed.md5sums unlink: cannot unlink `/var/lib/dpkg/info/ed.md5sums': Stale NFS file handle This seems a bit bogus. There's no reason why an rm or unlink should fail if the corruption only affects one file. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker wrote:
Sometimes you solve a problem once, think it's solved properly, post to a list, and then discover you can't reproduce it.
The second corrupt file gives the following: # ls -al /var/lib/dpkg/info/ed.md5sums ls: cannot access /var/lib/dpkg/info/ed.md5sums: Stale NFS file handle # rm -f /var/lib/dpkg/info/ed.md5sums rm: cannot remove `/var/lib/dpkg/info/ed.md5sums': Stale NFS file handle # unlink /var/lib/dpkg/info/ed.md5sums unlink: cannot unlink `/var/lib/dpkg/info/ed.md5sums': Stale NFS file handle
This seems a bit bogus. There's no reason why an rm or unlink should fail if the corruption only affects one file.
Even if the corrupt "file" in question is /var/lib/dpkg/info (i.e. the dir, not the file inside it)?

On Thu, 5 Jul 2012, "Trent W. Buck" <trentbuck@gmail.com> wrote:
This seems a bit bogus. There's no reason why an rm or unlink should fail if the corruption only affects one file.
Even if the corrupt "file" in question is /var/lib/dpkg/info (i.e. the dir, not the file inside it)?
The rest of the directory appears OK, I can access files there and I can create and delete files. So obviously the directory isn't entirely borked. A directory corruption that only affects one directory entry should allow that entry to be unlinked. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Quoting Russell Coker (russell@coker.com.au):
One of my Xen DomUs is getting memory corruption. I'm not sure why. I've replaced all the RAM in the system and run memtest86+.
memtest86+ (or memtest86), even if left running overnight, won't necessarily always find a bad stick of RAM. However, massively parallel kernel recompiles will -- or running Cerberus Test Control System, which includes parallel kernel recompiles. I detail how, here: http://linuxmafia.com/pipermail/conspire/2006-December/002662.html http://linuxmafia.com/pipermail/conspire/2006-December/002668.html http://linuxmafia.com/pipermail/conspire/2007-January/002743.html

On Thu, 5 Jul 2012, Rick Moen <rick@linuxmafia.com> wrote:
One of my Xen DomUs is getting memory corruption. I'm not sure why. I've replaced all the RAM in the system and run memtest86+.
memtest86+ (or memtest86), even if left running overnight, won't necessarily always find a bad stick of RAM.
True, but the fact that the problem appeared after installing a new kernel and hypervisor seems relevant. Also I've done things like stopping all DomUs and then just starting the problem one and got the same result. I presume that if it was a physical RAM issue then the mapping of DomU to RAM would be based on startup order and thus changing it would give the crashes to a different DomU. As for a physical RAM problem, if that's the case then I'll have to replace the system to avoid down-time. I've got an almost identical spare system so I just need to upgrade the RAM and run Memtest86+ for a day or two before swapping the disks. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Rick Moen wrote:
Quoting Russell Coker (russell@coker.com.au):
One of my Xen DomUs is getting memory corruption. I'm not sure why. I've replaced all the RAM in the system and run memtest86+.
memtest86+ (or memtest86), even if left running overnight, won't necessarily always find a bad stick of RAM.
However, massively parallel kernel recompiles will -- or running Cerberus Test Control System, which includes parallel kernel recompiles.
IIRC busybox's build system is much more parallelizable than linux's; would busybox be a better candidate for the test?
http://linuxmafia.com/pipermail/conspire/2006-December/002662.html http://linuxmafia.com/pipermail/conspire/2006-December/002668.html http://linuxmafia.com/pipermail/conspire/2007-January/002743.html
I hope you didn't answer my question in these links, because I haven't read them yet ^_^;;

Quoting Trent W. Buck (trentbuck@gmail.com):
IIRC busybox's build system is much more parallelizable than linux's; would busybox be a better candidate for the test?
Honestly, I have no idea, but 'make -j 256' is so extremely reliable and revealing that the question of whether something's a better candidate has never arisen.
participants (3)
-
Rick Moen
-
Russell Coker
-
Trent W. Buck