
On Thu, Apr 5, 2012 at 00:25, Russell Coker <russell@coker.com.au> wrote:
On Wed, 4 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
OTOH, have you seen what use google's ganeti has made of DRBD layered on top of LVM?
No, but my recent experience with DRBD hasn't made me inclined to go back for more. :(
We run an 8 node ganeti cluster with xen. A node can be used as both a primary and a secondary drbd for a given VM volume. It usually is, using the builtin VM allocation algorithm, which checks which node would be most suitable to use, based on that nodes free disk/memory/cpu at the time of VM creation. We have issues where the monthly mdadm raid check grinds the system to a halt. Initially we thought this was due to crappy raid cards, with the disks in JBOD mode, and using software raid (to get the battery backed cache). Removing the raid cards and plugging disks directly into the motherboard did alleviate the problems somewhat, and all the monthly checkarray scripts completed. However, now when drbd is used for the VM's primary and secondary volumes, we are seeing very similar issues. The monthly raid check now causes iowait to increase to the point where the recheck is crawling along at 0k/years to complete and the VMs are completely blocked. This may be an mdadm issue[1] or xen issue, or a drbd issue. The jury is still out on that one. Either way, the only solution is to reboot. This causes a cascading reboot of various different nodes because they are all running as both primaries and secondaries. I second Russell's earlier comments that drbd's auto-reboot behaviour is wrong, as this is what causes this cascading reboot. drbd doesn't seem to log why it rebooted and more importantly, there is no way to see the running drbd configuration. Also, ganeti manages drbd itself, so the standard config files don't get read. Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues. Google do note that xen+drbd has some IO issues, and that the 2.6.18 xen patches worked a lot better than 2.6.3* kernels[2]. [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881 [1] http://ganeti.googlecode.com/files/XenAtGoogle2011.pdf -- Marcus Furlong