
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
We have issues where the monthly mdadm raid check grinds the system to a halt.
do you find that these monthly cron jobs are actually useful? i've never found it to be so, and suspect that it will actually cause problems because the heavy io load might be enough to push a borderline drive into failure.
deb http://www.coker.com.au squeeze misc In the above Debian repository for i386 and amd64 I have a version of mdadm patched to send email when the disks have different content. I am seeing lots of errors from all systems, it seems that the RAID code in the kernel is reporting that 128 sectors (64K) of disk space is wrong for every error (all reported numbers are multiples of 128). Also I suspect that the Squeeze kernel has a bug in regard to this. I'm still tracking it down.
this is probably what you want in a data center with spare disks ready and waiting but not really what you want happening at home on a sunday morning. the computer shops are shut, the nearest swap meet might be the other side of town that week, and fixing a dead fs with the cheery sound of lawnmowers in the background is enough to send you postal :)
If you have a RAID stripe that doesn't match then you really it to be fixed even if replacing a disk is not possible. Having two reads from the same address on a RAID-1 give different results is a bad thing. Having the data on a RAID-5 or RAID-6 array change in the process of recovering from a dead disk is also a bad thing.
[story about cascading DRBD reboot in a cluster which is a perfact map for the term "cluster-fuck" snipped] Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
I've considered that with NBD instead of ISCSI. Also I'm idly considering a RAID-1 across a single local disk and a single remote disk with BTRFS using internal RAID-1 on top of that. That way BTRFS would deal with the case of a single read error on a local disk that's mostly working and RAID-1 would deal with an entire system dying. While BTRFS RAID-1 has got to have a performance overhead, that should be more than compensated by having two independent local disks for different filesystems. Now the advantage of DRBD is that it's written with split-brain issues in mind. The Linux software RAID code is written with the idea that it's impossible for the two disks to be separated and used at the same time. In the normal case this is not possible unless a disk is physically removed. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/