Re: drdb and mdadm and more (was Re: mail storage in a distributed database)

10 Apr 2012

      On 10/04/12 11:45, Russell Coker wrote:
...
On Tue, 10 Apr 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au>
wrote:
...
On 05/04/12 17:42, Craig Sanders wrote:
...
...
Overall ganeti is really nice, but it feels like drbd has some missing
pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over
two iscsi exports would be better than drbd.
Oooh, no, don't do that. We've tried it. It didn't work out.
It sounds like a good idea at first, but every time you need to reboot
one or other of the iscsi targets (eg. for kernel updates or suchlike)
you'll need to rebuild the RAID array, and the performance of that over
ethernet blows.
If you use an internal bitmap to indicate which parts of the RAID aren't
synchronised then there shouldn't be much data to transfer.
http://www.coker.com.au/bonnie++/zcav/results.html
Also given that the maximum contiguous transfer rates I've seen are under
120MB/s it seems unlikely that GigE is going to be a significant bottleneck.
I'm sure that there are disks that are faster than the 1TB disk I tested, but
it should be noted that the inner tracks of that 1TB disk were about half GigE
speed.  Also when synchronising a RAID array if performance matters then you
probably have other load which means that synchronisation speed is well below
the maximum speed of the disk.
If you're designing one of these systems so that you have high 
availability of your system, then it's because you do have lots of I/O 
all the time and can't afford to stop it during a RAID rebuild.

The random i/o interspersed with the rebuild i/o has the effect of 
totally trashing the rebuild performance. Random i/o over iscsi has 
sucked on the stable Debian kernels. (I believe the better-performing 
iscsi drivers (which are a totally independent rewrite) have finally 
made it into wheezy though.)

And so, you end up in the situation where the real I/O performs badly, 
AND the rebuild takes so long that you're concerned there's a sizeable 
window where another disk error could occur.
...
Some people claim that RAID bitmaps hurt performance, I haven't yet tested
that.  But a full RAID rebuild is going to seriously hurt performance for a
long time, so if performance matters it's probably best to have a small loss
all the time than a large loss for the hours or days required for a full
rebuild.  Also note that a long rebuild increases the probability of a second
failure while it's rebuilding...
Agreed with your sentiment about better to have a small performance loss 
constantly (that you can design for) rather than occasional massive perf 
loss.
If you try it with the RAID bitmap over iscsi, I'd be interested to hear 
how it works out for you..

In the long run, I think cluster filesystems are a better bet though. 
Still waiting on GlusterFS, Ceph, etc to reach maturity :(

Toby