Re: drdb and mdadm and more (was Re: mail storage in a distributed database)

10 Apr 2012

      On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> 
wrote:
...
...
...
It sounds like a good idea at first, but every time you need to reboot
one or other of the iscsi targets (eg. for kernel updates or suchlike)
you'll need to rebuild the RAID array, and the performance of that over
ethernet blows.
If you use an internal bitmap to indicate which parts of the RAID aren't
synchronised then there shouldn't be much data to transfer.
http://www.coker.com.au/bonnie++/zcav/results.html
Also given that the maximum contiguous transfer rates I've seen are under
120MB/s it seems unlikely that GigE is going to be a significant
bottleneck. I'm sure that there are disks that are faster than the 1TB
disk I tested, but it should be noted that the inner tracks of that 1TB
disk were about half GigE speed.  Also when synchronising a RAID array
if performance matters then you probably have other load which means
that synchronisation speed is well below the maximum speed of the disk.
If you're designing one of these systems so that you have high
availability of your system, then it's because you do have lots of I/O
all the time and can't afford to stop it during a RAID rebuild.
Yes, that is why bitmaps are a good thing.
...
The random i/o interspersed with the rebuild i/o has the effect of
totally trashing the rebuild performance. Random i/o over iscsi has
sucked on the stable Debian kernels. (I believe the better-performing
iscsi drivers (which are a totally independent rewrite) have finally
made it into wheezy though.)
How can someone write iSCSI drivers that hurt random IO?  The reports I have 
seen about command queuing in hard drives indicate that it generally doesn't 
give more than about a 10% benefit.  So an iSCSI driver that lacks command 
queuing and loses about 10% probably wouldn't count as making performance 
suck.
...
In the long run, I think cluster filesystems are a better bet though.
Still waiting on GlusterFS, Ceph, etc to reach maturity :(
I think that BigTable type systems are the way to go.  It seems that cluster 
filesystems generally either try for full POSIX compliance or implement a sub-
set that doesn't match the sub-set you want.  When applications use Cassandra 
or other distributed database technologies they can relax the consistency 
requirements as they wish.

For example when designing a mail server there is no need to have the creation 
of a new message appear instantly, it just has to reliably appear.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

Re: drdb and mdadm and more (was Re: mail storage in a distributed database)

Russell Coker