mail storage in a distributed database

Does anyone know of a mail store that uses a distributed database like Cassandra? I want something that has a delivery agent with a similar interface to maildrop or procmail and which has POP and IMAP servers to provide client access. IBM had a research project implementing a mail server on Cassandra but as far as I can tell they released nothing other than a PDF. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Does anyone know of a mail store that uses a distributed database like Cassandra?
I want something that has a delivery agent with a similar interface to maildrop or procmail and which has POP and IMAP servers to provide client access.
IBM had a research project implementing a mail server on Cassandra but as far as I can tell they released nothing other than a PDF.
MS Exchange ;) Does MySQL support enough replication these days to meet your requirements? What sort of geographic distribution are you after? James

On Wed, 4 Apr 2012, James Harper <james.harper@bendigoit.com.au> wrote:
Does MySQL support enough replication these days to meet your requirements? What sort of geographic distribution are you after?
I want replication that doesn't have a special operation for a node leaving the cluster (EG unlike what might be necessary if a MySQL master went down) and which allows new nodes to be added at run-time. Also for email it doesn't matter much if writes don't propagate instantly. At this time I don't plan distribution outside one site. I recall reading about a MTA that had a MySQL backend, the performance reports were not good at all. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, 4 Apr 2012, James Harper <james.harper@bendigoit.com.au> wrote:
Does MySQL support enough replication these days to meet your requirements? What sort of geographic distribution are you after?
I want replication that doesn't have a special operation for a node leaving the cluster (EG unlike what might be necessary if a MySQL master went down) and which allows new nodes to be added at run-time.
Also for email it doesn't matter much if writes don't propagate instantly.
At this time I don't plan distribution outside one site.
I recall reading about a MTA that had a MySQL backend, the performance reports were not good at all.
Do you need replication for load spreading or redundancy? If the former, Cyrus IMAP is pretty good at spreading mailboxes across multiple servers while retaining a single IMAP namespace. If the latter then I recommend looking at DRBD, although of course that is limited to 2 node clusters unless you run iscsi on top of it... Unless of course you do find the mythical clustered database backend you seek! James

On Wed, 4 Apr 2012, James Harper <james.harper@bendigoit.com.au> wrote:
Do you need replication for load spreading or redundancy?
Both.
If the former, Cyrus IMAP is pretty good at spreading mailboxes across multiple servers while retaining a single IMAP namespace.
Perdition gives the same result and is a lot easier to manage.
If the latter then I recommend looking at DRBD, although of course that is limited to 2 node clusters unless you run iscsi on top of it...
DRBD has some big down-sides. It loses performance and it will reboot nodes if it thinks that there is a split-brain. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Apr 04, 2012 at 11:03:07PM +1000, Russell Coker wrote:
DRBD has some big down-sides. It loses performance and it will reboot nodes if it thinks that there is a split-brain.
OTOH, have you seen what use google's ganeti has made of DRBD layered on top of LVM? each node in a ganeti cluster has a vg with the same name, when you provision a new VM, it builds a new lv on two of the cluster nodes and binds them together with drbd. redundancy and migration/failover without a SAN. add more nodes to the cluster and it can even re-balance the vms to spread the load evenly. very clever. http://code.google.com/p/ganeti/ craig -- craig sanders <cas@taz.net.au> BOFH excuse #27: radiosity depletion

On Wed, Apr 04, 2012 at 07:38:15PM +1000, Russell Coker wrote:
Does anyone know of a mail store that uses a distributed database like Cassandra?
http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf BlueRunner: Building an Email Service in the Cloud by Jun Rao IBM Almaden Research Center Apache Cassandra Committer found with: http://www.google.com.au/search?q=apache+cassandra+%2B%22mail+store%22 it occurs to me that openstack's Swift[1] object store might be good for this. store the message with an object id of the message-id. you probably don't even have to care about the fact that message-id is only unique(*) per message, not per recipient (in fact, that's probably an advantage). i've always been against the idea of storing mail in a database, but an object store isn't a database....it's more like an enormous flat filesystem (with buckets) or a giant key/value-pair store...a much better fit for this task than a relational database. [1] http://swift.openstack.org/ hmmm. you'd still need some sort of database so that you could get from a recipient address, subject, and/or other fields to the message-id (and hence to the msg body in the object store). apart from offloading the fulltext storage to something outside of the db, there might not be enough value in doing this. not sure. might be good in combination with cassandra.
I want something that has a delivery agent with a similar interface to maildrop or procmail and which has POP and IMAP servers to provide client access.
you'd have to write a swift access module for the pop/imap daemon of your choice (dovecot is quite modular and would probably be a good choice), and inserting incoming messages into the store would be a simple wrapper around either the command-line tools or the http api. there are also python libs. (*) for pretty-damn-good values of "unique". note: you need at least three nodes (preferably more than 5) to run swift. you also need a second NIC for the nodes to talk to each other - they chatter a LOT. you can imagine it as something like: node1 -> node2: do you have version x of foo? node1 -> node3: do you have version x of foo? node2 -> node1: yes. node2 -> node1: do you have version y of bar? node3 -> node1: i have a later version, here it is. node3 -> node2: do you have version x of foo? node1 -> node2: no, gimme. node2 -> node3: yes. node2 -> node1: node 5 has gone down, you're secondary so grab a copy of this. node2 -> node1: here it is. blah blah blah. the chatter is constant. however the data is highly redundant, highly available and the data store is self-repairing. it's also massively scalable - add more nodes as storage and load requires. craig -- craig sanders <cas@taz.net.au> BOFH excuse #89: Electromagnetic energy loss

On Wed, 4 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Wed, Apr 04, 2012 at 07:38:15PM +1000, Russell Coker wrote:
Does anyone know of a mail store that uses a distributed database like Cassandra?
http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf
BlueRunner: Building an Email Service in the Cloud by Jun Rao IBM Almaden Research Center Apache Cassandra Committer
found with:
http://www.google.com.au/search?q=apache+cassandra+%2B%22mail+store%22
Yes, that's the one that has nothing released apart from a PDF.
it occurs to me that openstack's Swift[1] object store might be good for this. store the message with an object id of the message-id. you probably don't even have to care about the fact that message-id is only unique(*) per message, not per recipient (in fact, that's probably an advantage).
Does IMAP allow altering a message? If so you would need copy on write in that case. Also the "Delivered-To" header would need to be fudged somehow. Also rumor has it that some MTAs duplicate the message-ID. I haven't tried to verify that claim.
hmmm. you'd still need some sort of database so that you could get from a recipient address, subject, and/or other fields to the message-id (and hence to the msg body in the object store). apart from offloading the fulltext storage to something outside of the db, there might not be enough value in doing this. not sure.
might be good in combination with cassandra.
It apparently worked well for IBM, but they didn't share the code.
I want something that has a delivery agent with a similar interface to maildrop or procmail and which has POP and IMAP servers to provide client access.
you'd have to write a swift access module for the pop/imap daemon of your choice (dovecot is quite modular and would probably be a good choice), and inserting incoming messages into the store would be a simple wrapper around either the command-line tools or the http api. there are also python libs.
If I was going to write it myself then I would look at Dovecot and Cassandra. But I really don't have the time for it, so if I end up doing some coding on such things it'll be helping out with someone else's project.
note: you need at least three nodes (preferably more than 5) to run swift. you also need a second NIC for the nodes to talk to each other - they chatter a LOT. you can imagine it as something like:
3 nodes is the practical minimum for any sort of distributed system no matter how you do it. With less than 3 you can't have quorum if one node goes away. A second Ethernet card on each server with a GigE switch doesn't add much to the cost. On Wed, 4 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
OTOH, have you seen what use google's ganeti has made of DRBD layered on top of LVM?
No, but my recent experience with DRBD hasn't made me inclined to go back for more. :( -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

note: you need at least three nodes (preferably more than 5) to run swift. you also need a second NIC for the nodes to talk to each other - they chatter a LOT. you can imagine it as something like:
3 nodes is the practical minimum for any sort of distributed system no matter how you do it. With less than 3 you can't have quorum if one node goes away.
You can have a dedicated quorum/witness server (or device) that doesn't provide any other cluster resources. There are often better ways of achieving similar results than a 2 node cluster though. James

On Thu, 5 Apr 2012, James Harper <james.harper@bendigoit.com.au> wrote:
3 nodes is the practical minimum for any sort of distributed system no matter how you do it. With less than 3 you can't have quorum if one node goes away.
You can have a dedicated quorum/witness server (or device) that doesn't provide any other cluster resources. There are often better ways of achieving similar results than a 2 node cluster though.
Calling the "dedicated quorum server" something less than a "node" doesn't mean much unless you pay MS style license fees for each node. If you are doing something like renting servers from a company like Hetzner.de then each server has the same amount of storage so it wouldn't be practical to have a server manage the quorum but not have storage. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Thu, Apr 5, 2012 at 00:25, Russell Coker <russell@coker.com.au> wrote:
On Wed, 4 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
OTOH, have you seen what use google's ganeti has made of DRBD layered on top of LVM?
No, but my recent experience with DRBD hasn't made me inclined to go back for more. :(
We run an 8 node ganeti cluster with xen. A node can be used as both a primary and a secondary drbd for a given VM volume. It usually is, using the builtin VM allocation algorithm, which checks which node would be most suitable to use, based on that nodes free disk/memory/cpu at the time of VM creation. We have issues where the monthly mdadm raid check grinds the system to a halt. Initially we thought this was due to crappy raid cards, with the disks in JBOD mode, and using software raid (to get the battery backed cache). Removing the raid cards and plugging disks directly into the motherboard did alleviate the problems somewhat, and all the monthly checkarray scripts completed. However, now when drbd is used for the VM's primary and secondary volumes, we are seeing very similar issues. The monthly raid check now causes iowait to increase to the point where the recheck is crawling along at 0k/years to complete and the VMs are completely blocked. This may be an mdadm issue[1] or xen issue, or a drbd issue. The jury is still out on that one. Either way, the only solution is to reboot. This causes a cascading reboot of various different nodes because they are all running as both primaries and secondaries. I second Russell's earlier comments that drbd's auto-reboot behaviour is wrong, as this is what causes this cascading reboot. drbd doesn't seem to log why it rebooted and more importantly, there is no way to see the running drbd configuration. Also, ganeti manages drbd itself, so the standard config files don't get read. Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues. Google do note that xen+drbd has some IO issues, and that the 2.6.18 xen patches worked a lot better than 2.6.3* kernels[2]. [1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881 [1] http://ganeti.googlecode.com/files/XenAtGoogle2011.pdf -- Marcus Furlong

On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
We have issues where the monthly mdadm raid check grinds the system to a halt.
do you find that these monthly cron jobs are actually useful? i've never found it to be so, and suspect that it will actually cause problems because the heavy io load might be enough to push a borderline drive into failure. this is probably what you want in a data center with spare disks ready and waiting but not really what you want happening at home on a sunday morning. the computer shops are shut, the nearest swap meet might be the other side of town that week, and fixing a dead fs with the cheery sound of lawnmowers in the background is enough to send you postal :)
Initially we thought this was due to crappy raid cards, with the disks in JBOD mode, and using software raid (to get the battery backed cache). Removing the raid cards and plugging disks directly into the motherboard did alleviate the problems somewhat, and all the monthly checkarray scripts completed.
raid cards often use raid-mode timeouts even in jbod mode, causing TLER problems with slow drives, as the timeouts tend to have enterprise grade 15K RPM drives in mind. some raid cards have alternate firmware (generally referred to as "Initiator Target" or "IT" mode) which alleviates that problem. probably why the problem partly cleared up when you switched to using motherboard drive ports. recommended practice when using mdadm (or zfs or other software-raid like thing) is to use plain sata ports or IT mode firmware. BTW, this is why supermicro motherboards with LSI SAS controllers built-in have a raid-mode/IT mode switch right next to the drive sockets.
Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd. part of my curiosity is due to the fact that i prefer zfs to lvm, and iscsi ... when i get time i intend to experiment with ganeti and see if i can come up with a zfs+iscsi+mdadm storage module for it as an alternative to lvm+drbd. this would also allow skipping the io-hogging mdadm check (replaced with a weekly or monthly zpool scrub). craig -- craig sanders <cas@taz.net.au> BOFH excuse #291: Due to the CDA, we no longer have a root account.

On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
We have issues where the monthly mdadm raid check grinds the system to a halt.
do you find that these monthly cron jobs are actually useful? i've never found it to be so, and suspect that it will actually cause problems because the heavy io load might be enough to push a borderline drive into failure.
deb http://www.coker.com.au squeeze misc In the above Debian repository for i386 and amd64 I have a version of mdadm patched to send email when the disks have different content. I am seeing lots of errors from all systems, it seems that the RAID code in the kernel is reporting that 128 sectors (64K) of disk space is wrong for every error (all reported numbers are multiples of 128). Also I suspect that the Squeeze kernel has a bug in regard to this. I'm still tracking it down.
this is probably what you want in a data center with spare disks ready and waiting but not really what you want happening at home on a sunday morning. the computer shops are shut, the nearest swap meet might be the other side of town that week, and fixing a dead fs with the cheery sound of lawnmowers in the background is enough to send you postal :)
If you have a RAID stripe that doesn't match then you really it to be fixed even if replacing a disk is not possible. Having two reads from the same address on a RAID-1 give different results is a bad thing. Having the data on a RAID-5 or RAID-6 array change in the process of recovering from a dead disk is also a bad thing.
[story about cascading DRBD reboot in a cluster which is a perfact map for the term "cluster-fuck" snipped] Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
I've considered that with NBD instead of ISCSI. Also I'm idly considering a RAID-1 across a single local disk and a single remote disk with BTRFS using internal RAID-1 on top of that. That way BTRFS would deal with the case of a single read error on a local disk that's mostly working and RAID-1 would deal with an entire system dying. While BTRFS RAID-1 has got to have a performance overhead, that should be more than compensated by having two independent local disks for different filesystems. Now the advantage of DRBD is that it's written with split-brain issues in mind. The Linux software RAID code is written with the idea that it's impossible for the two disks to be separated and used at the same time. In the normal case this is not possible unless a disk is physically removed. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker wrote:
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
I've considered that with NBD instead of ISCSI.
Why NBD over AOE? Last time I looked at HA storage I looked at drbd vs. AOE+mdadm, and I concluded the latter was not acceptable because even a momentary network outage between the AOE nodes would cause mdadm to degrade the array. (IIRC I didn't know about -binternal at the time, tho.) I didn't end up deploying either, because the customer's use case for HA was silly and I managed to convince them they didn't need it.

On Thu, Apr 05, 2012 at 06:31:49PM +1000, Russell Coker wrote:
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
We have issues where the monthly mdadm raid check grinds the system to a halt.
do you find that these monthly cron jobs are actually useful? [...]
deb http://www.coker.com.au squeeze misc
In the above Debian repository for i386 and amd64 I have a version of mdadm patched to send email when the disks have different content. I am seeing lots of errors from all systems, it seems that the RAID code in the kernel is reporting that 128 sectors (64K) of disk space is wrong for every error (all reported numbers are multiples of 128).
if mdadm software raid is doing that, then to me it says "don't use mdadm raid" rather than "stress-test raid every month and hope for the best". however, i've been using mdadm for years without seeing any sign of that (and yes, with the monthly mdadm raid checks enabled. i used to grumble about it slowing my system down but never made the decision to disable it). first question that occurs to me is: is there a bug in the raid code itself or is the bug in the raid checking code?
Also I suspect that the Squeeze kernel has a bug in regard to this. I'm still tracking it down.
i never really used squeeze for long on real hardware (as opposed to on VMs)...except in passing when sid was temporarily rather similar to what squeeze became. and i've always used later kernels - either custom-compiled or (more recently) by installing the later linux-image packages.
If you have a RAID stripe that doesn't match then you really it to be fixed even if replacing a disk is not possible. Having two reads from the same address on a RAID-1 give different results is a bad thing. Having the data on a RAID-5 or RAID-6 array change in the process of recovering from a dead disk is also a bad thing.
true, but as above that's a "don't do that, then" situation. if you are getting symptoms like the above then either your hardware is bad or your kernel version is broken. in either case, don't do that. backup your data immediately and do something else that isn't going to lose your data.
Now the advantage of DRBD is that it's written with split-brain issues in mind. The Linux software RAID code is written with the idea that it's impossible for the two disks to be separated and used at the same time. In the normal case this is not possible unless a disk is physically removed.
yep, and the re-sync is a pain, even with bitmaps. this interesting article i just spotted this from 2006 may indicate an alternative.: ZFS on iscsi http://www.cuddletech.com/blog/pivot/entry.php?id=566 in short: it's possible to build a zpool using iscsi devices. whether it's reliable if one of the iscsi devices disappers, i don't know. zfs already copes well with degraded vdevs...with a mirrored vdev, it shouldn't be a problem (and fairly easily repaired with zfs online if it reappears or zfs replace if it's gone for good). with raidz-n, it would depend on how many disappeared and which ones. and this far more recent post (Aug 2011): http://cloudcomputingresourcecenter.com/roll-your-own-fail-over-san-cluster-... in short: zfs and glusterfs, written by someone who'd given up on drbd. craig ps: one of the reasons i love virtualisation is that it makes it so easy to experiment with this stuff and get an idea of whether it's worthwhile trying on real hardware. spinning up a few new vms is much less hassle than scrounging parts to build another test system. -- craig sanders <cas@taz.net.au> BOFH excuse #336: the xy axis in the trackball is coordinated with the summer solstice

ps: one of the reasons i love virtualisation is that it makes it so easy to experiment with this stuff and get an idea of whether it's worthwhile trying on real hardware. spinning up a few new vms is much less hassle than scrounging parts to build another test system.
+1 Being able to simulate all sorts of funky failure scenarios is really cool too. James

On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 06:31:49PM +1000, Russell Coker wrote:
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
On Thu, Apr 05, 2012 at 01:44:00PM +1000, Marcus Furlong wrote:
We have issues where the monthly mdadm raid check grinds the system to a halt.
do you find that these monthly cron jobs are actually useful? [...]
deb http://www.coker.com.au squeeze misc
In the above Debian repository for i386 and amd64 I have a version of mdadm patched to send email when the disks have different content. I am seeing lots of errors from all systems, it seems that the RAID code in the kernel is reporting that 128 sectors (64K) of disk space is wrong for every error (all reported numbers are multiples of 128).
if mdadm software raid is doing that, then to me it says "don't use mdadm raid" rather than "stress-test raid every month and hope for the best".
however, i've been using mdadm for years without seeing any sign of that (and yes, with the monthly mdadm raid checks enabled. i used to grumble about it slowing my system down but never made the decision to disable it).
You won't see such an obvious sign because you aren't running the version of mdadm that I patched to send email about it. Maybe logwatch/logcheck would inform you.
first question that occurs to me is: is there a bug in the raid code itself or is the bug in the raid checking code?
Other reports that I've seen from a reliable source say that you can get lots of errors while still having the files match the correct md5sums. This suggests that the problem is in the RAID checking code as the actual data returned is still correct.
Also I suspect that the Squeeze kernel has a bug in regard to this. I'm still tracking it down.
i never really used squeeze for long on real hardware (as opposed to on VMs)...except in passing when sid was temporarily rather similar to what squeeze became. and i've always used later kernels - either custom-compiled or (more recently) by installing the later linux-image packages.
In my tests so far I haven't been able to reproduce such problems with Debian's 3.2.0 kernel.
If you have a RAID stripe that doesn't match then you really it to be fixed even if replacing a disk is not possible. Having two reads from the same address on a RAID-1 give different results is a bad thing. Having the data on a RAID-5 or RAID-6 array change in the process of recovering from a dead disk is also a bad thing.
true, but as above that's a "don't do that, then" situation. if you are getting symptoms like the above then either your hardware is bad or your kernel version is broken. in either case, don't do that. backup your data immediately and do something else that isn't going to lose your data.
Unless of course you have those things reported regularly without data loss.
ps: one of the reasons i love virtualisation is that it makes it so easy to experiment with this stuff and get an idea of whether it's worthwhile trying on real hardware. spinning up a few new vms is much less hassle than scrounging parts to build another test system.
Yes. It is unfortunate that the DRBD server reboot problem never appeared on any of my VM tests (not even when I knew what to look for) and only appeared in production. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Thu, Apr 05, 2012 at 06:31:49PM +1000, Russell Coker wrote:
deb http://www.coker.com.au squeeze misc
In the above Debian repository for i386 and amd64 I have a version of mdadm patched to send email when the disks have different content. I am seeing lots of errors from all systems, it seems that the RAID code in the kernel is reporting that 128 sectors (64K) of disk space is wrong for every error (all reported numbers are multiples of 128). Also I suspect that the Squeeze kernel has a bug in regard to this. I'm still tracking it down.
if you are looking at raid1 or raid10, then see 'man md' as mismatches are expected. eg. http://fedorapeople.org/gitweb?p=dledford/public_git/mdadm.git;a=commitdiff;... 64k is prob the dma size the fs or app is using. we extensively investigated mismatches with md raid1 and ext3 (or 2, I forget) a few years ago and the different blocks were indeed always in the unused part of the device - never in the fs. I might even have a reproducer for this somewhere - wasn't too hard to setup as I recall. if not raid1/10 then it's probably real. which kernel is 'squeeze' based on? have you ever seen data damage in the fs due to mismatches? cheers, robin

On 05/04/12 17:42, Craig Sanders wrote:
Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
Oooh, no, don't do that. We've tried it. It didn't work out. It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
part of my curiosity is due to the fact that i prefer zfs to lvm, and iscsi ... when i get time i intend to experiment with ganeti and see if i can come up with a zfs+iscsi+mdadm storage module for it as an alternative to lvm+drbd.
this would also allow skipping the io-hogging mdadm check (replaced with a weekly or monthly zpool scrub).
I'd be interested to hear your experiences of that, if you get it up and running. -Toby

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
On 05/04/12 17:42, Craig Sanders wrote:
Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
Oooh, no, don't do that. We've tried it. It didn't work out.
It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
If you use an internal bitmap to indicate which parts of the RAID aren't synchronised then there shouldn't be much data to transfer. http://www.coker.com.au/bonnie++/zcav/results.html Also given that the maximum contiguous transfer rates I've seen are under 120MB/s it seems unlikely that GigE is going to be a significant bottleneck. I'm sure that there are disks that are faster than the 1TB disk I tested, but it should be noted that the inner tracks of that 1TB disk were about half GigE speed. Also when synchronising a RAID array if performance matters then you probably have other load which means that synchronisation speed is well below the maximum speed of the disk. Some people claim that RAID bitmaps hurt performance, I haven't yet tested that. But a full RAID rebuild is going to seriously hurt performance for a long time, so if performance matters it's probably best to have a small loss all the time than a large loss for the hours or days required for a full rebuild. Also note that a long rebuild increases the probability of a second failure while it's rebuilding... -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 10/04/12 11:45, Russell Coker wrote:
On Tue, 10 Apr 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au> wrote:
On 05/04/12 17:42, Craig Sanders wrote:
Overall ganeti is really nice, but it feels like drbd has some missing pieces that would help in debugging issues.
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
Oooh, no, don't do that. We've tried it. It didn't work out.
It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
If you use an internal bitmap to indicate which parts of the RAID aren't synchronised then there shouldn't be much data to transfer.
http://www.coker.com.au/bonnie++/zcav/results.html
Also given that the maximum contiguous transfer rates I've seen are under 120MB/s it seems unlikely that GigE is going to be a significant bottleneck. I'm sure that there are disks that are faster than the 1TB disk I tested, but it should be noted that the inner tracks of that 1TB disk were about half GigE speed. Also when synchronising a RAID array if performance matters then you probably have other load which means that synchronisation speed is well below the maximum speed of the disk.
If you're designing one of these systems so that you have high availability of your system, then it's because you do have lots of I/O all the time and can't afford to stop it during a RAID rebuild. The random i/o interspersed with the rebuild i/o has the effect of totally trashing the rebuild performance. Random i/o over iscsi has sucked on the stable Debian kernels. (I believe the better-performing iscsi drivers (which are a totally independent rewrite) have finally made it into wheezy though.) And so, you end up in the situation where the real I/O performs badly, AND the rebuild takes so long that you're concerned there's a sizeable window where another disk error could occur.
Some people claim that RAID bitmaps hurt performance, I haven't yet tested that. But a full RAID rebuild is going to seriously hurt performance for a long time, so if performance matters it's probably best to have a small loss all the time than a large loss for the hours or days required for a full rebuild. Also note that a long rebuild increases the probability of a second failure while it's rebuilding...
Agreed with your sentiment about better to have a small performance loss constantly (that you can design for) rather than occasional massive perf loss. If you try it with the RAID bitmap over iscsi, I'd be interested to hear how it works out for you.. In the long run, I think cluster filesystems are a better bet though. Still waiting on GlusterFS, Ceph, etc to reach maturity :( Toby

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
If you use an internal bitmap to indicate which parts of the RAID aren't synchronised then there shouldn't be much data to transfer.
http://www.coker.com.au/bonnie++/zcav/results.html
Also given that the maximum contiguous transfer rates I've seen are under 120MB/s it seems unlikely that GigE is going to be a significant bottleneck. I'm sure that there are disks that are faster than the 1TB disk I tested, but it should be noted that the inner tracks of that 1TB disk were about half GigE speed. Also when synchronising a RAID array if performance matters then you probably have other load which means that synchronisation speed is well below the maximum speed of the disk.
If you're designing one of these systems so that you have high availability of your system, then it's because you do have lots of I/O all the time and can't afford to stop it during a RAID rebuild.
Yes, that is why bitmaps are a good thing.
The random i/o interspersed with the rebuild i/o has the effect of totally trashing the rebuild performance. Random i/o over iscsi has sucked on the stable Debian kernels. (I believe the better-performing iscsi drivers (which are a totally independent rewrite) have finally made it into wheezy though.)
How can someone write iSCSI drivers that hurt random IO? The reports I have seen about command queuing in hard drives indicate that it generally doesn't give more than about a 10% benefit. So an iSCSI driver that lacks command queuing and loses about 10% probably wouldn't count as making performance suck.
In the long run, I think cluster filesystems are a better bet though. Still waiting on GlusterFS, Ceph, etc to reach maturity :(
I think that BigTable type systems are the way to go. It seems that cluster filesystems generally either try for full POSIX compliance or implement a sub- set that doesn't match the sub-set you want. When applications use Cassandra or other distributed database technologies they can relax the consistency requirements as they wish. For example when designing a mail server there is no need to have the creation of a new message appear instantly, it just has to reliably appear. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker wrote:
Some people claim that RAID bitmaps hurt performance
So does journalling, but they're both still good ideas -- at least in typical use cases. I had problems a while back, but they turned out to be 100% caused by collectd's pathological write workload (lots of RRDs). Cranking the buffering of those up to 1hr fixed that, and the -binternal overhead turned out to be negligible.

Toby Corkindale wrote:
It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
Didja have write-intent bitmaps on? Not sure[0] if they help in that case... [0] my caffeine stream has a little too much blood in it atm.

On 10/04/12 12:20, Trent W. Buck wrote:
Toby Corkindale wrote:
It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
Didja have write-intent bitmaps on? Not sure[0] if they help in that case...
No, pretty sure they didn't. I didn't build the systems in question, so not sure why not -- maybe it just wasn't considered stable back when Debian Etch was the stable platform? *shrug* The rebuild performance (while there was also active i/o going on) was still cripplingly slow though. Also worth noting that the md layer and the iscsi layer didn't interact all that well -- if the iscsi target dropped out, I seem to remember that the md layer didn't respond quickly. It wasn't like with disks, where it'd kick out a non-responding disk soon and keep going -- instead it'd hang for aeons. This might have improved in more recent kernels.. Toby

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
No, pretty sure they didn't. I didn't build the systems in question, so not sure why not -- maybe it just wasn't considered stable back when Debian Etch was the stable platform? shrug
http://en.wikipedia.org/wiki/Debian#Releases Lenny was released in early 2009. http://etbe.coker.com.au/2008/01/28/write-intent-bitmaps/ In early 2008 I converted all my software RAID systems to use bitmaps. The vast majority of such systems would have been running Etch. So it seems that bitmaps were stable in Etch. As far as I recall they weren't even new then - I just hadn't noticed them earlier. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, 10 Apr 2012, Toby Corkindale wrote:
Also worth noting that the md layer and the iscsi layer didn't interact all that well -- if the iscsi target dropped out, I seem to remember that the md layer didn't respond quickly. It wasn't like with disks, where it'd kick out a non-responding disk soon and keep going -- instead it'd hang for aeons. This might have improved in more recent kernels..
Don't think its improved. It's a pain that on raid1 or other levels where there is some redundancy, if a read from one disk is taking some time because of contention/spinup, etc, linux raid doesn't automatically immediately resend the same command to one of the other available disks if their queues are empty. When I was raiding 2 external drives that I *wanted* to spin down because they were usually only accessed twice a day, I still had to wait for it to consecutively spin up both drives before I could start browsing my backups (it'd send a read command to one disk, wait for a response, then stripe a read command to the other disk. Almost as if there was a queue depth of only 1, and there would be absolutely no point in striping the reads across both disks since its got to wait for the reponse from each one anyway! It just ensures the second disk won't have the heads in the right place in a consecutive read!). It should have had to only wait for 1 drive to spin up. Or send read commands to both disks simultaneously. Or stripe it properly since readahead should realise theres more data to come from the second disk. Who knows... -- Tim Connors

On 10/04/12 15:58, Tim Connors wrote:
On Tue, 10 Apr 2012, Toby Corkindale wrote:
Also worth noting that the md layer and the iscsi layer didn't interact all that well -- if the iscsi target dropped out, I seem to remember that the md layer didn't respond quickly. It wasn't like with disks, where it'd kick out a non-responding disk soon and keep going -- instead it'd hang for aeons. This might have improved in more recent kernels..
Don't think its improved. It's a pain that on raid1 or other levels where there is some redundancy, if a read from one disk is taking some time because of contention/spinup, etc, linux raid doesn't automatically immediately resend the same command to one of the other available disks if their queues are empty.
When I was raiding 2 external drives that I *wanted* to spin down because they were usually only accessed twice a day, I still had to wait for it to consecutively spin up both drives before I could start browsing my backups (it'd send a read command to one disk, wait for a response, then stripe a read command to the other disk. Almost as if there was a queue depth of only 1, and there would be absolutely no point in striping the reads across both disks since its got to wait for the reponse from each one anyway! It just ensures the second disk won't have the heads in the right place in a consecutive read!). It should have had to only wait for 1 drive to spin up. Or send read commands to both disks simultaneously. Or stripe it properly since readahead should realise theres more data to come from the second disk. Who knows...
I think the RAID10 code might be better -- and you can actually configure it to work with just two disks, so it's just like RAID1. It can do smarter things like scatter data at opposite ends of the disks, so your average seek time can be reduced. (Enable the "far" option) As to your case -- there is the --write-mostly option for RAID1, which says that one of your disks should only be used for writes, not reads (unless the primary disk fails).

On Tue, Apr 10, 2012 at 11:31:20AM +1000, Toby Corkindale wrote:
On 05/04/12 17:42, Craig Sanders wrote:
i'm wondering if iscsi kind of obsoletes drbd, and if mdadm raid1 over two iscsi exports would be better than drbd.
Oooh, no, don't do that. We've tried it. It didn't work out.
It sounds like a good idea at first, but every time you need to reboot one or other of the iscsi targets (eg. for kernel updates or suchlike) you'll need to rebuild the RAID array, and the performance of that over ethernet blows.
does using a bitmap help? with local disks, it makes a massive improvement to resync times.
part of my curiosity is due to the fact that i prefer zfs to lvm, and iscsi ... when i get time i intend to experiment with ganeti and see if i can come up with a zfs+iscsi+mdadm storage module for it as an alternative to lvm+drbd.
I'd be interested to hear your experiences of that, if you get it up and running.
well, given that i'll be experimenting with it on virtual machines, i'm not expecting the performance to be anything to write home about. i'm more interested in finding out how viable it is, and what the behaviour is when bad things happen. and whether it works better than lvm+drbd. another option would be for two ganeti nodes to create a zvol each, export them via iscsi, build a mirrored zpool from them, and give that to the VM...i suspect that would work much better than mdadm over iscsi. craig -- craig sanders <cas@taz.net.au>
participants (8)
-
Craig Sanders
-
James Harper
-
Marcus Furlong
-
Robin Humble
-
Russell Coker
-
Tim Connors
-
Toby Corkindale
-
Trent W. Buck