Re: mail storage in a distributed database

4 Apr 2012

      On Wed, Apr 04, 2012 at 07:38:15PM +1000, Russell Coker wrote:
...
Does anyone know of a mail store that uses a distributed database like 
Cassandra?
http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf

  BlueRunner: Building an Email Service in the Cloud
  by Jun Rao
    IBM Almaden Research Center
    Apache Cassandra Committer

found with:

http://www.google.com.au/search?q=apache+cassandra+%2B%22mail+store%22

it occurs to me that openstack's Swift[1] object store might be good
for this. store the message with an object id of the message-id. you
probably don't even have to care about the fact that message-id is only
unique(*) per message, not per recipient (in fact, that's probably an
advantage).

i've always been against the idea of storing mail in a database, but
an object store isn't a database....it's more like an enormous flat
filesystem (with buckets) or a giant key/value-pair store...a much
better fit for this task than a relational database.

[1] http://swift.openstack.org/

hmmm.  you'd still need some sort of database so that you could get
from a recipient address, subject, and/or other fields to the message-id
(and hence to the msg body in the object store).  apart from offloading
the fulltext storage to something outside of the db, there might not be 
enough value in doing this.  not sure.

might be good in combination with cassandra.
...
I want something that has a delivery agent with a similar interface to 
maildrop or procmail and which has POP and IMAP servers to provide client 
access.
you'd have to write a swift access module for the pop/imap daemon of
your choice (dovecot is quite modular and would probably be a good
choice), and inserting incoming messages into the store would be
a simple wrapper around either the command-line tools or the http
api.  there are also python libs.

(*) for pretty-damn-good values of "unique".

note: you need at least three nodes (preferably more than 5) to run
swift. you also need a second NIC for the nodes to talk to each
other - they chatter a LOT. you can imagine it as something like:

node1 -> node2: do you have version x of foo?
node1 -> node3: do you have version x of foo?
node2 -> node1: yes.
node2 -> node1: do you have version y of bar?
node3 -> node1: i have a later version, here it is.
node3 -> node2: do you have version x of foo?
node1 -> node2: no, gimme.
node2 -> node3: yes.
node2 -> node1: node 5 has gone down, you're secondary so grab a copy of this.
node2 -> node1: here it is.
blah blah blah.

the chatter is constant. however the data is highly redundant, highly
available and the data store is self-repairing. it's also massively
scalable - add more nodes as storage and load requires.

craig

-- 
craig sanders <cas@taz.net.au>

BOFH excuse #89:

Electromagnetic energy loss

Re: mail storage in a distributed database

Craig Sanders