Re: cache, Xen, and zfsonlinux

30 Oct 2012


      On Wed, 17 Oct 2012, Craig Sanders wrote:
...
...
The headers are essentially impossible to dedup as they differ in
the final stage of delivery even if a single SMTP operation was used
to send to multiple local users.  Deduping the message body seems
unlikely to provide a significant benefit as there usually aren't
that many duplicates, not even when you count spam and jokes -
the scenario I was thinking of was internal email memos sent to "all
staff", with a stupidly large word .doc or .pdf file attached.
for an ISP mail server, de-duping isn't likely to help much (if at all).
For a small-medium business or corporate mail server, it could help a
lot.
...
I'm assuming that ZFS is even capable of deduplicating files which
have the duplicate part at different offsets, but I don't care enough
about this to even look it up.
zfs de-duping is done at block level. if a block's hash is an exact
match with another block's hash then it can be de-duped.
And guess what happens when 200 bytes into the message, Delivered-To:
changes from 123@abc.corp to 1234@abc.corp?  Every subsequent byte is out
by one and no subsequent block looks the same.
...
Editing large video files, perhaps. multiple cycles of edit & versioned
save would use not much more space than the original file + the size of
the diffs.
Would multiple large video edits that insert/delete a frame here or there
result in a non-integer amount of filesystem blocks to be inserted?  I
can't imagine things lining up on filesystem blocks sizes neatly like
that.
...
VMs are quite often touted as a good reason for de-duping - hundreds
of almost identical zvols. I remain far from convinced that de-duping
is the best use of available RAM on a virtualisation server, or that
upgrading/adding disks wouldn't be better.
We don't do it.  Meh, 20GB per VM common, when the rest of the 500GB-20TB
is unique on each system.  I get the feeling parts of our SAN are zfs
underneath, but the SAN controller is going to have a bit of trouble
keeping track of cache for hundreds of TB of disk.

-- 
Tim Connors