
On Wed, 17 Oct 2012, Craig Sanders wrote:
The headers are essentially impossible to dedup as they differ in the final stage of delivery even if a single SMTP operation was used to send to multiple local users. Deduping the message body seems unlikely to provide a significant benefit as there usually aren't that many duplicates, not even when you count spam and jokes -
the scenario I was thinking of was internal email memos sent to "all staff", with a stupidly large word .doc or .pdf file attached.
for an ISP mail server, de-duping isn't likely to help much (if at all). For a small-medium business or corporate mail server, it could help a lot.
I'm assuming that ZFS is even capable of deduplicating files which have the duplicate part at different offsets, but I don't care enough about this to even look it up.
zfs de-duping is done at block level. if a block's hash is an exact match with another block's hash then it can be de-duped.
And guess what happens when 200 bytes into the message, Delivered-To: changes from 123@abc.corp to 1234@abc.corp? Every subsequent byte is out by one and no subsequent block looks the same.
Editing large video files, perhaps. multiple cycles of edit & versioned save would use not much more space than the original file + the size of the diffs.
Would multiple large video edits that insert/delete a frame here or there result in a non-integer amount of filesystem blocks to be inserted? I can't imagine things lining up on filesystem blocks sizes neatly like that.
VMs are quite often touted as a good reason for de-duping - hundreds of almost identical zvols. I remain far from convinced that de-duping is the best use of available RAM on a virtualisation server, or that upgrading/adding disks wouldn't be better.
We don't do it. Meh, 20GB per VM common, when the rest of the 500GB-20TB is unique on each system. I get the feeling parts of our SAN are zfs underneath, but the SAN controller is going to have a bit of trouble keeping track of cache for hundreds of TB of disk. -- Tim Connors