
On Wed, 17 Oct 2012, Craig Sanders <cas@taz.net.au> wrote:
The last time I checked the average message size on a medium size mail spool it was about 70K.
compression would bring that down to (very roughly) an average of about 5-15K per message.
Yes, that could be a real win. If multiple processes are doing synchronous writes at the same time does ZFS bundle them in the same transaction? At busy times I have 12 processes doing synchronous delivery at the same time. If ZFS was to slightly delay a couple of them to create a bundle of 5+ synchronous file writes in the same operation it could improve overall performance and save bandwidth for the occasional read from disk.
The headers are essentially impossible to dedup as they differ in the final stage of delivery even if a single SMTP operation was used to send to multiple local users. Deduping the message body seems unlikely to provide a significant benefit as there usually aren't that many duplicates, not even when you count spam and jokes -
the scenario I was thinking of was internal email memos sent to "all staff", with a stupidly large word .doc or .pdf file attached.
for an ISP mail server, de-duping isn't likely to help much (if at all). For a small-medium business or corporate mail server, it could help a lot.
True. But if that sort of thing comprises a significant portion of your email then there are better ways of solving the problem. A Wiki is often one part of the solution to that problem.
I'm assuming that ZFS is even capable of deduplicating files which have the duplicate part at different offsets, but I don't care enough about this to even look it up.
zfs de-duping is done at block level. if a block's hash is an exact match with another block's hash then it can be de-duped.
So I guess it does no good for email then unless your MTA stores the attachments as separate files (IE not Maildir).
me too. i don't use zfs de-dupe at all. it is, IMO, of marginal use. adding more disks (or replacing with larger disks) is almost always going to be cheaper and better. but there are some cases where it could be useful...so I don't want to dismiss it just because I have no personal need for it.
Presumably the Sun people who dedicated a lot of engineering and testing time to developing the feature had some reason to do so.
Editing large video files, perhaps. multiple cycles of edit & versioned save would use not much more space than the original file + the size of the diffs.
For uncompressed video that could be the case. One of my clients currently has some problems with that sort of thing. They are using local non-RAID storage on Macs and then saving the result to the file server because of file transfer problems with files >2G.
VMs are quite often touted as a good reason for de-duping - hundreds of almost identical zvols. I remain far from convinced that de-duping is the best use of available RAM on a virtualisation server, or that upgrading/adding disks wouldn't be better.
For a VM you have something between 500M and 5G of OS data, if it's closer to 5G then it's probably fairly usage specific so less to dedup. For most of the VMs I run the application data vastly exceeds the OS data so the savings would be at most 10% or less. Not to mention the fact that the most common VM implementations use local storage which means that with a dozen VMs on a single system of which multiple distributions are in use there is little opportunity for dedup. Finally if you have 5G OS images for virtual machines then you could have more than 500 such images on a 3TB disk, so even if you can save disk space it's still going to be easier and cheaper to buy more disk. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/