
On Wednesday, 23 May 2018 1:10:08 PM AEST Craig Sanders via luv-main wrote:
far too much RAM to be worth doing. It's a great way to minimuse use of cheap disks ($60 per TB or less) by using lots of very expensive RAM ($15 per GB or more).
A very rough rule of thumb is that de-duplication uses around 1GB of RAM per TB of storage. Definitely not worth it. About the only good use case I've seen for de-duping is a server with hundreds of GBs of RAM providing storage for lots of mostly-duplicate clone VMs, like at an ISP or other hosting provider. It's only worthwile there because of the performance improvement that comes from NOT having multiple copies of the same data-blocks (taking more space in the ARC & L2ARC caches, and causing more seek time delays if using spinning rust rather than SSDs). Even then, it's debatable whether just adding more disk would be better.
http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-113-siz... Some Google results suggest it's up to 5G of RAM per TB of storage, the above URL seems to suggest 2.4G/TB. At your prices 2.4G of RAM costs $36 so if it could save you 600G of disk space (IE 1.6TB of regular storage deduped to 1TB of disk space which means 38% of blocks being duplicates) it would save money in theory. In practice it's probably more about which resource you run out of and which you can easily increase. Buying bigger disks generally seems to be easier than buying more RAM due to limited number of DIMM slots and unreasonable prices for the larger DIMMs.
Compression's worth doing on most filesystems, though. lz4 is a very fast, very low cpu usage algorithm, and (depending on what kind of data) on average you'll probably get about 1/3rd to 1/2 reduction of space used by compressible files. e.g. some of the datasets on the machine I just built (called "hex"):
# zfs get compressratio hex hex/home hex/var/log hex/var/cache NAME PROPERTY VALUE SOURCE hex compressratio 1.88x - hex/home compressratio 2.00x - hex/var/cache compressratio 1.09x - hex/var/log compressratio 4.44x -
The first entry is the overall compression ratio for the entire pool. 1.88:1 ratio. So compression is currently saving me nearly half of my disk usage. It's a new machine, so there's not much on it at the moment.
Strangely I never saw such good compression when storing email on ZFS. One would expect email to compress well (for starters anything like Huffman coding will give significant benefits) but it seems not.
I'd probably get even better compression on the logs (at least 6x, probably more) if I set it to use gzip for that dataset with:
zfs set compression=gzip hex/var/log
I never knew about that, it would probably have helped the mail store a lot.
(note that won't re-compress existing data. only new data will be compressed with the new algorithm)
If you are storing logs on a filesystem that supports compression you should turn off your distribution's support for compressing logs. That will read and rewrite the log files from a cron job and end up not providing much benefit to size. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, May 23, 2018 at 10:42:01PM +1000, russell@coker.com.au wrote:
On Wednesday, 23 May 2018 1:10:08 PM AEST Craig Sanders via luv-main wrote: http://www.oracle.com/technetwork/articles/servers-storage-admin/o11-113-siz...
Some Google results suggest it's up to 5G of RAM per TB of storage, the above URL seems to suggest 2.4G/TB.
I've seen estimates ranging from 1GB per TB to around 8GB per TB. Nobody really seems to know for sure (and, like most things, will vary greatly depending on the nature of the data being de-duped). Even at the most optimistic rate of 1GB/TB, it's not worth doing except perhaps in some very specialised circumstances. For the average home or small-medium business user, adding more disk space is much easier and much cheaper. And there are better uses for any extra RAM than de-duping - like ARC or other disk caching/buffering, or just running programs.
At your prices 2.4G of RAM costs $36 so if it could save you 600G of disk space (IE 1.6TB of regular storage deduped to 1TB of disk space which means 38% of blocks being duplicates) it would save money in theory.
In theory, maybe it could. In practice, it probably wouldn't. You don't buy RAM in 2.4GB DIMMs. You buy RAM in 2(*), 4, 8, 16, 32, 64, etc GB sizes and usually install them in pairs (or fours or eights) depending on whether you have dual-, quad- or eight memory channels. So probably a minimum of two 4GB DIMMs @ $15/GB = $120 (or more for ECC RAM). That's the price of a pair of 1TB drives. (*) I'm not even sure 2 GB DIMMs are still available new anywhere. You probably wouldn't waste a server's DIMM sockets on anything less than a 4 or 8 GB DIMM anyway, and at the scale where de-duping might be worth it, probably not less than 32 or 64 GB. I use 4 & GB DIMMs in my DDR3 machines here. 16 GB DIMMs in my new DDR-4 box - to me, that was part of the benefit of moving to the platform: ddr3 is effectively obsolete AND ddr4 is cheaper than ddr3 in large sizes...adding more RAM is still one of the best ways to improve performance on Linux boxes if the bottleneck is mostly disk I/O rather than CPU.
In practice it's probably more about which resource you run out of and which you can easily increase. Buying bigger disks generally seems to be easier than buying more RAM due to limited number of DIMM slots and unreasonable prices for the larger DIMMs.
Yeah, disk is cheap and easy to expand. RAM, significantly less so on both counts.
Strangely I never saw such good compression when storing email on ZFS. One would expect email to compress well (for starters anything like Huffman coding will give significant benefits) but it seems not.
logs almost certainly compress better than mail. lots more repeated text "phrases". I enabled gzip on /var/log last night and remembered to disable logrotate compression on this new machine. I'll check what the compression ratio looks like in a few days.
If you are storing logs on a filesystem that supports compression you should turn off your distribution's support for compressing logs. That will read and rewrite the log files from a cron job and end up not providing much benefit to size.
I was going to mention that but thought I'd already written more than enough :) It's even worse than what you say - not only does gzipping the log files create entirely new files, but you'll still have the uncompressed versions of the logs in your snapshots until the snapshots are expired or you delete them. Using the "compress" option in logrotate's conf files actually uses MORE space, not less. fixable with: sed -E -i 's/^\s*.*compress/#&/' /etc/logrotate.d/* Also make sure that "compress" isn't the global default in /etc/logrotate.conf -- comment it out. optionally explicitly set "nocompress" as the global default instead. .deb packages with support for logrotate (most service/daemon type packages) typically have compress enabled by default. You'll need to remember to fix that if/when you install a new package which does that. Or if you let dpkg/apt replace your modified conf files with the packaged ones on any upgrade. craig -- craig sanders <cas@taz.net.au>
participants (2)
-
Craig Sanders
-
Russell Coker