interesting thread on zfs-discuss

FYI: thread titled "What is the current status of native ZFS on Linux?" http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/th... includes this paragraph from http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/msg/9bd84a5e8d93... by Richard Yao: "As far as performance goes, ZFSOnLinux raidz2 outperforms a combination of MD RAID 6, LVM and ext4 on 6 Samsung HD204UI drives that I own. ZFS raidz2 has 220 MB/sec write performance in writing a 4GB file while the combination of MD RAID 6, LVM and ext4 only managed 20MB/sec." somebody in the thread mentioned that they get abysmal performance with zvols (chunks of a zpool allocated to be a "disk" - similar to an LVM logical volume). That really surprises me, I've had fantastic performance from them. All of my VMs are now on zvols....and i've done a lot of testing of zfs in a VM with zpools made up of lots of 100-200MB zvols. Same person also mentioned stability problems until they changed from onboard drive controller to an LSI controller...unfortunately, he doesn't mention what motherboard or kind of ports, or what kind of LSI controller (I'd guess one of the cheap HBAs like the 9211-8i or the IBM M1015 as they are very popular for running ZFS on linux, open solaris, and freebsd) craig ps: speaking of cheap HBAs, I bought a few of the IBM M1015s from ebay. they took only a few days to arrive. I reflashed one of them to IT mode on a spare machine and then replaced my supermicro card with it (which I then also reflashed to IT mode in the spare machine...didn't want to risk it until now in case i bricked the card. this will go into my myth box to replace the 4-port adaptec 1430sa) Haven't noticed any difference at all, but that's good. I wasn't expecting any performance difference, just more relaxed timeouts for the consumer-grade WD Green drives i'm using. When prices comes down enough I'll replace them with 3TB drives, probably Hitachi or Seagate. -- craig sanders <cas@taz.net.au> BOFH excuse #126: it has Intel Inside

Hi Craig,
somebody in the thread mentioned that they get abysmal performance with zvols (chunks of a zpool allocated to be a "disk" - similar to an LVM logical volume). That really surprises me, I've had fantastic performance from them. All of my VMs are now on zvols....and i've done a lot of testing of zfs in a VM with zpools made up of lots of 100-200MB zvols.
Recently I had a performance problem as well, under FreeBSD. Overnight I filled a 900GB zvol. A backup took too much (the script wasn't working as expected) so at one stage there were just 4GB left. I had our Zimbra mail server running on this machine, inside VirtualBox. That one complained about SCSI timeouts (virtal, inside the VM), and finally remounted the virtal disk read-only (practically "killing" the mail server). The zvol was never running out of disk space, just nearly 100% full. According to other discussions ZFS slows down if it is filling up. That would explain my problem. BTW: zfs receive seems to need additional disk space while "unpacking". It releases some when finished. I did not get my head around to look into it in more detail but should do it, to avoid more "surprises" of this kind. Regards Peter

Peter Ross wrote:
The zvol was never running out of disk space, just nearly 100% full. According to other discussions ZFS slows down if it is filling up. That would explain my problem.
IME, and according to #btrfs on freenode, that happens at about 80% full, IOW if you have a 2TB zpool its effective storage capacity is only 1.8TB, since after that it becomes unusably slow. "Unusably" as in "it took me hours to delete a single 20MB daily snapshot, while the system was disconnected from the network entirely and all its other I/Oy processes were turned off."

On Tue, 24 Apr 2012, Trent W. Buck wrote:
Peter Ross wrote:
The zvol was never running out of disk space, just nearly 100% full. According to other discussions ZFS slows down if it is filling up. That would explain my problem.
IME, and according to #btrfs on freenode, that happens at about 80% full, IOW if you have a 2TB zpool its effective storage capacity is only 1.8TB, since after that it becomes unusably slow.
"Unusably" as in "it took me hours to delete a single 20MB daily snapshot, while the system was disconnected from the network entirely and all its other I/Oy processes were turned off."
Just now on the machine:
zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zpool 912G 774G 138G 84% 1.00x ONLINE -
At the moment I don't see a performance problem. There are ca. 50 people connected to the mail server, it runs a MediaWiki, a PHP developer box, MySQL for it etc. It looks as I have to keep an eye on it. I mirror a samba server here, maybe I should do it on another box.. Regards Peter

Peter Ross wrote:
"Unusably" as in "it took me hours to delete a single 20MB daily snapshot, while the system was disconnected from the network entirely and all its other I/Oy processes were turned off."
Just now on the machine:
zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zpool 912G 774G 138G 84% 1.00x ONLINE -
At the moment I don't see a performance problem. There are ca. 50 people connected to the mail server, it runs a MediaWiki, a PHP developer box, MySQL for it etc.
It looks as I have to keep an eye on it. I mirror a samba server here, maybe I should do it on another box..
Well, I imagine it gets exponentially worse as you approach 100%. IIRC I was around the 94% mark when users first started complaining that their backups were failing out and I actually looked at the box. I went on a killing spree and it's now under 50%, so I can't easily reproduce the issue, and ICBF pulling any notes that were made at the time out of our stinky in-house ticket system.

On Tue, Apr 24, 2012 at 12:45:30PM +1000, Trent W. Buck wrote:
reproduce the issue, and ICBF pulling any notes that were made at the time out of our stinky in-house ticket system.
whatever your stinky in-house ticket system is, be glad it's not BMC Remedy. and be afraid, very afraid, that someone in management will get glossy-brochured into buying it. where i work we got forced by management decisions from far above to stop using our in-house ticketing system which was perfectly functional, written in php & mysql to ITIL principles and methodology, and was also our configuration management and authoritative machine database (incl. ip-address/mac-address/dns/etc - dhcpd.conf and zone files are built from this) and forced to use Remedy instead. we still use the original system for dhcp/dns etc but have to use Remedy for tickets. so now our tickets are not associated with specific machines. or specific users, either. unfortunately, Remedy is essentially unusable garbage. and i'm referring to the latest release, which was a vast improvement over the previous version. the big problem with remedy from my pov is that its focus is *entirely* wrong. ridiculously huge amounts of screen space is devoted to micro-recording of time spent on a job. almost no screen space is available for problem description or working notes...about half an inch square on a 24" monitor(*). it is a tool designed to micromanage call-center serfs, not a tool to support techs doing their work. in fact, it gets in the way of them doing their work by adding about 15-20 minutes of pointless bureacratic procedure that achieves nothing but some pretty time-and-motion graphs for mgmt. also, they've gone to a huge effort to replicate all of the ugly unusability of a Visual Basic app in the web interface....which is the only part of it that almost works on linux. one must almost admire their perverse dedication to a brain-damaged design and UI model. another big problem is the way it has been configured - the queues set up by $WORK_CENTRAL_IT are too broad, so i end up being spammed by dozens to hundreds of irrelevant (to me) end-user desktop support tickets for every 1 sysadmin-relevant ticket (so i don't see them because they're lost in all the noise) (*) oh yeah, the font sizes and window sizes and everything are hard-coded and fixed. unreadably tiny on a modern monitor. so even if you maximise the window, you just get the same unusable crap iface centered in white, with lots of little fields and several horiz and vert scroll bars for particular table views. also the Remedy devs seem to be unfamiliar with highly advanced topics like multiple browser windows or tabbed browsers. craig -- craig sanders <cas@taz.net.au>

On Tue, 24 Apr 2012, Craig Sanders wrote:
On Tue, Apr 24, 2012 at 12:45:30PM +1000, Trent W. Buck wrote:
reproduce the issue, and ICBF pulling any notes that were made at the time out of our stinky in-house ticket system.
whatever your stinky in-house ticket system is, be glad it's not BMC Remedy.
and be afraid, very afraid, that someone in management will get glossy-brochured into buying it.
Heh. We were asked/demanded to come up with saving [CENSORED] million dollars this morning without affecting our public presence (hah!). I forgot to mention our ticketing support. At least this one sort of works. They actually did buy ARSe years ago, and never got *it* to work, surprisingly enough.
also, they've gone to a huge effort to replicate all of the ugly unusability of a Visual Basic app in the web interface....which is the only part of it that almost works on linux. one must almost admire their perverse dedication to a brain-damaged design and UI model.
I am so glad I have never seen it in action.
also the Remedy devs seem to be unfamiliar with highly advanced topics like multiple browser windows or tabbed browsers.
Hey, maybe isupport and ARSe are actually the same thing? And of course everything is javashite, so you can't just open the bunch of search results in new windows, you have to continually go back, open the search dropdown box again, find which ticket you last opened, and click the next one, instead of just middle click, down, middle click, down, middle click through the (unlabelled, with just ticket number) list. -- Tim Connors

Craig Sanders wrote:
On Tue, Apr 24, 2012 at 12:45:30PM +1000, Trent W. Buck wrote:
reproduce the issue, and ICBF pulling any notes that were made at the time out of our stinky in-house ticket system.
whatever your stinky in-house ticket system is, be glad it's not BMC Remedy.
Yeah, my situation could be a lot worse. For example, when the ticket system pays me, I believe it now uses double-ledger accounting in some places! Dealing with it is still the worst part of my job, though.

On 24/04/12 14:15, Craig Sanders wrote:
whatever your stinky in-house ticket system is, be glad it's not BMC Remedy.
Honestly, it's not so bad. In my day job I get to use pretty much every ticket system on the planet (at least those that interact with e-mail), and the remedy-based ones are nowhere near the worst around, usually it is those stinky in-house systems. I should add I've never had to use anything that's obviously the official Remedy UI, the cases I know are Remedy have all got their own custom web UI. That's not to say I don't long for the days of $JOB[-1] and $JOB[-3] which were heavy Jira and RT users respectively.

On Tue, 24 Apr 2012, Peter Ross wrote:
On Tue, 24 Apr 2012, Trent W. Buck wrote:
Peter Ross wrote:
The zvol was never running out of disk space, just nearly 100% full. According to other discussions ZFS slows down if it is filling up. That would explain my problem.
IME, and according to #btrfs on freenode, that happens at about 80% full, IOW if you have a 2TB zpool its effective storage capacity is only 1.8TB, since after that it becomes unusably slow.
"Unusably" as in "it took me hours to delete a single 20MB daily snapshot, while the system was disconnected from the network entirely and all its other I/Oy processes were turned off."
Just now on the machine:
zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zpool 912G 774G 138G 84% 1.00x ONLINE -
At the moment I don't see a performance problem. There are ca. 50 people connected to the mail server, it runs a MediaWiki, a PHP developer box, MySQL for it etc.
It looks as I have to keep an eye on it. I mirror a samba server here, maybe I should do it on another box..
Do you use snapshots? Extensively? Perhaps continual use of snapshots greatly fragments the pool. I was personally hoping not to lose 20% of my disk! But since my usage won't involve snapshots (except every month or so when I send a snapshot to offsite storage then immediately remove the local snapshot), maybe I'll be ok. -- Tim Connors

On Tue, Apr 24, 2012 at 02:45:39PM +1000, Tim Connors wrote:
On Tue, 24 Apr 2012, Peter Ross wrote:
On Tue, 24 Apr 2012, Trent W. Buck wrote:
Peter Ross wrote:
The zvol was never running out of disk space, just nearly 100% full. According to other discussions ZFS slows down if it is filling up. That would explain my problem.
IME, and according to #btrfs on freenode, that happens at about 80% full, IOW if you have a 2TB zpool its effective storage capacity is only 1.8TB, since after that it becomes unusably slow.
"Unusably" as in "it took me hours to delete a single 20MB daily snapshot, while the system was disconnected from the network entirely and all its other I/Oy processes were turned off."
Just now on the machine:
zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT zpool 912G 774G 138G 84% 1.00x ONLINE -
At the moment I don't see a performance problem. There are ca. 50 people connected to the mail server, it runs a MediaWiki, a PHP developer box, MySQL for it etc.
It looks as I have to keep an eye on it. I mirror a samba server here, maybe I should do it on another box..
Do you use snapshots? Extensively? Perhaps continual use of snapshots greatly fragments the pool. I was personally hoping not to lose 20% of my disk! But since my usage won't involve snapshots (except every month or so when I send a snapshot to offsite storage then immediately remove the local snapshot), maybe I'll be ok.
ZFS is fragmentation city, pretty much by design - that's what you get with COW and variable block sizes. I don't think ZFS snapshots are any different to all the other i/o from a fragmentation point of view. what is the difference between a fs being 50% full with static snapshots or 50% full with static data? it's all just blocks that can't be shifted around. having said that, all fs's get fragmented and all hate being nearly full, and I doubt ZFS's %full drop-off point is more than 5% of disk capacity different from any other fs's. cheers, robin

On Tue, 24 Apr 2012, Robin Humble wrote:
On Tue, Apr 24, 2012 at 02:45:39PM +1000, Tim Connors wrote:
On Tue, 24 Apr 2012, Peter Ross wrote:
It looks as I have to keep an eye on it. I mirror a samba server here, maybe I should do it on another box..
Do you use snapshots? Extensively? Perhaps continual use of snapshots greatly fragments the pool. I was personally hoping not to lose 20% of my disk! But since my usage won't involve snapshots (except every month or so when I send a snapshot to offsite storage then immediately remove the local snapshot), maybe I'll be ok.
ZFS is fragmentation city, pretty much by design - that's what you get with COW and variable block sizes.
Yeah, that's what I feared. Why can't someone invent a perfect filesystem (that's why I didn't invent ZFS[1] :), dagnamit?
I don't think ZFS snapshots are any different to all the other i/o from a fragmentation point of view. what is the difference between a fs being 50% full with static snapshots or 50% full with static data? it's all just blocks that can't be shifted around.
Does zfs send or similar defragment objects before sending? Doubt it - it'd make sending extremely slow! But it would be nice to be able to do something like send somewhere (my storage about to go offsite) and back again occasionally. [1] It's like back in uni when we were tasked with writing a BIGNUM handler in assembler. The memory allocator was brk(). I tried forever to write an allocator thinking that I could somehow reclaim memory once I finished with it. I was even going to do double indirection. But I failed, and decided just to do a half arsed job and get the rest of the project working. Then I found out that the pros only ever grow memory and never shrink it again, because when combined with fragmentation, you never have space at the end to be able to shrink. That's why you end up with bloated pigs like mozilla always ever expanding and never shrinking when you close a bunch of tabs. You can't shrink unless brk() takes a negative number and all your allocations were at the start of your heap. I wouldn't have invented *THAT* API! -- Tim Connors

On Tue, 24 Apr 2012, Tim Connors <tconnors@rather.puzzling.org> wrote:
ZFS is fragmentation city, pretty much by design - that's what you get with COW and variable block sizes.
Yeah, that's what I feared. Why can't someone invent a perfect filesystem (that's why I didn't invent ZFS[1] :), dagnamit?
I have the impression that ZFS is like NetApp WAFL in that it streams out contiguous writes for separate files. So it makes writes contiguous at the expense of making later reads fragmented. For situations where write performance is more of a bottleneck than read performance (which means every mail server, most database servers, and a significant portion of all other servers) this is a good thing! You can improve read performance in most cases by adding more cache, ZFS has new caching methods and RAM is constantly dropping in price (32G for a personal workstation isn't impossible nowadays). Write performance has been a serious problem for years.
you close a bunch of tabs. You can't shrink unless brk() takes a negative number and all your allocations were at the start of your heap. I wouldn't have invented *THAT* API!
That's why modern systems have malloc() calling mmap() not brk(). -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Tim Connors wrote:
[1] It's like back in uni when we were tasked with writing a BIGNUM handler in assembler. The memory allocator was brk(). I tried forever to write an allocator thinking that I could somehow reclaim memory once I finished with it. I was even going to do double indirection. But I failed, and decided just to do a half arsed job and get the rest of the project working. Then I found out that the pros only ever grow memory and never shrink it again, because when combined with fragmentation, you never have space at the end to be able to shrink. That's why you end up with bloated pigs like mozilla always ever expanding and never shrinking when you close a bunch of tabs. You can't shrink unless brk() takes a negative number and all your allocations were at the start of your heap. I wouldn't have invented *THAT* API!
*blink* - are you saying that bloatware (moco and friends) don't use malloc and free? WTF?! You'd think they can't read a manpage: $ man brk | grep malloc Avoid using brk() and sbrk(): the malloc(3) memory allocation package is the portable and comfortable way of allocating memory.

On Tue, 24 Apr 2012, Trent W. Buck wrote:
*blink* - are you saying that bloatware (moco and friends) don't use malloc and free? WTF?! You'd think they can't read a manpage:
$ man brk | grep malloc Avoid using brk() and sbrk(): the malloc(3) memory allocation package is the portable and comfortable way of allocating memory.
Of course they use malloc. But for small allocations, malloc uses brk() to grow its heap. For larger allocations, it of course uses mmap(). Depends where the cutoff is, and the distribution of mallocs that are being made. I can easily imagine that most of mozilla's allocations (indeed, most stuff using higher level languages, because it's too damn easy to forget what you're doing when given too many language smarts) are a Kb or less (a 1x1 gif here, a line of text there, a widget up there), and might miss the cutoff. I think I read somewhere in one of the perennial "emacs23 eats ram like never before" threads that emacs uses 10k chunks for *everything*. Which means horrible wastage and fragmentation for anything smaller, and everything larger misses out on the benefit of using mmap(), because it misses that cutoff. -- Tim Connors

Tim Connors wrote:
On Tue, 24 Apr 2012, Trent W. Buck wrote:
*blink* - are you saying that bloatware (moco and friends) don't use malloc and free? WTF?! You'd think they can't read a manpage:
$ man brk | grep malloc Avoid using brk() and sbrk(): the malloc(3) memory allocation package is the portable and comfortable way of allocating memory.
Of course they use malloc. But for small allocations, malloc uses brk() to grow its heap. For larger allocations, it of course uses mmap(). Depends where the cutoff is, and the distribution of mallocs that are being made. I can easily imagine that most of mozilla's allocations (indeed, most stuff using higher level languages, because it's too damn easy to forget what you're doing when given too many language smarts) are a Kb or less (a 1x1 gif here, a line of text there, a widget up there), and might miss the cutoff.
I think I read somewhere in one of the perennial "emacs23 eats ram like never before" threads that emacs uses 10k chunks for *everything*. Which means horrible wastage and fragmentation for anything smaller, and everything larger misses out on the benefit of using mmap(), because it misses that cutoff.
My memory was that modern linux, at least, if an app said "can I have 1kB? now another? and a little bit more" the allocator would (under the hood - transparent to the app) initially allocate the minimum sane chunk up-front, and dole out from that as necessary. If someone can cite up-to-date accurate details, I'll happily defer to them. PS: I've always assumed the reason moco et al used a lot of ram was because they were poorly (if at all) designed, and that nobody bothers to go back and fix any given bit of it to use one-tenth the RAM because "RAM is cheap, I have 16GB per user, so you should too".

On Tue, 24 Apr 2012, "Trent W. Buck" <trentbuck@gmail.com> wrote:
My memory was that modern linux, at least, if an app said "can I have 1kB? now another? and a little bit more" the allocator would (under the hood - transparent to the app) initially allocate the minimum sane chunk up-front, and dole out from that as necessary.
I just wrote a small test program and a few calls to malloc() of up to 400 bytes will come from brk(). I didn't test when it switches to mmap(). -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Robin Humble wrote:
having said that, all fs's get fragmented and all hate being nearly full, and I doubt ZFS's %full drop-off point is more than 5% of disk capacity different from any other fs's.
BTW, this is hidden from ext users, because the default "5% reserved for root user" is not shown in df, and is used as the "wiggle room". If you set it to 0%, or wilfully and repeatedly fill the fs as root, the fragmentation of the ext fs in question rapidly rises from 2% to (say) 75%. ...not that I've ever done that before, oh no >whistles innocently<

On Tue, 24 Apr 2012, "Trent W. Buck" <trentbuck@gmail.com> wrote:
BTW, this is hidden from ext users, because the default "5% reserved for root user" is not shown in df, and is used as the "wiggle room". If you set it to 0%, or wilfully and repeatedly fill the fs as root, the fragmentation of the ext fs in question rapidly rises from 2% to (say) 75%.
Which generally doesn't cause problems. I've got a lot of systems that had "tune2fs -m0" run on them and I've never seen any serious performance problems as a result. I don't recall seeing any ext[34] system have a performance problem when almost full that it didn't have when ~75% full. Certainly nothing like the timeout problems referred to earlier in this thread. It probably helps that the most write-intensive systems I run are mail servers which have an average file size of something less than 100KB which leaves less possibility of file fragmentation. I think that one reason for this is that an almost full ZFS/BTRFS system will suddenly start getting serious fragmentation of writes which would otherwise be contiguous. While for Ext[23] and other non-COW filesystems you get writes all over the place for metadata anyway so it doesn't change things as much. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Tim Connors wrote:
Do you use snapshots? Extensively?
Yes. The ZFS box (5.11 snv_111b i86pc i386 i86pc Solaris) is basically doing the same job as rsnapshot, except using daily rsync+zfs instead of daily rsync+hard links. Also noteworth -- until it ran out of disk, nobody had gotten around to expiring out old snapshots, so there were a LOT of them (probably around 1000 per backup). That would certainly increase the time to remove a single snapshot significantly, and AFAICT when you tell it to remove >1 snapshot it does so by removing them one at a time still.
Perhaps continual use of snapshots greatly fragments the pool.
NFI. I will happily concede that my ancedote is not a very reliable datapoint, but I'm gonna keep using ext until cmason rubber stamps btrfs as being 100% a-OK for production use.

On Tue, 24 Apr 2012, Trent W. Buck wrote:
NFI. I will happily concede that my ancedote is not a very reliable datapoint, but I'm gonna keep using ext until cmason rubber stamps btrfs as being 100% a-OK for production use.
Hope it's not just a rubber stamp! :) It's not ready yet. -- Tim Connors
participants (7)
-
Craig Sanders
-
Julien Goodwin
-
Peter Ross
-
Robin Humble
-
Russell Coker
-
Tim Connors
-
Trent W. Buck