
has anyone else had issues with ZFS on recent kernels and distros? I'm a bit of a newbie to daily use of ZFS, but I found it's fairly easy to completely lockup with lots of metadata ops (eg. du or big rsync). the thing that fixed it for me is booting with -> cgroup_disable=memory which I found from here https://github.com/zfsonlinux/zfs/issues/4726 but the patch commited there didn't fix it for me. I'm not sure if that's quite the right issue, or if I should be patching ZFS 7.0.0 instead of 0.6.5.7? if the root cause is not enough GPL symbols exported from the kernel then that's sad. the perils of non GPL code :-/ BTW this is on fedora 24 with root on ZFS, but it sounds like ubuntu has similar issues. symptoms feel like a livelock in some slab handling rather than an outright OOM. there's 100% system time on all cores, zero fs activity, no way out except to reset. unfortunately root on ZFS on a laptop means no way that I can think of to get stack traces or logs :-/ cheers, robin

On Sunday, 7 August 2016 1:58:25 AM AEST Robin Humble via luv-main wrote:
has anyone else had issues with ZFS on recent kernels and distros?
Debian/Jessie (the latest version of Debian) is working really well for me. Several systems in a variety of configurations without any problems at all. Earlier versions had ZFS sometimes not mount on boot, but that seems to have gone away.
I'm a bit of a newbie to daily use of ZFS, but I found it's fairly easy to completely lockup with lots of metadata ops (eg. du or big rsync).
I had some problems when I first started with a system that had 4G of RAM. I determined that it would be cheaper for my client to buy more RAM than to pay me to figure out why it was getting an OOM - and more RAM helps performance. Since then I haven't had a problem.
BTW this is on fedora 24 with root on ZFS, but it sounds like ubuntu has similar issues. symptoms feel like a livelock in some slab handling rather than an outright OOM. there's 100% system time on all cores, zero fs activity, no way out except to reset. unfortunately root on ZFS on a laptop means no way that I can think of to get stack traces or logs :-/
I never had ZFS work in a way that was suitable for root. I have had a number of situations where ZFS systems required the ability to boot without ZFS mounted. For the laptops I run I use BTRFS. It gives all the benefits of ZFS for a configuration that doesn't have anything better than RAID-1 and doesn't support SSD cache (IE laptop hardware) without the pain. ZFS is necessary if you need RAID-Z/RAID-5 type functionality (I wouldn't trust BTRFS RAID-5 at this stage), if you are running a server (BTRFS performance sucks and reliability isn't adequate for a remote DC), or if you need L2ARC/ZIL type functionality. For a laptop slow disk performance usually isn't a problem and ZFS probably isn't going to do much better if you have a single HDD. If you have a SSD in your laptop (which costs $200 for 500G) then BTRFS performance will be great. I wouldn't recommend using ZFS with Fedora. New releases come out too quickly and the out of tree kernel modules have more potential for breaking than regular kernel modules. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Mon, Aug 08, 2016 at 09:23:56PM +1000, Russell Coker wrote:
On Sunday, 7 August 2016 1:58:25 AM AEST Robin Humble via luv-main wrote:
has anyone else had issues with ZFS on recent kernels and distros? Debian/Jessie (the latest version of Debian) is working really well for me. Several systems in a variety of configurations without any problems at all.
thanks for the data point. I assume that's running an old kernel though? if it's not 4.6+ then it shouldn't hit this problem. memcg is needed too % cat /proc/self/cgroup | grep mem 9:memory:/user.slice/user-XXXX.slice
I'm a bit of a newbie to daily use of ZFS, but I found it's fairly easy to completely lockup with lots of metadata ops (eg. du or big rsync). I had some problems when I first started with a system that had 4G of RAM. I
I've 8G ram which should be heaps. limiting l2arc to 1G didn't help either. actually, come to think of it, I could get logs out if ZFS locks up via rsyslog to something lan/cloudy. I'll try that next time.
For the laptops I run I use BTRFS. It gives all the benefits of ZFS for a configuration that doesn't have anything better than RAID-1 and doesn't support SSD cache (IE laptop hardware) without the pain.
fair enough. I wanted to test ZFS for other reasons though - Lustre ZFS OSDs. BTW does btrfs still have issues when the filesystem fills? does ZFS?
For a laptop slow disk performance usually isn't a problem and ZFS probably isn't going to do much better if you have a single HDD. If you have a SSD in your laptop (which costs $200 for 500G) then BTRFS performance will be great.
on my intel SSD, ZFS is noticably slower than ext4. part of it's because of ZFS's poor integration with linux's virtual memory system and both sets of caches clearly fighting each other, but presumably it's slower 'cos it has more features (eg. checksum, compression) too. cheers, robin

On Monday, 8 August 2016 2:05:47 PM AEST Robin Humble via luv-main wrote:
thanks for the data point. I assume that's running an old kernel though? if it's not 4.6+ then it shouldn't hit this problem. memcg is needed too % cat /proc/self/cgroup | grep mem 9:memory:/user.slice/user-XXXX.slice
3.16 is the kernel for Jessie.
For the laptops I run I use BTRFS. It gives all the benefits of ZFS for a configuration that doesn't have anything better than RAID-1 and doesn't support SSD cache (IE laptop hardware) without the pain.
fair enough. I wanted to test ZFS for other reasons though - Lustre ZFS OSDs.
Why can't Lustre run on BTRFS?
BTW does btrfs still have issues when the filesystem fills? does ZFS?
It's been a long time since it has had serious issues. But there are still issues with balancing after a filesystem becomes full.
For a laptop slow disk performance usually isn't a problem and ZFS probably isn't going to do much better if you have a single HDD. If you have a SSD in your laptop (which costs $200 for 500G) then BTRFS performance will be great. on my intel SSD, ZFS is noticably slower than ext4. part of it's because of ZFS's poor integration with linux's virtual memory system and both sets of caches clearly fighting each other, but presumably it's slower 'cos it has more features (eg. checksum, compression) too.
Checksums take little time on any CPU made in the last decade or so. Compression shouldn't slow it down either. The tree updates to the root of the filesystem will slow things down however and the duplicate metadata will also reduce performance. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Aug 10, 2016 at 03:39:40PM +1000, Russell Coker wrote:
On Monday, 8 August 2016 2:05:47 PM AEST Robin Humble via luv-main wrote:
I wanted to test ZFS for other reasons though - Lustre ZFS OSDs. Why can't Lustre run on BTRFS?
Lustre doesn't use the ZFS POSIX layer. infographics tell me that it hooks into the ZFS ZAP and DMU directly - for performance reasons I presume. I think I vaguely knew what those were once, but no longer. the entirity of Lustre server code is in-kernel using in-kernel APIs. so adding btrfs to the object storage options (currently ~ext4 and ZFS) wouldn't be trivial. I think there was some talk of btrfs a few years ago though, and the re-architecting to allow ZFS has no doubt made it easier to add a third choice. it might happen one day, especially if btrfs is more stable now. hmmm, maybe Lustre not using the POSIX layer also means that there is no ZFS vs. Linux VM cache conflicts because it's _all_ ZFS cache. that would be nice. I'll have to run some tests and check that... so is it possible to extremely heavily prioritise metadata over data in the ZFS read caches? never dropping metadata would be my preference. that's how we get excellent IOPS to our (ext4 OSD) Lustre currently. sadly it's all too easy for GB/s of use-once data to flush caches of metadata that then takes ages to re-read from disk... cheers, robin

On Mon, Aug 08, 2016 at 02:05:47PM -0400, Robin Humble wrote:
I've 8G ram which should be heaps. limiting l2arc to 1G didn't help either.
l2arc or arc? if l2arc, 1G isn't really worth bothering with and may even hurt performance.
actually, come to think of it, I could get logs out if ZFS locks up via rsyslog to something lan/cloudy. I'll try that next time.
yep, that or a serial console should do it. unless the kernel locks up completely before it can get out a log packet or output to tty :(
BTW does btrfs still have issues when the filesystem fills? does ZFS?
In my experience, ZFS performance starts to suck when you get over 80% full. and really sucks when you get to 90+%. don't do that. that was on raidz (which inspired converting my backup pool from 4x1TB RAIDZ to to 4x4TB mirrored pairs). Haven't yet got to 80+% full with zfs mirrors, so don't know if that is as bad.
on my intel SSD, ZFS is noticably slower than ext4. part of it's because of ZFS's poor integration with linux's virtual memory system and both sets of caches clearly fighting each other,
have you tried setting zfs_arc_min = zfs_arc_max? that should stop ARC from releasing memory for linux buffers to use.
but presumably it's slower 'cos it has more features (eg. checksum, compression) too.
compression usually speeds up disk access. The most likely cause is that ext4 has excellent SSD TRIM support but ZFS on linux doesn't. There's a patch that's (finally!) going to be merged "soon": https://github.com/zfsonlinux/zfs/pull/3656 there used to be another ZoL TRIM patch some time ago but it was scrapped to avoid duplication of effort with illumos/freebsd. interestingly, there's also a pull request for compressing arc and l2arc: https://github.com/zfsonlinux/zfs/pull/4768 depending on data compressibility, that should help a lot on low memory systems. craig -- craig sanders <cas@taz.net.au>

On Thu, Aug 11, 2016 at 02:47:10AM +1000, Craig Sanders via luv-main wrote:
On Mon, Aug 08, 2016 at 02:05:47PM -0400, Robin Humble wrote:
I've 8G ram which should be heaps. limiting l2arc to 1G didn't help either. l2arc or arc? if l2arc, 1G isn't really worth bothering with and may even hurt performance.
<looks through command history> echo "options zfs zfs_arc_max=$((2*1024*1024*1024))" > /etc/modprobe.d/zfs.conf
on my intel SSD, ZFS is noticably slower than ext4. part of it's because of ZFS's poor integration with linux's virtual memory system and both sets of caches clearly fighting each other, have you tried setting zfs_arc_min = zfs_arc_max? that should stop ARC from releasing memory for linux buffers to use.
is that what most folks do? as the SSD is fast at large reads (500MB/s), I could also just cache metadata and not data. would that make sense do you think?
The most likely cause is that ext4 has excellent SSD TRIM support but ZFS on linux doesn't. There's a patch that's (finally!) going to be merged "soon": https://github.com/zfsonlinux/zfs/pull/3656
neat! cheers, robin

On Thu, Aug 11, 2016 at 02:53:10AM -0400, Robin Humble wrote:
have you tried setting zfs_arc_min = zfs_arc_max? that should stop ARC from releasing memory for linux buffers to use.
is that what most folks do?
no idea. it just seems like something that's worth trying.
as the SSD is fast at large reads (500MB/s), I could also just cache metadata and not data. would that make sense do you think?
it might help. I used to do it and it didn't seem to do much, but that was on a low memory system with only 1GB of ARC. IIRC, I have primarycache=metadata and secondarycache=all you can set that per filesystem or zvol for both ARC and L2ARC with 'zfs set'. primarycache=all | none | metadata Controls what is cached in the primary cache (ARC). If this property is set to all, then both user data and metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If this property is set to metadata, then only metadata is cached. The default value is all. secondarycache=all | none | metadata Controls what is cached in the secondary cache (L2ARC). If this property is set to all, then both user data and metadata is cached. If this property is set to none, then neither user data nor metadata is cached. If this property is set to metadata, then only metadata is cached. The default value is all. craig -- craig sanders <cas@taz.net.au>

On Fri, Aug 12, 2016 at 12:38:41PM +1000, Craig Sanders wrote:
as the SSD is fast at large reads (500MB/s), I could also just cache metadata and not data. would that make sense do you think?
it might help. I used to do it and it didn't seem to do much, but that was on a low memory system with only 1GB of ARC. IIRC, I have primarycache=metadata and secondarycache=all
i should have mentioned that this was on a system with lots of background processes and cron jobs, many of which recurse directories, so lots of metadata churn. e.g. running find on a large directory (like a debian mirror) will do it. btw, i guess you've already read http://open-zfs.org/wiki/Performance_tuning also, this is a bit dated but still worth reading: https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-repl... craig -- craig sanders <cas@taz.net.au>

On Mon, Aug 08, 2016 at 09:23:56PM +1000, russell@coker.com.au wrote:
On Sunday, 7 August 2016 1:58:25 AM AEST Robin Humble via luv-main wrote:
has anyone else had issues with ZFS on recent kernels and distros?
Debian/Jessie (the latest version of Debian) is working really well for me. Several systems in a variety of configurations without any problems at all.
me too, no problems on sid with 4.6.x kernels and zfs-dkms 0.6.5.7-1 on several different machines. i recently upgraded my main system from 16GB to 32GB, but that was because I started using chromium again and it really uses a LOT of memory. leaks a lot too. I took the opportunity to tune zfs_arc_min & zfs_arc_max to 4GB & 8GB (they had been set to 1 & 4GB), and have zswap configured to use up to 25% of RAM for compressed swap.
BTW this is on fedora 24 with root on ZFS, but it sounds like ubuntu has similar issues. symptoms feel like a livelock in some slab handling rather than an outright OOM. there's 100% system time on all cores, zero fs activity, no way out except to reset. unfortunately root on ZFS on a laptop means no way that I can think of to get stack traces or logs :-/
syslog over the LAN? serial console?
For the laptops I run I use BTRFS. It gives all the benefits of ZFS for a configuration that doesn't have anything better than RAID-1 and doesn't support SSD cache (IE laptop hardware) without the pain.
I'm probably going to do this when i replace my boot SSDs sometime in the nearish future (currently mdadm raid-1 partitions for / and /boot, with other partitions for swap, mirrored ZIL, and L2ARC). I'd like to use zfs for root (i'm happy enough to net- or usb- boot a rescue image with ZFS tools built-in if/when i ever need to do any maintenance without the rpool mounted) except for the fact that ZFS is only just about to get good TRIM support for VDEVs. If it's ready and well-tested by the time i replace my SSDs, I may even go ahead with that. being able to use 'zfs send' instead of rsync to backup the root filesystems on all machines on my LAN will be worth it. speaking of which, have you ever heard of any tools that can interpret a btrfs send stream and extract files from it? and maybe even merge in future incremental streams? in other words, btrfs send to any filesystem (including zfs). something like that would make btrfs for rootfs and zfs for bulk storage / backup really viable. I need a good excuse to start learning Go, so i think i'll start playing with that idea on my ztest vm (initially created for zfs testing but now has the 5GB boot virtual disk + 12 x 200MB more disks for mdadm, lvm, btrfs, and zfs testing). BTW, there's a bug in seabios which causes a VM to lock up on "warm" reboot if there's more than 8 virtual disks if you have the BIOS boot menu enabled....which is an improvement over what it used to do, which was lock up even on initial "cold" boot. it may not even be possible - the idea is based on fuzzy memories from years ago that a btrfs send stream contains a sequence of commands (and data) which are interpreted and executed by btrfs receive. IIRC, the btrfs devs' original plan was to make it tar compatible, but tar couldn't do what they needed so they wrote their own.
ZFS is necessary if you need RAID-Z/RAID-5 type functionality (I wouldn't trust BTRFS RAID-5 at this stage), if you are running a server (BTRFS performance sucks and reliability isn't adequate for a remote DC), or if you need L2ARC/ZIL type functionality.
i made the mistake of using raidz when i first started using zfs years ago. it's not buggy (it's rock solid reliable), it's just that mirrors (raid1 or raid10) are much faster, and easier to expand. it made sense, financially, at the time to use 4x1TB drives in raid-z1, but I'm only using around 1.8GB of that, so I'm planning to replace them with either 2x2TB or 2x4TB. maybe even 4x2TB for better performance. the performance difference is significant my "backup" pool has two mirrored pairs, while my main "export" pool has raid-z. scrubs run at 200-250MB/s on "backup", and around 90-130MB/s on "export". i also use raidz on my mythtv box. performance isn't terribly important on that, but storage capacity is. even so, mirrored pairs would be easier to upgrade than raidz - cheaper too, because I only have to upgrade a pair at a time rather than all four raid-z members. I have no intention of replacing any until the drives start dying or 8+TB drives are cheap enough to consider buying a pair of them. craig -- craig sanders <cas@taz.net.au>
participants (3)
-
Craig Sanders
-
Robin Humble
-
Russell Coker