
On Fri, Apr 06, 2012 at 11:43:40AM -0400, Robin Humble wrote:
fair enough. I guess if remapped sectors are incrementing up in the drive's smart data then it's probably working, but if the drives are just timing out SCSI cmds randomly (should they really do that?!) then that wouldn't show up.
they're consumer drives so, yeah, they have long retry times. enterprise drives are far quicker to return an error...consumer drives keep on trying (non-tunable), and the kernel has a few retries too (tunable) so it can sometimes take a minute or more. that can be enough for zfs (or the SAS card, if it's still flashed with raid firmware like mine is) to decide the drive is failing.
any idea what's causing the deadlocks?
the traces and some builds back and forward through git commits give some idea. I'm guessing attempts by lustre to send data using zero copy write hooks in zfs are racing with (non-zero copy?) metadata (attr) updates. I'll email Brian and see if he can suggest something and/or which mailing list or jira or github issue to post to.
I assume regular ZFS is ok and stable because it doesn't attempt zero copy writes.
ah, okay. well, at least you know it is a priority for the LLNL zfs + lustre project :)
only when writing, or reading too? random or sequential writes?
just writes. sometimes sequential and sometimes random. always with at least 32 and often with 128 1M i/o's in flight from clients.
you got any SSDs for ZIL?
so I guess you're running with ashift=12 and a limit on zfs_arc_max?
yep. 4GB zfs_arc_max. I set it to that when my machine only had 8GB RAM, and didn't bother changing it when in upgraded to 16GB. seems good for a shared desktop/server running a few bloated apps like chromium. $ cat /etc/modprobe.d/zfs.conf # use minimum 1GB and maxmum of 4GB RAM for ZFS ARC options zfs zfs_arc_min=1073741824 zfs_arc_max=4294967296 and yes, using ashift=12 because some of my drives are 4K sectors ("advanced format") and i expect all of them will be replaced with 4K drives in the not-too-distant future. on my zfs backup server with 24GB RAM at work, i have: # use minimum 4GB and maxmum of 12GB RAM for ZFS ARC options zfs zfs_arc_min=4294967296 zfs_arc_max=12884901888 i'd be comfortable increasing that up to 20 or 22GB. usage is mostly writes (rsync backups), so it doesn't matter much.
I'm also using zfs_prefetch_disable=1 (helps lustre reads), but apart from that no other zfs tweaks, no l2arc SSDs yet etc.
L2ARC isn't any use for writes anyway, except indirectly as it helps reduce the read i/o load on the disks. i'd try putting in a good fast SLC SSD as ZIL and see if that helps. it should certainly smooth out the metadata updates. for ZIL you probably don't need more than a few GB (maybe as much as 4 or even 8. OTOH it doesn't hurt to have more ZIL than you need, it just doesn't get used, so use the entire device) or so for this purpose so write speed of the SSD is far more important than capacity. and for multiple simultaneous writers you're probably better off with multiple small ZIL devices than one larger one. note, however that larger SSDs tend to be faster than smaller ones due to internal raid0-like configuration of the individual flash chips. 120GB and 240GB SSDs, for example, are noticably faster than 60GB. excess space could be just wasted or perhaps the SSD could be partitioned and the excess used as L2ARC (but that would impact on the ZIL partition's write performance, so maybe better to just use the entire ssd as ZIL regardless of size). fortunately, log (ZIL) and cache (L2ARC) devices can be added and removed at whim with zfs, so you can experiment easily with different configurations. watching 'zfs iostat -v' will tell you ZIL usage. craig -- craig sanders <cas@taz.net.au> BOFH excuse #327: The POP server is out of Coke